**Task**
####It is a regression problem to predict the median house price. 
**Dataset**
####Both train and test dataset is generated from a deep learning model trained on California Housing Dataset
**Attributes**

####MedInc - Median income for households within a block of houses.
####HouseAge - Age of a house within a block; a lower number is a newer building.
####AveRooms - Average number of rooms within a block.
####AveBedrms - Average number of bedrooms within a block.
####Population - Total number of people residing within a block.
####AveOccup - Average number of household members.
####Longitude - A measure of how far west a house is; a more negative value is farther west.
####Latitude - A measure of how far north a house is; a higher value is farther north.
####MedHouseValue - Median house value for households within a block.

**Library imports**

In [None]:
import numpy as np             #provides support for multi-dimensional arrays and mathematical functions.
import pandas as pd            #provides data manipulation and analysis tools.
import matplotlib.pyplot as plt #plotting library for creating visualizations
import seaborn as sns           #data visualization library based on matplotlib, providing additional tools for creating statistical graphics
%matplotlib inline              

**Exploratory Data Analysis**

In [None]:
#reading input file
train_df=pd.read_csv("train_data.csv")
test_df=pd.read_csv("test_data.csv")

#####Analysing the data

In [None]:
df_train.head()   # displays the first five rows of the dataframe

In [None]:
df_test.head()   # displays the first five rows of the dataframe

In [None]:
df_train.shape    # returns a tuple of array dimension that specifies the number of rows and columns

In [None]:
df_test.shape    # returns a tuple of array dimension that specifies the number of rows and columns

In [None]:
df_train.info()   # prints the information about the dataframe

In [None]:
df_test.info()   # prints the information about the dataframe

Check for missing values and handle them

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isna().sum()

Check for duplicate values

In [None]:
df_train.duplicated().any()

In [None]:
df_test.duplicated().any()

Check for outliers

In [None]:
df_train.skew()

In [None]:
from scipy.stats import skew
for column in df_train:
    print(column)
    print(f"Skewness: {skew(df_train[column])}")
    plt.figure(figsize=(3,3))
    plt.style.use('ggplot')
    sns.distplot(df_train[column])
    plt.grid(False)
    plt.show()

The above code uses the **scipy.stats.skew** function to calculate the skewness of each column in a DataFrame df_train. Skewness measures the degree of asymmetry of a distribution, with positive values indicating a right-skewed (long tail to the right) distribution and negative values indicating a left-skewed (long tail to the left) distribution.

The code then creates a histogram of each column's distribution using **seaborn.distplot** and displays it using **matplotlib.pyplot.show**. **plt.grid**(False) is used to turn off the grid in the histogram, and **plt.style.use**('ggplot') applies a style to the plot.

This code is useful for identifying columns with highly skewed distributions, which may indicate that they need to be transformed before modeling.

In [None]:
#handle skewness
#inter quartile range
quantile1=df_train["AveBedrms"].quantile(0.25)
quantile2=df_train["AveBedrms"].quantile(0.75)

In [None]:
df_train["AveBedrms"]=np.where(df_train["AveBedrms"]<quantile1,quantile1,df_train["AveBedrms"])
df_train["AveBedrms"]=np.where(df_train["AveBedrms"]>quantile2,quantile2,df_train["AveBedrms"])

In [None]:
a = round(df_train['AveBedrms'].skew(),6)
print(a)

The above code calculates the first quartile (25th percentile) and third quartile (75th percentile) of the "AveBedrms" column in the DataFrame df_train and assigns them to variables quantile1 and quantile2, respectively.
This code uses numpy.where to replace any values in the "AveBedrms" column of df_train that are below the first quartile (quantile1) with quantile1, and any values that are above the third quartile (quantile2) with quantile2.
The code also calculates the skewness of the "AveBedrms" column in the DataFrame df_train using the skew() function from the Pandas library and assigns the result to variable a. The round() function is then used to round the result to 6 decimal places.

Quartiles divide a dataset into four equal parts, with the first quartile being the value below which 25% of the data falls and the third quartile being the value below which 75% of the data falls. The range between the first and third quartiles is known as the interquartile range (IQR) and is often used to identify outliers in a dataset. By setting extreme values to the quartile values, the data is brought back within a more reasonable range, which can improve the accuracy of models and statistical analysis. In this specific case, it is assumed that the outliers are values that are either too high or too low in the distribution, hence they are being replaced with the upper and lower bounds of the interquartile range.



In [None]:
# Interquantile Range
quantile1=df_train["MedInc"].quantile(0.25)
quantile2=df_train["MedInc"].quantile(0.75)
df_train["MedInc"]=np.where(df_train["MedInc"]<quantile1,quantile1,df_train["MedInc"])
df_train["MedInc"]=np.where(df_train["MedInc"]>quantile2,quantile2,df_train["MedInc"])
b = round(df_train['MedInc'].skew(),6)
print(b)

In [None]:
# Interquantile Range
quantile1=df_train["AveOccup"].quantile(0.25)
quantile2=df_train["AveOccup"].quantile(0.75)
df_train["AveOccup"]=np.where(df_train["AveOccup"]<quantile1,quantile1,df_train["AveOccup"])
df_train["AveOccup"]=np.where(df_train["AveOccup"]>quantile2,quantile2,df_train["AveOccup"])
c = round(df_train['AveOccup'].skew(),6)
print(c)

In [None]:
# Interquantile Range
quantile1=df_train["AveRooms"].quantile(0.25)
quantile2=df_train["AveRooms"].quantile(0.75)
df_train["AveRooms"]=np.where(df_train["AveRooms"]<quantile1,quantile1,df_train["AveRooms"])
df_train["AveRooms"]=np.where(df_train["AveRooms"]>quantile2,quantile2,df_train["AveRooms"])
d = round(df_train['AveRooms'].skew(),6)
print(d)

In [None]:
# Interquantile Range
quantile1=df_train["Population"].quantile(0.25)
quantile2=df_train["Population"].quantile(0.75)
df_train["Population"]=np.where(df_train["Population"]<quantile1,quantile1,df_train["Population"])
df_train["Population"]=np.where(df_train["Population"]>quantile2,quantile2,df_train["Population"])
e = round(df_train['Population'].skew(),6)
print(e)

In [None]:
# Interquantile Range
quantile1=df_train["MedHouseVal"].quantile(0.25)
quantile2=df_train["MedHouseVal"].quantile(0.75)
df_train["MedHouseVal"]=np.where(df_train["MedHouseVal"]<quantile1,quantile1,df_train["MedHouseVal"])
df_train["MedHouseVal"]=np.where(df_train["MedHouseVal"]>quantile2,quantile2,df_train["MedHouseVal"])
f = round(df_train['MedHouseVal'].skew(),6)
print(f)

In [None]:
df_train.skew()

Transformation of target

In [None]:
fig,ax = plt.subplots(figsize=(3,3))
sns.histplot(np.log(df_train['MedHouseVal']))
plt.show()

This code creates a histogram of the natural logarithm of the "MedHouseVal" column in the DataFrame df_train using seaborn.histplot and displays it using matplotlib.pyplot.show. The figsize argument sets the size of the figure to 3x3 inches.

Taking the logarithm of a variable is a common transformation that can help to reduce the impact of extreme values and improve the distribution's normality. In this case, taking the logarithm of "MedHouseVal" may help to make the distribution more symmetrical and closer to a normal distribution. The resulting histogram can be useful for identifying patterns or trends in the data that may not be as apparent in the original scale.

**Correlation** (Train Data)

In [None]:
correlation = df_train.corr()
correlation

This code calculates the correlation matrix of the DataFrame df_train using the corr() function from Pandas and assigns the result to the variable correlation.
By printing the correlation variable, we can see the correlation coefficients between each pair of variables in the dataset. This information can be useful for identifying relationships between variables and for selecting variables for modeling.

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(correlation,annot=True,cmap='crest',linewidths=0.2)
plt.show()

This code creates a heatmap of the correlation matrix calculated in the previous step using the seaborn.heatmap function and displays it using matplotlib.pyplot.show(). The figsize argument sets the size of the figure to 10x10 inches, while the annot=True argument displays the correlation coefficients in each cell of the heatmap. The cmap argument sets the color scheme to 'crest', while the linewidths argument sets the width of the lines separating each cell to 0.2.
The diagonal line of the heatmap shows the correlation of each variable with itself, which is always 1. The heatmap can be a useful tool for quickly identifying strong positive or negative correlations between variables, which can help in feature selection for predictive modeling.

**Check for outliers**(Test data)

In [None]:
df_test.skew()

In [None]:
# Interquantile Range
quantile1=df_test["AveBedrms"].quantile(0.25)
quantile2=df_test["AveBedrms"].quantile(0.75)
df_test["AveBedrms"]=np.where(df_test["AveBedrms"]<quantile1,quantile1,df_test["AveBedrms"])
df_test["AveBedrms"]=np.where(df_test["AveBedrms"]>quantile2,quantile2,df_test["AveBedrms"])
a = round(df_test['AveBedrms'].skew(),6)
print(a)

In [None]:
# Interquantile Range
quantile1=df_test["AveOccup"].quantile(0.25)
quantile2=df_test["AveOccup"].quantile(0.75)
df_test["AveOccup"]=np.where(df_test["AveOccup"]<quantile1,quantile1,df_test["AveOccup"])
df_test["AveOccup"]=np.where(df_test["AveOccup"]>quantile2,quantile2,df_test["AveOccup"])
b = round(df_test['AveOccup'].skew(),6)
print(b)

In [None]:
# Interquantile Range
quantile1=df_test["AveRooms"].quantile(0.25)
quantile2=df_test["AveRooms"].quantile(0.75)
df_test["AveRooms"]=np.where(df_test["AveRooms"]<quantile1,quantile1,df_test["AveRooms"])
df_test["AveRooms"]=np.where(df_test["AveRooms"]>quantile2,quantile2,df_test["AveRooms"])
c = round(df_test['AveRooms'].skew(),6)
print(c)

In [None]:
# Interquantile Range
quantile1=df_test["Population"].quantile(0.25)
quantile2=df_test["Population"].quantile(0.75)
df_test["Population"]=np.where(df_test["Population"]<quantile1,quantile1,df_test["Population"])
df_test["Population"]=np.where(df_test["Population"]>quantile2,quantile2,df_test["Population"])
d = round(df_test['Population'].skew(),6)
print(d)

In [None]:
# Interquantile Range
quantile1=df_test["MedInc"].quantile(0.25)
quantile2=df_test["MedInc"].quantile(0.75)
df_test["MedInc"]=np.where(df_test["MedInc"]<quantile1,quantile1,df_test["MedInc"])
df_test["MedInc"]=np.where(df_test["MedInc"]>quantile2,quantile2,df_test["MedInc"])
e = round(df_test['MedInc'].skew(),6)
print(e)

In [None]:
df_test.skew()

In [None]:
#correlation(test data)
correlation = df_test.corr()
correlation

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(correlation,annot=True,cmap='crest',linewidths=0.2)
plt.show()

**Spliting the data**

In [None]:
X_train = df_train.iloc[:,1:9]
X_train.head(2)

In [None]:
y_train = df_train.MedHouseVal
y_train.head(2)

In [None]:
X_test = df_test.iloc[:,1:]
X_test.head(2)

The code creates a new DataFrame X_train that contains a subset of columns from the original DataFrame df_train. The iloc function is used to select all rows (:) and columns 1 through 8 (1:9) of df_train. This assumes that the first column of df_train contains the target variable and that the remaining columns contain the predictor variables. The resulting X_train DataFrame contains all rows of df_train but only the predictor variables. 
It also creates a new y_train that contains the target variable (MedHouseVal) from the original DataFrame df_train. The target variable is the variable we are trying to predict in a predictive modeling task, so we need to separate it from the predictor variables in order to train our model. The y_train Series contains the same number of rows as X_train, with each row corresponding to the target variable value for the corresponding row in X_train.


**Modelling**
#####A voting regressor is an ensemble meta-estimator that fits several base regressors, each on the whole dataset. Then it averages the individual predictions to form a final prediction.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
reg1 = GradientBoostingRegressor(random_state=1)

from sklearn.ensemble import RandomForestRegressor
reg2 = RandomForestRegressor(random_state=1)

from sklearn.linear_model import LinearRegression
reg3 = LinearRegression()

from sklearn.ensemble import VotingRegressor
regressor = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lir', reg3)])

regressor.fit(X_train, y_train)
prediction = regressor.predict(X_test)
prediction

This code defines three regression models using scikit-learn: GradientBoostingRegressor, RandomForestRegressor, and LinearRegression. Each model is instantiated with a random_state parameter set to 1, which ensures that the results are reproducible.

The code then creates an ensemble of these three models using the VotingRegressor class from scikit-learn. The estimators argument is a list of tuples, where each tuple contains a name for the estimator and the estimator object itself. This ensemble model will make predictions by taking a weighted average of the predictions from each of the three base models.

The VotingRegressor object is then fit to the training data (X_train and y_train) using the fit() method. Once the model is trained, it is used to make predictions on the test set (X_test) using the predict() method.

The prediction variable contains the predicted target variable values for the test set.

In [None]:
submission = pd.DataFrame({'id': df_test.id, 'MedHouseVal': prediction})
submission.head()

This code creates a new DataFrame submission containing the predicted MedHouseVal values for the test set, along with the id column from the original test set (df_test.id). The pd.DataFrame() function is used to create the DataFrame, with a dictionary of column names and values as the input.

The resulting DataFrame has two columns: id and MedHouseVal. The id column contains the same values as the original test set, while the MedHouseVal column contains the predicted values from the ensemble model.

Printing the head of submission using submission.head() displays the first five rows of the new DataFrame, allowing us to inspect the predicted values.