# Housing Price Prediction

## Introduction :

##### This notebook explores all the features in determining Sale Price of a house.
##### The dataset consists of all the features that help in predicting the sale price of a house.
##### The complete description of the data is given in the description.txt file attached along with the data.

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
sns.set(rc={'figure.figsize':(24,8),'figure.dpi':500})


### Import the Data

In [None]:
df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

### Data Overview

In [None]:
df.head()

##### Let's look at the number of rows and columns in the dataset

In [None]:
pd.DataFrame([df.shape],index=['#'],columns=['Rows','Columns'])

##### Summary of the dataset

In [None]:
df.info()

In [None]:
corr_data=df.corr().drop('SalePrice').sort_values('SalePrice',ascending=False)[['SalePrice']]

In [None]:
corr_data

In [None]:
pd.DataFrame([[df.isna().sum().sum()]],columns=['Dataset'],index=['Missing Values'])

In [None]:
def missing_percent(df):
    nan_precent=100*(df.isnull().sum().sort_values(ascending=False)/len(df))
    # Filter to find the missing count > 0
    return nan_precent[nan_precent>0]

##### First let's see how many missing values are there

In [None]:
temp=missing_percent(df)
pd.DataFrame({'Column_Name':temp.index, 'Missing_Percentage':temp.values}).style.background_gradient(cmap='Reds')

# Data Visualisation

#### Distribution of Target variable (SalePrice)

In [None]:
sns.distplot(df[['SalePrice']])
plt.show()

In [None]:
res = stats.probplot(df['SalePrice'], plot=plt)
plt.show()

###### Most machine learning algorithm works well with data which are normally distributed 

###### The target variable (SalePrice) is right skewed. we need to transform this variable and make it more normally distributed.
###### Let's transform the target variable by taking log scale

In [None]:
df['SalePrice']= np.log(df['SalePrice'])
sns.distplot(df['SalePrice'])
plt.show()

In [None]:
stats.probplot(df['SalePrice'], plot=plt)
plt.show()

##### Now SalePrice is normally distributed

### Plotting Missing value in dataset

In [None]:
def draw_barplot(data):
    bar_plot=sns.barplot(x=data.index,y=data)
    plt.xticks(rotation=45)
    bar_plot.bar_label(bar_plot.containers[0],fmt='%.2f')
    plt.show()

In [None]:
nan_percent=missing_percent(df)

In [None]:
draw_barplot(nan_percent)

# Exploratory Data Analysis (EDA)

In [None]:
df.groupby(['YrSold','MoSold']).Id.count().plot(kind='bar')
plt.title('When was the property sold in which year and month ?')
plt.show()

##### The highest number of property sales is in **June** and **July**


In [None]:
df.groupby('Neighborhood').Id.count().sort_values().plot(kind='bar')
plt.title('Where are the most of the property located ?')
plt.show()

##### The most of the property is located neighborhood **NAmes** and **CollgCr**

# Data Preparation

In [None]:
# Finding numerical features
numeric_data=df.select_dtypes(include=[np.number]).drop(['SalePrice','Id'],axis=1)
# Finding categorical features
categorical_data=df.select_dtypes(include='object')

In [None]:
pd.DataFrame([[numeric_data.shape[1] , categorical_data.shape[1]]],columns=['Numerical_Features','Categorical_Features'],index=['Count'])

In [None]:
temp_numeric_data=pd.melt(df,value_vars=sorted(numeric_data))
temp_numeric_data

In [None]:
facet_grid=sns.FacetGrid(temp_numeric_data,col='variable',col_wrap=3,sharex=False,sharey=False)
facet_grid.map(sns.distplot,'value')
plt.show()

### Transforming some numerical variables that are really categorical
##### We see that some of the features having int data type consists of discrete values.

##### If the features have discrete values , it's better to change them into categorical variables that will help us in better analysis

In [None]:
#will convert those columns into dummy variables later.
int_to_object = ['MSSubClass','YrSold','MoSold','OverallCond']

for feature in int_to_object:
    df[feature] = df[feature].astype(object)

###### Let's plot of categorical features

In [None]:
# Finding numerical features
numeric_data=df.select_dtypes(include=[np.number]).drop(['SalePrice','Id'],axis=1)
# Finding categorical features
categorical_data=df.select_dtypes(include='object')

In [None]:
pd.DataFrame([[numeric_data.shape[1] , categorical_data.shape[1]]],columns=['Numerical_Features','Categorical_Features'],index=['Count'])

In [None]:
temp_categorical_data=pd.melt(df,value_vars=sorted(categorical_data))

In [None]:
facet_grid=sns.FacetGrid(temp_categorical_data,col='variable',col_wrap=3,sharex=False,sharey=False,height=3, aspect= 2)
facet_grid.map(sns.countplot,'value')
[plt.setp(ax.get_xticklabels(),rotation=60,ha='right') for ax in facet_grid.axes.flat]
facet_grid.fig.tight_layout()
plt.show()

##### Delete Id column from df because that is not need for corralation plot

In [None]:
del df['Id']

### Outliers

In [None]:
def draw_corr_plot(df):
    data=df.corr().drop('SalePrice').sort_values('SalePrice',ascending=False)[['SalePrice']]
    chart=sns.barplot(data.index,data['SalePrice'], palette='Blues_d')
    plt.xticks(rotation=90)
    plt.title('Correlation with Sale Price')
    chart.bar_label(chart.containers[0],fmt='%.3f')
    plt.show()

In [None]:
draw_corr_plot(df)

In [None]:
sns.scatterplot(data=df,x='OverallQual', y='SalePrice')
plt.axhline(y=12.3,color='r')
plt.axvline(x=9,color='g')

In [None]:
df[(df['OverallQual']>9) & (df['SalePrice']<12.3)][['SalePrice','OverallQual']]

In [None]:
sns.scatterplot(data=df,x='GrLivArea', y='SalePrice')
plt.axhline(y=12.5,color='r')
plt.axvline(x=4500,color='g')

In [None]:
df[(df['GrLivArea']>4500) & (df['SalePrice']<12.5)][['SalePrice','GrLivArea']]

In [None]:
sns.scatterplot(data=df,x='TotalBsmtSF', y='SalePrice')
plt.axhline(y=12.2,color='r')
plt.axvline(x=6000,color='g')

In [None]:
df[(df['TotalBsmtSF']>6000) & (df['SalePrice']<12.2)][['SalePrice','TotalBsmtSF']]

In [None]:
sns.scatterplot(data=df,x='GarageCars', y='SalePrice')
plt.axhline(y=10.7,color='r')
plt.axvline(x=1,color='g')

In [None]:
df[(df['GarageCars']==1) & (df['SalePrice']<10.7)][['SalePrice','GarageCars']]

In [None]:
sns.scatterplot(data=df,x='GarageArea', y='SalePrice')
plt.axhline(y=12.5,color='r')
plt.axvline(x=1300,color='g')

In [None]:
df[(df['GarageArea']>1300) & (df['SalePrice']<12.5)][['SalePrice','GarageArea']]

In [None]:
sns.scatterplot(data=df,x='1stFlrSF', y='SalePrice')
plt.axhline(y=12.1,color='r')
plt.axvline(x=4000,color='g')

In [None]:
df[(df['1stFlrSF']>4000) & (df['SalePrice']<12.1)][['SalePrice','1stFlrSF']]

#### There are outliers in some features

###### Deleting those four values with outliers ( 30 , 523 , 581 , 916 , 1190 , 1298 )

In [None]:
# Remove 30 , 916 , 523 , 581 , 1190 , 1298 
df=df.drop([30 , 916 , 523 , 581 , 1190 , 1298])

In [None]:
draw_corr_plot(df)

##### We see , correlation between our data has improved

### Visualising missing values

In [None]:
print(f'There are {df.isnull().sum().sum()} missing values')

### Plot the feature with missing indicating the percent of missing data

In [None]:
missing_percent(df)

In [None]:
def draw_missing_barplot(df,ylim_min=None,ylim_max=None):
        nan_percent= missing_percent(df).sort_values(ascending=True)
        bar_plot=sns.barplot(x=nan_percent.index,y=nan_percent)
        plt.xticks(rotation=60)
        bar_plot.bar_label(bar_plot.containers[0],fmt='%.3f')
        if ylim_min!= None and ylim_max!= None:
            plt.ylim(ylim_min,ylim_max)
        plt.show()

In [None]:
draw_missing_barplot(df,0,1)

In [None]:
100/len(df)

#### **Electrical**

In [None]:
df[df['Electrical'].isnull()]

###### **Electrical** : Fill in again with most frequent which is "SBrkr"

In [None]:
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])

#### **MasVnrArea , MasVnrType**

##### NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type

In [None]:
df["MasVnrType"] = df["MasVnrType"].fillna("None")
df["MasVnrArea"] = df["MasVnrArea"].fillna(0)

In [None]:
draw_missing_barplot(df,0,20)

#### **BsmtQual , BsmtCond , BsmtExposure , BsmtFinType1 , BsmtFinType2**

###### NaN values for these categorical basement df, means there's no basement

In [None]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    # Replacing the missing values with None 
    df[col] = df[col].fillna('None')

In [None]:
draw_missing_barplot(df,0,20)

#### **GarageYrBlt**

##### Replacing missing data with 0 

In [None]:
   df['GarageYrBlt']= df['GarageYrBlt'].fillna(0)

#### **GarageType, GarageFinish, GarageQual , GarageCond**

##### Replacing missing data with None

In [None]:
for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    df[col] = df[col].fillna("None")

In [None]:
draw_missing_barplot(df)

#### **LotFrontage**

#####  Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood .
##### We can fill in missing values by the mean LotFrontage of the Neighborhood.

In [None]:
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.mean()))

#### **FireplaceQu**

##### data description says NA means "no fireplace"

In [None]:
df["FireplaceQu"] = df["FireplaceQu"].fillna("None")

#### **Fence**

##### data description says NA means "no fence"

In [None]:
df["Fence"] = df["Fence"].fillna("None")

#### **Alley**

##### data description says NA means "no alley access"

In [None]:
df["Alley"] = df["Alley"].fillna("None")

#### **MiscFeature**

##### data description says NA means "no misc feature"

In [None]:
df["MiscFeature"] = df["MiscFeature"].fillna("None")

#### **PoolQC**

##### data description says NA means "No Pool"

In [None]:
df["PoolQC"] = df["PoolQC"].fillna("None")

#### It remains no missing value.

In [None]:
df.isnull().sum().sum()

In [None]:
df.to_csv('clean_df.csv', encoding='utf-8',index=False)

#### Getting dummy categorical features

In [None]:
df_num=df.select_dtypes(exclude='object')
df_obj=df.select_dtypes(include='object')

In [None]:
# use one-hot encoding
df_obj= pd.get_dummies(df_obj, drop_first=True)

In [None]:
Final_df= pd.concat([df_num, df_obj], axis=1)

### Linear Regression

##### Determine the Features & Target Variable

In [None]:
X = Final_df.drop('SalePrice',axis=1)
y = Final_df['SalePrice']

#### Split the Dataset to Train & Test

In [None]:
from sklearn.model_selection import train_test_split
X_train_linear, X_test_linear, y_train_linear, y_test_linear = train_test_split(X, y, test_size=0.3, random_state=101)

#### Train the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_model=LinearRegression()

In [None]:
linear_model.fit(X_train_linear, y_train_linear)

#### Predicting Test Data

In [None]:
y_pred_linear=linear_model.predict(X_test_linear)

#### Evalutaing the Model

In [None]:
from sklearn import metrics

In [None]:
MAE_linear=metrics.mean_absolute_error(y_test_linear,y_pred_linear)
MSE_linear=metrics.mean_squared_error(y_test_linear,y_pred_linear)
RMSE_linear=np.sqrt(MSE_linear)

In [None]:
metrics_linear=pd.DataFrame([MAE_linear,MSE_linear,RMSE_linear], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics Of Linear'])

In [None]:
metrics_linear

#### Residuals

In [None]:
residuals_linear=y_test_linear-y_pred_linear

In [None]:
sns.scatterplot(x=y_test_linear,y=y_pred_linear)
plt.ylabel('Y-Pred_Linear')
plt.xlabel('Y-Test_Linear')
sns.regplot(y_test_linear,y_pred_linear,ci=None)
plt.show()

In [None]:
sns.distplot(residuals_linear, bins=20, kde=True)
plt.show()

#### Residuals is normally distributed

In [None]:
sns.scatterplot(x=y_test_linear, y=residuals_linear)
plt.ylabel('Y-Pred_Linear')
plt.xlabel('Residuals_Linear')
plt.axhline(y=0, color='r', ls='--')
plt.show()

### Polynomial Regression

#### Preprocessing

In [None]:
from sklearn.preprocessing import  PolynomialFeatures

In [None]:
polynomial_converter=PolynomialFeatures(degree=2,include_bias=False)

In [None]:
polynomial_features=polynomial_converter.fit_transform(X)

#### Split the Dataset to Train & Test

In [None]:
from sklearn.model_selection import train_test_split
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(polynomial_features, y, test_size=0.3, random_state=101)

#### Train the Model

In [None]:
poly_model=LinearRegression()

In [None]:
poly_model.fit(X_train_poly,y_train_poly)

#### Predicting Test Data

In [None]:
y_pred_poly=poly_model.predict(X_test_poly)

#### Evalutaing the Model

In [None]:
MAE_poly=metrics.mean_absolute_error(y_test_poly,y_pred_poly)
MSE_poly=metrics.mean_squared_error(y_test_poly,y_pred_poly)
RMSE_poly=np.sqrt(MSE_poly)

In [None]:
metrics_poly=pd.DataFrame([MAE_poly,MSE_poly,RMSE_poly], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics Of Polynomial'])

In [None]:
metrics_df= pd.concat([metrics_linear, metrics_poly], axis=1).T

In [None]:
metrics_df

#### LinearRegression is the better from Polynomial Regression method in this model


<font color=#e6005c> <h4> If you liked this Notebook, please do upvote :)</h4> </font>