# **House Price Prediction**
   
Before we discuss the code let me tell you this is one of my first kaggle projects and the below notebook is inspired by several other notebooks.

I would like to thank owners of below mentioned notebooks which helped me in understanding basic concepts and are very useful in my works. 

1. https://www.kaggle.com/s/10533521 by Naresh bhatt
2. https://www.kaggle.com/s/314923 by Serigne
3. https://www.kaggle.com/s/96093 by Alexandru Papiu
4. https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python by pedro marcelino.


Let's load our dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.info()

As you can see, the train data has 80 features and target is SalePrice.     
We have both continuous and categorical data.

### **Visualization**

Before preprocessing and feature engineering, it's better to have basic intuition about our features(like the dependencies with saleprice).

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig,ax = plt.subplots(1,2,figsize=(15,5))
ax[0].hist(x = train.SalePrice)
ax2 = sns.distplot(x = train.SalePrice,ax=ax[1])

As there are 80 features its hard to visualize every feature. So we guess some features randomly.

House price most likely depends on size/Area and quality of house.GrLivArea(ground living area) and OverallQual( overall quality of the house) seems to be our best options to try.

In [None]:
sns.boxplot(x= train.OverallQual , y = train.SalePrice)

OverallQual effect Saleprice quite considerably.

In [None]:
sns.scatterplot(x= train.GrLivArea , y = train.SalePrice)

So as GrLivArea !!.

Let's try some other features. It's always good to know your features relation with targets which gives you the idea to treat some features seperatly than the others for better results.

In [None]:
sns.boxplot(x= train.TotRmsAbvGrd , y = train.SalePrice)

In [None]:
sns.scatterplot(x= train.TotalBsmtSF , y = train.SalePrice)

## Pre Processing

As there are lots of features in our data it is good to store and track categorical and numerical features seperatly while preprocessing (believe me, it takes a lot of time if confused in between). 

Categorical Features are : 
1) Every object datatype features                    
2) Some integer/float type which are categorical (eg OverallQual)

The following features are actually categorical.      



In [None]:
cat = ['OverallQual','TotRmsAbvGrd','GarageCars','OverallCond','MSSubClass', 'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr', 'Fireplaces']

In [None]:
y = train['SalePrice']

Let's concatenate and preprocess our train and test data simultaneously to avoid doing it twice and save lot of time ( 80 features !!!!!!) 

In [None]:
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
test.head()

In [None]:
print('Total no. of train samples: ',len(train))
print('Total no. of test samples: ',len(test))

In [None]:
total_data = pd.concat([train.drop('SalePrice',axis=1),test],axis=0,ignore_index = True)
print(total_data.shape)

Of course test data don't contain Saleprice column :)

In [None]:
total_data.drop('Id',axis=1,inplace=True)

In [None]:
object_type = total_data.dtypes[total_data.dtypes == 'object'].index
object_type

The best way to know your data when there are lots of features is using seaborn's heatmap .    The notebook by PEDRO MARCELINO helped me a lot here.

Now its time to view the correlation between numerical features and Saleprice. 

In [None]:
correlation_matrix = train.corr()
plt.figure(figsize=(12,12))
sns.heatmap(correlation_matrix,vmax=0.8,square=True)

Look at feature pairs (GarageCars,GarageArea), (TotalBsmtSF,1stFlrSF),(GarageYrBlt,YearBuilt),(TotRmsAbvGrd,GrLivArea) . They are highly correlated. It's better to remove one of the features or create a new feature by using the two features.

Let's look at the features which are highly correlated with SalePrice

In [None]:
k = 10
cols = correlation_matrix.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

We can see GarageCars and GarageArea are highly realted with saleprice and also with each other ,so we will drop GarageArea. Similarly we will drop 1stFlrSF and GarageYrBlt. 

### Filling Missing Values

In [None]:
null = total_data.isnull().sum()
null_values = pd.DataFrame({'No. of null': null[null != 0].sort_values(ascending=False)})
null_values

we will drop 'PoolQC','MiscFeature','Alley','Fence' as there are lot of null values.

In [None]:
total_data.drop(['PoolQC','MiscFeature','Alley','Fence'],axis=1,inplace=True)
object_type = object_type.drop(['PoolQC','MiscFeature','Alley','Fence'])

In [None]:
total_data['FireplaceQu'].describe()

In [None]:
total_data["FireplaceQu"] = total_data["FireplaceQu"].fillna("None")

In [None]:
total_data['LotFrontage'].median()

In [None]:
x = total_data['LotFrontage'].median()
total_data['LotFrontage'] = total_data['LotFrontage'].fillna(x)

In [None]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    total_data[col] = total_data[col].fillna('None')

In [None]:
total_data.drop('GarageYrBlt',axis=1,inplace=True)

In [None]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    total_data[col] = total_data[col].fillna('None')

In [None]:
total_data['MasVnrArea'] = total_data['MasVnrArea'].fillna(0)

In [None]:
total_data['MasVnrType'].value_counts()

In [None]:
total_data['MasVnrType'] = total_data['MasVnrType'].fillna('None')

In [None]:
total_data['Electrical'].value_counts()

In [None]:
total_data['Electrical'] = total_data['Electrical'].fillna('SBrkr')

In [None]:
total_data['Utilities'].value_counts()

In [None]:
total_data.drop('Utilities',axis=1,inplace=True)

In [None]:
object_type = object_type.drop('Utilities')

In [None]:
for col in ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF','GarageArea']:
    total_data[col] = total_data[col].fillna(0)

In [None]:
total_data['MSZoning'] = total_data['MSZoning'].fillna(total_data['MSZoning'].mode()[0])

In [None]:
total_data['MSZoning'].value_counts()

In [None]:
total_data['KitchenQual'] = total_data['KitchenQual'].fillna(total_data['KitchenQual'].mode()[0])
total_data['Exterior1st'] = total_data['Exterior1st'].fillna(total_data['Exterior1st'].mode()[0])
total_data['Exterior2nd'] = total_data['Exterior2nd'].fillna(total_data['Exterior2nd'].mode()[0])

In [None]:
for col in ['GarageCars','BsmtFullBath','BsmtHalfBath'] :
    total_data[col].fillna(0,inplace=True)

In [None]:
total_data['SaleType'].value_counts()

In [None]:
total_data['SaleType'] = total_data['SaleType'].fillna(total_data['SaleType'].mode()[0])
total_data['Functional'] = total_data['Functional'].fillna('Typ')

In [None]:
total_data.isnull().sum().sum()

No missing values.

we will convert MSSubclass to object type and then use label encoder because the values in MSSubclass are high(though they are categorical).

In [None]:
total_data['MSSubClass'] = total_data['MSSubClass'].apply(str)

In [None]:
object_type = list(object_type) + ['MSSubClass']

In [None]:
import sklearn
from sklearn.preprocessing import LabelEncoder

In [None]:
for i in object_type:
    le = LabelEncoder()
    le.fit(total_data[i].unique())
    total_data[i] = le.transform(total_data[i])

In [None]:
categorical_features = object_type + cat

We will create some new features.

In [None]:
total_data['TotalSF'] = total_data['TotalBsmtSF']+ total_data['1stFlrSF'] + total_data['2ndFlrSF']

total_data['Exterior'] = total_data['Exterior1st'] + total_data['Exterior2nd']

In [None]:
total_data.drop(['GarageArea','1stFlrSF'],axis=1,inplace=True)
total_data.drop(['Exterior1st','Exterior2nd'],axis=1,inplace=True)

In [None]:
categorical_features.append('Exterior')
categorical_features.remove('Exterior1st')
categorical_features.remove('Exterior2nd')

## Outliers and skewness

we will remove outliers of highly related features with saleprice because as they have larger effect. We will also remove skewness by using log and boxcox transformation of numerical features.

In [None]:
from scipy import stats
from scipy.stats import norm,skew

In [None]:
fig, ax = plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(x = total_data.GrLivArea,y=train.SalePrice,ax=ax[0])
sns.scatterplot(x = total_data.OverallQual,y=train.SalePrice,ax=ax[1])

In [None]:
total_data.drop(train[(train['GrLivArea']>4000)&(train['SalePrice']<300000)].index,inplace=True)
y.drop(train[(train['GrLivArea']>4000)&(train['SalePrice']<300000)].index,inplace=True)

In [None]:
sns.scatterplot(x = total_data['GrLivArea'],y=train.SalePrice)

In [None]:
sns.scatterplot(x = total_data.TotalSF,y=train.SalePrice)

No outliers in TotalSF

In [None]:
sns.distplot(y,fit=norm)
fig =plt.figure()
r = stats.probplot(y,plot = plt)

Saleprice has positive skewness, we can use log transformation to reduce it.

In [None]:
y = np.log(y)

In [None]:
sns.distplot(y,fit=norm)
fig =plt.figure()
r = stats.probplot(y,plot = plt)

In [None]:
sns.distplot(total_data.GrLivArea,fit=norm)
fig =plt.figure()
r = stats.probplot(total_data.GrLivArea,plot = plt)

In [None]:
total_data['GrLivArea'] = np.log(total_data['GrLivArea'])

In [None]:
sns.distplot(total_data.GrLivArea,fit=norm)
fig =plt.figure()
r = stats.probplot(total_data.GrLivArea,plot = plt)

In [None]:
sns.distplot(total_data.TotalBsmtSF,fit=norm)
fig =plt.figure()
r = stats.probplot(total_data.TotalBsmtSF,plot = plt)

In [None]:
total_data.loc[total_data['TotalBsmtSF']>0,'TotalBsmtSF'] = np.log(total_data.loc[total_data['TotalBsmtSF']>0,'TotalBsmtSF'])

In [None]:
sns.distplot(total_data[total_data['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm);
fig = plt.figure()
r = stats.probplot(total_data[total_data['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt)

In [None]:
numerical_features = total_data.drop(categorical_features,axis=1).columns

In [None]:
skew = total_data[numerical_features].apply(lambda x:skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'skewness': skew})
skewness

let's do boxcox transformation for features with skewness greater than 0.5

In [None]:
cols = skewness[skewness['skewness'] > 0.5].index

In [None]:
from scipy.special import boxcox1p

In [None]:
for col in cols :
    total_data[col] = boxcox1p(total_data[col],0.15)

In [None]:
x_train = total_data[:-len(test)]
x_test = total_data[-len(test):]

In [None]:
print(x_train.shape)
print(x_test.shape)

Let's try our data on basic regression models (lasso,ridge,elasticnet).
We will also use cross_validate to avoid train_test_split.

In [None]:
from sklearn.linear_model import LinearRegression, Lasso , ElasticNet ,Ridge
from sklearn.preprocessing import StandardScaler,RobustScaler
from sklearn.model_selection import cross_val_score,cross_validate,GridSearchCV,RandomizedSearchCV
from sklearn.metrics import mean_squared_error,make_scorer
from sklearn.pipeline import make_pipeline

we will scale our data using RobustScaler() before feeding it to Lasso() as it is sensitive to outliers.

In [None]:
lasso = make_pipeline(RobustScaler(),Lasso())
scores = cross_validate(lasso,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
np.sqrt(-scores['train_score'].mean())

In [None]:
np.sqrt(-scores['test_score'].mean())

We used the default lasso. We can improve the model by using gridsearchcv.

In [None]:
param_grid = {'alpha' : [0.00005,0.0005,0.007,0.1,0.00009], 'max_iter':[1000,2000,1500,2500]}
grid = GridSearchCV(Lasso(),param_grid=param_grid,scoring='neg_mean_squared_error')
grid.fit(x_train,y)

In [None]:
grid.best_params_

In [None]:
np.sqrt(-grid.best_score_)

The score improved. similarly we will do it to ridge and elasticnet to see the best score.

In [None]:
ridge = make_pipeline(RobustScaler(),Ridge(alpha=10))
scores2 = cross_validate(ridge,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
np.sqrt(-scores2['train_score'].mean())

In [None]:
np.sqrt(-scores2['test_score'].mean())

In [None]:
elastic = make_pipeline(RobustScaler(),ElasticNet(alpha=0.0005))
scores3 = cross_validate(elastic,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
np.sqrt(-scores3['train_score'].mean())

In [None]:
np.sqrt(-scores3['test_score'].mean())

we will also test with svr because if our model have more non-linear relationships it can fit better.

In [None]:
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.svm import SVR

In [None]:
svm = SVR()
scores4 = cross_validate(svm,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
print(np.sqrt(-scores4['train_score'].mean()))
print(np.sqrt(-scores4['test_score'].mean()))      

In [None]:
grid = {'degree':[3,5,7,9,10,15,20],
    'gamma' : ['scale','auto'],
       'C':[0.0001,0.001,0.01,0.1,1,10,100,1000],
       'epsilon':[0.001,0.01,0.1,1,5,10,100]} 
random = RandomizedSearchCV(SVR(),grid,scoring='neg_mean_squared_error',cv=5,n_iter=20)
random.fit(x_train,y)

In [None]:
print(random.best_params_)

In [None]:
np.sqrt(-random.best_score_)

After experiments with parameter C using GridSearchCV i found 80000 is the best value.

In [None]:
svm = SVR(C=80000,epsilon=0.01)
scores5 = cross_validate(svm,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
np.sqrt(-scores5['train_score'].mean())

In [None]:
np.sqrt(-scores5['test_score'].mean())

We can see the score is not much improvement over our linear models. It's beacause even though the dataset may have non-linear relationships our penalty C is high enough(80000) to make it linear. without regularization parameter the train_score is 0.09(which is better than our linear models).

In [None]:
random_forest = RandomForestRegressor()
scores6 = cross_validate(random_forest,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
print(np.sqrt(-scores6['train_score'].mean()))
print(np.sqrt(-scores6['test_score'].mean()))

I tried optimizing the parameters for random forest but it didn't improve much. It might be because random forest cannot extrapolate( it cannot predict out of range values than in train data ) .I will explain this using our test dataset with more strong xgb model.

In [None]:
import xgboost as xgb

In [None]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, 
                             random_state =7, nthread = -1)

In [None]:
scores7 = cross_validate(model_xgb,x_train,y,cv=5,scoring='neg_mean_squared_error',return_train_score=True)

In [None]:
print(np.sqrt(-scores7['train_score'].mean()))
print(np.sqrt(-scores7['test_score'].mean()))

surely xgb is better than random forest regressor but let's see it's performance with test data.

In [None]:
ridge.fit(x_train,y)
ridge_pred = ridge.predict(x_test)

In [None]:
elastic.fit(x_train,y)
elastic_pred = elastic.predict(x_test)

In [None]:
model_xgb.fit(x_train,y)
xgb_pred = model_xgb.predict(x_test)

In [None]:
pd.Series(np.expm1(elastic_pred)).describe()

In [None]:
pd.Series(np.expm1(xgb_pred)).describe()

In [None]:
pd.Series(np.expm1(ridge_pred)).describe()

In [None]:
np.expm1(y).describe()

look at max value saleprice!!!!. our linear models predict upto 900000 but xgb is only predicting upto 600000 ,this is because our train set have max value of 700000. We are safe to say our test set demands out of range values. It's also proved with my submissions too( my xgb got less score than elasticnet).

But when it comes to in range value prediction, xgb outperforms every other model. So, for better score we can average the prediction of xgb and elasticnet.

In [None]:
y_pred = 0.6*np.expm1(elastic_pred) + 0.4*np.expm1(xgb_pred)

In [None]:
submission = pd.DataFrame()
submission['Id'] = test['Id']
submission['SalePrice'] = y_pred
submission.to_csv('submission.csv',index=False)

Upvote my notebook if you like it. Please send me your feedback ,I'm just a beginner and could have made mistakes :) ...