# House Price Prediction - Advanced Regression Techniques
Dataset Link - https://www.kaggle.com/c/house-prices-advanced-regression-techniques
## Model Building 

We will train, validate and test on 4 different models and then would compare them-
- Multiple Linear Regression
- Random Forest Regression
- Support Vector Regression
- XGBoost Regression

- We don't have the labels for the test set ('y_test') as the data has been taken from Kaggle. So we can't directly evaluate the test error in this Jupyter notebook. Nut we will do check the cross validation errors on training sets. 
- We can check the test error by uploading the output to Kaggle, and then we can selecte the best performing model.

#### Importing required libraries and train and test data

In [1]:
# Importing required libraries
import pickle
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings.filterwarnings('ignore')

In [2]:
# Importing train set
df_train = pd.read_csv('train_final.csv')
df_train.head()

Unnamed: 0,MSZoning,LotShape,BldgType,OverallQual,YearRemodAdd,ExterQual,BsmtQual,BsmtExposure,BsmtFinType1,HeatingQC,...,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,GarageType,GarageFinish,GarageCars,PavedDrive,SalePrice
0,0.5,1.0,0.0,0.666667,0.098361,0.333333,0.5,1.0,0.333333,0.0,...,0.356155,0.577712,0.333333,0.666667,0.0,0.0,0.666667,0.5,1.0,12.247694
1,0.5,1.0,0.0,0.555556,0.52459,1.0,0.5,0.25,0.0,0.0,...,0.503056,0.470245,0.0,1.0,0.333333,0.0,0.666667,0.5,1.0,12.109011
2,0.5,0.0,0.0,0.666667,0.114754,0.333333,0.5,0.75,0.333333,0.0,...,0.383441,0.593095,0.333333,0.666667,0.333333,0.0,0.666667,0.5,1.0,12.317167
3,0.5,0.0,0.0,0.666667,0.606557,1.0,1.0,1.0,0.0,0.5,...,0.399941,0.579157,0.333333,0.666667,0.333333,0.6,1.0,0.75,1.0,11.849398
4,0.5,0.0,0.0,0.777778,0.147541,0.333333,0.5,0.0,0.333333,0.0,...,0.466237,0.666523,0.333333,0.666667,0.333333,0.0,0.666667,0.75,1.0,12.429216


In [3]:
# Importing test set
df_test = pd.read_csv('test_final.csv')
df_test.head()

Unnamed: 0,MSZoning,LotShape,BldgType,OverallQual,YearRemodAdd,ExterQual,BsmtQual,BsmtExposure,BsmtFinType1,HeatingQC,CentralAir,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,GarageType,GarageFinish,GarageCars,PavedDrive
0,1.0,1.0,0.0,0.444444,0.819672,1.0,1.0,1.0,0.833333,1.0,1.0,0.373438,0.349081,0.0,1.333333,0.0,0.2,1.0,0.25,1.0
1,0.5,0.0,0.0,0.555556,0.868852,1.0,1.0,1.0,0.0,1.0,1.0,0.522632,0.488544,0.0,0.666667,0.0,0.2,1.0,0.25,1.0
2,0.5,0.0,0.0,0.444444,0.213115,1.0,0.5,1.0,0.333333,0.5,1.0,0.386718,0.560546,0.0,1.333333,0.333333,0.2,0.0,0.5,1.0
3,0.5,0.0,0.0,0.555556,0.213115,1.0,1.0,1.0,0.333333,0.0,1.0,0.385901,0.555075,0.0,0.666667,0.333333,0.2,0.0,0.5,1.0
4,0.5,0.0,1.0,0.777778,0.311475,0.666667,0.5,1.0,0.0,0.0,1.0,0.508416,0.475254,0.0,0.666667,0.0,0.2,0.666667,0.5,1.0


In [4]:
# Splitting into X_train, y_train, X_test
X_train = df_train.drop(['SalePrice'], axis=1)
y_train = df_train['SalePrice']

X_test = df_test

- We will use RandomizedSearchCV to train over various hyperparameters and search for best combination of hyperparameters

#### Linear Regression

In [5]:
# Creating a Linear Regression model
linear_regressor = LinearRegression(n_jobs=-1)
linear_regressor.fit(X_train, y_train)
y_pred_train = linear_regressor.predict(X_train)

# Printing training regression metrics
print('Mean Absolute Error:', mean_absolute_error(y_train, y_pred_train))
print('Mean Squared Error:', mean_squared_error(y_train, y_pred_train))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_pred_train)))
print('R2 Score:', r2_score(y_train, y_pred_train))

Mean Absolute Error: 0.10079186373964102
Mean Squared Error: 0.020208603655192153
Root Mean Squared Error: 0.1421569683666339
R2 Score: 0.8732625523280031


In [6]:
# Predicting on test data
y_pred_linear = linear_regressor.predict(X_test)

- There are no hyperparameters for Linear Regression so we just get the 'best fit line' with a R2 Score of ~0.87326 over the training set

#### Random Forest Regression

In [7]:
# Defining a dictionary of hyperparameters with values to tune over
params = {'n_estimators' : [100, 200, 500],
          'max_depth' : [int(x) for x in np.linspace(10, 100, 5)],
         }

In [8]:
# Creating a RandomizedSearchCV object to search for best parameters
random_forest_regressor = RandomizedSearchCV(estimator=RandomForestRegressor(n_jobs=-1), param_distributions=params, cv=5, 
                                             n_iter=60, random_state=0, verbose=2, n_jobs=-1)
search = random_forest_regressor.fit(X_train, y_train)
print('Best Score:', search.best_score_)
print('Best Hyperparameters:', search.best_params_)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best Score: 0.8650452734810703
Best Hyperparameters: {'n_estimators': 200, 'max_depth': 77}


In [9]:
# Printing the cross validation results
results = pd.DataFrame(random_forest_regressor.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.484296,0.108014,0.190813,0.112489,100,10,"{'n_estimators': 100, 'max_depth': 10}",0.865886,0.845211,0.870539,0.881625,0.847051,0.862062,0.013989,15
1,1.157156,0.118701,0.228134,0.142028,200,10,"{'n_estimators': 200, 'max_depth': 10}",0.873041,0.846388,0.872055,0.883837,0.847004,0.864465,0.015087,2
2,3.334736,0.412259,0.740915,0.199021,500,10,"{'n_estimators': 500, 'max_depth': 10}",0.870338,0.845164,0.870787,0.88334,0.846458,0.863217,0.014965,13
3,0.882296,0.382537,0.314095,0.226758,100,32,"{'n_estimators': 100, 'max_depth': 32}",0.872476,0.843318,0.872211,0.879857,0.849878,0.863548,0.014261,10
4,2.761146,0.102341,0.456638,0.502983,200,32,"{'n_estimators': 200, 'max_depth': 32}",0.869856,0.849052,0.872676,0.883546,0.84378,0.863782,0.014991,9
5,5.923206,0.680913,1.036784,0.348288,500,32,"{'n_estimators': 500, 'max_depth': 32}",0.870189,0.846546,0.873096,0.883097,0.846621,0.86391,0.014781,5
6,1.806049,0.569165,0.707598,0.148945,100,55,"{'n_estimators': 100, 'max_depth': 55}",0.868597,0.847446,0.868609,0.8837,0.845633,0.862797,0.014385,14
7,2.989858,0.592692,0.351863,0.164129,200,55,"{'n_estimators': 200, 'max_depth': 55}",0.868531,0.848024,0.871942,0.885747,0.845063,0.863861,0.015299,7
8,6.256883,0.425029,0.865495,0.493275,500,55,"{'n_estimators': 500, 'max_depth': 55}",0.871402,0.847753,0.873332,0.883673,0.846161,0.864464,0.0149,3
9,2.063718,0.809194,1.048688,0.688438,100,77,"{'n_estimators': 100, 'max_depth': 77}",0.868382,0.84534,0.872724,0.886304,0.84462,0.863474,0.016218,11


In [10]:
# Creating final model with best hyperparameter values
random_forest_regressor = RandomForestRegressor(n_estimators=500, max_depth=77, n_jobs=-1)
random_forest_regressor.fit(X_train, y_train)

RandomForestRegressor(max_depth=77, n_estimators=500, n_jobs=-1)

In [11]:
# Predicting on test set
y_pred_random_forest = random_forest_regressor.predict(X_test)

#### Support Vector Regression

In [12]:
# Defining a dictionary of hyperparameters with values to tune over
params = {
          'C' : [1, 2, 3],
          'epsilon' : [0.1, 0.2, 0.3]
         }

In [13]:
# Creating a RandomizedSearchCV object to search for best parameters
support_vector_regressor = RandomizedSearchCV(estimator=SVR(), param_distributions=params, cv=5, 
                                              n_iter=100, random_state=0, verbose=2, n_jobs=-1)
search = support_vector_regressor.fit(X_train, y_train)
print('Best Score:', search.best_score_)
print('Best Hyperparameters:', search.best_params_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best Score: 0.8653337861450655
Best Hyperparameters: {'epsilon': 0.1, 'C': 1}


In [14]:
# Printing the cross validation results
results = pd.DataFrame(support_vector_regressor.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_epsilon,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.184155,0.03094,0.073425,0.01151,0.1,1,"{'epsilon': 0.1, 'C': 1}",0.898353,0.84447,0.872176,0.862573,0.849097,0.865334,0.019202,1
1,0.112715,0.029405,0.025199,0.00856,0.2,1,"{'epsilon': 0.2, 'C': 1}",0.879163,0.829931,0.852913,0.855657,0.846079,0.852749,0.015946,4
2,0.036925,0.005226,0.018288,0.003134,0.3,1,"{'epsilon': 0.3, 'C': 1}",0.854202,0.79486,0.805466,0.811822,0.806119,0.814494,0.020596,7
3,0.233444,0.039044,0.068167,0.011971,0.1,2,"{'epsilon': 0.1, 'C': 2}",0.889802,0.839076,0.870831,0.860941,0.840974,0.860325,0.019003,2
4,0.108686,0.018528,0.043775,0.024555,0.2,2,"{'epsilon': 0.2, 'C': 2}",0.87515,0.822448,0.845996,0.855277,0.845597,0.848894,0.017022,5
5,0.038082,0.005127,0.01645,0.002088,0.3,2,"{'epsilon': 0.3, 'C': 2}",0.848531,0.786175,0.79661,0.7977,0.797329,0.805269,0.022052,8
6,0.256519,0.025414,0.062224,0.008081,0.1,3,"{'epsilon': 0.1, 'C': 3}",0.882041,0.835235,0.866211,0.8565,0.835772,0.855151,0.017997,3
7,0.11855,0.007655,0.030398,0.013863,0.2,3,"{'epsilon': 0.2, 'C': 3}",0.871131,0.815193,0.84182,0.851468,0.843047,0.844532,0.018033,6
8,0.046303,0.010181,0.009518,0.008341,0.3,3,"{'epsilon': 0.3, 'C': 3}",0.840737,0.779064,0.792769,0.788038,0.792927,0.798707,0.02161,9


In [15]:
# Creating final model with best hyperparameter values
support_vector_regressor = SVR(C=1, epsilon=0.1)
support_vector_regressor.fit(X_train, y_train)

SVR(C=1)

In [16]:
# Predicting on test data
y_pred_support_vector = support_vector_regressor.predict(X_test)

#### XGBoost Regression

In [17]:
# First trying out a simple XGBoost model without any tuning
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
print('R2 Score:', r2_score(y_train, xgb.predict(X_train)))

R2 Score: 0.9958298394729684


- We get a 0.99 R2 score, which means our model is most likely overfitting

In [18]:
# Defining a dictionary of hyperparameters with values to tune over
params = {'eta' : [0.001, 0.005, 0.01, 0.05],
          'max_depth' : [1, 2, 3, 4, 5, 6],
          'eval_metric' : ['rmse', 'mae']
         }

In [19]:
# Creating a RandomizedSearchCV object to search for best parameters
xgboost_regressor = RandomizedSearchCV(estimator=XGBRegressor(), param_distributions=params, cv=5, 
                                       n_iter=100, random_state=0, verbose=2, n_jobs=-1)
search = xgboost_regressor.fit(X_train, y_train)
print('Best Score:', search.best_score_)
print('Best Hyperparameters:', search.best_params_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Score: 0.8416858872203108
Best Hyperparameters: {'max_depth': 5, 'eval_metric': 'rmse', 'eta': 0.05}


In [20]:
# Creating final model with best hyperparameter values
xgboost_regressor = XGBRegressor(eta=0.05, max_depth=5, eval_metric='rmse')
xgboost_regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, eta=0.05,
             eval_metric='rmse', gamma=0, gpu_id=-1, importance_type='gain',
             interaction_constraints='', learning_rate=0.0500000007,
             max_delta_step=0, max_depth=5, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=8,
             num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
             scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [21]:
# Predicting on test set
y_pred_xgboost = xgboost_regressor.predict(X_test)

The best scores for all the models on the training set are listed as follows:
- Linear Regression ~ 0.873
- Random Forest Regression ~ 0.872
- Support Vector Regression ~ 0.867
- XGBoost Regression ~ 0.842

#### Exporting the Predictions

We are exporting only one model at a time, and then we can verify its performance from Kaggle. After checking all 4 models, finally we will keep the best model.

In [22]:
# Creating CSV file accoring to Kaggle requirements
df = pd.DataFrame(y_pred_xgboost)
df.columns = ['SalePrice']
df = pd.concat([pd.read_csv('test.csv')['Id'], df], axis=1)
df.head()

Unnamed: 0,Id,SalePrice
0,1461,11.34598
1,1462,11.799376
2,1463,11.953595
3,1464,11.997738
4,1465,12.097059


In [23]:
# Exporting the predictions to CSV file
df.to_csv('prediction.csv', index=False)

We get the following scores from Kaggle:
- Linear Regression = 9.46135
- Random Forest Regression = 9.46118
- Support Vector Regression = 9.46062
- XGBoost Regression = 9.46696

We can see that XGBoost performed the best on the test set, so we will keep it for the deployment purposes.
- We can notice that all the four algorithms performed almost equally well with a score of ~9.46 

#### Dumping Models to Pickle File

In [24]:
# Creating a pickle file and dumping all 4 models to it
pickle_file = open('models.pkl', 'wb')
pickle.dump([linear_regressor, support_vector_regressor, random_forest_regressor, xgboost_regressor], pickle_file)
pickle_file.close()