# House Price Prediction with Scikit-Learn

The goal of this project is to predict house prices from a set of variables explaining each home. This is a famous machine-learning challenge hosted on kaggle. It is ideal to test some ML concept on real world data. More information can be found on the competition's [kaggle-page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)

import os

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

## Dataset

In [None]:
# get train and test data set
data_loc = './data'

train_data_base = pd.read_csv(os.path.join(data_loc,'train.csv'), index_col='Id')
test_data_base = pd.read_csv(os.path.join(data_loc,'test.csv'), index_col='Id')

In [None]:
train_data_base.info()

In [None]:
target = 'SalePrice'

There are both numerical and categorical features. According to the dataset description, the following features are numerical:

In [None]:
num_feat = [
    'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
    'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
    '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
    'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch',
    'PoolArea', 'MiscVal', 'YrSold'
]

## Exploratory Data Analysis

In [None]:
ax = sns.histplot(train_data_base, x='SalePrice')
ax.set_title('Label Distribution')
plt.show()

### Numerical Features

Top 10 positively and negatively correlated features:

In [None]:
train_data_base[num_feat+[target]].corr()['SalePrice'].sort_values(ascending=False).head(11)

In [None]:
train_data_base[num_feat+[target]].corr()['SalePrice'].sort_values(ascending=True).head(11)

## Feature Engineering

Some knowledge of the data set allows to combine some of the features into new, potentially more powerful features.

### Create New Features

In [None]:
new_train_features = pd.DataFrame()
new_test_features = pd.DataFrame()

# age of the house
new_train_features['age'] = train_data_base['YrSold'] - train_data_base['YearBuilt']
new_test_features['age'] = test_data_base['YrSold'] - test_data_base['YearBuilt']

# total indoor square footage
new_train_features['indoor_sf'] = train_data_base['TotalBsmtSF'] + train_data_base['1stFlrSF'] + train_data_base['2ndFlrSF']
new_test_features['indoor_sf'] = test_data_base['TotalBsmtSF'] + test_data_base['1stFlrSF'] + test_data_base['2ndFlrSF']

# total number of bathrroms
new_train_features['n_bath'] = train_data_base['BsmtFullBath'] + 0.5*train_data_base['BsmtHalfBath'] \
    + train_data_base['FullBath'] + 0.5*train_data_base['HalfBath']
new_test_features['n_bath'] = test_data_base['BsmtFullBath'] + 0.5*test_data_base['BsmtHalfBath'] \
    + test_data_base['FullBath'] + 0.5*test_data_base['HalfBath']

### Explore New Features

In [None]:
new_train_features.merge(train_data_base[target], left_index=True, right_index=True).corr()[target]

Age is slightly stronger correlated than the Year the house was built in alone. The indoor square footage and the number of bathroms are among the top correlated features.

## Feature Selection

In [None]:
features_selected = ['n_bath', 'age', 'indoor_sf', 'OverallQual', 'GrLivArea', 'GarageCars']

train_data = new_train_features.merge(train_data_base, left_index=True, right_index=True)[features_selected+[target]]
test_data = new_test_features.merge(test_data_base, left_index=True, right_index=True)[features_selected]

In [None]:
train_data.isna().sum()

## ML Pipeline

In [None]:
# define train and test data
X_train = train_data.drop(target, axis=1).to_numpy()
y_train = np.log(train_data[target]).to_numpy()

X_test = test_data.to_numpy()

In [None]:
# set up the ML pipeline

# imputing strategy for missing values
imputer = SimpleImputer(missing_values=np.nan)

# scale values
scaler = StandardScaler()

# regression
regressor = Ridge()

# pipeline
pipe = Pipeline([('imputer', imputer), ('scaler', scaler), ('regressor', regressor)])

In [None]:
# tune hyperparameters with cross validation
param_grid = {'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 50, 100, 500, 1000], 
              'imputer__strategy':['mean', 'median']}

search = GridSearchCV(pipe, param_grid, scoring='neg_mean_squared_error', n_jobs=-1)
search.fit(X_train,y_train)

print(search.best_params_)

In [None]:
# extract the best model
best_pipe = search.best_estimator_

## Evaluate Model

In [None]:
cv_scores = cross_val_score(best_pipe, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
display(cv_scores)

In [None]:
np.sqrt(-1*cv_scores.mean())

In [None]:
# predict the prices for the test set
y_test_pred = np.exp(best_pipe.predict(X_test))

In [None]:
# create the submission file for the kaggle competition
submission = pd.DataFrame(y_test_pred, columns=['SalePrice'])
submission['Id'] = test_data_base.index
submission['Id'].astype('int')

submission.to_csv(os.path.join(data_loc,'submission.csv'), index=None)