# House Price Prediction with Scikit-Learn

The goal of this project is to predict house prices from a set of variables explaining each home. This is a famous machine-learning challenge hosted on kaggle. It is ideal to test some ML concept on real world data. More information can be found on the competition's [kaggle-page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)

import os

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

## Dataset

In [None]:
# get train and test data set
data_loc = './data'

train_data_base = pd.read_csv(os.path.join(data_loc,'train.csv'), index_col='Id')
test_data_base = pd.read_csv(os.path.join(data_loc,'test.csv'), index_col='Id')

In [None]:
train_data_base.info()

In [None]:
target = 'SalePrice'

There are both numerical and categorical features. According to the dataset description, the following features are numerical:

In [None]:
num_feat = [
    'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
    'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
    '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
    'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch',
    'PoolArea', 'MiscVal', 'YrSold'
]

## Exploratory Data Analysis

In [None]:
ax = sns.histplot(train_data_base, x='SalePrice')
ax.set_title('Label Distribution')
plt.show()

### Numerical Features

Top 10 positively and negatively correlated features:

In [None]:
train_data_base[num_feat+[target]].corr()['SalePrice'].sort_values(ascending=False).head(11)

In [None]:
train_data_base[num_feat+[target]].corr()['SalePrice'].sort_values(ascending=True).head(11)

## Feature Engineering

Some knowledge of the data set allows to combine some of the features into new, potentially more powerful features.

In [None]:
new_train_features = pd.DataFrame()
new_test_features = pd.DataFrame()

### Create New Features

#### House Age

In [None]:
new_train_features['age'] = train_data_base['YrSold'] - train_data_base['YearBuilt']
new_test_features['age'] = test_data_base['YrSold'] - test_data_base['YearBuilt']

#### Total Indoor SF 

In [None]:
new_train_features['indoor_sf'] = train_data_base['TotalBsmtSF'] + train_data_base['1stFlrSF'] + train_data_base['2ndFlrSF']
new_test_features['indoor_sf'] = test_data_base['TotalBsmtSF'] + test_data_base['1stFlrSF'] + test_data_base['2ndFlrSF']

#### Number of Bathrooms

In [None]:
new_train_features['n_bath'] = train_data_base['BsmtFullBath'] + 0.5*train_data_base['BsmtHalfBath'] \
    + train_data_base['FullBath'] + 0.5*train_data_base['HalfBath']
new_test_features['n_bath'] = test_data_base['BsmtFullBath'] + 0.5*test_data_base['BsmtHalfBath'] \
    + test_data_base['FullBath'] + 0.5*test_data_base['HalfBath']

### Explore New Features

In [None]:
new_train_features.merge(train_data_base[target], left_index=True, right_index=True).corr()[target]

Age is slightly stronger correlated than the Year the house was built in alone. The indoor square footage and the number of bathroms are among the top correlated features.