<h1>House prices prediction</h1>

<img src="https://olegleyz.github.io/images/header.jpg" alt="Header" width="800"/><br>

This kernel is going to describe the method by which I predicted the selling prices of houses. <br>
This method will include the following steps:
<ul>
<li>Observation</li>
<li>Dealing with missing values</li>
<li>Fixing skewness</li>
<li>Adding new features</li>
<li>Modeling</li>
</ul>

<h3>Importing necessary libraries and reading datasets</h3>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.stats import norm, skew

import warnings
warnings.filterwarnings(action="ignore")


target_name = 'SalePrice'
dataset_train_raw = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
dataset_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

dataset_train_raw

Test data should contain the same type of data of the training set to preprocess them in the same way. The easiest way to solve this problem is to concatenate train and test datasets, preprocess, and then divide them again. It would also be a good idea to drop out the features that definitely do not affect the price, in our case - 'Id'.

In [None]:
ignore_feature = ['Id']
y_train = dataset_train_raw[target_name]
dataset_train = dataset_train_raw.drop([target_name] + ignore_feature, axis=1, inplace=False)
dataset_test.drop(ignore_feature, axis=1, inplace=True)

all_data = pd.concat([dataset_train, dataset_test], axis=0, sort=False)
all_data

<h3>Observation</h3>

In [None]:
correlation_train = dataset_train_raw.corr()
sb.set(font_scale=0.5)
plt.figure(figsize=(15, 10))
ax = sb.heatmap(correlation_train, annot=True, annot_kws={'size': 10}, fmt='.1f', cmap='PiYG', linewidths=.2)
plt.show()

As we can see, the multicollinearity still exists in various features. However, we will keep them for now for the sake of learning and let the regularization models do the clean up later on.

<h3>Dealing with missing values</h3>

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
sb.heatmap(all_data.isnull(), yticklabels=False, cbar=True)
plt.show()

At first it may seem that there are a lot of missing values, but in fact, some of them were omitted on purpose, and I made a separate category for them.

In [None]:
specially_missed = ['Alley',
                    'PoolQC',
                    'MiscFeature',
                    'Fence',
                    'FireplaceQu',
                    'GarageType',
                    'GarageFinish',
                    'GarageQual',
                    'GarageCond',
                    'BsmtQual',
                    'BsmtCond',
                    'BsmtExposure',
                    'BsmtFinType1',
                    'BsmtFinType2',
                    'MasVnrType']

for feature in specially_missed:
    all_data[feature] = all_data[feature].fillna('None')

If the missing feature is numerical, then fill it with zero, or it would also be a good idea to fill it with the median of the feature

In [None]:
numeric_missed = ['BsmtFinSF1',
                  'BsmtFinSF2',
                  'BsmtUnfSF',
                  'TotalBsmtSF',
                  'BsmtFullBath',
                  'BsmtHalfBath',
                  'GarageYrBlt',
                  'GarageArea',
                  'GarageCars',
                  'MasVnrArea']

for feature in numeric_missed:
    all_data[feature] = all_data[feature].fillna(0)

Let's change important numerical characteristics that take a small number of variants of values (for example, year) into categorical ones by casting them to strings.

In [None]:
all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Fill in the remaining missing values with the values that are most common for this feature. 

In [None]:
all_data['Functional'] = all_data['Functional'].fillna('Typ')
all_data['Utilities'] = all_data['Utilities'].fillna('AllPub')
all_data['KitchenQual'] = all_data['KitchenQual'].fillna('TA')
all_data['Electrical'] = all_data['Electrical'].fillna('SBrkr')

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.mean()))
all_data['MSZoning'] = all_data.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
sb.heatmap(all_data.isnull(), yticklabels=False, cbar=True)
plt.show()

No more missing values!

<h3>Fixing skewness</h3>

Let's create a histogram to see if the target variable (SalePrice) is normally distributed.

In [None]:
plt.subplots(figsize=(14, 9))
sb.distplot(y_train, kde=True, hist=True, fit=norm)
plt.show()

As we can see, target variable is not normally distributed, let's try to fix it by using log.

In [None]:
y_train = np.log1p(y_train)

plt.subplots(figsize=(14, 9))
sb.distplot(y_train, kde=True, hist=True, fit=norm)
plt.show()

Now let's fix the high skewness in the rest of the values 

In [None]:
numeric_feats = all_data.dtypes[all_data.dtypes != 'object'].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_feats[abs(skewed_feats) > 0.5]
high_skew

In [None]:
for feature in high_skew.index:
    all_data[feature] = np.log1p(all_data[feature])

<h3>Adding new features</h3>

Just adding new more significant features based on old minor features 

In [None]:
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

all_data['SqFtPerRoom'] = all_data['GrLivArea'] / (all_data['TotRmsAbvGrd'] + all_data['FullBath'] +
                                                       all_data['HalfBath'] + all_data['KitchenAbvGr'])

all_data['TotalHomeQuality'] = all_data['OverallQual'] + all_data['OverallCond']

all_data['TotalBathrooms'] = (all_data['FullBath'] + (0.5 * all_data['HalfBath']) +
                                  all_data['BsmtFullBath'] + (0.5 * all_data['BsmtHalfBath']))

<h3>Processed dataset</h3>

In [None]:
X_all = pd.get_dummies(all_data)
X_train = X_all[:len(y_train)]
X_test = X_all[len(y_train):]

X_train

In [None]:
X_test

<h3>Modeling</h3>

To begin with, let's conduct a superficial analysis of the models in order to weed out the inappropriate in this task. 

In [None]:
import xgboost as xg
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso
from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from mlxtend.regressor import StackingRegressor


kf = KFold(n_splits=8, random_state=42, shuffle=True)

def cv_rmse(model):
    return -cross_val_score(model, X_train, y_train, scoring='neg_root_mean_squared_error', cv=kf)

models = ['Linear', 'SVR', 'Random_Forest', 'XGBR', 'Cat_Boost', 'Ridge', 'Elastic_Net', 'Lasso', 'Stack']
scores = []

lin = LinearRegression()
score_lin = cv_rmse(lin)
scores.append(score_lin.mean())

svr = SVR()
score_svr = cv_rmse(svr)
scores.append(score_svr.mean())

rfr = RandomForestRegressor()
score_rfr = cv_rmse(rfr)
scores.append(score_rfr.mean())

xgb = xg.XGBRegressor()
score_xgb = cv_rmse(xgb)
scores.append(score_xgb.mean())

catb = CatBoostRegressor(verbose=0, allow_writing_files=False)
score_catb = cv_rmse(catb)
scores.append(score_catb.mean())

rid = Ridge()
score_rid = cv_rmse(rid)
scores.append(score_rid.mean())

el = ElasticNet()
score_el = cv_rmse(el)
scores.append(score_el.mean())

las = Lasso()
score_las = cv_rmse(las)
scores.append(score_las.mean())

stack_gen = StackingRegressor(regressors=(CatBoostRegressor(verbose=0, allow_writing_files=False),
                                          Ridge(),
                                          xg.XGBRegressor(),
                                          RandomForestRegressor()),
                              meta_regressor=CatBoostRegressor(verbose=0, allow_writing_files=False),
                              use_features_in_secondary=True)
score_stack_gen = cv_rmse(stack_gen)
scores.append(score_stack_gen.mean())

cv_score = pd.DataFrame(models, columns=['Regressors'])
cv_score['RMSE_mean'] = scores
cv_score

In [None]:
plt.figure(figsize=(15, 11))
sb.barplot(cv_score['Regressors'], cv_score['RMSE_mean'])
plt.xlabel('Regressors', fontsize=16)
plt.ylabel('CV_Mean_RMSE', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Now let's move on to optimizing the hyperparameters of the models that showed the best results. 

<h4>XGBR</h4>

In [None]:
predictions = {}

def xgbr(X_train, y_train, X_test):
    xgbrM = xg.XGBRegressor()
    params = {'max_depth': [3, 4, 5, 6, 7, 8],
              'min_child_weight': [0, 4, 5, 6, 7, 8],
              'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.25, 0.8, 1],
              'n_estimators': [10, 30, 50, 100, 200, 400, 1000]}

    grid_search_xg = RandomizedSearchCV(estimator=xgbrM, scoring='neg_root_mean_squared_error', param_distributions=params, n_iter=200, cv=4, verbose=2,
                                         random_state=42, n_jobs=-1)
    grid_search_xg.fit(X_train, y_train)
    xgbrModel = grid_search_xg.best_estimator_
    print('Best params(XGBR):',grid_search_xg.best_params_)
    print('RMSE(XGBR):', -grid_search_xg.best_score_)
    return xgbrModel

#xgbrModel = xgbr(X_train, y_train, X_test)
xgbrModel = xg.XGBRegressor(n_estimators=400, min_child_weight=5, max_depth=7, learning_rate=0.05)
xgbrModel.fit(X_train, y_train)
predictions['XGBR'] = xgbrModel.predict(X_test)

<h4>Ridge</h4>

In [None]:
def ridge(X_train, y_train, X_test):
    alpha_ridge = {'alpha': [-3, -2, -1, 1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 0.5, 1, 1.5, 2, 3, 4, 5, 10, 20, 30, 40]}

    rd = Ridge()
    grid_search_rd = GridSearchCV(estimator=rd, scoring='neg_root_mean_squared_error', param_grid=alpha_ridge, cv=4, n_jobs=-1, verbose=3)
    grid_search_rd.fit(X_train, y_train)
    ridgeModel = grid_search_rd.best_estimator_
    print('Best params(Ridge):', grid_search_rd.best_params_)
    print('RMSE(Ridge):', -grid_search_rd.best_score_)
    return ridgeModel

#ridgeModel = ridge(X_train, y_train, X_test)
ridgeModel = Ridge(alpha=10)
ridgeModel.fit(X_train, y_train)
predictions['Ridge'] = ridgeModel.predict(X_test)

<h4>Cat Boost</h4>

In [None]:
def catBoost(X_train, y_train, X_test):
    catM = CatBoostRegressor(verbose=0, allow_writing_files=False)
    params = {'learning_rate': [0.01, 0.05, 0.005, 0.0005],
              'depth': [4, 6, 10],
              'l2_leaf_reg': [1, 2, 3, 5, 9]}

    grid_search_cat = RandomizedSearchCV(estimator=catM, scoring='neg_root_mean_squared_error', param_distributions=params, n_iter=10, cv=4, verbose=2,
                                     random_state=42, n_jobs=-1)
    grid_search_cat.fit(X_train, y_train)
    catModel = grid_search_cat.best_estimator_
    print('Best params(CatBoost):',grid_search_cat.best_params_)
    print('RMSE(CatBoost):', -grid_search_cat.best_score_)
    return catModel

#catModel = catBoost(X_train, y_train, X_test)
catModel = CatBoostRegressor(verbose=0, allow_writing_files=False, learning_rate=0.05, l2_leaf_reg=2, depth=4)
catModel.fit(X_train, y_train)
predictions['CatBoost'] = catModel.predict(X_test)

<h4>Blending models</h4>

In [None]:
final_prediction = 0.25 * predictions['XGBR'] + 0.35 * predictions['CatBoost'] + 0.4 * predictions['Ridge']

<h4>Result</h4>

In [None]:
result = pd.DataFrame([len(y_train) + 1 + i for i in range(len(X_test))], columns=['Id'])
result[target_name] = np.expm1(final_prediction)
result.to_csv('result.csv', index=False, header=True)
result