Notebook have following structure:
1. Exploratory data analysis
1. Data cleaning
1. Feature engineering
1. Train part
1. Hyperoptimization
1. Test part

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None

## Exploratory data analysis

Let's load train (and later test) data and have a look on it. In train dataset exist target variable 'SalePrice', in test dataset - no.

In [None]:
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
train_df.shape

In [None]:
train_df.info()

In [None]:
train_df.head()

In [None]:
train_df.describe().transpose()

In [None]:
train_df.hist(figsize=(20, 20), bins=20);

In [None]:
plt.figure(figsize=(26, 16))
sns.heatmap(train_df.corr(), cmap='rocket', annot=True, fmt=f'0.1', cbar=False);

Let's take a look  little bit closer to our target feature.

In [None]:
plt.figure(figsize=(12, 4))
sns.distplot(train_df['SalePrice']);

Data is right-skewed, let's see if log of price can handle with outliers.

In [None]:
plt.figure(figsize=(12, 4))
sns.distplot(np.log(train_df['SalePrice']));

Yes, logarithmic 'SalePrice' looks better due to normal distribution and I will use LogPrice as target variable.

In [None]:
train_df.shape

In [None]:
# Add price logarithm to dataset
train_df['LogPrice'] = np.log(train_df['SalePrice'])

# and remove SalePrice 
train_df = train_df.drop('SalePrice', axis=1)

In [None]:
# Correlation target feature with others features
train_df.corr()['LogPrice'].sort_values(ascending=False)

Let's see on some features with strong correlation.

In [None]:
sns.barplot(x='OverallQual', y='LogPrice', data=train_df);

In [None]:
sns.scatterplot(x='GrLivArea', y='LogPrice', data=train_df);

In [None]:
sns.scatterplot(x='GarageArea', y='LogPrice', data=train_df);

In [None]:
sns.scatterplot(x='TotalBsmtSF', y='LogPrice', data=train_df);

In [None]:
sns.scatterplot(x='LotFrontage', y='LogPrice', data=train_df);

In [None]:
sns.scatterplot(x='LotArea', y='LogPrice', data=train_df);

## Data cleaning

So, I'm going to concatenate train and test data in order to avoid duplicating code when I will be cleaning data. And later I will saparate it before training part.

In [None]:
test_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
# Save train and test ID for final prediction on test part
test_id = test_df.pop('Id')
train_id = train_df.pop('Id')

# Save train length 
n_train = train_df.shape[0]

# Set target variable and drop it from dataset
labels = train_df.pop('LogPrice')

In [None]:
# Concatenate train and test part
df = pd.concat([train_df, test_df], axis=0)
df.reset_index(inplace=True, drop=True)

In [None]:
df.shape

In [None]:
test_df.shape, train_df.shape

In [None]:
# Check empty values
pd.DataFrame({'Amount': df.isnull().sum(),
             'Percent': (df.isnull().sum() / len(df)) *100}).sort_values(by='Percent', ascending=False)


In PoolQC, MiscFeature, Alley, Fence most of data is missing. I'm goind to drop all these columns.

In [None]:
df = df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis=1)

In [None]:
# FireplaceQu
df['Fireplaces'].value_counts()

In [None]:
df['FireplaceQu'].value_counts()

It seems empty FireplaceQu in houses without Fireplace at all. I fill it with NA.

In [None]:
df['FireplaceQu'].fillna('NA', inplace=True)

In [None]:
# LotFrontage 
sns.distplot(df['LotFrontage'])

For 'LotFrontage' we have 486 empty records, there are too many to delete. And we can see that there are no values  equal to 0. So we can try to fill these empty values with median (because we have some outliers).

In [None]:
lot_frontage_median = df['LotFrontage'].median()
df['LotFrontage'] = df['LotFrontage'].fillna(lot_frontage_median)

In [None]:
# Garages' features
df[df['GarageYrBlt'].isnull()].head()



Where 'GarageYrBlt' equals 'NaN', there other empty 'garage' values is empty, as well. Follow desciription data it means no garage.


In [None]:
# Numerical features replace with number
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0) 

# Features replace with 'NA'
for column in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    df[column] = df[column].fillna('NA')

In [None]:
# Bsmts' features
df[df['BsmtExposure'].isnull()].head()

The same situation as above: where 'BsmtExposure' is 'null', there other 'Bsmt' features are 'null', as well.

In [None]:
for column in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    df[column] = df[column].fillna('NA')

In [None]:
# MasVnrType
df[df['MasVnrType'].isnull()].head() 

Same here. Where 'MasVnrType' is null, there MasVnrArea - 0

In [None]:
df['MasVnrType'].value_counts()

In [None]:
df['MasVnrType'] = df['MasVnrType'].fillna('None')
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)

Rest columns have a few empty values, but as I concatenate train and test part I can't delete it. <br>
I'm going to fill rest values with 0, 'None' or with most occurred value.

In [None]:
# MSZoning
df['MSZoning'].value_counts()

In [None]:
df['MSZoning'] = df['MSZoning'].fillna(df['MSZoning'].mode()[0])

In [None]:
# BsmtBaths
df['BsmtHalfBath'].value_counts()

In [None]:
df['BsmtFullBath'].value_counts()

In [None]:
df['BsmtFullBath'] = df['BsmtFullBath'].fillna(0)
df['BsmtHalfBath'] = df['BsmtHalfBath'].fillna(0)

In [None]:
# Functional
df['Functional'].value_counts()

In [None]:
df['Functional'] = df['Functional'].fillna(df['Functional'].mode()[0])

In [None]:
# Electrical
df['Electrical'].value_counts()

In [None]:
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])

In [None]:
# Utilities
df['Utilities'].value_counts()

In [None]:
df['Utilities'] = df['Utilities'].fillna(df['Utilities'].mode()[0])

In [None]:
# TotalBsmtSF
df[df['TotalBsmtSF'].isnull()]

In [None]:
df.head(5)

TotalBsmtSf has strong correlation with 1stFlrSf

In [None]:
df['TotalBsmtSF'] = df['TotalBsmtSF'].fillna(df['1stFlrSF'])

In [None]:
# BsmtUnfSf, BsmtFinSF2, BsmtFinSF1
df['BsmtUnfSF'] = df['BsmtUnfSF'].fillna(0)
df['BsmtFinSF2'] = df['BsmtFinSF2'].fillna(0)
df['BsmtFinSF1'] = df['BsmtFinSF1'].fillna(0)

In [None]:
# Garage
df[df['GarageCars'].isnull()]

In [None]:
df['GarageCars'] = df['GarageCars'].fillna(0)
df['GarageArea'] = df['GarageArea'].fillna(0)

In [None]:
# Rest empty categorical values fill with mode
df['Exterior1st'] = df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df['Exterior2nd'] = df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
df['SaleType'] = df['SaleType'].fillna(df['SaleType'].mode()[0])
df['KitchenQual'] = df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])

In [None]:
pd.DataFrame({'Amount': df.isnull().sum(),
              'Percent': (df.isnull().sum() / len(df)) *100}).sort_values(by='Percent', ascending=False)


## Feature engineering



So, now is no more empty values. Let's transform categorical data to numbers.
I will use 3 methods:

*     Label encoder 
*     One hot encoding
*     And for features with clear scale I will map these features

Generally we have 3 main feature types with clear scale, I will separate these features depends which scale their have.

In [None]:
qual_columns = ['GarageCond', 'GarageQual', 'FireplaceQu', 'KitchenQual', 'HeatingQC', 
           'BsmtCond', 'BsmtQual', 'ExterCond', 'ExterQual'] 
bsmt_columns = ['BsmtFinType2', 'BsmtFinType1'] 
exposure_columns = ['BsmtExposure']

qual_rates = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0}
bsmtype_rates = {'GLQ': 5, 'ALQ': 4, 'BLQ': 3, 'Rec': 2, 'LwQ': 1, 'Unf': -1, 'NA': 0}
exposure_rates = {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': -1, 'NA': 0}

In [None]:
# Map features with clear scale
for feats, rate in ((qual_columns, qual_rates),  (bsmt_columns, bsmtype_rates), (exposure_columns, exposure_rates)):
    for feat in feats:
        df[feat] = df[feat].map(rate)

Now let's encode rest of categorical features with LabelEncoder and OneHotEncoding. I will use pandas function **factorize** and **get_dummies**, it gives the same result as LebelEncoder and OneHotEncoder from sklearn.

In [None]:
# LabelEncoder 
encode = ['Functional', 'CentralAir', 'PavedDrive', 'GarageFinish', 'Street', 'LandSlope']

for feat in encode:
    df['{0}_cat'.format(feat)] = pd.factorize(df[feat])[0]

# OneHotEncoding
categorical_features = [x for x in df.select_dtypes(include=np.object).columns if x not in encode]

for feat in categorical_features:
    dummies = pd.get_dummies(df[feat], prefix='{0}'.format(feat), drop_first=True)
    df = pd.concat([df, dummies], axis=1)

Let's add some new features.

In [None]:
df['BsmtFin'] = df['BsmtFinSF1'] + df['BsmtFinSF2'] 
df['TotalBsmt'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

In [None]:
df.shape

I will separate data on train and test datasets and I will store test data for final prediction. <br>

In [None]:
train_set = df[:n_train]
test_set = df[n_train:]

## Machine learing part

In [None]:
# import necessary libraries
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor, VotingRegressor
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor

I will use all numerical features for training models  (labels is LogPrice). I will try a few models and pick the bests. 

In [None]:
# Set X and y
X = train_set[train_set.select_dtypes(include=np.number).columns].values

# Normalise features
scalar = MinMaxScaler()
X_scaled = scalar.fit_transform(X)

In [None]:
# Create list of models
lasso_model = Lasso()
elastic_model = ElasticNet()
svr_model = SVR()
tree_model = ExtraTreesRegressor()
xgb_model = XGBRegressor()
knn_model = KNeighborsRegressor()

models = {'lasso_model': lasso_model,
         'elastic_model': elastic_model,
         'svr_model': svr_model,
         'tree_model': tree_model,
         'xgb_model': xgb_model,
         'knn_model': knn_model}

In [None]:
def cross_validation(model, X, y):
    "Check model with cross validation"
    score = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    cross_score = np.sqrt(-score)
    return round(np.mean(cross_score), 4)

In [None]:
# Check models with cross validation
models_evaluation = {}
for model_name, model in models.items():
    models_evaluation[model_name] = cross_validation(model, X_scaled, labels)
    
pd.DataFrame(data=models_evaluation.items(), columns=['Model', 'RMSE']).sort_values(by='RMSE')

I'm going to unite SVM and XGB and train it with VotingRegressor. But first I will improve model's parameters.

## Hyperoptimization

So let's try to achieve a little bit more. I'm going to improve model using: <br>

* Features importances (leave only significant features).
* Search better parameters for models 

In [None]:
xgb_model.fit(X_scaled, labels)

In [None]:
# Get features importances
features_list = sorted(zip(xgb_model.feature_importances_, train_set.select_dtypes(include=np.number).columns), reverse=True)
features_list

In [None]:
# Leave only useful features
imp_feats = [feat for (n, feat) in features_list if n > 0.001]

In [None]:
# Set X with new feature set
X = train_set[imp_feats].values
X_scaled = scalar.fit_transform(X)

In [None]:
cross_validation(xgb_model, X_scaled, labels)

In [None]:
# Search better parameters for xgb_model 
param_grid = {'n_estimators': np.arange(100, 1500),
             'learning_rate': np.arange(0.01, 1, 0.01),
             'max_depth': np.arange(1, 20),
             'colsample_bytree': np.arange(0, 1, 0.1)}

random_search = RandomizedSearchCV(xgb_model, param_grid, cv=10, scoring='neg_mean_squared_error', n_iter=100)
random_search.fit(X_scaled, labels)

best_xgb = random_search.best_estimator_

In [None]:
cross_validation(best_xgb, X_scaled, labels)

In [None]:
# Search better parameters for svm model 
svr_params = {'C': np.arange(1, 30),
             'kernel': ('linear', 'poly', 'rbf', 'sigmoid')}

hyperopt_svr = RandomizedSearchCV(svr_model, svr_params, cv=10, scoring='neg_mean_squared_error', n_iter=100)
hyperopt_svr.fit(X_scaled, labels)

best_svr = hyperopt_svr.best_estimator_

In [None]:
cross_validation(best_svr, X_scaled, labels)

So, these steps improved our models. Now let's combine better models to one and train our final model.

In [None]:
# Ensemble better models
voting_reg = VotingRegressor(estimators=[('xgb', best_xgb), ('svr', best_svr)])

cross_validation(voting_reg, X_scaled, labels)

In [None]:
# Final model training 
voting_reg.fit(X_scaled, labels)

## Test part

In [None]:
# Get X for test part
X_test = test_set[imp_feats].values
X_test_scaled = scalar.transform(X_test)

# Make prediction
y_pred = voting_reg.predict(X_test_scaled)

# Convert LopPrice to normal and save it to csv in order to upload on Kaggle
test_file = pd.DataFrame({'Id': test_id, 'SalePrice': np.exp(y_pred)})
test_file.to_csv('submission.csv', index=False)

So, I achieved 0.12089 score for test set in Kaggle competition (Kaggle use RMSLE metric).