<h1><center> Content </center></h1>

* [I. Getting started](#I)
    * [1. Importing some basic libraries](#I_1)
    * [2. Loading data](#I_2)
    * [3. Preliminary cleaning](#I_2)


* [II. Missing values](#II)
    * [1. Visualising NaNs](#II_1)
    * [2. Imputing NaN values. Training set](#II_2)
    * [3. Imputing NaN values. Test set](#II_3)
    
   
* [III. EDA](#III)
    * [1. Visualising potential numeric variables](#III_1)
    * [2. Visualising categorical variables](#III_2)


* [IV. Feature engineering](#IV)
    * [1. Dealing with outliers](#IV_1)
    * [2. Adding some new variables](#IV_2)
    * [3. Binning imbalanced features](#IV_3)
    * [4. Transforming skewed variables](#IV_4)
    * [5. Encoding variables](#IV_5)
    * [6. Getting the final training and test sets](#IV_6)


* [V. Building models](#V)
    * [1. Tuning modelss](#V_1)
    * [2. Stacking](#V_2)


* [VI. Some techniques that could have been useful](#VI)
    * [1. Feature interactions](#VI_1)

<h1><center> I. Getting started </center></h1> <a class="anchor" id = "I"></a>

## 1. Importing some basic libraries <a class="anchor" id = "I_1"></a>

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

## 2. Loading data <a class="anchor" id = "I_2"></a>

In [None]:
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

## 3. Preliminary cleaning <a class="anchor" id = "I_3"></a>

I want to delete "Id" columns, as they won't provide us with any useful information.

In [None]:
df_train = df_train.drop(columns = 'Id')
df_test = df_test.drop(columns = 'Id')

Next, let's make sure that columns that have a limited range of values don't have any obviously incorrect observations. To do so we can simply utilise the following function:

In [None]:
Check_years = df_train.columns[df_train.columns.str.contains(pat = 'Year|Yr')] 

In [None]:
df_train[Check_years.values].max().sort_values(ascending = False)

In [None]:
df_test[Check_years.values].max().sort_values(ascending = False)

The year 2207 was replaced the mode of the column.

In [None]:
Replace_year = df_test.loc[(df_test['GarageYrBlt'] > 2050), 'GarageYrBlt'].index.tolist()
df_test.loc[Replace_year, 'GarageYrBlt'] = df_test['GarageYrBlt'].mode()

<h1><center> II. Missing values </center></h1> <a class="anchor" id = "II"></a>

## 1. Visualising NaNs <a class="anchor" id = "II_1"></a>

Before we can build any models or engineer some features, we have to deal with missing values. First of all, I created a visual representation of NaN values that helped me understand their structure.

In [None]:
train_missing = df_train.count().loc[df_train.count() < 1460].sort_values(ascending = False)

Setting some global parameters for all plots was done with the aid of <code style = "background-color: #faedde">sns.set_theme(rc = {})</code>.

In [None]:
sns.set_theme(rc = {'grid.linewidth': 0.6, 'grid.color': 'white',
                    'axes.linewidth': 1, 'axes.facecolor': '#ECECEC', 
                    'axes.labelcolor': '#000000',
                    'figure.facecolor': 'white',
                    'xtick.color': '#000000', 'ytick.color': '#000000'})

In [None]:
with plt.rc_context(rc = {'figure.dpi': 120, 'axes.labelsize': 8.5, 
                          'xtick.labelsize': 6, 'ytick.labelsize': 6}): 

    fig, ax = plt.subplots(1, 1, figsize = (6, 4))

    sns.barplot(x = train_missing.values, y = train_missing.index, palette = 'viridis')

    plt.xlabel('Non-Na values')

    plt.show()

In [None]:
test_missing = df_test.count().loc[df_test.count() < 1459].sort_values(ascending = False)

In [None]:
with plt.rc_context(rc = {'figure.dpi': 120, 'axes.labelsize': 8.5, 
                          'xtick.labelsize': 6, 'ytick.labelsize': 6}):
    
    fig, ax = plt.subplots(1, 1, figsize = (7, 6))

    sns.barplot(x = test_missing.values, y = test_missing.index, palette = 'viridis')

    plt.xlabel('Non-Na values')

    plt.show()

Based on data description, we can conclude that NaN values in some columns are actually a category, namely "Not present". So, instead of dropping these columns, we can make them "clean".

In [None]:
None_category = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 
                 'FireplaceQu', 'GarageCond', 'GarageQual', 
                 'GarageFinish', 'GarageType', 'BsmtCond', 
                 'BsmtExposure', 'BsmtQual', 'BsmtFinType1', 
                 'BsmtFinType2']

In [None]:
for column in None_category:
    
    df_train.loc[df_train[column].isnull(), column] = 'None'
    df_test.loc[df_test[column].isnull(), column] = 'None'

## 2. Imputing NaN values. Training set <a class="anchor" id = "II_2"></a>

In [None]:
df_train.loc[:, df_train.isna().sum() > 0].isna().sum().sort_values(ascending = False)

I used KNN imputer when the number of missing values was relatively large; however, when there were only few NaNs, I thought that replacing them with the mode or mean of a respective column was a reasonable choice.

### 2.1 "LotFrontage"

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

Actually, separating variables by their type had been done before this step through analysing graphs; nevertheless, since EDA is never linear I decided to utilise some pieces of code that had been written beforehand.

I also want to mention that I imputed NaN values in numeric columns using only numeric variables and categorical columns using only categorical variables.

In [None]:
cont_vars = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 
             'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', 
             '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 
             'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch']

Remember, it is important to scale your data before using KNN because this algorithm is distance-based. Robust scaler was used since outliers had not been dealt with yet.

In [None]:
knn_vars_train_cont = df_train[cont_vars].copy()

Scaler = RobustScaler()

knn_vars_train_cont = pd.DataFrame(Scaler.fit_transform(knn_vars_train_cont), 
                                   columns = ["col" + str(i) for i in range(0, 15)])

train_imp_cont = KNNImputer(n_neighbors = 5, weights = 'uniform', metric = 'nan_euclidean')

Also, don't forget to inverse transform your data.

In [None]:
train_imp_cont_results = train_imp_cont.fit_transform(knn_vars_train_cont)

train_imp_cont_results = pd.DataFrame(Scaler.inverse_transform(train_imp_cont_results), 
                                      columns = ["col" + str(i) for i in range(0, 15)])

In [None]:
df_train['LotFrontage'] = train_imp_cont_results['col0']
df_train['MasVnrArea'] = train_imp_cont_results['col2'].astype('float64')

### 2.2 Other variables (few missing values)

In [None]:
for column in ['MasVnrType', 'Electrical']:
    
    df_train.loc[df_train[column].isnull(), column] = df_train[column].mode()[0]

### 2.3 "GarageYrBlt"

In [None]:
from sklearn.preprocessing import LabelEncoder

Label encoder was used to transform categorical data before feeding it to KNN.

In [None]:
knn_vars_train_cat = df_train.drop(cont_vars, axis = 1)
knn_vars_train_cat = knn_vars_train_cat.drop('SalePrice', axis = 1)

obj_vars = knn_vars_train_cat.select_dtypes(include = ['object', 'category']).columns

In [None]:
for column in obj_vars:
    
    knn_vars_train_cat[column] = LabelEncoder().fit_transform(knn_vars_train_cat[column])

In [None]:
train_imp_cat = KNNImputer(n_neighbors = 5, weights = 'uniform', metric = 'nan_euclidean')

train_imp_cat_results = train_imp_cat.fit_transform(knn_vars_train_cat)

train_imp_cat_results = pd.DataFrame(train_imp_cat_results, 
                                     columns = ["col" + str(i) for i in range(0, 64)])

In [None]:
df_train['GarageYrBlt'] = train_imp_cat_results['col48']
df_train['GarageYrBlt'] = df_train['GarageYrBlt'].astype('int64')

## 3. Imputing NaN values. Test set <a class="anchor" id = "II_3"></a>

### 3.1 "LotFrontage"

The process was exactly the same for the test set.

In [None]:
knn_vars_test_cont = df_test[cont_vars].copy()

Scaler = StandardScaler()

knn_vars_test_cont = pd.DataFrame(Scaler.fit_transform(knn_vars_test_cont), 
                                  columns = ["col" + str(i) for i in range(0, 15)])

test_imp_cont = KNNImputer(n_neighbors = 5, weights = 'uniform', metric = 'nan_euclidean')

In [None]:
test_imp_cont_results = test_imp_cont.fit_transform(knn_vars_test_cont)

test_imp_cont_results = pd.DataFrame(Scaler.inverse_transform(test_imp_cont_results), 
                                     columns = ["col" + str(i) for i in range(0, 15)])

In [None]:
df_test['LotFrontage'] = test_imp_cont_results['col0']

### 3.2 Other variables (few missing values)

In [None]:
for column in df_test.columns: 
    
    if ((df_test[column].isnull().sum() <= 60) & (df_test[column].isnull().sum() > 0) & 
        ((df_test[column].dtypes == 'O') | (df_test[column].dtypes == 'float64')) & 
        (df_test[column].nunique() < 20)):
        
        df_test.loc[df_test[column].isnull(), column] = df_test[column].mode()[0]
        
    elif ((df_test[column].isnull().sum() <= 60) & (df_test[column].isnull().sum() > 0) & 
          (df_test[column].dtypes == 'float64') & (df_test[column].nunique() > 100)):
        
        df_test.loc[df_test[column].isnull(), column] = df_test[column].mean()
        
    else: pass

### 3.3 "GarageYrBlt"

In [None]:
knn_vars_test_cat = df_test.drop(cont_vars, axis = 1)

In [None]:
for column in knn_vars_test_cat:
    
    knn_vars_test_cat[column] = LabelEncoder().fit_transform(knn_vars_test_cat[column])

In [None]:
test_imp_cat = KNNImputer(n_neighbors = 5, weights = 'uniform', metric = 'nan_euclidean')

test_imp_cat_results = test_imp_cat.fit_transform(knn_vars_test_cat)

test_imp_cat_results = pd.DataFrame(test_imp_cat_results, 
                                    columns = ["col" + str(i) for i in range(0, 64)])

In [None]:
df_test['GarageYrBlt'] = test_imp_cat_results['col48']
df_test['GarageYrBlt'] = df_test['GarageYrBlt'].astype('int64')

Finally, we can check the number of NaN values left:

In [None]:
print(df_train.isna().sum().any(), df_test.isna().sum().any(), sep = '\n')

<h1><center> III. EDA </center></h1> <a class="anchor" id = "III"></a>

For starters, we should separate variables by their type in order to figure out what columns are categorical and what are numeric, which is crucial for further analysis.

In [None]:
train_obj = df_train.select_dtypes(include = ['object', 'category']).columns

train_int_float = df_train.select_dtypes(include = ['int64', 'float64'])
col_order = train_int_float.nunique().sort_values(ascending = False).index.tolist()
train_int_float = train_int_float[col_order].columns

## 1. Visualising potential numeric variables <a class="anchor" id = "III_1"></a>

<div style = "color: #000000;
             display: fill;
             padding: 8px;
             border-radius: 5px;
             border-style: solid;
             border-color: #a63700;
             background-color: rgba(235, 125, 66, 0.3)">
    
<span style = "font-size: 20px; font-weight: bold">Note:</span> 
<span style="font-size: 15px">If you want to learn more about efficiently creating neat visualisations, please refer to this <a href="https://www.kaggle.com/suprematism/house-prices-advanced-visualisation">notebook</a>.</span>
</div>

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7, 
                          'xtick.labelsize': 5, 'ytick.labelsize': 5}): 

    fig, ax = plt.subplots(5, 5, figsize = (8.5, 10), sharey = True)

    for idx, (column, axes) in list(enumerate(zip(train_int_float[0:22], ax.flatten()))):
    
        sns.scatterplot(ax = axes, x = df_train[column], 
                        y = np.log(df_train['SalePrice']), 
                        hue =  np.log(df_train['SalePrice']), 
                        palette = 'viridis', alpha = 0.7, s = 8)
    
        axes.legend([], [], frameon = False)
    
    else:
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]

    plt.tight_layout()
    plt.show()

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7, 
                          'xtick.labelsize': 5, 'ytick.labelsize': 5}): 

    fig, ax = plt.subplots(5, 4, figsize = (8.5, 9), sharey = True)

    for idx, (column, axes) in list(enumerate(zip(train_int_float[22:], ax.flatten()))):
    
        sns.scatterplot(ax = axes, x = df_train[column], 
                        y = np.log(df_train['SalePrice']), 
                        hue =  np.log(df_train['SalePrice']), 
                        palette = 'viridis', alpha = 0.7, s = 8)
    
        axes.legend([], [], frameon = False)
    
    else:
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]
    
    plt.tight_layout()
    plt.show()

Based on graphs like the ones above, we can easily determine what variables are actually continous. In addition, I kept imbalanced predictors away from balanced ones.

In [None]:
train_cont_balanced = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 
                       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', 
                       '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 
                       'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch']

train_cont_unbalanced = ['LowQualFinSF', '3SsnPorch' , 'PoolArea' , 'MiscVal']

## 2. Visualising categorical variables <a class="anchor" id = "III_2"></a>

Let's get only categorical features:

In [None]:
train_cat = df_train.drop(train_cont_balanced, axis = 1).columns.tolist()
train_cat.remove('SalePrice')

### 2.1 High cardinality features

It is important to visualise variables with lots of categories, as you can observe whether they have some kind of relationship with the target. If yes, you should not drop them but encode properly.

In [None]:
df_train[train_cat].loc[:, df_train.nunique() > 25].nunique().sort_values(ascending = False)

In [None]:
train_high_cat = df_train[train_cat].loc[:, df_train.nunique() > 25].copy()

In [None]:
for column in train_high_cat.columns:
    
    train_high_cat[column] = train_high_cat[column].astype('category')

In [None]:
with plt.rc_context(rc = {'figure.dpi': 450, 'axes.labelsize': 5, 
                          'xtick.labelsize': 4, 'ytick.labelsize': 4}): 

    fig, ax = plt.subplots(1, 3, figsize = (6, 7.5))

    for idx, (column, axes) in list(enumerate(zip(train_high_cat.columns, ax.flatten()))): 
    
        sns.stripplot(ax = axes, x = np.log(df_train['SalePrice']), 
                      y = train_high_cat[column], 
                      palette = 'viridis', alpha = 0.95, size = 1.5)

        sns.boxplot(ax = axes, x = np.log(df_train['SalePrice']), 
                    y = train_high_cat[column],
                    showmeans = True, meanline = True, zorder = 10,
                    meanprops = {'color': 'r', 'linestyle': '-', 'lw': 0.8},
                    medianprops = {'visible': False},
                    whiskerprops = {'visible': False},
                    showfliers = False, showbox = False, showcaps = False)
        
        sns.pointplot(ax = axes, x = np.log(df_train['SalePrice']), 
                      y = train_high_cat[column],
                      ci = None, color = 'r', scale = 0.15)
    
    else: 
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]
    
    plt.tight_layout()
    plt.show()

### 2.2 Variables with manageable cardinality

For the sake of experimenting with <span style = "color: #E85E40"> matplotlib </span> and <span style = "color: #E85E40"> seaborn </span>, I plotted some categorical features. In my judgment, using stripplots with ordered by the target mean categories can be quite insightful. First and foremost, you can clearly see how many observations each category contains, which is vital if you want to isolate imbalanced features. On top of that, after ordering every category, a relationship (if present) of an independent variable with the target becomes evident.

In [None]:
train_norm_cat = df_train[train_cat].loc[:, df_train.nunique() <= 25].columns.tolist()

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7, 
                          'xtick.labelsize': 5.5, 'ytick.labelsize': 5.5}): 

    fig, ax = plt.subplots(5, 3, figsize = (8, 13), sharey = True)

    for idx, (column, axes) in list(enumerate(zip(train_norm_cat[: 15], ax.flatten()))):
    
        order = df_train.groupby(column)['SalePrice'].mean().sort_values(ascending = True).index
    
        sns.violinplot(ax = axes, x = df_train[column], 
                       y = np.log(df_train['SalePrice']),
                       order = order, scale = 'width',
                       linewidth = 0.3, palette = 'viridis',
                       saturation = 0.5, inner = None)
    
        plt.setp(axes.collections, alpha = 0.3)
    
        sns.stripplot(ax = axes, x = df_train[column], 
                      y = np.log(df_train['SalePrice']),
                      palette = 'viridis', s = 1.3, alpha = 0.9,
                      order = order)
    
        sns.boxplot(ax = axes, x = df_train[column], order = order,
                    y = np.log(df_train['SalePrice']),
                    showmeans = True, meanline = True, zorder = 10,
                    meanprops = {'color': 'r', 'linestyle': '--', 'lw': 0.6},
                    medianprops = {'visible': False},
                    whiskerprops = {'visible': False},
                    showfliers = False, showbox = False, showcaps = False)
        
        if df_train[column].nunique() > 5: 
        
            plt.setp(axes.get_xticklabels(), rotation = 90)
    
    else:
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]

    plt.tight_layout()
    plt.show()

<h1><center> IV. Feature engineering </center></h1> <a class="anchor" id = "IV"></a>

## 1. Dealing with outliers <a class="anchor" id = "IV_1"></a>

How did I determine what observations were outliers? I simply run a Lasso model and collected the largest residuals. I didn't do it in this segment because all columns needed to be properly encoded (remember actual analysis is not linear while writing a notebook is).

In [None]:
indx_final = [30, 462, 495, 523, 588, 632, 968, 1298, 1324]

In [None]:
df_train = df_train.drop(indx_final, axis = 0).reset_index(drop = True)

Here is an axample of how you can do it yourself:

In [None]:
##### Training a model #####

# Lasso_outliers = linear_model.Lasso(alpha = 0.0005)

# Lasso_fit = Lasso_outliers.fit(X_train, y)

##### Getting outliers #####

# rows_to_drop = (Lasso_fit.predict(X_train) - df_train['SalePrice'])**2
# rows_to_drop[rows_to_drop > 0.2].index

Before setting a threshold, try plotting residuals. It can help a lot.

## 2. Adding some new variables <a class="anchor" id = "IV_2"></a>

I added some variables that made sense to me. For instance, I calculated the total number of rooms (kitchens, bathrooms and other rooms).

In [None]:
df_train['TotalPorch'] = (df_train['ScreenPorch'] + df_train['EnclosedPorch'] + 
                          df_train['3SsnPorch'] + df_train['ScreenPorch'])

df_train['Rooms_kitchens'] = (df_train['TotRmsAbvGrd'] + df_train['BsmtFullBath'] + 
                              df_train['BsmtHalfBath'] + df_train['FullBath'] + 
                              df_train['HalfBath'])

df_train['Sqr_feet_per_room'] = ((df_train['1stFlrSF'] + 
                                  df_train['2ndFlrSF']) / df_train['TotRmsAbvGrd'])

In [None]:
train_cont_balanced.append('TotalPorch')
train_cont_balanced.append('Sqr_feet_per_room')

In [None]:
df_test['TotalPorch'] = (df_test['ScreenPorch'] + df_test['EnclosedPorch'] + 
                         df_test['3SsnPorch'] + df_test['ScreenPorch'])

df_test['Rooms_kitchens'] = (df_test['TotRmsAbvGrd'] + df_test['BsmtFullBath'] + 
                             df_test['BsmtHalfBath'] + df_test['FullBath'] + 
                             df_test['HalfBath'])

df_test['Sqr_feet_per_room'] = ((df_test['1stFlrSF'] + 
                                 df_test['2ndFlrSF']) / df_test['TotRmsAbvGrd'])

## 3. Binning imbalanced features <a class="anchor" id = "IV_3"></a>

Imbalanced numeric variables were determined at the very beginning.

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7, 
                          'xtick.labelsize': 6, 'ytick.labelsize': 6,
                          'legend.fontsize': 6, 'legend.title_fontsize': 6}): 

    fig, ax = plt.subplots(1, 4, figsize = (8, 3), sharey = True)

    for idx, (column, axes) in list(enumerate(zip(train_cont_unbalanced, ax.flatten()))):
    
        sns.scatterplot(ax = axes, x = df_train[column], 
                        y = np.log(df_train['SalePrice']), 
                        hue = np.log(df_train['SalePrice']), 
                        palette = 'viridis', alpha = 0.8, s = 9)

    axes_legend = ax.flatten()

    axes_legend[0].legend(title = 'SalePrice', loc = 'lower right')
    axes_legend[1].legend(title = 'SalePrice', loc = 'lower right')
    axes_legend[3].legend(title = 'SalePrice', loc = 'lower right')
    
    plt.tight_layout()
    plt.show()

In [None]:
for column in train_cont_unbalanced:
    
    df_train.loc[(df_train[column] == 0), column] = 'None' 
    
    df_train.loc[(df_train[column] != 0) & (df_train[column] != 'None'), column] = 'Present'

In [None]:
for column in train_cont_unbalanced:
    
    df_test.loc[(df_test[column] == 0), column] = 'None' 
    
    df_test.loc[(df_test[column] != 0) & (df_test[column] != 'None'), column] = 'Present'

## 4. Transforming skewed variables <a class="anchor" id = "IV_4"></a>

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7, 
                          'xtick.labelsize': 5, 'ytick.labelsize': 5}):
    
    fig, ax = plt.subplots(5, 4, figsize = (8.5, 9))

    for idx, (column, axes) in list(enumerate(zip(train_cont_balanced, ax.flatten()))):
    
        sns.kdeplot(ax = axes, x = df_train[column], 
                    fill = True, alpha = 0.2, color = '#006e7a',
                    linewidth = 0.8)
    
    else:
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]
    
    plt.tight_layout()
    plt.show()

As far as I can tell, making features look more "normal" is not the number one priority, especially if we take into account that we are not doing statistical analysis. We care about how accurate our models are. Thus, I simply used log transformation.

In [None]:
df_train[train_cont_balanced] = np.log(df_train[train_cont_balanced] + 1)
df_test[train_cont_balanced] = np.log(df_test[train_cont_balanced] + 1)

## 5. Encoding variables <a class="anchor" id = "IV_5"></a>

### 5.1 Mean encoding

High cardinality features were encoded via mean encoding with cross validation and regularisation, which are a must if you want to prevent overfitting.

I used a piece of code from this excellent notebook:

https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study

I can also recommend you two great videos that cover the idea of mean encoding and various regularisation techniques: 

https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv

https://www.coursera.org/lecture/competitive-data-science/regularization-LGYQ2

In [None]:
from sklearn.model_selection import KFold

In [None]:
def mean_encode(train_data, test_data, columns, target_col, alpha = 0, folds = 1):
    encoded_cols = []
    target_mean_global = train_data[target_col].mean()
    for col in columns:
        # Getting means for test data
        nrows_cat = train_data.groupby(col)[target_col].count()
        target_means_cats = train_data.groupby(col)[target_col].mean()
        target_means_cats_adj = (target_means_cats*nrows_cat + 
                                 target_mean_global*alpha)/(nrows_cat+alpha)
        # Mapping means to test data
        encoded_col_test = test_data[col].map(target_means_cats_adj)
        # Getting a train encodings
        kfold = KFold(folds, shuffle=True, random_state=1).split(train_data[target_col].values)
        parts = []
        
        for tr_in, val_ind in kfold:
            # divide data
            df_for_estimation, df_estimated = train_data.iloc[tr_in], train_data.iloc[val_ind]
            # getting means on data for estimation (all folds except estimated)
            nrows_cat = df_for_estimation.groupby(col)[target_col].count()
            target_means_cats = df_for_estimation.groupby(col)[target_col].mean()
            target_means_cats_adj = (target_means_cats*nrows_cat + 
                                         target_mean_global*alpha)/(nrows_cat+alpha)
            # Mapping means to estimated fold
            encoded_col_train_part = df_estimated[col].map(target_means_cats_adj)
 
            # Saving estimated encodings for a fold
            parts.append(encoded_col_train_part)
            encoded_col_train = pd.concat(parts, axis = 0)
            encoded_col_train.fillna(target_mean_global, inplace = True)

        # Saving the column with means
        encoded_col = pd.concat([encoded_col_train, encoded_col_test], axis = 0)
        encoded_col[encoded_col.isnull()] = target_mean_global
        encoded_cols.append(pd.DataFrame({'mean_'+ target_col + '_' + col:encoded_col}))
    all_encoded = pd.concat(encoded_cols, axis = 1)
    return (all_encoded.loc[train_data.index,:], 
            all_encoded.loc[test_data.index,:])

<div style = "color: #000000;
             display: fill;
             padding: 8px;
             border-radius: 5px;
             border-style: solid;
             border-color: #a63700;
             background-color: rgba(235, 125, 66, 0.3)">
    
<span style = "font-size: 20px; font-weight: bold">Note:</span> 
<span style="font-size: 15px">This function works properly only if training and test sets have different indices.</span>
</div>

In [None]:
train_mean_encoding = df_train[list(train_high_cat.columns)].copy()
train_mean_encoding['SalePrice'] = df_train['SalePrice']

target_col = 'SalePrice'
columns = train_mean_encoding.columns.tolist()

columns_test = columns
columns_test.remove('SalePrice')
test_mean_encoding = df_test[columns_test]

index_0 = list(range(0, 1459))
index_1 = list(range(1451, 2910))

test_mean_encoding = test_mean_encoding.rename(index = dict(zip(index_0, index_1)))

In [None]:
Mean_encoding = mean_encode(train_mean_encoding, test_mean_encoding, 
                            columns, target_col, alpha = 5, folds = 10)

In [None]:
train_high_cat_encoded = np.log(Mean_encoding[0].reset_index(drop = True))
test_high_cat_encoded = np.log(Mean_encoding[1].reset_index(drop = True))

### 5.2 One-hot encoding

The rest of categorical variables were encoded with the help of one-hot encoding.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
train_test_norm_cat = pd.concat([df_train[train_norm_cat], 
                                 df_test[train_norm_cat]], 
                                 axis = 0, join = 'outer', 
                                 ignore_index = True)

In [None]:
OHE =  OneHotEncoder(sparse = False, handle_unknown = 'ignore')

train_test_norm_cat_OHE = pd.DataFrame(pd.DataFrame(OHE.fit_transform(train_test_norm_cat)))
train_test_norm_cat_OHE.columns = OHE.get_feature_names(train_test_norm_cat.columns.tolist())

At this stage, you can drop columns that almost entirely consist of a single class.

In [None]:
NULLS = pd.DataFrame({'%_nulls': train_test_norm_cat_OHE.isin([0]).mean()})
NULLS = NULLS.reset_index().sort_values(ascending = False, by = '%_nulls')
NULLS = NULLS.rename(columns = {'index': 'Variable'})

DROP = NULLS.loc[((NULLS['%_nulls'] >= 0.99) | (NULLS['%_nulls'] <= 0.005)), 'Variable'].values

In [None]:
train_test_norm_cat_OHE = train_test_norm_cat_OHE.drop(DROP, axis = 1)

In [None]:
train_norm_cat_OHE = train_test_norm_cat_OHE.iloc[:1451, ]
test_norm_cat_OHE = (train_test_norm_cat_OHE.iloc[1451:, ]).reset_index(drop = True)

### 5.3 Ordinal encoding

Playing around with different encoding techniques, I found out that "OverallQual" and "OverallCond" significantly boosted CV scores when they were encoded ordinally. But I decided to do both: keep them ordinal and one-hot encode them, allowing models to make all difficult choices for themselves.

In [None]:
train_ordinal = pd.DataFrame()
test_ordinal = pd.DataFrame()

In [None]:
train_ordinal['OverallQual'] = df_train['OverallQual']
train_ordinal['OverallCond'] = df_train['OverallCond']

In [None]:
test_ordinal['OverallQual'] = df_test['OverallQual']
test_ordinal['OverallCond'] = df_test['OverallCond']

## 6. Getting the final training and test sets <a class="anchor" id = "IV_6"></a>

In [None]:
train_cont_balanced_default = df_train[train_cont_balanced].copy()
test_cont_balanced_default = df_test[train_cont_balanced].copy()

In [None]:
train_list = [train_high_cat_encoded, train_norm_cat_OHE,
              train_cont_balanced_default, train_ordinal]

In [None]:
X_train = pd.concat(train_list, axis = 1)
y = np.log(df_train['SalePrice'])

In [None]:
test_list = [test_high_cat_encoded, test_norm_cat_OHE,
             test_cont_balanced_default, test_ordinal]

In [None]:
X_test = pd.concat(test_list, axis = 1)

<h1><center> V. Building models </center></h1> <a class="anchor" id = "V"></a>

I built 6 models: **Lasso**, **ElasticNet**, **XGB**, **LGBM**, **SVR** and **KNN**, and then stacked them using <span style = "color: #E85E40"> StackingRegressor </span>. Since tuning hyperparameters took me about 2 hours (XGB was quite slow), I run only the final (tuned) models in this notebook, but you can still see how they were tuned. I mostly relied on <span style = "color: #E85E40"> RandomizedSearchCV </span>.

## 1. Tuning models <a class="anchor" id = "V_1"></a>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import KFold

CV = KFold(n_splits = 10, random_state = 999, shuffle = True)
CV_rep = RepeatedKFold(n_splits = 10, n_repeats = 3, random_state = 999)

# 1.1 Lasso

from sklearn import linear_model

###### Training a model ######

# %%time

# Lasso_model = linear_model.Lasso()

# alpha = {'alpha': [x / 25000 for x in range(1, 50, 1)],
#          'tol': [0.0000001], 
#          'max_iter': [3000]}

# Lasso_grid = GridSearchCV(Lasso_model, alpha, verbose = True, 
#                           scoring = 'neg_root_mean_squared_error', 
#                           n_jobs = 7, cv = CV)

# Lasso_fit = Lasso_grid.fit(X_train, y)

###### Getting scores and parameters ######

# round(-1*Lasso_fit.best_score_, 5)
# Lasso_fit.best_params_

###### Getting feature importance ######

# FI_lasso = list(zip(abs(Lasso_fit.best_estimator_.coef_), X_train.columns))
# FI_lasso = pd.DataFrame(FI_lasso, columns = ['Imp', 'Variable'])
# FI_lasso = FI_lasso.sort_values(ascending = False, by = 'Imp')

########################################

# 1.2 Elastic Net

###### Training a model ######

# %%time

# ElasticNet_model = linear_model.ElasticNet()

# alpha_l1 = {'alpha': [x / 25000 for x in range(1, 25, 1)],
#             'l1_ratio': [x / 100 for x in range(10, 100, 1)],
#             'tol': [0.000001], 
#             'max_iter': [4000]}

# ElasticNet_random = RandomizedSearchCV(ElasticNet_model, alpha_l1, verbose = True, 
#                                        scoring = 'neg_root_mean_squared_error', 
#                                        n_jobs = 7, cv = CV, n_iter = 50)

# ElasticNet_fit = ElasticNet_random.fit(X_train, y)

###### Getting scores and parameters ######

# round(-1*ElasticNet_fit.best_score_, 5)
# ElasticNet_fit.best_params_

########################################

# 1.3 XGBoost

import xgboost as xgb
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X_train, y, 
                                   test_size = 0.1, random_state = 999, 
                                   shuffle = True)

###### Training a model ######

# %%time

# XGB_model = xgb.XGBRegressor(use_label_encoder = False, 
#                              eval_metric = 'rmse', 
#                              n_estimators = 10000)

# XGB_param_Random = {'reg_alpha': [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 3],
#                     'reg_lambda': [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 3],
#                     'learning_rate': [x / 400 for x in range(1, 10, 1)],
#                     'max_depth': list(range(2, 15, 1)),
#                     'min_child_weight': list(range(2, 35, 1)),
#                     'gamma': [x / 200 for x in range(0, 50, 1)],
#                     'subsample': [0.5, 0.6, 0.7, 0.8, 0.9],
#                     'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9]}

# XGB_random_grid = RandomizedSearchCV(XGB_model, XGB_param_Random, cv = CV, 
#                                      verbose = False, n_jobs = 7, 
#                                      scoring = 'neg_root_mean_squared_error', 
#                                      n_iter = 65)

# XGB_fit = XGB_random_grid.fit(x_train, y_train, 
#                               early_stopping_rounds = 200, 
#                               eval_set = [[x_test, y_test]], 
#                               eval_metric = 'rmse', verbose = False)

###### Getting scores and parameters ######

# round(-1*XGB_fit.best_score_, 5)
# XGB_fit.best_params_

########################################

# 1.4 LGBM

import lightgbm as lgb

from scipy.stats import randint
from scipy.stats import uniform

###### Training a model ######

# %%time

# LGBM_model = lgb.LGBMRegressor(n_estimators = 10000)

# LGBM_param_Random = {'reg_lambda': [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2],
#                      'reg_alpha': [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2],
#                      'min_child_samples': randint(1, 100),
#                      'subsample': [x / 10 for x in range(1, 10, 1)], # bagging_fraction
#                      'subsample_freq': randint(1, 200), # bagging_freq
#                      'num_leaves': randint(1, 200),
#                      'max_depth': list(range(1, 15, 1)),
#                      'max_bin': randint(1, 700),
#                      'learning_rate': [x / 200 for x in range(1, 10, 1)],
#                      'colsample_bytree': [x / 10 for x in range(1, 11, 1)]} # feature_fraction 
                        
                    
# LGBM_random_grid = RandomizedSearchCV(LGBM_model, LGBM_param_Random, cv = CV, 
#                                       verbose = False, n_jobs = 7, 
#                                       scoring = 'neg_root_mean_squared_error', n_iter = 100)

# LGBM_fit = LGBM_random_grid.fit(x_train, y_train, early_stopping_rounds = 100, 
#                                 eval_set = [[x_test, y_test]], 
#                                 eval_metric = 'rmse', verbose = False)

###### Getting scores and parameters ######

# round(-1*LGBM_fit.best_score_, 5)
# LGBM_fit.best_params_

########################################

# 1.5 SVR

# Before using KNN or SVR, we have to scale data!

from sklearn.preprocessing import RobustScaler

vars_for_scaling = (train_high_cat_encoded.columns.tolist() + 
                   train_cont_balanced_default.columns.tolist())

Scaler = RobustScaler()

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

for column in vars_for_scaling:
    
    X_train_scaled[column] = Scaler.fit_transform(X_train[[column]])
    X_test_scaled[column] = Scaler.fit_transform(X_test[[column]])
    
from sklearn.svm import SVR

###### Training a model ######

# %%time

# SVR_model = SVR()

# parameters = {'kernel' : ['rbf'],
#               'C' : list(range(1, 100, 1)),
#               'epsilon' : [x / 2000 for x in range(1, 50, 1)],
#               'gamma' : [x / 10000 for x in range(1, 50, 1)]}

# SVR_random_grid = RandomizedSearchCV(SVR_model, parameters, cv = CV, 
#                                      verbose = False, n_jobs = 7, 
#                                      scoring = 'neg_root_mean_squared_error', 
#                                      n_iter = 60)

# SVR_fit = SVR_random_grid.fit(X_train_scaled, y)

###### Getting scores and parameters ######

# round(-1*SVR_fit.best_score_, 3)
# SVR_fit.best_params_

########################################

# 1.6 KNN

from sklearn.neighbors import KNeighborsRegressor

###### Training a model ######

# %%time

# KNN_model = KNeighborsRegressor()

# KNN_param_Random = {'leaf_size': list(range(1, 50, 1)),
#                     'n_neighbors': list(range(1, 50, 1)),
#                     'p' : [1, 2], 
#                     'weights': ('uniform', 'distance'),
#                     'metric': ('minkowski', 'chebyshev'), 
#                     'algorithm': ('ball_tree', 'kd_tree')}

# KNN_random_grid = RandomizedSearchCV(KNN_model, KNN_param_Random, cv = CV_rep,  
#                                      scoring = 'neg_root_mean_squared_error', 
#                                      verbose = True, n_jobs = 7, n_iter = 100)

# KNN_fit = KNN_random_grid.fit(X_train_scaled, y)

###### Getting scores and parameters ######

# round(-1*KNN_fit.best_score_, 5)
# KNN_fit.best_params_

## 2. Stacking <a class="anchor" id = "V_2"></a>

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.pipeline import make_pipeline

In [None]:
base_learners = [
                 ('Lasso', linear_model.Lasso(tol = 1e-7, 
                           alpha = 0.00028, max_iter = 3000)),
    
                 ('El_Net', linear_model.ElasticNet(tol = 1e-6, 
                            alpha = 0.00044, l1_ratio = 0.61, max_iter = 4000)),
    
                 ('XGB', xgb.XGBRegressor(use_label_encoder = False, 
                         eval_metric = 'rmse',                   
                         n_estimators = 5000,
                         reg_alpha = 0.1,
                         reg_lambda = 0.005,
                         learning_rate = 0.0125,
                         max_depth = 13,
                         min_child_weight = 4,
                         gamma = 0.04,
                         subsample = 0.7,
                         colsample_bytree = 0.6)),
    
                 ('LGBM', lgb.LGBMRegressor(
                          n_estimators = 9000,
                          reg_lambda = 1.8,
                          reg_alpha = 0.01,
                          min_child_samples = 13,
                          subsample = 0.8,
                          subsample_freq = 11,
                          num_leaves = 101,
                          max_depth = 3,
                          max_bin = 160,
                          learning_rate = 0.005,
                          colsample_bytree = 0.1)),
    
                 ('KNN', make_pipeline(RobustScaler(), 
                         KNeighborsRegressor(
                         leaf_size = 25,
                         n_neighbors = 9,
                         p = 1,
                         weights = 'distance',
                         metric = 'minkowski',
                         algorithm = 'ball_tree'))),
    
                 ('SVR', make_pipeline(RobustScaler(), 
                         SVR(
                         kernel = 'rbf',
                         C =  10, 
                         epsilon =  0.017,
                         gamma =  0.0007)))
                ]

In [None]:
Final_stack = StackingRegressor(estimators = base_learners, 
                                final_estimator = linear_model.Lasso(tol = 1e-7, 
                                alpha = 0.00028, max_iter = 3000), 
                                passthrough = True, verbose = False, 
                                cv = 5)

In [None]:
Final_fit = Final_stack.fit(X_train, y)

y_pred = Final_fit.predict(X_test)

In [None]:
submission = pd.DataFrame({'Id': list(range(1461, 2920)), 'SalePrice': np.exp(y_pred)})
submission.to_csv('submission.csv', index = False)

<h1><center> VI. Some techniques that could have been useful </center></h1> <a class="anchor" id = "VI"></a>

In this segment I included some feature engineering options that didn't really work in this case (based on CV scores) but might be valuable in other situations.

<div style = "color: #000000;
             display: fill;
             padding: 8px;
             border-radius: 5px;
             border-style: solid;
             border-color: #a63700;
             background-color: rgba(235, 125, 66, 0.3)">
    
<span style = "font-size: 20px; font-weight: bold">Note:</span> 
<span style="font-size: 15px">If you want to explore other feature engineering techniques and get utility functions, please refer to this <a href="https://www.kaggle.com/suprematism/advanced-feature-engineering-utility-functions">notebook</a>.</span>
</div>

## 1. Feature interactions <a class="anchor" id = "VI_1"></a>

Sometimes interactions between variables may prove to be valuable. Instead of picking pairs of predictors by hand and crossing them via various mathematical operations, you can automate this process to some extent.

At first, define what variables you want to use. For instance, you can collect features that are highly correlated with your target:

In [None]:
Corr_vars = abs(df_train.corr()['SalePrice']).sort_values(ascending = False)

In [None]:
High_corr_vars = Corr_vars.loc[Corr_vars > 0.6].index.tolist()
High_corr_vars.remove('SalePrice')

In [None]:
High_corr_vars

Following that, you should create combinations of the previously picked variables and multiply them, for example.

In [None]:
from itertools import combinations

In [None]:
train_cont_comb = pd.DataFrame()

In [None]:
for c_1, c_2 in combinations(df_train[High_corr_vars], 2):
    
    train_cont_comb['{0}*{1}'.format(c_1, c_2)] = df_train[c_1] * df_train[c_2]

In [None]:
train_cont_comb.head(3).round(1)

Finally, you can train a model that has a built-in regularisation (Lasso, for instance) and see what variables are actually important.

## Thanks for reading!