<h1><center> Table of contents </center></h1>

### I. Preparation

* [1. Importing libraries](#I_1)


* [2. Data preparation](#I_2)

    
* [3. Data preparation](#I_3)
        
        
### II. Feature engineering

* [1. K-means clustering](#II_1)


* [2. K-bins discretisation](#II_2)


* [3. Categorical variables interactions](#II_3)


* [4. Featuretools](#II_4)

<h1><center> I. Preparation </center></h1>

## 1. Importing basic libraries <a class="anchor" id = "I_1"></a>

In [None]:
import pandas as pd
import numpy as np

In [None]:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

## 2. Setting global parameters for plots <a class="anchor" id = "I_2"></a>

In [None]:
sns.set_theme(rc = {'grid.linewidth': 0.5,
                    'axes.linewidth': 0.75, 'axes.facecolor': '#ECECEC', 
                    'axes.labelcolor': 'black',
                    'figure.facecolor': 'white',
                    'xtick.color': 'black', 'ytick.color': 'black'})

## 3. Importing data <a class="anchor" id = "I_3"></a>

<div style = "color: #000000;
             display: fill;
             padding: 8px;
             border-radius: 5px;
             border-style: solid;
             border-color: #a63700;
             background-color: rgba(235, 125, 66, 0.3)">
    
<span style = "font-size: 20px; font-weight: bold">Note:</span> 
The purpose of this kernel is to familiarise you with some feature engineering techniques. I used cleaned training and test sets from one of my <a href="https://www.kaggle.com/suprematism/top-7-useful-graphs-and-encoding-techniques">notebooks</a>. Thus, should you want to explore data cleaning and model building, refer to the link placed above.
</div>

In [None]:
Set = pd.read_csv('../input/housing-prices-visual/Housing_prices_visual.csv')

In [None]:
df_train = Set.iloc[:1451]
df_test = Set.iloc[1451:]

<h1><center> II. Feature engineering </center></h1>

## 1. K-means clustering <a class="anchor" id = "II_1"></a>

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Useful articles and documents:

- <a href="https://medium.com/greyatom/using-clustering-for-feature-engineering-on-the-iris-dataset-f438366d0b4b">Clustering for feature engineering</a>;
- <a href="https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/">A guide to k-means clustering</a>;
- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">Sklearn documentation</a>

<div style = "text-align: justify">We can use the output of <span style="color:#E85E40">k-means clustering</span> as features for our models. On top of that, not only clusters themselves can be treated as variables but also the distance to the cluster centres can be used as potential predictors.</div>

First, we have to define numeric variables that we are going to use for clustering.

In [None]:
Num_vars = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 
            'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', 
            '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 
            'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch', 'MiscVal', 
            '3SsnPorch' , 'PoolArea' , 'LowQualFinSF']

<div style = "text-align: justify"> Before generating new variables, we have to determine the optimal number of clusters. One way to do it is to draw a so called <span style="color:#E85E40">elbow curve</span>. We plot a number of clusters against <span style="color:#E85E40"> inertia_ </span> – the sum of squared distances of samples to their closest cluster centre. Our task is to find the smallest number of clusters while keeping <span style="color:#E85E40"> inertia_ </span> as low as possible. </div>

For that purpose, I created a simple utility function:

In [None]:
def K_means_claster_tuning(df_train, Vars_list, max_cluster = 15):
    
    ### Scaling data
    
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(df_train[Vars_list])
    
    ### Calculating inertia_ for each number of clusters
    
    SSE = []
    
    for cluster in range(1, max_cluster):
        
        kmeans = KMeans(n_clusters = cluster, init = 'k-means++', random_state = 999)
        kmeans.fit(data_scaled)
        SSE.append(kmeans.inertia_)
        
    df_plot = pd.DataFrame({'Cluster': range(1, max_cluster), 'SSE': SSE})
    
    return(df_plot)

In [None]:
Cluster_tuning = K_means_claster_tuning(df_train, Num_vars, max_cluster = 20)

In [None]:
with plt.rc_context(rc = {'figure.dpi': 110, 'axes.labelsize': 8, 
                          'xtick.labelsize': 6, 'ytick.labelsize': 6}):
    
    fig_1, ax_1 = plt.subplots(1, 1, figsize = (5, 3.5))
    
    sns.lineplot(x = [1, 19], y = [27569, 10481], color = '#86b9cf',
                 linewidth = 1)

    sns.lineplot(x = Cluster_tuning['Cluster'].astype('int64'), 
                 y = Cluster_tuning['SSE'], color = '#146964', 
                 marker = 'o', linewidth = 1)
    
    plt.xticks(range(1, 20))
          
plt.show()

It is evident that after 12 clusters a decrease in the SSE became insignificant.

Finally, I created a utility function for engineering features from <span style="color:#E85E40">k-means clustering</span>:

In [None]:
def K_means_clastering(df_train, df_test, Vars_list, n_clusters = 10):
    
    ### Scaling data
    
    scaler = StandardScaler()
    df_train_scaled = scaler.fit_transform(df_train[Vars_list])
    df_test_scaled = scaler.transform(df_test[Vars_list])
    
    ### Initiating KMeans algorithm
    
    kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 999)
    
    ### Getting clusters
    
    kmeans_train = kmeans.fit_predict(df_train_scaled)
    kmeans_test = kmeans.predict(df_test_scaled)
    
    ### Getting the distance to the cluster centres
    
    Cluster_space = []
    Cols = [f'Clust_space_{i}' for i in range(n_clusters)]
    
    Cluster_space.append(kmeans.fit_transform(df_train_scaled))
    Cluster_space.append(kmeans.transform(df_test_scaled))
    
    ### Saving results
    
    df_clusters_train = pd.DataFrame(Cluster_space[0], columns = Cols)
    df_clusters_train['Cluster'] = kmeans_train
    
    df_clusters_test = pd.DataFrame(Cluster_space[1], columns = Cols)
    df_clusters_test['Cluster'] = kmeans_test
    
    return(df_clusters_train, df_clusters_test)

Features for only 5 clusters were calculated in order to keep the notebook less cluttered.

In [None]:
Result_1 = K_means_clastering(df_train, df_test, Num_vars[0:5], n_clusters = 5)

In [None]:
df_clusters_train = Result_1[0]
df_clusters_test = Result_1[1]

In [None]:
df_clusters_train.round(2).head(3)

## 2. K-bins discretisation <a class="anchor" id = "II_2"></a>

In [None]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

Useful articles and documents:

- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html">Sklearn documentation</a>

<div style = "text-align: justify"> Sometimes, binning continuous variables can be quite valuable. This technique allows you to deal with outliers and features that are not homogeneous, thus preventing overfitting. Nevertheless, binning always results in information loss. </div>

<div style = "text-align: justify"> If you want, you can change a strategy of binning. For instance, pass <code style = "background-color: #faedde">strategy = 'uniform'</code> instead of <code style = "background-color: #faedde">strategy = 'kmeans'</code>. Also, it is up to you to define how variables are going to be encoded. Use either <code style = "background-color: #faedde">encode = 'onehot-dense'</code> or <code style = "background-color: #faedde">encode = 'ordinal'</code>. To get more details, consult the official documentation. </div>

In [None]:
def discretizer(df_train, df_test, Vars_list, n_bins = 5, 
                encode = 'onehot-dense', strategy = 'kmeans'):
    
    ### Initiating KBinsDiscretizer algorithm
    
    KBins_d = KBinsDiscretizer(n_bins = n_bins, encode = encode, 
                               strategy = strategy)
    
    ### Getting binned variables
    
    df_train_binned = KBins_d.fit_transform(df_train[Vars_list])
    df_test_binned = KBins_d.transform(df_test[Vars_list])
    
    ### NOT the best way of creating column names :)
    
    if encode == 'onehot-dense':
        
        Cols = df_train[Vars_list].shape[1] * n_bins
        
    else: Cols = df_train[Vars_list].shape[1]
    
    df_train_binned = pd.DataFrame(df_train_binned, 
                                   columns = ["col" + str(i) for i in range(0, Cols)])

    df_test_binned = pd.DataFrame(df_test_binned, 
                                  columns = ["col" + str(i) for i in range(0, Cols)])
    
    return(df_train_binned, df_test_binned)

To make my example cleaner, I took only the first 5 numeric variables.

In [None]:
Result_2 = discretizer(df_train, df_test, Num_vars[0:5],
                       n_bins = 3, encode = 'onehot-dense', strategy = 'kmeans')

In [None]:
df_binned_train = Result_2[0]
df_binned_test = Result_2[1]

In [None]:
df_binned_train.head(3)

## 3. Categorical variables interactions <a class="anchor" id = "II_3"></a>

In [None]:
from itertools import combinations

Useful articles and documents:

- <a href="https://www.coursera.org/lecture/competitive-data-science/feature-interactions-yt5t3">Feature interactions</a>

First and foremost, you should create a list of categorical variables that you want to combine:

In [None]:
Cat_vars = ['ExterCond', 'MasVnrType', 'ExterQual']

In [None]:
def cat_var_combinations(df, Vars_list):

    Comb_train = []
    Cols = []
    
    for с_1, c_2 in combinations(df[Vars_list], 2):
    
        Comb_train.append(df[с_1].astype(str) + " | " + df[c_2].astype(str))
    
        Cols.append(str(с_1) + " | " + str(c_2))
    
        df_final = pd.DataFrame(Comb_train).T
        df_final.columns = Cols
        
    return(df_final)

In [None]:
df_comb_cat_train = cat_var_combinations(df_train, Cat_vars)
df_comb_cat_test = cat_var_combinations(df_test, Cat_vars)

In [None]:
df_comb_cat_train.head(3)

Lastly, you can visualise new variables:

In [None]:
df_comb_cat_train['SalePrice'] = df_train['SalePrice']

Visual_vars = df_comb_cat_train.columns.tolist()
Visual_vars.remove('SalePrice')

In [None]:
with plt.rc_context(rc = {'figure.dpi': 500, 'axes.labelsize': 7.5, 
                          'xtick.labelsize': 5.5, 'ytick.labelsize': 5.5}):

    fig_2, ax_2 = plt.subplots(3, 1, figsize = (8, 10))

    for idx, (column, axes) in list(enumerate(zip(list(df_comb_cat_train.columns), ax_2.flatten()))):
    
        order = df_comb_cat_train.groupby(column)['SalePrice'].mean().sort_values(ascending = True).index
    
        sns.violinplot(ax = axes, x = df_comb_cat_train[column], 
                       y = np.log(df_comb_cat_train['SalePrice']),
                       order = order, scale = 'width',
                       linewidth = 0.5, palette = 'viridis',
                       inner = None)
    
        plt.setp(axes.collections, alpha = 0.3)
    
        sns.stripplot(ax = axes, x = df_comb_cat_train[column], 
                      y = np.log(df_comb_cat_train['SalePrice']),
                      palette = 'viridis', s = 1.5, alpha = 1,
                      order = order, jitter = 0.2)
        
        sns.pointplot(ax = axes, x = df_comb_cat_train[column],
                      y = np.log(df_comb_cat_train['SalePrice']),
                      order = order,
                      color = '#ff5736', scale = 0.2,
                      estimator = np.mean, ci = 'sd',
                      errwidth = 0.5, capsize = 0.15, join = True)
    
        plt.setp(axes.lines, zorder = 100)
        plt.setp(axes.collections, zorder = 100)
    
        if df_comb_cat_train[column].nunique() > 5: 
        
            plt.setp(axes.get_xticklabels(), rotation = 90)
    
    else:
    
        [axes.set_visible(False) for axes in ax_2.flatten()[idx + 1:]]

plt.tight_layout(pad = 1)
plt.show()

## 4. Featuretools <a class="anchor" id = "II_4"></a>

In [None]:
import featuretools as ft

Useful articles and documents:

- <a href="https://featuretools.alteryx.com/en/stable/index.html">Featuretools documentation</a>
- <a href="https://www.kaggle.com/liananapalkova/automated-feature-engineering-for-titanic-dataset">Automated feature engineering for Titanic dataset</a>

Well, <span style="color:#E85E40"> featuretools </span> allows you to do a lot. In this notebook, I covered only feature engineering with the help of aggregations.

<div style = "text-align: justify"> First, I defined all variables that would participate in generating features <code style = "background-color: #faedde">['GrLivArea', 'LotArea', 'Neighborhood']</code>. Following that, I isolated a single categorical variable <code style = "background-color: #faedde">['Neighborhood']</code> that was used to aggregate the rest of the continuous variables.</div>

In [None]:
All_vars = ['GrLivArea', 'LotArea', 'Neighborhood']

Cat_vars_only = ['Neighborhood']

In [None]:
def f_tools(df, Vars_list_0, Vars_list_1):
    
    df_ft = df[Vars_list_0]
    df_ft['ID'] = list(range(0, df_ft.shape[0]))
    
    ### Loading data and creating an entity
    
    ES = ft.EntitySet(id = 'SalePrice_data')
    ES = ES.entity_from_dataframe(entity_id = 'df_ft',                       
                                  dataframe = df_ft, index = 'ID')
    
    ### Creating relationships
    
    for column in Vars_list_1:
        
        ES = ES.normalize_entity(base_entity_id = 'df_ft', 
                             new_entity_id = str(column), index = str(column))
        
    ### Creating features via aggregations
    
    features, feature_names = ft.dfs(entityset = ES,
                                     target_entity = 'df_ft',
                                     agg_primitives = ['max', 'min', 'mean'],
                                     max_depth = 2)
    
    features = features.drop(Vars_list_0, axis = 1)
    features = features.dropna(axis = 1)
    
    return(features)

<div style = "text-align: justify"> I used only 3 basic primitives <code style = "background-color: #faedde">agg_primitives = ['max', 'min', 'mean']</code>. If you want to learn more about other options, you should either type <code style = "background-color: #faedde">ft.primitives.list_primitives()</code> or consult this <a href="https://docs.featuretools.com/en/stable/api_reference.html#feature-primitives">website</a>. </div>

In [None]:
df_ft_train = f_tools(df_train, All_vars, Cat_vars_only)
df_ft_test = f_tools(df_test, All_vars, Cat_vars_only)

Let's explore a particular variable:

In [None]:
df_ft_train[['Neighborhood.MAX(df_ft.LotArea)']].head(3)

<div style = "text-align: justify"> The name of this feature essentially speaks for itself. "LotArea" area was aggregated by "Neighborhood", and the aggregation function was the mean. We can get the same result with the help of <code style = "background-color: #faedde">.groupby()</code>:</div>

In [None]:
pd.DataFrame({'Check': df_train.groupby('Neighborhood')['LotArea'].transform('max')}).head(3)

You can always plot new variables against the target and explore them more meticulously:

In [None]:
with plt.rc_context(rc = {'figure.dpi': 250, 'axes.labelsize': 6, 
                          'xtick.labelsize': 5, 'ytick.labelsize': 5}):

    fig, ax = plt.subplots(2, 3, figsize = (6.5, 4.5), sharey = True)

    for idx, (column, axes) in list(enumerate(zip(list(df_ft_train.columns), 
                                                  ax.flatten()))):
    
        sns.scatterplot(ax = axes, x = df_ft_train[column], 
                        y = np.log(df_train['SalePrice']),
                        hue = np.log(df_train['SalePrice']),
                        palette = 'viridis', alpha = 0.7, s = 7)
    
        axes.legend([], [], frameon = False)
    
    else:
    
        [axes.set_visible(False) for axes in ax.flatten()[idx + 1:]]

    plt.tight_layout(pad = 1)
    plt.show()

## Thanks for reading!