# **Notebook 4: Feature Engineering**

## Objectives

* Engineer features for Regression models

## Inputs

* outputs/datasets/cleaned/train/TrainSetCleaned.csv
* outputs/datasets/cleaned/test/TestSetCleaned.csv

## Outputs

* Generate a list with variables to engineer

## Conclusions

* Feature Engineering Transformers
  * Ordinal categorical encoding: `['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']`
  * Numerical transformation: `['1stFlrSF',
    '2ndFlrSF',
    'BedroomAbvGr',
    'BsmtFinSF1',
    'BsmtUnfSF',
    'TotalBsmtSF',
    'GarageArea',
    'GarageYrBlt',
    'GrLivArea',
    'LotArea',
    'LotFrontage',
    'MasVnrArea',
    'OpenPorchSF',
    'OverallCond',
    'OverallQual',
    'YearBuilt',
    'YearRemodAdd']`
  * Smart Correlation Selection: `[]`



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/housing'

# Section 1: Load and inspect the data

Train set

In [4]:
import pandas as pd

TrainSet = pd.read_csv("outputs/datasets/cleaned/train/TrainSetCleaned.csv")
TrainSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,1828,0,3,Av,48,Unf,1774,774,Unf,2007,...,Gd,11694,90,452,108,5,9,1822,2007,2007
1,894,0,2,No,0,Unf,894,308,Unf,1962,...,TA,6600,60,0,0,5,5,894,1962,1962
2,964,0,2,No,713,ALQ,163,432,Unf,1921,...,TA,13360,80,0,0,7,5,876,1921,2006


Test set

In [5]:
TestSet = pd.read_csv("outputs/datasets/cleaned/test/TestSetCleaned.csv")
TestSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,2515,0,4,No,1219,Rec,816,484,Unf,1975,...,TA,32668,69,0,0,3,6,2035,1957,1975
1,958,620,3,No,403,BLQ,238,240,Unf,1941,...,Fa,9490,79,0,0,7,6,806,1941,1950
2,979,224,3,No,185,LwQ,524,352,Unf,1950,...,Gd,7015,69,161,0,4,5,709,1950,1950


# Feature Engineering Analysis

### Custom function:

In [6]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type='numerical'):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

## Feature Engineering Spreadsheet Summary


We will use
- Categorical Encoding on all 4 categorical variables
- Numerical Transformation on all 17 numerical variables
- Smart Correlated Selection on all variables

## Dealing with Feature Engineering

### Categorical Encoding - Ordinal: replaces categories with ordinal numbers 

* Step 1: Select variable(s)

In [None]:
%matplotlib inline

In [None]:
variables_engineering= ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

- Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method for each variable.

In [None]:
%matplotlib inline

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

 - For all variables, the transformation is effective, since it converted categories to numbers.
 - None of the variables seem to be normally distributed.

- Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
# the steps are: 
# 1 - create a transformer
# 2 - fit_transform into TrainSet
# 3 - transform into TestSet 
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Transformation

* Step 1: Select variable(s)

In [7]:
variables_engineering = ['1stFlrSF',
    '2ndFlrSF',
    'BedroomAbvGr',
    'BsmtFinSF1',
    'BsmtUnfSF',
    'TotalBsmtSF',
    'GarageArea',
    'GarageYrBlt',
    'GrLivArea',
    'LotArea',
    'LotFrontage',
    'MasVnrArea',
    'OpenPorchSF',
    'OverallCond',
    'OverallQual',
    'YearBuilt',
    'YearRemodAdd'
]

variables_engineering

['1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'GarageArea',
 'GarageYrBlt',
 'GrLivArea',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'YearBuilt',
 'YearRemodAdd']

* Step 2: Divide my variables into three chunks (I was experiencing issues with the transformation because of the large number of variables processed simultaneously).

In [8]:
# Get the total number of variables
total_variables = len(variables_engineering)

# Split the variables into three chunks
variables_engineering_1 = variables_engineering[:total_variables//3]
variables_engineering_2 = variables_engineering[total_variables//3:(total_variables//3)*2]
variables_engineering_3 = variables_engineering[(total_variables//3)*2:]

* Step 3: Create separate DataFrames and run the feature engineering function on each subset of data
In this code, after processing each chunk, we delete the dataframe to clear up memory and then call the garbage collector (gc.collect()) to free up memory that's no longer in use. This is to help to prevent the notebook from running out of resources.
- For users of this notebook, clear the cell output before running the next chunk in order to avoid crashing the notebook.

In [None]:
#Create the first dataframe and run the transformations
df_engineering_1 = TrainSet[variables_engineering_1].copy()
df_engineering_1 = FeatureEngineeringAnalysis(df=df_engineering_1, analysis_type='numerical')

* Assess engineered variables distribution and select the most suitable method
* For each variable, write your conclusion on how the transformation(s) look(s) to be effective :
1. '1stFlrSF' = log-e and log-10 shows similar results in normalizing the data, as do Box-Cox and Yeo-Johnson. Any of these transformations could be used. Since the latter two include the log transformation, we could use either of them.
2. '2ndFlrSF' = only power and Yeo-Johnson were applied to this variable, none of which normalize the data. The reason seems to be that the variable contains a large number of zeros. Ordinarily Yeo-Johnson is equipped to handle zeros since the log transformation is undefined for zero. However, in this case it did not normalize the data.
3. 'BedroomAbvGr' = neither power nor Yeo-Johnson helped normalize the data.
4. 'BsmtFinSF1' = neither power nor Yeo-Johnson helped normalize the data.
5. 'BsmtUnfSF' = both power and Yeo-Johnson normalized the data somewhat, although there is still a large number of zeros at the tail end of the curve.

In [None]:
# Delete the first dataframe to free memory
del df_engineering_1
import gc
gc.collect()

In [None]:
# Create the second dataframe and run the transformations
df_engineering_2 = TrainSet[variables_engineering_2].copy()
df_engineering_2 = FeatureEngineeringAnalysis(df=df_engineering_2, analysis_type='numerical')

* Assess engineered variables distribution and select the most suitable method
* For each variable, write your conclusion on how the transformation(s) look(s) to be effective :
1. 'TotalBsmtSF' = Power and Yeo-Johnson were applied, the latter of which improves normality somewhat.
2. 'GarageArea' = Power and Yeo-Johnson were applied, the latter of which improves normality somewhat.
3. 'GarageYrBlt' = A number of transformers were applied but none help normalize the data.
4. 'GrLivArea' = both log transformations, as well as box-cox and Yeo-Johnson normalize the data.
5. 'LotArea' = Again, both log transformations, as well as box-cox and Yeo-Johnson normalize the data.

In [None]:
# Delete the second dataframe to free memory
del df_engineering_2
gc.collect()

In [None]:
# Create the third dataframe and run the transformations
df_engineering_3 = TrainSet[variables_engineering_3].copy()
df_engineering_3 = FeatureEngineeringAnalysis(df=df_engineering_3, analysis_type='numerical')

* Assess engineered variables distribution and select the most suitable method
* For each variable, write your conclusion on how the transformation(s) look(s) to be effective :
1. 'LotFrontage' = Both log transformations show normalization of data, as do power, box-cox and Yeo-Johnson transformations.
2. 'MasVnrArea' = Due to the very high number of zeros in the variable, no transformations are able to normalize the data.
3. 'OpenPorchSF' = Due to the very high number of zeros in the variable, no transformations are able to normalize the data.
4. 'OverallCond' = none of the transformations appear to normalize the data.
5. 'OverallQual' = none of the transformations improve data normality much.
6. 'YearBuilt' = none of the transformations normalize the data significantly.
7. 'YearRemodAdd' = none of the transformations normalize the data significantly.

In [None]:
# Delete the third dataframe to free memory
del df_engineering_3
gc.collect()

* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
# write code for transforming train and test set here.

### SmartCorrelatedSelection Variables

* Step 1: Select variable(s)

In [None]:
# for this transformer, you don't need to select variables, since you need all variables for this transformer

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) applying the transformation(s)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

----

# Conclusion

The list below shows the transformations needed for feature engineering.
  * We will add these steps to the ML Pipeline

Feature Engineering Transformers
  * Ordinal categorical encoding: `['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']`
  * Numerical transformation: `['1stFlrSF',
    '2ndFlrSF',
    'BedroomAbvGr',
    'BsmtFinSF1',
    'BsmtUnfSF',
    'TotalBsmtSF',
    'GarageArea',
    'GarageYrBlt',
    'GrLivArea',
    'LotArea',
    'LotFrontage',
    'MasVnrArea',
    'OpenPorchSF',
    'OverallCond',
    'OverallQual',
    'YearBuilt',
    'YearRemodAdd']`
  * Smart Correlation Selection: `[]`