# Feature Engineering Notebook

## Objectives

*   Engineer features for Regression models


## Inputs

* outputs/datasets/collection/HousePricing.csv

## Outputs

* generate a list with variables to engineer

## Conclusions



* Feature Engineering Transformers
  * Ordinal categorical encoding: `['KitchenQual', 'GarageFinish', 'BsmtFinType1', 'BsmtExposure']`
  * Numerical Transformation: YeoJohnsonTransformer `['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage']`
  * Smart Correlation Selection: `['1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'YearRemodAdd']`
  


---

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

In [None]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/HousePricing.csv')
df

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['SalePrice'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

### Data Cleaning Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import CategoricalImputer
from feature_engine.imputation import MeanMedianImputer
from feature_engine.selection import DropFeatures
pipeline = Pipeline([
      ('Median', MeanMedianImputer(imputation_method='median', 
                                   variables=['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 
                                              'LotFrontage', 'MasVnrArea']
                                              )),
      ('CategoricalImputer', CategoricalImputer(imputation_method='missing', 
                                                fill_value='No Record', 
                                                variables=['BsmtFinType1','GarageFinish']
                                                )),
      ('Drop', DropFeatures(features_to_drop=['EnclosedPorch', 'WoodDeckSF']))
])
pipeline

### Fit Pipeline

In [None]:
pipeline.fit(TrainSet)

TrainSet, TestSet = pipeline.transform(TrainSet) , pipeline.transform(TestSet)

# Data Exploration

In feature engineering, you are interested to evaluate which potential transformation you could do in your variables
* Take your notes in your separate spreadsheet

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

# Feature Engineering

## Custom function

We studied this custom function in the feature-engine lesson. That will help you with the feature engineering process.
* Do not worry if you need help understanding the full code at first, as it is expected you will take some time to absorb the use case.
* At this moment, what matters is to understand the function objective and how you can use it.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', 
                        variables=[f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


## Feature Engineering Spreadsheet Summary

* Consider the notes taken in your spreadsheet summary. List the transformers you will use
    * Categorical Encoding
    * Numerical Transformation
    * Smart Correlation Selection

## Dealing with Feature Engineering

### Categorical Encoding - Ordinal: replaces categories with ordinal numbers 

* Step 1: Select variable(s)

In [None]:
variables_engineering= ['KitchenQual', 'GarageFinish', 'BsmtFinType1', 'BsmtExposure']

variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

* For each variable, write your conclusion on how the transformation(s) look(s) to be effective.
  * For all variables, the transformation is effective, since it converted categories to numbers.



* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables=variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Transformation

* Step 1: Select variable(s)

In [None]:
variables_engineering = ['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage']
variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(20)

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

Conclusion on how the transformation(s) look(s) to be effective

**1stFlrSF**

  - **Original**: The histogram shows a right skew, the Q-Q plot indicates the distribution tails off more steeply than normal, and there are outliers apparent in the boxplot.

  - **Logarithmic Transformation e:** The histogram looks more symmetric, indicating that the transformation has corrected some of the skewness. The Q-Q plot is more linear, especially in the central values, though there are deviations in the tails. The boxplot shows fewer outliers, suggesting improvement towards normality.
  
  - **Logarithmic Transformation 10:** Similar to the natural logarithm transformation, the histogram is more bell-shaped, the Q-Q plot looks linear for the central quantiles, and the boxplot shows a reduction in outliers. Both logarithmic transformations appear effective, but without specific values on the Q-Q plot, it's hard to say which is better.
  
  - **Reciprocal Transformation:** The histogram shows an extreme skew, the Q-Q plot exhibits strong nonlinearity, and the boxplot has a lot of spread in the data, indicating that this transformation is not appropriate for normalizing the data.

  - **Power Transformation:** The histogram reveals a right skew, the Q-Q plot shows a nonlinear pattern, and the boxplot suggests the presence of outliers. This indicates that the power transformation has not normalized the data effectively.

  - **Box-Cox Transformation:**  The histogram is fairly symmetric, the Q-Q plot demonstrates a mostly linear pattern, although there's some deviation in the tails, and the boxplot indicates fewer outliers. This transformation appears to have a substantial normalizing effect.

  - **Yeo-Johnson Transformation:** The histogram is quite symmetric, the Q-Q plot follows the line closely with only slight deviations at the tails, and the boxplot shows a reasonable range with some outliers. This transformation seems to normalize the data well

**Summary**

All three logarithmic-based transformations (natural log, base 10 log, and Box-Cox) show significant improvement in normalizing the distribution. 
The Yeo-Johnson transformation seems to perform similarly to the Box-Cox.

**LotArea**

   - **Original**: Shows a right skewness, as we se by the long tail on the right in the histogram, the nonlinear pattern in the Q-Q plot, and the many outliers on the right in the boxplot.

   - **Logarithmic Transformation e:** Still shows some right skewness but to a lesser extent. There are fewer outliers, and the Q-Q plot is closer to a straight line.
  
  - **Logarithmic Transformation 10:** Shows improvements similar to the log_e transformation but does not differ significantly from it in terms of normality.
  
  - **Reciprocal Transformation:** Leads to a left-skewed distribution, which is also not normal, as it is evident from the tail to the left in the histogram, the S-shaped Q-Q plot, and the presence of outliers in the boxplot.

  - **Power Transformation:** Doesn't result in normality, as seen by the right skew in the histogram and the nonlinear Q-Q plot.

  - **Box-Cox Transformation:** Seems to offer a significant improvement with a more symmetric histogram and a Q-Q plot that is more linear than the previous transformations.

  - **Yeo-Johnson Transformation:** Shows similar improvements as the Box-Cox, with a fairly symmetric histogram and a reasonably linear Q-Q plot.

 
**Summary**

The Box-Cox and Yeo-Johnson transformations appear to have the most effective normalizing effect on the distribution. Both result in histograms that are approximately symmetric and have Q-Q plots that suggest resemblance to normal distribution, although the Yeo-Johnson transformation might have a slight edge due to fewer outliers in the boxplot.
In conclusion the Yeo-Johnson transformation seems to be the best

**LotFrontage**

   - **Original**: The histogram shows a significant right skew, the Q-Q plot indicates substantial deviation from normality, particularly for larger values, and the boxplot displays many outliers.

   - **Logarithmic Transformations (e and 10):** Both improved symmetry in the histograms and the Q-Q plots are closer to linear, especially for central values. The boxplots show fewer outliers compared to the original.
  
  - **Reciprocal Transformation:** The histogram is still right-skewed, and the Q-Q plot shows substantial deviations. The boxplot indicates the presence of extreme values.

  - **Power Transformation:** It shows an improvement in the histogram symmetry but still some skewness is present. The Q-Q plot has noticeable deviations in the tails, and the boxplot indicates outliers.

  - **Box-Cox Transformation:** The histogram is more symmetric than the original, the Q-Q plot shows smaller deviations, and the boxplot indicates a reduction in outliers, suggesting a more normal-like distribution.

  - **Yeo-Johnson Transformation:**  Similar to the Box-Cox, the histogram is relatively symmetric, the Q-Q plot aligns more closely with the reference line, and the boxplot shows fewer outliers.


**Summary**

Both Box-Cox and Yeo-Johnson transformations seems to be the most effective in normalizing, indicated in the more symmetric histograms and the Q-Q plots which are closer to the reference line, implying a normal distribution more closely. Both are suitable choice.


**GrLivArea**

  - **Original:** The histogram shows right skewness, the Q-Q plot indicates heavy tails, and the boxplot shows outliers, implying it is not normally distributed.

  - **Logarithmic Transformation e:** The histogram is more symmetric, the Q-Q plot is closer to the reference line, and the boxplot shows fewer outliers, indicating an improvement toward normality.

  - **Logarithmic Transformation 10:** This transformation also yields a more bell-shaped histogram, a Q-Q plot that aligns well with the reference line for most of the distribution, and a boxplot with fewer outliers.

  - **Reciprocal Transformation:** The histogram is very skewed, the Q-Q plot indicates non-normality, and the boxplot displays extreme outliers. This transformation does not appear to be suitable.

  - **Power Transformation:** The histogram is more symmetric than the original, the Q-Q plot is closer to linear, though there's still some deviation, and the boxplot has outliers, suggesting some improvement but not optimal.

  - **Box-Cox Transformation:** The histogram appears normally distributed, the Q-Q plot fits the reference line well except for the extreme values, and the boxplot shows a few outliers. This transformation significantly improves the normality.

  - **Yeo-Johnson Transformation:** The histogram, Q-Q plot, and boxplot are similar to those of the Box-Cox transformation, suggesting this transformation also normalizes the data effectively.

**Summary**

The Box-Cox and Yeo-Johnson transformations perform well, resulting in a distribution that is closer to normal for. Given the similarity in their performance, both would work.
 


* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
yeo = vt.YeoJohnsonTransformer(variables=variables_engineering)
TrainSet = yeo.fit_transform(TrainSet)
TestSet = yeo.transform(TestSet)

### SmartCorrelatedSelection Variables

* Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) applying the transformation(s)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

---

# So what is the conclusion? :)


The list below shows the transformations needed for feature engineering.
  * You will add these steps to the ML Pipeline


Feature Engineering Transformers
  * Ordinal categorical encoding: `['KitchenQual', 'GarageFinish', 'BsmtFinType1', 'BsmtExposure']`
  * Numerical Transformation: YeoJohnsonTransformer `['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage']`
  * Smart Correlation Selection: `['1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'YearRemodAdd']`
  