# **Feature Engineering Notebook**

## Objectives

* Engineer features for Classification, Regression and Cluster models.

In order to do this we will follow the next tasks:
* Load and inspect the data prepared during data cleaning
* Exploring the data
* Feature engineering
* Conclusion and next steps

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* Generate a list with variables to engineer
* We will select the transformers based on these lists to add to our ML pipeline


## Additional Comments

* This notebook was written based on the guidelines provided in the walk through project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Cleaned Data

### Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

### Test Set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

# Data Exploration

* We will run the pandas profiling report to evaluate potential transformations in the data:

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

The profiling report shows us that -
* `diagnosis` is a categorical variable which we can assess by clicking on `More details` option.
* The rest of the variables are all numerical.

# Correlation and PPS Analysis

* We don’t expect major changes compared to the data cleaning notebook since the only data difference is the removal of `id`, so correlation levels and PPS will essentially be the same.

# Feature Engineering

* In this part of the notebook we will analyze and transform the variables with some custom functions.
* We will be using the function from feature-engine lesson, and costume to our needs in order to implement the feature engineering process.

## Custom function

* We will use the following custom function from Code Institute.
* This function will help quick feature engineering on numerical and categorical variables to decide which transformation can better transform the distribution shape.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=['#432371'],
                order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', 
                                variables=[f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Feature Engineering Spreadsheet Summary

## Dealing with Feature Engineering

### Categorical Encoding - Ordinal: replaces categories with ordinal numbers

The following steps will replace categorical data with ordinal numbers:

* **Step 1: Declare a variable with categorical variable name**

In [None]:
variables_engineering= ['diagnosis']
variables_engineering

* **Step 2: Create a separate DataFrame, with our variable**

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.tail()

* **Step 3: Apply the transformation to the variable**

    - Assess the engineered variable distribution.

In [None]:
%matplotlib inline
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

* Nothing appears to be out of place; the variable is converted to numbers. Thus, we can now proceed with applying the transformations.

* **Step 4: Apply the selected transformation to the Train and Test set**

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

In [None]:
TestSet.head()

### Numerical Transformation

* **Step 1: Declare variables with the numerical variable names**

    - We are declaring multiple variables because, the large amount of output of applying multiple transformations to 30 variables from our dataset might crash or slow down jupyter notebook.

In [None]:
var_one = TrainSet.columns[1:6].tolist()
var_two = TrainSet.columns[6:11].tolist()
var_three = TrainSet.columns[11:16].tolist()
var_four = TrainSet.columns[16:21].tolist()
var_five = TrainSet.columns[21:26].tolist()
var_six = TrainSet.columns[26:31].tolist()
var_one, var_two, var_three, var_four, var_five, var_six

* **Step 2: Create a separate DataFrame, with our variables**

In [None]:
df_engineering_one = TrainSet[var_one].copy()
df_engineering_two = TrainSet[var_two].copy()
df_engineering_three = TrainSet[var_three].copy()
df_engineering_four = TrainSet[var_four].copy()
df_engineering_five = TrainSet[var_five].copy()
df_engineering_six = TrainSet[var_six].copy()

df_engineering_one.head(3)

* **Step 3: Apply the transformation to the variables.**
    - We will do the aplication in six parts to avoid the Jupyteer notebook stress.
    - Assess the engineered variables distribution. We need to do this process so that we can find the most suitable method for each variable.

#### First Assessment

In [None]:
df_engineering_one = FeatureEngineeringAnalysis(df=df_engineering_one, analysis_type='numerical')

#### Second assessment

In [None]:
df_engineering_two = FeatureEngineeringAnalysis(df=df_engineering_two, analysis_type='numerical')

#### Third assessment

In [None]:
df_engineering_three = FeatureEngineeringAnalysis(df=df_engineering_three, analysis_type='numerical')

#### Fourth assessment

In [None]:
df_engineering_four = FeatureEngineeringAnalysis(df=df_engineering_four, analysis_type='numerical')

#### Fifth assessment

In [None]:
df_engineering_five = FeatureEngineeringAnalysis(df=df_engineering_five, analysis_type='numerical')

#### Sixth assessment

In [None]:
df_engineering_six = FeatureEngineeringAnalysis(df=df_engineering_six, analysis_type='numerical')

### Plot Assessment

Transformers that are applied to 30 variables are as follows -
1. Log Tranmsformer with Base e
2. Log Transformer with Base 10
3. Reciprocal Transformer
4. Power Transformer
5. BoxCox transformer
6. Yeo-Johnson Transformer

From the histplot, probplot and boxplot visualization of the numeric variables, we deduce that - 
* `concavity_mean`, `concave_points_mean` and `concavity_worst` show better normal distribution using Power Transformer, albeit, only slightly better than Yeo-Johnson Transformer.
* The rest 27 variables show much better normal distribution, shape and reduced outliers after the usage of Yeo-Johnson Transformer.
* BoxCox transformer's performance was slightly worse than Yeo-Johnson's, although much better than Log_10, Log_e, Reciprocal and Power transformers across most of the variables.

* **Step 4: Apply the transformations to the Train and Test set**

* We select the variables for Power transformation

In [None]:
power_vars = ['concavity_mean', 'concave points_mean', 'concavity_worst']
power_vars

* We select the variables for Yeo-Johnson transformation

In [None]:
yj_vars = TrainSet.drop(columns=['diagnosis', 'concavity_mean', 'concave points_mean', 'concavity_worst']).columns.tolist()
yj_vars

* We create a pipline applying the tranformations to the variables selected and fit them to the Train and Test sets

In [None]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ("power_transform", vt.PowerTransformer(variables=power_vars)),
    ("yj_transform", vt.YeoJohnsonTransformer(variables=yj_vars))
])

train_set = pipeline.fit_transform(TrainSet)
test_set = pipeline.transform(TestSet)

print("* The numerical transformation has been completed!")

In [None]:
TrainSet.head(3)

### SmartCorrelatedSelection Variables

* **Step 1: Select Variables**

* We will use all the variables for `SmartCorrelatedSelection` method except `diagnosis` since that is our target variable.

* **Step 2: Create a separate DataFrame, with our variables**

In [None]:
TrainSet.head(3)

In [None]:
df_engineering = TrainSet.drop(columns=['diagnosis']).copy()
df_engineering.head(3)

* **Step 3: Create engineered variables applying the transformation**

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

* We will be removing any surplus of the correlated features as those will be adding the same information to the model.

In [None]:
corr_sel.features_to_drop_

* The `SmartCorrelatedTransformer` selected the following variables to drop for our ML cases due to their high correlativity (we set our threshold to 0.8) - 

  * ['area_mean',
 'perimeter_worst',
 'perimeter_mean',
 'radius_worst',
 'radius_mean',
 'perimeter_se',
 'radius_se',
 'texture_mean',
 'compactness_worst',
 'concavity_mean',
 'concave points_worst',
 'compactness_mean',
 'concave points_mean',
 'compactness_se']

---

# Conclusion

The list below shows the transformations needed for feature engineering. We will be fitting them into our ML pipeline.

### Feature Engineering Transformers:

* **Ordinal categorical encoding:** ['diagnosis']

* **Power Transformer:** ['concavity_mean', 'concave points_mean', 'concavity_worst']

* **Yeo-Johnson Transformer:** ['radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

* **Smart Correlated Selection:** ['area_mean',
 'perimeter_worst',
 'perimeter_mean',
 'radius_worst',
 'radius_mean',
 'perimeter_se',
 'radius_se',
 'texture_mean',
 'compactness_worst',
 'concavity_mean',
 'concave points_worst',
 'compactness_mean',
 'concave points_mean',
 'compactness_se']