# Feature Engineering

+ Feature engineering is to transform the data in such a way that the information content is easily exposed to the model.
+ This statement can mean many things and highly depends on what exactly is "the model".0
+ As we have seen, we are using many tools in combination to manipulate data. Thus far, we have encountered pandas, Dask, and sklearn in this course, but there are many more (PySpark, SQL, DAX, M, R, etc.)
+ It is important to discuss which tools are the right ones, specifically in the context of data leakage.

## Transform using pandas/Dask/SQL or sklearn?

+ Depending on the perspective, the answer could be neither, pandas, or sklearn:

    - Neither: 
        * Most join and filtering should be done closer to the source using a database or parquet/Dask operation. 
        * Map-Reduce and Group-by-Aggregate ("data warehousing") operations.
        * Indexing and reshuffling.
    - Pandas, Dask, or PySpark: 
        * Renames tasks.
        * Use python libraries like pandas, Dask, or pySpark to add contemporaneous feature, time-series manipulation (for example, adding lags), parallel computation (using Dask or pySpark).
        * Do not use these libraries for sample-dependent features.
    - Use sklearn, pytorch:
        * Use python libraries like sklearn or pytorch to add features that are sample-dependent like scaling and normalization, one-hot encoding, tokenization, and vectorization.
        * Model-depdenent transformations: PCA, embeddings, iterative/knn imputation, etc.
+ Decisions must be guided by optimization criteria (time and resources) while avoiding data leakage.

## Example Transforms in sklearn

The list below is found in [Scikit's Documentation](https://scikit-learn.org/stable/modules/preprocessing.html), which also includes convenience interfaces for the classes below.

Work with categorical variables:

+ `preprocessing.Binarizer(*[, threshold, copy])`: Binarize data (set feature values to 0 or 1) according to a threshold.
+ `preprocessing.KBinsDiscretizer([n_bins, ...])`:  Bin continuous data into intervals.
+ `preprocessing.LabelBinarizer(*[, neg_label, ...])`: Binarize labels in a one-vs-all fashion.
+ `preprocessing.LabelEncoder()`: Encode target labels with value between 0 and n_classes-1.
+ `preprocessing.MultiLabelBinarizer(*[, ...])`:  Transform between iterable of iterables and a multilabel format.
+ `preprocessing.OneHotEncoder(*[, categories, ...])`: Encode categorical features as a one-hot numeric array.
+ `preprocessing.OrdinalEncoder(*[, ...])`: Encode categorical features as an integer array.

Scale and normalize:

+ `preprocessing.StandardScaler(*[, copy, ...])`: Standardize features by removing the mean and scaling to unit variance.
+ `preprocessing.MaxAbsScaler(*[, copy])`: Scale each feature by its maximum absolute value.
+ `preprocessing.MinMaxScaler([feature_range, ...])`: Transform features by scaling each feature to a given range.
+ `preprocessing.Normalizer([norm, copy])`:  Normalize samples individually to unit norm.
+ `preprocessing.RobustScaler(*[, ...])`: Scale features using statistics that are robust to outliers.


Nonlinear transforms:

+ `preprocessing.FunctionTransformer([func, ...])`: Constructs a transformer from an arbitrary callable.
+ `preprocessing.KernelCenterer()`: Center an arbitrary kernel matrix 
+ `preprocessing.PolynomialFeatures([degree, ...])`: Generate polynomial and interaction features.
+ `preprocessing.PowerTransformer([method, ...])`: Apply a power transform featurewise to make data more Gaussian-like.
+ `preprocessing.QuantileTransformer(*[, ...])`: Transform features using quantiles information.
+ `preprocessing.SplineTransformer([n_knots, ...])`: Generate univariate B-spline bases for features.
+ `preprocessing.TargetEncoder([categories, ...])`: Target Encoder for regression and classification targets.


## What are we doing?

<div>
<img src="./images/04_column_transform_1.png" width="75%">
</div>

### The Objectives

Build a pipeline that: 

+ Add indicators: 

    - SME indicated that a Debt-to-Ratio > 100% is too high.
    - Missing values indicator for `monthly_income` and `num_dependents`.

+ Impute missing values, where required.
+ Standardize variables.
+ Evaluate if a transform (Yeo-Johnson or Box-Cox) of selected variables (debt_ratio, monthly_income, and revolving_unsecured_line_utilization) is beneficial.

Feature selection:

+ We are looking for informative features: their contribution to prediction is valuable.
+ We prefer parsimonious models.
+ We want to retain evidence of our work and afford reproducibility. 

# Data Source

+ For this example, we will use [Give Me Some Credit from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data), a widely refered example. 
+ To run the examples below, download the data set and extract cs-training.csv to `../05_src/data/credit/`.
 

## Our data




In [2]:
# Load environment variables
%load_ext dotenv
%dotenv 
# Add src to path
import os
import sys
sys.path.append(os.getenv('SRC_DIR'))

# Standard libraries
import pandas as pd
import numpy as np


# Load data
ft_file = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_file)

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [3]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

In [4]:
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
).assign(
    high_debt_ratio = lambda x: (x['debt_ratio'] > 1)*1,
    missing_monthly_income = lambda x: x['monthly_income'].isna()*1,
    missing_num_dependents = lambda x: x['num_dependents'].isna()*1, 
)

## Manual Solution

+ To get deeper insights into the task, first approach it manually.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.naive_bayes import GaussianNB

num_cols = ['revolving_unsecured_line_utilization', 'age',
       'num_30_59_days_late', 'debt_ratio', 'monthly_income',
       'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       # Although expressed as numbers, these columns are boolean:
       # 'high_debt_ratio',
       # 'missing_monthly_income', 
       # 'missing_num_dependents' 
       ]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

ctransform_simple= ColumnTransformer([
    ('numeric_simple', pipe_num_simple, num_cols),
], remainder='passthrough')

pipe_simple = Pipeline([
    ('preprocess', ctransform_simple),
    ('model', GaussianNB())
])
pipe_simple


In [14]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB

pipe= Pipeline(steps= [
    ("scaler",StandardScaler()),
    ("classifer",GaussianNB())
])

In [16]:
from sklearn.compose import ColumnTransformer
num_cols= ['revolving_unsecured_line_utilization','age']
ctransform = ColumnTransformer([
    ("a_name",StandardScaler(),num_cols)
],remainder= "drop")
ctransform.fit_transform(df)

array([[-0.02115001, -0.49385982],
       [-0.02038516, -0.83234222],
       [-0.02158222, -0.96773518],
       ...,
       [-0.02323239,  0.38619443],
       [-0.02421753, -1.50930703],
       [-0.02081306,  0.79237332]])

In [17]:
from sklearn.compose import ColumnTransformer
num_cols= ['revolving_unsecured_line_utilization','age']
ctransform = ColumnTransformer([
    ("a_name",StandardScaler(),num_cols)
],remainder= "passthrough")
ctransform.fit_transform(df)

array([[-0.02115001, -0.49385982,  1.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.02038516, -0.83234222,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.02158222, -0.96773518,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.02323239,  0.38619443,  0.        , ...,  1.        ,
         1.        ,  0.        ],
       [-0.02421753, -1.50930703,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.02081306,  0.79237332,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

## Cross-validation of simple pipeline

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 14 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   delinquency                           150000 non-null  int64  
 1   revolving_unsecured_line_utilization  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   num_30_59_days_late                   150000 non-null  int64  
 4   debt_ratio                            150000 non-null  float64
 5   monthly_income                        120269 non-null  float64
 6   num_open_credit_loans                 150000 non-null  int64  
 7   num_90_days_late                      150000 non-null  int64  
 8   num_real_estate_loans                 150000 non-null  int64  
 9   num_60_89_days_late                   150000 non-null  int64  
 10  num_dependents                        146076 non-null  float64
 11  

In [20]:
X = df.drop(columns = 'delinquency')
Y = df['delinquency']

scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)



In [22]:
X

Unnamed: 0,revolving_unsecured_line_utilization,age,num_30_59_days_late,debt_ratio,monthly_income,num_open_credit_loans,num_90_days_late,num_real_estate_loans,num_60_89_days_late,num_dependents,high_debt_ratio,missing_monthly_income,missing_num_dependents
0,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,0,0,0
1,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0,0,0
2,0.658180,38,1,0.085113,3042.0,2,1,0,0,0.0,0,0,0
3,0.233810,30,0,0.036050,3300.0,5,0,0,0,0.0,0,0,0
4,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,0.040674,74,0,0.225131,2100.0,4,0,1,0,0.0,0,0,0
149996,0.299745,44,0,0.716562,5584.0,4,0,1,0,2.0,0,0,0
149997,0.246044,58,0,3870.000000,,18,0,1,0,0.0,1,1,0
149998,0.000000,30,0,0.000000,5716.0,4,0,0,0,0.0,0,0,0


In [23]:
res_simple_dict = cross_validate(pipe_simple, X_train, Y_train, cv = 5, scoring = scoring)
res_simple = pd.DataFrame(res_simple_dict).assign(experiment = 1)
res_simple


Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,experiment
0,0.33276,0.103442,-0.379323,0.682019,0.038892,0.932042,0.39759,0.020446,1
1,0.272029,0.100692,-0.360565,0.689999,0.06422,0.932,0.430769,0.034696,1
2,0.245021,0.099352,-0.35492,0.691366,0.04325,0.931792,0.381443,0.022924,1
3,0.233692,0.104528,-0.353023,0.702334,0.055109,0.931417,0.375,0.02974,1
4,0.29809,0.10348,-0.361277,0.691373,0.045322,0.931542,0.364486,0.024164,1


In [24]:
res_simple_dict

{'fit_time': array([0.3327601 , 0.27202916, 0.2450211 , 0.23369193, 0.29809046]),
 'score_time': array([0.10344172, 0.10069227, 0.09935164, 0.10452843, 0.10347986]),
 'test_neg_log_loss': array([-0.37932251, -0.36056473, -0.3549203 , -0.35302294, -0.36127748]),
 'test_roc_auc': array([0.68201926, 0.68999898, 0.69136577, 0.70233422, 0.691373  ]),
 'test_f1': array([0.03889216, 0.06422018, 0.04324956, 0.05510907, 0.04532249]),
 'test_accuracy': array([0.93204167, 0.932     , 0.93179167, 0.93141667, 0.93154167]),
 'test_precision': array([0.39759036, 0.43076923, 0.3814433 , 0.375     , 0.36448598]),
 'test_recall': array([0.0204461 , 0.03469641, 0.02292441, 0.02973978, 0.02416357])}

In [25]:
pd.DataFrame(res_simple_dict).assign( experiment = 1)

Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,experiment
0,0.33276,0.103442,-0.379323,0.682019,0.038892,0.932042,0.39759,0.020446,1
1,0.272029,0.100692,-0.360565,0.689999,0.06422,0.932,0.430769,0.034696,1
2,0.245021,0.099352,-0.35492,0.691366,0.04325,0.931792,0.381443,0.022924,1
3,0.233692,0.104528,-0.353023,0.702334,0.055109,0.931417,0.375,0.02974,1
4,0.29809,0.10348,-0.361277,0.691373,0.045322,0.931542,0.364486,0.024164,1


On average, we obtain a log-loss of about 0.362.

In [26]:
res_simple.mean()

fit_time             0.276319
score_time           0.102299
test_neg_log_loss   -0.361822
test_roc_auc         0.691418
test_f1              0.049359
test_accuracy        0.931758
test_precision       0.389858
test_recall          0.026394
experiment           1.000000
dtype: float64

## Alternative Pipeline

+ The pipeline below is more complex:

    - Treat selected numericals using [Yeo-Johnson transformation](https://feature-engine.trainindata.com/en/latest/user_guide/transformation/YeoJohnsonTransformer.html).
    - Treat other numericals with scaling only.
    - Do not treat booleans.

In [27]:
num_cols = ['age',
       'num_30_59_days_late', 'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       ]

num_cols_transform = ['revolving_unsecured_line_utilization', 'debt_ratio', 'monthly_income',]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

pipe_num_yj = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler()),
    ('transform', PowerTransformer(method='yeo-johnson'))
])

ctramsform_yj = ColumnTransformer([
    ('numeric_std', pipe_num_simple, num_cols),
    ('numeric_yj', pipe_num_yj, num_cols_transform),
], remainder='passthrough')

pipe_yj = Pipeline([
    ('preprocess', ctramsform_yj),
    ('clf', GaussianNB())
])
pipe_yj

In [11]:
res_yj_dict = cross_validate(pipe_yj, X_train, Y_train, cv = 5, scoring = scoring)
res_yj = pd.DataFrame(res_yj_dict).assign(experiment = 2)
res_yj

Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,experiment
0,0.90189,0.118112,-0.442556,0.788478,0.044802,0.930708,0.307087,0.024164,2
1,0.792456,0.114147,-0.435881,0.774738,0.0625,0.93125,0.376712,0.034077,2
2,0.826098,0.107431,-0.442304,0.78824,0.047045,0.930792,0.317829,0.025403,2
3,0.675088,0.10542,-0.446322,0.793749,0.052392,0.930667,0.323944,0.028501,2
4,0.818696,0.12527,-0.448388,0.780519,0.044957,0.930958,0.322314,0.024164,2


We obtained a greater loss of 0.443, therefore the additional feature is not profitable.

In [12]:
res_yj.mean()

fit_time             0.802846
score_time           0.114076
test_neg_log_loss   -0.443090
test_roc_auc         0.785145
test_f1              0.050339
test_accuracy        0.930875
test_precision       0.329577
test_recall          0.027261
experiment           2.000000
dtype: float64

# Reflection

+ We are currently evaluating two feature engineering procedures using the same classifier. 

    - However, feature engineering is classifier-dependent: each classifier is a specialized tool to learn a certain type of hypothesis. 
    - Different classifiers will benefit from different type of engineered features (see, for example, [Khun and Silge's recommendations on TMWR.org](https://www.tmwr.org/pre-proc-table)).

+ We are producing data from our experiments.

    - The data that we produced is more or less structured: we are using standard performance metrics, for instance.
    - Each preprocessing pipeline will be different and may accept different configuration parameters.
    - Likewise, classifiers will tend to have different configuration parameters. 
    
+ We modify code to produce experiments:

    - Our experiment results will be a function of our algorithm's logic, its implementation (code), and our data.
    - Code tracking is doen with Git.
    - Data tracking is in development.

**It is generally a good idea to use software for experiment tracking once you move out of the Proof of Concept stage.** Some solutions include:

- [ML Flow](https://mlflow.org/).
- [Weights & Balances](https://wandb.ai/site).
- [Sacred](https://sacred.readthedocs.io/en/stable/).

# Sacred

+ Sacred is a Python package that automates taks related to experiment tracking:

    - Keep track of experiment parameters.
    - Run experiements using different settings.
    - Save configurations for individual experiment runs in files or databases.
    - Reproduce results.

+ A few features that may be useful:

    - Automatically set and store random seeds.
    - Keep track of code and artifacts associated with experiment: record the Github repo, hash, and code of the experiment.
    - Store experiment run times and system characteristics.
    - Work with different backends ("[Observers](https://sacred.readthedocs.io/en/stable/observers.html)"): SQL, Mongo, S3, files, Telegram, Slack, and event messges, among others.

An important note from [Sacred's documentation](https://sacred.readthedocs.io/en/stable/experiment.html):

> By default, Sacred experiments will fail if run in an interactive environment like a REPL or a Jupyter Notebook. This is an intended security measure since in these environments reproducibility cannot be ensured.

The safeguard can be relaxed, but generally Production systems do not involve Jupyter notebooks.

## Experiments in Sacred


+ Experiments in sacred are organized in modules (.py files):

    - The main file is called the main *Experiment* file.
    - Auxiliary files are called *Ingredients*.

+ In the main experiment file, we will instantiate an `Experiment` object:

```
from sacred import Experiment
ex = Experiment("Experiment Name")
```

+ The `Experiment` object will allow us to use two function decorators:

    - `@ex.config`: will decorate the configuration function. All variables declared in this function are observed and made available to all captured functions.
    - `@ex.capture`: will decorate one or more captured functions. We can access all variables in `@ex.config`.

+ The `Ingredient` objects also have `config` and `capture` decorators that can be used within their own modules. 

+ `SqlObserver` is the connector between Sacred and a SQL Server. It uses sqlalchemy as an underlying libraries, so URL strings are formatted accordingly.

    - SQL Alchemy DB Strings are documented [here](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls).
    - Common DB Strings are:

        * Postgres: `postgresql://user:password@localhost/mydatabase`
        * SQL Server: `mysql://user:password@localhost/foo`
        * SQLite: `sqlite:///foo.db`

    - The SQL String for the Docker-based implementation in this repo is in the `../05_src/.env` file, under "DB_URL".
    - Note that we are passing usernames and passwords through these strings. Although, this may be acceptable for a development environment, usernames and passwords should never be published in Github for production. Use a secrets manager to pass credentials as environment variables in production.

## Our First Experiment

Continuing with our example, the following setup will track an experiment to compare the two feature engineering pipelines:

+ DB Backend:

    - We assume a database backend which can be setup using docker:

        * In a terminal, navigate to `./05_src/db/`.
        * If the containers are not up, use: `docker compose up -d`
        * If you would like to stop the containers use: `docker compose down`.
        * Bring the containers down and remove all volumes with: `docker compose down -v` (this will erase all your data in the DB).
    
    - Notice that all relevant environment variables for the DB server are in `./05_src/db/env`.

+ The Experiment file:

    - The main file for this experiment is `./05_src/credit_experiment_nb.py`
    - Lines 25-26 instantiate the experiment and import all the ingredients: `data_ingredient`, `preproc_ingredient`, and `db_ingredient`.
    - We use our standard logger (line 24), and we share it with the experiment tracker (line 28).
    - Configuration of the experiment is defined in lines 31-35:

        * Line 31 has the decorator `@ex.config`.
        * We define the preproc_pipe, number of folds, and scoring metrics as the experiment's configuration. 
        
    - Captured functions:
        
        * Notice that the function `get_pipe()` in line 40 requireds `preproc_pipe`. When the function is called, sacred will ensure that the relevant value of `preproc_pipe` is used in place of this input parameter.

    - Main function:

        * The main function is identified by `@ex.automain` or `@ex.main`.
        * As well, the lines 79-80 add commands to modify experiments from [the CLI](https://sacred.readthedocs.io/en/stable/command_line.html).
        * This is the function that is run from the CLI: `python credit_experiment_nb.py`
    

+ Ingredients:

    - We create other modules to organize our ideas and code.
    - The preproc ingredient (`./05_src/credit_preproc_ingredient.py`) encapsulates the preprocessing logic: selecting the right pipeline, for example.
    - The data ingredient (`./05_src/credit_data_ingredient.py`) loads and performs the panda-based manipulations. 
    - The db ingredient (`./05_src/credit_db_ingredient.py`) keeps all functions related to db interactions and authentication.


After ensuring your docker containers are up, run the experiment with 

```
cd src/
python credit_experiment.py
```

After running the experiment, take a look at your database:

+ Navigate to [http://localhost:5051](http://localhost:5051).
+ Login using posgres/HumanAfterAll.
+ Connect to the db (its name is db in the local network) and query the runs table and model_cv_results.