# Feature Engineering and Selection

Main steps:
- We only select numerical feature: we drop `name` and `company`
- We use `median` strategy for missing values in `horsepower` since it is tail heavy.
- We standardize numerical features. Not all algorithms need scaling to perform well. For example, linear regression (when not trained with gradient descent) and tree-based algorithms don't suffer from features not being on the same scale and centred around zero. We will, however, scale features in case we want to use algorithms other than the latter.
- We one-hot-encode `region` and drop the column corresponding to `Europe` to limit colinearity in the dataset.

In [1]:
# use black formatter
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

#### Feature selection and engineering pipeline 

In [6]:
import pandas as pd
import numpy as np
import os

from src.utils import data_path, split_features_target


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Continous features
CONTINUOUS_FEATURES = ["displacement", "horsepower", "weight", "acceleration"]
# Categorical features
ORDINAL_FEATURES = ["cylinders", "year"]
NOMINAL_FEATURES = ["region"]


def make_final_transformation_pipe():

    # Build transformation pipelines adapted to feature types
    cont_pipeline = Pipeline(
        [
            ("imputer_cont", SimpleImputer(strategy="median")),
            ("std_scaler_cont", StandardScaler()),
        ]
    )

    ord_pipeline = Pipeline(
        [
            ("imputer_ord", SimpleImputer(strategy="most_frequent")),
            ("std_scaler_ord", StandardScaler()),
        ]
    )

    full_pipeline = ColumnTransformer(
        [
            ("cont", cont_pipeline, CONTINUOUS_FEATURES),
            ("ord", ord_pipeline, ORDINAL_FEATURES),
            ("nom", OneHotEncoder(), NOMINAL_FEATURES),
        ]
    )

    return full_pipeline


def get_interim_data(dataset):
    if dataset not in ["train", "test"]:
        raise Exception("dataset type argument is train or test)")
    filename = f"{dataset}_cleaned.pkl"
    filepath = data_path("interim", filename)
    return pd.read_pickle(filepath)


def make_final_sets():
    df_train = get_interim_data("train")
    df_test = get_interim_data("test")
    X_train, y_train = split_features_target(df_train, "mpg")
    X_test, y_test = split_features_target(df_test, "mpg")

    full_pipeline = make_final_transformation_pipe()
    X_train_processed_values = full_pipeline.fit_transform(X_train)
    X_test_processed_values = full_pipeline.transform(X_test)
    # Add columns names to build the processed dataframe
    region_ohe_features = list(
        full_pipeline.named_transformers_["nom"].get_feature_names()
    )
    column_names = CONTINUOUS_FEATURES + ORDINAL_FEATURES + region_ohe_features
    X_train_processed = pd.DataFrame(X_train_processed_values, columns=column_names)
    X_test_processed = pd.DataFrame(X_test_processed_values, columns=column_names)

    # Drop one of the ohe features to limit correlations in the data set
    for df in (X_train_processed, X_test_processed):
        df.drop("x0_EUROPE", axis=1, inplace=True)

    # Save the data
    df_train_processed = X_train_processed.join(y_train)
    df_train_processed.to_csv(data_path("processed", "train_processed.pkl"))

    df_test_processsed = X_test_processed.join(y_test)
    df_test_processsed.to_csv(data_path("processed", "test_processed.pkl"))

    return df_train_processed, df_test_processsed

<IPython.core.display.Javascript object>

In [7]:
df_train, df_test = make_final_sets()

/Users/Corentin/Documents/ml_projects/auto-mpg/src
/Users/Corentin/Documents/ml_projects/auto-mpg/src
/Users/Corentin/Documents/ml_projects/auto-mpg/src
/Users/Corentin/Documents/ml_projects/auto-mpg/src


<IPython.core.display.Javascript object>

We can ignore this warning. It is expected that Standardizing ordinal features converted them to float. 

#### Check training and test sets 

In [33]:
df_train.head()

Unnamed: 0,displacement,horsepower,weight,acceleration,cylinders,year,x0_ASIA,x0_USA,mpg
0,1.090196,1.266232,0.552826,-1.319334,1.527188,-1.696667,0.0,1.0,
1,-0.922996,-0.407925,-0.999667,-0.413182,-0.850515,-1.696667,1.0,0.0,15.0
2,-0.98135,-0.947975,-1.124772,0.927922,-0.850515,1.638975,1.0,0.0,18.0
3,-0.98135,-1.163996,-1.392854,0.275493,-0.850515,0.527094,1.0,0.0,16.0
4,-0.747936,-0.218907,-0.327675,-0.231952,-0.850515,-0.306816,0.0,0.0,17.0


<IPython.core.display.Javascript object>

In [35]:
df_train.shape

(318, 9)

<IPython.core.display.Javascript object>

In [36]:
df_test.head()

Unnamed: 0,displacement,horsepower,weight,acceleration,cylinders,year,x0_ASIA,x0_USA,mpg
0,-0.98135,-1.353013,-1.398812,0.637953,-0.850515,-0.028846,1.0,0.0,18.0
1,-0.699308,-0.650948,-0.409887,1.072906,-0.850515,1.638975,0.0,1.0,
2,0.389956,-0.083895,-0.399163,-0.956873,0.338337,-1.418697,0.0,1.0,
3,1.226354,1.266232,1.156905,-0.884381,1.527188,-0.028846,0.0,1.0,
4,1.226354,1.266232,1.510773,-0.413182,1.527188,-0.862757,0.0,1.0,


<IPython.core.display.Javascript object>

In [37]:
df_test.shape

(80, 9)

<IPython.core.display.Javascript object>

Next, we move this code to the module `feature_engineering.py` in the source folder.  