# Feature Engineering and Selection

Main steps:
- We only select numerical feature: we drop `name` and `company`
- We use `median` strategy for missing values in `horsepower` since it is tail heavy.
- We standardize numerical features. Not all algorithms need scaling to perform well. For example, linear regression (when not trained with gradient descent) and tree-based algorithms don't suffer from features not being on the same scale and centred around zero. We will, however, scale features in case we want to use algorithms other than the latter.
- We one-hot-encode `region` and drop the column corresponding to `Europe` to limit colinearity in the dataset.

In [1]:
# use black formatter
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

#### Feature selection and engineering pipeline 

In [3]:
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from src.utils import data_path, save_data

# Continous features
CONTINUOUS_FEATURES = ["displacement", "horsepower", "weight", "acceleration"]
# Categorical features
ORDINAL_FEATURES = ["cylinders", "year"]
NOMINAL_FEATURES = ["region"]


def make_final_transformation_pipe():

    # Build transformation pipelines adapted to feature types
    cont_pipeline = Pipeline(
        [
            ("imputer_cont", SimpleImputer(strategy="median")),
            ("std_scaler_cont", StandardScaler()),
        ]
    )

    ord_pipeline = Pipeline(
        [
            ("imputer_ord", SimpleImputer(strategy="most_frequent")),
            ("std_scaler_ord", StandardScaler()),
        ]
    )

    full_pipeline = ColumnTransformer(
        [
            ("cont", cont_pipeline, CONTINUOUS_FEATURES),
            ("ord", ord_pipeline, ORDINAL_FEATURES),
            ("nom", OneHotEncoder(), NOMINAL_FEATURES),
        ]
    )

    return full_pipeline


# def get_interim_data(dataset):
#     if dataset not in ["train", "test"]:
#         raise Exception("dataset type argument is train or test)")
#     filename = f"{dataset}_cleaned.pkl"
#     filepath = data_path("interim", filename)
#     return pd.read_pickle(filepath)


def get_cleaned_train_test_df():
    clean_data_path = data_path("interim", "data_cleaned.pkl")
    df = pd.read_pickle(clean_data_path)
    X = df.drop("mpg", axis=1)
    y = df["mpg"]
    return train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)


def make_final_sets():
    X_train, X_test, y_train, y_test = get_cleaned_train_test_df()

    full_pipeline = make_final_transformation_pipe()
    X_train_processed_values = full_pipeline.fit_transform(X_train)
    X_test_processed_values = full_pipeline.transform(X_test)

    # Add column names to build the processed dataframe
    region_ohe_features = list(
        full_pipeline.named_transformers_["nom"].get_feature_names()
    )
    column_names = CONTINUOUS_FEATURES + ORDINAL_FEATURES + region_ohe_features
    X_train_processed = pd.DataFrame(X_train_processed_values, columns=column_names)
    X_test_processed = pd.DataFrame(X_test_processed_values, columns=column_names)

    # Drop one of the ohe features to limit correlations in the data set
    for df in (X_train_processed, X_test_processed):
        df.drop("x0_EUROPE", axis=1, inplace=True)

    # Save the data
    df_train_processed = X_train_processed.join(y_train.reset_index(drop=True))
    save_data(df_train_processed, "processed", "train_processed.pkl")

    df_test_processsed = X_test_processed.join(y_test.reset_index(drop=True))
    save_data(df_test_processsed, "processed", "test_processed.pkl")

    return df_train_processed, df_test_processsed

<IPython.core.display.Javascript object>

In [4]:
df_train, df_test = make_final_sets()



<IPython.core.display.Javascript object>

We can ignore this warning. It is expected that Standardizing ordinal features converted them to float. 

#### Check training set 

In [5]:
df_train.head()

Unnamed: 0,displacement,horsepower,weight,acceleration,cylinders,year,x0_ASIA,x0_USA,mpg
0,1.090196,1.266232,0.552826,-1.319334,1.527188,-1.696667,0.0,1.0,16.0
1,-0.922996,-0.407925,-0.999667,-0.413182,-0.850515,-1.696667,1.0,0.0,27.0
2,-0.98135,-0.947975,-1.124772,0.927922,-0.850515,1.638975,1.0,0.0,37.0
3,-0.98135,-1.163996,-1.392854,0.275493,-0.850515,0.527094,1.0,0.0,36.1
4,-0.747936,-0.218907,-0.327675,-0.231952,-0.850515,-0.306816,0.0,0.0,23.0


<IPython.core.display.Javascript object>

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318 entries, 0 to 317
Data columns (total 9 columns):
displacement    318 non-null float64
horsepower      318 non-null float64
weight          318 non-null float64
acceleration    318 non-null float64
cylinders       318 non-null float64
year            318 non-null float64
x0_ASIA         318 non-null float64
x0_USA          318 non-null float64
mpg             318 non-null float64
dtypes: float64(9)
memory usage: 22.5 KB


<IPython.core.display.Javascript object>

In [7]:
df_train.shape

(318, 9)

<IPython.core.display.Javascript object>

#### Check test set

In [8]:
df_test.head()

Unnamed: 0,displacement,horsepower,weight,acceleration,cylinders,year,x0_ASIA,x0_USA,mpg
0,-0.98135,-1.353013,-1.398812,0.637953,-0.850515,-0.028846,1.0,0.0,33.0
1,-0.699308,-0.650948,-0.409887,1.072906,-0.850515,1.638975,0.0,1.0,28.0
2,0.389956,-0.083895,-0.399163,-0.956873,0.338337,-1.418697,0.0,1.0,19.0
3,1.226354,1.266232,1.156905,-0.884381,1.527188,-0.028846,0.0,1.0,13.0
4,1.226354,1.266232,1.510773,-0.413182,1.527188,-0.862757,0.0,1.0,14.0


<IPython.core.display.Javascript object>

In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 9 columns):
displacement    80 non-null float64
horsepower      80 non-null float64
weight          80 non-null float64
acceleration    80 non-null float64
cylinders       80 non-null float64
year            80 non-null float64
x0_ASIA         80 non-null float64
x0_USA          80 non-null float64
mpg             80 non-null float64
dtypes: float64(9)
memory usage: 5.8 KB


<IPython.core.display.Javascript object>

In [10]:
df_test.shape

(80, 9)

<IPython.core.display.Javascript object>

Next, we move this code to the module `feature_engineering.py` in the source folder.  