# Preprocessing for numerical features

In this notebook we will introduce these new aspects when working with numerical data:

* an example of preprocessing, namely **scaling numerical variables**;
* using a scikit-learn **pipeline** to chain preprocessing and model trianing;
* assessing the generalization performance of our model via **cross-validation** instead of a single train-test split.

## Data Preparation

We start as usual by importing the required libraries and loading the data set.

In [2]:
import pandas as pd
from sklearn import set_config
set_config(display='diagram')

df = pd.read_csv("data/adult-census.csv")
df.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
21697,74,Private,211075,HS-grad,9,Married-civ-spouse,Other-service,Husband,White,Male,0,0,30,United-States,<=50K
24696,20,Private,216436,Bachelors,13,Never-married,Sales,Other-relative,Black,Female,0,0,30,United-States,<=50K
18517,60,Private,207665,HS-grad,9,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,40,United-States,>50K


We then drop the features, that don't contain valuable info.

In [3]:
# drop not needed columns
df = df.drop(columns=['fnlwgt', 'education'])

# split target and features
target = df['class']
# we are considering only numerical features
data = df.select_dtypes(include='number')

Now, we can start with preparing our model.  
The first step is to divide our data into a training and test sets:

In [5]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=.25,
                                                                   random_state=23)

## Model Fitting with Preprocessing

A range of preprocessing algorithms in scikit-learn allow us to transform the input data before training a model. In our case, we will standardize the data and then train a new logistic model on that new version of the dataset.

Let's start by printing some statistics about the training data

In [6]:
data_train.describe()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,36631.0,36631.0,36631.0,36631.0,36631.0
mean,38.689525,10.083645,1115.03164,84.717753,40.414976
std,13.735819,2.572581,7611.916806,397.034139,12.443421
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


From the above table, we can see the the data set's features span across different ranges. Some algorithms make some assumptions regarding the feature distribution and usually normalizing features will be helpful to address these assumptions.

The two main reasons why we should scale the training data:

1. Models that rely on the distance between a pair of samples, for instance k-neares neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
2. Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.

Whether or not a machine learning model reequires scaling the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as decision trees do not need such preprocessing (but will not suffer from it).

We will show how to apply such normalization using a scikit-learn transformer called `StandardScaler`. this transformer shifts and scales each feature individually so that they all have a 0-mean and a unit standard deviation.

We will investigate different steps used in scikit-learn to achieve such a transformation of the data.

First, one needs to call the method `fit` in order to learn the scaling from the data.

In [7]:
from sklearn.preprocessing import StandardScaler

# the standard scaler is usually called scaler
scaler = StandardScaler()
# We fit only the training data (without target)
scaler.fit(data_train)

The fit method for transformers is similar to the `fit` method for predictors. The main difference is that the former has a single argument (the data matrix), whereas the latter has two arguments (the data matrix and the target).

![image.png](attachment:5680ab76-c106-4e4c-a1e3-61436af828b6.png)

In this case, the algorith needs to compute the mean and standard deviation for each feature and store them into some Numpy arrays. Here, these statistics are the model states.

We can inspect the computed means and standard deviation:

In [8]:
scaler.mean_

array([  38.68952527,   10.083645  , 1115.03163987,   84.71775272,
         40.41497639])

In [9]:
scaler.scale_

array([1.37356316e+01, 2.57254607e+00, 7.61181291e+03, 3.97028720e+02,
       1.24432516e+01])

<div class="alert alert-block alert-info">
<b>scikit-learn convention:</b> <br>
     If an attribute is learned from the data, its name ends with an underscore (i.e. _), as in `mean_` and `scale_` for the `StandardScaler`.</div>
     
Scaling the data is applied to each feature individually. For each feature, we subtract its mean and divide by its standard deviation.

Once we have called the `fit` method, we can perform data transformation by calling the method `transform`.

In [10]:
data_train_scaled = scaler.transform(data_train)
data_train_scaled

array([[ 0.53222705,  1.13364539,  1.82728721, -0.21337941, -0.03334951],
       [-0.4142165 , -0.42123444, -0.146487  , -0.21337941, -0.03334951],
       [ 1.04185051, -0.42123444, -0.146487  , -0.21337941, -0.03334951],
       ...,
       [ 0.1682103 ,  1.13364539, -0.146487  , -0.21337941,  2.53832556],
       [ 0.24101365,  0.74492544, -0.146487  , -0.21337941,  0.28810987],
       [ 0.7506371 ,  1.52236535,  0.81228591, -0.21337941, -0.03334951]])

Let's illustrate the internal mechanism of the `transform` method and put it to perspective with what we already saw with predictors.

![image.png](attachment:b88c28fe-2721-4e42-bb17-996b4d13a8a4.png)

The `transform` method for transformers is similar to the `predict` method for predictors. It uses a predefined function, called a **transformation function**, and uses the model states and the input data. However, instead of outputting predictions, the job of the transform method is to output a transformed version of the input data.

Finally, the method `fit_transform` is a shorthand method to call successively `fit` and then `transform`

![image.png](attachment:990b0ab3-4085-4abd-a2f4-0880e962afea.png)

In [11]:
data_train_scaled == scaler.fit_transform(data_train)

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       ...,
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

Now we need to convert the result from a numpy array to a DataFrame with the proper column names:

In [12]:
data_train_scaled = pd.DataFrame(data_train_scaled, 
                                 columns=data_train.columns)

In [13]:
data_train_scaled.describe()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,36631.0,36631.0,36631.0,36631.0,36631.0
mean,2.560444e-16,1.87572e-16,3.830968e-18,2.366471e-17,-2.168619e-16
std,1.000014,1.000014,1.000014,1.000014,1.000014
min,-1.57907,-3.530994,-0.146487,-0.2133794,-3.167579
25%,-0.7782333,-0.4212344,-0.146487,-0.2133794,-0.03334951
50%,-0.1230031,-0.03251448,-0.146487,-0.2133794,-0.03334951
75%,0.6778338,0.7449254,-0.146487,-0.2133794,0.3684747
max,3.735574,2.299805,12.99086,10.75812,4.708176


We can easily combine these sequential operations with a sickit-learn `Pipeline`, which chains together operations and is used as any other classifier or regressor. The helper function `make_pipeline` will create a `Pipeline`: it takes as arguments the successive transformations to perform, followed by the classifier or regressor model.

In [14]:
import time
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

In [15]:
model

The `make_pipeline` function did not require us to give a name to each step, it was automatically assigned based on the name of the classes provided; a `StandardScaler` will be a step name `standardscaler` in the resulting pipeline. We can check the name of each steps of our model:

In [16]:
model.named_steps

{'standardscaler': StandardScaler(),
 'logisticregression': LogisticRegression()}

This predictive pipeline exposes the same methods as the final predictor: `fit` and `predict` (and additionally `predict_proba`, `decision_function` or `score`).

In [18]:
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start
print(elapsed_time)

0.16499996185302734


We can present the internal mechanism of a pipeline when calling `fit` by the following diagram:

![image.png](attachment:969aedcc-8d2c-4ecd-a2c4-ccb393d5976c.png)

When calling `model.fit`, the method `fit_transorm` from each underlying transformer (here a single transformer) in the pipeline will be called to:

* learn their internal model states
* transform the training data. Finally, the preprocessed data are provided to train the predictor.

To predict the targets given a test set, one uses the `predict` method.

In [19]:
predicted_target = model.predict(data_test)
predicted_target[:5]

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

Let's show the underlying mechanism:

![image.png](attachment:3cca2a11-2b33-4713-859c-1c9c0ee22600.png)

Let's check the computational and generalization performance of such a predictive pipeline:

In [20]:
model_name = model.__class__.__name__
score = model.score(data_test, target_test)
print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model[-1].n_iter_[0]} iterations")

The accuracy using a Pipeline is 0.821 with a fitting time of 0.165 seconds in 13 iterations


Now let's compare this predictive model with the transformed training data with the predicitve model used in the previous notebook, which did not scale the features.

In [21]:
model = LogisticRegression()
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

In [22]:
score = model.score(data_test, target_test)
print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model.n_iter_[0]} iterations")

The accuracy using a Pipeline is 0.821 with a fitting time of 0.377 seconds in 76 iterations


We see that scaling the data before training the logistic regression was beneficial in terms of computational performance. Indeed, the number of iterations decreased as well as the training time. The generalization performance did not change since both models converged!

<div class="alert alert-block alert-warning">
<b>:</b> <br>
    No-showing rate is independent on the patient overall health.
</div>