# Proper Data Preprocessing Steps

In this notebook we take a quick aside on how we should preprocess both our training and testing data. We'll review some more advanced pipeline techniques that were touched upon in Regression Notebook 7 as well.

## What You'll Accomplish

We'll:
<ul>
    <li>emphasize the importance of fitting transformers to the training data not the test data,</li>
    <li>show why pipelines really are useful,</li>
    <li>give a quick review of more advanced pipeline techniques that are also covered in Regression Notebook 7.</li>
</ul>

In [1]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## fit, transform, fit_transform, and train vs test data

Many people are confused about how to properly preprocess data for example here is a image with over ten questions about the proper application of `StandardScaler` alone.
<img src="train_test_question.png" style="width:70%;"></img>

While this may be review for many of you, it is such an important concept that it bears repeating now that we've got a large array of preprocessing techniques.

Let's start with a simple `StandardScaler` example.

Recall that `StandardScaler` takes in a data set and subtracts off the arithmetic mean and divides that by the sample standard deviation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a>, i.e.
$$
\frac{\bullet - \overline{X}}{s_X}
$$



We now load some random data.

In [2]:
X = 10*np.random.randn(1000) + 20

In [3]:
# let's make a train test split
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test = train_test_split(X, test_size = .25, random_state = 440, shuffle=True)

Now when you scale the training set you subtract off the training mean and divide by the training sample deviaition. What about for the test set? 

This is where people often get confused. While your instinct may be to scale the test data by subtracting off the test mean and dividing by the test sample deviation, this is NOT the correct approach. Instead you scale the test data by subtracting off the training mean and dividing by the training deviation. Counterituitive I know, but this is the scaling procedure that we used to train our algorithm, so we have to repeat it when we predict on new data, like the test set.

This is exactly why `sklearn` `transformers` have a `fit`, a `transform`, and a `fit_transform` method.

In [5]:
# look at the means of the test and train
print("Train Mean",np.mean(X_train))
print("Test Mean",np.mean(X_test))
print("Train SD",np.std(X_train))
print("Test SD",np.std(X_test))

Train Mean 20.343719218196068
Test Mean 20.035797954465977
Train SD 9.959174058499645
Test SD 9.757869041275374


In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
# now we scale
scaler = StandardScaler()

X_train_scale = scaler.fit_transform(X_train.reshape(-1,1))
X_test_scale = scaler.transform(X_test.reshape(-1,1))

In [8]:
print("Train Mean",np.mean(X_train_scale))
print("Test Mean",np.mean(X_test_scale))
print("Train SD",np.std(X_train_scale))
print("Test SD",np.std(X_test_scale))

Train Mean 1.4210854715202004e-17
Test Mean -0.0309183534619821
Train SD 1.0
Test SD 0.979786976706922


Notice the slight difference here, the scale training set has a mean that is essentially $0$, but not the test set. Let's see what happens if I perturb the test set a little. Go ahead and play around with the value of perturb.

In [9]:
# now we scale
scaler = StandardScaler()

perturb = 100

X_train_scale = scaler.fit_transform(X_train.reshape(-1,1))
X_test_scale = scaler.transform(X_test.reshape(-1,1) + perturb)

print("Train Mean",np.mean(X_train_scale))
print("Test Mean",np.mean(X_test_scale))
print("Train SD",np.std(X_train_scale))
print("Test SD",np.std(X_test_scale))

Train Mean 1.4210854715202004e-17
Test Mean 10.010074947047222
Train SD 1.0
Test SD 0.979786976706922


## Practice

On your own time go through and find what is wrong with the following code.

In [10]:
# Make data
X = np.array([2,4])*np.random.randn(100,2) + [-1,2]

# I need to scale my data
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Now I do a train test split
X_train, X_test = train_test_split(X_scaled,test_size=.2,random_state=44,shuffle=True)

In [11]:
## This line of code is correct!
from sklearn.decomposition import PCA

In [12]:
# Make Data
X = np.random.randn(1000,50) + np.random.randint(-100,100,(1000,50))

# train test split
X_train, X_test = train_test_split(X,test_size=.1,shuffle=True,random_state=44)

# I want to perform PCA
pca = PCA(n_components=10)

X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

In [13]:
## This chunk of code is correct!
from sklearn.preprocessing import PolynomialFeatures,FunctionTransformer
from sklearn.pipeline import Pipeline

In [14]:
# a processing function
def process(X):
    process_X = np.zeros((np.shape(X)[0],4))
    
    process_X[:,0] = X[:,0]
    process_X[:,1] = np.sqrt(X[:,1])
    process_X[:,2] = X[:,2]
    
    scale = StandardScaler()
    process_X[:,1:3] = scale.fit_transform(X[:,1:3])
    
    process_X[:,3] = process_X[:,0]*process_X[:,1]
    
    
    return process_X

In [15]:
# make some data
X = np.zeros((1000,3))

X[:,0] = np.random.randint(0,2,1000)
X[:,1] = 5*np.random.random(1000) + 10
X[:,2] = 10*np.random.randn(1000) - 12

# train test split
X_train, X_test = train_test_split(X,test_size=.1,shuffle=True,random_state=44)

In [16]:
# make a pipe
pipe = Pipeline([('process',FunctionTransformer(process))])

process_train = pipe.fit_transform(X_train)
process_test = pipe.transform(X_test)

### A Reminder on More Advanced Pipelines

The last example in the practice set illustrates the need for more advanced pipelins.

Luckily we introduced these back in Notebook 7.

We quickly review them now before signing off.

The key features we require are that things like scalers, imputers, pca, and other transformers need a `fit`, a `transform`, and a `fit_transform` method.

This can all be done with `sklearn`.

We'll end with an example that features categorical and continuous variables.

We want to one hot encode the categorical variables and we want to scale the continuous then put them through PCA.

In [17]:
# make data
X = np.zeros((1000,9))

X[:,0] = np.random.randint(0,3,1000)
X[:,1:8] = np.random.randn(1000,7)
X[:,8] = X[:,1] + 2*X[:,3] - 4*X[:,6] + np.random.randn(1000)


In [18]:
# We first create this function that takes in 
# X and makes one hot encoded columns
def get_X_ready(X):
    new_X = np.zeros((np.shape(X)[0],10))
    
    # one hot encode
    new_X[X[:,0]==0,0] = 1
    new_X[X[:,0]==1,1] = 1

    # copy the rest
    new_X[:,2:] = X[:,1:]
    
    return new_X

In [19]:
# This allows you to maek
# a custom transformer
from sklearn.base import BaseEstimator, TransformerMixin

In [20]:
# Define our custom transformer
# It should take in our X with the one hot encoded columns
# and return the scaled continuous columns 
class Scaler(BaseEstimator, TransformerMixin):
    #Class Constructor 
    # This allows you to initiate the class when you call
    # Scaler
    def __init__(self):
        # I want to initiate each object with
        # the StandardScaler method
        self.StandardScaler = StandardScaler()
    
    # For my fit method I'm just going to "steal"
    # StandardScaler's fit method using only the
    # columns I want
    def fit(self, X, y = None ):
        self.StandardScaler.fit(X[:,2:])
        return self
    
    # Now I want to transform the columns I want
    # and return it with scaled columns
    def transform(self, X, y = None):
        X[:,2:] = self.StandardScaler.transform(X[:,2:])
        return X

In [21]:
from sklearn.decomposition import PCA
# we now make a custom PCA transform
class CustomPCA(BaseEstimator, TransformerMixin):
    #Class Constructor 
    # This allows you to initiate the class when you call
    # CustomPCA
    def __init__(self):
        # I want to initiate each object with
        # the PCA method
        self.PCA = PCA()
    
    # For my fit method I'm just going to "steal"
    # PCA's fit method using only the
    # columns I want
    def fit(self, X, y = None ):
        self.PCA.fit(X[:,2:])
        return self
    
    # Now I want to transform the columns
    # and return it with PCA
    def transform(self, X, y = None):
        X[:,2:] = self.PCA.transform(X[:,2:])
        return X

In [22]:
# Now we put it all together with a pipe
pipe = Pipeline([('get_X_ready',FunctionTransformer(get_X_ready)),
                ('scale',Scaler()),
                ('pca',CustomPCA())])

In [23]:
# train test split
X_train, X_test = train_test_split(X,test_size=.1,shuffle=True,random_state=44)

In [24]:
# Processed training set
X_train_processed = pipe.fit_transform(X_train)

In [25]:
# Processed testing set
X_test_processed = pipe.transform(X_test)

In [26]:
X_train_processed

array([[ 0.        ,  1.        ,  0.01959462, ...,  2.12517514,
         0.39893159,  0.17110421],
       [ 1.        ,  0.        ,  0.68136149, ..., -0.47417259,
         0.70362549, -0.10537797],
       [ 1.        ,  0.        , -2.13647321, ..., -1.4762095 ,
        -0.26700619, -0.14958157],
       ...,
       [ 1.        ,  0.        ,  0.0327174 , ...,  0.42355316,
        -0.60359615,  0.12598118],
       [ 1.        ,  0.        ,  1.29427975, ...,  0.71041834,
        -0.63478924,  0.36667208],
       [ 0.        ,  1.        ,  0.25386409, ..., -0.60850776,
        -1.16975088, -0.19820841]])

In [27]:
X_test_processed

array([[ 0.00000000e+00,  0.00000000e+00, -1.18157105e+00,
         3.08637062e-01, -1.79897606e+00, -1.32392574e+00,
         3.79064272e-01,  1.15973460e+00,  6.29259101e-01,
        -1.66040965e-01],
       [ 0.00000000e+00,  0.00000000e+00,  7.11527458e-01,
         5.22406100e-01, -5.05242177e-01, -3.71303617e-01,
         7.23545399e-01,  7.89895588e-01, -8.10115913e-01,
        -3.56530649e-01],
       [ 0.00000000e+00,  0.00000000e+00,  1.87994706e+00,
        -7.59367360e-01,  7.70514521e-01,  6.36154622e-02,
        -4.98931448e-01, -4.14265681e-01,  9.78746265e-01,
        -4.92020393e-02],
       [ 0.00000000e+00,  0.00000000e+00, -1.21395695e+00,
         1.09941692e+00, -4.80966173e-01, -1.76246158e+00,
         3.31771300e-01,  1.79201102e-01, -1.98682632e-01,
         3.28612637e-02],
       [ 1.00000000e+00,  0.00000000e+00, -9.65211408e-01,
        -1.77320437e+00,  1.08915267e-01, -1.42623559e+00,
        -1.25431610e+00,  1.25134010e+00,  4.63535586e-01,
        -3.

In [28]:
# we can check the means again
np.mean(X_train_processed[:,2:],axis=0)

array([ 6.82787160e-17, -1.24591695e-17, -5.20571240e-17,  1.48029737e-17,
        1.83803590e-17,  1.08555140e-17, -2.26978929e-17,  3.30908140e-17])

In [29]:
# test mean
np.mean(X_test_processed[:,2:],axis=0)

array([-0.05276432, -0.00484614, -0.11133791,  0.0490695 , -0.13454378,
       -0.09040187,  0.03284448,  0.00323857])

This more advanced pipeline is more complicated python than what we've covered up to this point. It's okay if you don't get it right away!

I encourage you to review this and Regression Notebook 7 to get more practice. It may also help to review object oriented programming in python, this is a helpful resource <a href="https://python.swaroopch.com/oop.html">https://python.swaroopch.com/oop.html</a>.

That's it for this aside, I hope this notebook was helpful!