# Preprocessing

## After this lecture, you

- Can prepare categorical variables for `sklearn` models
- Know that different imputation strategies exists and can use them
- Know that standardizing continuous variables can improve your models
- Know that you shouldn't apply preprocessing transformations with info from the testing dataset, that's called "data leakage" and is akin to letting your model "seeing the future" while training
- Know that you should apply the **exact** transformations to the testing data that you applied to the training data before making predictions

All of these can be accomplished by using `pipelines`.

## Preprocessing categorical variables

Depending on the variable you have, you can turn to
- `DictVectorizer` is how you turn string categorical variables into usable numeric vars
- `OneHotEncoder` takes array-like inputs instead of dicts

Let's start by borrowing a clear example from [PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html)

In [1]:
data = [
    {'price': 850, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 650, 'rooms': 3, 'neighborhood': 'Queen Anne'},
    {'price': 700, 'rooms': 1, 'neighborhood': 'Wallingford'},
    {'price': 650, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 700, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 600, 'rooms': 2, 'neighborhood': 'Fremont'}
]
data


[{'price': 850, 'rooms': 4, 'neighborhood': 'Queen Anne'},
 {'price': 650, 'rooms': 3, 'neighborhood': 'Queen Anne'},
 {'price': 700, 'rooms': 1, 'neighborhood': 'Wallingford'},
 {'price': 650, 'rooms': 3, 'neighborhood': 'Wallingford'},
 {'price': 700, 'rooms': 3, 'neighborhood': 'Fremont'},
 {'price': 600, 'rooms': 2, 'neighborhood': 'Fremont'}]

`sklearn` can't use `neighborhood` in a regression like `sm` could:

In [2]:
import pandas as pd    
from statsmodels.formula.api import ols as sm_ols
print('The coefs from SM:')
print(sm_ols('price ~ neighborhood - 1', data = pd.DataFrame(data)).fit().params)
# ""-1" means no intercept. Don't do this! It's here for illustration


The coefs from SM:
neighborhood[Fremont]        650.0
neighborhood[Queen Anne]     750.0
neighborhood[Wallingford]    675.0
dtype: float64


So, we need to preprocess that data to run the same regression in `sklearn`. Depending on the variable you have, you can turn to
- `DictVectorizer` is how you turn string categorical variables into usable numeric vars
- `OneHotEncoder` takes array-like inputs instead of dicts

In [3]:
# create an object ("vec") that can do the transform
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int) 

# apply vec with ".fit_transform", save to new data obj
data2 = vec.fit_transform(data) 
print(data2, '\n')              
print(vec.get_feature_names())  # can use .get_feature_names() to recover names

# now we can repeat the regression here
from sklearn.linear_model import LinearRegression
print('Reg coefs:')
LinearRegression(fit_intercept=False).fit(data2[:,:3],data2[:,3]).coef_


[[  0   1   0 850   4]
 [  0   1   0 650   3]
 [  0   0   1 700   1]
 [  0   0   1 650   3]
 [  1   0   0 700   3]
 [  1   0   0 600   2]] 

['neighborhood=Fremont', 'neighborhood=Queen Anne', 'neighborhood=Wallingford', 'price', 'rooms']
Reg coefs:


array([650., 750., 675.])

## Imputation / Missing Values

_We talked [about imputation a bit before](https://ledatascifi.github.io/lectures-spr2020/02/05_outro.html#Dealing-With-Missing-Values) in the context of `pandas`. [These slides](https://github.com/matthewbrems/ODSC-missing-data-may-18/blob/master/Analysis%20with%20Missing%20Data.pdf) on missing data are quite good! [This article](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/) has examples too._

Before modeling, you have to decide how to deal with missing values. You can 
1. Drop observations with any missing values, 
2. Impute missing values (mean, median, mode, interpolation, deduction, mean-of-group, etc), 
3. Or model the missing values explicitly (e.g. in a regression, as an incremental intercept but with no impact on the slope). 

What's the right choice? It depends. On the data, the domain, the question, and economic theory. My choices change from project to project. You might use a combination of these!

**You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.)** You should have some livecoding from the prior lecture showing imputation in `pandas`.

`sklearn` comes with an `impute` class described in the [official docs](https://scikit-learn.org/stable/modules/impute.html)



In [4]:
# silly data
import numpy as np
X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])
print(X,'\n')

# it's this easy:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
imp.fit_transform(X) 


[[nan  0.  3.]
 [ 3.  7.  9.]
 [ 3.  5.  2.]
 [ 4. nan  6.]
 [ 8.  8.  1.]] 



array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

`imp.fit_transform(X)` is the combination of `imp.fit(X)` and `imp.transform(X)`. 

If you have a train/test split, you shouldn't use `fit_transform`. Instead, use `imp.fit(X_train)` to get the means in the training sample and `imp.transform(X_test)` to apply those to the test data.

## Standardization

Effectively, this means that **continuous** variables should have a mean of 0 and a variance of one.

The [`sklearn` documentation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) on this is quite good.

> Standardization of datasets is a **common requirement for many machine learning estimators** implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Why does this matter? "If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."

**In other words: STANDARDIZATION WILL IMPROVE YOUR PREDICTIONS.**


In [5]:
# a very simple example
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)

print(' X_scaled\n',         '-'*40,'\n',X_scaled,'\n')
print(' Mean of each var:\n','-'*40,'\n',X_scaled.mean(axis=0),'\n')
print(' STD of each var:\n', '-'*40,'\n',X_scaled.std(axis=0),'\n')


 X_scaled
 ---------------------------------------- 
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]] 

 Mean of each var:
 ---------------------------------------- 
 [0. 0. 0.] 

 STD of each var:
 ---------------------------------------- 
 [1. 1. 1.] 



`sklearn` can scale variables in many ways. Some alternative transforms are faster and some transform non-normal distributions into proto-normal distributions (which can improve the efficacy of many models).

Visit (you guessed it!) [the documentation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) for more.

##  Data leakage: 

Now you know how to transform your data before training a model. You might be tempted to do something like:

```python
import #a bunch of sklearn stuff
X, y = #load data
X = transform(X) # imputation, encode cat vars, standardize

# and then you either do these lines:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=9,train_size=.8)
model = # something
model.fit(Xtrain, ytrain)
y_predict = model.predict(Xtest) # using X2 (out-of-sample data), predict y2
accuracy_score(ytest, y_predict)

# or this:
cross_validate(model,X,y)
```

The problem here is that `transform(X)` used info from the **ENTIRE** dataset, including observations that ended up in `Xtest`!

**This means that your cross-validation scores are unreliable.** They will be at the very least overoptimistic, and in some cases, result in models that are down-right completely invalid. 

### The absolute golden rule of prediction modeling is...

**YOUR MODEL CAN'T HAVE ACCESS TO ANY DATA THAT IT WOULDN'T HAVE IN PRACTICE WHEN IT MAKES THE PREDICTION.**

I know I already said that, and repetition is usually bad writing, but it must be said again. And again.

### Data leakage can be tricky

Here are some more examples:
- The outcome variable is a predictor (implicitly or explicitly)
- Predictor variables that are in response to the result (after the fact) or the possibility (anticipatory)
- Predicting loan default, the data might include employee IDs for recent customer service contacts. But the most recent contact might be with trouble-loan specialists (because the firm anticipated possible default due to some other signal). Using that employee's customer contacts to predict default would add no value - the lender already knew to assign that employee!
- The smell test - is it too good to be true? I've seen some asset pricing models with suspicious out-of-sample R2s. Predicting stock prices is hard! _The best OOS predictive R2 for individual stocks [in this paper](https://dachxiu.chicagobooth.edu/download/ML.pdf) is 1.80% per month._


## The solution, or: Safety first, via Pipelines

1. Be very familiar with the data and how it was collected and built 
1. Do your data prep within CV folds

The second part of the solution is [relatively easy to implement in `sklearn`: PIPES](https://scikit-learn.org/stable/modules/compose.html)!
- Pipelines make apply all steps to the data they receive
- In `cross_validate`'s training fold, the entire pipeline is applied to the training data
- In `cross_validate`'s testing fold, the saved transformations and model fits are applied to the test data

Examples and walkthroughs:
- [PDSH has an example of imputing, creating polynomial features, then fitting a regression in one line](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html)
    - [Another walkthrough with a pipeline](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html)
    - [A pipeline used to optimize model parameters](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html)
- [sklearn doc with details](https://scikit-learn.org/stable/modules/compose.html)
- [sklearn doc with two walkthroughs](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html), these walkthroughs use a pipeline to optimize a model's parameters

Let's follow quickly this walkthrough on [scaling the iris data and building a classification model](https://chrisalbon.com/machine_learning/model_evaluation/cross_validation_pipeline/)

In [6]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn import svm

iris = load_iris() # data

# set up the pipeline, which will, given a set of observations 
# 1. fit and apply these steps to the training fold
# 2. in the testing fold, apply the transform and model to predict (no estimation)

classifier_pipeline = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))

# ok, go!
scores = cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)
scores

{'fit_time': array([0.00099707, 0.00099754, 0.        , 0.0009973 , 0.        ]),
 'score_time': array([0.00099659, 0.        , 0.00099778, 0.        , 0.0009973 ]),
 'test_score': array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])}