- In this section, we show:
    - how to prepare data to improve our models
    - how to deal with more challenging situations (e.g. outliers, missing data, etc.)

- Real world data is usually messy
    - Almost never good to go without any preprocessing

___

# Numeric feature scaling

- In chapter 3, we discussed mapping each predictor to a z-score to be used in gradient descent
    - In this chapter, we discuss how feature scaling can be helpful when working with data that is missing/faulty

- There are two ways to deal with missing data
    1. Actively deal with missing values
    2. Passively deal with the values by either:
        - throwing an error
        - ignoring the value

- If we standardize our data, then missing values can be mapped to zero (which is the mean)
    - We can show this in practice

### Example

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
boston = load_boston()
df = pd.DataFrame(boston.data, columns = boston.feature_names)
df['target'] = boston.target

In [6]:
X = df.iloc[:,:-1]
y = df['target'].values

- To be able give an example for the logistic regression as well, we'll create a binary version of the `y` vector

In [10]:
y2 = 1*(y>25)

___

# Mean centering

- scikit-learn has the tools we need to do all of our feature scaling
    - `StandardScaler` converts our values to z-scores
    - `MinMaxScaler` rescales the variables setting a new min and max

In [19]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [20]:
linear_regression = linear_model.LinearRegression(normalize = False,
                                                 fit_intercept = True)
linear_regression.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [21]:
linear_regression.coef_

array([-1.07170557e-01,  4.63952195e-02,  2.08602395e-02,  2.68856140e+00,
       -1.77957587e+01,  3.80475246e+00,  7.51061703e-04, -1.47575880e+00,
        3.05655038e-01, -1.23293463e-02, -9.53463555e-01,  9.39251272e-03,
       -5.25466633e-01])

In [22]:
linear_regression.intercept_

36.4911032803614

- Considering the intercept, we know that it is the value when all predictor values are zero
    - But let's think for a second: how likely is it that they'll all be zero?

In [23]:
X.min()

CRIM         0.00632
ZN           0.00000
INDUS        0.46000
CHAS         0.00000
NOX          0.38500
RM           3.56100
AGE          2.90000
DIS          1.12960
RAD          1.00000
TAX        187.00000
PTRATIO     12.60000
B            0.32000
LSTAT        1.73000
dtype: float64

- Not once is any predictor equal to zero
    - So, for all our observed values, our model would never predict a target equal to the intercept (unless there's some combination where the values cancel out)