# Preprocessing

This notebook will give you a taste of what scikit-learn provides for preprocessing data.

## Data used
We will be using the planets data and red wine data:

### Data License for Planet Data
Copyright (C) 2012 Hanno Rein

Permission is hereby granted, free of charge, to any person obtaining a copy of this database and associated scripts (the "Database"), to deal in the Database without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Database, and to permit persons to whom the Database is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Database. A reference to the Database shall be included in all scientific publications that make use of the Database.

THE DATABASE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATABASE OR THE USE OR OTHER DEALINGS IN THE DATABASE.

### Citations for Red Wine Data
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
- [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016)
- [Pre-press (pdf)](http://www3.dsi.uminho.pt/pcortez/winequality09.pdf)
- [bib](http://www3.dsi.uminho.pt/pcortez/dss09.bib)

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## Setup

In [1]:
import numpy as np
import pandas as pd

planets = pd.read_csv('data/planets.csv')
red_wine = pd.read_csv('data/winequality-red.csv')
wine = pd.concat([
    pd.read_csv('data/winequality-white.csv', sep=';').assign(kind='white'), 
    red_wine.assign(kind='red')
])

## Train Test Split
We can use scikit-learn's `train_test_split()` function to get our training and testing sets. (We will discuss the validation set in [chapter 10](https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/blob/master/ch_10).)

In [2]:
from sklearn.model_selection import train_test_split

X = planets[['eccentricity', 'semimajoraxis', 'mass']]
y = planets.period

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

The original data had this shape:

In [3]:
X.shape, y.shape

((3814, 3), (3814,))

Our training data has this shape:

In [4]:
X_train.shape, y_train.shape

((2860, 3), (2860,))

Our testing data has this shape:

In [5]:
X_test.shape, y_test.shape

((954, 3), (954,))

Let's look at the first 5 entries:

In [6]:
X_train.head()

Unnamed: 0,eccentricity,semimajoraxis,mass
1846,,,
1918,,,
3261,,0.074,
3000,,,
1094,0.073,2.6,1.6


Our y data will be for the same rows as our X:

In [7]:
y_train.head()

1846      52.661793
1918      23.622591
3261       7.008151
3000     226.890470
1094    1251.000000
Name: period, dtype: float64

## Scaling data
### Standardizing with `StandardScaler`

In [8]:
from sklearn.preprocessing import StandardScaler

standardized = StandardScaler().fit_transform(X_train)

# examine some of the non-NaN values
standardized[~np.isnan(standardized)][:30]

array([-0.17649619, -0.5045706 ,  0.14712504, -0.12764807,  1.09368797,
        0.67240099, -0.01731216, -0.84296411, -0.18098025, -0.13758824,
       -0.00791305, -0.09869129, -0.26808282, -0.18032045, -0.15945662,
        0.36484041, -0.15305095, -0.28352985, -0.17803358, -0.18017312,
       -0.238978  ,  0.05247717, -0.16875798, -0.26094578, -0.18022437,
       -0.22704979, -0.25988606, -0.17954522, -0.2851004 , -0.68678249])

In [9]:
standardized = StandardScaler(with_mean=False).fit_transform(X_train)
standardized[~np.isnan(standardized)][:30]

array([9.48059029e-03, 3.80041941e-01, 3.33101821e-01, 1.59042751e-01,
       1.97830051e+00, 8.58377769e-01, 2.69378660e-01, 4.16484319e-02,
       4.99652731e-03, 1.49102579e-01, 8.76699491e-01, 8.72854887e-02,
       1.86080019e-02, 5.65632515e-03, 1.27234201e-01, 1.24945296e+00,
       3.29258338e-02, 3.16097468e-03, 7.94319727e-03, 5.80365865e-03,
       4.77128253e-02, 9.37089717e-01, 1.72188018e-02, 2.57450453e-02,
       5.75241221e-03, 5.96410317e-02, 6.24726478e-01, 6.43155558e-03,
       1.59042751e-03, 1.97830051e-01])

In [10]:
standardized = StandardScaler(with_std=False).fit_transform(X_train)
standardized[~np.isnan(standardized)][:30]

array([-1.37762709e+00, -9.69199710e-02,  1.14837291e+00, -1.28416362e+00,
        2.10080029e-01,  5.24837291e+00, -1.74163617e-01, -1.61919971e-01,
       -1.41262709e+00, -1.38416362e+00, -1.51997104e-03, -7.70327089e-01,
       -2.69696362e+00, -1.40747709e+00, -1.60416362e+00,  7.00800290e-02,
       -1.19462709e+00, -2.85236362e+00, -1.38962709e+00, -1.40632709e+00,
       -2.40416362e+00,  1.00800290e-02, -1.31722709e+00, -2.62516362e+00,
       -1.40672709e+00, -2.28416362e+00, -4.99199710e-02, -1.40142609e+00,
       -2.86816362e+00, -1.31919971e-01])

### Normalizing with `MinMaxScaler`

In [11]:
from sklearn.preprocessing import MinMaxScaler

normalized = MinMaxScaler().fit_transform(X_train)

# examine some of the non-NaN values
normalized[~np.isnan(normalized)][:30]

array([3.93117161e-04, 7.63598326e-02, 1.46646600e-02, 6.08362087e-03,
       3.97489540e-01, 3.78290803e-02, 1.03041533e-02, 8.36820084e-03,
       1.95372110e-04, 5.70339272e-03, 1.76150628e-01, 3.82427629e-03,
       7.11757595e-04, 2.24468882e-04, 4.86689080e-03, 2.51046025e-01,
       1.42704129e-03, 1.20883052e-04, 3.25318858e-04, 2.30966220e-04,
       1.82506561e-03, 1.88284519e-01, 7.34368621e-04, 9.84761405e-04,
       2.28706276e-04, 2.28133939e-03, 1.25523013e-01, 2.58656177e-04,
       6.08070051e-05, 3.97489540e-02])

### Using the Median and IQR with `RobustScaler`

In [12]:
from sklearn.preprocessing import RobustScaler

robust_scaled = RobustScaler().fit_transform(X_train)

# examine some of the non-NaN values
robust_scaled[~np.isnan(robust_scaled)][:30]

array([-0.06900269, -0.16197792,  2.09219752,  0.29241339,  1.18200107,
        5.60008385,  0.75183146, -0.44653374, -0.09894806,  0.25102438,
        0.25566245,  0.45059228, -0.29233062, -0.09454181,  0.15996854,
        0.56911162,  0.08756882, -0.35664915, -0.07926968, -0.0935579 ,
       -0.17114358,  0.30644472, -0.01732554, -0.2626133 , -0.09390013,
       -0.12147676,  0.04377782, -0.08936469, -0.36318861, -0.31520028])

## Encoding
### Binary encoding with `np.where()`

In [13]:
np.where(wine.kind == 'red', 1, 0)

array([0, 0, 0, ..., 1, 1, 1])

We can also use the `LabelBinarizer` from scikit-learn:

In [14]:
from sklearn.preprocessing import LabelBinarizer

binary_labels = LabelBinarizer().fit(wine.kind)
binary_labels

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

We can use the `classes_` attribute to see the classes labeled and the `inverse_transform()` to see the labels assigned to each value:

In [15]:
binary_labels.inverse_transform(np.array([0, 1]))

array(['red', 'white'], dtype='<U5')

We can use the `Binarizer` for binary encoding of values based on a threshold. Values less than or equal to `threshold` will be 0; values greater than `threshold` will be 1:

In [16]:
from sklearn.preprocessing import Binarizer

pd.Series(Binarizer(threshold=6).fit_transform(red_wine.quality.values.reshape(-1, 1)).flatten()).value_counts()

0    1382
1     217
dtype: int64

### Ordinal Encoding with the `LabelEncoder`

In [17]:
from sklearn.preprocessing import LabelEncoder

set(LabelEncoder().fit_transform(pd.cut(
    red_wine.quality.sort_values(),
    bins=[-1, 3, 6, 10],
    labels=['low', 'med', 'high']
)))

{0, 1, 2}

### One-hot encoding with the `OneHotEncoder`
In some cases, label encoding may yield some associations that aren't something we want the model to be trained on. A safer strategy is to use one-hot encoding.

Our planets data has a `list` column that we can one-hot encode:

In [18]:
planets.list.value_counts()

Confirmed planets                    3683
Controversial                         106
Retracted planet candidate             11
Solar System                            9
Kepler Objects of Interest              4
Planets in binary systems, S-type       1
Name: list, dtype: int64

### Using `pd.get_dummies()`

In [19]:
pd.get_dummies(planets.list).head()

Unnamed: 0,Confirmed planets,Controversial,Kepler Objects of Interest,"Planets in binary systems, S-type",Retracted planet candidate,Solar System
0,1,0,0,0,0,0
1,1,0,0,0,0,0
2,1,0,0,0,0,0
3,1,0,0,0,0,0
4,0,1,0,0,0,0


This gives us a redundant column. Note that we only need one less column than the number of planet lists. Pandas makes it easy to remove one of the columns to address multicollinearity:

In [20]:
pd.get_dummies(planets.list, drop_first=True).head()

Unnamed: 0,Controversial,Kepler Objects of Interest,"Planets in binary systems, S-type",Retracted planet candidate,Solar System
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,1,0,0,0,0


We can also use the `LabelBinarizer`:

In [21]:
from sklearn.preprocessing import LabelBinarizer

LabelBinarizer().fit_transform(planets.list)

array([[1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0]], dtype=int32)

## Imputing
### `SimpleImputer`
We can fill with the mean, median, `most_frequent` (mode), or a constant value by specifiying the `strategy`. The default is the mean:

In [22]:
from sklearn.impute import SimpleImputer

SimpleImputer().fit_transform(
    planets[['semimajoraxis', 'mass', 'eccentricity']]
)

array([[ 1.29      , 19.4       ,  0.231     ],
       [ 1.54      , 11.2       ,  0.08      ],
       [ 0.83      ,  4.8       ,  0.        ],
       ...,
       [ 1.61032944,  0.3334    ,  0.31      ],
       [ 1.61032944,  0.4       ,  0.27      ],
       [ 1.61032944,  0.42      ,  0.16      ]])

Changing to the median is just a matter of passing that as the `strategy`:

In [23]:
from sklearn.impute import SimpleImputer

SimpleImputer(strategy='median').fit_transform(
    planets[['semimajoraxis', 'mass', 'eccentricity']]
)

array([[ 1.29    , 19.4     ,  0.231   ],
       [ 1.54    , 11.2     ,  0.08    ],
       [ 0.83    ,  4.8     ,  0.      ],
       ...,
       [ 0.163518,  0.3334  ,  0.31    ],
       [ 0.163518,  0.4     ,  0.27    ],
       [ 0.163518,  0.42    ,  0.16    ]])

### `MissingIndicator`
We can mark where values are missing and use this as a feature in our model:

In [24]:
from sklearn.impute import MissingIndicator

MissingIndicator().fit_transform(
    planets[['semimajoraxis', 'mass', 'eccentricity']]
)

array([[False, False, False],
       [False, False, False],
       [False, False, False],
       ...,
       [ True, False, False],
       [ True, False, False],
       [ True, False, False]])

## Additional Transformers
### `FunctionTransformer`
With the `FunctionTransformer`, we can use any function on the data. By passing `validate=True`, we will convert the result to two-dimensional numpy array and raise an error if there is an issue:

In [25]:
from sklearn.preprocessing import FunctionTransformer

FunctionTransformer(
    np.abs, validate=True
).fit_transform(X_train.dropna())

array([[0.073 , 2.6   , 1.6   ],
       [0.38  , 6.7   , 2.71  ],
       [0.008 , 0.039 , 1.5   ],
       ...,
       [0.4   , 1.9   , 5.3   ],
       [0.2   , 0.172 , 0.0162],
       [0.249 , 4.62  , 2.99  ]])

### `ColumnTransformer`
Sometimes we don't want to perform the same transformation on all of our features, the `ColumnTransformer` lets us specify which tranformations to use on each column. We pass a list of tuples in the form (name, transformer object, columns to apply to):

In [26]:
from sklearn.compose import ColumnTransformer 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler

ColumnTransformer(
    [
        ('standard_scale', StandardScaler(), [0, 1]),
        ('min_max', MinMaxScaler(), [2]),
        ('impute', SimpleImputer(), [0, 2])
    ]
).fit_transform(X_train)[15:20] 

array([[            nan,             nan,             nan,
         1.69919971e-01,  2.88416362e+00],
       [-7.91305129e-03, -9.86912907e-02,  7.11757595e-04,
         1.68400000e-01,  1.87200000e-01],
       [            nan,             nan,             nan,
         1.69919971e-01,  2.88416362e+00],
       [            nan, -1.80320454e-01,  4.86689080e-03,
         1.69919971e-01,  1.28000000e+00],
       [ 3.64840414e-01, -1.53050946e-01,  1.20883052e-04,
         2.40000000e-01,  3.18000000e-02]])

We can use the `make_column_transformer()` which will name the transformers for us:

In [27]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical = [
    col for col in planets.columns \
    if col in [
        'list', 'name', 'description', 
        'discoverymethod', 'lastupdate'
    ]
]
numeric = [col for col in planets.columns if col not in categorical]

make_column_transformer(
    (StandardScaler(), numeric),
    (OneHotEncoder(sparse=False), categorical)
).fit_transform(planets.dropna())

array([[-0.49212919, -0.00209303, -0.22454741, ...,  1.        ,
         0.        ,  0.        ],
       [-0.40533358, -0.68276379, -0.23169035, ...,  1.        ,
         0.        ,  0.        ],
       [-0.55610666, -1.04338405, -0.23162366, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-0.5979688 , -0.68276379, -0.76124811, ...,  0.        ,
         0.        ,  1.        ],
       [-0.62883808, -0.88110493,  0.40907393, ...,  1.        ,
         0.        ,  0.        ],
       [-0.5785095 , -1.04338405, -0.20563417, ...,  1.        ,
         0.        ,  0.        ]])

## `Pipeline`
Using pipelines ensures the whole model training and testing process is consistent. To make a pipeline, we pass in a list of steps as tuples of (name, object):

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

Pipeline([('scale', StandardScaler()), ('lr', LinearRegression())])

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])

We can also use the `make_pipeline()` function to make the pipeline without naming the steps ourselves:

In [29]:
from sklearn.pipeline import make_pipeline

make_pipeline(StandardScaler(), LinearRegression())

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])