# Normalisation and Preprocessing

[sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) can be used in many ways to clean data:

* Standardisation with [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) or [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).
* Centring of kernel matrices with [KernelCenterer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KernelCenterer.html).
* Non-linear transformations with [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)
* Normalisation with [normalize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html).
* Encoding of categorical features with [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
* [Discretisation](https://en.wikipedia.org/wiki/Discretization_of_continuous_features) (also known as quantisation or binning) with [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).
* Binarisation of features with [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)
* Imputation of missing values with [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) or [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) where the added values can be marked with [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html).

<div class="alert alert-block alert-info">

**See also:**

* [statsmodels](https://www.statsmodels.org/stable/index.html)
</div>

## Example

In the following example, we fill in mean values and do some scaling:

### 1. Imports

In [1]:
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from datetime import datetime

In [2]:
hvac = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv')

### 2. Check data quality

Display data types with [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html):

In [3]:
hvac.dtypes

Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object

Return dimensions of the DataFrame as a tuple with [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html):

In [4]:
hvac.shape

(8000, 8)

Return first *n* rows with [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html):

In [5]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


### 3. Attribute the mean value to missing values

For this we use the `mean` strategy of [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn-impute-simpleimputer):

In [6]:
imp = SimpleImputer(missing_values=np.nan,
                    strategy='mean')

In [7]:
hvac_numeric = hvac[['TargetTemp', 'SystemAge']]

In [8]:
imp = imp.fit(hvac_numeric.loc[:10])

For more information on `fit`, see the [Scikit Learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit).

[fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit_transform) then transforms the adapted data:

In [9]:
transformed = imp.fit_transform(hvac_numeric)

In [10]:
transformed

array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])

In [11]:
hvac['TargetTemp'], hvac['SystemAge'] = transformed[:,0], transformed[:,1]

Now we display the first rows with the changed data records:

In [12]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,67.507735,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,15.386643,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


### 4. Scale

To standardise data sets that look like standard normally distributed data, we can use [sklearn.preprocessing.scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html). This can be used to determine the factors by which a value increases or decreases. We can use this to scale the current temperature.

In [13]:
hvac['ScaledTemp'] = preprocessing.scale(hvac['ActualTemp'])

In [14]:
hvac['ScaledTemp'].head()

0   -1.293272
1    0.048732
2    0.719733
3   -0.622270
4    0.853934
Name: ScaledTemp, dtype: float64

[sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) scales the terms so that they lie between a certain minimum and maximum value, often between zero and one. This has the advantage of making the scaling more robust to very small standard deviations of features.

In [15]:
min_max_scaler = preprocessing.MinMaxScaler()

In [16]:
temp_minmax = min_max_scaler.fit_transform(hvac[['ActualTemp']])

In [17]:
temp_minmax

array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])

Now we also add `temp_minmax` as a new column:

In [18]:
hvac['MinMaxScaledTemp'] = temp_minmax[:,0]
hvac['MinMaxScaledTemp'].head()

0    0.12
1    0.52
2    0.72
3    0.32
4    0.76
Name: MinMaxScaledTemp, dtype: float64

In [19]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10,ScaledTemp,MinMaxScaledTemp
0,6/1/13,0:00:01,66.0,58,13,20.0,4,,-1.293272,0.12
1,6/2/13,1:00:01,67.507735,68,3,20.0,17,,0.048732,0.52
2,6/3/13,2:00:01,70.0,73,17,20.0,18,,0.719733,0.72
3,6/4/13,3:00:01,67.0,63,2,15.386643,15,,-0.62227,0.32
4,6/5/13,4:00:01,68.0,74,16,9.0,3,,0.853934,0.76
