# Scikit Learn Preprocessing

The Scikit-learn library has many machine learning packages ready off the shelf. However, we would like to concentrate on data preprocessing techniques that are made easy with Scikit Learn.

In [1]:
import warnings
import numpy as np
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

## The two stage process
Most scikit learn libraries have two stages - 1) Fit stage and 2) Transform or Predict stage

### Fit
In the fit stage, all parameters/statistics are computed.

### Transform or Predict
For data preprocessing, the second stage is the transform stage. Using the statistics calculated in the fit stage,
we transform the data accordingly.

For machine learning models, the second stage is the predict stage. Using the learned parameters of the model in the fit stage, we make predictions on test data

# Preprocessing Continuous Variables

## Min-Max Scaling

In [2]:
from sklearn.preprocessing import MinMaxScaler

In [3]:
scaler = MinMaxScaler()
scaler.fit(data)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [4]:
# The learned statistics from the fit stage
print(scaler.data_max_)
print(scaler.data_min_)

[ 1. 18.]
[-1.  2.]


In [5]:
# Using the learned statistics for min-max scaling
print(scaler.transform(data))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


In [6]:
# We can also combine both the steps in one function call
print(scaler.fit_transform(data))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


## Standardization Scaling

If `x` was the previous variable then after standardization it becomes (`x`- `mean`)/`sigma`.


In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
scaler.fit(data)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [9]:
# The learned statistics from the fit stage
print(scaler.mean_)
print(scaler.var_)

[-0.125  9.   ]
[ 0.546875 35.      ]


In [10]:
# Using the learned statistics for min-max scaling
print(scaler.transform(data))

[[-1.18321596 -1.18321596]
 [-0.50709255 -0.50709255]
 [ 0.16903085  0.16903085]
 [ 1.52127766  1.52127766]]


In [11]:
# We can also combine both the steps in one function call
print(scaler.fit_transform(data))

[[-1.18321596 -1.18321596]
 [-0.50709255 -0.50709255]
 [ 0.16903085  0.16903085]
 [ 1.52127766  1.52127766]]


# Preprocessing Categorical Variables

Categorical variables in a normal dataset may either be present as random integers or as strings. The LabelEncoder and OneHotEncoder package is great for their preprocessing. There are two steps to preprocess categorical variables.

## Step 1 - Map categories to a Sequence of Increasing Numbers

## Step 2 - Use this Sequence for a One-Hot Encoding

# Step 1

## If Categorical Variables are Numbers

In [12]:
data_type_num = [1,2,2,6,2,1,1,1,6,2,6,6]

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
le_num = LabelEncoder()
le_num.fit(data_type_num)
le_num.classes_

array([1, 2, 6])

In [15]:
print("Transform : {}".format(le_num.transform(data_type_num)))
print("Inverse Transform : {}".format(le_num.inverse_transform([0, 0, 1, 2])))

Transform : [0 1 1 2 1 0 0 0 2 1 2 2]
Inverse Transform : [1 1 2 6]


  if diff:


In [16]:
print(le_num.fit_transform(data_type_num))

[0 1 1 2 1 0 0 0 2 1 2 2]


## If Categorical Variables are Strings

In [17]:
data_type_str = ["paris", "paris", "tokyo", "amsterdam"]

In [18]:
le_str = LabelEncoder()
le_str.fit(data_type_str)
le_str.classes_

array(['amsterdam', 'paris', 'tokyo'], dtype='<U9')

In [19]:
print("Transform : {}".format(le_str.transform(data_type_str)))
print("Inverse Transform : {}".format( le_str.inverse_transform([2, 2, 1]) ))

Transform : [1 1 2 0]
Inverse Transform : ['tokyo' 'tokyo' 'paris']


  if diff:


In [20]:
print(le_str.fit_transform(data_type_str))

[1 1 2 0]


# Step 2

After we get a sequence of numbers, it is important to represent this information as one-hot vectors. One of the reasons being that any machine learning model that is fed with this data must be able to differentiaite between a normal numerical variable and a categorical variable. An example of one hot encoding is given below.

![title](one-hot-encoding.png)

In [21]:
from sklearn.preprocessing import OneHotEncoder

In [22]:
one_hot_num = OneHotEncoder(sparse=False)
one_hot_str = OneHotEncoder(sparse=False)

In [23]:
# Pass the output of the label encoder to one hot encoder
data_num = np.expand_dims(le_num.fit_transform(data_type_num),axis=1)
one_hot_num.fit_transform(data_num)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [24]:
# Pass the output of the label encoder to one hot encoder
data_str = np.expand_dims(le_str.fit_transform(data_type_str),axis=1)
one_hot_num.fit_transform(data_str)

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])