# Pre Processing

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Content</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Data-Preprocessing" role="tab" aria-controls="profile">Data Preprocessing<span class="badge badge-primary badge-pill"></span></a>
     <div class="list-group" id="list-tab" role="tablist">
  <a class="list-group-item list-group-item-action active" data-toggle="list" href="#Missing-Value-Treatments" role="tab" aria-controls="home">Missing Value Treatment</a>
 <a class="list-group-item list-group-item-action" data-toggle="list" href="#Mean-or-median-or-other-summary-statistic-substitution" role="tab" aria-controls="messages"> Missing Value Treatment with mean<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Forward-fill-and-backward-fill-(can-be-used-according-to-business-problem)" role="tab" aria-controls="messages">Forward and Backward fill<span class="badge badge-primary badge-pill"></span></a>
   <a class="list-group-item list-group-item-action" data-toggle="list" href="#Nearest-neighbors-imputation" role="tab" aria-controls="messages">Nearest neighbors imputation<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#MultiOutputRegressor" role="tab" aria-controls="messages">MultiOutputRegressor<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#IterativeImpute" role="tab" aria-controls="messages">IterativeImpute<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Time-Series-Specific-Methods" role="tab" aria-controls="messages">Time-Series Specific Methods<span class="badge badge-primary badge-pill"></span></a>
<div class="list-group" id="list-tab" role="tablist">
  <a class="list-group-item list-group-item-action active" data-toggle="list" href="#Rescalling-Data" role="tab" aria-controls="home">Rescalling</a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#MinMaxScaler" role="tab" aria-controls="settings">MinMaxScaler<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#MaxAbsScaler" role="tab" aria-controls="settings">MaxAbsScaler<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Robust-Scaler" role="tab" aria-controls="settings">Robust Scaler<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#StandardScaler" role="tab" aria-controls="settings">StandardScaler Scaler<span class="badge badge-primary badge-pill"></span></a>
 <div class="list-group" id="list-tab" role="tablist">
  <a class="list-group-item list-group-item-action active" data-toggle="list" href="#Data-Transformation" role="tab" aria-controls="home">Data Transformation</a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Quantile-Transformation" role="tab" aria-controls="settings">Quantile Transformation<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Power-Transformation" role="tab" aria-controls="settings">Power Transformation<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Custom-Transformation" role="tab" aria-controls="settings">Custom Transformation<span class="badge badge-primary badge-pill"></span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Data-Normalization" role="tab" aria-controls="settings">Data-Normalization<span class="badge badge-primary badge-pill"></span></a>
<div class="list-group" id="list-tab" role="tablist">
  <a class="list-group-item list-group-item-action active" data-toggle="list" href="#Handling-Categorical-Variable" role="tab" aria-controls="home">Handling Categorical Variable</a>

# Data Preprocessing

Pre-processing refers to the **transformations applied to data** before feeding it to the algorithm. Data Preprocessing is a process that can be used to **convert the raw data into a clean dataset**. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for analysis; pre-processing heps us to bring our data to a desired format.

## Need for Data Preprocessing

**For achieving better results** from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning models need information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set. Another aspect is that data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.

## Different data preprocesses

The different pre processing techniques are listed below; we will look into each of it in detail:

![Types%20of%20Pre_Processing.PNG](attachment:Types%20of%20Pre_Processing.PNG)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

# __Missing Value Treatments__
The methods used to handle missing values are as follows:<br>
1. Drop missing values
2. Fill missing value with test statistic
3. Predict missing value with maching learning algoritm

In [1]:
import pandas as pd
import  numpy as np
# Check missing values in a dataset 
dict = {'First Score':[100, 90, np.nan, 95, 75], 

        'Second Score': [30, 45, 56, np.nan, np.nan], 

        'Third Score':[np.nan, 40, 98, 98, 56]} 

# creating a dataframe from list 
df = pd.DataFrame(dict)
print(df)
print('\nNo of null values:')
df.isnull().sum()

   First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2          NaN          56.0         98.0
3         95.0           NaN         98.0
4         75.0           NaN         56.0

No of null values:


First Score     1
Second Score    2
Third Score     1
dtype: int64

In [2]:
# If the missing value isn’t identified as NaN , then we have to first convert or replace such non NaN entry with a NaN
df_2 = df.copy()
df_2['First Score'].replace(np.nan,0, inplace= True)
df_2[df_2['First Score'] == 0].head(2)

Unnamed: 0,First Score,Second Score,Third Score
2,0.0,56.0,98.0


## Imputation vs Removing Data
Before jumping to the methods of data imputation, we have to understand the reason why data goes missing.
1. **Missing completely at random**: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value.
2. **Missing at random**: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male.
3. **Missing that depends on unobserved predictors**: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients.
4. **Missing that depends on the missing value itself**: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.
 

**Simple approaches**<br>
A number of simple approaches exist. For basic use cases, these are often enough.<br><br>
**Dropping rows with null values**
1. If the number of data points is sufficiently high that dropping some of them will not cause lose generalizability in the models built (to determine whether or not this is the case, a learning curve can be used)
2. Dropping too much data is also dangerous
3. If in a large data set is present and missinng values is in range of 5-3%; then droping missing values is feasible

In [3]:
df_3=df.copy()
df_3.dropna()

Unnamed: 0,First Score,Second Score,Third Score
1,90.0,45.0,40.0


**Dropping features with high nullity**

A feature that has a high number of empty values is unlikely to be very useful for prediction. It can often be safely dropped.
<br>**Note:** "But before deciding the variable is not usefull we should perform feature importance test for validation", tree based method can be used 

In [4]:
df_2.drop(['Second Score'], axis= 1, inplace = True)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

## Mean or median or other summary statistic substitution
When to use example:
1. Check outlier, if less outliers is present then mean imputation can be used
2. When outliers are more median impuation can be used 
3. For categorical variables mode imputaion can be used

<br>**NOTE:**- Ok to use if missing data is less than 3%, otherwise introduces too much bias and artificially lowers variability of data

In [5]:
# Simple illustration for missing value imputation with mean 
# The imputation strategies are mean, mode & median 
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
#This will look for all columns where we have NaN value and replace the NaN value with specified test statistic.
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [6]:
import pandas as pd
import numpy as np

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange","class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]
df_copy= df.copy()
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,,orange,class 1,17.0
4,M,11.0,green,class 3,
5,M,7.0,red,class 1,22.0


In [7]:
# imputation is done with respect to one column by using mean, mode and median stragey 
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(df[['boh']])
df["boh"]=imp_mean.transform(df[["boh"]])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [8]:
# some other column 
imp_mean.fit(df[['price']])
df["price"]=imp_mean.transform(df[["price"]])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [9]:
# Columns specific imputation in a dataframe 
df_4 = df.copy()
mean_value=df_4['price'].mean()
df_4['First score']=df_4['price'].fillna(mean_value)
#this will replace all NaN values with the mean of the non null values
#For Median
median_value=df_4['price'].median()
df_4['Second Score']=df_4['price'].fillna(median_value)
print(df_4)

  size  price   color    class   boh  First score  Second Score
0  XXL    8.0   black  class 1  22.0          8.0           8.0
1    L    9.0    gray  class 2  20.0          9.0           9.0
2   XL   10.0    blue  class 2  19.0         10.0          10.0
3    M    9.0  orange  class 1  17.0          9.0           9.0
4    M   11.0   green  class 3  20.0         11.0          11.0
5    M    7.0     red  class 1  22.0          7.0           7.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Forward fill and backward fill (can be used according to business problem)
Forward filling means fill missing values with previous data. Backward filling means fill missing values with next data point.

In [10]:
# Creating the Series 
sr = pd.Series([100, None, None, 18, 65, None, 32, 10, 5, 24, 60]) 

# Create the Index 
index_ = pd.date_range('2010-10-09', periods = 11, freq ='M')   

# set the index
sr.index = index_   

# Print the series
print('Series  :\n',sr) 


Series  :
 2010-10-31    100.0
2010-11-30      NaN
2010-12-31      NaN
2011-01-31     18.0
2011-02-28     65.0
2011-03-31      NaN
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


In [11]:
result = sr.fillna(method = 'ffill')
print('Series after forward fill :\n',result)


Series after forward fill :
 2010-10-31    100.0
2010-11-30    100.0
2010-12-31    100.0
2011-01-31     18.0
2011-02-28     65.0
2011-03-31     65.0
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


In [12]:
result = sr.fillna(method = 'bfill')
print('Series after backward fill :\n',result)


Series after backward fill :
 2010-10-31    100.0
2010-11-30     18.0
2010-12-31     18.0
2011-01-31     18.0
2011-02-28     65.0
2011-03-31     32.0
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Nearest neighbors imputation
It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables. <br><br>The distance metric varies according to the type of data:
1. **Continuous Data**: The commonly used distance metrics for continuous data are Euclidean, Manhattan and Cosine
2. **Categorical Data**: Hamming distance is generally used in this case. It takes all the categorical attributes 

In [13]:
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

In [14]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [15]:
# KNN Imputer for a dataframe 
import numpy as np
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
q = imputer.fit_transform(df[['boh']])
df['boh'] = q
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [16]:
q = imputer.fit_transform(df[['price']])
df['price'] = q
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### MultiOutputRegressor


This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression.

In [17]:
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
X, y = load_linnerud(return_X_y=True)
print(X)
print(y)
clf = MultiOutputRegressor(Ridge(random_state=123)).fit(X, y)
pred =clf.predict(X[[0]])

[[  5. 162.  60.]
 [  2. 110.  60.]
 [ 12. 101. 101.]
 [ 12. 105.  37.]
 [ 13. 155.  58.]
 [  4. 101.  42.]
 [  8. 101.  38.]
 [  6. 125.  40.]
 [ 15. 200.  40.]
 [ 17. 251. 250.]
 [ 17. 120.  38.]
 [ 13. 210. 115.]
 [ 14. 215. 105.]
 [  1.  50.  50.]
 [  6.  70.  31.]
 [ 12. 210. 120.]
 [  4.  60.  25.]
 [ 11. 230.  80.]
 [ 15. 225.  73.]
 [  2. 110.  43.]]
[[191.  36.  50.]
 [189.  37.  52.]
 [193.  38.  58.]
 [162.  35.  62.]
 [189.  35.  46.]
 [182.  36.  56.]
 [211.  38.  56.]
 [167.  34.  60.]
 [176.  31.  74.]
 [154.  33.  56.]
 [169.  34.  50.]
 [166.  33.  52.]
 [154.  34.  64.]
 [247.  46.  50.]
 [193.  36.  46.]
 [202.  37.  62.]
 [176.  37.  54.]
 [157.  32.  52.]
 [156.  33.  54.]
 [138.  33.  68.]]


In [18]:
pred

array([[176.16484296,  35.0548407 ,  57.09000136]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### IterativeImpute

It is a Multivariate imputer that estimates each feature from all the others. It applies a strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

In [19]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp_mean = IterativeImputer(random_state=0)
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
IterativeImputer(random_state=0)
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X)

array([[ 6.95847623,  2.        ,  3.        ],
       [ 4.        ,  2.6000004 ,  6.        ],
       [10.        ,  4.99999933,  9.        ]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

## Time-Series Specific Methods
1. **Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)**
<br>This is a common statistical approach to the analysis of longitudinal repeated measured data where some follow-up observations may be missing. Longitudinal data track the same sample at different points in time. Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
2. **Data without trend and seasonality**
mean, mode, median and random sample imputation can be used 
3. **Linear Interpolation**
This method works well for a time series with some **trend** but is not suitable for **seasonal data**
4. **Seasonal Adjustment + Linear Interpolation**
This method works well for data with both **trend and seasonality**


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

# Rescalling Data

When data is comprised of **attributes with varying scales**, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent.

It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors. 
Rescaling of data using different techniques, some of which are listed below.

When faced with features which are very different in scale / units, it is quite clear to see that classifiers / regressors which rely on euclidean distance such as k-nearest neighbours will fail or be sub-optimal. Same goes for other regressors. Especially the ones that rely on gradient descent based optimisation such as logistic regressions, Support Vector Machines and Neural networks. The only classifiers/regressors which are immune to impact of scale are the tree based regressors.

**NOTE:** 
1. Before performing scalling one should check oultier and Treat the outlier
2. Check the EDA Notebook for various outlier treament method 

### MinMaxScaler

Transform features by **scaling each feature to a given range**.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.<br><br>The transformation is given by:<br>X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))<br>X_scaled = X_std * (max - min) + min

In [20]:
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
MinMaxScaler()
print(scaler.transform(data))

MinMaxScaler()
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


In [21]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [22]:
# minmax scaler on cloumn of a dataframe
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['price'] = scaler.fit_transform(df[['price']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [23]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

###  MaxAbsScaler

This estimator **scales and translates each feature individually** such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.<br><br>
This scaler can also be applied to sparse CSR or CSC matrices.

In [24]:
from sklearn.preprocessing import MaxAbsScaler
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = MaxAbsScaler().fit(X)
transformer
MaxAbsScaler()
transformer.transform(X)


array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [25]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [26]:
from sklearn.preprocessing import MaxAbsScaler
transformer = MaxAbsScaler().fit(df[['price']])
df['price'] = transformer.transform(df[['price']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Robust Scaler

Scale features using statistics that are **robust to outliers**. RobustScaler transforms the feature vector by subtracting the median and then dividing by the interquartile range (75% value — 25% value)<br><br>
**Centering and scaling happen independently on each feature** by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.<br>
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results<br><br>**Use RobustScaler, to reduce the effects of outliers**, relative to MinMaxScaler.

In [27]:
from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]
transformer = RobustScaler().fit(X)
transformer
RobustScaler()
transformer.transform(X)


array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

In [28]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [29]:
# robustscaler for dataframe
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df[['boh']])
df['boh'] = transformer.transform(df[['boh']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,0.888889
1,L,0.5,gray,class 2,0.0
2,XL,0.75,blue,class 2,-0.444444
3,M,0.5,orange,class 1,-1.333333
4,M,1.0,green,class 3,0.0
5,M,0.0,red,class 1,0.888889


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### StandardScaler

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of scale I introduced earlier.

**When to use**
it can be used when to transform a feature so it is close to normally distributed 
**NOTE**
1. Results in the distribution with a Standard deviation equal to 1
2. If there are outliers in the feature, normalize the data and scale most of the data to a small interval


In [30]:
import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

# df = pd.DataFrame(df, columns=['N0', 'N1', 'N2', 'N3', 'N4'])
# Get the dataframe for further analysis



# From this stats infromation can be obtanined
num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333335, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

# Data Transformation

Two types of transformations are available: quantile transforms and power transforms.<br> 

###  Quantile Transformation
Quantile transformation can be used for __uniform data__. By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features. <br><br> An example of Quantile Transformation is given below: 

In [31]:
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(X)

array([[0.        ],
       [0.09871873],
       [0.10643612],
       [0.11754671],
       [0.21017437],
       [0.21945445],
       [0.23498666],
       [0.32443642],
       [0.33333333],
       [0.41360794],
       [0.42339464],
       [0.46257841],
       [0.47112236],
       [0.49834237],
       [0.59986536],
       [0.63390302],
       [0.66666667],
       [0.68873101],
       [0.69611125],
       [0.81280699],
       [0.82160354],
       [0.88126439],
       [0.90516028],
       [0.99319435],
       [1.        ]])

In [32]:
df

array([[-1.34164079, -1.28280871, -1.15534415, -1.2604572 , -1.33894539],
       [-0.4472136 , -0.52262577, -0.53456222, -0.63816599, -0.45080071],
       [ 0.4472136 ,  0.4276029 ,  0.15519548,  0.63181608,  0.44631513],
       [ 1.34164079,  1.37783158,  1.53471088,  1.26680711,  1.34343097]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Power Transformation
Power transforms are a family of parametric transformations that aim to __map data from any distribution to as close to a Gaussian distribution__ as possible in order to stabilize variance and minimize skewness<br><br>There are two methods for power transformation: __Yeo Johnson and Box-cox__<br><br> Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parameterized by 
λ, which is determined through maximum likelihood estimation. Here is an example of using Box-Cox to map samples drawn from a lognormal distribution to a normal distribution:

In [33]:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
pt.fit_transform(X_lognormal)


array([[ 0.49024349,  0.17881995, -0.1563781 ],
       [-0.05102892,  0.58863195, -0.57612414],
       [ 0.69420009, -0.84857822,  0.10051454]])

In [34]:
df_copy

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,,orange,class 1,17.0
4,M,11.0,green,class 3,
5,M,7.0,red,class 1,22.0


In [35]:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox', standardize=False)
df_copy['boh'] = pt.fit_transform(df_copy[['boh']])
df_copy

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,7546.664382
1,L,,gray,class 2,5524.660719
2,XL,10.0,blue,class 2,4670.996737
3,M,,orange,class 1,3245.914188
4,M,11.0,green,class 3,
5,M,7.0,red,class 1,7546.664382


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Custom Transformation
We might want to __convert an existing Python function into a transformer__ to assist in data cleaning or processing. A transformer from an arbitrary function with FunctionTransformer can be implemented. <br><br>For example, to build a transformer that applies a log transformation in a pipeline, the following can be done:

In [36]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

In [37]:
df_copy

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,7546.664382
1,L,,gray,class 2,5524.660719
2,XL,10.0,blue,class 2,4670.996737
3,M,,orange,class 1,3245.914188
4,M,11.0,green,class 3,
5,M,7.0,red,class 1,7546.664382


In [38]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
df_copy = df_copy.dropna()
df_copy['boh'] = transformer.fit_transform(df_copy[['boh']])
df_copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['boh'] = transformer.fit_transform(df_copy[['boh']])


Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,8.928993
2,XL,10.0,blue,class 2,8.449342
5,M,7.0,red,class 1,8.928993


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

# Data Normalization

Normalization is the process of **scaling individual samples to have unit norm**. The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms. Normalizer __works on the rows, not the columns!__ 
<br><br>By default, L2 normalization is applied to each observation so the that the values in a row have a unit norm. Unit norm with L2 means that if each element were squared and summed, the total would equal 1. Alternatively, L1 (aka taxicab or Manhattan) normalization can be applied instead of L2 normalization.

In [39]:
from sklearn import preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized


array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

The preprocessing module further provides a utility class Normalizer that implements the same operation using the Transformer API (even though the fit method is useless in this case: the class is stateless as this operation treats samples independently)

In [40]:
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [41]:
df_copy

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,8.928993
2,XL,10.0,blue,class 2,8.449342
5,M,7.0,red,class 1,8.928993


In [42]:
from sklearn import preprocessing
df_copy['price'] = preprocessing.normalize(df_copy[['price']], norm='l1')
df_copy


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['price'] = preprocessing.normalize(df_copy[['price']], norm='l1')


Unnamed: 0,size,price,color,class,boh
0,XXL,1.0,black,class 1,8.928993
2,XL,1.0,blue,class 2,8.449342
5,M,1.0,red,class 1,8.928993


Normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input. For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### __To summarize__
• Use MinMaxScaler as the default if we are transforming a feature. It’s non-distorting.
<br>• Use RobustScaler if we have outliers and want to reduce their influence. However, we might be better off removing the outliers, instead.
<br>• Use StandardScaler if one need a relatively normal distribution.
<br>• Use Normalizer sparingly — it normalizes sample rows, not feature columns. It can use l2 or l1 normalization.

![Scalers.png](attachment:Scalers.png)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

# __Handling Categorical Variable__

### One Hot Encoding
In this method, each category is maped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.<br><br>This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.<br><br>One Hot Encoding is very popular. All categories can be represented by **N-1 (N= No of Category)** as that is sufficient to encode the one that is not included. Usually, for **Regression, N-1** (drop first or last column of One Hot Coded new feature ) is used, **but for classification, the recommendation is to use all N columns without as most of the tree-based algorithm builds a tree based on all available variables.**

In [43]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                

df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df = pd.get_dummies(df, columns = ['dummy'])
df
# Dummy variable are created 


Unnamed: 0,y,x,dummy_a,dummy_b,dummy_c
0,5,1,True,False,False
1,3,3,False,True,False
2,1,2,False,True,False
3,3,1,True,False,False
4,4,2,False,True,False
5,7,1,False,False,True
6,7,1,False,False,True


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

In [44]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                


df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df = pd.get_dummies(df, columns = ['dummy'], drop_first = True)
# to run the regression we want to get rid of the strings 'a', 'b', 'c' (obviously)
# and we want to get rid of one dummy variable to avoid the dummy variable trap
# arbitrarily chose "a", coefficients on "c" an "b" would show effect of "c" and "b"
# relative to "a"
df

Unnamed: 0,y,x,dummy_b,dummy_c
0,5,1,False,False
1,3,3,True,False
2,1,2,True,False
3,3,1,False,False
4,4,2,True,False
5,7,1,False,True
6,7,1,False,True


### Label Encoding
In this encoding, __each category is assigned a value from 1 through N__; here N is the number of categories for the feature. One major issue with this approach is that there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship.

In [45]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                

df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df

Unnamed: 0,y,dummy,x
0,5,a,1
1,3,b,3
2,1,b,2
3,3,a,1
4,4,b,2
5,7,c,1
6,7,c,1


In [46]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['dummy'] = le.fit_transform(df.dummy)
df

Unnamed: 0,y,dummy,x
0,5,0,1
1,3,1,3
2,1,1,2
3,3,0,1
4,4,1,2
5,7,2,1
6,7,2,1


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>