<center><h1>Diving into Data Preprocessing</h1></center>

## Table of contents

* [Feature Engineering](#feat_engg)
    * [Missing Values Treatment](#missing)
    * [Outliers Treatment](#outliers)
    * [Categorical Data Handling](#cat)
    * [Imbalanced Class Handling](#imbal)
    * [Data Transformation](#trans)
    * [Extracting Date](#date)


<a id='feat_engg'></a>
## Feature Engineering

### What is Feature Engineering?
**Feature engineering is about creating new input features from your existing ones.**

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:

* You can isolate and highlight key information, which helps your algorithms "focus" on what’s important.
* You can bring in your own domain expertise.
* Most importantly, once you understand the "vocabulary" of feature engineering, you can bring in other people’s domain expertise!

*"The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering."* — Luca Massaron

#### Toy dataset

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

dt = {"col1":[51,22,13,64,50,np.nan,17,580,19,1000],
      "col2":[np.nan,np.nan,89,np.nan,76,np.nan,53,np.nan,900,np.nan],
      "col3":['male','male','male','male','male','male','male',np.nan,'female',np.nan],
      "col4":['good','bad','good',np.nan,'good','bad','bad',np.nan,'good',np.nan]}

data = pd.DataFrame(dt)

data

<a id='missing'></a>
### Missing Values Treatment

#### Analyze missing values

In [None]:
data.isna().sum()

### Drop columns

In [None]:
threshold = 0.5

# Dropping columns with missing value rate higher than threshold
data[data.columns[data.isnull().mean() < threshold]]

In [None]:
# Dropping rows with missing value rate higher than threshold
data.loc[data.isnull().mean(axis=1) < threshold]

### Impute

#### Numerical Imputation

In [5]:
num_cols = ['col1', 'col2']

In [None]:
# Filling all missing values with 0
data[num_cols].fillna(0)

In [None]:
# Filling missing values with medians of the columns
data['col1'].fillna(data['col1'].median())

In [None]:
# Fill all numerical columns
for col in num_cols:
    data[col] = data[col].fillna(data[col].median())
    
data

#### Categorical Imputation

In [9]:
cat_cols = ['col3', 'col4']

In [None]:
# Max fill function for categorical columns
data['col3'].fillna(data['col3'].value_counts().idxmax())

In [None]:
# Fill all categorical columns
for col in cat_cols:
    data[col] = data[col].fillna(data[col].value_counts().idxmax())
    
data

<a id='outliers'></a>
### Outliers Treatment

#### Detect outliers using boxplots

In [None]:
data.plot.box();

#### Detect outliers using interquartile range

In [13]:
def detect_outlier(feature):
    Q1 = feature.quantile(0.25)
    Q3 = feature.quantile(0.75)
    
    IQR = Q3-Q1
    
    lower_bound = Q1-(1.5*IQR)
    upper_bound = Q3+(1.5*IQR)
    
    return feature.index[(feature<lower_bound)|(feature>upper_bound)].tolist()

In [None]:
for col in num_cols:
    print(col,'-->',detect_outlier(data[col]))

#### Caping the outliers

In [15]:
for col in num_cols:
    indx = detect_outlier(data[col])
    data[col].loc[indx] = data[col].median()

In [None]:
data

#### Droping the rows that contain outliers

In [17]:
for col in num_cols:
    indx = detect_outlier(data[col])
    data[col].loc[indx] = np.nan

data.dropna(inplace=True)

<a id='cat'></a>
### Categorical Data Handling

#### Label Encoding

In [None]:
for col in cat_cols:
    data[col] = data[col].astype('category')
    print(col,'---->', dict(enumerate(data[col].cat.categories)))
    data[col] = data[col].cat.codes

In [None]:
data

#### One Hot Encoding

In [None]:
data = pd.get_dummies(data, columns=cat_cols, prefix=cat_cols)

data

<a id='imbal'></a>
### Imbalanced Class Handling

In [None]:
# !pip install -U imbalanced-learn

#### Toy dataset

In [None]:
from imblearn.datasets import make_imbalance
from sklearn.datasets import load_iris

data = load_iris()
data

In [None]:
from imblearn.datasets import make_imbalance
from sklearn.datasets import load_iris

data = load_iris()
X, y = data.data, data.target
X, y = make_imbalance(X, y, sampling_strategy={0: 10, 1: 20, 2: 30}, random_state=42)

print("Data shape:", y.shape)
pd.Series(y).value_counts().plot.bar();

#### Oversampling: SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)

X_smt, y_smt = sm.fit_resample(X, y)

print("Data shape:", y_smt.shape)
pd.Series(y_smt).value_counts().plot.bar();

#### Undersampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)

print("Data shape:", y_rus.shape)
pd.Series(y_rus).value_counts().plot.bar();

<a id='trans'></a>
### Data Transformation

Data Transformation is performed to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data preprocessing while using machine learning algorithms.

There are 3 popular methods to transform data:
* Scaling
* Normalization
* Standardization

### Scaling

In scaling, you transform the data such that the features are within a specific range e.g. [0, 1].

${\displaystyle x'={\frac {x-{\text{min}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}}$

where ${\displaystyle x}$ is an original value, ${\displaystyle x'}$ is the rescaled value. 

Scaling is important in the algorthms such as support vector machines (SVM) and k-nearest neighbors
(KNN) where distance betYouen the data points is important. For example, in the dataset containing
prices of products; without scaling, SVM might treat 1 USD equivalent to 1 INR though 1 USD = 65
INR.

In [23]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import minmax_scale

# set seed for reproducibility
np.random.seed(0)

# generate 1000 data points randomly drawn from an exponential distr
original_data = np.random.exponential(size = 1000)

In [24]:
# mix-max scale the data betYouen 0 and 1
scaled_data = minmax_scale(original_data)

In [None]:
# plot both together to compare
fig, ax=plt.subplots(1,2)

sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")

sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
plt.show()

### Normalization
The point of normalization is to change your observations so that they can be described as a normal distribution.

Normal distribution (Gaussian distribution), also known as the bell curve, is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean.

The general formula is given as:

${\displaystyle x'={\frac {x-{\text{mean}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}}$

where ${\displaystyle x}$ is an original value, ${\displaystyle x'}$ is the normalized value. 

In [26]:
# for Box-Cox Transformation
from scipy import stats

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

In [None]:
# plot both together to compare
fig, ax=plt.subplots(1,2)

sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Originabl Data")

sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
plt.show()

In scaling, you’re changing the range of your data while in normalization you’re changing the shape of
the distribution of your data.

You need to normalize our data if you’re going use a machine learning or statistics technique that
assumes that data is normally distributed e.g. t-tests, ANOVAs, linear regression, linear discriminant
analysis (LDA) and Gaussian Naive Bayes.

### Standardization
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1.

The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

${\displaystyle x'={\frac {x-{\bar {x}}}{\sigma }}}$

Where $x$ is the original feature vector, ${\bar{x}={\text{average}}(x)}$ is the mean of that feature vector, and $\sigma$ is its standard deviation.

In [28]:
from sklearn.preprocessing import StandardScaler

standerdized_data = StandardScaler().fit_transform(original_data.reshape(1, -1))

In [None]:
# plot both together to compare
fig, ax=plt.subplots(1,2)

sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Originabl Data")

sns.distplot(standerdized_data, ax=ax[1])
ax[1].set_title("Standerdized data")
plt.show()

It’s widely used in SVMs, logistics regression and neural networks.

#### Applications of Data Transformation
In stochastic gradient descent, feature scaling can sometimes improve the convergence speed of the algorithm. In support vector machines, it can reduce the time to find support vectors.

<a id='date'></a>
### Extracting Date
We can perform the following engineering to the date time variales:
* Extracting the parts of the date into different columns: Year, month, day, etc.
* Extracting the time period between the current date and columns in terms of years, months, days, etc.
* Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc.

In [None]:
from datetime import date

data = pd.DataFrame({'date':['01-01-2017', 
                             '04-12-2008', 
                             '23-06-1988', 
                             '25-08-1999', 
                             '20-02-1993',]})
data

In [None]:
#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

data

In [None]:
#Extracting Year
data['year'] = data['date'].dt.year

data

In [None]:
#Extracting Month
data['month'] = data['date'].dt.month

data

In [None]:
#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

data

In [None]:
#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

data

In [None]:
#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()

data