# Data Processing with sklearn

*aka feature engineering*

Feature engineering the process to represent data that improves the model performance.

> It's easy to feed raw data into a machine learning. **The difference is in the data processing**

Many factor influence a model performance. It's clear that if a feature doesn't have nay relation with the result, any representation of this data is irrelevant.

Keep in mid that different models has it's own needs and limitations. Examples are:

- Some models can't process multicolinearity or correlation between features
- Many models can't have NANs
- Some models are severely penalized with irrelevant features.

Feature engineering and feature selection play an important role dealing with this problems.

## Data processing workflow

<img src="https://miro.medium.com/max/875/1*QRVI4dwzTN89P8awT8FC6A.png"  width=700>

## Data pre-processing steps
<img src="./images/pre-processing_order.png" width=700>

## Pre-processing

- Split
    - Train-test split 
- Cleaning
    - drop irrelevant columns
    - remove duplicates
    - remove samples based on filter techniques
    - remove outliers
    - remove incorrect data
    - change dtypes
    - flag missing as NAN
    
Data cleaning must be done before missing values imputation and one-hot encoding.

- Imput missing values
    - SimpleImputer (mean, mode)
    - KNNImputer
    - IterativeImputer
    
Imput must be done before one-hot encoding (oht does not accept NANs)

- Transform features:
    - Categorical Features:
        - OrdinalEncoder
        - LabelEncoder
        - One-Hot Encoding
    - Numerical Features:
        - Binarizer
        - KBinsDiscretizer
        - MinMaxScale
        - StandardScale
        - RobustScale

*Nans and dtypes must be done before imputation.

- Feature Engineering
    - PolynomialFeature
    - PowerTransformer
    - Feature aggregation (combining features)
    
- Feature Selection:
    - Univariate Statistical Test
    - Recursive Feature Engineering (RFE)
    - Mutual_info_classifier
    - Variance inflation factor (VIF)
    
- Dimensionality Reduction:
    - Principal Component Analysis (PCA)
    - Linear Discriminant Analysis (LDA)
    - t-SNE
    
These classes are called **transformers**. The idea is to transform data to ensure it can run the model as well as improve it's performance.

> Fit é coisa de crosfiteiro e tem relacao com treino

It's important that the transformers are fitted with trainning data to avoid **data leakage** (when test data is used for training)

Follow this simple recipe:

1. Make `train-test split`
2. Use `.fit()` on training data
3. Use `.transform()` on training data
4. Use `.transform()` on test data (with the same transformer trained on the training data)

In [14]:
# General imports
import numpy as np
import pandas as pd

# Import data viz
import matplotlib.pyplot as plt
import seaborn as sns

# Missing data
<img src="images/pre-processing_order.png" width=600>

## SimpleImputer

[**SimpleImputer**](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) is an `sklearn.impute` module used to fill missing values. Strategy options are:

- `mean`
- `median`
- `mosrt frequent`
- `constant`

### Recipe
```python
# import module
from sklearn.impute import SimpleImputer

# instantiate encoder
imputer = SimpleImputer(strategy='mean')

# train encoder and transform data
X_imp = imputer.fit(X_train)
X_imp = imputer.transform(X_train)
```

In [15]:
# Make a dataset
X = np.array([[-1.0], 
              [-0.5], 
              [np.nan], 
              [0.5], 
              [1.0]])

In [16]:
from sklearn.impute import SimpleImputer

# Instantiate encoder
imputer = SimpleImputer(strategy='median')

# Train encoder and transform imput data
X_imp = imputer.fit(X)
X_imp = imputer.transform(X)

# Show Results
print(X_imp)

[[-1. ]
 [-0.5]
 [ 0. ]
 [ 0.5]
 [ 1. ]]


## [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)

Missing values are imput with the k nearest neighbors mean in the training data. Two samples are near when non missing features are near.

> Do not use for big data

In [17]:
# Make data
X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
print(X)

[[ 1.  2. nan]
 [ 3.  4.  3.]
 [nan  6.  5.]
 [ 8.  8.  7.]]


In [18]:
# Revcipe
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)

imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

# Transform Features - categorical data 
<img src="images/pre-processing_order.png" width=600>

We already know how to use `pd.get_dummies()`. To make it easyer to integrate this process to a pipeline, it's important to use sklearn.

The relevant classes are:
- [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder)
- [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=labelencoder#sklearn.preprocessing.LabelEncoder)
- [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

`LabelEncoder` should only be used on target.

## OrdinalEncoder and LabelEncoder

Transforms categories in an ordered sequence, the order is chosen alphabetically. For an ordered list, `OrdinalEncoder` should be used, and an order array must be passed.

<img src="https://i.imgur.com/tEogUAr.png" width=1000>

### What's the difference between them?
- `OrdinalEncoder` is used for 2D data with shape(n_samples, n_features) and therefore is used in feature transformation;

- `LabelEncoder` is for 1D data with shape(n_samples), and therefore it's used for label transformation (Target)

Another difference is the learned parameter:
- `LabelEncoder` learns `classes_`
- `OrdinalEncoder` learns `categories_`

In [22]:
# Creating data
X = np.array([['Paris', 'Île-de-France', 105.4],
              ['Yvelines', 'Île-de-France', 2284.0],
              ['Grenoble', 'Auvergne-Rhône-Alpes', 18.13],
              ['Lyon Metropolis', 'Auvergne-Rhône-Alpes', 533.68]])

# Which ones are cool and uncool?
y = ['uncool', 'cool', 'uncool', 'cool']

In [23]:
# Imports
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Instantiate encoders
feature_encoder = OrdinalEncoder()
label_encoder = LabelEncoder()

# Train and transform encoders
X_encoded = feature_encoder.fit_transform(X[:, :2]) # choose 2 first columns (categorical)
y_encoded = label_encoder.fit_transform(y)

In [25]:
# Before
print("X:\n", X)

# Result X
print("\n\nX_encoded:\n", X_encoded)
print("\ncategories: ", feature_encoder.categories_)

# Result y
print("\n\ny_encoded: \n", y_encoded)
print("\nclasses: ", label_encoder.classes_)

X:
 [['Paris' 'Île-de-France' '105.4']
 ['Yvelines' 'Île-de-France' '2284.0']
 ['Grenoble' 'Auvergne-Rhône-Alpes' '18.13']
 ['Lyon Metropolis' 'Auvergne-Rhône-Alpes' '533.68']]


X_encoded:
 [[2. 1.]
 [3. 1.]
 [0. 0.]
 [1. 0.]]

categories:  [array(['Grenoble', 'Lyon Metropolis', 'Paris', 'Yvelines'], dtype='<U32'), array(['Auvergne-Rhône-Alpes', 'Île-de-France'], dtype='<U32')]


y_encoded: 
 [1 0 1 0]

classes:  ['cool' 'uncool']


In [26]:
# Give OrdinalEncoder an specific order:
OrdinalEncoder(categories=[['cold', 'warm', 'hot']])\
.fit_transform([['hot'], ['warm'], ['warm'], ['cold']])\
.reshape((1,-1))[0]

array([2., 1., 1., 0.])

## [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=onehotencoder#sklearn.preprocessing.OneHotEncoder)

Transforms each category in a binary column.

<img src="https://i.imgur.com/TW5m0aJ.png" width=1000>

In [28]:
# Create data
X = np.array([['abacate'], 
              ['Irmão do Jorel'], 
              ['Vovó Juju'], 
              ['Vovó Juju'], 
              ['Irmão do Jorel']])

In [29]:
from sklearn.preprocessing import OneHotEncoder

# Instantiate encoder
dummy_encoder = OneHotEncoder()

# Train encoder
X_onehot = dummy_encoder.fit_transform(X)#.toarray()

# Show results
print(X_onehot)
print(' ')
print(dummy_encoder.categories_) # order with ASCII

  (0, 2)	1.0
  (1, 0)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 0)	1.0
 
[array(['Irmão do Jorel', 'Vovó Juju', 'abacate'], dtype='<U14')]


OneHotEncoder returns a scipy sparse matrix

In [31]:
print(type(X_onehot))

<class 'scipy.sparse.csr.csr_matrix'>


A sparce matrix is one way to represent a matrix. Inside the parêntesis the position (line, column) is written and outside the atributed value.

**To return a numpy array `sparse=False` is used or apply `X_onehot.toarray()`.**

`drop='first'` is used to to exclude the first column. **check documentation before implementation**

In [33]:
# Make data
X = np.array([['abacate'], 
              ['Irmão do Jorel'], 
              ['Vovó Juju'], 
              ['Vovó Juju'], 
              ['Irmão do Jorel']])

In [34]:
# Instantiate encoder
dummy_encoder = OneHotEncoder(drop='first', sparse='False')

# Train encoder and transform imput features
X_onehot = dummy_encoder.fit_transform(X)

# Show results
print(X_onehot)
print(' ')
print(dummy_encoder.categories_)

  (0, 1)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
 
[array(['Irmão do Jorel', 'Vovó Juju', 'abacate'], dtype='<U14')]


# Transform Features - Numerical data