# Day 2 - ML Pipelines

In this series of exercices, you will learn how a build robust a ML pipeline using [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

An important part of a ML pipeline is the pre-processing part. For this, you will learn how to master 
Sklearn encoders and tranformers as part of the [Preprocessing Sklearn module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)


In [36]:
import warnings
warnings.filterwarnings('ignore')

## Preprocessing Data

1. [Scaling with StandardScaler](#exo1)
2. [Encoding Categorical Features](#exo2)
3. [Dealing with missing data](#exo3)
4. [Custom Transformers and Encoders](#exo4)

### 1. Scaling with StandardScaler <a id='exo1'/>

Standardize features by removing the mean and scaling to unit variance is a common pre-processing step we apply to help many machine learning algorithms behave more efficiently.

[Sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) can do the scaling transformation for you.
The standard score of a sample x is calculated as:

z = (x - u) / s

The goal of this exercice is to re-implement it.

As you know, there are 2 main methods for any encoder/transformer. 
- `fit` which computes the mean and std to be used for later scaling.
- `tranform` which performs standardization by centering and scaling

#### Exercice
- Given the numpy arrays `data` and `test_data`, write a simple custom implementation of standard scaler. To test it, fit the scaler with `data` and tranform `test_data` with it.
- Compare your results with `StandardScaler`
- Make the custom implementation using a python class

In [2]:
import numpy as np
data = np.array([[1, 10], [2, -1], [0, 22], [3, 15]])
test_data = np.array([[2, 1], [5, 1], [3, 55], [3, 1]])

In [15]:
## Simple custom implementation of Standard Scaler

def fit(X):
    """implement fit method"""
    pass

def transform(X, **kwargs):
    """implement transformation method"""
    pass

_ = fit(data)
transformed_test_data = transform(test_data)
transformed_test_data

In [21]:
# Use Sklearn StandardScaler and compare results
from sklearn.preprocessing import StandardScaler

# your code that uses StandardScaler 
# compare that you get the same transformed_test_data from above.

In [22]:
# Custom Implementation with a Class
class Scaler(object):
    pass

# your code
# compare that you get the same transformed_test_data from above.

### 2. Encoding Categorical Variables <a id='exo2'/>

Often features are not given as continuous values but categorical. However, machine learning algorithms only accept numerical data as inputs. That is why we need to make sure categorical variables are encoded before passed in ML estimators.

One encoder that is commonly used for categorical variables is [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

#### Exercise

Given `data` and `test_data`, implement a OneHotEncoder by yourself and then use the Sklearn implementation to make you got it right.

In [18]:
import numpy as np
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])
test_data = np.array([['China'], ['USA'], ['Italy']])

In [19]:
class CustomOneHotEncoder(object):
    """re-implement one hot encoder"""
    pass

# your code

In [23]:
## use Sklearn OneHotEncoder and compare results on test_data
from sklearn.preprocessing import OneHotEncoder

# your code to use OneHotEncoder and check that you get the samed transformed_test_data

### 3. Dealing with Missing Data <a id='exo3'/>

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

For this, Sklearn has multiple ways to impute from missing data with the [Inpute module](https://scikit-learn.org/stable/modules/impute.html#)

#### Exercise
- Re-implement the [`SimpleInputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) tranformer using `mean` strategy.
- Test your implementation with `data` and `test_data`
- Compare with transformed data using `Sklearn SimpleInputer`
- Bonus: Implement for all 4 strategies (`mean`, `median`, `most_frequent` and `constant`)

In [78]:
import numpy as np
data = np.array([[1, 3, 3], [2, np.nan, 6], [3, 9, 9]])
test_data = np.array([[1, 1, 1], [1, np.nan, 1], [1, 1, 1]])

In [79]:
class CustomSimpleInputer(object):
    """Implement SimpleInputer """

    def ___init__(self, strategy="mean"):
        self.strategy = strategy
        
    def fit(self, X):
        pass
    
    def transform(self, X):
        pass

In [26]:
## use Sklearn Simple Inputer and compare transformed data using your custom implementation
from sklearn.impute import SimpleImputer

# your code

### 4. Custom Transformers and Encoders <a id='exo4'/>

Sklearn provides a large collection of transformers and encoders but you might need to implement you own encoder to fit the needs of your data and problem.

For this, there are two very useful Sklearn classes:
1. [FunctionTransfomer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) which lets you Construct a transformer from an arbitrary callable.
2. [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) are base classes one can use to implement completely new custom encoders


#### Exercice
With the Taxi Fare Prediction Challenge data:

- Using `FunctionTransformer` implement a transformer that computes haversine distance between pickup and dropoff location
- With `BaseEstimator` and `TransformerMixin`, implement a custom encoder that extract time features from `pickup_datetime`
- Use these two new encoders to fit and transform the training data

In [27]:
# your code

## Putting all together as a Pipeline

A Pipeline is very useful concept. In Machine Learning, you often need to perform a sequence of different transformations (scaling, filling missing values, transforming, encoding) of raw dataset before applying a final estimator.

A Pipeline gives you a simple interface for all these different steps of transformation and the resulting estimator. With that, it is easier to iterate and improve models because you can easily add, remove or re-order these different steps. Also, changing one or several parameters is very strightforward and does not require a lot code refactoring.

For this, you will learn how to use 2 Sklearn modules:
1. [ColumnTransformer](#exo11)
2. [Pipeline](#exo12)

### 1. Column Transformer <a id="exo11" />

Before building your pipeline let's use a very useful Sklearn module called [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

This module is very useful when your input data is a pandas dataframe as you can select columns from their names.

#### Exercise

You are given a small dataset containing weights and heights for a few individuals.

In [239]:
data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 180, 'weight': 82},
        {'gender': 'Female', 'height': np.nan, 'weight': 72},
        {'gender': 'Male', 'height': 175, 'weight': 75},
        {'gender': 'Female', 'height': 175, 'weight': 60},
        {'gender': 'Male', 'height': 170, 'weight': 76},
    ])

test_data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 170, 'weight': 72},
        {'gender': 'Female', 'height': np.nan, 'weight': 60}
    ]
)

data

Unnamed: 0,gender,height,weight
0,Male,180.0,82
1,Female,,72
2,Male,175.0,75
3,Female,175.0,60
4,Male,170.0,76


With `ColumnTransformer`, build a single encoder that apply these transformations:
- encode `gender` with OneHot
- fill missing values for height

In [29]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

encoder = ColumnTransformer([])
encoder.fit_transform(data)

array([], shape=(8, 0), dtype=float64)

In [30]:
encoder.transform(test_data)

array([], shape=(3, 0), dtype=float64)

### 2. Pipeline <a id="exo12" />

Now it is time to use a Sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

#### Exercice
With the weight/height dataset, build a pipeline to predict the weight of individuals in the test set.

This pipeline should have:
- a oneHotEncode for `gender`
- fill missing values for height
- a scaler for height
- a simple estimator like a linear regression

**Tip** You can also use [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) which is an alias of `Pipeline` to easily generate a pipeline without giving names to the transformers.

In [35]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# your code

## Refactor Taxi Fare Prediction Problem with a Pipeline

Refactor the model you built yesterday for the Taxi Fare Prediction Problem using:
- Custom encoders you wrote for distance and time features
- OneHot Encoder to encoder hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together


Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

In [36]:
## your code