# Day 2 - ML Pipelines

In this series of exercices, you will learn how a build robust a ML pipeline using [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

An important part of a ML pipeline is the pre-processing part. For this, you will learn how to master 
Sklearn encoders and tranformers as part of the [Preprocessing Sklearn module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)


In [8]:
%load_ext autoreload
%autoreload 2

In [36]:
import warnings
warnings.filterwarnings('ignore')

## Preprocessing Data

1. [Scaling with StandardScaler](#exo1)
2. [Encoding Categorical Features](#exo2)
3. [Dealing with missing data](#exo3)
4. [Custom Transformers and Encoders](#exo4)

### 1. Scaling with StandardScaler <a id='exo1'/>

Standardize features by removing the mean and scaling to unit variance is a common pre-processing step we apply to help many machine learning algorithms behave more efficiently.

[Sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) can do the scaling transformation for you.
The standard score of a sample x is calculated as:

z = (x - u) / s

The goal of this exercice is to re-implement it.

As you know, there are 2 main methods for any encoder/transformer. 
- `fit` which computes the mean and std to be used for later scaling.
- `tranform` which performs standardization by centering and scaling

#### Exercice
- Given the numpy arrays `data` and `test_data`, write a simple custom implementation of standard scaler. To test it, fit the scaler with `data` and tranform `test_data` with it.
- Compare your results with `StandardScaler`
- Make the custom implementation using a python class

In [15]:
import numpy as np
data = np.array([[1, 10], [2, -1], [0, 22], [3, 15]])
test_data = np.array([[2, 1], [5, 1], [3, 55], [3, 1]])

In [3]:
# axis 
np.nanmean(data, axis=0)

array([ 1.5, 11.5])

In [4]:
## Simple custom implementation of Standard Scaler

def fit(X):
    """implement fit method"""
    mean = np.nanmean(X, axis=0)
    std = np.nanstd(X, axis=0)
    return {"mean" : mean, "std" : std}


def transform(X, params):
    """implement transformation method"""
    return (X - params["mean"]) / params["std"]

In [5]:
params = fit(data)
transformed_test_data = transform(test_data,params)
transformed_test_data

array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [6]:
# Use Sklearn StandardScaler and compare results
from sklearn.preprocessing import StandardScaler

# your code that uses StandardScaler 

scaler = StandardScaler()
scaler.fit(data)
transformed_test_data_2 = scaler.transform(test_data)
transformed_test_data_2

array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [16]:
# Custom Implementation with a Class

class Scaler(object):
    def __init__(self):
        self.mean = None
        self.std = None
    
    def fit(self, X):
        self.mean = np.nanmean(X, axis=0)
        self.std = np.nanstd(X, axis=0)
        
    def transform(self, X):
        return (X - self.mean) / self.std
        
scaler = Scaler()
scaler.fit(data)
transformed_test_data_3 = scaler.transform(test_data)
transformed_test_data_3


array([[ 0.4472136 , -1.25275497],
       [ 3.13049517, -1.25275497],
       [ 1.34164079,  5.18998488],
       [ 1.34164079, -1.25275497]])

In [17]:
scaler.fit(data)

### 2. Encoding Categorical Variables <a id='exo2'/>

Often features are not given as continuous values but categorical. However, machine learning algorithms only accept numerical data as inputs. That is why we need to make sure categorical variables are encoded before passed in ML estimators.

One encoder that is commonly used for categorical variables is [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

#### Exercise

Given `data` and `test_data`, implement a OneHotEncoder by yourself and then use the Sklearn implementation to make you got it right.

In [10]:
import numpy as np
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])
test_data = np.array([['China'], ['USA'], ['Italy']])

In [19]:
class CustomOneHotEncoder(object):
    """re-implement one hot encoder"""
    def __init__(self):
        self.mean = None
        self.std = None
    
    def fit(self, X):
        self.mean = np.nanmean(X, axis=0)
        self.std = np.nanstd(X, axis=0)
        
    def transform(self, X):
        for elem in X
            
        return (X - self.mean) / self.std


In [12]:
enc.fit(test_data)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)

In [11]:
## use Sklearn OneHotEncoder and compare results on test_data
from sklearn.preprocessing import OneHotEncoder

test_data
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(test_data)
enc.transform(test_data).toarray()
# your code to use OneHotEncoder and check that you get the samed transformed_test_data

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

### 3. Dealing with Missing Data <a id='exo3'/>

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

For this, Sklearn has multiple ways to impute from missing data with the [Inpute module](https://scikit-learn.org/stable/modules/impute.html#)

#### Exercise
- Re-implement the [`SimpleInputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) tranformer using `mean` strategy.
- Test your implementation with `data` and `test_data`
- Compare with transformed data using `Sklearn SimpleInputer`
- Bonus: Implement for all 4 strategies (`mean`, `median`, `most_frequent` and `constant`)

In [21]:
import numpy as np
data = np.array([[1, 3, 3], [2, np.nan, 6], [3, 9, 9]])
test_data = np.array([[1, 1, 1], [1, np.nan, 1], [1, 1, 1]])

data_other = np.array([[1, 3, 3], [2, np.nan, 6], [np.nan, 9, 9]])

In [31]:
ind = np.where(np.isnan(data))
print(ind)

ind_other = np.where(np.isnan(data_other))
print(ind_other)
#data[ind]
#data[ind]
ind_other[1]


data_other[ind_other[1]]

(array([1]), array([1]))
(array([1, 2]), array([1, 0]))


array([[ 2., nan,  6.],
       [ 1.,  3.,  3.]])

In [79]:
class CustomSimpleInputer(object):
    """Implement SimpleInputer """

    def ___init__(self, strategy="mean"):
        self.strategy = strategy
        
    def fit(self, X):
        self.mean = np.nanmean(X, axis=0)
    
    def transform(self, X):
        
        pass

In [32]:
## use Sklearn Simple Inputer and compare transformed data using your custom implementation
from sklearn.impute import SimpleImputer

from sklearn.impute import SimpleImputer
inpute = SimpleImputer(strategy="mean")
inpute.fit(data)
inpute.transform(test_data)
# your code

array([[1., 1., 1.],
       [1., 6., 1.],
       [1., 1., 1.]])

### 4. Custom Transformers and Encoders <a id='exo4'/>

Sklearn provides a large collection of transformers and encoders but you might need to implement you own encoder to fit the needs of your data and problem.

For this, there are two very useful Sklearn classes:
1. [FunctionTransfomer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) which lets you Construct a transformer from an arbitrary callable.
2. [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) are base classes one can use to implement completely new custom encoders


#### Exercice
With the Taxi Fare Prediction Challenge data:

- Using `FunctionTransformer` implement a transformer that computes haversine distance between pickup and dropoff location
- With `BaseEstimator` and `TransformerMixin`, implement a custom encoder that extract time features from `pickup_datetime`
- Use these two new encoders to fit and transform the training data

In [47]:
# your code
import os
import pandas as pd

os.getcwd()
os.chdir('/Users/nicolasbancel/git/data')

df = pd.read_csv('train.csv', nrows = 1000)

In [48]:
def haversine_vectorized(df, 
    start_lat="start_lat", 
    start_lon="start_lon", 
    end_lat="end_lat", 
    end_lon="end_lon"):

    """ 
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df
        Computes distance in kms
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [44]:
df["hav_dist"] = haversine_vectorized(df, start_lat="pickup_latitude", start_lon="pickup_longitude",
            end_lat="dropoff_latitude", end_lon="dropoff_longitude")

In [49]:
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(haversine_vectorized, kw_args=dict(start_lat="pickup_latitude", start_lon="pickup_longitude",
                                                                end_lat="dropoff_latitude", end_lon="dropoff_longitude"))

hav_dist = transformer.fit_transform(df)

In [50]:
hav_dist

0      1.030764
1      8.450134
2      1.389525
3      2.799270
4      1.999157
         ...   
995    8.131868
996    6.833256
997    9.991246
998    1.544828
999    3.169336
Length: 1000, dtype: float64

In [52]:
df["hav_dist"] = transformer.fit_transform(df)

In [53]:
df["hav_dist"].head(10)

0    1.030764
1    8.450134
2    1.389525
3    2.799270
4    1.999157
5    3.787239
6    1.555807
7    4.155444
8    1.253232
9    2.849627
Name: hav_dist, dtype: float64

In [54]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd 

In [55]:
class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, time_column, time_zone_name='America/New_York'):
        self.time_column = time_column
        self.time_zone_name = time_zone_name

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        X.index = pd.to_datetime(X[self.time_column])
        X.index = X.index.tz_convert(self.time_zone_name)
        X["dow"] = X.index.weekday
        X["hour"] = X.index.hour
        X["month"] = X.index.month
        X["year"] = X.index.year
        return X[["dow", "hour", "month", "year"]].reset_index(drop=True)

    def fit(self, X, y=None):
        return self

In [56]:
tf = TimeFeaturesEncoder("pickup_datetime")
tf.transform(df).head()

Unnamed: 0,dow,hour,month,year
0,0,13,6,2009
1,1,11,1,2010
2,2,20,8,2011
3,5,0,4,2012
4,1,2,3,2010


## Putting all together as a Pipeline

A Pipeline is very useful concept. In Machine Learning, you often need to perform a sequence of different transformations (scaling, filling missing values, transforming, encoding) of raw dataset before applying a final estimator.

A Pipeline gives you a simple interface for all these different steps of transformation and the resulting estimator. With that, it is easier to iterate and improve models because you can easily add, remove or re-order these different steps. Also, changing one or several parameters is very strightforward and does not require a lot code refactoring.

For this, you will learn how to use 2 Sklearn modules:
1. [ColumnTransformer](#exo11)
2. [Pipeline](#exo12)

### 1. Column Transformer <a id="exo11" />

Before building your pipeline let's use a very useful Sklearn module called [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

This module is very useful when your input data is a pandas dataframe as you can select columns from their names.

#### Exercise

You are given a small dataset containing weights and heights for a few individuals.

In [72]:
data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 180, 'weight': 82},
        {'gender': 'Female', 'height': np.nan, 'weight': 72},
        {'gender': 'Male', 'height': 175, 'weight': 75},
        {'gender': 'Female', 'height': 175, 'weight': 60},
        {'gender': 'Male', 'height': 170, 'weight': 76},
    ])

test_data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 170, 'weight': 72},
        {'gender': 'Female', 'height': np.nan, 'weight': 60}
    ]
)

With `ColumnTransformer`, build a single encoder that apply these transformations:
- encode `gender` with OneHot
- fill missing values for height

In [76]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

encoder = ColumnTransformer([
    ('gender', OneHotEncoder(), ['weight']),
    ('fill_missing', SimpleImputer(), ['height'])])
transformed_data = encoder.fit_transform(data)

In [79]:
transformed_data

array([[  0.,   0.,   0.,   0.,   1., 180.],
       [  0.,   1.,   0.,   0.,   0., 175.],
       [  0.,   0.,   1.,   0.,   0., 175.],
       [  1.,   0.,   0.,   0.,   0., 175.],
       [  0.,   0.,   0.,   1.,   0., 170.]])

In [81]:
data.head()

Unnamed: 0,gender,height,weight
0,Male,180.0,82
1,Female,,72
2,Male,175.0,75
3,Female,175.0,60
4,Male,170.0,76


In [65]:
print(encoder.fit_transform(data))

  (0, 4)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 0)	1.0
  (4, 3)	1.0


In [63]:
data.head()

Unnamed: 0,gender,height,weight
0,Male,180.0,82
1,Female,,72
2,Male,175.0,75
3,Female,175.0,60
4,Male,170.0,76


In [30]:
encoder.transform(test_data)

array([], shape=(3, 0), dtype=float64)

In [68]:
from sklearn.preprocessing import Normalizer

ct = ColumnTransformer(
     [("norm1", Normalizer(norm='l1'), [0, 1]),
      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
               [1., 1., 0., 1.]])

In [69]:
ct.fit_transform(X)

array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

In [74]:
encoder.transform(test_data)

array([[  0.,   1.,   0.,   0.,   0., 170.],
       [  1.,   0.,   0.,   0.,   0., 175.]])

### 2. Pipeline <a id="exo12" />

Now it is time to use a Sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

#### Exercice
With the weight/height dataset, build a pipeline to predict the weight of individuals in the test set.

This pipeline should have:
- a oneHotEncode for `gender`
- fill missing values for height
- a scaler for height
- a simple estimator like a linear regression

**Tip** You can also use [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) which is an alias of `Pipeline` to easily generate a pipeline without giving names to the transformers.

In [82]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

encoder = ColumnTransformer([
    ('gender', OneHotEncoder(), ['gender']),
    ('height_scaled', make_pipeline(SimpleImputer(), StandardScaler()), ['height'])
                            ])

pipe  = Pipeline(steps=[ ('features', encoder),
                         ('clf', LassoCV()) ])

pipe.fit(data, data.weight)

# your code

Pipeline(memory=None,
         steps=[('features',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('gender',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['gender']),
                                                 ('height_scaled',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',


In [83]:
pipe.predict(test_data)

array([74.66391325, 66.08048299])

## Refactor Taxi Fare Prediction Problem with a Pipeline

Refactor the model you built yesterday for the Taxi Fare Prediction Problem using:
- Custom encoders you wrote for distance and time features
- OneHot Encoder to encoder hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together


Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

In [36]:
## your code