# Day 3 - ML Iteration

Now that we have already built a simple, we want to make it better! The ultimate goal is having a model that makes more accurate predictions on the test set, hence getting a RMSE as low as possible.

**So what can we do?**

There are many different things that make models better:
- build and try to use different or more features
- test with different estimators (linear, non linear, etc..)
- tune hyperparameters


The problem is that it is often hard to keep track of this different experimentations. There are many different parameters that we can tune and many different combinations. 

**[MLFlow](https://www.mlflow.org/docs/latest/concepts.html) is a very useful tool to help us in machine learning models iteration.** 

In this series of exercise, you will get hands on using the [MLFlow Tracking Api](https://www.mlflow.org/docs/latest/tracking.html) in order to experiment with different features, models and parameters.

### Summary
0. [Workflow setup](#part0)
1. [Setup MLflow Tracking](#part1)
2. [Try different models](#part2)
3. [Features engineering](#part3)
4. [Hyperparameters tuning](#part4)
5. [MLFlow Projects](#part5)

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [85]:
%load_ext autoreload
%autoreload 2

### 0. Workflow Setup <a id="part0" />

It is time to move away from Jupyter Notebook, and start writing reusable code with python modules and classes. 

Let's first create our folder structure:

1. Create a folder with the name of your project/module, for example `TaxiFareModel`
2. Inside this folder, create a `__init_.py` file to make it a python module
3. Then create multiple python files that will contain the different python classes or methods we need


* `trainer.py`: **Trainer** class that will be our main class that trains our model
* `data.py`: **Data** class that will be responsible for getting the input raw data
* `utils.py`: any utility functions you may have
* `encoders.py`: your custom encoders and transformers

You should have something like:

* TaxiFareModel
    * __init__.py
    * trainer.py
    * data.py
    * utils.py
    * encoders.py
    * data
      * train.csv
      * test.csv

#### data.py
- Create the `Data` class. This class should have a method `get_data` 

In [2]:
class Data(object):
    
    def get_data(self, nrows=None, test=False):
        """returns the input data"""
        pass

In [84]:
# https://stackoverflow.com/questions/997797/what-does-s-mean-in-a-python-format-string

import os
"Hello %s, my name is %s" % ('john', 'mike')

'Hello john, my name is mike'

#### utils.py
- In `utils.py` this where you can have :
 - `haversine_distance` method
 - `compute_rmse` method

#### encoders.py
- In `encoders.py` let's put the custom encoders and transformers you have for distance and time features

#### trainer.py
- The `Trainer` class is the main class. It should have:
  - a `def get_estimator()` to return the estimator chosent to train the model
  - a `def get_pipeline()` method that builds the pipeline
  - a `def fit()` method that train the pipeline
  
  
- You can also have a `train` method that:
 - gets the training data
 - split date into train/validation sets
 - fits a model on that training data
 - evaludate the model on validation set

In [16]:
from sklearn.linear_model import LassoCV, RidgeCV, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import svm

class Trainer(object):
    
    EXPERIMENT_NAME = "MyExperiment"
    TRAINING_NROWS = 10000
    ESTIMATOR = "Lasso"

    def __init__(self, **kwargs):
        self.pipeline = None
        self.kwargs = kwargs
        self.experiment_name = kwargs.get("experiment_name", self.EXPERIMENT_NAME)
        self.nrows = kwargs.get("nrows", self.TRAINING_NROWS)

In [5]:
trainer = Trainer()

In [6]:
print(trainer.pipeline)

None


In [7]:
trainer.experiment_name

'MyExperiment'

In [8]:
trainer.nrows

10000

In [90]:
class Trainer(object):

    EXPERIMENT_NAME = "MyExperiment"
    TRAINING_NROWS = 10000
    ESTIMATOR = "Lasso"

# See the jupyter notebook
# This gives a default value to get() for keys that are not in the dictionnary
# https: // stackoverflow.com / questions / 1098549 / proper - way - to - use - kwargs - in -python

    def __init__(self, **kwargs):
        self.pipeline = None
        self.kwargs = kwargs
        self.experiment_name = kwargs.get("experiment_name", self.EXPERIMENT_NAME)
        self.nrows = kwargs.get("nrows", self.TRAINING_NROWS)

    def get_estimator(self):
        estimator = self.kwargs.get("estimator", self.ESTIMATOR)
        if estimator == "Lasso":
            estimator = LassoCV(cv=5, n_alphas=5)
        elif estimator == "Ridge":
            estimator = RidgeCV(cv=5)
        elif estimator == "Linear":
            estimator = LinearRegression()
        elif estimator == "GBM":
            estimator = GradientBoostingRegressor()
        else:
            estimator = LassoCV(cv=5, n_alphas=5)
        estimator_params = self.kwargs.get("estimator_params", {})
        estimator.set_params(**estimator_params)
        return estimator

In [18]:
new_trainer = Trainer()

In [92]:
new_estimator = new_trainer.get_estimator()
new_estimator

LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
        max_iter=1000, n_alphas=5, n_jobs=None, normalize=False, positive=False,
        precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
        verbose=False)

In [26]:
import os 
#if you want to know current working dir
os.getcwd()
#if you want to change
os.chdir('/Users/nicolasbancel/git')
# if you want to list dir
#os.listdir()

In [32]:
from taxifaremodel.data import Data
from taxifaremodel.encoders import TimeFeaturesEncoder, DistanceTransformer
from taxifaremodel.utils import compute_rmse

### Test it!
- Once you have everything setup, test that it works by calling `Trainer.train()` from your notebook.
- Do not hesitate to only breakdown your code into smaller calls for debugging
- Tip
  - add `%load_ext autoreload` and `%autoreload 2` in your notebook to automaticall have new code imported anytime you make a change


In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
# from TaxiFareModel.trainer import Trainer
# t = Trainer()
# t.train()

### 1. Setup MLFlow Tracking <a id="part1" />

Since that now we have a good workflow to make model improvements, it is very important to track all our different experiments. We want to be able to save all our differents training runs and compare their performance.

This is what MLFlow tracking is about.

#### Exercise
- Read [MLFlow Quickstart](https://www.mlflow.org/docs/latest/quickstart.html#quickstart)
- Install MLFlow with `pip install mlflow`
- Setup MLFlow in your code to start logging training runs
 - Think about which parameters you want to log
 - Think about the metric you want to log
- Re-organize your code in order to easily log the different parameters and metrics you need
    - Extract out the different parameters you may have in your code and make them inputs of your `Trainer(object)` class.
    - **Tips**:
       - If you want to log the estimator used, you can do it with `estimator.__class__.__name__`
       - Look at the estimator documentation to extract the params programmatically 
    - Write a method `log_mlflow_params` to automatically log the params
- View results with `mlflow ui`

To go further, look at the [full doc](https://www.mlflow.org/docs/latest/tracking.html).


In [9]:
class Trainer(object):
    
    def __init__(self, **kwargs):
        self.kwargs = kwargs
        
    def get_estimator(self):
        """use kwargs to set your estimator and params """
        
    def log_mlflow_params(self, **kwargs):
        """log params to mlflow here"""
        
    def log_mlflow_metric(self, **kwargs):
        """log metric to mlflow here"""

In [45]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LassoCV, RidgeCV, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import svm

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

import mlflow
from  mlflow.tracking import MlflowClient
from memoized_property import memoized_property


class Trainer(object):

    EXPERIMENT_NAME = "MyExperiment"
    TRAINING_NROWS = 10000
    ESTIMATOR = "Lasso"

# See the jupyter notebook
# This gives a default value to get() for keys that are not in the dictionnary
# https: // stackoverflow.com / questions / 1098549 / proper - way - to - use - kwargs - in -python

    def __init__(self, **kwargs):
        self.pipeline = None
        self.kwargs = kwargs
        self.experiment_name = kwargs.get("experiment_name", self.EXPERIMENT_NAME)
        self.nrows = kwargs.get("nrows", self.TRAINING_NROWS)

    def get_estimator(self):
        estimator = self.kwargs.get("estimator", self.ESTIMATOR)
        if estimator == "Lasso":
            estimator = LassoCV(cv=5, n_alphas=5)
        elif estimator == "Ridge":
            estimator = RidgeCV(cv=5)
        elif estimator == "Linear":
            estimator = LinearRegression()
        elif estimator == "GBM":
            estimator = GradientBoostingRegressor()
        else:
            estimator = LassoCV(cv=5, n_alphas=5)
        estimator_params = self.kwargs.get("estimator_params", {})
        estimator.set_params(**estimator_params)
        return estimator

    def get_pipeline(self):
        distance_transformer = DistanceTransformer(
            start_lat="pickup_latitude", start_lon="pickup_longitude",
            end_lat="dropoff_latitude", end_lon="dropoff_longitude",
        )
        features_encoder = ColumnTransformer([
            ('time_features', make_pipeline(TimeFeaturesEncoder(time_column='pickup_datetime'), OneHotEncoder()),['pickup_datetime']),
            ('distance', make_pipeline(distance_transformer, SimpleImputer()),['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']),
            ('passenger_count', SimpleImputer(), ['passenger_count'])
        ])
        return Pipeline(
            steps=[
                ('features', features_encoder),
                ('clf', self.get_estimator())
            ]
        )

    def fit(self, df_train):
        pipeline = self.get_pipeline()
        pipeline.fit(df_train, df_train.fare_amount)
        self.pipeline = pipeline

    def evaluate(self, df_test):
        if self.pipeline is None:
            raise ("Cannot evaluate an empty pipeline")
        y_pred = self.pipeline.predict(df_test)
        return compute_rmse(y_pred, df_test.fare_amount)
    
    def log_estimator_params(self):
        clf = self.get_estimator()
        self.mlflow_log_param('estimator_name', clf.__class__.__name__)
        params = clf.get_params()
        for k, v in params.items():
            pass
          #self.mlflow_log_param(k, v)

  ### MLFlow methods
    @memoized_property
    def mlflow_run(self):
        return self.mlflow_client.create_run(self.mlflow_experiment_id)

    @memoized_property
    def mlflow_experiment_id(self):
        try:
            return self.mlflow_client.create_experiment(self.experiment_name)
        except BaseException:
            return self.mlflow_client.get_experiment_by_name(self.experiment_name).experiment_id

    @memoized_property
    def mlflow_client(self):
        return MlflowClient()

    def mlflow_log_param(self, key, value):
        self.mlflow_client.log_param(self.mlflow_run.info.run_id, key, value)
        
    def mlflow_log_metric(self, key, value):
        self.mlflow_client.log_metric(self.mlflow_run.info.run_id, key, value)

    def train(self):
        df = Data().get_data(nrows=100000)
        df_train, df_val = train_test_split(df, random_state=99, test_size=0.05)
        df_train = df_train.sample(n=self.nrows)
        self.fit(df_train)

        self.log_estimator_params()
        self.mlflow_log_param("nrows", self.nrows)

        rmse = self.evaluate(df_val)
        self.mlflow_log_metric("rmse", rmse)


# if __name__ == "__main__":
#     for n in np.arange(500, 50000, 100):
#         t = Trainer(experiment_name='training_size_experiment', nrows=n)
#         t.train()

In [46]:
params = {"estimator": "Lasso", "estimator_params": {"cv": 5}}

In [83]:
root = os.path.dirname(os.path.realpath(__file__))

NameError: name '__file__' is not defined

In [89]:
import os
os.getcwd()

'/Users/nicolasbancel/git'

In [47]:
t = Trainer(**params)

In [50]:
#Need to make sure data.csv and test.csv are in the folder
df = Data().get_data(nrows=100000)

In [51]:
t.train()

In [79]:
# Won't work because some expensive load doesn't do anything

class class_example(object):
    @property
    def name(self):
        if not hasattr(self, '_name'):
            self._name = some_expensive_load()
        return self._name

In [73]:
from memoized_property import memoized_property

class class_example_two(object):

    @memoized_property
    def name(self):
        # Boilerplate guard conditional avoided, but this is still only called once
        return some_expensive_load()

In [82]:
a = class_example()

#### Organize your runs into MLflow Experiments
In order to have your runs organized, you can leverage [MLFlow experiments](https://www.mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments).

For this, it is easier to use the MLFlow API with [`MlflowClient`](https://www.mlflow.org/docs/latest/python_api/mlflow.tracking.html#module-mlflow.tracking)

#### Exercise
- Write a few additional methods to manage the creation of experiments and runs in your class.
- Also write wrappers around `log_metric` and `log_param` MLflow methods to easily log based on your current experiment and run
- @memoized_property is a useful decorator to declare properties see [https://pypi.org/project/memoized-property/](https://pypi.org/project/memoized-property/)

In [52]:
@memoized_property
def mlflow_run(self):
    return self.mlflow_client.create_run(self.mlflow_experiment_id)

@memoized_property
def mlflow_experiment_id(self):
    try:
        return self.mlflow_client.create_experiment(self.experiment_name)
    except BaseException:
        return self.mlflow_client.get_experiment_by_name(self.experiment_name).experiment_id

@memoized_property
def mlflow_client(self):
    return MlflowClient()

def mlflow_log_param(self, key, value):
    self.mlflow_client.log_param(self.mlflow_run.info.run_id, key, value)
    
def mlflow_log_metric(self, key, value):
    self.mlflow_client.log_metric(self.mlflow_run.info.run_id, key, value)

In [53]:
import warnings
warnings.filterwarnings('ignore')

In [56]:
from taxifaremodel.trainer import Trainer

In [58]:
t = Trainer(experiment_name='Experiment0')
t.train()

Now write a loop that will launch multiple runs for different parameters. 

In [6]:
from TaxiFareModel.trainer import Trainer
for param in ['param1', 'param2', 'param3']:
    t = Trainer(experiment_name='Experiment1', param=param)
    t.train()

In [97]:
import numpy as np
from taxifaremodel.trainer import Trainer
#for n in np.arange(500, 50000, 100):
for n in np.arange(500, 50000, 100)[0:20]:
    t = Trainer(experiment_name='training_size_experiment', nrows=n)
    t.train()

[autoreload of taxifaremodel.trainer failed: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/usr/local/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/nicolasbancel/git/taxifaremodel/trainer.py", line 23, in <module>
    class Trai

Then go to MLFlow UI to see the different runs.

In [61]:
# Starts at 500, ends at 50000, by increment of 100
np.arange(500, 50000, 100)[0:20]

array([ 500,  600,  700,  800,  900, 1000, 1100, 1200, 1300, 1400, 1500,
       1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400])

### 2. Try different models <a id="part2" />

Now that you have a way to track different iterations, it is time to experiment!

**First, let's try different estimators.**

#### Exercise
- Think about the different estimators that you know that can be used to solve prediction problems
- Implement a short script that will loop through all estimators, train the model and evalulate it on a validation set.
- Be careful: make sure you always use the same validation set accross all your trainings. **Tip** you can set the random seed for `train_test_split` to make sure the split is always the same.
- Compare performance with `mlflow ui`

In [None]:
# ## experiment for nrows
import numpy as np
from TaxiFareModel.trainer import Trainer
for estimator in ["Lasso", "Ridge", "Linear", "GBM"]:
    print(estimator)
    t = Trainer(experiment_name='estimators_experiment', estimator=estimator)
    t.train()
Lasso

### 3. Features engineering and selection <a id="part3" />

**Now it is time to be creative!**

You just tried different models, and you now see that some estimators may be more powerful than others. Another area where you can experiment is about `features engineering`. 

#### Exercise 1
- Try different combinations of features (by removing or adding some) and track the runs.
- Use [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) transfomer to generate new features from distance.
- Compute other types of distance (straightline, manhattan, travel, etc..)
- Use some "context knowledge" to generate new features that you think might be relevant.
 - For example: we know that taxis apply a fixed fare for airport transfers

#### Exercice 2
- Try different methods for outliers removals
- Look at how the size of the training set helps reduce the RMSE on the validation set.

### 4. Hyperparameters tuning <a id="part4" />

Finally, once you are satisfied with your features engineering work, **let's fine tune your model.**

For this, we recommand you choosing a `Gradient Boosting Tree` estimator ([Xgboost](https://xgboost.readthedocs.io/en/latest/get_started.html) or [LightGBM](https://lightgbm.readthedocs.io/en/latest/) or [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)).

The idea is to tune the hyperparameters of this estimator. The most important parameters to tune are:
- `learning_rate`
- `max_depth`
- `n_estimators`

To perform hyperparameters search, you have the choice between two `search` mechanisms:
- [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [RandomSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)


#### Exercise
- First, try to adjust the hyperparameters manually and do a few runs that you can track with MLFlow. Then [with MLFlow UI you can visually see how these parameters affect the performance metric](https://mlflow.org/docs/latest/tracking.html#visualizing-metrics).
- Once you have an idea of how the parameters impact RMSE, try to implement both `GridSearch` and `RandomSearch` as part of your pipeline to fully tune the model.
- Once you are satisfied with your tuned model, submit your predictions to Kaggle!

### 5. MLFlow projects <a id="part5" />

To go further, look at [MLFlow Projects](https://www.mlflow.org/docs/latest/projects.html#) and see how you can use them to perform hyperparameters tuning.