## Agenda

In the notebook, we borrow the code from week 2 and use hyperopt to tune hyperparameters.

we will have a quick recap on the ML Practice I:
1. Reading in the Kaggle data and adding features
2. Using a **`Pipeline`** for proper cross-validation
3. Combining **`GridSearchCV`** with **`Pipeline`**

The following are new items: 

4. Efficiently searching for tuning parameters using **`RandomizedSearchCV`**
5. Advanced Hyperparameter tuning using **`Hyperopt`**
6. Adding features to a document-term matrix (using SciPy)
7. Adding features to a document-term matrix (using **`FeatureUnion`**)
8. Ensembling models

# ML Practice Part 2

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Part 1: Reading in the Kaggle data and adding features

- Our goal is to predict the **cuisine** of a recipe, given its **ingredients**.
- **Feature engineering** is the process through which you create features that don't natively exist in the dataset.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# define a function that accepts a DataFrame and adds new features
def make_features(df):  
    # string representation of the ingredient list
    df['ingredients_str'] = df.ingredients.astype(str)
    
    return df

In [4]:
# create the same features in the training data and the new data
train = make_features(pd.read_json('../data/cuisine_data/train.json'))
new = make_features(pd.read_json('../data/cuisine_data/test.json'))

In [5]:
train.head()

Unnamed: 0,id,cuisine,ingredients,ingredients_str
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...","['romaine lettuce', 'black olives', 'grape tom..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...","['plain flour', 'ground pepper', 'salt', 'toma..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...","['eggs', 'pepper', 'salt', 'mayonaise', 'cooki..."
3,22213,indian,"[water, vegetable oil, wheat, salt]","['water', 'vegetable oil', 'wheat', 'salt']"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...","['black pepper', 'shallots', 'cornflour', 'cay..."


## Part 2: Using a `Pipeline` for proper cross-validation

In [6]:
# define X and y
X = train.ingredients_str
y = train.cuisine

In [7]:
# X is just a Series of strings
X.head()

0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object

In [8]:
# replace the regex pattern that is used for tokenization
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(token_pattern=r"'([a-z ]+)'")

In [9]:
# import and instantiate Multinomial Naive Bayes (with the default parameters)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

[make_pipeline documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)

In [10]:
# create a pipeline of vectorization and Naive Bayes
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vect, nb)

In [11]:
# examine the pipeline steps
pipe.steps

[('countvectorizer', CountVectorizer(token_pattern="'([a-z ]+)'")),
 ('multinomialnb', MultinomialNB())]

**Proper cross-validation:**

- By passing our pipeline to **`cross_val_score`**, features will be created from **`X`** (via **`CountVectorizer`**) within each fold of cross-validation.
- This process simulates the real world, in which your out-of-sample data will contain **features that were not seen** during model training.

In [12]:
# cross-validate the entire pipeline
from sklearn.model_selection import cross_val_score, cross_validate
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7323126392849393

In [13]:
# pipeline steps are automatically assigned names by make_pipeline
pipe.named_steps.keys()

dict_keys(['countvectorizer', 'multinomialnb'])

## Part 3: Tuning Hyperparameters via Bayesian Optimization (using Hyperopt)

- Different from grid-search and random-search, bayesian optimization aims to limit evals of the objective function by spending more time chossing the next value to try. 
- Define a probability model of P(loss|input parameters), which can be a surrogate function. 
- Select the next parameters values by applying a criteria (Expected Improvement) to the surrogate function.
- Why we call it bayesian? Updating a model based on new evidence and the probability model is updated to incorporate the latest information.

In [14]:
#use pip from jupyter notebook to install package
#!pip install seaborn
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from time import time
"""
all different parameter spaces should be joined in one tuple or dictionary. hyperpot define several spaces as:

1. hp.choice(name_arguments, options):    label is a string input which refers to the hyperparameter
                                 options will contain a list, one element will be returned from the list for that particular label.
2. hp.uniform(label, low, high):  Again the label will contain the string referring to hyperparameter 
                                  and returns a value uniformly between low and high. 
                                  And when optimizing, this variable is constrained to a two-sided interval

"""
param_hyperopt= {
    'countvectorizer__token_pattern': hp.choice('countvectorizer__token_pattern', [r"\b\w\w+\b", r"'([a-z ]+)'"]),
    'countvectorizer__min_df':        hp.choice('countvectorizer__min_df', np.arange(1, 5,1, dtype=int)), 
    'multinomialnb__alpha':           hp.uniform('multinomialnb__alpha', 0.0, 1.0)                                   
}
hyparams_list = list(param_hyperopt.keys())

In [15]:
list(param_hyperopt.keys())

['countvectorizer__token_pattern',
 'countvectorizer__min_df',
 'multinomialnb__alpha']

In [16]:
def hyperopt(param_space, X_train, y_train, num_eval):
    
    start = time()
    
    # defin the object function
    def objective_function(params):
        """since pipe object is not callable, we can not use pipe(**params) """
        clf = pipe.set_params(**params) ### since pipelien object is not callable
        score = cross_val_score(clf, X_train, y_train, cv=5).mean()
        return {'loss': -score, 'status': STATUS_OK}

    trials = Trials()
    """fmin function will iterate on differnt sets of algoritms and their hyperparmeters
        and return the set on which loss is minimum.
    """
    best_param = fmin(objective_function, 
                      param_space, 
                      algo=tpe.suggest, # you can change other algorithms such as GP,
                      max_evals=num_eval, 
                      trials=trials,
                      rstate= np.random.default_rng(42))
    loss = [x['result']['loss'] for x in trials.trials]
    
    best_param_values = [x for x in best_param.values()]
    
    if best_param_values[0] == 0:
        token_type = r"\b\w\w+\b"
    else:
        token_type = r"'([a-z ]+)'"
    
    clf_best = pipe.set_params(countvectorizer__token_pattern=token_type,
                    countvectorizer__min_df=int(best_param_values[1]),
                    multinomialnb__alpha=float(best_param_values[2]))
                                  
    clf_best.fit(X_train, y_train)
    
    print("")
    print("##### Results")
    print("Score best parameters: ", min(loss)*-1)
    print("Best parameters: ", best_param)
    print("Time elapsed: ", time() - start)
    print("Parameter combinations evaluated: ", num_eval)
    
    return trials

In [18]:
num_eval = 5

In [19]:
results_hyperopt = hyperopt(param_hyperopt, X, y,num_eval)

100%|███████| 5/5 [00:09<00:00,  1.90s/trial, best loss: -0.7476995552838627]

##### Results
Score best parameters:  0.7476995552838627
Best parameters:  {'countvectorizer__min_df': 0, 'countvectorizer__token_pattern': 1, 'multinomialnb__alpha': 0.5029037546614818}
Time elapsed:  9.95956826210022
Parameter combinations evaluated:  5


## Part 4: MLflow is used to track

- Different from grid-search and random-search, bayesian optimization aims to limit evals of the objective function by spending more time chossing the next value to try. 
- Define a probability model of P(loss|input parameters), which can be a surrogate function. 
- Select the next parameters values by applying a criteria (Expected Improvement) to the surrogate function.
- Why we call it bayesian? Updating a model based on new evidence and the probability model is updated to incorporate the latest information.

#### Run the command  

```console
mlflow server --backend-store-uri sqlite:///mydb.sqlite
```

In [22]:
import mlflow
import mlflow.sklearn

RANDOM_SEED = 0

mlflow.set_tracking_uri('http://0.0.0.0:5000')

try:
    EXPERIMENT_ID = mlflow.create_experiment('sklearn-pipeline-hyperopt')
except:
    EXPERIMENT_ID = dict(mlflow.get_experiment_by_name('sklearn-pipeline-hyperopt'))['experiment_id']

In [25]:
def train_model(params):
    
    # enable autologging
    #mlflow.sklearn.autolog()
    
    with mlflow.start_run(experiment_id=EXPERIMENT_ID, nested=True):
        metric_names = ['accuracy', 'f1_micro']        
        clf = pipe.set_params(**params) ### since pipelien object is not callable
        scores = cross_validate(clf, X, y, 
                                cv=5, scoring=metric_names, return_train_score=True)        
        training_metrics = {
            'Accuracy': scores['train_accuracy'].mean().round(3),
            'F1': scores['train_f1_micro'].mean().round(3)
        }
        training_metrics_values = list(training_metrics.values())

        validation_metrics = {
            'Accuracy': scores['test_accuracy'].mean().round(3),
            'F1': scores['test_f1_micro'].mean().round(3)
        }
        validation_metrics_values = list(validation_metrics.values())
        
        # Logging model signature, class, and name
        #signature = infer_signature(X_train, y_val_pred)
        mlflow.sklearn.log_model(clf, 'model')
        #mlflow.set_tag('estimator_name', model.__class__.__name__)
        
        # Logging each metric
        for name, metric in list(zip(metric_names, training_metrics_values)):
            mlflow.log_metric(f'training_{name}', metric)
        for name, metric in list(zip(metric_names, validation_metrics_values)):
            mlflow.log_metric(f'validation_{name}', metric)
        
        # Logging each hyper-parameters
        for name in hyparams_list:
            mlflow.log_param(name, params[name])

        # Set the loss to -1*F1 so fmin maximizes the it
        return {'loss': -1*validation_metrics['F1'], 'status': STATUS_OK}

trials = Trials()
"""fmin function will iterate on differnt sets of algoritms and their hyperparmeters
        and return the set on which loss is minimum.
"""
# Run fmin within an MLflow run context so that each hyperparameter configuration is logged as a child run of a parent
# run called "sklearn_pipeline" .
with mlflow.start_run(experiment_id=EXPERIMENT_ID, run_name='sklearn_pipe'):
    xgboost_best_params = fmin(
        fn=train_model, 
        space=param_hyperopt, 
        algo=tpe.suggest,
        trials=trials,
        max_evals=6
)

100%|█████████████████████| 6/6 [02:04<00:00, 20.76s/trial, best loss: -0.75]


In [24]:
print(EXPERIMENT_ID)

1
