# MLflow, Kaggle submission

This notebook is meant to demonstrate how one could experiment with different models through MLflow. 

The goal is to log the different model experiment results MLflow and use the [MLflow UI](https://www.mlflow.org/docs/latest/quickstart.html#viewing-the-tracking-ui) to pick the best model. You should then have a first understanding of how MLflow could help you with model selection.

This notebooks uses data from the Kaggle [Amazon Employee Access Challenge](https://www.kaggle.com/c/amazon-employee-access-challenge), which is a binary classsification problem. You can do a late submission at the end of this notebook to see how your model compares to the other approaches in the leaderboard.



## Setup

### Preparing the virtualenv

First make sure you have installed the packages that we need in your virtualenv:

In [None]:
!pip install kaggle mlflow python-dotenv

### Setting MLflow and Kaggle credentials through `python-dotenv`

For tracking and storing your model experimentation results, MLflow uses a tracking server. There are two options:
1. Run the tracking server locally with command `mlflow ui`. This exposes the UI on http://localhost:5000.
2. Set up an tracking server and connect to it by setting the right environment variables.

In order for Python to know how to write to the MLflow tracking server, you need to set a couple of environment variables. This notebook uses the [`python-dotenv`](https://pypi.org/project/python-dotenv/) package to load these into this notebook. The package assumes all your environemt variable sare neatly saved in a `.env` file, which is gitignored to prevent us from uploading passwords into Git. 

We also need credentials to use the Kaggle API for downloading data. These can be added to the `.env` file as well.

**Firsts task**: Copy the `.env-example` file and save it as `.env`. Then fill in the passwords and usernames required for MLflow and Kaggle.

**Note**: When running the MLflow UI locally you only need to set `MLFLOW_TRACKING_URI='http://127.0.0.1:5000'`.

## Init

Having prepared your environment variables it's time to load them:

In [None]:
from dotenv import load_dotenv
load_dotenv()

Check where you are going to serve the MLflow UI from:

In [None]:
!echo $MLFLOW_TRACKING_URI

Furthermore load packages you need and set a random seed for reproducability:

In [None]:
import pandas as pd
import numpy as np
import mlflow

In [None]:
np.random.seed(42)

## Get the data

You can easily download the Kaggle data by calling the Kaggle API. First select a folder you want to store the files in and create it:

In [None]:
data_folder = '../../data/amazon-employee-access-challenge/'

In [None]:
!mkdir $data_folder

Now call the Kaggle API:

In [None]:
!kaggle competitions download -c amazon-employee-access-challenge -p $data_folder

And validate that the files are indeed there:

In [None]:
!ls $data_folder

## Data Exploration

When submitting to Kaggle it's good to know what the required format is for your submission:

In [None]:
!head -5 $data_folder/sampleSubmission.csv

Now load in the other data sets for training and testing:

In [None]:
df_train = pd.read_csv(data_folder + 'train.csv')

df_train.shape

Use `test.csv` to create your submission:

In [None]:
df_test = pd.read_csv(data_folder + 'test.csv')

df_test.shape

**Task**: Check out the data sets for yourself to see how they look like and what variables you can use for a model.

Specifically, look into the class distribution of your target variable. What kind of model would you apply?

In [None]:
# do some exploration

## Preparing data

For this demo we use `sklearn`. It requires your data to be in a specific shape, so let's do that for you:

In [None]:
X_train = df_train.copy()
y_train = X_train.pop('ACTION')

(X_train.shape, y_train.shape)

In [None]:
X_test = df_test.drop('id', axis=1)
y_id = df_test[['id']].rename(columns={"id": "Id"})

(X_test.shape, y_id.shape)

## Setting mlflow experiment

In MLflow you can define experiments. Familiarize yourself a bit with the [documentation](https://mlflow.org/docs/latest/quickstart.html), and check out what `set_experiment` does.

In [None]:
# Set your own experiment here
mlflow.set_experiment(<fill-in>)

## Running a single model

Let's run a model once without logging to MLflow yet. Choose one of your own and return the metrics that report its quality of fit:

In [None]:
# Import sklearn dependencies
from sklearn.<fill-in> 

# create a model
model = <fill-in>

# fit, run, and/or score your model
<fill-in>

# print the metrics that you are interested in
<fill-in>

In [None]:
%load answers/simple-model.py

Great. Are you ready to tweak the model and use MLflow?

## Hyperparameter tuning

The next step is to tune our model, and find out the optimal hyperparameters. In this case, let's also see how we can start logging the results into MLflow!

Firstly: define a parameter grid that you want to search over. For example for the `LogisticRegression`, do something like:

```
C_range = np.logspace(-4.0, 4.0, num=10)
param_grid = {'C': C_range,'penalty': ['l1','l2'] }
```

In [None]:
# set the param_grid for your model of choice:
param_grid = <fill-in>

Now for our grid search, we need to loop over all possible combinations within this `param_grid`. Here's some code that creates all combinations for you:

In [None]:
from itertools import product

# create list to iterate hyperparameters over
def create_param_grid_list(param_grid):
    keys, values = zip(*param_grid.items())
    params_list = []
    for v in product(*values):
        params_list.append(dict(zip(keys, v)))
    return params_list

params_list = create_param_grid_list(param_grid)

We can now just loop over all combinations, run our model with those specific hyperparameter settings and write them to MLflow! 

Within the loop, just use the same logic as you did above when you were running just a single model. However, instead of printing your metrics, now write them to MLflow. 

**Task**: Check out the documentation on `mlflow.start_run()`. What does it do? 

**Task**: We are writing both metrics and params to mlflow. What is the difference?

In [None]:
for params in params_list:
    
    # fill in a run_name that you think is convenient
    with mlflow.start_run(run_name=<fill-in>):

        # create your model with the params of this loop
        model = <fill-in>

        # fit, run, and/or score your model
        <fill-in>

        # write the metrics that you are interested in to mlflow
        # for example: 
        mlflow.log_param(<fill-in>)
        # for example: 
        mlflow.log_metric(<fill-in>)
        

In [None]:
%load answers/gridsearch-example.py

Great! Now check out your results in the MLflow UI.

**Task:** Compare the different run results and pick the model that you think is best. Did you have enough information for your decisison? 

## Predict and submit to Kaggle!

Great! Now you have your besst model, finish the job by retraining it once on the entire dataset and submitting your reuslts to Kaggle!

In [None]:
# create a model
model = <fill-in>

# fit it on the entire dataset
<fill-in>

# predict on the test set
pred = <fill-in>

In [None]:
%load answers/fit-predict.py

Now prepare your predictions and save appropriately for the Kaggle submission format:

In [None]:
# save predictions
pred = pd.Series(pred, name='Action')

submission = pd.concat([y_id, pred], axis=1)
submission.head()

And finally write to csv and upload!

In [None]:
submission.to_csv(data_folder + 'submission.csv', index=False)

In [None]:
!head -5 $data_folder/submission.csv

In [None]:
!kaggle competitions submit -c amazon-employee-access-challenge -f $data_folder/submission.csv -m "My submission!"

Nice.

**Task:** What is your score?! 

Done.