## Machine Learning and Experiment Tracking

> __Setup__:
> For Windows users, install git bash if you do not have that already.<br><br>
> Open Git Bash/Mac Terminal and do the following:
> - Create a Python environment: `conda create -n env-name` or `python -m venv env-name`
> - Activate the environment: `conda activate env-name` or `source env-name/Scripts/activate`.
> - Clone the lab's repository: `git clone https://github.com/sharonibejih/ml-exp-tracking.git`
> - Change current working directory: `cd ml-exp-tracking`<br><br>
> Open your Jupyter Notebook locally (i.e from Anaconda, VSCode or any other IDE that you can access locally, which you are comfortable with).<br><br>
> Ensure that the location of the notebook in inside the "ml-exp-tracking" folder.<br><br>
> _P.S: It's okay to choose to work with the notebook that already exist in the cloned repo._


__Machine Learning:__

In the simplest term, machine learning is making a computer learn to do what a human can do. <br><br>

>According to Chip Huyen, Machine learning is an approach to learn complex patterns from existing data and use these patterns to make predictions on unseen data.<br><br>

Since learning is a process for us as humans, it is also a process for machines. To determin whether a model has learnt well enough, we evaluate them either with qualitative measures, quanitative or both. 

<img src="ml_vs_traditional_paradigm.png">

For each time we train a model and evaluate, we carry out a machine learning experiment.


### Experiment Tracking

Just like in our science labs for example, we put down observations from any experiments carried out. The same practice is extremely required in machine learning too. 

Reviewing past experiments, gives insights on what performed best and possible reasons why some experiments failed. It also gives the ability to reproduce past experiment or share findings with the team.


When building machine learning models, we try a lot of things to be able to arrive at an acceptable model. Some of the things we try could be the data preprocessing technique, the data split size, the hyperparameters and the machine learning algorithm used. 


> Machine learning Experiment tracking is the process of recording all these [important] details (called metadata) that associates with any trained model, including their performance scores.


#### Why is ML Experiment Tracking Important?

What happens if your team lead asks you to use a model you trained a month ago? Can you reproduce that model if it isn't saved to a path? And even if it was saved can you give details on the model if you never recorded them somewhere else other than your notebook, which may no longer be available?

Perhaps all of these could give you insights on why documentation is important.


__Tracking ML Experiments__
<br><br>
<img src="tracking_with_spreadsheet.png">

> <br>
> Class Question: <br>
> Do you see any problems with this method of tracking results?
> <br><br>



### MLflow

Think of MLflow has a platform that helps you manage your machine learning cycle - experimentation, deployment, and model storage. The goal is to aid reproducibility, a key ingredient in research and engineering.

Today, we will focus on the experiment tracking feature of MLflow and see how it can be used.

<img src="gui_result.png">

REF: https://mlflow.org/docs/latest/tracking.html 


#### Track your ML Experiments in 3 Basic Steps:
1. Create the experiment by giving it a name: `mlflow.set_experiment("experiment-name")`<br><br>
2. Set the Tracking URI: `mlflow.set_tracking_uri()`<br><br>
3. Log the metadata and artifacts for each run: `mlflow.start_run`<br><br>

__DEFINITION OF TERMS__

__Experiment__ in MLflow is simply what we use to represent the name of the project, for example: "fraud-detection" or "term-deposit".

__Run__ represents each modeling carried out under this project (or experiment). Just like trying out three or four fraud detection models before concluding on the best. Each try is called a run. After this is executed, it stores the metadata and artifacts of the model.

__Metadata__ is the light details of the model such as metrics, tags, hyperparemeter values, etc. What metadata is store is your choice.

__Artifacts__ are the heavier results of the model such as the preprocessor, the exported model, images, data etc. What artifacts to store is also based on choice.

__Tracking URI__ is an optional choice. Tracking URI is used if :
- you want to store your files (metadata and artifacts) to a specific local directory.
- you want to log your results to a specific resource, say to a specific host. <br><br>If tracking uri is specified without stating the file path or HTTP URI, then, it will use the default: __Files will be stored in your working directory by creating a folder called `mlruns` and the UI will be accessible via `http://127.0.0.1:5000`.__

## Now, let's code!

In [None]:
## pip install mlflow
## pip install xgboost

### Ctrl + Shift + P to add env to Jupyter Notebook

In [None]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import f1_score

from xgboost import XGBClassifier

import pickle # to export and load the model

import mlflow

The ML model has been created before now, using the following simple codes. 

In [None]:
def load_data(train_path:str, test_path:str):
    train = pd.read_csv(train_path, sep=";")
    test = pd.read_csv(test_path, sep=";")
    
    return train, test

    
def preprocess_data(train_data, test_data):

    dv = DictVectorizer()

    trainX = train_data.drop(columns=["y"])
    trainy = train_data["y"]

    testX = test_data.drop(columns=["y"])
    testy = test_data["y"]

    trainy = trainy.replace({"no":0, "yes": 1})
    testy = testy.replace({"no":0, "yes": 1})

    trainX = dv.fit_transform(trainX.to_dict(orient="records")).toarray()

    testX = dv.fit_transform(testX.to_dict(orient="records")).toarray()
    
    return dv, trainX, trainy, testX, testy


def train_model(trainX, trainy, testX):

    xgb = XGBClassifier()
    
    xgb.fit(trainX, trainy)
    
    pred = xgb.predict(testX)

    return xgb, pred
    

def evaluate(testy, pred):
    
    return (f1_score(testy, pred))