# MLFlow hands-on

In this session we will take what we learned previously and add MLFlow configuration to already existing project.

We will
- look into the existing project
- finish partially filled MLproject file
- calculate predictions for csv file using model server 

In [None]:
%%bash --bg
source ./mlflow_env_vars.sh

mkdir -p data
mlflow server --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

## 1. The project

In real-world scenarios MLOps will consist of productionizing existing projects.

We have a project that uses [lending club dataset](https://www.kaggle.com/datasets/wordsforthewise/lending-club) for credit default risk.

The scripts for downloading, preprocessing data and training are already implemented.

Before running it, we need to set up a kaggle token. To do that, follow the instructions provided in [here](https://adityashrm21.github.io/Setting-Up-Kaggle/)

Our task is supervised learning - we try to predict whether a loan was repaid. We have over 100 features that can be used for training a model.

You will need to have high-level understanding of the code. That means mostly understanding function signatures.

Find appropriate entry point in `MLproject` and run the following cell to download data.

# 1. 0 Python Fire

We will be using `fire`, a Python package for making convenient CLI apps from Python scripts, like `argparse` but smarter.

`fire` works by wrapping Python script functions and exposing them to command line. 


`script.py:`

```
import fire

def f(msg):
    ...
    
if __name__ == "__main__":
    fire.Fire()
```

Here it will infer that `f` is a command.
You can call this in Python using `python script.py f $msg` and it will call `f(msg)` in Python.

We will use `fire` in our scripts.

In [1]:
# Run only if Added Kaggle token (see README)
# %%bash
# source ./mlflow_env_vars.sh

# mlflow run . -e download_data

# 1. 1 Prepare data

The following function is used to filter input data and define target. 

```python
def filter_input_data(
    data_path: str, # load csv file from this path
    target_col: str, # the column that will be treated as target and transformed accordingly
    dropped_values: List[str], # drop rows with target values like these 
    replace_by_zero: str, # replace target with this value as zero, other as 1
    max_nan_proportion: float, # drop columns with more than this amount of NaN
    max_categorical_cardinality: int, # drop categorical columns with more than this number of levels
    dst_filename: str, # output filename (with feather extension)
):
    csv_path = data_path
    df = pd.read_csv(csv_path)
    cleaned_df = cleaning.filter_df(
        df,
        target_col,
        dropped_values,
        replace_by_zero,
        max_nan_proportion,
        max_categorical_cardinality,
    )
    cleaned_df.reset_index().to_parquet(str(dst_filename), index=False)
```

Implement `prepare_data` entrypoint in MLproject.

In [None]:
!python -c "import sys; print(sys.executable)"

In [None]:
# Instead of !pip install pandas
# called conda install pandas in virtualenv mlops-student

In [None]:
%%bash
python -c "import sys; print(sys.executable)"
source ./mlflow_env_vars.sh

mlflow run . -e prepare_data

## 1. 2 Prepare train-test split

Implement `prepare_train_test_split` in `MLproject` and use them to prepare two datasets containing records from different time periods.

```python
def prepare_train_test_split(data_path, seed, test_size, train_path, test_path):
    df = utils.read_parquet(data_path)
    train_df, test_df = model_selection.train_test_split(
        df, test_size=test_size, random_state=seed
    )
    train_df.to_parquet(train_path, index=False)
    test_df.to_parquet(test_path, index=False)
```

In [None]:
%%bash
source ./mlflow_env_vars.sh

mlflow run . -e prepare_train_test_split_older
mlflow run . -e prepare_train_test_split_newer

## 1.3 Model training

`train_model` trains appropriate model. You do not need to know all the details of training here.

`model_conf.yaml` contains model pipeline configuration.

Fill in the details in `main` entrypoint in `MLproject`.

```python
def train_model(data_dir, config_dict_path):
    with open(config_dict_path, "r") as f:
        config_dict = yaml.safe_load(f)
    config = configs.PipelineConfig(**config_dict)
    with mlflow.start_run():
        mlflow.log_params(config_dict)
        logging.info(f"training model: {config_dict}")
        clf_pipeline = pipelines.get_classification_pipeline(config)
        X_train, y_train = prepare_input(data_dir, "train")
        clf_pipeline.fit(X_train, y_train)
        mlflow.sklearn.log_model(
            clf_pipeline, "model", registered_model_name="ChurnModel"
        )
```

In [None]:
%%bash
source ./mlflow_env_vars.sh
mlflow run . -e main

## Using trained model

Recall that MLFLow models can be loaded in Python using several interfaces (scikit-learn, Keras et c)

Our scikit-learn model can be loaded for example using MLFlow `pyfunc`

In [None]:
import os
import mlflow.pyfunc
import pandas as pd
from sklearn import metrics

In [None]:
os.environ["MLFLOW_TRACKING_URI"] = "http://0.0.0.0:5000"
model_name = "ChurnModel"
model_version_uri_1_0 = f"models:/{model_name}_1_0/latest"

model_1_0 = mlflow.pyfunc.load_model(model_version_uri_1_0)

In [None]:
model_1_0 = mlflow.pyfunc.load_model(model_version_uri_1_0)

In [None]:
df_1_0 = pd.read_parquet("data/test1_0.parquet")
df_1_1 = pd.read_parquet("data/test1_1.parquet")

## Evaluation

We have a binary classification problem that is somewhat imbalanced.

The following cell will show the proportion of positive examples

In [None]:
df_1_0["target"].mean()

### Metrics

#### Question 1: what are some metrics that are better suited than accuracy for imbalanced problems?



Answer: F1 score, recall, precision, ROC AUC

#### Question 2: what is the problem with scores different from AUC?


Answer: for example recall will measure how many positive examples have predicted $p(x) > 0.5$.

In imbalanced problems this is an issue since $p(x)$ on average will be close to $y$ average (for example 15% in our case), so many positives would not have $p(x) > 0.5$.

The solution to this is to use a score that does some kind of averaging over thresholds.

**ROC AUC** is one such score that is commonly used in imbalanced problems.


We will use this score now.

Run old model on old test set

In [None]:
%%bash
source ./mlflow_env_vars.sh
mlflow run . -e evaluate_model -P "data_version='"1_0"'" -P "model_version='"1_0"'"

Run old model on new test set

In [None]:
%%bash
source ./mlflow_env_vars.sh
mlflow run . -e evaluate_model -P "data_version='"1_1"'" -P "model_version='"1_0"'"

We see that the performance of model declined on new data.

Let's train model on new data and see its performance.

In [None]:
%%bash
source ./mlflow_env_vars.sh
mlflow run . -e main -P data_version="'1_1'"

In [None]:
%%bash
source ./mlflow_env_vars.sh
mlflow run . -e evaluate_model -P "data_version='"1_1"'" -P "model_version='"1_1"'"

We see that the newer model is actually better on new data.

## Serving the model

In [None]:
%%bash --bg
source ./mlflow_env_vars.sh

mlflow models serve -m models:/ChurnModel_1_1/Production -p 5001 --env-manager=conda

In [None]:
%%bash 
source ./mlflow_env_vars.sh
mlflow models serve -m models:/ChurnModel_1_1/Production -p 5001 --env-manager=conda

# Prediction

We'll load data that we can feed into prediction server.