# wine_predictor Deployed

In this notebook, we cover some of the important themes around model operationalization. Most notably we use `mlfow` to seamlessly integrate python scripts and the associated metadata for a hassle-free machine learning operation.  

- A **script** is some Python code we want to run, stored as a `.py` or `.ipynb` formats. Usually, the script has a set of required or optional inputs I provide (just like a Python function). In `mlflow`, we refer to these inputs as **parameters**, but do NOT confuse this term with model parameters in ML.
- A **run** is when we fix the inputs of a script to some value and execute the script. In the context of ML, the script could be a training script, its "parameters" could be hyper-parameters to the model we wish to train, and a run is when we train a model with the hyper-parameters set to some fix values.
- As part of a run we can log the **parameters** we used, the **metrics** we calculated such as training and test accuracy, and **artifacts** such as plots, tables, or trained models we save externally for reuse later. We can refer to these as the run meta-data. In addition to the meta-data we log explicitly in the code, `mlflow` also logs some of its own meta-data such as run ID or run time.
- An **experiment** is a collection of related runs. So to continue with the above example, if we execute the script several times, each time using another set of values for the hyper-parameters, then the experiment is the collection of all such runs. After executing all the runs, we can go to our experiment to compare them in terms of accuracy, run time, or whatever **metric** of interest.

In general we can be flexible in what exactly we define as an experiment. The general idea is that from run to run, we change things and later we want to see what worked and what didn't by looking at metrics or artifacts generated by the model. A machine learning project can consist of one or several experiments. It all depends on the complexity of the project, and how granular we think of individual runs. This is to some extent a matter of preference and can even be driven by business needs. 

Finally, of course we can do a lot of this manually. After all we know how to run scripts with different inputs, or how to save plots or models on disk. Using a **version control** tool like Git, we can also track changes to the code. So why do we need `mlflow`? The answer is simple: It takes away most of the hassle that comes with doing such things manually, and on top of that it provides us with a UI where we go to find all our runs and quickly compare them. There are other concepts in `mlflow` that we do not cover here, but we invite you to check out [their website](https://mlflow.org/).

To begin with, we create a folder to save not only the code, but also the meta-data generated by our runs. Once we begin to log runs, the project folder will be populated by such meta-data. You are advised against deleting the meta-data directly (the better way is to use the UI). the UI).

In [1]:
import os
import pandas as pd
import mlflow

Below is our example data for this lab, where we will predict wine quality given all wine related features. 

In [2]:
# here we pull the attributes from a data archive from online
df_wine = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep = ";"
)
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


To begin with, we create a folder to save not only the code, but also the meta-data generated by our runs. 

In [3]:
project_folder = "wine"
os.makedirs(project_folder, exist_ok=True)
os.makedirs(project_folder + "/code", exist_ok=True)
os.makedirs(project_folder + "/config", exist_ok=True)
os.makedirs(project_folder + "/data", exist_ok=True)

Then we create an **experiment**. 

In [4]:
experiment_name = "predict_wine_quality"

try:
    experiment_id = mlflow.create_experiment(experiment_name)
except:
    experiment = mlflow.get_experiment_by_name(experiment_name)
    experiment_id = experiment.experiment_id
    
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///c:/Users/silva/GH_Stage/wine_quality_predictor_deployed/mlruns/379278478217381913', creation_time=1711917878411, experiment_id='379278478217381913', last_update_time=1711917878411, lifecycle_stage='active', name='predict_wine_quality', tags={}>

In [5]:
print(experiment_id)

379278478217381913


Creating a training **script**. 

In [6]:
%%writefile $project_folder/code/train.py 
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

# import all required modules
import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet # most simple linear model (hybrid of L1 and L2 penalties)
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

# how we will log our results for use later
import logging

logging.basicConfig(level = logging.WARN)
logger = logging.getLogger(__name__)

# must write methods for each task
# here, the inputs are the actual value and predictions and we get the output of prediction metrics
# treating as regression because quality is 1-5
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

# main script to run in command line
if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" # reading in the same dataset
    )
    try:
        data = pd.read_csv(csv_url, sep = ";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e
        )

    # split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # the predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis = 1)
    test_x = test.drop(["quality"], axis = 1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    # what kind of parameters do we want to supply, when we want to call this python code, for the elastic net
    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5 # if user supplies argument, but default is .5 if not supplied
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
    experiment_name = str(sys.argv[3]) if len(sys.argv) > 3 else "predict_wine_quality" # user can specify their own otherwise, give them default

    mlflow.set_experiment(experiment_name)
    # mlflow.autolog()
    with mlflow.start_run():
        
        # first activate run process and store
        run = mlflow.active_run()
        experiment = mlflow.get_experiment(run.info.experiment_id)
        print("Experiment ID: \"{}\"".format(run.info.experiment_id))
        print("Experiment name: \"{}\"".format(experiment.name))
        print("Run ID: \"{}\"".format(run.info.run_id))
        
        # training portion we use an elastic net linear regression alogrithm
        lr = ElasticNet(alpha = alpha, l1_ratio = l1_ratio, random_state = 42)
        lr.fit(train_x, train_y)
        
        # make predictions
        predicted_qualities = lr.predict(test_x)
        
        # call eval metrics functions and get results
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        # print info to show metrics
        print("Using alpha = {:0.2f}, l1_ratio = {:0.2f} we get the following metrics:".format(alpha, l1_ratio))
        print("  metric RMSE: {:6.2f}".format(rmse))
        print("  metric MAE: {:6.2f}".format(mae))
        print("  metric R-squared: {:0.2f}".format(r2))

        # log the parameters, store all the run information, and record which parameters for each run
        # after all tries of machine learning training, under what parameters is the model accuracy
        mlflow.log_param("alpha", alpha) # parameter, controls regularization amount
        mlflow.log_param("l1_ratio", l1_ratio) # split between L1 and L2 penalties
        
        # log metrics
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

        # model registry does not work with file store
        if tracking_url_type_store != "file":

            # register the model
            mlflow.sklearn.log_model(lr, "model", registered_model_name = "ElasticnetWineModel")
        else:
            mlflow.sklearn.log_model(lr, "model")

Writing wine/code/train.py


Now that we created an **experiment** and training **script**, let's start a training **run**.

need to add ! to run python like command line

In [7]:
!python $project_folder/code/train.py

Experiment ID: "379278478217381913"
Experiment name: "predict_wine_quality"
Run ID: "95cc61c255514156a94e4ee74a8ac5e3"
Using alpha = 0.50, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.79
  metric MAE:   0.63
  metric R-squared: 0.11


Since we defined the script with two inputs (what `mlflow` calls "parameters"), we can now change them to new values and execute the script again.

here we specify alpha and lr ratio

In [8]:
!python $project_folder/code/train.py 0.25 0.50

Experiment ID: "379278478217381913"
Experiment name: "predict_wine_quality"
Run ID: "ff266ac11cae4cdab64ff9cb75b592ca"
Using alpha = 0.25, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.75
  metric MAE:   0.58
  metric R-squared: 0.21


Let's now define an `mlflow` experiment and formalize what we did above. We create a file below that defines an `mlflow` project with its parameters and the command to be executed. Note that file paths are sepecified relative to the project directory.

create a project, includes environment, with entry points and parameters. 

also creates a conda environment

In [9]:
%%writefile $project_folder/MLproject
name: Wine Quality Prediction

conda_env: config/conda.yaml

entry_points:
  main:
    parameters:
      alpha: {type: float, default: 0.5}
      l1_ratio: {type: float, default: 0.1}
    command: "python code/train.py {alpha} {l1_ratio}"

Writing wine/MLproject


The above file also points to a conda environment file which we create below. This file defines the Python runtime used by the experiment. So for example, as part of the experiment, we can update one of the packages listed below and execute a new run to see if the update breaks our script.

yaml - common file format

define dependencies for environment so our code works, this will fix the version of the packages that are consistent with the training process

In [10]:
%%writefile $project_folder/config/conda.yaml
channels:
  - conda-forge
dependencies:
  - python=3.11
  - pip
  - pip:
    - scikit-learn==1.3.0
    - mlflow==2.6.0
    - pandas==2.1.0

Writing wine/config/conda.yaml


To execute our experiment, we use the `mlflow` command. This is very similar to the way we executed the script earlier, but instead we point to the mlflow project.

In [11]:
!mlflow run $project_folder --experiment-name $experiment_name -P alpha=0.42

Experiment ID: "379278478217381913"
Experiment name: "predict_wine_quality"
Run ID: "217a7357a6784ffbb236edd2a6ec9c88"
Using alpha = 0.42, l1_ratio = 0.10 we get the following metrics:
  metric RMSE:   0.74
  metric MAE:   0.57
  metric R-squared: 0.22


2024/03/31 13:48:47 INFO mlflow.utils.conda: Conda environment mlflow-c59dcc8a5768bf4486f5b50970681f464742e7ac already exists.
2024/03/31 13:48:47 INFO mlflow.projects.utils: === Created directory C:\Users\silva\AppData\Local\Temp\tmp1q0q_lyy for downloading remote URIs passed to arguments of type 'path' ===
2024/03/31 13:48:47 INFO mlflow.projects.backend.local: === Running command 'conda activate mlflow-c59dcc8a5768bf4486f5b50970681f464742e7ac && python code/train.py 0.42 0.1' in run with ID '217a7357a6784ffbb236edd2a6ec9c88' === 
2024/03/31 13:48:53 INFO mlflow.projects: === Run (ID '217a7357a6784ffbb236edd2a6ec9c88') succeeded ===


After executing all the runs, we can go to our experiment to compare them in terms of accuracy, run time, or whatever **metric** of interest.

This is how we see how many runs we have done

In [12]:
mlflow.search_runs(experiment_id).head()

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.mae,metrics.r2,metrics.rmse,params.alpha,...,tags.mlflow.log-model.history,tags.mlflow.source.git.repoURL,tags.mlflow.gitRepoURL,tags.mlflow.user,tags.mlflow.project.entryPoint,tags.mlflow.project.backend,tags.mlflow.runName,tags.mlflow.source.name,tags.mlflow.source.type,tags.mlflow.source.git.commit
0,217a7357a6784ffbb236edd2a6ec9c88,379278478217381913,FINISHED,file:///c:/Users/silva/GH_Stage/wine_quality_p...,2024-03-31 20:48:44.826000+00:00,2024-03-31 20:48:53.828000+00:00,0.572285,0.219785,0.742062,0.42,...,"[{""run_id"": ""217a7357a6784ffbb236edd2a6ec9c88""...",https://github.com/silvanoross/wine_quality_pr...,https://github.com/silvanoross/wine_quality_pr...,silva,main,local,mercurial-dove-676,file://c:\Users\silva\GH_Stage\wine_quality_pr...,PROJECT,4f1b8226936a5ffe76045abb270c4cfaa0c30a4d
1,ff266ac11cae4cdab64ff9cb75b592ca,379278478217381913,FINISHED,file:///c:/Users/silva/GH_Stage/wine_quality_p...,2024-03-31 20:47:50.845000+00:00,2024-03-31 20:47:53.690000+00:00,0.580695,0.205275,0.748931,0.25,...,"[{""run_id"": ""ff266ac11cae4cdab64ff9cb75b592ca""...",,,silva,,,awesome-donkey-25,wine/code/train.py,LOCAL,4f1b8226936a5ffe76045abb270c4cfaa0c30a4d
2,95cc61c255514156a94e4ee74a8ac5e3,379278478217381913,FINISHED,file:///c:/Users/silva/GH_Stage/wine_quality_p...,2024-03-31 20:47:23.844000+00:00,2024-03-31 20:47:27.252000+00:00,0.627195,0.108626,0.793164,0.5,...,"[{""run_id"": ""95cc61c255514156a94e4ee74a8ac5e3""...",,,silva,,,clean-shrike-821,wine/code/train.py,LOCAL,4f1b8226936a5ffe76045abb270c4cfaa0c30a4d


## Predict using trained model

Now let's see how we can load the model saved from one of our runs into the current Python session. To do so, we copy the line with `logged_model = ...` (see above or in your folder path) from the model artifacts page, and paste it below. We can then load a few rows of the wine data and use the model to get predictions.

In [13]:
# create our variable for the logged model
logged_model = mlflow.search_runs(experiment_id)['run_id'][0]
logged_model

'217a7357a6784ffbb236edd2a6ec9c88'

In [14]:
logged_model = f"./mlruns/{experiment_id}/{logged_model}/artifacts/model"
loaded_model = mlflow.pyfunc.load_model(logged_model) # load model as a PyFuncModel.

 - mlflow (current: 2.7.1, required: mlflow==2.6.0)
 - numpy (current: 1.24.3, required: numpy==1.26.2)
 - scikit-learn (current: 1.3.1, required: scikit-learn==1.3.0)
 - scipy (current: 1.11.2, required: scipy==1.11.4)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [15]:
df_wine_sample = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep = ";"
).head() # load some data

In [16]:
df_wine_sample

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


This model is different, it is a packaged model, with all of the parameters etc. We have to use it differently than other models.

In [17]:
loaded_model.predict(df_wine_sample.drop(columns="quality", axis=1)) 

array([5.367283  , 5.4342527 , 5.41406653, 5.57125136, 5.367283  ])

#### Pick a cloud service, like kubernets, then setup and upload the model into service, wrap the model within the service, the specific service has folders for aritifacts, if you use azure machine learning to train the models, there is a seamless way to deploy model into a cloud service.

#### In reality you use the specific cloud vendors that have wrappers for mlflow, or there own specific ml-flow like environments.