# MLflow

In this notebook you will build a model to predict the score quality of a wine given some physicochemical measurements. See [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/wine/) for more detail about the dataset. 

The goal of the notebook is to go through all the different stpes of putting a ML model to productions:
* ingest the data
* split the data for training and evaluation and test
* transform the data for the model
* train and evaluate the model
* store the model
* use the model above to predict on some new data (in batch or real-time)

The goal of this notebook is to give some end-to-end flow. We are not trying to go very deep in any steps but show the overall flow. 

In [1]:
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from pathlib import Path
import os

In [14]:
# Note: please change the directory if you are not using a dev container. 
# We want to have the working directory to be the src folder in the mlflow-trainng repo
os.chdir("/workspaces/mlflow-training/src")

## Ingest data

In [3]:
red_df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
red_df["is_red"] = 1
white_df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")
white_df["is_red"] = 0

df = pd.concat([red_df, white_df])
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,1
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,1
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,0
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,0
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,0
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,0


## Split data

We want to split the data to have the following proportion:
- 80% training
- 10% evaluation
- 10% test

In [4]:
y = df[["quality"]]
X = df.drop("quality", axis=1, inplace=False)

X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test_val, y_test_val, test_size=0.5, random_state=42)

## Transform data

Apply a preprocessing step to by removing the mean and scaling to unit variance. 

In [5]:
# here we do a transform
preprocessing_pipeline = Pipeline(
    [
        (
            "ct",
            ColumnTransformer(
                [
                    (
                        "minmax",
                        StandardScaler(),
                        X_train.columns,
                    ),
                    
                ]
            )
        )
    ]
)

# Note that we want to do the fit only on the training and then transform the validation and test set with the settings from our training dataset
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_val_processed = preprocessing_pipeline.transform(X_val)
X_test_processed = preprocessing_pipeline.transform(X_test)

## Train model

In [16]:
def setup_mlflow(experiment_name: str, mlflow_location: str = "metadata/mlflow"):
    """set mlflow experiment and mlflow artifacts location"""
    mlflow_root_path = Path.cwd().joinpath(mlflow_location)
    mlflow_root_path.mkdir(parents=True, exist_ok=True)
    # create a sqlite file if id does not exists
    sqlite_path = mlflow_root_path.joinpath("mlruns.db")
    sqlite_path.touch()

    # set the tracking to the sqlite file
    mlflow.set_tracking_uri(sqlite_path.as_uri().replace("file:", "sqlite:/"))
    
    # Get a list of all existing experiments
    experiments = mlflow.search_experiments()
    experiment_names = [ex.name for ex in experiments]
    # Create experiment if it does not exist
    if experiment_name not in experiment_names:
        artifact_location = mlflow_root_path.joinpath("mlartifacts")
        artifact_location.mkdir(exist_ok=True)
        mlflow.create_experiment(
            experiment_name,
            artifact_location=artifact_location.as_uri(),
        )
    
    mlflow.set_experiment(experiment_name)
    return True
    

setup_mlflow(
    experiment_name="wine_score_notebook",
)

True

In [7]:
model = LinearRegression()

# ToDo add logging of parameter and metrics

mlflow.autolog(log_input_examples=True)
with mlflow.start_run() as run:
    model.fit(X_train_processed, y_train)
    # ToDo log also the preprocessing_pipeline to the mlflow run
    mlflow.sklearn.log_model(preprocessing_pipeline, "preprocessing_pipeline")
    mlflow.sklearn.log_model(model, "regressor")
    mlflow_run_id = run.info.run_id
    
mlflow.autolog(disable=True)

2023/05/27 23:02:24 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [8]:
mlflow_run_id

'63195d49f9f644b6bb58de417f90634a'

## Predict with trained model

### Predict on batch inference

In [9]:
print(f"Model path = runs:/{mlflow_run_id}/model/")
loaded_model = mlflow.sklearn.load_model(f"runs:/{mlflow_run_id}/model/")

Model path = runs:/63195d49f9f644b6bb58de417f90634a/model/


In [10]:
y_test["prediction"] = loaded_model.predict(X_test_processed)
y_test

Unnamed: 0,quality,prediction
1303,6,5.945335
3446,6,5.694131
1153,6,6.072289
323,6,5.364762
3619,6,5.822099
...,...,...
711,6,6.319968
549,6,5.185846
2191,5,5.630085
1002,7,6.627817


### Predict in real time

We can also use the mlflow model to do rediction in real-time. To do so we will need to:
1. run an mlflow server to be able to distribute the model
2. create a serving enpoint which will pull the model from mlflow server
3. finally we can query our model in real time using `curl`

In [11]:
print("Please copy the command below in a new terminal on your IDE \n")

print("mlflow server \\")
print("    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \\")
print("    --default-artifact-root ./src/metadata/mlflow/mlartifacts \\")
print("    --host 0.0.0.0 \\")
print("    --port 5000")

Please copy the command below in a new terminal on your IDE 

mlflow server \
    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \
    --default-artifact-root ./src/metadata/mlflow/mlartifacts \
    --host 0.0.0.0 \
    --port 5000


In [12]:
print("Please copy the command below in a new terminal on your IDE \n")

print("MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \\") 
print("      --host=0.0.0.0 \\")
print("      --port=5001 \\")
print("      --env-manager=local \\")
print(f"      --model-uri runs:/{mlflow_run_id}/model/")

Please copy the command below in a new terminal on your IDE 

MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \
      --host=0.0.0.0 \
      --port=5001 \
      --env-manager=local \
      --model-uri runs:/63195d49f9f644b6bb58de417f90634a/model/


In [13]:
print("You can copy the command below on one of your terminal \n")

request_data = pd.DataFrame(X_test_processed).iloc[0:4].to_json(orient="records")
print("""curl http://0.0.0.0:5001/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """ +request_data +"""}'""")

You can copy the command below on one of your terminal 

curl http://0.0.0.0:5001/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": [{"0":0.7738688092,"1":-0.8461749496,"2":0.2745893897,"3":0.5480123276,"4":-0.5979451988,"5":1.6879744754,"6":0.8423490573,"7":0.1692239584,"8":-1.0460083335,"9":-1.5541818503,"10":-0.154123837,"11":-0.5651286567},{"0":0.0751888016,"1":0.0171458612,"2":-0.140147658,"3":0.8190646823,"4":0.0337640161,"5":0.1969664016,"6":1.1095544349,"7":0.2822422616,"8":-0.4194907278,"9":-0.6123499532,"10":-0.0701790489,"11":-0.5651286567},{"0":1.8607043766,"1":0.5104720389,"2":0.2054665484,"3":-0.7029985403,"4":0.7803294518,"5":-0.7779234928,"6":-1.3843624235,"7":0.5082788679,"8":0.0817233567,"9":0.8676715995,"10":0.5174344676,"11":1.769508568},{"0":2.4041221603,"1":1.5587901663,"2":0.6893264375,"3":-0.5570472724,"4":0.9813278384,"5":0.254312866,"6":-0.7608832089,"7":1.4323696995,"8":0.3949821596,"9":1.1367664272,"10":-0.8256821416,"11":1.769508568}

## To Go Further

You can try to combine the transformer and the predictor together in the same sklearn pipeline. 