# MLflow Core Concepts

In this notebook you will build a model to predict the score quality of a wine given some physicochemical measurements. See [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/wine/) for more detail about the dataset. 

The goal of the notebook is to go through all the different stpes of putting a ML model to productions:
* ingest the data
* split the data for training and evaluation and test
* transform the data for the model
* train and evaluate the model
* store the model
* use the model above to predict on some new data (in batch or real-time)

The goal of this notebook is to give some end-to-end flow. We are not trying to go very deep in any steps but show the overall flow. 

Along this notebook you will have some tasks that need to be completed. You will be able to find where they are in the code by searching for `# ToDo#: ...`

In this notebooks you will be asked to:
* ToDo1: add a column to the data frame that indicates if the wine is red or white
* ToDo2: separate the target variable from the features
* ToDo3: fit the preprocessing pipeline on the training data and transform the validation and test data
* ToDo4: log the model and the preprocessing pipeline
* ToDo5: log metrics to mlflow
* ToDo6: log parameters to mlflow
* ToDo7: go to see your model logged on mlflow and register the model in the UI and set the model stage to production
* ToDo8: load the model from mlflow and make a prediction on the test data
* ToDo9: set the model uri to the model you just registered
* ToDo10: [To Go Further] rebuild a model using sklearn pipeline, log it to mlflow and deploy a serving endpoint

If you need help you can browse through the following documentation:
* [MLflow](https://mlflow.org/docs/latest/index.html)
* [scikit-learn](https://scikit-learn.org/stable/)
* [pandas](https://pandas.pydata.org/docs/)

In [1]:
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import os
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.base import RegressorMixin
from sklearn.base import BaseEstimator


In [2]:
# Note: please change the directory if you are not using a dev container.
# We want to have the working directory to be the src folder in the mlflow-trainng repo
os.chdir("/workspaces/mlflow-training/src")


# setup mlflow to use the same setting than in the recipe in notebook 02
from steps.utils import setup_mlflow

setup_mlflow(
    experiment_name="wine_score_notebook",
)


True

## Ingest data

In [3]:
red_df = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep=";",
)

white_df = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
    sep=";",
)
# ToDo1: add a column to the data frame that indicates if the wine is red or white
red_df["is_red"] = 1
white_df["is_red"] = 0

df = pd.concat([red_df, white_df])
df


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,1
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,1
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,0
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,0
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,0
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,0


## Split data

We want to split the data to have the following proportion:
- 80% training
- 10% evaluation
- 10% test

In [4]:
# ToDo2: separate the target variable from the features
y = df[["quality"]]
X = df.drop("quality", axis=1, inplace=False)

X_train, X_test_val, y_train, y_test_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_test_val, y_test_val, test_size=0.5, random_state=42
)


## Transform data

Apply a preprocessing step to by removing the mean and scaling to unit variance. 

In [5]:
preprocessing_pipeline = Pipeline(
    [
        (
            "ct",
            ColumnTransformer(
                [
                    (
                        "minmax",
                        StandardScaler(),
                        X_train.columns,
                    ),
                ]
            ),
        )
    ]
)

# ToDo3: fit the preprocessing pipeline on the training data and transform the validation and test data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_val_processed = preprocessing_pipeline.transform(X_val)
X_test_processed = preprocessing_pipeline.transform(X_test)


## Train model

In [6]:
model = LinearRegression()


In [7]:
def log_metrics(
    model: RegressorMixin, X: pd.DataFrame, y: pd.Series, suffix: str = "test"
) -> dict:
    """Log model perfomance on dataset"""
    y_pred = model.predict(X)
    mae = mean_absolute_error(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    metrics = {
        f"{suffix}.mean_absolute_error": mae,
        f"{suffix}.mean_squared_error": mse,
        f"{suffix}.r2_score": r2,
    }
    # ToDo5: log metrics to mlflow
    mlflow.log_metrics(metrics)
    return metrics


In [8]:
def log_parameters(
    model: BaseEstimator,
) -> dict:
    """Log parameters of interest of the model"""
    model_params = model.get_params()

    # ToDo6: log parameters to mlflow
    mlflow.log_params(model_params)
    return model_params


In [9]:
mlflow.autolog(log_input_examples=True)
with mlflow.start_run() as run:
    model.fit(X_train_processed, y_train)
    # ToDo4: log the model and the preprocessing pipeline
    mlflow.sklearn.log_model(
        preprocessing_pipeline, artifact_path="preprocessing_pipeline"
    )
    # Note: the model is automatically logged by mlflow with autolog when we fit the model
    mlflow.sklearn.log_model(model, artifact_path="regressor")

    # ToDo5: log metrics to mlflow (see above)
    log_metrics(model, X_val_processed, y_val, suffix="val")
    log_metrics(model, X_test_processed, y_test, suffix="test")
    # ToDo6: log metrics on the validation data
    log_parameters(model)

    # Note: we store the run id to be able to retrieve the run later
    mlflow_run_id = run.info.run_id

mlflow.autolog(disable=True)


2023/05/31 19:36:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/05/31 19:36:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.


In [10]:
print(
    "Please copy the command below in a new terminal on your IDE and let it run until the end of the notebook \n"
)

print("mlflow server \\")
print("    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \\")
print("    --default-artifact-root ./src/metadata/mlflow/mlartifacts \\")
print("    --host 0.0.0.0 \\")
print("    --port 5000")

# ToDo7: go to see your model logged on mlflow and register the model in the UI and set the model stage to production
# Note: mlflow ui by going to http://localhost:5000/ or http://0.0.0.0:5000/ in your browser


Please copy the command below in a new terminal on your IDE and let it run until the end of the notebook 

mlflow server \
    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \
    --default-artifact-root ./src/metadata/mlflow/mlartifacts \
    --host 0.0.0.0 \
    --port 5000


## Predict with trained model

### Predict on batch inference

In [11]:
# ToDo8: load the model from mlflow and make a prediction on the test data
print(f"Model path = runs:/{mlflow_run_id}/model/")
loaded_model = mlflow.sklearn.load_model(f"runs:/{mlflow_run_id}/model/")
predictions = pd.DataFrame(
    loaded_model.predict(X_test_processed), columns=["prediction"]
)
predictions["quality"] = y_test["quality"].values
predictions


Model path = runs:/6e928de1ec974bddaa1bd42ccaa78193/model/


Unnamed: 0,prediction,quality
0,5.945335,6
1,5.694131,6
2,6.072289,6
3,5.364762,6
4,5.822099,6
...,...,...
645,6.319968,6
646,5.185846,6
647,5.630085,5
648,6.627817,7


### Predict in real time

We can also use the mlflow model to do rediction in real-time. To do so we will need to:
1. run an mlflow server to be able to distribute the model (already done above)
2. create a serving enpoint which will pull the model from mlflow server
3. finally we can query our model in real time using `curl`

In [12]:
print("Please copy the command below in a new terminal on your IDE \n")

print("MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \\")
print("      --host=0.0.0.0 \\")
print("      --port=5001 \\")
print("      --env-manager=local \\")
# ToDo9: set the model uri to the model you just registered
print(f"      --model-uri runs:/{mlflow_run_id}/model/")


Please copy the command below in a new terminal on your IDE 

MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \
      --host=0.0.0.0 \
      --port=5001 \
      --env-manager=local \
      --model-uri runs:/6e928de1ec974bddaa1bd42ccaa78193/model/


In [13]:
print("You can copy the command below on one of your terminal \n")

request_data = pd.DataFrame(X_test_processed).iloc[0:4].to_json(orient="records")
print(
    """curl http://0.0.0.0:5001/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """
    + request_data
    + """}'"""
)


You can copy the command below on one of your terminal 

curl http://0.0.0.0:5001/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": [{"0":0.7738688092,"1":-0.8461749496,"2":0.2745893897,"3":0.5480123276,"4":-0.5979451988,"5":1.6879744754,"6":0.8423490573,"7":0.1692239584,"8":-1.0460083335,"9":-1.5541818503,"10":-0.154123837,"11":-0.5651286567},{"0":0.0751888016,"1":0.0171458612,"2":-0.140147658,"3":0.8190646823,"4":0.0337640161,"5":0.1969664016,"6":1.1095544349,"7":0.2822422616,"8":-0.4194907278,"9":-0.6123499532,"10":-0.0701790489,"11":-0.5651286567},{"0":1.8607043766,"1":0.5104720389,"2":0.2054665484,"3":-0.7029985403,"4":0.7803294518,"5":-0.7779234928,"6":-1.3843624235,"7":0.5082788679,"8":0.0817233567,"9":0.8676715995,"10":0.5174344676,"11":1.769508568},{"0":2.4041221603,"1":1.5587901663,"2":0.6893264375,"3":-0.5570472724,"4":0.9813278384,"5":0.254312866,"6":-0.7608832089,"7":1.4323696995,"8":0.3949821596,"9":1.1367664272,"10":-0.8256821416,"11":1.769508568}

Congratulation! You made it! 

If you still have some time you can take a big breach and try to help the people around you. 

Or if you like you can try to improve on what you already did and see what could be added 

## To Go Further

You can try to combine the transformer and the predictor together in the same sklearn pipeline. 

In [14]:
# ToDo10: [To Go Further] rebuild a model using sklearn pipeline, log it to mlflow and deploy a serving endpoint
setup_mlflow(
    experiment_name="wine_score_pipeline_notebook",
)
pipe = Pipeline(
    [
        (
            "ct",
            ColumnTransformer(
                [
                    (
                        "minmax",
                        StandardScaler(),
                        X_train.columns,
                    ),
                ]
            ),
        ),
        ("reg", LinearRegression()),
    ]
)

mlflow.autolog(log_input_examples=True)
with mlflow.start_run() as run:
    pipe.fit(X_train, y_train)

    log_metrics(pipe, X_val, y_val, suffix="val")
    log_metrics(pipe, X_test, y_test, suffix="test")
    log_parameters(pipe)

    # Note: we store the run id to be able to retrieve the run later
    mlflow_run_id = run.info.run_id

mlflow.autolog(disable=True)


2023/05/31 19:36:26 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/05/31 19:36:26 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.


In [15]:
print("MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \\")
print("      --host=0.0.0.0 \\")
print("      --port=5002 \\")
print("      --env-manager=local \\")
print(f"      --model-uri runs:/{mlflow_run_id}/model/")


MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \
      --host=0.0.0.0 \
      --port=5002 \
      --env-manager=local \
      --model-uri runs:/b8a58307b8ca45138e49474622f44ce4/model/


In [16]:
print("You can copy the command below on one of your terminal \n")

request_data = X_test.iloc[0:4].to_json(orient="records")
print(
    """curl http://0.0.0.0:5002/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """
    + request_data
    + """}'"""
)


You can copy the command below on one of your terminal 

curl http://0.0.0.0:5002/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": [{"fixed acidity":8.2,"volatile acidity":0.2,"citric acid":0.36,"residual sugar":8.1,"chlorides":0.035,"free sulfur dioxide":60.0,"total sulfur dioxide":163.0,"density":0.9952,"pH":3.05,"sulphates":0.3,"alcohol":10.3,"is_red":0},{"fixed acidity":7.3,"volatile acidity":0.34,"citric acid":0.3,"residual sugar":9.4,"chlorides":0.057,"free sulfur dioxide":34.0,"total sulfur dioxide":178.0,"density":0.99554,"pH":3.15,"sulphates":0.44,"alcohol":10.4,"is_red":0},{"fixed acidity":9.6,"volatile acidity":0.42,"citric acid":0.35,"residual sugar":2.1,"chlorides":0.083,"free sulfur dioxide":17.0,"total sulfur dioxide":38.0,"density":0.99622,"pH":3.23,"sulphates":0.66,"alcohol":11.1,"is_red":1},{"fixed acidity":10.3,"volatile acidity":0.59,"citric acid":0.42,"residual sugar":2.8,"chlorides":0.09,"free sulfur dioxide":35.0,"total sulfur dioxide":73