# MLflow with recipe

In this second notebook we will take the same example than the first example but we will use MLflow recipe to accomplish the same result. 
We will go throught the same steps than on the first notebook but this time we will use the MLflow recipe module. 

Like on the previous notebook, you will have some tasks that need to be completed. You will be able to find where they are in the code by searching for `# ToDo#: ...`

In this notebook, you will be asked to:
* ToDo1: Add a column to indicate if the wine is red or white
* ToDo2: specify split ratios for train, validation, and test sets
* ToDo3: Create a Pipeline object that transforms the features
* ToDo4: Create a LinearRegression estimator with the estimator_params
* ToDo5: add custom metrics to our recipe
* ToDo6: look in the UI what did the recipe logged by default. What was added compared with last notebook?
* ToDo7: change the model uri to the one from current run
* ToDo8: query the model with some test data
* ToDo9: [To Go Further] use the AutoML estimator instead and use the UI to compare the results
* ToDo10: [To Go Further] use databricks mlflow instead of local mlflow server

If you need help you can browse through the following documentation:
* [MLflow](https://mlflow.org/docs/latest/index.html), in particular the [recipe module](https://mlflow.org/docs/latest/recipes.html)
* [MLflow recipe template](https://github.com/mlflow/recipes-regression-template)
* [MLflow recipe example](https://github.com/mlflow/recipes-examples)

In [None]:
from mlflow.recipes import Recipe
import os


In [None]:
# Note: please change the directory if you are not using a dev container.
# We want to have the working directory to be the src folder in the mlflow-trainng repo
os.chdir("/workspaces/mlflow-training/src")


In [None]:
r = Recipe(profile="local")


In [None]:
r.clean()


In [None]:
# for some reason you might have to run the cell twice before working
r.inspect()


## Ingest data

In [None]:
!cat steps/ingest.py

In [None]:
r.run("ingest")


## Split data

We want to split the data to have the following proportion:
- 80% training
- 10% evaluation
- 10% test

In [None]:
!cat recipe.yaml

In [None]:
r.run("split")


## Transform data

In [None]:
!cat steps/transform.py

In [None]:
r.run("transform")


## Train model

In [None]:
!cat steps/train.py

In [None]:
!cat recipe.yaml

In [None]:
r.run("train")


In [None]:
r.run("evaluate")

In [None]:
r.run("register")


In [None]:
print("If you shut down mlflow server from notebook 01")
print(
    "Please copy the command below in a new terminal on your IDE and let it run until the end of the notebook \n"
)

print("mlflow server \\")
print("    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \\")
print("    --default-artifact-root ./src/metadata/mlflow/mlartifacts \\")
print("    --host 0.0.0.0 \\")
print("    --port 5000")

# ToDo6: look in the UI what did the recipe logged by default. What was added compared with last notebook?


## Predict with trained model

### Predict on batch inference

In [None]:
# Notes: it takes around 5 minutes to run...
# we can only run it locally, if you are using codespace it will break your environemt
if "GITHUB_CODESPACE_TOKEN" not in os.environ:
    r.run("predict")


### Predict in real time

We can also use the mlflow model to do rediction in real-time. To do so we will need to:
1. run an mlflow server to be able to distribute the model (like in notebook 01)
2. create a serving enpoint which will pull the model from mlflow server
3. finally we can query our model in real time using `curl`

In [None]:
print("Please copy the command below in a new terminal on your IDE \n")

print("MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \\")
print("      --host=0.0.0.0 \\")
print("      --port=5011 \\")
print("      --env-manager=local \\")
# ToDo7: change the model uri to the one from current run
print(f"      --model-uri ...")


In [None]:
# ToDo8: query the model with some test data
test_data = r.get_artifact("test_data")
request_data = test_data.iloc[0:4].to_json(orient="records")
print("You can copy the command below on one of your terminal \n")
print(
    """curl http://0.0.0.0:5011/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """
    + request_data
    + """}'"""
)


## To Go Further

You can try to use `flaml` to get one of the best model. 

In [None]:
# ToDo9: [To Go Further] use the AutoML estimator instead and use the UI to compare the results
!cat recipe.yaml

In [None]:
...


In [None]:
# ToDo10: [To Go Further] use databricks mlflow instead of local mlflow server
# Note: you will need to have a databricks community account (free)
# See ANNEXE.md for more details
