# MLflow tracking
### The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. 

### Let's log some stuff

In [1]:
import mlflow

In [2]:
mlflow.log_param("param1", "This is a param")
mlflow.log_metric("ROC AUC", 0.75)
mlflow.log_metric("ROC AUC", 0.8)
mlflow.log_metric("ROC AUC", 0.88)
with open("artifact.txt", mode="w") as f:
    f.write("This is an artifact file")
mlflow.log_artifact("artifact.txt")

### You can start a development MLflow UI server using `mlflow ui` shell command

### MLflow Tracking is organized around the concept of *runs*, which are executions of some piece of data science code. Each run records the following information:
* Code Version
* Start & End Time
* Source
* Parameters
* Metrics
* Artifacts

### *Run* is started automatically as soon as you start logging stuff

In [3]:
mlflow.log_param("param2", "This is in the same run as param1")

### You have to explicitly end current run or use a context manager

In [4]:
mlflow.active_run().info.run_id

'a8778c2310454cbd80eb0b438e7e9914'

In [5]:
mlflow.end_run()

In [6]:
mlflow.active_run().info.run_id

AttributeError: 'NoneType' object has no attribute 'info'

In [7]:
with mlflow.start_run():
    mlflow.log_metrics({"ROC AUC": 0.7})

### You can group multiple runs as an *experiment*

In [8]:
experiment_id = mlflow.create_experiment("My first experiment")

In [9]:
experiment_id

'1'

In [10]:
with mlflow.start_run(experiment_id=experiment_id):
    mlflow.log_param("param", "param-pam-pam")

### If you don't set experiment id, it will fall back to "Default"

In [11]:
with mlflow.start_run():
    mlflow.log_metric("PR AUC", 1)

### You can also name your runs

In [12]:
with mlflow.start_run(experiment_id=experiment_id, run_name="Run with default hyperparameters"):
    mlflow.log_param("alpha", 0.01)
    mlflow.log_metric("PR AUC", 1)

### You can communicate with MLflow server via `MlflowClient`

In [13]:
client = mlflow.tracking.MlflowClient()

In [14]:
client

<mlflow.tracking.client.MlflowClient at 0x7efdb888bd90>

In [15]:
experiment = client.get_experiment_by_name("My first experiment")

In [16]:
experiment

<Experiment: artifact_location='file:///home/users/vova-cmc/ozon-masters-bigdata/lectures/lect8%20-%20MlFlow/mlruns/1', experiment_id='1', lifecycle_stage='active', name='My first experiment', tags={}>

In [17]:
client.search_runs(experiment_ids=experiment.experiment_id, filter_string="metrics.`PR AUC` > 0.9")

[<Run: data=<RunData: metrics={'PR AUC': 1.0}, params={'alpha': '0.01'}, tags={'mlflow.runName': 'Run with default hyperparameters',
  'mlflow.source.name': '/opt/conda/envs/dsenv/lib/python3.7/site-packages/ipykernel_launcher.py',
  'mlflow.source.type': 'LOCAL',
  'mlflow.user': 'vova-cmc'}>, info=<RunInfo: artifact_uri='file:///home/users/vova-cmc/ozon-masters-bigdata/lectures/lect8%20-%20MlFlow/mlruns/1/41f48d5cffef4a44bbcdd249354fd939/artifacts', end_time=1620288486216, experiment_id='1', lifecycle_stage='active', run_id='41f48d5cffef4a44bbcdd249354fd939', run_uuid='41f48d5cffef4a44bbcdd249354fd939', start_time=1620288486206, status='FINISHED', user_id='vova-cmc'>>]

### [More on search syntax](https://www.mlflow.org/docs/latest/search-syntax.html)

### MLflow tracking server has two major components:
* backend store
* artifact store

### The backend store is where MLflow Tracking Server stores experiment and run metadata as well as params, metrics, and tags for runs. It is either file store or SQLAlchemy compatible database. By default the backend is file based

In [18]:
EXPERIMENT_ID = "0"

In [19]:
!ls mlruns/$EXPERIMENT_ID

6f308dd1ac944358a07f942e4ea09a82  cb80e95af7e446ffa2e672138d094e75
a8778c2310454cbd80eb0b438e7e9914  meta.yaml


In [20]:
RUN_ID = client.search_runs(experiment_ids=EXPERIMENT_ID)[-1].info.run_id

In [21]:
!ls mlruns/$EXPERIMENT_ID/$RUN_ID

artifacts  meta.yaml  metrics  params  tags


### The artifact store is a location suitable for large data (such as an S3 bucket or shared NFS file system) and is where clients log their artifact output (for example, models).

In [22]:
!ls mlruns/$EXPERIMENT_ID/$RUN_ID/artifacts

artifact.txt


In [23]:
!cat mlruns/$EXPERIMENT_ID/$RUN_ID/meta.yaml

artifact_uri: file:///home/users/vova-cmc/ozon-masters-bigdata/lectures/lect8%20-%20MlFlow/mlruns/0/a8778c2310454cbd80eb0b438e7e9914/artifacts
end_time: 1620288215007
entry_point_name: ''
experiment_id: '0'
lifecycle_stage: active
name: ''
run_id: a8778c2310454cbd80eb0b438e7e9914
run_uuid: a8778c2310454cbd80eb0b438e7e9914
source_name: ''
source_type: 4
source_version: ''
start_time: 1620287704818
status: 3
tags: []
user_id: vova-cmc


### A separate artifact store is super useful for data scientists to share large datasets, so these datasets don't need to be rebuild from scratch

# MLflow models
### An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools—for example, real-time serving through a REST API or batch inference on Apache Spark

In [24]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [27]:
X, y = make_classification()

In [25]:
X_train, X_test, y_train, y_test = train_test_split(*make_classification())

In [26]:
estimator = RandomForestClassifier()
estimator.fit(X_train, y_train)

RandomForestClassifier()

In [27]:
estimator.score(X_test, y_test)

0.8

In [28]:
import mlflow.sklearn

In [29]:
with mlflow.start_run():
    mlflow.sklearn.log_model(estimator, artifact_path="models")

### `log_model` saves trained model in a special format, but don't track model hyperparameters. How can this be resolved?

In [30]:
with mlflow.start_run():
    estimator = RandomForestClassifier()
    mlflow.log_params(estimator.get_params())
    estimator.fit(X_train, y_train)
    accuracy = estimator.score(X_test, y_test)
    mlflow.log_metric("Accuracy", accuracy)
    mlflow.sklearn.log_model(estimator, artifact_path="models")

### Some model flavors implement [automatic logging](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging)

In [31]:
import xgboost
import mlflow.xgboost

In [32]:
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
num_round = 3

In [33]:
mlflow.xgboost.autolog()

In [34]:
dtrain = xgboost.DMatrix(data=X_train, label=y_train)

In [35]:
dtest = xgboost.DMatrix(data=X_test, label=y_test)

In [36]:
with mlflow.start_run():
    bst = xgboost.train(param, dtrain, num_round)



### Metrics are automatically logged if early stopping is enabled

In [37]:
param["eval_metric"] = "auc"

In [38]:
with mlflow.start_run():
    bst = xgboost.train(param, dtrain, num_round, evals=[(dtest, 'eval')], early_stopping_rounds=10)

[0]	eval-auc:0.79779
[1]	eval-auc:0.81250
[2]	eval-auc:0.84559


### This becomes especially handy for tuning hyperparameters

In [41]:
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from sklearn.metrics import roc_auc_score

ModuleNotFoundError: No module named 'hyperopt'

In [40]:
!pip install hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.5-py2.py3-none-any.whl (965 kB)
[K     |████████████████████████████████| 965 kB 3.1 MB/s eta 0:00:01
Collecting networkx>=2.2
  Downloading networkx-2.5.1-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 10.0 MB/s eta 0:00:01
Collecting cloudpickle
  Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 16.8 MB/s eta 0:00:01
[?25hCollecting numpy
  Using cached numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
Collecting scipy
  Downloading scipy-1.5.4-cp36-cp36m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 75 kB/s s eta 0:00:01
[?25hCollecting tqdm
  Downloading tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 3.3 MB/s  eta 0:00:01
[?25hBuilding wheels for collected packages: future
  Building wheel for future

In [42]:
mlflow.set_experiment("XGboost hyperparameters")
mlflow.xgboost.autolog()

INFO: 'XGboost hyperparameters' does not exist. Creating a new experiment


In [43]:
def score(params):
    with mlflow.start_run():
        num_round = int(params.pop("n_estimators"))
        watchlist = [(dtest, 'eval'), (dtrain, 'train')]
        gbm_model = xgboost.train(params, dtrain, num_round, evals=watchlist, verbose_eval=True)
        predictions = gbm_model.predict(dtest,
                                        ntree_limit=gbm_model.best_iteration + 1)
        score = roc_auc_score(y_test, predictions)
        loss = 1 - score
    return {'loss': loss, 'status': STATUS_OK}

In [44]:
def optimize(random_state=5757):
    space = {
        'n_estimators': hp.quniform('n_estimators', 10, 20, 1),
        'eta': hp.quniform('eta', 0.025, 0.5, 0.025),
        'eval_metric': 'auc',
        'objective': 'binary:logistic',
        'seed': random_state
    }
    
    best = fmin(score, space, algo=tpe.suggest, max_evals=5)
    return best

In [45]:
optimize()

NameError: name 'hp' is not defined

### OK let's return to model logging

In [46]:
EXPERIMENT_ID = client.get_experiment_by_name("XGboost hyperparameters").experiment_id

In [47]:
EXPERIMENT_ID

'2'

In [48]:
RUN_ID = client.search_runs(EXPERIMENT_ID, order_by=["attribute.start_time"])[-1].info.run_id

IndexError: list index out of range

In [49]:
RUN_ID

'a8778c2310454cbd80eb0b438e7e9914'

In [50]:
!cat mlruns/$EXPERIMENT_ID/$RUN_ID/artifacts/model/MLmodel

cat: mlruns/2/a8778c2310454cbd80eb0b438e7e9914/artifacts/model/MLmodel: No such file or directory


In [51]:
!cat mlruns/$EXPERIMENT_ID/$RUN_ID/artifacts/model/conda.yaml

cat: mlruns/2/a8778c2310454cbd80eb0b438e7e9914/artifacts/model/conda.yaml: No such file or directory


### *Flavors* are the key concept that makes MLflow Models powerful: they are a convention that deployment tools can use to understand the model, which makes it possible to write tools that work with models from any ML library without having to integrate each tool with each library. 

In [53]:
pymodel = mlflow.pyfunc.load_model(f"mlruns/{EXPERIMENT_ID}/{RUN_ID}/artifacts/model/")

In [54]:
type(pymodel)

mlflow.pyfunc.PyFuncModel

In [55]:
X_test

array([[ 8.97011787e-01, -4.37778410e-01, -9.90481805e-01,
         6.35516781e-01, -6.50844600e-01,  2.89035649e-01,
         2.94370350e-01,  7.46032090e-01, -3.13528302e+00,
        -1.92101562e+00,  2.18396340e-01, -3.06256217e-01,
         2.00263161e+00, -2.89424733e+00,  1.59550142e-01,
        -9.83212996e-01, -5.43435411e-01, -1.83512075e+00,
        -9.72914428e-01,  1.43375706e-01],
       [ 4.84418406e-01,  6.59529717e-01,  1.71458779e+00,
         1.69649287e+00, -3.24274390e-01,  1.79945298e-01,
        -3.35645787e-01,  2.38402533e-01, -1.78089070e+00,
        -1.38335437e+00,  1.47016258e+00,  8.13139988e-01,
        -7.62926286e-01, -1.49500396e+00, -4.86484747e-01,
        -1.07540269e+00, -2.02313826e+00, -8.31772732e-01,
         4.20055411e-01, -6.03500912e-01],
       [-7.10188645e-01,  6.34483126e-01,  1.55347620e+00,
        -8.51243987e-01,  1.87924811e+00,  3.29910554e-01,
        -5.31020943e-01,  1.74786445e-01, -1.42550691e+00,
        -8.13951026e-01, -7.6

In [56]:
pymodel.predict(X_test)

array([0.9249707 , 0.03110778, 0.967145  , 0.25543132, 0.05689516,
       0.8208202 , 0.8024059 , 0.9799483 , 0.8527327 , 0.26427805,
       0.95896983, 0.50783724, 0.95464176, 0.5780617 , 0.9454438 ,
       0.07998231, 0.7849171 , 0.72404534, 0.92745286, 0.04240661,
       0.9402491 , 0.8283368 , 0.9124435 , 0.01633013, 0.01788557],
      dtype=float32)

### What is `pyfunc` flavor anyway? https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlmodel-configuration

### You can also infer your models as a service or as Spark UDF. Let's switch to a more realistic example for illustration

In [52]:
from sklearn.datasets import load_iris
import pandas as pd

In [53]:
data = load_iris(as_frame=True)

In [54]:
pdf = data["frame"]
target = pdf.pop("target")

In [55]:
pdf.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [56]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [57]:
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])

In [58]:
pdf_train, pdf_test, target_train, target_test = train_test_split(pdf, target)

In [59]:
mlflow.set_experiment("Iris with sklearn")
mlflow.sklearn.autolog()

INFO: 'Iris with sklearn' does not exist. Creating a new experiment


In [60]:
with mlflow.start_run(run_name="The run I need"):
    pdf_train.to_pickle("dataset_train.pickle")
    mlflow.log_artifact("dataset_train.pickle")
    pipeline.fit(pdf_train, target_train)

In [61]:
run = client.search_runs(experiment_ids=client.get_experiment_by_name("Iris with sklearn").experiment_id,
                         filter_string="tags.`mlflow.runName` = 'The run I need'")

In [62]:
run[0].info.artifact_uri

'file:///home/users/vova-cmc/ozon-masters-bigdata/lectures/lect8%20-%20MlFlow/mlruns/3/be0f6a0696dc4ecab735acfd25c24790/artifacts'

In [63]:
skmodel = mlflow.pyfunc.load_model(f"{run[0].info.artifact_uri}/model")

In [64]:
skmodel

mlflow.pyfunc.loaded_model:
  artifact_path: model
  flavor: mlflow.sklearn
  run_id: be0f6a0696dc4ecab735acfd25c24790

In [65]:
skmodel.predict(pdf_test)

array([1, 0, 2, 2, 2, 0, 2, 2, 0, 0, 2, 1, 2, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 0, 2, 1, 0, 0, 2, 2, 1, 0, 0, 1])

### Do it with Spark UDF

In [66]:
import os
import sys

SPARK_HOME = "/usr/hdp/current/spark2-client"
PYSPARK_PYTHON = "/opt/conda/envs/dsenv/bin/python"
os.environ["PYSPARK_PYTHON"]= PYSPARK_PYTHON
os.environ["SPARK_HOME"] = SPARK_HOME

PYSPARK_HOME = os.path.join(SPARK_HOME, "python/lib")
sys.path.insert(0, os.path.join(PYSPARK_HOME, "py4j-0.10.7-src.zip"))
sys.path.insert(0, os.path.join(PYSPARK_HOME, "pyspark.zip"))

In [67]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set("spark.driver.memory", "4g")
conf.set("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true")

spark = SparkSession.builder.config(conf=conf).appName("MLflow model inference with Spark").getOrCreate()

In [68]:
spark

In [69]:
spark_udf = mlflow.pyfunc.spark_udf(spark, model_uri=f"{run[0].info.artifact_uri}/model")

In [70]:
spark_udf

<function mlflow.pyfunc.spark_udf.<locals>.predict(*args)>

In [71]:
spark_df = spark.createDataFrame(pdf_test)

In [72]:
spark_df.printSchema()

root
 |-- sepal length (cm): double (nullable = true)
 |-- sepal width (cm): double (nullable = true)
 |-- petal length (cm): double (nullable = true)
 |-- petal width (cm): double (nullable = true)



In [73]:
spark_df.withColumn("prediction", spark_udf(*spark_df.schema.fieldNames())).show(10)

+-----------------+----------------+-----------------+----------------+----------+
|sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)|prediction|
+-----------------+----------------+-----------------+----------------+----------+
|              5.8|             2.7|              4.1|             1.0|       1.0|
|              5.4|             3.9|              1.3|             0.4|       0.0|
|              7.6|             3.0|              6.6|             2.1|       2.0|
|              5.8|             2.7|              5.1|             1.9|       2.0|
|              7.2|             3.0|              5.8|             1.6|       2.0|
|              4.8|             3.4|              1.6|             0.2|       0.0|
|              6.1|             3.0|              4.9|             1.8|       2.0|
|              6.5|             3.0|              5.8|             2.2|       2.0|
|              5.5|             4.2|              1.4|             0.2|       0.0|
|   

In [74]:
spark.stop()

# MLflow projects
### An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions. In addition, the Projects component includes an API and command-line tools for running projects, making it possible to chain together projects into workflows.

https://www.mlflow.org/docs/latest/projects.html