In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import mlflow

In [2]:
# example dataset
dataset = datasets.load_iris()

In [3]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)

In [4]:
X_train.shape, y_train.shape

((90, 4), (90,))

In [5]:
X_test.shape, y_test.shape

((60, 4), (60,))

## ML engineering example

In [6]:
# set model
model = LogisticRegression()

In [7]:
# train model
model.fit(X_train, y_train)

## In MLflow
- `mlflow.autolog()` automatically log the experiment in the local directory (`./mlruns/`)
- `./mlruns/` structures
    ```bash
    mlruns/
    └── 0 # identifying experiments/
        ├── e80db17dc13d4b898a18e8b55c62f835 # experiment UUID
        │   ├── artifacts
        │   │   └── model
        │   │       ├── conda.yaml # packaging dependencies
        │   │       ├── MLmodel
        │   │       ├── model.pkl # serialized version of the model
        │   │       ├── python_env.yaml
        │   │       └── requirements.txt
        │   ├── meta.yaml # stores basic metadata of experiments (e.g. starting time, end_time, run_id, ...)
        │   ├── metrics # contains training score values
        │   │   ├── training_accuracy_score
        │   │   ├── training_f1_score
        │   │   ├── training_log_loss
        │   │   ├── training_precision_score
        │   │   ├── training_recall_score
        │   │   ├── training_roc_auc_score
        │   │   └── training_score
        │   ├── params # contains default parameters of the model
        │   │   ├── C
        │   │   ├── class_weight
        │   │   ├── dual
        │   │   ├── fit_intercept
        │   │   ├── intercept_scaling
        │   │   ├── l1_ratio
        │   │   ├── max_iter
        │   │   ├── multi_class
        │   │   ├── n_jobs
        │   │   ├── penalty
        │   │   ├── random_state
        │   │   ├── solver
        │   │   ├── tol
        │   │   ├── verbose
        │   │   └── warm_start
        │   └── tags
        │       ├── estimator_class
        │       ├── estimator_name
        │       ├── mlflow.log-model.history
        │       ├── mlflow.source.name
        │       ├── mlflow.source.type
        │       └── mlflow.user
        └── meta.yaml
    ```

In [8]:
mlflow.autolog()

2022/09/13 14:10:16 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [9]:
with mlflow.start_run():
    model = LogisticRegression()
    model.fit(X_train, y_train)



## Exploring MLflow projects
* MLflow Tracking: Provides a mechanism and UI to handle metrics and artifacts generated by ML executions (training and inference)
* MLflow Projects: A package format to standardize ML projects
* MLflow Models: A mechanism that deploys to different types of environments, both on-premises and in the cloud
* MLflow Model registry: A module that handles the management of models in MLflow and its life cycle, including state

## MLProject examples

- installs MLflow in local environment first.
    ```bash
    $ pip install mlflow
    ```

- MLProject file examples
    - conda project
        ```yaml
        name: condapred
        conda_env:
          image: conda.yaml
        entry_points:
          main:
            command: "python mljob.py"
        ```

    - system-based project
        ```yaml
        name: syspred
        entry_points:
          main:
            command: "python mljob.py"
        ```

    - docker project
        ```yaml
        name: stockpred
        docker_env:
          image: stockpred-docker
        entry_points:
          main:
            command: "python train.py"
        ```

## End-to-end pipeline in MLflow
- see `./stockpred/README.md`