# Quick start tutorial



**Introduction to MLRun - Use serverless functions to train and deploy models**

This notebook provides a quick overview of developing and deploying machine learning applications using the [MLRun](https://www.mlrun.org/) MLOps orchestration framework.

<a id="setup"></a>
## Set-up

Import required libraries:

In [1]:
import mlrun
import os

Load environment variables for MLRun:

In [None]:
ENV_FILE = ".mlrun.env"
if os.path.exists(ENV_FILE):
    mlrun.set_env_from_file(ENV_FILE)

Create a MLRun project:

In [None]:
PROJECT = "demo-ml"
project = mlrun.get_or_create_project(PROJECT, "./")

<a id="generate-data"></a>
## Generate data

**Function code**

Run the following cell to generate the data prep file (or copy it manually):

In [3]:
%%writefile data-prep.py

import pandas as pd
from sklearn.datasets import load_breast_cancer

import mlrun


@mlrun.handler(outputs=["dataset", "label_column"])
def breast_cancer_generator():
    """
    A function which generates the breast cancer dataset
    """
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
    breast_cancer_dataset = pd.concat(
        [breast_cancer_dataset, breast_cancer_labels], axis=1
    )

    return breast_cancer_dataset, "label"

Overwriting data-prep.py


**Create a serverless function object from the code above, and register it in the project**

In [4]:
data_gen_fn = project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="breast_cancer_generator")
project.save()  # save the project with the latest config

<mlrun.projects.project.MlrunProject at 0x7ff72063d460>

<br>

**Run using the SDK**

In [6]:
gen_data_run = project.run_function("data-prep", local=True)

> 2022-09-20 13:22:59,351 [info] starting run data-prep-breast_cancer_generator uid=1ea3533192364dbc8898ce328988d0a3 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
quick-tutorial-iguazio,...8988d0a3,0,Sep 20 13:22:59,completed,data-prep-breast_cancer_generator,v3io_user=iguaziokind=owner=iguaziohost=jupyter-5654cb444f-c9wk2,,,label_column=label,dataset





> 2022-09-20 13:22:59,693 [info] run executed, status=completed


<br>

**Print the run state and outputs**

In [7]:
gen_data_run.state()

'completed'

In [8]:
gen_data_run.outputs

{'label_column': 'label',
 'dataset': 'store://artifacts/quick-tutorial-iguazio/data-prep-breast_cancer_generator_dataset:1ea3533192364dbc8898ce328988d0a3'}

<br>

**Print the output dataset artifact (`DataItem` object) as dataframe**

In [9]:
gen_data_run.artifact("dataset").as_df().head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


<a id="use-hub"></a>
## Train a model using an MLRun built-in Function Hub

MLRun provides a [**Function Hub**](https://www.mlrun.org/marketplace/) that hosts a set of pre-implemented and
validated ML, DL, and data processing functions.

You can import the `auto-trainer` hub function that can: train an ML model using a variety of ML frameworks; generate
various metrics and charts; and log the model along with its metadata into the MLRun model registry.

In [10]:
# Import the function
trainer = mlrun.import_function('hub://auto_trainer')


See the `auto_trainer` function usage instructions in [the Function Hub](https://www.mlrun.org/marketplace/functions/master/auto_trainer/) or by typing `trainer.doc()`

**Run the function on the cluster (if there is)**

In [11]:
trainer_run = project.run_function(trainer,
    inputs={"dataset": gen_data_run.outputs["dataset"]},
    params = {
        "model_class": "sklearn.ensemble.RandomForestClassifier",
        "train_test_split_size": 0.2,
        "label_columns": "label",
        "model_name": 'cancer',
    }, 
    handler='train',
)

> 2022-09-20 13:23:14,811 [info] starting run auto-trainer-train uid=84057e1510174611a5d2de0671ee803e DB=http://mlrun-api:8080
> 2022-09-20 13:23:14,970 [info] Job is running in the background, pod: auto-trainer-train-dzjwz
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-3pzdch1o because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
> 2022-09-20 13:23:20,953 [info] Sample set not given, using the whole training set as the sample set
> 2022-09-20 13:23:21,143 [info] training 'cancer'
> 2022-09-20 13:23:22,561 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
quick-tutorial-iguazio,...71ee803e,0,Sep 20 13:23:20,completed,auto-trainer-train,v3io_user=iguaziokind=jobowner=iguaziomlrun/client_version=1.1.0host=auto-trainer-train-dzjwz,dataset,model_class=sklearn.ensemble.RandomForestClassifiertrain_test_split_size=0.2label_columns=labelmodel_name=cancer,accuracy=0.956140350877193f1_score=0.967741935483871precision_score=0.9615384615384616recall_score=0.974025974025974,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodel





> 2022-09-20 13:23:24,216 [info] run executed, status=completed


<a id="model-serving"></a>
## Build, test, and deploy the model serving functions

MLRun serving can produce managed, real-time, serverless, pipelines composed of various data processing and ML tasks. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. For more details and examples, see [MLRun serving graphs](https://docs.mlrun.org/en/stable/serving/serving-graph.html).

**Create a model serving function**

The original tutorial uses an image for this function that may cause compatibility issues with scikit-learn versions. Create the function as follows:

In [14]:
serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")

**Add a model**

The basic serving topology supports a router with multiple child models attached to it.
The `function.add_model()` method allows you to add models and specify the `name`, `model_path` (to a model file, dir, or artifact), and the serving `class` (built-in or user defined).

In [15]:
serving_fn.add_model('cancer-classifier',model_path=trainer_run.outputs["model"], class_name='mlrun.frameworks.sklearn.SklearnModelServer')

<mlrun.serving.states.TaskStep at 0x7ff6da1ac190>

**Deploy a real-time serving function (over Kubernetes or Docker)**

This section requires Nuclio to be installed (over k8s or Docker).

Use the mlrun `deploy_function()` method to build and deploy a Nuclio serving function from your serving-function code.
You can deploy the function object (`serving_fn`) or reference pre-registered project functions.

In [20]:
project.deploy_function(serving_fn)

> 2022-09-20 13:24:34,823 [info] Starting remote function deploy
2022-09-20 13:24:35  (info) Deploying function
2022-09-20 13:24:35  (info) Building
2022-09-20 13:24:35  (info) Staging files and preparing base images
2022-09-20 13:24:35  (info) Building processor image
2022-09-20 13:25:35  (info) Build complete
2022-09-20 13:26:05  (info) Function deploy complete
> 2022-09-20 13:26:06,030 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-quick-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['quick-tutorial-iguazio-serving-quick-tutorial-iguazio.default-tenant.app.alexp-edge.lab.iguazeng.com/']}


DeployStatus(state=ready, outputs={'endpoint': 'http://quick-tutorial-iguazio-serving-quick-tutorial-iguazio.default-tenant.app.alexp-edge.lab.iguazeng.com/', 'name': 'quick-tutorial-iguazio-serving'})

- Define some data to use for a test:

In [19]:
my_data = {"inputs"
           :[[
               1.371e+01, 2.083e+01, 9.020e+01, 5.779e+02, 1.189e-01, 1.645e-01,
               9.366e-02, 5.985e-02, 2.196e-01, 7.451e-02, 5.835e-01, 1.377e+00,
               3.856e+00, 5.096e+01, 8.805e-03, 3.029e-02, 2.488e-02, 1.448e-02,
               1.486e-02, 5.412e-03, 1.706e+01, 2.814e+01, 1.106e+02, 8.970e+02,
               1.654e-01, 3.682e-01, 2.678e-01, 1.556e-01, 3.196e-01, 1.151e-01]
            ]
}

X does not have valid feature names, but RandomForestClassifier was fitted with feature names


{'id': '27d3f10a36ce465f841d3e19ca404889',
 'model_name': 'cancer-classifier',
 'outputs': [0]}

- Test the live endpoint

In [21]:
serving_fn.invoke("/v2/models/cancer-classifier/infer", body=my_data)

> 2022-09-20 13:26:06,094 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-quick-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080/v2/models/cancer-classifier/infer'}


{'id': '2533b72a-6d94-4c51-b960-02a2deaf84b6',
 'model_name': 'cancer-classifier',
 'outputs': [0]}