# Importing mlflow models in DSS

In this notebook we show through a simple example how to import a machine learning model trained *entirely out of DSS* into a SavedModel in a project's Flow. We use the [Catboost]() framework to perform a binary classification task on the [UCI Bank dataset]().

In [1]:
import dataiku
import dataikuapi
import os

from dataikuapi.dss.ml import DSSPredictionMLTaskSettings



## Step 1: train your model outside of DSS

Using the archive data and source files provided along with this notebook, perform the following actions *outside of DSS*:
* Create a virtual environment using Python >= 3.6 and install the packages listed in `requirement.txt`
* Activate the newly-created virtual environment
* Go to `src/` and train the model by running `python train_catboost.py`. The resulting model artifact will be stored in the `dist/` directory, its name should be of the form `catboost-uci-bank-xxxxxxxx-xxxxxx`.

> **WARNING**: Any pre-processing step applied to the training data **MUST** also be applied to the evaluation data.

## Step 2: create the code env in DSS

In the *Administration > Code envs* section of DSS, crate a new code environment and add the packages listed in the archive's `requirement.txt` file (minus `pandas`), then build the code-env.

> **This notebook should be running using that code env ! **

Write down the name of that code env, you will need it to call `import_mlflow_version_from_path()`.

## Step 3: get a handle on a SavedModel

In [2]:
client = dataiku.api_client()
project = client.get_default_project()

# Get or create SavedModel
sm_name = "catboost-uci-bank"
sm_id = None
for sm in project.list_saved_models():
    if sm_name != sm["name"]:
        continue
    else:
        sm_id = sm["id"]
        print("Found SavedModel {} with id {}".format(sm_name, sm_id))
        break
if sm_id:
    sm = project.get_saved_model(sm_id)
else:
    sm = project.create_mlflow_pyfunc_model(name=sm_name,
                                            prediction_type=DSSPredictionMLTaskSettings.PredictionTypes.BINARY)
    sm_id = sm.id
    print("SavedModel not found, created new one with id {}".format(sm_id))

SavedModel not found, created new one with id VdHxdbkg


## Step 4: import the evaluation dataset

Create a new Dataset in your DSS project by uploading `data/uci-bank-marketing/eval_data.csv`. Call this Dataset `eval_data`.

> **WARNING**: The evaluation Dataset **MUST** already be preprocessed using the exact same steps as in step 1 !

## Step 5: Import mlflow model into a SavedModel version

In [6]:
# Change the following values to match your setup !
MLFLOW_DIST_DIR = "/Users/hrajaona/Projects/repos/tce/demo-mlflow/mlflow-model-import/dist/"
CATBOOST_MODEL_DIR = "catboost-uci-bank-20211206-151341"

version_id = "v01" # Change this to iterate to a new version
model_dir = os.path.join(MLFLOW_DIST_DIR, CATBOOST_MODEL_DIR)

# Create version in SavedModel
for v in sm.list_versions():
    if v["id"] == version_id:
        raise Exception("SavedModel version already exists! Choose a new version name.")

sm_version = sm.import_mlflow_version_from_path(version_id=version_id,
                                                path=model_dir,
                                                code_env_name="mlflow_catboost")
# Evaluate the version using the previously created Dataset
sm_version.set_core_metadata(target_column_name="y",
                             class_labels=["no", "yes"],
                             get_features_from_dataset="eval_data")
sm_version.evaluate("eval_data")

In [4]:
# Change the following values to match your setup !
MLFLOW_DIST_DIR = "/Users/christelleren/DSS/design/jupyter-run/dku-workdirs/DEPLOYERCREDITCARD/MLflow1ca82af5/mlruns/0/"
CATBOOST_MODEL_DIR = "3dab62035e5e439f9503d0c4fde113c8"

version_id = "v01" # Change this to iterate to a new version
model_dir = os.path.join(MLFLOW_DIST_DIR, CATBOOST_MODEL_DIR)

# Create version in SavedModel
for v in sm.list_versions():
    if v["id"] == version_id:
        raise Exception("SavedModel version already exists! Choose a new version name.")

sm_version = sm.import_mlflow_version_from_path(version_id=version_id,
                                                path=model_dir,
                                                code_env_name="py36_mlflow")
# Evaluate the version using the previously created Dataset
sm_version.set_core_metadata(target_column_name="y",
                             class_labels=["no", "yes"],
                             get_features_from_dataset="eval_data")
sm_version.evaluate("eval_data")

DataikuException: com.dataiku.dip.io.SocketBlockLinkKernelException: Could not run command READ_META: : <class 'FileNotFoundError'> : [Errno 2] No such file or directory: '/Users/christelleren/DSS/design/saved_models/MLFLOW/VdHxdbkg/versions/v01/MLmodel'

If you go to the SavedModel's version screen, you should now be able to see properly all the "Performance" visualizations.