![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/quickstart-ci/AzureMLin10mins.png)

# Quickstart: Train and deploy a model in Azure Machine Learning

Dataset: https://www.kaggle.com/olistbr/marketing-funnel-olist

We will follow the following steps:
- Create the scripts for Data Extraction and converting it for model training and serving/prediction pipeline
- Create the script to build model and save the model and version control
- Create the script to use the saved model and do the prediction
- Create scripts to validate the models
- Create scripts to monitor the model performance
- Continuous Integration and Deployment scripts using open source platforms of your choice


In [1]:
import pandas as pd 
import numpy as np

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


## Import Data

Before you train a model, you need to understand the data you're using to train it. In this section, learn how to:

* Download the MNIST dataset
* Display some sample images

You'll use Azure Open Datasets to get the raw MNIST data files. [Azure Open Datasets](https://docs.microsoft.com/azure/open-datasets/overview-what-are-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for better models. Each dataset has a corresponding class, `MNIST` in this case, to retrieve the data in different ways.

## Load the olist data

In [2]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '71b2b08a-790d-402f-9800-fa9ad29fcb60'
resource_group = 'rg_sm'
workspace_name = 'SM1196'

workspace = Workspace(subscription_id, resource_group, workspace_name)

def load_model(path):
    df = Dataset.get_by_name(workspace, name=path)
    df = df.to_pandas_dataframe()

    return df


In [3]:
# flag = 0 for train data, flag = 1 for test data
def feature_engineering(X, flag = 0, training_cols = None):
  X = X[['first_contact_date','origin','seller_id']]

  #Convert date to year, month, day, quarter
  X['year'] = pd.DatetimeIndex(X['first_contact_date']).year
  X['month'] = pd.DatetimeIndex(X['first_contact_date']).month
  X['day'] = pd.DatetimeIndex(X['first_contact_date']).day
  X['quarter'] = X['month'].apply(lambda x:x//4)

  #Drop contact date and seller id
  X.drop(columns=['first_contact_date','seller_id'], axis=1, inplace=True)

  X = pd.get_dummies(X, drop_first=True, prefix='', prefix_sep='')

  if flag == 1:
    X = X.T.reindex(training_cols).T.fillna(0)
    # missing_cols = set( training_cols ) - set( X.columns )
    # # Add a missing column in test set with default value equal to 0
    # for c in missing_cols:
    #     X[c] = 0
    # # Ensure the order of column in the test set is in the same order than in train set
    # X = X[training_cols]

  return X, X.columns


In [4]:
def prepare_train_validation_data(closed_deals_path, market_lead_path):

  closed_deals = load_model(closed_deals_path)
  market_lead = load_model(market_lead_path)
  
  #Join the two datasets
  mf = pd.merge(market_lead, closed_deals, left_on='mql_id', right_on='mql_id', how='left')

  # Create target variable
  mf['converted'] = mf[['seller_id']].where(mf[['seller_id']].isnull()==True, 1).fillna(0).astype(int)

  X = mf.loc[:, mf.columns != 'converted']
  y = mf.loc[:, ['converted']]

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

  X_train, training_cols = feature_engineering(X_train)

  X_test, cols = feature_engineering(X_test, 1, training_cols)

  return X_train, X_test, y_train, y_test, training_cols


In [5]:
def evaluate(model, X, y):
  y_pred = model.predict(X.astype(np.int32))
  cm = confusion_matrix(y, y_pred)
  score = accuracy_score(y, y_pred)

  return y_pred, cm, score


In [6]:
closed_deals_path = 'olist_closed_deals' 
market_lead_path = 'olist_marketing_qualified_leads'

X_train, X_test, y_train, y_test, training_cols = prepare_train_validation_data(closed_deals_path, market_lead_path)

print("Done")


Done


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-d

In [7]:
def train_model(X, y, params):


  model = LogisticRegression(**params)
  model.fit(X, y)

  print(model)

  return model


## Train model and log metrics with MLflow

You'll train the model using the code below. Note that you are using MLflow autologging to track metrics and log model artefacts.

You'll be using the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier from the [SciKit Learn framework](https://scikit-learn.org/) to classify the data.

**Note: The model training takes approximately 2 minutes to complete.**

In [8]:
# create the model
import mlflow
import numpy as np
from sklearn.linear_model import LogisticRegression
from azureml.core import Workspace
from pprint import pprint

# connect to your workspace
ws = Workspace.from_config()

# create experiment and start logging to a new run in the experiment
experiment_name = "olist_lead_conversion"

# set up MLflow to track the metrics
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
mlflow.set_experiment(experiment_name)
mlflow.autolog()

def fetch_logged_data(run_id):
    client = mlflow.tracking.MlflowClient()
    data = client.get_run(run_id).data
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
    return data.params, data.metrics, tags, artifacts

#params to train the model
params = {'max_iter':200,'random_state':0,'penalty': 'l1','solver':'liblinear'}

with mlflow.start_run() as run:
    model = train_model(X_train, y_train, params)
    y_pred, cm, accuracy_score = evaluate(model, X_train, y_train)
    print("Confusion Metric: ", cm)
    print("Accuracy_score: ", accuracy_score)

    mlflow.log_param("confusion matrix", cm)
    mlflow.log_param("accuracy_score", accuracy_score)
    # mlflow.sklearn.log_model(model,'model')

    params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)

    pprint(params)

    pprint(metrics)

    pprint(tags)

    pprint(artifacts)

print("Done")

2021/08/30 12:16:42 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2021/08/30 12:16:42 INFO mlflow.pyspark.ml: No SparkSession detected. Autologging will log pyspark.ml models contained in the default allowlist. To specify a custom allowlist, initialize a SparkSession prior to calling mlflow.pyspark.ml.autolog() and specify the path to your allowlist file via the spark.mlflow.pysparkml.autolog.logModelAllowlistFile conf.
  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
Confusion Metric:  [[5344    0]
 [ 656    0]]
Accuracy_score:  0.8906666666666667
{'C': '1.0',
 'accuracy_score': '0.8906666666666667',
 'class_weight': 'None',
 'confusion matrix': '[[5344    0]\n [ 656    0]]',
 'dual': 'False',
 'fit_intercept': 'True',
 'intercept_scaling': '1',
 'l1_ratio': 'None',
 'max_iter': '200',
 'multi_class': 'auto',
 'n_jobs': 'None',
 'penalty': 'l1',
 'random_state': '0',
 'solver': 'liblinear',
 'tol': '0.0001',
 'verbose': '0',
 'warm_start': 'False'}
{'training_accuracy_score': 0.8906666666666667,
 'training_f1_score': 0.8391612599905971,
 'training_log_loss': 0.325091456389626,
 'training_precision_score': 0.793287111111111,
 't

## View Experiment
In the left-hand menu in Azure Machine Learning Studio, select __Experiments__ and then select your experiment (azure-ml-in10-mins-tutorial). An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.

## Version control your models with the model registry

You can use model registration to store and version your models in your workspace. Registered models are identified by name and version. Each time you register a model with the same name as an existing one, the registry increments the version. The code below registers and versions the model you trained above. Once you have executed the code cell below you will be able to see the model in the registry by selecting __Models__ in the left-hand menu in Azure Machine Learning Studio.

In [9]:
model_uri = "runs:/{}/model".format(run.info.run_id)
model = mlflow.register_model(model_uri, "olist_sklearn_lr")


Registered model 'olist_sklearn_lr' already exists. Creating a new version of this model...
2021/08/30 12:16:48 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: olist_sklearn_lr, version 8
Created version '8' of model 'olist_sklearn_lr'.


Use registered model to evaluate the data

In [10]:
import mlflow.pyfunc
from sklearn.metrics import confusion_matrix, accuracy_score

model_name = "olist_sklearn_lr"
model_version = 5

model = mlflow.pyfunc.load_model(
    model_uri=f"models:/{model_name}/{model_version}"
)


pprint(evaluate(model,X_test, y_test))

# y_test.converted.tolist()


(array([0, 0, 0, ..., 0, 0, 0]),
 array([[1814,    0],
       [ 186,    0]]),
 0.907)


## Deploy the model for real-time inference
In this section you learn how to deploy a model so that an application can consume (inference) the model over REST.

### Create deployment configuration
The code cell gets a _curated environment_, which specifies all the dependencies required to host the model (for example, the packages like scikit-learn). Also, you create a _deployment configuration_, which specifies the amount of compute required to host the model. In this case, the compute will have 1CPU and 1GB memory.

In [11]:
# create environment for the deploy
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.webservice import AciWebservice

# get a curated environment
env = Environment.get(
    workspace=ws, 
    name="AzureML-sklearn-0.24.1-ubuntu18.04-py37-cpu-inference",
    version=1
)
env.inferencing_stack_version='latest'

# create deployment config i.e. compute resources
aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=1,
    tags={
            "data": "https://www.kaggle.com/olistbr/marketing-funnel-olist",
            "method": "predict"
        },
    description="Predict whether a lead will be converted or not",
)

### Deploy model

This next code cell deploys the model to Azure Container Instance (ACI).

**Note: The deployment takes approximately 3 minutes to complete.**

In [12]:
%%time
import uuid
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core.model import Model

# get the registered model
model = Model(ws, "olist_sklearn_lr")

# create an inference config i.e. the scoring script and environment
inference_config = InferenceConfig(entry_script="score.py", environment=env)

pprint(inference_config)

# deploy the service
service_name = "sklearn-olist-lr-" + str(uuid.uuid4())[:4]
service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    inference_config=inference_config,
    deployment_config=aciconfig,
)

service.wait_for_deployment(show_output=True)

InferenceConfig(entry_script=score.py, runtime=None, conda_file=None, extra_docker_file_steps=None, source_directory=None, enable_gpu=None, base_image=None, base_image_registry=<azureml.core.container_registry.ContainerRegistry object at 0x7f99f19f6e80>)
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-08-30 12:16:54+00:00 Creating Container Registry if not exists.
2021-08-30 12:16:54+00:00 Registering the environment.
2021-08-30 12:16:57+00:00 Use the existing image.
2021-08-30 12:16:58+00:00 Generating deployment configuration.
2021-08-30 12:16:59+00:00 Submitting deployment to compute.
2021-08-30 12:17:02+00:00 Checking the status of deployment sklearn-olist-lr-7d22..
2021-08-30 12:18:08+00:00 Checking the status of inference endpoint sklearn-olist-lr-7d22.
Succeeded
ACI service creation operation finished, operation "Succeeded"
CPU times: u

The [*scoring script*](score.py) file referenced in the code above can be found in the same folder as this notebook, and has two functions:

1. an `init` function that executes once when the service starts - in this function you normally get the model from the registry and set global variables
1. a `run(data)` function that executes each time a call is made to the service. In this function, you normally format the input data, run a prediction, and output the predicted result.

### View Endpoint
Once the model has been successfully deployed, you can view the endpoint by navigating to __Endpoints__ in the left-hand menu in Azure Machine Learning Studio. You will be able to see the state of the endpoint (healthy/unhealthy), logs, and consume (how applications can consume the model).

## Test the model service

You can test the model by sending a raw HTTP request to test the web service. 

In [13]:
# send raw HTTP request to test the web service.
import requests

# send a random row from the test set to score
random_index = np.random.randint(0, len(X_test) - 1)
input_data = '{"data": [' + str(list(X_test.iloc[random_index,])) + "]}"

print("Random_index: ", random_index)
pprint(input_data)

headers = {"Content-Type": "application/json"}

resp = requests.post(service.scoring_uri, input_data, headers=headers)

print("POST to url", service.scoring_uri)
print("label:", y_test.iloc[random_index,])
print("prediction:", resp.text)

Random_index:  224
'{"data": [[2018, 1, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]}'
POST to url http://a219226d-6871-4263-9f92-8cc72754df88.eastus2.azurecontainer.io/score
label: converted    0
Name: 2043, dtype: int64
prediction: [0]


## Clean up resources

If you're not going to continue to use this model, delete the Model service using:

In [16]:
# Delete all web services
from azureml.core import Webservice
for webservice in Webservice.list(ws):
    print('name:', webservice.name)
    Webservice(ws, name = webservice.name).delete()


name: sklearn-olist-lr-7d22
name: sklearn-olist-lr-fe20
name: sklearn-olist-lr-a7a6
name: sklearn-olist-lr-e247
name: sklearn-olist-lr-ffcd
name: sklearn-olist-lr-fe9f


### Take a look at the data

Load the compressed files into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them. 

Note this step requires a `load_data` function that's included in an `utils.py` file. This file is placed in the same folder as this notebook. The `load_data` function simply parses the compressed files into numpy arrays.