# 🖥️ Monitoring Classification Model Performance Metrics

In this tutorial, we'll show how you can log performance metrics of your ML Model with whylogs, and how to send it to your dashboard at Whylabs Platform.
We'll follow a classification use case, where we're trying to predict whether a given transaction will be cancelled, using data from [Retail Case Study Data](https://www.kaggle.com/darpan25bajaj/retail-case-study-data).

We will:
- Download Model/Features/Labels data from S3
- Make predictions with the loaded models and features
- Log Input/Output features with whylogs
- Log Performance Metrics (Labels and Predictions) with whylogs
- Show Performance summary at WhyLabs

# 🛍️ The Data Story

In this example, we want to predict whether a given transaction will be cancelled, using data from a small retail business.


Most applications receive labeled data at a substantial delay, often for a subset of the total data seen at inference time. That is why, in whylogs, ML model performance metrics can be treated separately from the input and output data that we see for data profiling. To illustrate that, we will download separate data on the actual values, predictions, and a threshold or probability score to determine performance metrics.

### The Dataset

The features used in this example contains information about:

- Transaction
    - Date of transaction
    - Total amount
    - Quantity
- Product
    - Product category
    - Product subcategory
- Customer
    - Age
    - Gender
    - City code

The dataset used is based on the original dataset present in [Retail Case Study Data](https://www.kaggle.com/darpan25bajaj/retail-case-study-data), with additional preprocessing and transformations.

# Installing Required Packages

In [2]:
%%sh
pip install --upgrade pip -q
pip install whylogs -U -q
pip install sklearn -U -q

# Fetching the Artifacts from S3 (Model+Features+Labels)

In [1]:
import urllib.request
import pickle
import pandas as pd
model_path = "https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/perf/retail-rf-classifier.pickle"
features_path = "https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/perf/transformed-current.csv"
labels_path = "https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/perf/transformed-current-labels.csv"


model= pickle.load(urllib.request.urlopen(model_path))
df = pd.read_csv(features_path)
df_metrics = pd.read_csv(labels_path)

In [2]:
type(model)

sklearn.ensemble._forest.RandomForestClassifier

Let's take a look at the feature's names:

In [3]:
df.dtypes

Product Subcategory Code       int64
Product Category Code          int64
Quantity                       int64
Item Price                   float64
Total Tax                    float64
Total Amount                 float64
City Code                    float64
Age at Transaction Date      float64
Transaction Day of Week        int64
Store Type.Flagship store      int64
Store Type.MBR                 int64
Store Type.TeleShop            int64
Store Type.e-Shop              int64
Gender.F                       int64
Gender.M                       int64
Gender.Unknown                 int64
dtype: object

Our target field is `purchase canceled`:

In [4]:
df_metrics.dtypes

Purchase Canceled    float64
dtype: object

# 🔮 Making the Predictions

Now that we have the features and the model, we can use them to make the predictions. Later, when we log the metrics, we'll also need the prediction scores (to get ROC and Precision-Recall Curves). We trained our Random Forest Model with `SKLearn`. We have access to the scores using `predict_proba`, so let's call it, in addition to `predict`, to have the scores and the classes predicted.

In [5]:
predict_proba = model.predict_proba(df)
predict_class = model.predict(df)

# ✔️ Setup WhyLabs/Credentials


We will follow the same instructions as those you may find in the WhyLabs Observability Platform live data example instructions. In that workflow, you will gather your organization ID and API key if you haven't already and then upload a number of profiles new model.

See detailed instructions in our documentation using the code in the cells below: https://docs.whylabs.ai/docs/whylabs-set-up-model

Now we can add our API key and organization ID as environment variables and add a `WhyLabsWriter` to our whylogs session for automated upload.

In [6]:
import datetime
import pandas as pd
import os
from whylogs.app import Session
from whylogs.app.writers import WhyLabsWriter
import getpass



# set your org-id here
print("Enter your WhyLabs Org ID")
os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()

# set your API key here
print("Enter your WhyLabs API key")
os.environ["WHYLABS_API_KEY"] = getpass.getpass()
print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

# Adding the WhyLabs Writer to utilize WhyLabs platform
writer = WhyLabsWriter()
session = Session(writers=[writer])

Enter your WhyLabs Org ID
Enter your WhyLabs API key
Using API Key ID:  xxtIbfnVKB


# 📊 Profiling Input Data

We will first profile input data. 

Our dataframe contains transactions of one particular day. Let's log it as if it were for today.

Remember to input `datasetID` to point to the right model. If it's your first model, that would be `model-1`, for example.

In [7]:
# Run whylogs on historical data and upload to WhyLabs.
now = datetime.datetime.now()
print("Enter your Dataset ID")
datasetID = input()
with session.logger(
    # Note: 'datasetId' in whylogs maps to 'model-id' that is provided when you set up a model in WhyLabs
    tags={"datasetId": datasetID}, dataset_timestamp=now
) as ylog:
    ylog.log_dataframe(df)

Using API key ID: xxtIbfnVKB


# Assembling metrics

As stated earlier, we will need to log the prediction scores along with the actual prediction classes. That is needed to generate the ROC Curves Precision-recall curves. __SKLearn__'s `predict_proba` gives us the scores for each class. For example, in the code below, for the first prediction, the model yields a score of 0.89 for class 0 and 0.11 for class 1. Since the score for class 0 is higher than for class 1, the predicted class is `0`.

In [8]:
predict_proba[:2]

array([[0.89, 0.11],
       [0.91, 0.09]])

For whylogs, we need to pass the score only for the predicted class. So let's create a `scores` list with only the highest score between the two classes, which is the predicted class' score

In [9]:
df_metrics['prediction'] = predict_class

scores = [max(p) for p in predict_proba]

df_metrics['score'] = scores

df_metrics

Unnamed: 0,Purchase Canceled,prediction,score
0,0.0,0.0,0.89
1,0.0,0.0,0.91
2,0.0,0.0,0.94
3,0.0,0.0,1.00
4,0.0,0.0,0.88
...,...,...,...
832,0.0,0.0,0.82
833,0.0,0.0,0.93
834,0.0,0.0,0.95
835,1.0,0.0,0.98


Let's also cast our labels and predictions as integers, so WhyLabs will understand the 1's as positives and 0's as negatives. This is important when calculating metrics such as `precision` and `recall`.

In [10]:
# df_metrics
df_metrics["Purchase Canceled"] = df_metrics["Purchase Canceled"].astype(int)
df_metrics["prediction"] = df_metrics["prediction"].astype(int)


# 📊 Profiling Model Metrics

Notice that we use a different method to profile performance data, `log_metrics`. We also need to define the feature that represents the labels, the predictions and the prediction scores.

In [11]:
from datetime import datetime

now = datetime.now()

print("Enter your Dataset ID")
datasetID = input()

with session.logger(
    # Note: 'datasetId' in whylogs maps to 'model-id' that is provided when you set up a model in WhyLabs
    tags={"datasetId": datasetID}, dataset_timestamp=now
) as ylog:
    ylog.log_metrics(targets=df_metrics['Purchase Canceled'].tolist(), 
                predictions=df_metrics['prediction'].tolist(), 
                scores=df_metrics['score'].tolist(),
                target_field="Purchase Canceled",
                prediction_field="prediction",
                score_field="Normalized Prediction Probability")

Enter your Dataset ID


In [12]:
#closing the session once we're done.
session.close()

# 🔍 Inspecting your Model's performance

We showed the process for logging the performance metrics for a given day. We repeated the process for a number of consecutive days to show how we can inspect the calculated metrics on a daily basis in your model's dashboard at WhyLabs:

![alt text](images/classification_metrics.png)


By looking only at the Accuracy metrics, it would look like our model has a good performance. However, by taking a closer look at the other metrics, we can see that our model is actually performing rather poorly. By looking at the confusion matrix, we can see that our data is very unbalanced, which is why the Accuracy gives such misleading results.

The dashboard also presents us with other metrics as well. For classification tasks, the following metrics are tracked:

- Total output and input count
- Accuracy
- ROC
- Precision-Recall chart
- Confusion Matrix
- Recall
- FPR (false positive rate)
- Precision
- F1


You're free to inspect the rest of the metrics at your own dashboard. If you prefer, take a look at https://docs.whylabs.ai/docs/performance-metrics#classification for more information!