# Introduction to unit metrics


In this notebook, we introduce the concept of _Unit Metric_ and provide a step-by-step guide on how to define, execute and extract results from these measures. As an example, we use data from a customer churn use case to fit a binary classification model. To illustrate the application of these measures, we show how to run sklearn classification metrics as unit metrics, demonstrating their utility in quantifying model performance and risk.

In Model Risk Management (MRM), the primary objective is to identify, assess, and mitigate the risks associated with the development, implementation, and ongoing use of quantitative models. The process of measuring risk involves the understanding and assessment of evidence generated throw multiple tests across all the model development lifecycle stages, from data collection and data quality to model performance and explainability.

### Evidence vs Risk

The distinction between evidence and quantifiable risk measures is a critical aspect of MRM. Evidence, in this context, refers to the outputs from various tests conducted throughout the model lifecycle. For instance, a table displaying the number of missing values per feature in a dataset is a form of evidence. It shows where data might be incomplete, which can affect the model's performance and reliability. Similarly, a Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is evidence of the model's classification performance.

However, these pieces of evidence do not offer a direct measure of risk. To quantify risk, one must derive metrics from this evidence that reflect the potential impact on the model's performance and the decisions it informs. For example, the missing data rate, calculated as the percentage of missing values in the dataset, is a quantifiable risk measure that indicates the risk associated with data quality. Similarly, the accuracy score, which measures the proportion of correctly classified labels, acts as an indicator of performance risk in a classification model.

### Unit Metric

A _Unit Metric_ is a single value measure that is used to identify and monitor risks arising from the development of Machine Learning or AI models. This metric simplifies evidence into a single actionable number, that can be monitored and compared over time or across different models or datasets.

**Properties**

- They are the fundamental computation unit that returns a single value.
- They quantify risk and can be used to monitor and assess risks associated with a model's entire lifecycle.
- Measurable, relevant, and linked to risk areas and critical business processes - e.g., regulatory requirements, risk appetite, model performance, data quality.
- Standalone in nature, meaning they do not rely on other metrics for their calculation or interpretation.

Incorporating Unit Metrics into your ML workflow streamlines risk assessment, turning complex analyses into clear, actionable insights.


## Initialize the client library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register new model**.

3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/register-models-in-model-inventory.html))

   For example, to register a model for use with this notebook, select:

   - Documentation template: `Binary classification`
   - Use case: `Marketing/Sales - Attrition/Churn Management`

   You can fill in other options according to your preference.

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:


In [None]:
# Replace with your code snippet

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="...",
)

## Notebook Setup


In [None]:
import xgboost as xgb

%matplotlib inline

## Load the demo dataset

In this example, we load a demo dataset to fit a customer churn model.


In [None]:
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

## Train a model for testing

We train a simple customer churn model for our test.


In [None]:
train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)

x_train = train_df.drop(demo_dataset.target_column, axis=1)
y_train = train_df[demo_dataset.target_column]
x_val = validation_df.drop(demo_dataset.target_column, axis=1)
y_val = validation_df[demo_dataset.target_column]

model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

In [None]:
feature_columns = [col for col in test_df.columns if col != demo_dataset.target_column]
feature_columns

## Compute Predictions

After the model is fitted, we compute model predictions and predictive probabilities, then add them to the customer churn dataset:


In [None]:
# Compute predictive probabilities for the test dataset
# Here, we only use the probabilities for the positive class (class 1)
predictive_probabilities = model.predict_proba(
    test_df.drop(demo_dataset.target_column, axis=1)
)[:, 1]

# Add the predictive probabilities as a new column to the test dataframe
test_df["PredictiveProbabilities"] = predictive_probabilities

# Add the predictions from the predictive probabilities as a new column to the test dataframe
test_df["Predictions"] = (predictive_probabilities > 0.5).astype(int)

# Display the first few rows of the updated dataframe to verify
test_df.head()

## Initialize ValidMind objects

Once the datasets and model are prepared for validation, we initialize ValidMind `dataset` and `model`, specifying features and targets columns. The property `input_id` allows users to uniquely identify each dataset and model. This allows for the creation of multiple versions of datasets and models, enabling us to compute metrics by specifying which versions we want to use as inputs.


In [None]:
import validmind as vm

vm_test_ds = vm.init_dataset(
    input_id="test_dataset",
    dataset=test_df,
    target_column=demo_dataset.target_column,
    feature_columns=feature_columns,
)

vm_model = vm.init_model(model=model, input_id="my_model")

## Assign Predictions


**Assigning Pre-computed Predictions**

We use `vm_test` to incorporate a column named 'Predictions', which consists of pre-computed predictions associated with `vm_model`. The `assign_predictions` method facilitates the addition of multiple prediction columns as necessary. By linking these precomputed predictions to a specific model through this method, we establish a clear reference system, allowing for precise identification of the predictions needed for various computational tasks.


In [None]:
vm_test_ds.assign_predictions(model=vm_model, prediction_column="Predictions")

In [None]:
vm_test_ds._extra_columns

## Running Unit Metrics


### Computing F1 Score

The following snippet shows how to set up and execute a unit metric implementation of the F1 score from `sklearn`. In this example, our objective is to compute F1 for the test dataset. Therefore, we specify `vm_test_ds` as the dataset in the inputs along with the `metric_id`.

**Dataset to Metric Input Mapping**

To accurately compute the F1 score, it's essential to ensure that these columns are correctly aligned and contain the relevant data. The F1 score requires two inputs:

- the predictions `y_pred` and
- the true labels `y_true`

Since `vm_test_ds` has the capability to include multiple prediction columns, each linked to a different model. Therefore, it's essential to specify both the dataset for extracting the target column and the correct prediction column, as well as the model to ensure the selection of the appropriate prediction column for that specific model, referred to as `vm_model`.

When calculating the F1 score, it's essential to use the correct prediction column associated with `vm_model` within `vm_test_ds`. This prediction column is dynamically identified based on the model id, specified in `input_id`.


In [None]:
metric_id = "validmind.unit_metrics.sklearn.classification.F1"

inputs = {"model": vm_model, "dataset": vm_test_ds}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)

### Accessing Metric Results

Once the metric computation is complete, the result object provides two key attributes:


In [None]:
result.value

In [None]:
result.summary

### Passing Parameters

When using the unit metric implementation of the F1 score from `sklearn`, it's important to note that this implementation supports all parameters of the original `sklearn.metrics.f1_score` function. This flexibility allows you to tailor the metric computation to your specific needs and scenarios.

Below, we provide a brief description the parameters you can pass to customize the F1 score calculation:

- `average`: Specifies the averaging method for the F1 score. Common options include 'micro', 'macro', 'samples', 'weighted', or None.
- `sample_weight`: Allows for weighting of samples. By default, it is None, but it can be an array of weights that are applied to the samples, useful for cases where some classes are more important than others.
- `zero_division`: Defines the behavior when there is a division by zero during F1 calculation. Options are 'warn', 'raise', or a numeric value like 0 or 1, indicating what value to set when encountering division by zero.


In [None]:
inputs = {"model": vm_model, "dataset": vm_test_ds}

params = {"average": "micro", "sample_weight": None, "zero_division": "warn"}

result = vm.unit_metrics.run_metric(metric_id=metric_id, inputs=inputs, params=params)
result.value

### Loading the Last Computed Value

Unit metrics are designed to optimize performance and efficiency by caching results of metric computations. When you execute a metric with the same signature —a unique combination of the metric ID, model, inputs, and parameters- a second time, validmind retrieves the result from its last computed value instead of recalculating it. This feature ensures faster access to metrics you've previously run and conserves computational resources.


**First Computation of Precision Metric**

In this first example, the precision metric is computed for the first time with a specific dataset. The result of this computation is stored in the cache.


In [None]:
metric_id = "validmind.unit_metrics.sklearn.classification.Precision"

inputs = {"model": vm_model, "dataset": vm_test_ds}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)
result.value

In [None]:
result.summary

**Second Computation with the Same Signature**

In this second example, the same precision metric computation is requested again with the identical inputs. Since the signature (metric ID and inputs) matches the previous run, validmind loads the result directly from the cache instead of recomputing it.


In [None]:
result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)
result.value

**Computation with a Changed Signature**

In this example, the signature is modified by adding parameters. This change prompts validmind to compute the metric anew, as the new signature does not match any stored result. The outcome is then cached, ready for any future requests with the same signature.


In [None]:
inputs = {"dataset": vm_test_ds, "model": vm_model}

params = {"average": "micro", "sample_weight": None, "zero_division": "warn"}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)
result.value

### Unit Metrics for Model Performance


In [None]:
metric_id = "validmind.unit_metrics.sklearn.classification.Accuracy"

inputs = {"model": vm_model, "dataset": vm_test_ds}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)
result.value

In [None]:
result.summary

In [None]:
metric_id = "validmind.unit_metrics.sklearn.classification.Recall"

inputs = {"model": vm_model, "dataset": vm_test_ds}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id,
    inputs=inputs,
)
result.value

In [None]:
result.summary

In [None]:
metric_id = "validmind.unit_metrics.sklearn.classification.ROC_AUC"

inputs = {"model": vm_model, "dataset": vm_test_ds}

result = vm.unit_metrics.run_metric(
    metric_id=metric_id, 
    inputs=inputs,
)
result.value

In [None]:
result.summary

# Composing Complex Metrics from Individual Unit Metrics

### Run Multiple Unit Metrics as a Single "Test"

Up until now we have just been running individual unit metrics on their own. However, in a normal use-case, you will likely want to "compose" multiple unit metrics into a more complex metric. For instance, we may want to compose the above metrics (`f1_score`, `precision`, `recall`, `accuracy` and `roc_auc`) into a single tabular display showing the overall model performance. This can be done by using the `run_test` function. This will allow us to run all these metrics at the same time, display the results in a single output, customize the output using html templates, and finally save the result as a single composite metric to the ValidMind platform. Let's see how we can do this.

In [None]:
from validmind.tests import run_test

result = run_test(
    name="ModelPerformance",
    unit_metrics=[
        "validmind.unit_metrics.sklearn.classification.F1",
        "validmind.unit_metrics.sklearn.classification.Precision",
        "validmind.unit_metrics.sklearn.classification.Recall",
        "validmind.unit_metrics.sklearn.classification.Accuracy",
        "validmind.unit_metrics.sklearn.classification.ROC_AUC"
    ],
    inputs=inputs,
)

If we take a look at the `result_id` for the result, we'll see that it is a unique identifier that starts with `validmind.composite_metric.<user-supplied-metric-name>`. This will be used to identify this result as coming from a composite metric and is used to rebuild the composite metric as we will see in the next section.

In [None]:
result.result_id

Let's go ahead and log the result to save it to the ValidMind platform.

In [None]:
result.log()

### Adding Composite Metrics to the Documentation Template

Now that we have run and logged the composite metric, the result and the metadata required to reconstruct the composite metric that was run is all stored in the ValidMind platform. You can now visit the documentation project that you connected to at the beginning of this notebook and add a new content block in the relevant section.

To do this, go to the documentation page of the `[Demo] Customer Churn Model - Initial Validation` project and navigate to the `Model Development` -> `Model Evaluation` section. Then hover between any existing content block to reveal the `+` button as shown in the screenshot below.

![screenshot showing insert button for test-driven blocks](../images/insert-test-driven-block.png)

Click on the `+` button and select `Test-Driven Block`. This will open a dialog where you can select `Metric` as the type of the test-driven content block, and then select the `Validmind Composite Metric Model Performance` metric. This will show a preview of the composite metric and it should match the results shown above.

![screenshot showing the selected composite metric in the dialog](../images/selecting-composite-metric.png)

Finally, click on the `Insert block` button to add the composite metric to the documentation. You'll see the composite metric displayed in the documentation and now anytime you run `run_documentation_tests()`, the `Model Performance` composite metric will be run as part of the test suite. Let's go ahead and connect to the documentation project and run the tests.

#### Reconnect to the Documentation Project

In [None]:
# Replace with your code snippet like you did in the first cell of this notebook

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="...",
)

Now that we have reconnected, we can run `vm.preview_template()` to see that our new composite metric has been added to the documentation.

In [None]:
vm.preview_template()

Let's go ahead and run `vm.run_documentation_tests()` to run the `model_evaluation` section of the documentation that includes the Model Performance composite metric that we just added. You should see the result in the output as well as in the documentation page on the ValidMind platform.

In [None]:
res = vm.run_documentation_tests(
    inputs=inputs,
    section="model_evaluation",
    fail_fast=True,
)