# Introduction to Unit Metrics

In this notebook  ...

In Model Risk Management (MRM), the primary objective is to identify, assess, and mitigate the risks associated with the development, implementation, and ongoing use of quantiative models. The process of measuring risk involves the understanding and assessment of evidence generated throw multple tests acorss all the model development lifecycle stages, from data collection and data quality to model performance and explainability. 

### Evidence vs Risk

The distinction between evidence and quantifiable risk measures is a critical aspect of MRM. Evidence, in this context, refers to the outputs from various tests conducted throughout the model lifecycle. For instance, a table displaying the number of missing values per feature in a dataset is a form of evidence. It shows where data might be incomplete, which can affect the model's performance and reliability. Similarly, a Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is evidence of the model's classification performance.

However, these pieces of evidence do not offer a direct measure of risk. To quantify risk, one must derive metrics from this evidence that reflect the potential impact on the model's performance and the decisions it informs. For example, the missing data rate, calculated as the percentage of missing values in the dataset, is a quantifiable risk measure that indicates the risk associated with data quality. Similarly, the accuracy score, which measures the proportion of correctly classified labels, acts as an indicator of performance risk in a classification model.

### Unit Metric

A *Unit Metric* is a single value measure that is used to identify and monitor risks arising from the development of Machine Learning or AI models. This metric simplifies evidence into a single actionable number, that can be monitored and compared over time or across different models or datasets. 

**Properties**
- Is the fundamental computation unit that returns a single value.
- They quantify risk and can be used to monitor and assess risks associated with a model's entire lifecycle.
- Measurable, relevant, and linked to risk areas and critical business processes - e.g., regulatory requirements, risk appetite, model performance, data quality.

## Notebook Setup

In [1]:
import xgboost as xgb

%matplotlib inline

## Initialize the client library

Every documentation project in the Platform UI comes with a _code snippet_ that lets the client library associate your documentation and tests with the right project on the Platform UI when you run this notebook. As you will see later, documentation projects are useful because they act as containers for model documentation and validation reports and they enable you to organize all of your documentation work in one place. 

Get your code snippet by creating a documentation project:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. Go to **Documentation Projects** and click **Create new project**.

3. Select **`[Demo] Customer Churn Model`** and **`Initial Validation`** for the model name and type, give the project a unique  name to make it yours, and then click **Create project**.

4. Go to **Documentation Projects** > **YOUR_UNIQUE_PROJECT_NAME** > **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [2]:
import validmind as vm

vm.init(
  api_host = "https://api.dev.vm.validmind.ai/api/v1/tracking",
  api_key = "...",
  api_secret = "...",
  project = "..."
)

2024-02-28 14:03:52,504 - INFO(validmind.api_client): Connected to ValidMind. Project: Customer Churn Demo (1.4) - Initial Validation-demo (clnup756d051w15lf2dmzywvf)


## Load the demo dataset

In [3]:
from validmind.datasets.classification import customer_churn as demo_dataset

print(f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}")

raw_df = demo_dataset.load_data()
raw_df.head()

Loaded demo dataset with: 

	• Target column: 'Exited' 
	• Class labels: {'0': 'Did not exit', '1': 'Exited'}


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Train a model for testing

We train a simple customer churn model for our test.

In [4]:
train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)

x_train = train_df.drop(demo_dataset.target_column, axis=1)
y_train = train_df[demo_dataset.target_column]
x_val = validation_df.drop(demo_dataset.target_column, axis=1)
y_val = validation_df[demo_dataset.target_column]

model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

In [5]:
type(test_df['Exited'])

pandas.core.series.Series

In [6]:
feature_columns = [col for col in test_df.columns if col != demo_dataset.target_column]
feature_columns

['CreditScore',
 'Gender',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary',
 'Geography_France',
 'Geography_Germany',
 'Geography_Spain']

## Compute Predictions

In [7]:
# Compute predictive probabilities for the test dataset
# Here, we only use the probabilities for the positive class (class 1)
predictive_probabilities = model.predict_proba(test_df.drop(demo_dataset.target_column, axis=1))[:, 1]

# Add the predictive probabilities as a new column to the test dataframe
test_df['PredictiveProbabilities'] = predictive_probabilities

# Add the predictions from the predictive probabilities as a new column to the test dataframe
test_df['Predictions'] = (predictive_probabilities > 0.5).astype(int)

# Display the first few rows of the updated dataframe to verify
test_df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,PredictiveProbabilities,Predictions
2394,471,0,27,4,0.0,2,1,0,122642.09,0,1,0,0,0.033179,0
5298,713,0,37,8,0.0,1,1,1,16403.41,0,1,0,0,0.18744,0
29,520,0,42,6,0.0,2,1,1,34410.55,0,0,0,1,0.081376,0
6185,649,0,41,3,130931.83,1,1,1,144808.37,0,1,0,0,0.217643,0
2856,639,0,41,5,98635.77,1,1,0,199970.74,0,0,1,0,0.424413,0


## Initialize ValidMind objects

In [8]:
import validmind as vm

vm_test_ds = vm.init_dataset(
    
    input_id='test_dataset',
    dataset=test_df, 
    target_column=demo_dataset.target_column,
    feature_columns=feature_columns,
    
)

vm_model = vm.init_model(

    model=model,
    input_id="my_model"
    
)

2024-02-28 14:03:52,739 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


In [9]:
vm_test_ds.assign_predictions(
    
    model=vm_model, 
    prediction_column='Predictions'
    
)

In [10]:
vm_test_ds._extra_columns

{'prediction_columns': {'my_model': 'Predictions'}, 'group_by_column': None}

In [11]:
vm_test_ds.feature_columns

['CreditScore',
 'Gender',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary',
 'Geography_France',
 'Geography_Germany',
 'Geography_Spain']

In [12]:
vm_test_ds._df

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,PredictiveProbabilities,Predictions
2394,471.0,0.0,27.0,4.0,0.00,2.0,1.0,0.0,122642.09,0.0,1.0,0.0,0.0,0.033179,0.0
5298,713.0,0.0,37.0,8.0,0.00,1.0,1.0,1.0,16403.41,0.0,1.0,0.0,0.0,0.187440,0.0
29,520.0,0.0,42.0,6.0,0.00,2.0,1.0,1.0,34410.55,0.0,0.0,0.0,1.0,0.081376,0.0
6185,649.0,0.0,41.0,3.0,130931.83,1.0,1.0,1.0,144808.37,0.0,1.0,0.0,0.0,0.217643,0.0
2856,639.0,0.0,41.0,5.0,98635.77,1.0,1.0,0.0,199970.74,0.0,0.0,1.0,0.0,0.424413,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6637,556.0,1.0,36.0,7.0,154872.08,2.0,1.0,1.0,32044.64,0.0,1.0,0.0,0.0,0.050024,0.0
3726,623.0,1.0,43.0,1.0,0.00,2.0,1.0,1.0,146379.30,0.0,1.0,0.0,0.0,0.046449,0.0
3501,721.0,1.0,68.0,4.0,136525.99,1.0,0.0,0.0,175399.14,0.0,0.0,1.0,0.0,0.921723,1.0
5159,850.0,1.0,35.0,9.0,102050.47,1.0,1.0,1.0,3769.71,0.0,1.0,0.0,0.0,0.030381,0.0


## Running Unit Metrics

### Computing F1 Score 

The following snippet shows how to set up and execute a unit metric implementation of the F1 score from `sklearn`. In this example, our objective is to compute F1 for the test dataset. Therefore, we specify `vm_test_ds` as the dataset in the inputs along with the `metric_id`. 

**Dataset to Metric Input Mapping**

To accurately compute the F1 score, it's essential to ensure that these columns are correctly aligned and contain the relevant data. The F1 score requires two inputs: 

- the predictions `y_pred` and 
- the true labels `y_true`
 
By selecting `vm_test_ds`, we include the columns necessary for calculating the F1 score from `sklearn`. These are `y_pred`, the predictions mapped to the `vm_test_ds.prediction_column`, in this case labeled as 'Predictions', and the true label column `vm_test_ds.target_column` named 'Exited'. 

In [13]:
metric_id = "validmind.metrics.sklearn.classification.F1"

inputs = {"dataset": vm_test_ds}

result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    
)

Computing metric value for 'validmind.metrics.sklearn.classification.F1'
y_pred obtained from pre-computed predictions in dataset column 'Predictions' from 'my_model'
y_true obtained from column 'Exited'


### Accessing Metric Results

Once the metric computation is complete, the result object provides two key attributes: 

In [14]:
result.metric.value

0.5992217898832685

In [15]:
result.metric.summary

{'F1': '0.60'}

### Computing F1 from Model Predictions

If the predictions are not pre-included in your dataset, the unit metric framework is designed to be flexible. By inputting the `model` directly, predictions can be dynamically generated and utilized for metric calculation. This approach allows for a seamless integration of model outputs into the validation process. 

Below is an example of how to implement this, by providing both the `model` and the `dataset`:

In [16]:
metric_id = "validmind.metrics.sklearn.classification.F1"

inputs = {
    "model": vm_model,
    "dataset": vm_test_ds
}

result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    
)
result.metric.value

Computing metric value for 'validmind.metrics.sklearn.classification.F1'
y_pred computed directly from model 'my_model'
y_true obtained from column 'Exited'


0.5992217898832685

### Passing Parameters

When using the unit metric implementation of the F1 score from `sklearn`, it's important to note that this implementation supports all parameters of the original `sklearn.metrics.f1_score` function. This flexibility allows you to tailor the metric computation to your specific needs and scenarios. 

Below, we provide a brief description the parameters you can pass to customize the F1 score calculation:

- `average`: Specifies the averaging method for the F1 score. Common options include 'micro', 'macro', 'samples', 'weighted', or None. 
- `sample_weight`: Allows for weighting of samples. By default, it is None, but it can be an array of weights that are applied to the samples, useful for cases where some classes are more important than others.
- `zero_division`: Defines the behavior when there is a division by zero during F1 calculation. Options are 'warn', 'raise', or a numeric value like 0 or 1, indicating what value to set when encountering division by zero.

In [17]:
metric_id = "validmind.metrics.sklearn.classification.F1"

inputs = {"dataset": vm_test_ds}

params = {
    "average": "micro",
    "sample_weight": None,
    "zero_division": "warn"
}

result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    params=params
    
)
result.metric.value

Computing metric value for 'validmind.metrics.sklearn.classification.F1'
y_pred obtained from pre-computed predictions in dataset column 'Predictions' from 'my_model'
y_true obtained from column 'Exited'


0.8712500000000001

### Loading the Last Computed Value

Unit metrics are designed to optimize performance and efficiency by caching results of metric computations. When you execute a metric with the same signature —a unique combination of the metric ID, model, inputs, and parameters- a second time, validmind retrieves the result from its last computed value instead of recalculating it. This feature ensures faster access to metrics you've previously run and conserves computational resources.

**First Computation of Precision Metric**

In this first example, the precision metric is computed for the first time with a specific dataset. The result of this computation is stored in the cache.

In [18]:
metric_id = "validmind.metrics.sklearn.classification.Precision"

inputs = {"dataset": vm_test_ds}

result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    
)
result.metric.value

Computing metric value for 'validmind.metrics.sklearn.classification.Precision'
y_pred obtained from pre-computed predictions in dataset column 'Predictions' from 'my_model'
y_true obtained from column 'Exited'


0.751219512195122

**Second Computation with the Same Signature**

In this second example, the same precision metric computation is requested again with the identical inputs. Since the signature (metric ID and inputs) matches the previous run, validmind loads the result directly from the cache instead of recomputing it.

In [19]:
result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    
)
result.metric.value

Loading last computed value value from 'validmind.metrics.sklearn.classification.Precision'


0.751219512195122

**Computation with a Changed Signature**

In this third example, the signature changes due to the inclusion of a model in the inputs. This change prompts validmind to compute the metric anew, as the new signature does not match any stored result. The outcome is then cached, ready for any future requests with the same signature.

In [20]:
inputs = {
    "dataset": vm_test_ds,
    "model": vm_model
}

result = vm.run_metric(

    metric_id=metric_id, 
    inputs=inputs,
    
)
result.metric.value

Computing metric value for 'validmind.metrics.sklearn.classification.Precision'
y_pred computed directly from model 'my_model'
y_true obtained from column 'Exited'


0.751219512195122