# Introduction to the use of ValidMind Dataset and Model Objects
As a model developer, learn how the end-to-end documentation process works based on common scenarios you encounter in model development settings.

As a prerequisite, a model documentation template must be available on the platform. You can [view the available templates](https://docs.validmind.com/guide/swap-documentation-templates.html#view-current-templates) to see what has been defined on the platform.

This notebook employs a binary classification model to illustrate the core ValidMind dataset and model objects. It further demonstrates their application in tests to generate artifacts for documentation.


## Overview of the notebook

**1. Initializing the ValidMind Developer Framework**

ValidMind’s developer framework provides a rich collection of documentation tools and test suites, from documenting descriptions of datasets to validation and testing of models using a variety of open-source testing frameworks.

**2. Explore basic components of ValidMind library**

Learn how to use the interfaces for ValidMind's objects, such as dataset and model, which serve as the building blocks for developing both custom and built-in tests. These objects function as inputs in the tests.

For a full list of out-of-the-box tests, see [Test descriptions](https://docs.validmind.com/guide/test-descriptions.html) or try the interactive [Test sandbox](https://docs.validmind.com/guide/test-sandbox.html).

Model developers typically create their own custom tests, and it is crucial to include these in the model documentation. We will demonstrate how to develop custom tests using the ValidMind dataset and model objects.

## About ValidMind

ValidMind is a platform for managing model risk, including risk associated with AI and statistical models. You use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.


### Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

### New to ValidMind?

If you haven't already seen our [Get started with the ValidMind Developer Framework](https://docs.validmind.ai/guide/get-started-developer-framework.html), we recommend you explore the available resources for developers at some point. There, you can learn more about documenting models, find code samples, or read our developer reference.

<div class="alert alert-block alert-info" style="background-color: #f7e4ee; color: #222425; border: 1px solid #222425;">For access to all features available in this notebook, create a free ValidMind account.

Signing up is FREE — <a href="https://app.prod.validmind.ai"><b>Sign up now</b></a></div>

### Key concepts

**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

**Tests**: A function contained in the ValidMind Developer Framework, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.

**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with ValidMind to be used in the platform.

**Inputs**: Objects to be evaluated and documented in the ValidMind framework. They can be any of the following:

- **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).
- **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).
- **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.
- **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. See this [example](https://docs.validmind.ai/notebooks/how_to/run_tests_that_require_multiple_datasets.html) for more information.

**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.

**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

![Dataset based test architecture](./dataset_image.png)
![Model based test architecture](./model_image.png)

**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases.


<a id='toc4_'></a>

## 1. Initializing the ValidMind Developer Framework


<a id='toc4_1_'></a>

### Install the client library

Please note the following recommended Python versions to use:

- Python 3.7 > x <= 3.11

The client library provides Python support for the ValidMind Developer Framework. To install it run:


In [None]:
%pip install -q validmind

<a id='toc4_2_'></a>

### Register a new model in ValidMind UI and initialize the client library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register new model**.

3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/register-models-in-model-inventory.html))

   For example, to register a model for use with this notebook, select:

   - Documentation template: `Binary classification`
   - Use case: `Marketing/Sales - Attrition/Churn Management`

   You can fill in other options according to your preference.

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:


In [None]:
# Replace with your code snippet

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="...",
)

## 2. Explore Basic Components of the ValidMind Library

In this section, you will learn about the basic objects of the ValidMind library that are necessary to implement both custom and built-in tests. As explained above, these objects are:
* VMDataset: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset)
* VMModel: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMModel)

Let's understand these objects and their interfaces step by step: 

The ValidMind Developer Framework provides a wrapper to automatically load the dataset as a Pandas DataFrame object.

In [None]:
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()


### 2.1 VMDataset Object
### Initialize the ValidMind datasets

Now, assume we have identified some tests we want to run with regards to the data we are intending to use. The next step is to connect your data with a ValidMind `Dataset` object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind. You only need to do it one time per dataset.

You can initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.

This function takes a number of arguments. Some of the arguments are:

- `dataset` — the raw dataset that you want to provide as input to tests
- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test
- `target_column` — a required argument if tests require access to true values. This is the name of the target column in the dataset

The detailed list of the arguments can be found [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset) 


In [None]:
# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

Once you have a ValidMind dataset object (VMDataset), you can inspect its attributes and methods using the inspect_obj utility module. This method provides a list of attributes and interfaces available for use in tests.

In [None]:
from validmind.utils import inspect_obj
inspect_obj(vm_raw_dataset)

### 2.2 Custom test
A custom test is simply a Python function that takes two types of arguments: `inputs` and `params`. The `inputs` are ValidMind objects (`VMDataset`, `VMModel`), and the `params` are additional parameters required for the underlying computation of the test. We will discuss both types of arguments in the following sections.

Let's start with a custom test that requires only a ValidMind dataset object. In this example, we will check the balance of classes in the target column of the dataset:

- The custom test below requires a single argument of type `VMDataset` (dataset).
- The `my_custom_metrics.ClassImbalance` is a unique test identifier that can be assigned using the `vm.metric` decorator functionality. This unique test ID will be used in the UI to load test results in the documentation.
- We use the `dataset.target_column` and `dataset.df` attributes of the `VMDataset` object.

Other high-level APIs (attributes and methods) of the dataset object are listed [here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset).




In [None]:
from validmind.vm_models.dataset.dataset import VMDataset
import pandas as pd

@vm.metric("my_custom_metrics.ClassImbalance")
def class_imbalance(dataset):
    # Can only run this test if we have a Dataset object
    if not isinstance(dataset, VMDataset):
        raise ValueError("ClassImbalance requires a validmind Dataset object")

    if dataset.target_column is None:
        print("Skipping class_imbalance test because no target column is defined")
        return

    # VMDataset object provides target_column attribute
    target_column = dataset.target_column
    # we can access pandas DataFrame using df attribute
    imbalance_percentages = dataset.df[target_column].value_counts(
        normalize=True
    )
    classes = list(imbalance_percentages.index) 
    percentages = list(imbalance_percentages.values * 100)

    return pd.DataFrame({"Classe":classes, "Percentage": percentages})

### How to Run the Test

Let's run the test using the `run_test` method, which is part of the `validmind.tests` module. Here, we pass the `dataset` through the `inputs`. Similarly, you can pass `datasets`, `model`, or `models` as inputs if your custom test requires them. In this example below, we run the custom test `my_custom_metrics.ClassImbalance` by passing the `dataset` through the `inputs`. 


In [None]:
from validmind.tests import run_test
result = run_test(
    test_id="my_custom_metrics.ClassImbalance",
    inputs={
        "dataset": vm_raw_dataset
    }
)

### 2.3 Pass parameters to custom test

Simlilar to `inputs`, you can pass `params` to a custom test by providing a dictionary of parameters to the `run_test()` function. The parameters will override any default parameters set in the custom metric definition. Note that the `dataset` is still passed as `inputs`. 
Let's modify the class imbalance tests so that it provide provide flexibility to `normalize` the results.


In [None]:
from validmind.vm_models.dataset.dataset import VMDataset
import pandas as pd

@vm.metric("my_custom_metrics.ClassImbalance")
def class_imbalance(dataset, normalize=True):
    # Can only run this test if we have a Dataset object
    if not isinstance(dataset, VMDataset):
        raise ValueError("ClassImbalance requires a validmind Dataset object")

    if dataset.target_column is None:
        print("Skipping class_imbalance test because no target column is defined")
        return

    # VMDataset object provides target_column attribute
    target_column = dataset.target_column
    # we can access pandas DataFrame using df attribute
    imbalance_percentages = dataset.df[target_column].value_counts(
        normalize=normalize
    )
    classes = list(imbalance_percentages.index) 
    if normalize:  
        result = pd.DataFrame({"Classe":classes, "Percentage": list(imbalance_percentages.values*100)})
    else:
        result = pd.DataFrame({"Classe":classes, "Count": list(imbalance_percentages.values)})
    return result

In this example, the `normalize` parameter is set to `False`, so the class counts will not be normalized. You can change the value to `True` if you want the counts to be normalized. The results of the test will reflect this flexibility, allowing for different outputs based on the parameter passed.

Here, we have passed the `dataset` through the `inputs` and the `normalize` parameter using the `params`.

In [None]:
from validmind.tests import run_test
result = run_test(
    test_id = "my_custom_metrics.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"normalize": True},
)

### 2.4 VMModel Object
### Initialize ValidMind model object
Now let's look at the ValidMind model object that is wrapper around the model object. Before, we do that we need to build a simple to classification model.
Here, we are builing a simple xgboost classification model.

In [None]:
import xgboost as xgb

train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)

x_train = train_df.drop(demo_dataset.target_column, axis=1)
y_train = train_df[demo_dataset.target_column]
x_val = validation_df.drop(demo_dataset.target_column, axis=1)
y_val = validation_df[demo_dataset.target_column]

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=demo_dataset.target_column,
)


model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

Now, we have `XGBClassifier` model that can be easily wrapped using `init_model` interface of ValidMind.

In [None]:

vm_model = vm.init_model(
    model=model,
    input_id="xgb_model",
)

Similar to `VMDataset` objcet, we can inspect the methods and attributes of the model now:

In [None]:
inspect_obj(vm_model)

### Initialize model evaluation objects and assigning predictions
Now we can store, predictions in the dataset object using `assign_predictions` interface. It adds the prediction column in the dataset to establish the link between model and dataset.

In [None]:
vm_train_ds.assign_predictions(model=vm_model)

The extra prediction column (`xgb_model_prediction`) for the model (`xgb_model`) has been added in the dataset.

In [None]:
print(vm_train_ds)

### 3.5 Custom test using VMDataset and VMModel as inputs

We will now create a `@vm.metric` wrapper that will allow you to create a reusable metric. Note the following changes in the code below:

- The function `confusion_matrix` takes two arguments `dataset` and `model`. This is a `VMDataset` and `VMModel` object respectively.
  - `VMDataset` objects allow you to access the dataset's true (target) values by accessing the `.y` attribute.
  - `VMDataset` objects allow you to access the predictions for a given model by accessing the `.y_pred()` method.
- The function docstring provides a description of what the metric does. This will be displayed along with the result in this notebook as well as in the ValidMind platform.
- The function body calculates the confusion matrix using the `sklearn.metrics.confusion_matrix` function as we just did above.
- The function then returns the `ConfusionMatrixDisplay.figure_` object - this is important as the ValidMind framework expects the output of the custom metric to be a plot or a table.
- The `@vm.metric` decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind framework. It also registers the metric so it can be found by the ID `my_custom_metrics.ConfusionMatrix` (see the section below on how test IDs work in ValidMind and why this format is important)

Similarly, you can use the functinality provided by `VMDataset` and `VMModel` objects. You can refer our documentation page for all the avalialble APIs [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset)


In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt
@vm.metric("my_custom_metrics.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    # we can retrieve traget value from dataset which is y attribute
    y_true = dataset.y
    # The prediction value of a specific model using y_pred method 
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()
    plt.close()

    return cm_display.figure_  # return the figure object itself

Here, we run test using two inputs; `dataset` and `model`. 

In [None]:
from validmind.tests import run_test
result = run_test(
    test_id = "my_custom_metrics.ConfusionMatrix",
    inputs={
        "dataset": vm_train_ds,
        "model": vm_model,
    }
)

### Log the confusion matrix results

As you saw in the pearson correlation example, you can log any result to the ValidMind platform with the `.log()` method of the result object. This will allow you to add the result to the documentation.

You can now do the same for the confusion matrix results.


In [None]:
result.log()

## Where to go from here

In this notebook you have learned the end-to-end process to document a model with the ValidMind Developer Framework, running through some very common scenarios in a typical model development setting:

- Running out-of-the-box tests
- Documenting your model by adding evidence to model documentation
- Extending the capabilities of the Developer Framework by implementing custom tests
- Ensuring that the documentation is complete by running all tests in the documentation template

As a next step, you can explore the following notebooks to get a deeper understanding on how the developer framework allows you generate model documentation for any use case:


### Use cases

- [Application scorecard demo](../code_samples/credit_risk/application_scorecard_demo.ipynb)
- [Linear regression documentation demo](../code_samples/regression/quickstart_regression_full_suite.ipynb)
- [LLM model documentation demo](../code_samples/nlp_and_llm/foundation_models_integration_demo.ipynb)


### More how-to guides and code samples

- [Explore available tests in detail](../how_to/explore_tests.ipynb)
- [In-depth guide for implementing custom tests](../code_samples/custom_tests/implement_custom_tests.ipynb)
- [In-depth guide to external test providers](../code_samples/custom_tests/integrate_external_test_providers.ipynb)
- [Configuring dataset features](../how_to/configure_dataset_features.ipynb)
- [Introduction to unit and composite metrics](../how_to/run_unit_metrics.ipynb)

### Discover more learning resources

All notebook samples can be found in the following directories of the Developer Framework GitHub repository:

- [Code samples](https://github.com/validmind/developer-framework/tree/main/notebooks/code_samples)
- [How-to guides](https://github.com/validmind/developer-framework/tree/main/notebooks/how_to)
