# ValidMind for model development — 102 Start the model development process

Learn how to use ValidMind for your end-to-end model documentation process with our series of four introductory notebooks. In this second notebook, you'll run tests and investigate results, then add the results or evidence to your documentation.

You'll become familiar with the individual tests available in ValidMind, as well as how to run them and change parameters as necessary. Using ValidMind's repository of individual tests as building blocks helps you ensure that a model is being built appropriately. 

**For a full list of out-of-the-box tests,** refer to our [Test descriptions](https://docs.validmind.ai/developer/model-testing/test-descriptions.html) or try the interactive [Test sandbox](https://docs.validmind.ai/developer/model-testing/test-sandbox.html).

## Prerequisites

In order to log test results or evidence to your model documentation with this notebook, you'll need to first:

- [ ] Register a model within the ValidMind Platform with a predefined documentation template
- [ ] Install and initialize the ValidMind Library, enabling you to connect to the correct model in the ValidMind Platform
- [ ] Preview the selected documentation template for your model and verify that it's appropriate for your use case

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>Need help with the above steps?</b></span>
<br></br>
Refer to the first notebook in this series: <a href="101-set_up_validmind.ipynb" style="color: #DE257E;"><b>101 Set up ValidMind</b></a></div>


## Setting up

### Import sample dataset

First, let's import the public [Bank Customer Churn Prediction](https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction) dataset from Kaggle. 

In our below example, note that: 

- The target column, `Exited` has a value of `1` when a customer has churned and `0` otherwise.
- The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object.

In [None]:
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

### Assess data quality

Next, let's do some data quality assessments by running a few individual tests related to data assessment.

Use the `vm.tests.list_tests()` function introduced by the first notebook in this series in combination with `vm.tests.list_tags()` and `vm.tests.list_tasks()` to find which prebuilt tests are relevant for data quality assessment:


In [None]:
# Get the list of available tags
sorted(vm.tests.list_tags())

In [None]:
# Get the list of available task types
sorted(vm.tests.list_tasks())

You can pass `tags` and `tasks` as parameters to the `vm.tests.list_tests()` function to filter the tests based on the tags and task types. For example, to find tests related to tabular data quality for classification models, you can call `list_tests()` like this:

In [None]:
vm.tests.list_tests(task="classification", tags=["tabular_data", "data_quality"])

### Initialize the ValidMind datasets

Now, assume we have identified some tests we want to run with regards to the data we are intending to use. The next step is to connect your data with a ValidMind `Dataset` object. **This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind,** but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module. This function takes a number of arguments:

- **`dataset`** — The raw dataset that you want to provide as input to tests
- **`input_id`** — A unique identifier that allows tracking what inputs are used when running each individual test
- **`target_column`** — A required argument if tests require access to true values. This is the name of the target column in the dataset


In [None]:
# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

## Running tests

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>Want to learn more about ValidMind tests?</b></span>
<br></br>
Refer to our notebook that includes code samples and usage of key functions: <a href="https://docs.validmind.ai/notebooks/how_to/explore_tests.html" style="color: #DE257E;"><b>Explore tests</b></a></div>

### Run tabular data tests

Run individual tests by calling the `run_test` function provided by the `validmind.tests` module. This function takes the following arguments:

- **`test_id`** — The ID of the test to run. To find a particular test and retrieve its ID, refer to the Explore tests notebook.
- **`params`** — A dictionary of parameters for the test. These will override any `default_params` set in the test definition. Refer to the Explore tests notebook to find the default parameters for a test. See below for examples.

The inputs expected by a test can also be found in the test definition. Let's take `validmind.data_validation.DescriptiveStatistics` as an example. Note that the output of the `describe_test()` function below shows that this test expects a `dataset` as input:


In [None]:
vm.tests.describe_test("validmind.data_validation.DescriptiveStatistics")

Now, let's run a few tests to assess the quality of the dataset:

In [None]:
result = vm.tests.run_test(
    test_id="validmind.data_validation.DescriptiveStatistics",
    inputs={"dataset": vm_raw_dataset},
)

In [None]:
result2 = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"min_percent_threshold": 30},
)

You can see that the class imbalance test did not pass according to the value of `min_percent_threshold` we have set. Here is how you can re-run the test on some processed data to address this data quality issue. In this case we apply a very simple rebalancing technique to the dataset:


In [None]:
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

With this new raw dataset, you can re-run the individual test to see if it passes the class imbalance test requirement. Remember to register new VM Dataset object since that is the type of input required by `run_test()`:


In [None]:
# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

In [None]:
result = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_balanced_raw_dataset},
    params={"min_percent_threshold": 30},
)

## Next steps

### Integrate custom tests