# Load predictions in ValidMind datasets

This notebook guides you through loading predictions in ValidMind dataset objects using the `assign_predictions()` function. The function is designed to enable developers to support various way to load predictions in the dataset object so that tests can make use of it.

This guide includes the code required to:

- Load the demo dataset
- Prepocess the raw dataset and Train a model for testing
- Initialize ValidMind objects
- Options to load predictions using the developer frameworks
  - Load predictions from a file
  - Link an existing prediction column in the dataset with a model
  - Let the developer framework run predictions and link them to a model


## Install the client library

The client library provides Python support for the ValidMind Developer Framework. To install it:


In [None]:
%pip install -q validmind

## Initialize the client library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register new model**.

3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/register-models-in-model-inventory.html))

   For example, to register a model for use with this notebook, select:

   - Documentation template: `Binary classification`
   - Use case: `Marketing/Sales - Attrition/Churn Management`

   You can fill in other options according to your preference.

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:


In [None]:
# Replace with your code snippet

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="...",
)

### Preview the documentation template

A template predefines sections for your documentation project and provides a general outline to follow, making the documentation process much easier.

You will upload documentation and test results into this template later on. For now, take a look at the structure that the template provides with the `vm.preview_template()` function from the ValidMind library and note the empty sections:


In [None]:
vm.preview_template()

## Load the sample dataset

The sample dataset used here is provided by the ValidMind library. To be able to use it, you need to import the dataset and load it into a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), a two-dimensional tabular data structure that makes use of rows and columns:


In [None]:
# Import the sample dataset from the library

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

## Prepocess the raw dataset

Preprocessing performs a number of operations to get ready for the subsequent steps:

- Preprocess the data: Splits the DataFrame (`df`) into multiple datasets (`train_df`, `validation_df`, and `test_df`) using `demo_dataset.preprocess` to simplify preprocessing.
- Separate features and targets: Drops the target column to create feature sets (`x_train`, `x_val`) and target sets (`y_train`, `y_val`).


In [None]:
train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)
x_train = train_df.drop(demo_dataset.target_column, axis=1)
y_train = train_df[demo_dataset.target_column]
x_val = validation_df.drop(demo_dataset.target_column, axis=1)
y_val = validation_df[demo_dataset.target_column]

## Train models for testing

- Initialize XGBoost and Logistic Regression Classifiers


In [None]:
from sklearn.linear_model import LogisticRegression
import xgboost

%matplotlib inline

xgb = xgboost.XGBClassifier(early_stopping_rounds=10)
xgb.set_params(
    eval_metric=["error", "logloss", "auc"],
)
xgb.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

lr = LogisticRegression(random_state=0)
lr.fit(
    x_train,
    y_train,
)


## Initialize ValidMind objects

### Initialize the ValidMind models


In [None]:
vm_model_xgb = vm.init_model(
    xgb,
    input_id="xgb",
)
vm_model_lr = vm.init_model(
    lr,
    input_id="lr",
)

### Initialize the ValidMind datasets

Before you can run tests, you must first initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.

This function takes a number of arguments:

- `dataset` — the raw dataset that you want to provide as input to tests
- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test
- `target_column` — a required argument if tests require access to true values. This is the name of the target column in the dataset
- `class_labels` — an optional value to map predicted classes to class labels

With all datasets ready, you can now initialize the raw, training and test datasets (`raw_df`, `train_df` and `test_df`) created earlier into their own dataset objects using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset):


In [None]:
vm_raw_ds = vm.init_dataset(
    input_id="raw_dataset",
    dataset=raw_df,
    target_column=demo_dataset.target_column,
)

vm_train_ds = vm.init_dataset(
    input_id="train_dataset",
    dataset=train_df,
    target_column=demo_dataset.target_column,
)
vm_test_ds = vm.init_dataset(
    input_id="test_dataset", dataset=test_df, target_column=demo_dataset.target_column
)

## Options to load predictions using the developer frameworks

### 1. Load predictions from a file

This creates a new column called `<model_id>_prediction` in the dataset and assigns metadata to track that the `<model_id>_prediction` column is linked to the model `<model_id>`


### Predictions calculated outside of VM


In [None]:
import pandas as pd

train_xgb_prediction = pd.DataFrame(xgb.predict(x_train), columns=["xgb_prediction"])
test__xgb_prediction = pd.DataFrame(xgb.predict(x_val), columns=["xgb_prediction"])

train_lr_prediction = pd.DataFrame(lr.predict(x_train), columns=["lr_prediction"])
test_lr_prediction = pd.DataFrame(lr.predict(x_val), columns=["lr_prediction"])

### Assign predictions to the training dataset

We can now use the `assign_predictions()` method from the `Dataset` object to link existing predictions to any model:


In [None]:
vm_train_ds.assign_predictions(
    model=vm_model_xgb, prediction_values=train_xgb_prediction.xgb_prediction.values
)
vm_train_ds.assign_predictions(
    model=vm_model_lr, prediction_values=train_lr_prediction.lr_prediction.values
)

### Run an example test

Now, let's run an example test such as `MinimumAccuracy` twice to show how we're able to load the correct model predictions by using the `model` input parameter, even though we're passing the same `train_ds` dataset instance to the test:


In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={"dataset": vm_train_ds, "model": vm_model_xgb},
)

In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={
        "dataset": vm_train_ds,
        "model": vm_model_lr,
    },
)

### 2. Link an existing prediction column in the dataset with a model

This approach allows loading datasets that already have prediction columns in addition to feature and target columns. The developer framework assigns metadata to track the predictions column that are linked to a given `<vm_model>` model.


In [None]:
train_df2 = train_df.copy()
train_df2["xgb_prediction"] = train_xgb_prediction.xgb_prediction.values
train_df2["lr_prediction"] = train_lr_prediction.lr_prediction.values
train_df2.head(5)

In [None]:
feature_columns = [
    "CreditScore",
    "Gender",
    "Age",
    "Tenure",
    "Balance",
    "NumOfProducts",
    "HasCrCard",
    "IsActiveMember",
    "EstimatedSalary",
    "Geography_France",
    "Geography_Germany",
    "Geography_Spain",
]

vm_train_ds = vm.init_dataset(
    dataset=train_df2,
    input_id="train_dataset",
    target_column=demo_dataset.target_column,
    feature_columns=feature_columns,
)

#### Link prediction column to a specific model

The `prediction_column` parameter informs the `Dataset` object about the model that should be linked to that column.


In [None]:
vm_train_ds.assign_predictions(model=vm_model_xgb, prediction_column="xgb_prediction")
vm_train_ds.assign_predictions(model=vm_model_lr, prediction_column="lr_prediction")

In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={"dataset": vm_train_ds, "model": vm_model_xgb},
)

In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={"dataset": vm_train_ds, "model": vm_model_lr},
)

### 3. Link an existing prediction column in the dataset with a model

This lets the developer framework run model predictions, creates a new column called `<model_id>_prediction`, and assign metadata to track that the `<model_id>_prediction` column is linked to the `<vm_model>` model.

There are two ways run and assign model predictions with the developer framework:

- When initializing a `Dataset` with `init_dataset()`. This is the most straightforward method to assign predictions for a single model.
- Using `dataset.assign_predictions()`. This allows assigning predictions to a dataset for one or more models.


#### 3.1 Pass `<vm_model>` in dataset interface


In [None]:
feature_columns = [
    "CreditScore",
    "Gender",
    "Age",
    "Tenure",
    "Balance",
    "NumOfProducts",
    "HasCrCard",
    "IsActiveMember",
    "EstimatedSalary",
    "Geography_France",
    "Geography_Germany",
    "Geography_Spain",
]

vm_train_ds = vm.init_dataset(
    model=vm_model_xgb,
    dataset=train_df,
    input_id="train_dataset",
    target_column=demo_dataset.target_column,
    feature_columns=feature_columns,
)

#### 3.2 Through `assign_predictions` interface


In [None]:
vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=demo_dataset.target_column,
    feature_columns=feature_columns,
)

##### Perform predictions using the same `assign_predictions` interface


In [None]:
vm_train_ds.assign_predictions(model=vm_model_xgb)
vm_train_ds.assign_predictions(model=vm_model_lr)

### Run an example test

Now, let's run an example test such as `MinimumAccuracy` twice to show how we're able to load the correct model predictions by using the `model` input parameter, even though we're passing the same `train_ds` dataset instance to the test:


In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={"dataset": vm_train_ds, "model": vm_model_xgb},
)

In [None]:
full_suite = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumAccuracy",
    inputs={
        "dataset": vm_train_ds,
        "model": vm_model_lr,
    },
)