# ValidMind Python Library Introduction

This interactive notebook will guide you through using the ValidMind Developer Framework to document a model built in Python. 

For this simple demonstration, we will use the following bank customer churn dataset from Kaggle: https://www.kaggle.com/code/kmalit/bank-customer-churn-prediction/data.

We will train a sample model and demonstrate the following documentation functionalities:

- Logging information about a dataset
- Running data quality tests on a dataset
- Logging information about a model
- Logging training metrics for a model
- Running model evaluation tests

## Training an Example Model
We will now train an example model to demonstrate the ValidMind client library functions. The following demo datasets are available to use, and on this notebook we'll train a model for the Bank Customer Churn dataset.

### Initializing Python environment

In [1]:
import pandas as pd
import xgboost as xgb

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

%matplotlib inline

### Loading demo dataset

In [None]:
df = pd.read_csv("./datasets/bank_customer_churn.csv")

### Preparing the training dataset

Before we train a model, we need to run some common minimal feature selection and engineering steps on the dataset:

- Dropping irrelevant variables
- Encoding categorical variables

#### Dropping irrelevant variables

The following variables will be dropped from the dataset:

- `RowNumber`: it's a unique identifier to the record
- `CustomerId`: it's a unique identifier to the customer
- `Surname`: no predictive power for this variable
- `CreditScore`: we didn't observer any correlation between `CreditScore` and our target column `Exited`

In [None]:
df.drop(["RowNumber", "CustomerId", "Surname", "CreditScore"], axis=1, inplace=True)

#### Encoding categorical variables

We will apply one-hot or dummy encoding to the following variables:

- `Geography`: only 3 unique values found in the dataset
- `Gender`: convert from string to integer

In [None]:
genders = {"Male": 0, "Female": 1}
df.replace({"Gender": genders}, inplace=True)

df = pd.concat([df, pd.get_dummies(df["Geography"], prefix="Geography")], axis=1)
df.drop("Geography", axis=1, inplace=True)

We are now ready to train our model with the preprocessed dataset:

In [None]:
df.head()

#### Dataset preparation

For training our model, we will **randomly** split the dataset in 3 parts:

- `training` split with 60% of the rows
- `validation` split with 20% of the rows
- `test` split with 20% of the rows

The `test` dataset will be our held out dataset for model evaluation.

In [None]:
train_df, test_df = train_test_split(df, test_size=0.20)

# This guarantees a 60/20/20 split
train_ds, val_ds = train_test_split(train_df, test_size=0.25)

# For training
x_train = train_ds.drop("Exited", axis=1)
y_train = train_ds.loc[:, "Exited"].astype(int)
x_val = val_ds.drop("Exited", axis=1)
y_val = val_ds.loc[:, "Exited"].astype(int)

# For testing
x_test = test_df.drop("Exited", axis=1)
y_test = test_df.loc[:, "Exited"].astype(int)

### Model training

We will train a simple XGBoost model and set its `eval_set` to `[(x_train, y_train), (x_val, y_val)]` in order to collect validation datasets metrics on every round. The ValidMind library supports collecting any type of "in training" metrics so model developers can provide additional context to model validators if necessary.

In [None]:
model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_train, y_train), (x_val, y_val)],
    verbose=False,
)

In [None]:
y_pred = model.predict_proba(x_val)[:, -1]
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_val, predictions)

print(f"Accuracy: {accuracy}")

Now that we are satisfied with our model, we can begin using the ValidMind Library to generate test and document it.

## Initializing the ValidMind Library

Log in to the ValidMind platform with your registered email address, and navigate to the Documentation Projects page.

### Creating a new Documentation Project 

***(Note: if a documentation project has already been created, you can skip this section and head directly to the next)***

Clicking on "Create a new project" allows to you to register a new documentation project for our demo model. 

Select "Customer Churn model" from the Model drop-down, and "Initial Validation" as Type. Finally, click on "Create Project".

### Finding the project API key and secret 

In the "Client Integration" page of the newly created project, you can now find the initialization code that allows the client library to associate documentation and tests with the appropriate project. The initialization code configures the following arguments: 

* api_host: Location of the ValidMind API.
* api_key: Account API key.
* api_secret: Account Secret key.
* project: The project identifier. The `project` argument is mandatory since it allows the library to associate all data collected with a specific account project.

<img src="https://vmai.s3.us-west-1.amazonaws.com/sdk-images/settings.png" width="600" height="300" />

The code snippet can be copied and pasted directly into your developer source code and initialize the ValidMind Developer Framework when run:  


In [None]:
import validmind as vm

vm.init(
  api_host = "https://api.staging.validmind.ai/api/v1/tracking",
  api_key = "e22b89a6b9c2a27da47cb0a09febc001",
  api_secret = "a61be901b5596e3c528d94231e4a3c504ef0bb803d16815f8dfd6857fac03e57",
  project = "cl1jyv16o000809lg98gi9tie"
)
  

The Developer Framework is now initialized and connected to the correct project on the platform. 

### Viewing all test plans available in the developer framework

We can find all the test plans and tests available in the developer framework by calling the following functions:

- All test plans: `vm.test_plans.list_plans()`
- Describe a test plan: `vm.test_plans.describe_plan("tabular_dataset")`
- List all available tests: `vm.test_plans.list_tests()`

As an example, here's the output `list_plans()` and `list_tests()`:

In [None]:
vm.test_plans.list_plans()

In [None]:
vm.test_plans.list_tests()

### Running a data quality test plan

We will now run the default data quality test plan that will collect the
following metadata from a dataset:

- Field types and descriptions
- Descriptive statistics
- Data distribution histograms
- Feature correlations

and will run a collection of data quality tests such as:

- Class imbalance
- Duplicates
- High cardinality
- Missing values
- Skewness

ValidMind evaluates if the data quality metrics are within expected ranges. These thresholds or ranges can be further configured by model validators.

#### Load the demo dataset

Before running the test plan, we must first initialize
a ValidMind dataset object using the `init_dataset` function from the `vm` module. This function takes in arguements: `dataset` which is the dataset that we want to analyze; `target_column` which is used to identify the target variable; `class_labels` which is used to identify the labels used for classification model training.

In [None]:
vm_dataset = vm.init_dataset(
    dataset=df,
    target_column="Exited",
    class_labels={
        "0": "Did not exit",
        "1": "Exited",
    }
)

#### Initialize and run the TabularDataset test plan

We can now initialize the `TabularDataset` test plan. The primary method of doing this is with the `run_test_plan` function from the `vm` module. This function takes in a test plan name (in this case `tabular_dataset`) and a `dataset` keyword argument (the `vm_dataset` object we created earlier):

```python
vm.run_test_plan("tabular_dataset", dataset=vm_dataset)
```

In [None]:
vm.run_test_plan("tabular_dataset", dataset=vm_dataset)

### Running a model evaluation test plan

We will now run a basic model evaluation test plan that is compatible with the model we have trained.
Since we have trained an XGBoost model with a sklearn-like API, we will use the `SKLearnClassifier` test plan. This test plan will collect model metadata and metrics, and run a variety of model evaluation tests, according to the modeling objective (binary classification for this example).

The following model metadata is collected:

- Model framework and architecture (e.g. XGBoost, Random Forest, Logistic Regression, etc.)
- Model task details (e.g. binary classification, regression, etc.)
- Model hyperparameters (e.g. number of trees, max depth, etc.)

The model metrics that are collected depend on the model type, use case, etc. For example, for a binary classification model, the following metrics could be collected (again, depending on configuration):

- AUC
- Error rate
- Logloss
- Feature importance

Similarly, different model evaluation tests are run depending on the model type, use case, etc. For example, for a binary classification model, the following tests could be executed:

- Simple training/test overfit test
- Training/test performance degradation
- Baseline test dataset performance test

#### Initialize VM model object and train/test datasets

In order to run our SKLearnClassifier test plan, we need to initialize ValidMind object instances for the trained model and the training and test datasets:

In [None]:
vm_model = vm.init_model(model)
vm_train_ds = vm.init_dataset(dataset=train_ds, type="generic", target_column="Exited")
vm_test_ds = vm.init_dataset(dataset=test_df, type="generic", target_column="Exited")

We can now run the `SKLearnClassifier` test plan:

In [None]:
vm.run_test_plan("sklearn_classifier", model=vm_model, train_ds=vm_train_ds, test_ds=vm_test_ds)