# Validation of an application scorecard model 

As a model validator, your task is to independently assess the application scorecard model developed using the ValidMind Library, which is based on Kaggle's [Lending Club](https://www.kaggle.com/datasets/devanshi23/loan-data-2007-2014/data. Your role focuses on evaluating the model developer's work by conducting thorough testing and validation, potentially including the use of challenger models to benchmark performance.

An application scorecard model is a type of statistical model used in credit scoring to evaluate the creditworthiness of potential borrowers by generating a score based on various characteristics of an applicant such as credit history, income, employment status, and other relevant financial data.
 - This score assists lenders in making informed decisions about whether to approve or reject loan applications, as well as in determining the terms of the loan, including interest rates and credit limits.
 - Effective validation of application scorecard models ensures that lenders can manage risk efficiently while maintaining a fast and transparent loan application process for applicants.

This interactive notebook provides a step-by-step guide for:

 - Loading the developer model and verifying the data quality steps performed by the model developer.
 - Independently replicating the model's results and conducting additional tests to assess performance, stability, and robustness.
 - Setting up test inputs and challenger models for comparative analysis.
 - Running validation tests, analyzing results, and logging findings to ValidMind.

<a id='toc1_'></a>

## About ValidMind
ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

<a id='toc1_1_'></a>

### Before you begin
This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

<a id='toc1_2_'></a>

### New to ValidMind?
If you haven't already seen our [Get started with the ValidMind Library](https://docs.validmind.ai/developer/get-started-validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models, find code samples, or read our developer reference.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, create a free ValidMind account.</b></span>
<br></br>
Signing up is FREE — <a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>


<a id='toc1_3_'></a>

### Key concepts

**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.

**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

- **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).
- **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).
- **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.
- **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/run_tests_that_require_multiple_datasets.html) for more information.

**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.

**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Example: The [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases.

<a id='toc2_'></a>

## Install the ValidMind Library

To install the library:

In [None]:
%pip install -q validmind

In [None]:
import os
os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

context = """
FORMAT FOR THE LLM DESCRIPTIONS: 
    **<Test Name>** is designed to <begin with a concise overview of what the test does and its primary purpose, 
    extracted from the test description>.

    The test operates by <write a paragraph about the test mechanism, explaining how it works and what it measures. 
    Include any relevant formulas or methodologies mentioned in the test description.>

    The primary advantages of this test include <write a paragraph about the test's strengths and capabilities, 
    highlighting what makes it particularly useful for specific scenarios.>

    Users should be aware that <write a paragraph about the test's limitations and potential risks. 
    Include both technical limitations and interpretation challenges. 
    If the test description includes specific signs of high risk, incorporate these here.>

    **Key Insights:**

    The test results reveal:

    - **<insight title>**: <comprehensive description of one aspect of the results>
    - **<insight title>**: <comprehensive description of another aspect>
    ...

    Based on these results, <conclude with a brief paragraph that ties together the test results with the test's 
    purpose and provides any final recommendations or considerations.>

ADDITIONAL INSTRUCTIONS:
    Present insights in order from general to specific, with each insight as a single bullet point with bold title.
    You are a model validator and the goal is to identify risk and/or suggest room for improvements or recommendations on what Model Developer should do in order to improve outcomes and reduce risk

    For each metric in the test results, include in the test overview:
    - The metric's purpose and what it measures
    - Its mathematical formula
    - The range of possible values
    - What constitutes good/bad performance
    - How to interpret different values

    Each insight should progressively cover:
    1. Overall scope and distribution
    2. Complete breakdown of all elements with specific values
    3. Natural groupings and patterns
    4. Comparative analysis between datasets/categories
    5. Stability and variations
    6. Notable relationships or dependencies

    Remember:
    - Keep all insights at the same level (no sub-bullets or nested structures)
    - Make each insight complete and self-contained
    - Include specific numerical values and ranges
    - Cover all elements in the results comprehensively
    - Maintain clear, concise language
    - Use only "- **Title**: Description" format for insights
    - Progress naturally from general to specific observations

""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

<a id='toc3_'></a>

## Initialize the ValidMind Library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

<a id='toc3_1_'></a>

### Get your code snippet

1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register Model**.

3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))

   For example, to register a model for use with this notebook, select:

   - Documentation template: `Credit Risk Scorecard`
   - Use case: `Credit Risk - CECL`

   You can fill in other options according to your preference.

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:

In [None]:
# Load your model identifier credentials from an `.env` file
import validmind as vm

vm.init(
,
)

<a id='toc4_'></a>

## Initialize the Python environment and import Model Developer Model

Next, let's import the Model Developer model, used as the model developer as the champion model.

In [None]:
import xgboost as xgb

#Load the saved model
xgb_model = xgb.XGBClassifier()
xgb_model.load_model("xgb_model_champion.pkl")
xgb_model

In [None]:
#ensure that we have to appropriate order in feature names from Champion model and dataset
cols_when_model_builds = xgb_model.get_booster().feature_names

<a id='toc4_1_'></a>

### Preview the Validation Report Template

A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

You'll upload documentation and test results into this template later on. For now, take a look at the structure that the template provides with the `vm.preview_template()` function from the ValidMind library and note the empty sections:

In [None]:
vm.preview_template()

<a id='toc5_'></a>

## Load the sample dataset that Model Developer provided to the Validation Team, that was used to develop, train and test the model.

The sample dataset used here is provided by the ValidMind library. To be able to use it, you'll need to import the dataset and load it into a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), a two-dimensional tabular data structure that makes use of rows and columns:

In [None]:
from validmind.datasets.credit_risk import lending_club

df = lending_club.load_data(source="offline")
df.head()

<a id='toc5_1_'></a>

### Obtain the Prepocessed dataset from Model Developer for data quality testing purposes

In [None]:
preprocess_df = lending_club.preprocess(df)

<a id='toc5_2_'></a>

### Obtain the final Feature engineered dataset that Model Developer uses to train and test the model

In the feature engineering phase, we apply specific transformations to optimize the dataset for predictive modeling in our application scorecard. 

Using the `ending_club.feature_engineering()` function, the Model Developer conducted the following operations:
- **WoE encoding**: Converts both numerical and categorical features into Weight of Evidence (WoE) values. WoE is a statistical measure used in scorecard modeling that quantifies the relationship between a predictor variable and the binary target variable. It calculates the ratio of the distribution of good outcomes to the distribution of bad outcomes for each category or bin of a feature. This transformation helps to ensure that the features are predictive and consistent in their contribution to the model.
- **Integration of WoE bins**: Ensures that the WoE transformed values are integrated throughout the dataset, replacing the original feature values while excluding the target variable from this transformation. This transformation is used to maintain a consistent scale and impact of each variable within the model, which helps make the predictions more stable and accurate.

In [None]:
fe_df = lending_club.feature_engineering(preprocess_df)
fe_df.head()

<a id='toc6_'></a>

## As a Model Validator we want to split the featured engineered dataset into train and test for Validation testing purposes. In addition, as a Validator we also want to include challenger/benchmark models.

In this section, we will to a train and split randomly as the Validator want's to independently challenge the developer 
- We begin by dividing our data, which is based on Weight of Evidence (WoE) features, into training and testing sets (`train_df`, `test_df`). 
- With `lending_club.split`, we employ a simple random split, randomly allocating data points to each set to ensure a mix of examples in both.

In [None]:
# Split the data
train_df, test_df = lending_club.split(fe_df, test_size=0.2)

x_train = train_df.drop(lending_club.target_column, axis=1)
y_train = train_df[lending_club.target_column]

x_test = test_df.drop(lending_club.target_column, axis=1)
y_test = test_df[lending_club.target_column]

#now let's apply the order of features from the champion model construction
x_train = x_train[cols_when_model_builds]
x_test = x_test[cols_when_model_builds]

In [None]:
cols_use = ['annual_inc_woe',
 'verification_status_woe',
 'emp_length_woe',
 'installment_woe',
 'term_woe',
 'home_ownership_woe',
 'purpose_woe',
 'open_acc_woe',
 'total_acc_woe',
 'int_rate_woe',
 'sub_grade_woe',
 'grade_woe','loan_status']


train_df = train_df[cols_use]
test_df = test_df[cols_use]
test_df.head()

### As a Model Validator I also want to investigate potential challenger models - Let's train two challenger models as basis for the testing

In [None]:
# Define the Random Forest model
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(
    n_estimators=50, 
    random_state=42,
)

# Fit the model
rf_model.fit(x_train, y_train)

In [None]:
#Second Challenger Model a Logistic Regression
from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(x_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_
log_reg

<a id='toc6_1_'></a>

### Compute probabilities as this is the raw probabilitistc output from the models of interest

In [None]:
train_xgb_prob = xgb_model.predict_proba(x_train)[:, 1]
test_xgb_prob = xgb_model.predict_proba(x_test)[:, 1]

train_rf_prob = rf_model.predict_proba(x_train)[:, 1]
test_rf_prob = rf_model.predict_proba(x_test)[:, 1]

train_log_prob = log_reg.predict_proba(x_train)[:, 1]
test_log_prob = log_reg.predict_proba(x_test)[:, 1]

<a id='toc6_2_'></a>

### Compute binary predictions

In [None]:
cut_off_threshold = 0.3 

train_xgb_binary_predictions = (train_xgb_prob > cut_off_threshold).astype(int)
test_xgb_binary_predictions = (test_xgb_prob > cut_off_threshold).astype(int)

train_rf_binary_predictions = (train_rf_prob > cut_off_threshold).astype(int)
test_rf_binary_predictions = (test_rf_prob > cut_off_threshold).astype(int)

train_log_binary_predictions = (train_log_prob > cut_off_threshold).astype(int)
test_log_binary_predictions = (test_log_prob > cut_off_threshold).astype(int)

<a id='toc7_'></a>

## Document the model

To document the model with the ValidMind Library, you'll need to:
1. Preprocess the raw dataset
2. Initialize some training and test datasets
3. Initialize a model object you can use for testing
4. Run the full suite of tests

<a id='toc7_1_'></a>

### Initialize the ValidMind datasets

Before you can run tests, you must first initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.

This function takes a number of arguments:

- `dataset`: The dataset that you want to provide as input to tests.
- `input_id`: A unique identifier that allows tracking what inputs are used when running each individual test.
- `target_column`: A required argument if tests require access to true values. This is the name of the target column in the dataset.

With all datasets ready, you can now initialize the raw, processed, training and test datasets (`raw_df`, `preprocessed_df`, `fe_df`,  `train_df` and `test_df`) created earlier into their own dataset objects using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset):

In [None]:
vm_raw_dataset = vm.init_dataset(
    dataset=df,
    input_id="raw_dataset",
    target_column=lending_club.target_column,
)

vm_preprocess_dataset = vm.init_dataset(
    dataset=preprocess_df,
    input_id="preprocess_dataset",
    target_column=lending_club.target_column,
)

vm_fe_dataset = vm.init_dataset(
    dataset=fe_df,
    input_id="fe_dataset",
    target_column=lending_club.target_column,
)

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=lending_club.target_column,
)

vm_test_ds = vm.init_dataset(
    dataset=test_df,
    input_id="test_dataset",
    target_column=lending_club.target_column,
)

<a id='toc7_2_'></a>

### Initialize a model object

You will also need to initialize a ValidMind model object (`vm_model`) that can be passed to other functions for analysis and tests on the data. You simply intialize this model object with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model):

In [None]:
vm_xgb_model = vm.init_model(
    xgb_model,
    input_id="xgb_model_developer_champion",
)

vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model",
)


<a id='toc7_3_'></a>

### Assign prediction values and probabilities to the datasets

With our model now trained, we'll move on to assigning both the predictive probabilities coming directly from the model's predictions, and the binary prediction after applying the cutoff threshold described in the previous steps. 
- These tasks are achieved through the use of the `assign_predictions()` method associated with the VM `dataset` object.
- This method links the model's class prediction values and probabilities to our VM train and test datasets.

In [None]:
# XGBoost Model
vm_train_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=train_xgb_binary_predictions,
    prediction_probabilities=train_xgb_prob,
)

vm_test_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=test_xgb_binary_predictions,
    prediction_probabilities=test_xgb_prob,
)

# Random Forest Model
vm_train_ds.assign_predictions(
    model=vm_rf_model,
    prediction_values=train_rf_binary_predictions,
    prediction_probabilities=train_rf_prob,
)

vm_test_ds.assign_predictions(
    model=vm_rf_model,
    prediction_values=test_rf_binary_predictions,
    prediction_probabilities=test_rf_prob,
)


# Logistic Regression 
vm_train_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=train_log_binary_predictions,
    prediction_probabilities=train_log_prob,
)

vm_test_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=test_log_binary_predictions,
    prediction_probabilities=test_log_prob,
)



### Compute credit risk scores

In this phase, we translate model predictions into actionable scores using probability estimates generated by our trained model.

In [None]:
train_xgb_scores = lending_club.compute_scores(train_xgb_prob)
test_xgb_scores = lending_club.compute_scores(test_xgb_prob)
train_rf_scores = lending_club.compute_scores(train_rf_prob)
test_rf_scores = lending_club.compute_scores(test_rf_prob)
train_log_scores = lending_club.compute_scores(train_log_prob)
test_log_scores = lending_club.compute_scores(test_log_prob)

# Assign scores to the datasets
vm_train_ds.add_extra_column("xgb_scores", train_xgb_scores)
vm_test_ds.add_extra_column("xgb_scores", test_xgb_scores)
vm_train_ds.add_extra_column("rf_scores", train_rf_scores)
vm_test_ds.add_extra_column("rf_scores", test_rf_scores)
vm_train_ds.add_extra_column("log_scores", train_log_scores)
vm_test_ds.add_extra_column("log_scores", test_log_scores)

## Model Validation Testing

In the section below you (Model Validator) will select a series of tests from ValidMind in order to Independently challenge the Model Developer evidence and Assessment. In addition to this the Model Validator will also configure custom tets (tests not available out of the box). The focus will be on the following testing:

- Ensuring Data used for training and testing the model is of appropriate data quality
- Ensuring that Raw Data has been pre-processed appropriately and the resulting feature engineered dataset reflects this
- Comprehensive testing around Model Performance of both the Developer Champion Model and challenger models developed by Validator


In [None]:
##first we start with Data Quality Testing, and we are going to leverage ValidMind tests for this purpose
#Explore tests
from validmind.tests import (
    describe_test,
    list_tests,
    list_tasks,
    list_tags,
    list_tasks_and_tags,
)

In [None]:
#let's find the data quality tests relevant for a classification use-case
list_tasks_and_tags()

In [None]:
#list the tests that we want to run as a validator for data_quality
list_tests(
    tags=["data_quality"], task="classification"
)

In [None]:
dq = list_tests(tags=["data_quality"], task="classification",pretty=False)
dq              

In [None]:
#now let's run the list of tests, first focusing on dataquality of the datasets used for training, and then we will do a comparison test from raw to final train datasets, i.e. has the developer addressed potential concerns
for test in dq:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_preprocess_dataset
        },
    ).log() 

### Let's do a comparison of the raw dataset and the pre-processed dataset, i.e. has the Developer adjusted the processed the dataset from raw to pre-processed appropriately

In [None]:
######next let's do a comparison test between the pre-processed data and raw datasets
for test in dq:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_raw_dataset,vm_preprocess_dataset]
        },
    ).log()

In [None]:
#Now let's do some independent testing with regards to performance of the champion model (xgboost). 
#First we want to test independently on the champion model and then we will move forward to add challenger models that we have trained and defined before
list_tests(tags=["model_performance"], task="classification")

In [None]:
mpt = ['validmind.model_validation.sklearn.ClassifierPerformance:xgboost_champion','validmind.model_validation.sklearn.ConfusionMatrix:xgboost_champion',
       'validmind.model_validation.sklearn.MinimumAccuracy:xgboost_champion', 'validmind.model_validation.sklearn.MinimumF1Score:xgboost_champion','validmind.model_validation.sklearn.ROCCurve:xgboost_champion']

In [None]:
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_xgb_model, 
        },
    ).log()

In [None]:
#excellent-  we have now conducted similar tests as developer in order to verify the results. Now let's provide some challenge by introducting challenger models

#first let's provide some identifiers to the tests since they are now challenger model tests
mpt_chall = ['validmind.model_validation.sklearn.ClassifierPerformance:xgboost_champion_vs_challengers','validmind.model_validation.sklearn.ConfusionMatrix:xgboost_champion_vs_challengers',
       'validmind.model_validation.sklearn.MinimumAccuracy:xgboost_champion_vs_challengers', 'validmind.model_validation.sklearn.MinimumF1Score:xgboost_champion_vs_challengers','validmind.model_validation.sklearn.ROCCurve:xgboost_champion_vs_challengers']

In [None]:
import os
os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

context = """
FORMAT FOR THE LLM DESCRIPTIONS: 
    **<Test Name>** is designed to <begin with a concise overview of what the test does and its primary purpose, 
    extracted from the test description>.

    The test operates by <write a paragraph about the test mechanism, explaining how it works and what it measures. 
    Include any relevant formulas or methodologies mentioned in the test description.>

    The primary advantages of this test include <write a paragraph about the test's strengths and capabilities, 
    highlighting what makes it particularly useful for specific scenarios.>

    Users should be aware that <write a paragraph about the test's limitations and potential risks. 
    Include both technical limitations and interpretation challenges. 
    If the test description includes specific signs of high risk, incorporate these here.>

    **Key Insights:**

    The test results reveal:

    - **<insight title>**: <comprehensive description of one aspect of the results>
    - **<insight title>**: <comprehensive description of another aspect>
    ...

    Based on these results, <conclude with a brief paragraph that ties together the test results with the test's 
    purpose and provides any final recommendations or considerations.>

ADDITIONAL INSTRUCTIONS:

    The champion model as the basis for comparison is called "xgb_model_developer_champion" and emphasis should be on the following:
    - The metrics for the champion model compared agains the challenger models
    - Which model potentially outperforms the champion model based on the metrics, this should be highlighted and emphasized


    For each metric in the test results, include in the test overview:
    - The metric's purpose and what it measures
    - Its mathematical formula
    - The range of possible values
    - What constitutes good/bad performance
    - How to interpret different values

    Each insight should progressively cover:
    1. Overall scope and distribution
    2. Complete breakdown of all elements with specific values
    3. Natural groupings and patterns
    4. Comparative analysis between datasets/categories
    5. Stability and variations
    6. Notable relationships or dependencies

    Remember:
    - Champion model (xgb_model_developer_champion) is the selection and challenger models are used to challenge the selection
    - Keep all insights at the same level (no sub-bullets or nested structures)
    - Make each insight complete and self-contained
    - Include specific numerical values and ranges
    - Cover all elements in the results comprehensively
    - Maintain clear, concise language
    - Use only "- **Title**: Description" format for insights
    - Progress naturally from general to specific observations

""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

In [None]:
#let's do same performance tests as above, but let's challenge the actual model itself and add two additional benchmark models
for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model,vm_rf_model], 
        },
    ).log()

In [None]:
### Now let's dig a little bit deeper into one of the tests that allows the Validator to custoimze parameters and thresholds for performance standards - and let's diregard RF model as we have
### learned that the RF model is not a viable candidate based on the performance metrics.
result = vm.tests.run_test(
    'validmind.model_validation.sklearn.MinimumF1Score:AdjThreshold',
    input_grid={
        "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model], 'params':{'min_threshold': 0.35},
    },
).log()

In [None]:
### Robustness and Stability Testing Comparison Between the Two Models
list_tests(tags=["model_diagnosis"], task="classification")

In [None]:
#Let's see if models suffer from any overfit potentials and also where there are potential sub-segments of issues
overfit_testing = ['validmind.model_validation.sklearn.TrainingTestDegradation:Champion_vs_LogRegression','validmind.model_validation.sklearn.OverfitDiagnosis:Champion_vs_LogRegression'] 


In [None]:
for test in overfit_testing:
    vm.tests.run_test(
        test,
        input_grid={
            "datasets": [[vm_train_ds,vm_test_ds]], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

In [None]:
### Now finally let's conduct robustness and stability testing of the two models:
stab_robust = ['validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression'] # 'validmind.model_validation.sklearn.WeakspotsDiagnosis:Champion_vs_LogRegression'


In [None]:
for test in stab_robust:
    vm.tests.run_test(
        test,
        input_grid={
            "datasets": [[vm_train_ds,vm_test_ds]], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

### Let's verify the feature importance and inspect differences - different models might have more intuitive feature impacts that might lead to decisions in selection of a model

In [None]:
FI = list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

In [None]:
for test in FI:
    vm.tests.run_test(
        "".join((test,':Champion_vs_LogisticRegression')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

In [None]:
### Let's finish off with a custom test example - scoring (customization of output to a FICO score type)
import numpy as np
import pandas as pd
import plotly.graph_objects as go


@vm.test("my_custom_tests.ScoreToOdds")
def score_to_odds_analysis(dataset, score_column='score', score_bands=[410, 440, 470]):
    """
    Analyzes the relationship between score bands and odds (good:bad ratio).
    Good odds = (1 - default_rate) / default_rate
    
    Higher scores should correspond to higher odds of being good.

    If there are multiple scores provided through score_column, this means that there are two different models and the scores reflect each model

    If there are more scores provided in the score_column then focus the assessment on the differences between the two scores and indicate through evidence which one is preferred.
    """
    df = dataset.df
    
    # Create score bands
    df['score_band'] = pd.cut(
        df[score_column],
        bins=[-np.inf] + score_bands + [np.inf],
        labels=[f'<{score_bands[0]}'] + 
               [f'{score_bands[i]}-{score_bands[i+1]}' for i in range(len(score_bands)-1)] +
               [f'>{score_bands[-1]}']
    )
    
    # Calculate metrics per band
    results = df.groupby('score_band').agg({
        dataset.target_column: ['mean', 'count']
    })
    
    results.columns = ['Default Rate', 'Total']
    results['Good Count'] = results['Total'] - (results['Default Rate'] * results['Total'])
    results['Bad Count'] = results['Default Rate'] * results['Total']
    results['Odds'] = results['Good Count'] / results['Bad Count']
    
    # Create visualization
    fig = go.Figure()
    
    # Add odds bars
    fig.add_trace(go.Bar(
        name='Odds (Good:Bad)',
        x=results.index,
        y=results['Odds'],
        marker_color='blue'
    ))
    
    fig.update_layout(
        title='Score-to-Odds Analysis',
        yaxis=dict(title='Odds Ratio (Good:Bad)'),
        showlegend=False
    )
    
    return fig

In [None]:
result = vm.tests.run_test(
    "my_custom_tests.ScoreToOdds:Champion_vs_Challenger",
    inputs={
        "dataset": vm_test_ds,
    },
    param_grid={
        "score_column": ["xgb_scores","log_scores"],
        "score_bands": [[500, 540, 570]],
    },
).log()

### Finally we got all of the tests from the Developer that was provided as evidence, now as a final task we will verify testing being appropriately recorded

In [None]:
from validmind.utils import preview_test_config

test_config = {'validmind.data_validation.DatasetDescription:raw_data': {'inputs': {'dataset': 'raw_dataset'}},
 'validmind.data_validation.DescriptiveStatistics:raw_data': {'inputs': {'dataset': 'raw_dataset'}},
 'validmind.data_validation.MissingValues:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.ClassImbalance:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_percent_threshold': 10}},
 'validmind.data_validation.Duplicates:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.HighCardinality:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'num_threshold': 100,
   'percent_threshold': 0.1,
   'threshold_type': 'percent'}},
 'validmind.data_validation.Skewness:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'max_threshold': 1}},
 'validmind.data_validation.UniqueRows:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_percent_threshold': 1}},
 'validmind.data_validation.TooManyZeroValues:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'max_percent_threshold': 0.03}},
 'validmind.data_validation.IQROutliersTable:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'threshold': 5}},
 'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.MissingValues:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'},
  'params': {'default_column': 'loan_status'}},
 'validmind.data_validation.DescriptiveStatistics:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.TabularDescriptionTables:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.ClassImbalance:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_percent_threshold': 10}},
 'validmind.data_validation.UniqueRows:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_percent_threshold': 1}},
 'validmind.data_validation.TabularNumericalHistograms:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.MutualInformation:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_threshold': 0.01}},
 'validmind.data_validation.PearsonCorrelationMatrix:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.HighPearsonCorrelation:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'max_threshold': 0.3, 'top_n_correlations': 10}},
 'validmind.data_validation.WOEBinTable': {'input_grid': {'dataset': ['preprocess_dataset']},
  'params': {'breaks_adj': {'loan_amnt': [5000, 10000, 15000, 20000, 25000],
    'int_rate': [10, 15, 20],
    'annual_inc': [50000, 100000, 150000]}}},
 'validmind.data_validation.WOEBinPlots': {'input_grid': {'dataset': ['preprocess_dataset']},
  'params': {'breaks_adj': {'loan_amnt': [5000, 10000, 15000, 20000, 25000],
    'int_rate': [10, 15, 20],
    'annual_inc': [50000, 100000, 150000]}}},
 'validmind.data_validation.DatasetSplit': {'inputs': {'datasets': ['train_dataset',
    'test_dataset']}},
 'validmind.model_validation.ModelMetadata': {'input_grid': {'model': ['xgb_model',
    'rf_model']}},
 'validmind.model_validation.sklearn.ModelParameters': {'input_grid': {'model': ['xgb_model',
    'rf_model']}},
 'validmind.model_validation.statsmodels.GINITable': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model', 'rf_model']}},
 'validmind.model_validation.sklearn.ClassifierPerformance': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model', 'rf_model']}},
 'validmind.model_validation.sklearn.TrainingTestDegradation:XGBoost': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'max_threshold': 0.1}},
 'validmind.model_validation.sklearn.TrainingTestDegradation:RandomForest': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'rf_model'},
  'params': {'max_threshold': 0.1}},
 'validmind.model_validation.sklearn.ROCCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.MinimumROCAUCScore': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.5}},
 'validmind.model_validation.statsmodels.PredictionProbabilitiesHistogram': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.statsmodels.CumulativePredictionProbabilities': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']}},
 'validmind.model_validation.sklearn.PopulationStabilityIndex': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'num_bins': 10, 'mode': 'fixed'}},
 'validmind.model_validation.sklearn.ConfusionMatrix': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.MinimumAccuracy': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.7}},
 'validmind.model_validation.sklearn.MinimumF1Score': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.5}},
 'validmind.model_validation.sklearn.PrecisionRecallCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.CalibrationCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.ClassifierThresholdOptimization': {'inputs': {'dataset': 'train_dataset',
   'model': 'xgb_model'},
  'params': {'target_recall': 0.8}},
 'validmind.model_validation.statsmodels.ScorecardHistogram': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'score_column': 'xgb_scores'}},
 'validmind.data_validation.ScoreBandDefaultRates': {'input_grid': {'dataset': ['train_dataset'],
   'model': ['xgb_model']},
  'params': {'score_column': 'xgb_scores', 'score_bands': [504, 537, 570]}},
 'validmind.model_validation.sklearn.ScoreProbabilityAlignment': {'input_grid': {'dataset': ['train_dataset'],
   'model': ['xgb_model']},
  'params': {'score_column': 'xgb_scores'}},
 'validmind.model_validation.sklearn.WeakspotsDiagnosis': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'}},
 'validmind.model_validation.sklearn.OverfitDiagnosis': {'inputs': {'model': 'xgb_model',
   'datasets': ['train_dataset', 'test_dataset']},
  'params': {'cut_off_threshold': 0.04}},
 'validmind.model_validation.sklearn.RobustnessDiagnosis': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'scaling_factor_std_dev_list': [0.1, 0.2, 0.3, 0.4, 0.5],
   'performance_decay_threshold': 0.05}},
 'validmind.model_validation.sklearn.PermutationFeatureImportance': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.FeaturesAUC': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']}},
 'validmind.model_validation.sklearn.SHAPGlobalImportance': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']},
  'params': {'kernel_explainer_samples': 10,
   'tree_or_linear_explainer_samples': 200}}}

In [None]:
for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")