# ValidMind for model validation — 114 Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, you'll configure and run some custom tests, then add test results and findings to your validation report.

As we concluded in [113 Perform validation tests](113-perform_validation_tests.ipynb), our challenger random forest classification model was not a viable candidate for our use case and was eliminated as a contender. We'll finish up by comparing our champion application scorecard model against our remaining challenger logistic regression model, then use the ValidMind Platform to put together our validation report supplemented by our logged test results as evidence and findings.

## Prerequisites

In order to finalize the validation testing and reporting for your sample model, you'll need to first have:

- [ ] Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
- [ ] Installed the ValidMind Library in your local environment, allowing you to access all its features
- [ ] Learned how to import and initialize datasets for use with ValidMind
- [ ] Learned how to enable custom context for test descriptions generated by ValidMind
- [ ] Understood the basics of how to identify and run validation tests
- [ ] Run data quality and model performance tests for your champion and challenger models, and logged the results of those tests to the ValidMind Platform


<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>Need help with the above steps?</b></span>
<br></br>
Refer to the first three notebooks in this series:

<ol>
    <li><a href="111-import_champion_model.ipynb" style="color: #DE257E;"><b>111 Import the champion model</b></a></li>
    <li><a href="112-develop_challenger_models.ipynb" style="color: #DE257E;"><b>112 Develop potential challenger models</b></a></li>
    <li><a href="113-perform_validation_tests.ipynb" style="color: #DE257E;"><b>113 Perform validation tests</b></a></li>
</ol>

</div>

## Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

### Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).

2. In the left sidebar, navigate to **Inventory** and select the model you registered for this "ValidMind for model validation" series of notebooks.

3. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:

In [None]:
# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

### Import the champion model

Next, we'll import the champion model submitted by the model development team as we used in the last notebooks (**[xgb_model_champion.pkl](xgb_model_champion.pkl)**) and load in the same sample [Lending Club](https://www.kaggle.com/datasets/devanshi23/loan-data-2007-2014/data) dataset:

In [None]:
import xgboost as xgb

#Load the saved model
xgb_model = xgb.XGBClassifier()
xgb_model.load_model("xgb_model_champion.pkl")
xgb_model

# Ensure that we have to appropriate order in feature names from Champion model and dataset
cols_when_model_builds = xgb_model.get_booster().feature_names

In [None]:
# Import the Lending Club dataset from Kaggle
from validmind.datasets.credit_risk import lending_club

df = lending_club.load_data(source="offline")
df.head()

# Preprocess the dataset for data quality testing purposes
preprocess_df = lending_club.preprocess(df)

# Apply feature engineering to the dataset
fe_df = lending_club.feature_engineering(preprocess_df)
fe_df.head()

In [None]:
# Split our dataset into train and test to start the validation testing process
train_df, test_df = lending_club.split(fe_df, test_size=0.2)

x_train = train_df.drop(lending_club.target_column, axis=1)
y_train = train_df[lending_club.target_column]

x_test = test_df.drop(lending_club.target_column, axis=1)
y_test = test_df[lending_club.target_column]

# Now let's apply the order of features from the champion model construction
x_train = x_train[cols_when_model_builds]
x_test = x_test[cols_when_model_builds]

In [None]:
cols_use = ['annual_inc_woe',
 'verification_status_woe',
 'emp_length_woe',
 'installment_woe',
 'term_woe',
 'home_ownership_woe',
 'purpose_woe',
 'open_acc_woe',
 'total_acc_woe',
 'int_rate_woe',
 'sub_grade_woe',
 'grade_woe','loan_status']


train_df = train_df[cols_use]
test_df = test_df[cols_use]
test_df.head()

### Train the challenger model

As we eliminated the random forest classification model as a challenger, we'll only train our logistic regression model here:

In [None]:
# Import the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(x_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_
log_reg

### Extract predicted probabilities

With our challenger model trained, let's extract the predicted probabilities from our two models and convert the probability predictions into a binary:

In [None]:
# Champion — Application scorecard model
train_xgb_prob = xgb_model.predict_proba(x_train)[:, 1]
test_xgb_prob = xgb_model.predict_proba(x_test)[:, 1]

# Challenger — Logistic regression model
train_log_prob = log_reg.predict_proba(x_train)[:, 1]
test_log_prob = log_reg.predict_proba(x_test)[:, 1]

In [None]:
# If probability > 0.3 = 1 (positive)
cut_off_threshold = 0.3

# Champion — Application scorecard model
train_xgb_binary_predictions = (train_xgb_prob > cut_off_threshold).astype(int)
test_xgb_binary_predictions = (test_xgb_prob > cut_off_threshold).astype(int)

# Challenger — Logistic regression model
train_log_binary_predictions = (train_log_prob > cut_off_threshold).astype(int)
test_log_binary_predictions = (test_log_prob > cut_off_threshold).astype(int)

### Initialize the ValidMind objects

Let's initialize the ValidMind `Dataset` and `Model` objects in preparation for assigning model predictions to each dataset:

In [None]:
# Initialize the raw dataset
vm_raw_dataset = vm.init_dataset(
    dataset=df,
    input_id="raw_dataset",
    target_column=lending_club.target_column,
)

# Initialize the preprocessed dataset
vm_preprocess_dataset = vm.init_dataset(
    dataset=preprocess_df,
    input_id="preprocess_dataset",
    target_column=lending_club.target_column,
)

# Initialize the feature engineered dataset
vm_fe_dataset = vm.init_dataset(
    dataset=fe_df,
    input_id="fe_dataset",
    target_column=lending_club.target_column,
)

# Initialize the training dataset
vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=lending_club.target_column,
)

# Initialize the test dataset
vm_test_ds = vm.init_dataset(
    dataset=test_df,
    input_id="test_dataset",
    target_column=lending_club.target_column,
)

In [None]:
# Initialize the champion application scorecard model
vm_xgb_model = vm.init_model(
    xgb_model,
    input_id="xgb_model_developer_champion",
)

# Initialize the challenger logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model",
)

### Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold:

In [None]:
# Champion — Application scorecard model
vm_train_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=train_xgb_binary_predictions,
    prediction_probabilities=train_xgb_prob,
)

vm_test_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=test_xgb_binary_predictions,
    prediction_probabilities=test_xgb_prob,
)

# Challenger — Logistic regression model
vm_train_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=train_log_binary_predictions,
    prediction_probabilities=train_log_prob,
)

vm_test_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=test_log_binary_predictions,
    prediction_probabilities=test_log_prob,
)

In [None]:
# Compute the scores
train_xgb_scores = lending_club.compute_scores(train_xgb_prob)
test_xgb_scores = lending_club.compute_scores(test_xgb_prob)
train_log_scores = lending_club.compute_scores(train_log_prob)
test_log_scores = lending_club.compute_scores(test_log_prob)

# Assign scores to the datasets
vm_train_ds.add_extra_column("xgb_scores", train_xgb_scores)
vm_test_ds.add_extra_column("xgb_scores", test_xgb_scores)
vm_train_ds.add_extra_column("log_scores", train_log_scores)
vm_test_ds.add_extra_column("log_scores", test_log_scores)

### Enable use case context

We'll also adjust the use case context to focus on comparison between our models for tests going forward:

In [None]:
import os
os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

context = """
FORMAT FOR THE LLM DESCRIPTIONS: 
    **<Test Name>** is designed to <begin with a concise overview of what the test does and its primary purpose, 
    extracted from the test description>.

    The test operates by <write a paragraph about the test mechanism, explaining how it works and what it measures. 
    Include any relevant formulas or methodologies mentioned in the test description.>

    The primary advantages of this test include <write a paragraph about the test's strengths and capabilities, 
    highlighting what makes it particularly useful for specific scenarios.>

    Users should be aware that <write a paragraph about the test's limitations and potential risks. 
    Include both technical limitations and interpretation challenges. 
    If the test description includes specific signs of high risk, incorporate these here.>

    **Key Insights:**

    The test results reveal:

    - **<insight title>**: <comprehensive description of one aspect of the results>
    - **<insight title>**: <comprehensive description of another aspect>
    ...

    Based on these results, <conclude with a brief paragraph that ties together the test results with the test's 
    purpose and provides any final recommendations or considerations.>

ADDITIONAL INSTRUCTIONS:

    The champion model as the basis for comparison is called "xgb_model_developer_champion" and emphasis should be on the following:
    - The metrics for the champion model compared against the challenger models
    - Which model potentially outperforms the champion model based on the metrics, this should be highlighted and emphasized


    For each metric in the test results, include in the test overview:
    - The metric's purpose and what it measures
    - Its mathematical formula
    - The range of possible values
    - What constitutes good/bad performance
    - How to interpret different values

    Each insight should progressively cover:
    1. Overall scope and distribution
    2. Complete breakdown of all elements with specific values
    3. Natural groupings and patterns
    4. Comparative analysis between datasets/categories
    5. Stability and variations
    6. Notable relationships or dependencies

    Remember:
    - Champion model (xgb_model_developer_champion) is the selection and challenger models are used to challenge the selection
    - Keep all insights at the same level (no sub-bullets or nested structures)
    - Make each insight complete and self-contained
    - Include specific numerical values and ranges
    - Cover all elements in the results comprehensively
    - Maintain clear, concise language
    - Use only "- **Title**: Description" format for insights
    - Progress naturally from general to specific observations

""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

# WIP

Now let's dig a little bit deeper into one of the tests that allows the Validator to custoimze parameters and thresholds for performance standards 


In [None]:
result = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumF1Score:AdjThreshold",
    input_grid={
        "dataset": [vm_test_ds],
        "model": [vm_xgb_model, vm_log_model],
        "params": {"min_threshold": 0.35},
    },
).log()

Robustness and Stability Testing Comparison Between the Two Models

In [None]:
vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

Let's see if models suffer from any overfit potentials and also where there are potential sub-segments of issues

In [None]:
overfit_testing = [
    "validmind.model_validation.sklearn.TrainingTestDegradation:Champion_vs_LogRegression",
    "validmind.model_validation.sklearn.OverfitDiagnosis:Champion_vs_LogRegression",
]

In [None]:
for test in overfit_testing:
    vm.tests.run_test(
        test,
        input_grid={
            "datasets": [[vm_train_ds,vm_test_ds]], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

Now finally let's conduct robustness and stability testing of the two models:

In [None]:
stab_robust = ['validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression']

In [None]:
for test in stab_robust:
    vm.tests.run_test(
        test,
        input_grid={
            "datasets": [[vm_train_ds,vm_test_ds]], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

# WIP2

Let's verify the feature importance and inspect differences - different models might have more intuitive feature impacts that might lead to decisions in selection of a model

In [None]:
FI = list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

In [None]:
for test in FI:
    vm.tests.run_test(
        "".join((test,':Champion_vs_LogisticRegression')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model], 
        },
    ).log()

Let's finish off with a custom test example - scoring (customization of output to a FICO score type)

In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go


@vm.test("my_custom_tests.ScoreToOdds")
def score_to_odds_analysis(dataset, score_column='score', score_bands=[410, 440, 470]):
    """
    Analyzes the relationship between score bands and odds (good:bad ratio).
    Good odds = (1 - default_rate) / default_rate
    
    Higher scores should correspond to higher odds of being good.

    If there are multiple scores provided through score_column, this means that there are two different models and the scores reflect each model

    If there are more scores provided in the score_column then focus the assessment on the differences between the two scores and indicate through evidence which one is preferred.
    """
    df = dataset.df
    
    # Create score bands
    df['score_band'] = pd.cut(
        df[score_column],
        bins=[-np.inf] + score_bands + [np.inf],
        labels=[f'<{score_bands[0]}'] + 
               [f'{score_bands[i]}-{score_bands[i+1]}' for i in range(len(score_bands)-1)] +
               [f'>{score_bands[-1]}']
    )
    
    # Calculate metrics per band
    results = df.groupby('score_band').agg({
        dataset.target_column: ['mean', 'count']
    })
    
    results.columns = ['Default Rate', 'Total']
    results['Good Count'] = results['Total'] - (results['Default Rate'] * results['Total'])
    results['Bad Count'] = results['Default Rate'] * results['Total']
    results['Odds'] = results['Good Count'] / results['Bad Count']
    
    # Create visualization
    fig = go.Figure()
    
    # Add odds bars
    fig.add_trace(go.Bar(
        name='Odds (Good:Bad)',
        x=results.index,
        y=results['Odds'],
        marker_color='blue'
    ))
    
    fig.update_layout(
        title='Score-to-Odds Analysis',
        yaxis=dict(title='Odds Ratio (Good:Bad)'),
        showlegend=False
    )
    
    return fig

In [None]:
result = vm.tests.run_test(
    "my_custom_tests.ScoreToOdds:Champion_vs_Challenger",
    inputs={
        "dataset": vm_test_ds,
    },
    param_grid={
        "score_column": ["xgb_scores","log_scores"],
        "score_bands": [[500, 540, 570]],
    },
).log()

# WIP3

Finally we got all of the tests from the Developer that was provided as evidence, now as a final task we will verify testing being appropriately recorded

In [None]:
from validmind.utils import preview_test_config

test_config = {'validmind.data_validation.DatasetDescription:raw_data': {'inputs': {'dataset': 'raw_dataset'}},
 'validmind.data_validation.DescriptiveStatistics:raw_data': {'inputs': {'dataset': 'raw_dataset'}},
 'validmind.data_validation.MissingValues:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.ClassImbalance:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_percent_threshold': 10}},
 'validmind.data_validation.Duplicates:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.HighCardinality:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'num_threshold': 100,
   'percent_threshold': 0.1,
   'threshold_type': 'percent'}},
 'validmind.data_validation.Skewness:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'max_threshold': 1}},
 'validmind.data_validation.UniqueRows:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'min_percent_threshold': 1}},
 'validmind.data_validation.TooManyZeroValues:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'max_percent_threshold': 0.03}},
 'validmind.data_validation.IQROutliersTable:raw_data': {'inputs': {'dataset': 'raw_dataset'},
  'params': {'threshold': 5}},
 'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.MissingValues:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'},
  'params': {'min_threshold': 1}},
 'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'}},
 'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {'inputs': {'dataset': 'preprocess_dataset'},
  'params': {'default_column': 'loan_status'}},
 'validmind.data_validation.DescriptiveStatistics:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.TabularDescriptionTables:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.ClassImbalance:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_percent_threshold': 10}},
 'validmind.data_validation.UniqueRows:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_percent_threshold': 1}},
 'validmind.data_validation.TabularNumericalHistograms:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.MutualInformation:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'min_threshold': 0.01}},
 'validmind.data_validation.PearsonCorrelationMatrix:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']}},
 'validmind.data_validation.HighPearsonCorrelation:development_data': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'max_threshold': 0.3, 'top_n_correlations': 10}},
 'validmind.data_validation.WOEBinTable': {'input_grid': {'dataset': ['preprocess_dataset']},
  'params': {'breaks_adj': {'loan_amnt': [5000, 10000, 15000, 20000, 25000],
    'int_rate': [10, 15, 20],
    'annual_inc': [50000, 100000, 150000]}}},
 'validmind.data_validation.WOEBinPlots': {'input_grid': {'dataset': ['preprocess_dataset']},
  'params': {'breaks_adj': {'loan_amnt': [5000, 10000, 15000, 20000, 25000],
    'int_rate': [10, 15, 20],
    'annual_inc': [50000, 100000, 150000]}}},
 'validmind.data_validation.DatasetSplit': {'inputs': {'datasets': ['train_dataset',
    'test_dataset']}},
 'validmind.model_validation.ModelMetadata': {'input_grid': {'model': ['xgb_model',
    'rf_model']}},
 'validmind.model_validation.sklearn.ModelParameters': {'input_grid': {'model': ['xgb_model',
    'rf_model']}},
 'validmind.model_validation.statsmodels.GINITable': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model', 'rf_model']}},
 'validmind.model_validation.sklearn.ClassifierPerformance': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model', 'rf_model']}},
 'validmind.model_validation.sklearn.TrainingTestDegradation:XGBoost': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'max_threshold': 0.1}},
 'validmind.model_validation.sklearn.TrainingTestDegradation:RandomForest': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'rf_model'},
  'params': {'max_threshold': 0.1}},
 'validmind.model_validation.sklearn.ROCCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.MinimumROCAUCScore': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.5}},
 'validmind.model_validation.statsmodels.PredictionProbabilitiesHistogram': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.statsmodels.CumulativePredictionProbabilities': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']}},
 'validmind.model_validation.sklearn.PopulationStabilityIndex': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'num_bins': 10, 'mode': 'fixed'}},
 'validmind.model_validation.sklearn.ConfusionMatrix': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.MinimumAccuracy': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.7}},
 'validmind.model_validation.sklearn.MinimumF1Score': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']},
  'params': {'min_threshold': 0.5}},
 'validmind.model_validation.sklearn.PrecisionRecallCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.CalibrationCurve': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.sklearn.ClassifierThresholdOptimization': {'inputs': {'dataset': 'train_dataset',
   'model': 'xgb_model'},
  'params': {'target_recall': 0.8}},
 'validmind.model_validation.statsmodels.ScorecardHistogram': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset']},
  'params': {'score_column': 'xgb_scores'}},
 'validmind.data_validation.ScoreBandDefaultRates': {'input_grid': {'dataset': ['train_dataset'],
   'model': ['xgb_model']},
  'params': {'score_column': 'xgb_scores', 'score_bands': [504, 537, 570]}},
 'validmind.model_validation.sklearn.ScoreProbabilityAlignment': {'input_grid': {'dataset': ['train_dataset'],
   'model': ['xgb_model']},
  'params': {'score_column': 'xgb_scores'}},
 'validmind.model_validation.sklearn.WeakspotsDiagnosis': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'}},
 'validmind.model_validation.sklearn.OverfitDiagnosis': {'inputs': {'model': 'xgb_model',
   'datasets': ['train_dataset', 'test_dataset']},
  'params': {'cut_off_threshold': 0.04}},
 'validmind.model_validation.sklearn.RobustnessDiagnosis': {'inputs': {'datasets': ['train_dataset',
    'test_dataset'],
   'model': 'xgb_model'},
  'params': {'scaling_factor_std_dev_list': [0.1, 0.2, 0.3, 0.4, 0.5],
   'performance_decay_threshold': 0.05}},
 'validmind.model_validation.sklearn.PermutationFeatureImportance': {'input_grid': {'dataset': ['train_dataset',
    'test_dataset'],
   'model': ['xgb_model']}},
 'validmind.model_validation.FeaturesAUC': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']}},
 'validmind.model_validation.sklearn.SHAPGlobalImportance': {'input_grid': {'model': ['xgb_model'],
   'dataset': ['train_dataset', 'test_dataset']},
  'params': {'kernel_explainer_samples': 10,
   'tree_or_linear_explainer_samples': 200}}}

In [None]:
for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

## In summary

## Next steps

### Work with your validation report

### Learn more

#### More how-to guides and code samples

#### Discover more learning resources