# Explore tests

View and learn more about the tests available in the ValidMind Developer Framework, including code examples and usage of key functions.

In this notebook, we'll dive deep into the utilities available for viewing and understanding the various tests that ValidMind provides through the `tests` module. Whether you're just getting started or looking for advanced tips, you'll find clear examples and explanations to assist you every step of the way.

Before we go into the details, let's import the `describe_test` and `list_tests` functions from the `validmind.tests` module. These are the two functions that can be used to easily filter through tests and view details for individual tests.


In [1]:
from validmind.tests import (
    describe_test,
    list_tests,
    list_tasks,
    list_tags,
    list_tasks_and_tags,
)

## Contents

- [Listing All Tests](#toc1_)
- [Understanding Tags and Task Types](#toc2_)
- [Searching for Specific Tests using `tags` and `tasks`](#toc3_)
- [Delving into Test Details with `describe_test`](#toc4_)
- [Next steps](#toc5_)
  - [Discover more learning resources](#toc5_1_)

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->


<a id='toc1_'></a>

## [Listing All Tests](#toc0_)


The `list_tests` function provides a convenient way to retrieve all available tests in the `validmind.tests` module. When invoked without any parameters, it returns a pandas DataFrame containing detailed information about each test.


In [2]:
list_tests()

ID,Name,Description,Required Inputs,Params
validmind.prompt_validation.Bias,Bias,Evaluates bias in a Large Language Model based on the order and distribution of exemplars in a prompt....,['model.prompt'],{'min_threshold': 7}
validmind.prompt_validation.Clarity,Clarity,Evaluates and scores the clarity of prompts in a Large Language Model based on specified guidelines....,['model.prompt'],{'min_threshold': 7}
validmind.prompt_validation.Specificity,Specificity,"Evaluates and scores the specificity of prompts provided to a Large Language Model (LLM), based on clarity,...",['model.prompt'],{'min_threshold': 7}
validmind.prompt_validation.Robustness,Robustness,Assesses the robustness of prompts provided to a Large Language Model under varying conditions and contexts....,['model'],{'num_tests': 10}
validmind.prompt_validation.NegativeInstruction,Negative Instruction,"Evaluates and grades the use of affirmative, proactive language over negative instructions in LLM prompts....",['model.prompt'],{'min_threshold': 7}
validmind.prompt_validation.Conciseness,Conciseness,Analyzes and grades the conciseness of prompts provided to a Large Language Model....,['model.prompt'],{'min_threshold': 7}
validmind.prompt_validation.Delimitation,Delimitation,Evaluates the proper use of delimiters in prompts provided to Large Language Models....,['model.prompt'],{'min_threshold': 7}
validmind.model_validation.ModelPredictionResiduals,Model Prediction Residuals,"Plot the residuals and histograms for each model, and generate a summary table...","['datasets', 'models']","{'nbins': 100, 'p_value_threshold': 0.05, 'start_date': None, 'end_date': None}"
validmind.model_validation.BertScore,Bert Score,Evaluates the quality of machine-generated text using BERTScore metrics and visualizes the results through histograms...,"['dataset', 'model']",{}
validmind.model_validation.TimeSeriesPredictionsPlot,Time Series Predictions Plot,Plot actual vs predicted values for time series data and generate a visual comparison for each model....,"['datasets', 'models']",{}


<a id='toc2_'></a>

## Understanding Tags and Task Types

Effectively using ValidMind's tests involves a deep understanding of its 'tags' and 'task types'. Here's a breakdown:

- **Task Types**: Represent the kind of modeling task associated with a test. For instance:

  - **classification:** Works with Classification Models and Datasets
  - **regression:** Works with Regression Models and Datasets
  - **text classification:** Works with Text Classification Models and Datasets
  - **text summarization:** Works with Text Summarization Models and Datasets

- **Tags**: Free-form descriptors providing more details about the test, what data and models the test is compatible with and what category the test falls into etc. Some examples include:
  - **llm:** Tests that work with Large Language Models
  - **nlp:** Tests relevant for natural language processing.
  - **binary_classification:** Tests for binary classification tasks.
  - **forecasting:** Tests for forecasting and time-series analysis.
  - **tabular_data:** Tests for tabular data like CSVs and Excel spreadsheets.


You can use the functions list_tasks() and list_tags() to view all the tasks and tags used for classifying all the tests available in the developer framework:


In [3]:
list_tasks()

['text_qa',
 'time_series_forecasting',
 'text_generation',
 'text_summarization',
 'nlp',
 'text_classification',
 'visualization',
 'classification',
 'feature_extraction',
 'regression',
 'residual_analysis',
 'clustering']

In [4]:
list_tags()

['statsmodels',
 'anomaly_detection',
 'text_data',
 'data_quality',
 'ragas',
 'kmeans',
 'stationarity',
 'seasonality',
 'model_metadata',
 'zero_shot',
 'embeddings',
 'tabular_data',
 'qualitative',
 'forecasting',
 'correlation',
 'model_interpretation',
 'model_comparison',
 'feature_importance',
 'AUC',
 'analysis',
 'time_series_data',
 'rag_performance',
 'text_embeddings',
 'model_explainability',
 'data_validation',
 'multiclass_classification',
 'binary_classification',
 'nlp',
 'data_distribution',
 'sklearn',
 'visualization',
 'few_shot',
 'numerical_data',
 'model_predictions',
 'frequency_analysis',
 'model_performance',
 'senstivity_analysis',
 'logistic_regression',
 'unit_root_test',
 'model_selection',
 'dimensionality_reduction',
 'metadata',
 'llm',
 'statistical_test',
 'retrieval_performance',
 'model_training',
 'model_diagnosis',
 'categorical_data',
 'regression',
 'risk_analysis',
 'credit_risk']

If you want to see which tags correspond to which task type, you can use the function `list_tasks_and_tags()`:


In [5]:
list_tasks_and_tags()

Task,Tags
text_classification,"text_data, ragas, model_metadata, zero_shot, tabular_data, model_comparison, feature_importance, time_series_data, multiclass_classification, binary_classification, nlp, sklearn, visualization, few_shot, frequency_analysis, model_performance, llm, retrieval_performance, model_diagnosis"
text_summarization,"time_series_data, rag_performance, dimensionality_reduction, text_data, qualitative, ragas, nlp, llm, model_metadata, visualization, few_shot, retrieval_performance, zero_shot, frequency_analysis, embeddings, tabular_data"
residual_analysis,regression
visualization,regression
regression,"statsmodels, text_data, data_quality, stationarity, seasonality, model_metadata, tabular_data, forecasting, correlation, model_interpretation, model_comparison, feature_importance, analysis, time_series_data, model_explainability, data_validation, data_distribution, sklearn, visualization, numerical_data, model_predictions, model_performance, senstivity_analysis, unit_root_test, model_selection, metadata, statistical_test, model_training, categorical_data, risk_analysis"
time_series_forecasting,"model_explainability, metadata, data_validation, sklearn, visualization, model_training, model_predictions, model_performance"
classification,"statsmodels, anomaly_detection, text_data, data_quality, model_metadata, tabular_data, correlation, model_comparison, feature_importance, AUC, time_series_data, multiclass_classification, binary_classification, data_distribution, sklearn, visualization, numerical_data, model_performance, logistic_regression, statistical_test, model_diagnosis, categorical_data, risk_analysis, credit_risk"
clustering,"sklearn, model_performance, kmeans"
text_qa,"rag_performance, dimensionality_reduction, qualitative, ragas, llm, visualization, retrieval_performance, embeddings"
text_generation,"rag_performance, dimensionality_reduction, qualitative, ragas, llm, visualization, retrieval_performance, embeddings"


<a id='toc3_'></a>

## Searching for Specific Tests using `tags` and `tasks`

While listing all tests is valuable, there are times when you need to narrow down your search. The `list_tests` function offers `filter`, `task`, and `tags` parameters to assist in this.


If you're targeting a specific test or tests that match a particular task type, the `filter` parameter comes in handy. For example, to list tests that are compatible with 'sklearn' models:


In [6]:
list_tests(filter="sklearn")

ID,Name,Description,Required Inputs,Params
validmind.model_validation.ClusterSizeDistribution,Cluster Size Distribution,Compares and visualizes the distribution of cluster sizes in model predictions and actual data for assessing...,"['model', 'dataset']",
validmind.model_validation.TimeSeriesR2SquareBySegments,Time Series R2 Square By Segments,Plot R-Squared values for each model over specified time segments and generate a bar chart...,"['datasets', 'models']",{'segments': None}
validmind.model_validation.sklearn.RegressionModelsPerformanceComparison,Regression Models Performance Comparison,"Compares and evaluates the performance of multiple regression models using five different metrics: MAE, MSE, RMSE,...","['dataset', 'models']",
validmind.model_validation.sklearn.AdjustedMutualInformation,Adjusted Mutual Information,"Evaluates clustering model performance by measuring mutual information between true and predicted labels, adjusting...","['model', 'datasets']",
validmind.model_validation.sklearn.SilhouettePlot,Silhouette Plot,"Calculates and visualizes Silhouette Score, assessing degree of data point suitability to its cluster in ML models....","['model', 'dataset']",
validmind.model_validation.sklearn.RobustnessDiagnosis,Robustness Diagnosis,Evaluates the robustness of a machine learning model by injecting Gaussian noise to input data and measuring...,"['model', 'datasets']","{'features_columns': None, 'scaling_factor_std_dev_list': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], 'accuracy_decay_threshold': 4}"
validmind.model_validation.sklearn.AdjustedRandIndex,Adjusted Rand Index,Measures the similarity between two data clusters using the Adjusted Rand Index (ARI) metric in clustering machine...,"['model', 'datasets']",
validmind.model_validation.sklearn.SHAPGlobalImportance,SHAP Global Importance,Evaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....,"['model', 'dataset']","{'kernel_explainer_samples': 10, 'tree_or_linear_explainer_samples': 200}"
validmind.model_validation.sklearn.ConfusionMatrix,Confusion Matrix,Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...,"['model', 'dataset']",
validmind.model_validation.sklearn.HomogeneityScore,Homogeneity Score,"Assesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1...","['model', 'datasets']",


The `task` parameter is designed for pinpointing tests that align with a specific task type. For instance, to find tests tailored for 'classification' tasks:


In [7]:
list_tests(task="classification")

ID,Name,Description,Required Inputs,Params
validmind.model_validation.FeaturesAUC,Features AUC,Evaluates the discriminatory power of each individual feature within a binary classification model by calculating the Area Under the Curve (AUC) for each feature separately....,"['model', 'dataset']","{'fontsize': 12, 'figure_height': 500}"
validmind.model_validation.ModelMetadata,Model Metadata,Extracts and summarizes critical metadata from a machine learning model instance for comprehensive analysis....,['model'],
validmind.model_validation.sklearn.RobustnessDiagnosis,Robustness Diagnosis,Evaluates the robustness of a machine learning model by injecting Gaussian noise to input data and measuring...,"['model', 'datasets']","{'features_columns': None, 'scaling_factor_std_dev_list': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], 'accuracy_decay_threshold': 4}"
validmind.model_validation.sklearn.SHAPGlobalImportance,SHAP Global Importance,Evaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....,"['model', 'dataset']","{'kernel_explainer_samples': 10, 'tree_or_linear_explainer_samples': 200}"
validmind.model_validation.sklearn.ConfusionMatrix,Confusion Matrix,Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...,"['model', 'dataset']",
validmind.model_validation.sklearn.OverfitDiagnosis,Overfit Diagnosis,Detects and visualizes overfit regions in an ML model by comparing performance on training and test datasets....,"['model', 'datasets']","{'features_columns': None, 'cut_off_percentage': 4}"
validmind.model_validation.sklearn.PermutationFeatureImportance,Permutation Feature Importance,Assesses the significance of each feature in a model by evaluating the impact on model performance when feature...,"['model', 'dataset']","{'fontsize': None, 'figure_height': 1000}"
validmind.model_validation.sklearn.MinimumROCAUCScore,Minimum ROCAUC Score,Validates model by checking if the ROC AUC score meets or surpasses a specified threshold....,"['model', 'dataset']",{'min_threshold': 0.5}
validmind.model_validation.sklearn.PrecisionRecallCurve,Precision Recall Curve,Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....,"['model', 'dataset']",
validmind.model_validation.sklearn.ClassifierPerformance,Classifier Performance,"Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...","['model', 'dataset']",


The `tags` parameter facilitates searching tests by their tags. For instance, if you're interested in only tests associated designed for `model_performance` that produce a plot (denoted by the `visualization` tag)


In [8]:
list_tests(tags=["model_performance", "visualization"])

ID,Name,Description,Required Inputs,Params
validmind.model_validation.sklearn.ConfusionMatrix,Confusion Matrix,Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...,"['model', 'dataset']",
validmind.model_validation.sklearn.PrecisionRecallCurve,Precision Recall Curve,Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....,"['model', 'dataset']",
validmind.model_validation.sklearn.ROCCurve,ROC Curve,Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...,"['model', 'dataset']",
validmind.model_validation.sklearn.TrainingTestDegradation,Training Test Degradation,Tests if model performance degradation between training and test datasets exceeds a predefined threshold....,"['model', 'datasets']","{'metrics': ['accuracy', 'precision', 'recall', 'f1'], 'max_threshold': 0.1}"
validmind.model_validation.statsmodels.GINITable,GINI Table,"Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....","['model', 'datasets']",


The above parameters can be combined to create complex queries. For instance, to find tests that are compatible with 'sklearn' models, designed for 'classification' tasks, and produce a plot:


In [9]:
list_tests(
    tags=["model_performance", "visualization", "sklearn"], task="classification"
)

ID,Name,Description,Required Inputs,Params
validmind.model_validation.sklearn.ConfusionMatrix,Confusion Matrix,Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...,"['model', 'dataset']",
validmind.model_validation.sklearn.PrecisionRecallCurve,Precision Recall Curve,Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....,"['model', 'dataset']",
validmind.model_validation.sklearn.ROCCurve,ROC Curve,Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...,"['model', 'dataset']",
validmind.model_validation.sklearn.TrainingTestDegradation,Training Test Degradation,Tests if model performance degradation between training and test datasets exceeds a predefined threshold....,"['model', 'datasets']","{'metrics': ['accuracy', 'precision', 'recall', 'f1'], 'max_threshold': 0.1}"


# Programmatic Use

To work with a specific set of tests programmatically, you can store the results in a variable. For instance, let's list all tests that are designed for Text Summarization tests and store them in `text_summarization_tests` for further use.


In [10]:
text_summarization_tests = list_tests(task="text_summarization", pretty=False)
text_summarization_tests

['validmind.prompt_validation.Bias',
 'validmind.prompt_validation.Clarity',
 'validmind.prompt_validation.Specificity',
 'validmind.prompt_validation.Robustness',
 'validmind.prompt_validation.NegativeInstruction',
 'validmind.prompt_validation.Conciseness',
 'validmind.prompt_validation.Delimitation',
 'validmind.model_validation.BertScore',
 'validmind.model_validation.RegardScore',
 'validmind.model_validation.BleuScore',
 'validmind.model_validation.ContextualRecall',
 'validmind.model_validation.MeteorScore',
 'validmind.model_validation.RougeScore',
 'validmind.model_validation.ModelMetadata',
 'validmind.model_validation.TokenDisparity',
 'validmind.model_validation.ToxicityScore',
 'validmind.model_validation.embeddings.CosineSimilarityComparison',
 'validmind.model_validation.embeddings.TSNEComponentsPairwisePlots',
 'validmind.model_validation.embeddings.PCAComponentsPairwisePlots',
 'validmind.model_validation.embeddings.CosineSimilarityHeatmap',
 'validmind.model_validatio

<a id='toc4_'></a>

## Delving into Test Details with `describe_test`

After identifying a set of potential tests, you might want to explore the specifics of an individual test. The `describe_test` function provides a deep dive into the details of a test. It reveals the test name, description, ID, test type, and required inputs. Below, we showcase how to describe a test using its ID:


In [11]:
describe_test("validmind.model_validation.sklearn.OverfitDiagnosis")

Accordion(children=(HTML(value='\n<div>\n  <h2>Overfit Diagnosis</h2>\n  <p>Detects and visualizes overfit reg…

<a id='toc5_'></a>

## Next steps

By harnessing the functionalities presented in this guide, you should be able to easily list and filter through all of ValidMind's available tests and find those you are interested in running against your model and/or dataset. The next step is to take the IDs of the tests you'd like to run and either create a test suite for reuse or just run them directly to try them out. See the other notebooks for a tutorial on how to do both.

<a id='toc5_1_'></a>

### Discover more learning resources

We offer many interactive notebooks to help you document models:

- [Run tests & test suites](https://docs.validmind.ai/developer/model-testing/testing-overview.html)
- [Code samples](https://docs.validmind.ai/developer/samples-jupyter-notebooks.html)

Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind.
