# Exploring Tests in ValidMind:
## A Comprehensive Guide to List and Describe Tests

Welcome to this comprehensive guide through the ValidMind Developer Framework! In this notebook, we'll dive deep into the utilities available for managing and understanding the various tests that can be run against your models and datasets. Whether you're just getting started or looking for advanced tips, you'll find clear examples and explanations to assist you every step of the way.

Before we delve into the details, let's set up our environment by importing the necessary modules.

In [6]:
import validmind.tests as vt

## Listing All Tests

The `list_tests` function provides a convenient way to retrieve all available tests in the `validmind.tests` module. When invoked without any parameters, it returns a pandas DataFrame containing detailed information about each test.

In [7]:
vt.list_tests()

Test Type,Name,Description,ID
ThresholdTest,Bias,"**Purpose:** The Bias Evaluation test calculates if and how the order and distribution of exemplars (examples) in a few-shot learning prompt affect the output of a Language Learning Model (LLM). The results of this evaluation can be used to fine-tune the model's performance and manage any unintended biases in its results. **Test Mechanism:** This test uses two checks: 1. *Distribution of Exemplars:* The number of positive vs. negative examples in a prompt is varied. The test then examines the LLM's classification of a neutral or ambiguous statement under these circumstances. 2. *Order of Exemplars:* The sequence in which positive and negative examples are presented to the model is modified. Their resultant effect on the LLM's response is studied. For each test case, the LLM grades the input prompt on a scale of 1 to 10. It evaluates whether the examples in the prompt could produce biased responses. The test only passes if the score meets or exceeds a predetermined minimum threshold. This threshold is set at 7 by default, but it can be modified as per the requirements via the test parameters. **Signs of High Risk:** - A skewed result favoring either positive or negative responses may suggest potential bias in the model. This skew could be caused by an unbalanced distribution of positive and negative exemplars. - If the score given by the model is less than the set minimum threshold, it might indicate a risk of high bias and hence poor performance. **Strengths:** - This test provides a quantitative measure of potential bias, providing clear guidelines for developers about whether their Language Learning Model (LLM) contains significant bias. - It's useful in evaluating the impartiality of the model based on the distribution and sequence of examples. - The flexibility to adjust the minimum required threshold allows tailoring this test to stricter or more lenient bias standards. **Limitations:** - The test may not pick up on more subtle forms of bias or biases that are not directly related to the distribution or order of exemplars. - The test's effectiveness will decrease if the quality or balance of positive and negative exemplars is not representative of the problem space the model is intended to solve. - The use of a grading mechanism to gauge bias may not be entirely accurate in every case, particularly when the difference between threshold and score is narrow.",validmind.prompt_validation.Bias
ThresholdTest,Clarity,"**Purpose:** The Clarity evaluation metric is used to assess how clear the prompts of a Language Learning Model (LLM) are. This assessment is particularly important because clear prompts assist the LLM in more accurately interpreting and responding to instructions. **Test Mechanism:** The evaluation uses an LLM to scrutinize the clarity of prompts, factoring in considerations such as the inclusion of relevant details, persona adoption, step-by-step instructions, usage of examples and specification of desired output length. Each prompt is rated on a clarity scale of 1 to 10, and any prompt scoring at or above the preset threshold (default of 7) will be marked as clear. It is important to note that this threshold can be adjusted via test parameters, providing flexibility in the evaluation process. **Signs of High Risk:** - Prompts that consistently score below the clarity threshold - Repeated failure of prompts to adhere to guidelines for clarity. These guidelines could include detail inclusion, persona adoption, explicit step-by-step instructions, use of examples, and specification of output length. **Strengths:** - Encourages the development of more effective prompts that aid the LLM in interpreting instructions accurately. - Applies a quantifiable measure (a score from 1 to 10) to evaluate the clarity of prompts. - Threshold for clarity is adjustable, allowing for flexible evaluation depending on the context. **Limitations:** - Scoring system is subjective and relies on the AI’s interpretation of 'clarity'. - The test assumes that all required factors (detail inclusion, persona adoption, step-by-step instructions, use of examples, and specification of output length) contribute equally to clarity, which might not always be the case. - The evaluation may not be as effective if used on non-textual models.",validmind.prompt_validation.Clarity
ThresholdTest,Specificity,"**Purpose:** The Specificity Test evaluates the clarity, precision, and effectiveness of the prompts provided to a Language Learning Model (LLM). It aims to ensure that the instructions embedded in a prompt are indisputably clear and relevant, thereby helping to yank out ambiguity and steer the LLM towards desired outputs. This level of specificity significantly affects the accuracy and relevance of LLM outputs. **Test Mechanism:** The Specificity Test employs an LLM to grade each prompt based on clarity, detail, and relevance parameters within a specificity scale that extends from 1 to 10. On this scale, prompts scoring equal to or more than a predefined threshold (set to 7 by default) pass the evaluation, while those scoring below this threshold fail it. Users can adjust this threshold as per their requirements. **Signs of High Risk:** - Prompts scoring consistently below the established threshold - Vague or ambiguous prompts that do not provide clear direction to the LLM - Overly verbose prompts that may confuse the LLM instead of providing clear guidance **Strengths:** - Enables precise and clear communication with the LLM to achieve desired outputs - Serves as a crucial means to measure the effectiveness of prompts - Highly customizable, allowing users to set their threshold based on specific use cases **Limitations:** - This test doesn't consider the content comprehension capability of the LLM - High specificity score doesn't guarantee a high-quality response from the LLM, as the model's performance is also dependent on various other factors - Striking a balance between specificity and verbosity can be challenging, as overly detailed prompts might confuse or mislead the model.",validmind.prompt_validation.Specificity
ThresholdTest,Robustness,"**Purpose:** The Robustness test is meant to evaluate the resilience and reliability of prompts provided to a Language Learning Model (LLM). The aim of this test is to guarantee that the prompts consistently generate accurate and the expected outputs, despite being in diverse or challenging scenarios. **Test Mechanism:** The Robustness test appraises prompts under various conditions, alterations, and contexts to ascertain their stability in producing consistent responses from the LLM. Factors evaluated range from different phrasings, inclusion of potential distracting elements, and various input complexities. By default, the test generates 10 inputs for a prompt but can be adjusted according to test parameters. **Signs of High Risk:** - If the output from the tests diverges extensively from the expected results, this indicates high risk. - When the prompt doesn't give a consistent performance across various tests. - A high risk is indicated when the prompt is susceptible to breaking, especially when the output is expected to be of a specific type. **Strengths:** - The robustness test helps to ensure stable performance of the LLM prompts and lowers the chances of generating unexpected or off-target outputs. - This test is vital for applications where predictability and reliability of the LLM’s output are crucial. **Limitations:** - Currently, the test only supports single-variable prompts, which restricts its application to more complex models. - When there are too many target classes (over 10), the test is skipped, which can leave potential vulnerabilities unchecked in complex multi-class models. - The test may not account for all potential conditions or alterations that could show up in practical use scenarios.",validmind.prompt_validation.Robustness
ThresholdTest,Negative Instruction,"**Purpose:** The Negative Instruction test is utilized to scrutinize the prompts given to a Language Learning Model (LLM). The objective is to ensure these prompts are expressed using proactive, affirmative language. The focus is on instructions indicating what needs to be done rather than what needs to be avoided, thereby guiding the LLM more efficiently towards the desired output. **Test Mechanism:** An LLM is employed to evaluate each prompt. The prompt is graded based on its use of positive instructions with scores ranging between 1-10. This grade reflects how effectively the prompt leverages affirmative language while shying away from negative or restrictive instructions. A prompt that attains a grade equal to or above a predetermined threshold (7 by default) is regarded as adhering effectively to the best practices of positive instruction. This threshold can be custom-tailored through the test parameters. **Signs of High Risk:** - Low score obtained from the LLM analysis, indicating heavy reliance on negative instructions in the prompts. - Failure to surpass the preset minimum threshold. - The LLM generates ambiguous or undesirable outputs as a consequence of the negative instructions used in the prompt. **Strengths:** - Encourages the usage of affirmative, proactive language in prompts, aiding in more accurate and advantageous model responses. - The test result provides a comprehensible score, helping to understand how well a prompt follows the positive instruction best practices. **Limitations:** - Despite an adequate score, a prompt could still be misleading or could lead to undesired responses due to factors not covered by this test. - The test necessitates an LLM for evaluation, which might not be available or feasible in certain scenarios. - A numeric scoring system, while straightforward, may oversimplify complex issues related to prompt designing and instruction clarity. - The effectiveness of the test hinges significantly on the predetermined threshold level, which can be subjective and may need to be adjusted according to specific use-cases.",validmind.prompt_validation.NegativeInstruction
ThresholdTest,Conciseness,"**Purpose:** The Conciseness Assessment is designed to evaluate the brevity and succinctness of prompts provided to a Language Learning Model (LLM). A concise prompt strikes a balance between offering clear instructions and eliminating redundant or unnecessary information, ensuring that the LLM receives relevant input without being overwhelmed. **Test Mechanism:** Using an LLM, this test conducts a conciseness analysis on input prompts. The analysis grades the prompt on a scale from 1 to 10, where the grade reflects how well the prompt delivers clear instructions without being verbose. Prompts that score equal to or above a predefined threshold (default set to 7) are deemed successfully concise. This threshold can be adjusted to meet specific requirements. **Signs of High Risk:** - Prompts that consistently score below the predefined threshold. - Prompts that are overly wordy or contain unnecessary information. - Prompts that create confusion or ambiguity due to excess or unnecessary information. **Strengths:** - Ensures clarity and effectiveness of the prompts. - Promotes brevity and preciseness in prompts without sacrificing essential information. - Useful for models like LLMs, where input prompt length and clarity greatly influence model performance. - Provides a quantifiable measure of prompt conciseness. **Limitations:** - The conciseness score is based on an AI's assessment, which might not fully capture human interpretation of conciseness. - The predefined threshold for conciseness could be subjective and might need adjustment based on application. - The test is dependent on the LLM’s understanding of conciseness, which might vary from model to model.",validmind.prompt_validation.Conciseness
ThresholdTest,Delimitation,"**Purpose:** This test, dubbed the ""Delimitation Test"", is engineered to assess whether prompts provided to the Language Learning Model (LLM) correctly use delimiters to mark different sections of the input. Well-delimited prompts simplify the interpretation process for LLM, ensuring responses are precise and accurate. **Test Mechanism:** The test employs an LLM to examine prompts for appropriate use of delimiters such as triple quotation marks, XML tags, and section titles. Each prompt is assigned a score from 1 to 10 based on its delimitation integrity. Those with scores equal to or above the preset threshold (which is 7 by default, although it can be adjusted as necessary) pass the test. **Signs of High Risk:** - The test identifies prompts where a delimiter is missing, improperly placed, or incorrect, which can lead to misinterpretation by the LLM. - A high-risk scenario may involve complex prompts with multiple tasks or diverse data where correct delimitation is integral to understanding. - Low scores (below the threshold) are a clear indicator of high risk. **Strengths:** - This test ensures clarity in the demarcation of different components of given prompts. - It helps reduce ambiguity in understanding prompts, particularly for complex tasks. - Scoring allows for quantified insight into the appropriateness of delimiter usage, aiding continuous improvement. **Limitations:** - The test only checks for the presence and placement of delimiter, not whether the correct delimiter type is used for the specific data or task. - It may not fully reveal the impacts of poor delimitation on LLM's final performance. - Depending on the complexity of the tasks and prompts, the preset score threshold may not be refined enough, requiring regular manual adjustment.",validmind.prompt_validation.Delimitation
Metric,Bert Score,"**Purpose**: The BERTScore metric is deployed to evaluate the competence of text generation models by focusing on the similarity between the reference and the generated text. It employs the contextual embeddings from BERT models to assess the similarity of the contents. This measures the extent to which a model has learned and can generate contextually relevant results. **Test Mechanism**: The true values derived from the model's test dataset and the model's predictions are employed in this metric. BERTScore calculates the precision, recall, and F1 score of the model considering the contextual similarity between the reference and the produced text. These scores are computed for each token in the predicted sentences as compared to the reference sentences, while considering the cosine similarity with BERT embeddings. A line plot depicting the score changes across row indexes is generated for each metric i.e., Precision, Recall, and F1 Score. **Signs of High Risk**: - Observable downward trend in Precision, Recall, or F1 Score. - Noticeable instability or fluctuation in these metrics. Lower Precision implies that predictions often incorporate irrelevant contexts. - Declining Recall suggests that the model frequently omits relevant contexts during predictions. - Lower F1 score signals poor overall performance in both precision and recall. **Strengths**: - BERTScore efficiently detects the quality of text that requires to comprehend the context, a common requirement in natural language processing tasks. - This metric advances beyond the simple n-gram matching and considers the semantic similarity in the context, thereby providing more meaningful evaluation results. - The integrated visualization function allows tracking of the performance trends across different prediction sets. **Limitations**: - Dependence on BERT model embeddings for BERTScore implies that if the base BERT model is not suitable for a specific task, it might impair the accuracy of BERTScore. - Despite being good at understanding semantics, it might be incapable of capturing certain nuances in text similarity that other metrics like BLEU or ROUGE could detect. - Can be computationally expensive due to the utilization of BERT embeddings.",validmind.model_validation.BertScore
Metric,Bleu Score,"**Purpose**: The Bilingual Evaluation Understudy (BLEU) metric measures the quality of machine-translated text by comparing it to human-translated text. This comparison is done at the sentence level and is designed to bring machine translations closer to the quality of a professional human translation. It is commonly used in the field of translation evaluation, and its purpose is to assess the accuracy of a model's output against that of a benchmark. **Test Mechanism**: The BLEU score is implemented using the NLTK's word_tokenize function to split the text into individual words. After tokenization, the evaluate library's BLEU metric calculates the BLEU score for each translated sentence by comparing the model's translations (predictions) against the actual, correct translations (references). The test algorithm then combines these individual scores into a single score that represents the average 'distance' between the generated translations and the human translations across the entire test set. **Signs of High Risk**: - Low BLEU scores suggest high model risk. This could indicate significant discrepancies between the machine translation and its human equivalent. - This could be due to ineffective model learning, overfitting of training data, or inadequate handling of the language's nuances. - Machine biases toward a certain language style or translation mode can result in lower scores. **Strengths**: - The BLEU score's primary strength lies in its simplicity and interpretability. It offers a straightforward way to assess translated text quality, and its calculations often align with human judgments. - The BLEU score breaks down its evaluations at the sentence level, offering granular insights into any errors. - The score consolidates the model’s performance into a single, comprehensive score, making it easy to compare and monitor. **Limitations**: - The BLEU score heavily favours exact matches, which can create a bias towards literal translations. Thus, it may fail to fully evaluate more complex or flexible translations that shy away from a word-for-word structure. - The score does not directly measure the intelligibility or grammatical correctness of the translations. - It may miss errors originating from subtle nuances in language, cultural contexts, or ambiguities.",validmind.model_validation.BleuScore
Metric,Contextual Recall,"**Purpose**: The Contextual Recall metric is used to evaluate the ability of a natural language generation (NLG) model to generate text that appropriately reflects the given context or prompt. It measures the model's capability to remember and reproduce the main context in its resulting output. This metric is critical in natural language processing tasks, as the coherency and contextuality of the generated text are essential. **Test Mechanism**: 1. **Preparation of Reference and Candidate Texts**: - **Reference Texts**: Gather the reference text(s) which exemplify the expected or ideal output for a specific context or prompt. - **Candidate Texts**: Generate candidate text(s) from the NLG model under evaluation using the same context. 2. **Tokenization and Preprocessing**: - Tokenize the reference and candidate texts into discernible words or tokens using libraries such as NLTK. 3. **Computation of Contextual Recall**: - Identify the token overlap between the reference and candidate texts. - The Contextual Recall score is computed by dividing the number of overlapping tokens by the total number of tokens in the reference text. Scores are calculated for each test dataset instance, resulting in an array of scores. These scores are then visualized using a line plot to show score variations across different rows. **Signs of High Risk**: - Low contextual recall scores could indicate that the model is not effectively reflecting the original context in its output, leading to incoherent or contextually misaligned text. - A consistent trend of low recall scores could suggest underperformance of the model. **Strengths**: - The Contextual Recall metric provides a quantifiable measure of a model's adherence to the context and factual elements of the generated narrative. - This metric finds particular value in applications requiring deep comprehension of context, such as text continuation or interactive dialogue systems. - The line plot visualization provides a clear and intuitive representation of score fluctuations. **Limitations**: - Despite its effectiveness, the Contextual Recall could fail to comprehensively assess the performance of NLG models. Its focus on word overlap could result in high scores for texts that use many common words, even when these texts lack coherence or meaningful context. - This metric does not consider the order of words, which could lead to overestimated scores for scrambled outputs. - Models that effectively use infrequent words might be undervalued, as these words might not overlap as often.",validmind.model_validation.ContextualRecall


## Understanding Tags and Task Types

Effectively using ValidMind's tests involves a deep understanding of its 'tags' and 'task types'. Here's a breakdown:

- **Task Types**: Represent the kind of modeling task associated with a test. For instance:
  - **classification:** Classifying data into specific categories.
  - **regression:** Predicting a continuous outcome variable.
  - **text classification:** Classifying text into specific categories.
  - **text summarization:** Producing a concise summary for a text.

- **Tags**: Free-form descriptors providing detailed insights about a test. Some examples include:
  - **nlp:** Tests relevant for natural language processing.
  - **binary_classification:** Tests for binary classification tasks.
  - **forecasting:** Tests for forecasting and time-series analysis.
  - **tabular_data:** Tests for tabular data like CSVs and Excel spreadsheets.

## Searching for Specific Tests using `tags` and `task_types`

While listing all tests is valuable, there are times when you need to narrow down your search. The `list_tests` function offers `filter`, `task`, and `tags` parameters to assist in this.

In [8]:
vt.list_tests(filter="classification")

Test Type,Name,Description,ID
ThresholdTest,High Cardinality,"**Purpose**: The “High Cardinality” test is used to evaluate the number of unique values present in the categorical columns of a dataset. In this context, high cardinality implies the presence of a large number of unique, non-repetitive values in the dataset. **Test Mechanism**: The test first infers the dataset's type and then calculates an initial numeric threshold based on the test parameters. It only considers columns classified as ""Categorical"". For each of these columns, the number of distinct values (n_distinct) and the percentage of distinct values (p_distinct) are calculated. The test will pass if n_distinct is less than the calculated numeric threshold. Lastly, the results, which include details such as column name, number of distinct values, and pass/fail status, are compiled into a table. **Signs of High Risk**: - A large number of distinct values (high cardinality) in one or more categorical columns implies a high risk. - A column failing the test (n_distinct >= num_threshold) is another indicator of high risk. **Strengths**: - The High Cardinality test is effective in early detection of potential overfitting and unwanted noise. - It aids in identifying potential outliers and inconsistencies, thereby improving data quality. - The test can be applied to both, classification and regression task types, demonstrating its versatility. **Limitations**: - The test is restricted to only ""Categorical"" data types and is thus not suitable for numerical or continuous features, limiting its scope. - The test does not consider the relevance or importance of unique values in categorical features, potentially causing it to overlook critical data points. - The threshold (both number and percent) used for the test is static and may not be optimal for diverse datasets and varied applications. Further mechanisms to adjust and refine this threshold could enhance its effectiveness.",validmind.data_validation.HighCardinality
ThresholdTest,Specificity,"**Purpose:** The Specificity Test evaluates the clarity, precision, and effectiveness of the prompts provided to a Language Learning Model (LLM). It aims to ensure that the instructions embedded in a prompt are indisputably clear and relevant, thereby helping to yank out ambiguity and steer the LLM towards desired outputs. This level of specificity significantly affects the accuracy and relevance of LLM outputs. **Test Mechanism:** The Specificity Test employs an LLM to grade each prompt based on clarity, detail, and relevance parameters within a specificity scale that extends from 1 to 10. On this scale, prompts scoring equal to or more than a predefined threshold (set to 7 by default) pass the evaluation, while those scoring below this threshold fail it. Users can adjust this threshold as per their requirements. **Signs of High Risk:** - Prompts scoring consistently below the established threshold - Vague or ambiguous prompts that do not provide clear direction to the LLM - Overly verbose prompts that may confuse the LLM instead of providing clear guidance **Strengths:** - Enables precise and clear communication with the LLM to achieve desired outputs - Serves as a crucial means to measure the effectiveness of prompts - Highly customizable, allowing users to set their threshold based on specific use cases **Limitations:** - This test doesn't consider the content comprehension capability of the LLM - High specificity score doesn't guarantee a high-quality response from the LLM, as the model's performance is also dependent on various other factors - Striking a balance between specificity and verbosity can be challenging, as overly detailed prompts might confuse or mislead the model.",validmind.prompt_validation.Specificity
Metric,Feature Target Correlation Plot,"**Purpose**: This test is designed to graphically illustrate the correlations between distinct input features and the target output of a Machine Learning model. Understanding how each feature influences the model's predictions is crucial - a higher correlation indicates stronger influence of the feature on the target variable. This correlation study is especially advantageous during feature selection and for comprehending the model's operation. **Test Mechanism**: This FeatureTargetCorrelationPlot test computes and presents the correlations between the features and the target variable using a specific dataset. These correlations are calculated, graphically represented in a horizontal bar plot, and color-coded based on the strength of the correlation. A hovering template can also be utilized for informative tooltips. It is possible to specify the features to be analyzed and adjust the graph's height according to need. **Signs of High Risk**: - There are no strong correlations (either positive or negative) between features and the target variable. This could suggest high risk as the supplied features do not appear to significantly impact the prediction output. - The presence of duplicated correlation values might hint at redundancy in the feature set. **Strengths**: - Provides visual assistance to interpreting correlations more effectively. - Gives a clear and simple tour of how each feature affects the model's target variable. - Beneficial for feature selection and grasping the model's prediction nature. - Precise correlation values for each feature are offered by the hover template, contributing to a granular-level comprehension. **Limitations**: - The test only accepts numerical data, meaning variables of other types need to be prepared beforehand. - The plot assumes all correlations to be linear, thus non-linear relationships might not be captured effectively. - Not apt for models that employ complex feature interactions, like Decision Trees or Neural Networks, as the test may not accurately reflect their importance.",validmind.data_validation.FeatureTargetCorrelationPlot
Metric,Pearson Correlation Matrix,"**Purpose**: This test is intended to evaluate the extent of linear dependency between all pairs of numerical variables in the given dataset. It provides the Pearson Correlation coefficient, which reveals any high correlations present. The purpose of doing this is to identify potential redundancy, as variables that are highly correlated can often be removed to reduce the dimensionality of the dataset without significantly impacting the model's performance. **Test Mechanism**: This metric test generates a correlation matrix for all numerical variables in the dataset using the Pearson correlation formula. A heat map is subsequently created to visualize this matrix effectively. The color of each point on the heat map corresponds to the magnitude and direction (positive or negative) of the correlation, with a range from -1 (perfect negative correlation) to 1 (perfect positive correlation). Any correlation coefficients higher than 0.7 (in absolute terms) are indicated in white in the heat map, suggesting a high degree of correlation. **Signs of High Risk**: - A large number of variables in the dataset showing a high degree of correlation (coefficients approaching ±1). This indicates redundancy within the dataset, suggesting that some variables may not be contributing new information to the model. - This could potentially lead to overfitting. **Strengths**: - The primary strength of this metric test is its ability to detect and quantify the linearity of relationships between variables. This allows for the identification of redundant variables, which in turn can help in simplifying models and potentially improving their performance. - The visualization aspect (heatmap) is another strength as it offers an easy-to-understand overview of the correlations, beneficial for those not comfortable navigating numerical matrices. **Limitations**: - The primary limitation of Pearson Correlation is its inability to detect non-linear relationships between variables, which can lead to missed opportunities for dimensionality reduction. - It only measures the degree of linear relationship and not the strength of effect of one variable on the other. - The cutoff value of 0.7 for high correlation is a somewhat arbitrary choice and some valid dependencies might be missed if they have a correlation coefficient less than this value.",validmind.data_validation.PearsonCorrelationMatrix
Metric,ANOVA One Way Table,"**Purpose**: The ANOVA (Analysis of Variance) One-Way Table metric is utilized to determine whether the mean of numerical variables differs across different groups identified by target or categorical variables. Its primary purpose is to scrutinize the significant impact of categorical variables on numerical ones. This method proves essential in identifying statistically significant features corresponding to the target variable present in the dataset. **Test Mechanism**: The testing mechanism involves the ANOVA F-test's performance on each numerical variable against the target. If no specific features are mentioned, all numerical features are tested. A p-value is produced for each test and compared against a certain threshold (default being 0.05 if not specified). If the p-value is less than or equal to this threshold, the feature is marked as 'Pass', indicating significant mean difference across the groups. Otherwise, it's marked as 'Fail'. The test produces a DataFrame that includes variable name, F statistic value, p-value, threshold, and pass/fail status for every numerical variable. **Signs of High Risk**: - A large number of 'Fail' results in the ANOVA F-test could signify high risk or underperformance in the model. This issue may arise when multiple numerical variables in the dataset don't exhibit any significant difference across the target variable groups. - Features with high p-values also indicate a high risk as they imply a greater chance of obtaining observed data given that the null hypothesis is true. **Strengths**: - The ANOVA One Way Table is highly efficient in identifying statistically significant features by simultaneously comparing group means. - Its flexibility allows the testing of all numerical features in the dataset when no specific ones are mentioned. - This metric provides a convenient method to measure the statistical significance of numerical variables and assists in selecting those variables influencing the classifier's predictions considerably. **Limitations**: - This metric assumes that the data is normally distributed, which may not always be the case leading to erroneous test results. - The sensitivity of the F-test to variance changes may hinder this metric's effectiveness, especially for datasets with high variance. - The ANOVA One Way test does not specify which group means differ statistically from others; it strictly asserts the existence of a difference. - The metric fails to provide insights into variable interactions, and significant effects due to these interactions could easily be overlooked.",validmind.data_validation.ANOVAOneWayTable
ThresholdTest,Delimitation,"**Purpose:** This test, dubbed the ""Delimitation Test"", is engineered to assess whether prompts provided to the Language Learning Model (LLM) correctly use delimiters to mark different sections of the input. Well-delimited prompts simplify the interpretation process for LLM, ensuring responses are precise and accurate. **Test Mechanism:** The test employs an LLM to examine prompts for appropriate use of delimiters such as triple quotation marks, XML tags, and section titles. Each prompt is assigned a score from 1 to 10 based on its delimitation integrity. Those with scores equal to or above the preset threshold (which is 7 by default, although it can be adjusted as necessary) pass the test. **Signs of High Risk:** - The test identifies prompts where a delimiter is missing, improperly placed, or incorrect, which can lead to misinterpretation by the LLM. - A high-risk scenario may involve complex prompts with multiple tasks or diverse data where correct delimitation is integral to understanding. - Low scores (below the threshold) are a clear indicator of high risk. **Strengths:** - This test ensures clarity in the demarcation of different components of given prompts. - It helps reduce ambiguity in understanding prompts, particularly for complex tasks. - Scoring allows for quantified insight into the appropriateness of delimiter usage, aiding continuous improvement. **Limitations:** - The test only checks for the presence and placement of delimiter, not whether the correct delimiter type is used for the specific data or task. - It may not fully reveal the impacts of poor delimitation on LLM's final performance. - Depending on the complexity of the tasks and prompts, the preset score threshold may not be refined enough, requiring regular manual adjustment.",validmind.prompt_validation.Delimitation
DatasetMetadata,Dataset Metadata,"**Purpose**: The `DatasetMetadata` test is primarily aimed at collecting and logging essential descriptive statistics related to the training datasets. This test generates essential metadata such as the types of tasks (classification, regression, text_classification, text_summarization) and tags (tabular_data, time_series_data, text_data) associated with the datasets. This transparency facilitates model validation by linking different metrics and test results to the originating dataset. **Test Mechanism**: Rather than conducting a test or implementing a grading scale, this class collects and logs dataset metadata. During post-initialization, the metadata is linked to the dataset object. The `run` method produces a `TestSuiteDatasetResult` object, which is assigned a unique ID and is bound to a dataset. The dataset metadata is associated with this ID for use in future, more focused, validation procedures. **Signs of High Risk**: - The metadata is incomplete or incorrect which can lead to inaccuracies in model risk assessment. - Dataset labels or types are missing, leading to issues in further model validation or mispresentations. **Strengths**: - The class brings transparency to model validation exercises by providing detailed information about the dataset. - It assists in error diagnosis and behaviors correlation to the model. - Ensures the correctness of tasks and data types associations and allows superior model explanations. - Supports dataset versioning by logging each dataset's metadata, maintaining a trackable history of alterations. **Limitations**: - The `DatasetMetadata` class's completeness and accuracy might be questionable, especially if metadata isn't appropriately added or is inaccurate. - It doesn't involve the evaluation of the dataset's quality or the direct validation of model predictions, hence it should be combined with other tests for a more comprehensive assessment. - The class cannot detect potential bias in the dataset. For bias detection, separate tests specifically tailored towards fairness and bias detection would be necessary.",validmind.data_validation.DatasetMetadata
ThresholdTest,Stop Words,"**Purpose**: The StopWords threshold test is a tool designed for assessing the quality of text data in an ML model. It focuses on the identification and analysis of ""stop words"" in a given dataset. Stop words are frequent, common, yet semantically insignificant words (for example: ""the"", ""and"", ""is"") in a language. This test evaluates the proportion of stop words to the total word count in the dataset, in essence, scrutinizing the frequency of stop word usage. The core objective is to highlight the prevalent stop words based on their usage frequency, which can be instrumental in cleaning the data from noise and improving ML model performance. **Test Mechanism**: The StopWords test initiates on receiving an input of a 'VMDataset' object. Absence of such an object will trigger an error. The methodology involves inspection of the text column of the VMDataset to create a 'corpus' (a collection of written texts). Leveraging the Natural Language Toolkit's (NLTK) stop word repository, the test screens the corpus for any stop words and documents their frequency. It further calculates the percentage usage of each stop word compared to the total word count in the corpus. This percentage is evaluated against a predefined 'min_percent_threshold'. If this threshold is breached, the test returns a failed output. Top prevailing stop words along with their usage percentages are returned, facilitated by a bar chart visualization of these stop words and their frequency. **Signs of High Risk**: - A percentage of any stop words exceeding the predefined 'min_percent_threshold'. - High frequency of stop words in the dataset which may adversely affect the application's analytical performance due to noise creation. **Strengths**: - The ability to scrutinize and quantify the usage of stop words. - Provides insights into potential noise in the text data due to stop words. This can directly aid in enhancing model training efficiency. - The test includes a bar chart visualization feature to easily interpret and action upon the stop words frequency information. **Limitations**: - The test only supports English stop words, making it less effective with datasets of other languages. - The 'min_percent_threshold' parameter may require fine-tuning for different datasets, impacting the overall effectiveness of the test. - Contextual use of the stop words within the dataset is not considered which may lead to overlooking their significance in certain contexts. - The test focuses specifically on the frequency of stop words, not providing direct measures of model performance or predictive accuracy.",validmind.data_validation.nlp.StopWords
Metric,IQR Outliers Table,"**Purpose**: The ""Interquartile Range Outliers Table"" (IQROutliersTable) metric has been designed for identifying and summarizing outliers within numerical features of a dataset using the Interquartile Range (IQR) method. The purpose of this exercise is crucial in the pre-processing of data as outliers can substantially distort the statistical analysis and debilitate the performance of machine learning models. **Test Mechanism**: The IQR, which is the range separating the first quartile (25th percentile) from the third quartile (75th percentile), is calculated for each numerical feature within the dataset. An outlier is defined as a data point falling below the ""Q1 - 1.5 * IQR"" or above ""Q3 + 1.5 * IQR"" range. The metric then computes the number of outliers along with their minimum, 25th percentile, median, 75th percentile, and maximum values for each numerical feature. If no specific features are chosen, the metric will apply to all numerical features in the dataset. The default outlier threshold is set to 1.5, following the standard definition of outliers in statistical analysis, although it can be customized by the user. **Signs of High Risk**: - High risk is indicated by a large number of outliers in multiple features. - Outliers that are significantly distanced from the mean value of variables could potentially signal high risk. - Data entry errors or other data quality issues could be manifested through extremely high or low outlier values. **Strengths**: - It yields a comprehensive summary of outliers for each numerical feature within the dataset. This enables the user to pinpoint features with potential quality issues. - The IQR method is not overly affected by extremely high or low outlier values as it is based on quartile calculations. - The versatility of this metric grants the ability to customize the method to work on selected features and set a defined threshold for outliers. **Limitations**: - The metric might cause false positives if the variable of interest veers away from a normal or near-normal distribution, notably in the case of skewed distributions. - It does not extend to provide interpretation or recommendations for tackling outliers and relies on the user or a data scientist to conduct further analysis of the results. - As it only functions on numerical features, it cannot be used for categorical data. - For data that has undergone heavy pre-processing, was manipulated, or inherently possesses a high kurtosis (heavy tails), the pre-set threshold may not be optimal for outlier detection.",validmind.data_validation.IQROutliersTable
ThresholdTest,Training Test Degradation,"**Purpose**: The 'TrainingTestDegradation' class serves as a test to verify that the degradation in performance between the training and test datasets does not exceed a predefined threshold. This test serves as a measure to check the model's ability to generalize from its training data to unseen test data. It assesses key classification metric scores such as accuracy, precision, recall and f1 score, to verify the model's robustness and reliability. **Test Mechanism**: The code applies several predefined metrics including accuracy, precision, recall and f1 scores to the model's predictions for both the training and test datasets. It calculates the degradation as the difference between the training score and test score divided by the training score. The test is considered successful if the degradation for each metric is less than the preset maximum threshold of 10%. The results are summarized in a table showing each metric's train score, test score, degradation percentage, and pass/fail status. **Signs of High Risk**: - A degradation percentage that exceeds the maximum allowed threshold of 10% for any of the evaluated metrics. - A high difference or gap between the metric scores on the training and the test datasets. - The 'Pass/Fail' column displaying 'Fail' for any of the evaluated metrics. **Strengths**: - This test provides a quantitative measure of the model's ability to generalize to unseen data, which is key for predicting its practical real-world performance. - By evaluating multiple metrics, it takes into account different facets of model performance and enables a more holistic evaluation. - The use of a variable predefined threshold allows the flexibility to adjust the acceptability criteria for different scenarios. **Limitations**: - The test compares raw performance on training and test data, but does not factor in the nature of the data. Areas with less representation in the training set, for instance, might still perform poorly on unseen data. - It requires good coverage and balance in the test and training datasets to produce reliable results, which may not always be available. - The test is currently only designed for classification tasks.",validmind.model_validation.sklearn.TrainingTestDegradation


If you're targeting a specific test or tests that match a particular task type, the `filter` parameter comes in handy. For example, to list tests for 'classification':

In [9]:
vt.list_tests(task="regression")

Test Type,Name,Description,ID
Metric,Model Metadata,"**Purpose:** This test is designed to collect and summarize important metadata related to a particular machine learning model. Such metadata includes the model's architecture (modeling technique), the version and type of modeling framework used, and the programming language the model is written in. **Test Mechanism:** The mechanism of this test consists of extracting information from the model instance. It tries to extract the model information such as the modeling technique used, the modeling framework version, and the programming language. It decorates this information into a data frame and returns a summary of the results. **Signs of High Risk:** - High risk could be determined by a lack of documentation or inscrutable metadata for the model. - Unidentifiable language, outdated or unsupported versions of modeling frameworks, or undisclosed model architectures reflect risky situations, as they could hinder future reproducibility, support, and debugging of the model. **Strengths:** - The strengths of this test lie in the increased transparency and understanding it brings regarding the model's setup. - Knowing the model's architecture, the specific modeling framework version used, and the language involved, provides multiple benefits: supports better error understanding and debugging, facilitates model reuse, aids compliance of software policies, and assists in planning for model obsolescence due to evolving or discontinuing software and dependencies. **Limitations:** - Notably, this test is largely dependent on the compliance and correctness of information provided by the model or the model developer. - If the model's built-in methods for describing its architecture, framework or language are incorrect or lack necessary information, this test will hold limitations. - Moreover, it is not designed to directly evaluate the performance or accuracy of the model, rather it provides supplementary information which aids in comprehensive analysis.",validmind.model_validation.ModelMetadata
Metric,Regression Models Coeffs,"**Purpose**: The 'RegressionModelsCoeffs' metric is utilized to evaluate and compare coefficients of different regression models trained on the same dataset. By examining how each model weighted the importance of different features during training, this metric provides key insights into which factors have the most impact on the model's predictions and how these patterns differ across models. **Test Mechanism**: The test operates by extracting the coefficients of each regression model using the 'regression_coefficients()' method. These coefficients are then consolidated into a dataframe, with each row representing a model and columns corresponding to each feature's coefficient. It must be noted that this test is exclusive to 'statsmodels' and 'R' models, other models will result in a 'SkipTestError'. **Signs of High Risk**: - Discrepancies in how different models weight the same features - Unexpectedly high or low coefficients - The test is inapplicable to certain models because they are not from 'statsmodels' or 'R' libraries **Strengths**: - Enables insight into the training process of different models - Allows comparison of feature importance across models - Through the review of feature coefficients, the test provides a more transparent evaluation of the model and highlights significant weights and biases in the training procedure **Limitations**: - The test is only compatible with 'statsmodels' and 'R' regression models - While the test provides contrast in feature weightings among models, it does not establish the most appropriate or accurate weighting, thus remaining subject to interpretation - It does not account for potential overfitting or underfitting of models - The computed coefficients might not lead to effective performance on unseen data",validmind.model_validation.statsmodels.RegressionModelsCoeffs
Metric,Box Pierce,"**Purpose:** The Box-Pierce test is utilized to detect the presence of autocorrelation in a time-series dataset. Autocorrelation, or serial correlation, refers to the degree of similarity between observations based on the temporal spacing between them. This test is essential for affirming the quality of a time-series model by ensuring that the error terms in the model are random and do not adhere to a specific pattern. **Test Mechanism:** The implementation of the Box-Pierce test involves calculating a test statistic along with a corresponding p-value derived from the dataset features. These quantities are used to test the null hypothesis that posits the data to be independently distributed. This is achieved by iterating over every feature column in the time-series data and applying the `acorr_ljungbox` function of the statsmodels library. The function yields the Box-Pierce test statistic as well as the respective p-value, all of which are cached as test results. **Signs of High Risk:** - A low p-value, typically under 0.05 as per statistical convention, throws the null hypothesis of independence into question. This implies that the dataset potentially houses autocorrelations, thus indicating a high-risk scenario concerning model performance. - Large Box-Pierce test statistic values may indicate the presence of autocorrelation. **Strengths:** - Detects patterns in data that are supposed to be random, thereby ensuring no underlying autocorrelation. - Can be computed efficiently given its low computational complexity. - Can be widely applied to most regression problems, making it very versatile. **Limitations:** - Assumes homoscedasticity (constant variance) and normality of residuals, which may not always be the case in real-world datasets. - May exhibit reduced power for detecting complex autocorrelation schemes such as higher-order or negative correlations. - It only provides a general indication of the existence of autocorrelation, without providing specific insights into the nature or patterns of the detected autocorrelation. - In the presence of exhibits trends or seasonal patterns, the Box-Pierce test may yield misleading results. - Applicability is limited to time-series data, which limits its overall utility.",validmind.model_validation.statsmodels.BoxPierce
Metric,Regression Coeffs Plot,"**Purpose**: The Regression Coefficients with Confidence Intervals plot and metric aims to understand the impact of predictor variables on the response variable in question. This understanding is achieved via the visualization and analysis of the regression model by presenting the coefficients derived from the model along with their associated 95% confidence intervals. By doing so, it offers insights into the variability and uncertainty associated with the model's estimates. **Test Mechanism**: The test begins by extracting the estimated coefficients and their related standard errors from the regression model under test. It then calculates and draws confidence intervals based on a 95% confidence level (a standard convention in statistics). These intervals provide a range wherein the true value can be expected to fall 95% of the time if the same regression were re-run multiple times with samples drawn from the same population. This information is then visualized as a bar plot, with the predictor variables and their coefficients on the x-axis and y-axis respectively and the confidence intervals represented as error bars. **Signs of High Risk**: * If the calculated confidence interval contains the zero value, it could mean the feature/coefficient in question doesn't significantly contribute to prediction in the model. * If there are multiple coefficients exhibiting this behavior, it might raise concerns about overall model reliability. * Very wide confidence intervals might indicate high uncertainty in the associated coefficient estimates. **Strengths**: * This metric offers a simple and easily comprehendible visualization of the significance and impact of individual predictor variables in a regression model. * By including confidence intervals, it enables an observer to evaluate the uncertainty around each coefficient estimate. **Limitations**: * The test is dependent on a few assumptions about the data, namely normality of residuals and independence of observations, which may not always be true for all types of datasets. * The test does not consider multi-collinearity (correlation among predictor variables), which can potentially distort the model and make interpretation of coefficients challenging. * The test's application is limited to regression tasks and tabular datasets and is not suitable for other types of machine learning assignments or data structures.",validmind.model_validation.statsmodels.RegressionCoeffsPlot
Metric,Regression Model Sensitivity Plot,"**Purpose**: The Regression Sensitivity Plot metric is designed to perform sensitivity analysis on regression models. This metric aims to measure the impact of slight changes (shocks) applied to individual variables on the system's outcome while keeping all other variables constant. By doing so, it analyzes the effects of each independent variable on the dependent variable within the regression model and helps identify significant risk factors that could substantially influence the model's output. **Test Mechanism**: This metric operates by initially applying shocks of varying magnitudes, defined by specific parameters, to each of the model's features, one at a time. With all other variables held constant, a new prediction is made for each dataset subjected to shocks. Any changes in the model's predictions are directly attributed to the shocks applied. In the event that the transformation parameter is set to ""integrate"", initial predictions and target values undergo transformation via an integration function before being plotted. Lastly, a plot demonstrating observed values against predicted values for each model is generated, showcasing a distinct line graph illustrating predictions for each shock. **Signs of High Risk**: - If the plot exhibits drastic alterations in model predictions consequent to minor shocks to an individual variable, it may indicate high risk. This underscores potentially high model sensitivity to changes in that variable, suggesting over-dependence on that variable for predictions. - Unusually high or unpredictable shifts in response to shocks may also denote potential model instability or overfitting. **Strengths**: - The metric allows identification of variables strongly influencing the model outcomes, paving the way for understanding feature importance. - It generates visual plots which make the results easily interpretable even to non-technical stakeholders. - Beneficial in identifying overfitting and detecting unstable models that over-react to minor changes in variables. **Limitations**: - The metric operates on the assumption that all other variables remain unchanged during the application of a shock. However, real-world situations where variables may possess intricate interdependencies may not always reflect this. - It is best compatible with linear models and may not effectively evaluate the sensitivity of non-linear model configurations. - The metric does not provide a numerical risk measure. It offers only a visual representation, which may invite subjectivity in interpretation.",validmind.model_validation.statsmodels.RegressionModelSensitivityPlot
Metric,Regression Models Performance,"**Purpose**: This metric is used to evaluate and compare the performance of various regression models. Through the use of key statistical measures such as R-squared, Adjusted R-squared, and Mean Squared Error (MSE), the performance of different models in predicting dependent variables can be assessed both on the data used for training (in-sample) and new, unseen data (out-of-sample). **Test Mechanism**: The test evaluates a list of provided regression models. For each model, it calculates their in-sample and out-of-sample performance by deriving the model predictions for the training and testing datasets respectively, and then comparing these predictions to the actual values. In doing so, it calculates R-squared, Adjusted R-squared, and MSE for each model, stores the results, and returns them for comparison. **Signs of High Risk**: - High Mean Squared Error (MSE) values. - Strikingly low values of R-squared and Adjusted R-squared. - A significant drop in performance when transitioning from in-sample to out-of-sample evaluations, signaling a potential overfitting issue. **Strengths**: - The test permits comparisons of multiple models simultaneously, providing an objective base for identifying the top-performing model. - It delivers both in-sample and out-of-sample evaluations, presenting performance data on unseen data. - The utilization of R-squared and Adjusted R-squared in conjunction with MSE allows for a detailed view of the model's explainability and error rate. **Limitations**: - This test is built around the assumption that the residuals of the regression model are normally distributed, which is a fundamental requirement for Ordinary Least Squares (OLS) regression; thus, it could be not suitable for models where this assumption is broken. - The test does not consider cases where higher R-squared or lower MSE values do not necessarily correlate with better predictive performance, particularly in instances of excessively complex models.",validmind.model_validation.statsmodels.RegressionModelsPerformance
Metric,Zivot Andrews Arch,"**Purpose**: The Zivot-Andrews Arch metric is used to evaluate the order of integration for a time series data in a machine learning model. It's designed to test for stationarity, a crucial aspect in time series analysis where data points are not dependent on time. Stationarity means that the statistical properties such as mean, variance and autocorrelation are all constant over time. **Test Mechanism**: The Zivot-Andrews unit root test is performed on each feature in the dataset using the `ZivotAndrews` function from the `arch.unitroot` module. This function returns the Zivot-Andrews metric for each feature, which includes the statistical value, p-value (probability value), the number of used lags, and the number of observations. The p-value is later used to decide on the null hypothesis (the time series has a unit root and is non-stationary) based on a chosen level of significance. **Signs of High Risk**: - A high p-value can suggest high risk. This might indicate that there's insufficient evidence to reject the null hypothesis, which would mean the time series has a unit root and is therefore non-stationary. - Non-stationary time series data can lead to misleading statistics and unreliable machine learning models. **Strengths**: - The Zivot-Andrews Arch metric dynamically tests for stationarity against structural breaks in time series data, offering robust evaluation of stationarity in features. - This metric is especially beneficial with financial, economic, or other time-series data where data observations lack a consistent pattern and structural breaks may occur. **Limitations**: - The Zivot-Andrews Arch metric assumes that data is derived from a single-equation, autoregressive model. It may, therefore, not be appropriate for multivariate time series data or data which does not align with the autoregressive model assumption. - It might not take into account unexpected shocks or changes in the series trend which can both have a significant impact on the stationarity of the data.",validmind.model_validation.statsmodels.ZivotAndrewsArch
Metric,Regression Model Outsample Comparison,"**Purpose**: The RegressionModelOutsampleComparison test is designed to evaluate the predictive performance of multiple regression models by means of an out-of-sample test. The primary aim of this test is to validate the model's ability to generalize to unseen data, a common challenge in the context of overfitting. It does this by computing two critical metrics — Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), which provide a quantifiable measure of the model's prediction accuracy on the testing dataset. **Test Mechanism**: This test requires multiple models (specifically Ordinary Least Squares - OLS regression models) and a test dataset as inputs. Each model generates predictions using the test dataset. The residuals are then calculated and used to compute the MSE and RMSE for each model. The test outcomes, which include the model's name, its MSE, and RMSE, are recorded and returned in a structured dataframe format. **Signs of High Risk**: - High values of MSE or RMSE indicate significant risk, signifying that the model's predictions considerably deviate from the actual values in the test dataset. - Consistently large discrepancies between training and testing performance across various models may indicate an issue with the input data itself or the model selection strategies employed. **Strengths**: - This test offers a comparative evaluation of multiple models' out-of-sample performance, enabling the selection of the best performing model. - The use of both MSE and RMSE provides insights into the model's prediction error. While MSE is sensitive to outliers, emphasizing larger errors, RMSE provides a more interpretable measure of average prediction error given that it's in the same unit as the dependent variable. **Limitations**: - The applicability of this test is limited to regression tasks, specifically OLS models. - The test operates under the assumption that the test dataset is a representative sample of the population. This might not always hold true and can result in less accurate insights. - The interpretability and the objectivity of the output (MSE and RMSE) can be influenced when the scale of the dependent variable varies significantly, or the distribution of residuals is heavily skewed or contains outliers.",validmind.model_validation.statsmodels.RegressionModelOutsampleComparison
Metric,Regression Model Forecast Plot Levels,"**Purpose:** The `RegressionModelForecastPlotLevels` metric is designed to visually assess a series of regression models' performance. It achieves this by contrasting the models' forecasts with the observed data from the respective training and test datasets. The gauge of accuracy here involves determining the extent of closeness between forecasted and actual values. Accordingly, if any transformations are specified, the metric will handle transforming the data before making this comparison. **Test Mechanism:** The `RegressionModelForecastPlotLevels` class in Python initiates with a `transformation` parameter, which default aggregates to None. Initially, the class checks for the presence of model objects and raises a `ValueError` if none are found. Each model is then processed, creating predictive forecasts for both its training and testing datasets. These forecasts are then contrasted with the actual values and plotted. In situations where a specified transformation, like ""integrate,"" is specified, the class navigates the transformation steps (performing cumulative sums to generate a novel series, for instance). Finally, plots are produced that compare observed and forecasted values for both the raw and transformed datasets. **Signs of High Risk:** Indications of high risk or failure in the model's performance can be derived from checking the generated plots. When the forecasted values dramatically deviate from the observed values in either the training or testing datasets, it suggests a high risk situation. A significant deviation could be a symptom of either overfitting or underfitting, both scenarios are worrying. Such discrepancies could inhibit the model's ability to create precise, generalized results. **Strengths:** - Visual Evaluations: The metric provides a visual and comparative way of assessing multiple regression models at once. This allows easier interpretation and evaluation of their forecasting accuracy. - Transformation Handling: This metric can handle transformations like ""integrate,"" enhancing its breadth and flexibility in evaluating different models. - Detailed Perspective: By looking at the performance on both datasets (training and testing), the metric may give a detailed overview of the model. **Limitations:** - Subjectivity: Relying heavily on visual interpretations; assessments may differ from person to person. - Limited Transformation Capability: Currently, only the ""integrate"" transformation is supported, implying complex transformations might go unchecked or unhandled. - Overhead: The plotting mechanism may become computationally costly when applying to extensive datasets, increasing runtime. - Numerical Measurement: Although visualization is instrumental, a corresponding numerical measure would further reinforce the observations. However, this metric does not provide numerical measures.",validmind.model_validation.statsmodels.RegressionModelForecastPlotLevels
Metric,Feature Importance And Significance,"**Purpose**: The 'FeatureImportanceAndSignificance' test evaluates the statistical significance and the importance of features in the context of the machine learning model. By comparing the p-values from a regression model and the feature importances from a decision tree model, this test aids in determining the most significant variables from a statistical and a machine learning perspective, assisting in feature selection during the model development process. **Test Mechanism**: The test first compares the p-values from a regression model and the feature importances from a decision tree model. These values are normalized to ensure a uniform comparison. The 'p_threshold' parameter is used to determine what p-value is considered statistically significant and if the 'significant_only' parameter is true, only features with p-values below this threshold are included in the final output. The output from this test includes an interactive visualization displaying normalized p-values and the associated feature importances. The test throws an error if it does not receive both a regression model and a decision tree model. **Signs of High Risk**: - Exceptionally high or low p-values, which suggest that a feature may not be significant or meaningful in the context of the model. - If many variables with small feature importance values have significant p-values, this could indicate that the model might be overfitting. **Strengths**: - Combines two perspectives statistical significance (p-values) and feature importance (decision tree model), making it a robust feature selection test. - Provides an interactive visualization making it easy to interpret and understand the results. **Limitations**: - The test only works with a regression model and a decision tree model which may limit its applicability. - The test does not take into account potential correlations or causative relationships between features which may lead to misinterpretations of significance and importance. - Over-reliance on the p-value as a cut-off for feature significance can be seen as arbitrary and may not truly reflect the real-world importance of the feature.",validmind.model_validation.statsmodels.FeatureImportanceAndSignificance


The `task` parameter is designed for pinpointing tests that align with a specific task type. For instance, to find tests tailored for 'regression' tasks:

In [10]:
vt.list_tests(tags=["model_performance", "visualization"])

Test Type,Name,Description,ID
Metric,Confusion Matrix,"**Purpose**: The Confusion Matrix tester is designed to assess the performance of a classification Machine Learning model. This performance is evaluated based on how well the model is able to correctly classify True Positives, True Negatives, False Positives, and False Negatives - fundamental aspects of model accuracy. **Test Mechanism**: The mechanism used involves taking the predicted results (`y_test_predict`) from the classification model and comparing them against the actual values (`y_test_true`). A confusion matrix is built using the unique labels extracted from `y_test_true`, employing scikit-learn's metrics. The matrix is then visually rendered with the help of Plotly's `create_annotated_heatmap` function. A heatmap is created which provides a two-dimensional graphical representation of the model's performance, showcasing distributions of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). **Signs of High Risk**: Indicators of high risk related to the model include: - High numbers of False Positives (FP) and False Negatives (FN), depicting that the model is not effectively classifying the values. - Low numbers of True Positives (TP) and True Negatives (TN), implying that the model is struggling with correctly identifying class labels. **Strengths**: The Confusion Matrix tester brings numerous strengths: - It provides a simplified yet comprehensive visual snapshot of the classification model's predictive performance. - It distinctly brings out True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), thus, making it easier to focus on potential areas of improvement. - The matrix is beneficial in dealing with multi-class classification problems as it can provide a simple view of complex model performances. - It aids in understanding the different types of errors that the model could potentially make, as it provides in-depth insights into Type-I and Type-II errors. **Limitations**: Despite its various strengths, the Confusion Matrix tester does exhibit some limitations: - In cases of unbalanced classes, the effectiveness of the confusion matrix might be lessened. It may wrongly interpret the accuracy of a model that is essentially just predicting the majority class. - It does not provide a single unified statistic that could evaluate the overall performance of the model. Different aspects of the model's performance are evaluated separately instead. - It mainly serves as a descriptive tool and does not offer the capability for statistical hypothesis testing. - Risks of misinterpretation exist because the matrix doesn't directly provide precision, recall, or F1-score data. These metrics have to be computed separately.",validmind.model_validation.sklearn.ConfusionMatrix
Metric,Precision Recall Curve,"**Purpose**: The Precision Recall Curve metric is intended to evaluate the trade-off between precision and recall in classification models, particularly binary classification models. It assesses the model's capacity to produce accurate results (high precision), as well as its ability to capture a majority of all positive instances (high recall). **Test Mechanism**: The test extracts ground truth labels and prediction probabilities from the model's test dataset. It applies the precision_recall_curve method from the sklearn metrics module to these extracted labels and predictions, which computes a precision-recall pair for each possible threshold. This calculation results in an array of precision and recall scores that can be plotted against each other to form the Precision-Recall Curve. This curve is then visually represented by using Plotly's scatter plot. **Signs of High Risk**: * A lower area under the Precision-Recall Curve signifies high risk. * This corresponds to a model yielding a high amount of false positives (low precision) and/or false negatives (low recall). * If the curve is closer to the bottom left of the plot, rather than being closer to the top right corner, it can be a sign of high risk. **Strengths**: * This metric aptly represents the balance between precision (minimizing false positives) and recall (minimizing false negatives), which is especially critical in scenarios where both values are significant. * Through the graphic representation, it enables an intuitive understanding of the model's performance across different threshold levels. **Limitations**: * This metric is only applicable to binary classification models – it raises errors for multiclass classification models or Foundation models. * It may not fully represent the overall accuracy of the model if the cost of false positives and false negatives are extremely different, or if the dataset is heavily imbalanced.",validmind.model_validation.sklearn.PrecisionRecallCurve
Metric,ROC Curve,"**Purpose**: The Receiver Operating Characteristic (ROC) curve is designed to evaluate the performance of binary classification models. This curve illustrates the balance between the True Positive Rate (TPR) and False Positive Rate (FPR) across various threshold levels. In combination with the Area Under the Curve (AUC), the ROC curve aims to measure the model's discrimination ability between the two defined classes in a binary classification problem (e.g., default vs non-default). Ideally, a higher AUC score signifies superior model performance in accurately distinguishing between the positive and negative classes. **Test Mechanism**: First, this script selects the target model and datasets that require binary classification. It then calculates the predicted probabilities for the test set, and uses this data, along with the true outcomes, to generate and plot the ROC curve. Additionally, it concludes a line signifying randomness (AUC of 0.5). The AUC score for the model's ROC curve is also computed, presenting a numerical estimation of the model's performance. If any Infinite values are detected in the ROC threshold, these are effectively eliminated. The resulting ROC curve, AUC score, and thresholds are consequently saved for future reference. **Signs of High Risk**: - A high risk is potentially linked to the model's performance if the AUC score drops below or nears 0.5. - Another warning sign would be the ROC curve lying closer to the line of randomness, indicating no discriminative ability. - For the model to be deemed competent at its classification tasks, it is crucial that the AUC score is significantly above 0.5. **Strengths**: - This ROC Curve offers an inclusive visual depiction of a model's discriminative power throughout all conceivable classification thresholds, unlike other metrics that solely disclose model performance at one fixed threshold. - Despite the proportions of the dataset, the AUC Score, which represents the entire ROC curve as a single data point, continues to be consistent, proving to be the ideal choice for such situations. **Limitations**: - The primary limitation is that this test is exclusively structured for binary classification tasks, thus limiting its application towards other model types. - Furthermore, its performance might be subpar with models that output probabilities highly skewed towards 0 or 1. - At the extreme, the ROC curve could reflect high performance even when the majority of classifications are incorrect, provided that the model's ranking format is retained. This phenomenon is commonly termed the ""Class Imbalance Problem"".",validmind.model_validation.sklearn.ROCCurve
ThresholdTest,Training Test Degradation,"**Purpose**: The 'TrainingTestDegradation' class serves as a test to verify that the degradation in performance between the training and test datasets does not exceed a predefined threshold. This test serves as a measure to check the model's ability to generalize from its training data to unseen test data. It assesses key classification metric scores such as accuracy, precision, recall and f1 score, to verify the model's robustness and reliability. **Test Mechanism**: The code applies several predefined metrics including accuracy, precision, recall and f1 scores to the model's predictions for both the training and test datasets. It calculates the degradation as the difference between the training score and test score divided by the training score. The test is considered successful if the degradation for each metric is less than the preset maximum threshold of 10%. The results are summarized in a table showing each metric's train score, test score, degradation percentage, and pass/fail status. **Signs of High Risk**: - A degradation percentage that exceeds the maximum allowed threshold of 10% for any of the evaluated metrics. - A high difference or gap between the metric scores on the training and the test datasets. - The 'Pass/Fail' column displaying 'Fail' for any of the evaluated metrics. **Strengths**: - This test provides a quantitative measure of the model's ability to generalize to unseen data, which is key for predicting its practical real-world performance. - By evaluating multiple metrics, it takes into account different facets of model performance and enables a more holistic evaluation. - The use of a variable predefined threshold allows the flexibility to adjust the acceptability criteria for different scenarios. **Limitations**: - The test compares raw performance on training and test data, but does not factor in the nature of the data. Areas with less representation in the training set, for instance, might still perform poorly on unseen data. - It requires good coverage and balance in the test and training datasets to produce reliable results, which may not always be available. - The test is currently only designed for classification tasks.",validmind.model_validation.sklearn.TrainingTestDegradation
Metric,Log Regression Confusion Matrix,"**Purpose**: The Logistic Regression Confusion Matrix is a metric used to measure the performance of a logistic regression classification model. This metric is particularly useful for scenarios where a model's predictions are formulated by thresholding probabilities. The main advantage of this approach is that it includes true positives, true negatives, false positives, and false negatives in its assessment, providing a more comprehensive overview of the model's effectiveness in distinguishing between correct and incorrect classifications. **Test Mechanism**: The methodology behind the Logistic Regression Confusion Matrix uses the `sklearn.metrics.confusion_matrix` function from the Python library to generate a matrix. This matrix is created by comparing the model's predicted probabilities, which are initially converted to binary predictions using a predetermined cut-off threshold (default is 0.5), against the actual classes. The matrix's design consists of the predicted class labels forming the x-axis, and the actual class labels forming the y-axis, with each cell containing the record of true positives, true negatives, false positives, and false negatives respectively. **Signs of High Risk**: - A significant number of false positives and false negatives, indicating that the model is incorrectly classifying instances. - The counts of true positives and true negatives being substantially lower than projected, positioning this as a potential high-risk indicator. **Strengths**: - Simple, intuitive, and provides a comprehensive understanding of the model's performance. - Provides a detailed breakdown of error types, improving transparency. - Offers flexible adaptation for diverse prediction scenarios by allowing adjustments to the cut-off threshold, and enabling exploration of trade-offs between precision (minimizing false positives) and recall (minimizing false negatives). **Limitations**: - Acceptable performance on majority classes but potential poor performance on minority classes in imbalanced datasets, as the confusion matrix may supply misleading results. - Lack of insight into the severity of the mistakes and the cost trade-off between different types of misclassification. - Selection of the cut-off threshold can significantly alter the interpretation, and a poorly chosen threshold may lead to erroneous conclusions.",validmind.model_validation.statsmodels.LogRegressionConfusionMatrix
Metric,GINI Table,"**Purpose**: The 'GINITable' metric is designed to evaluate the performance of a classification model by emphasizing its discriminatory power. Specifically, it calculates and presents three important metrics - the Area under the ROC Curve (AUC), the GINI coefficient, and the Kolmogov-Smirnov (KS) statistic - for both training and test datasets. **Test Mechanism**: Using a dictionary for storing performance metrics for both the training and test datasets, the 'GINITable' metric calculates each of these metrics sequentially. The Area under the ROC Curve (AUC) is calculated via the `roc_auc_score` function from the Scikit-Learn library. The GINI coefficient, a measure of statistical dispersion, is then computed by doubling the AUC and subtracting 1. Finally, the Kolmogov-Smirnov (KS) statistic is calculated via the `roc_curve` function from Scikit-Learn, with the False Positive Rate (FPR) subtracted from the True Positive Rate (TPR) and the maximum value taken from the resulting data. These metrics are then stored in a pandas DataFrame for convenient visualization. **Signs of High Risk**: - Low values for performance metrics may suggest a reduction in model performance, particularly a low AUC which indicates poor classification performance, or a low GINI coefficient, which could suggest a decreased ability to discriminate different classes. - A high KS value may be an indicator of potential overfitting, as this generally signifies a substantial divergence between positive and negative distributions. - Significant discrepancies between the performance on the training dataset and the test dataset may present another signal of high risk. **Strengths**: - Offers three key performance metrics (AUC, GINI, and KS) in one test, providing a more comprehensive evaluation of the model. - Provides a direct comparison between the model's performance on training and testing datasets, which aids in identifying potential underfitting or overfitting. - The applied metrics are class-distribution invariant, thereby remaining effective for evaluating model performance even when dealing with imbalanced datasets. - Presents the metrics in a user-friendly table format for easy comprehension and analysis. **Limitations**: - The GINI coefficient and KS statistic are both dependent on the AUC value. Therefore, any errors in the calculation of the latter will adversely impact the former metrics too. - Mainly suited for binary classification models and may require modifications for effective application in multi-class scenarios. - The metrics used are threshold-dependent and may exhibit high variability based on the chosen cut-off points. - The test does not incorporate a method to efficiently handle missing or inefficiently processed data, which could lead to inaccuracies in the metrics if the data is not appropriately preprocessed.",validmind.model_validation.statsmodels.GINITable


The `tags` parameter facilitates searching tests by their tags. For instance, if you're keen on tests associated with 'model_performance' and 'visualization':

To work with a specific set of tests programmatically, you can store the results in a variable. For instance, let's list all regression tests and store them in `regression_tests` for further use.

In [11]:
regression_tests = vt.list_tests(task="regression", pretty=False)
regression_tests

['validmind.model_validation.ModelMetadata',
 'validmind.model_validation.statsmodels.RegressionModelsCoeffs',
 'validmind.model_validation.statsmodels.BoxPierce',
 'validmind.model_validation.statsmodels.RegressionCoeffsPlot',
 'validmind.model_validation.statsmodels.RegressionModelSensitivityPlot',
 'validmind.model_validation.statsmodels.RegressionModelsPerformance',
 'validmind.model_validation.statsmodels.ZivotAndrewsArch',
 'validmind.model_validation.statsmodels.RegressionModelOutsampleComparison',
 'validmind.model_validation.statsmodels.RegressionModelForecastPlotLevels',
 'validmind.model_validation.statsmodels.FeatureImportanceAndSignificance',
 'validmind.model_validation.statsmodels.LJungBox',
 'validmind.model_validation.statsmodels.JarqueBera',
 'validmind.model_validation.statsmodels.PhillipsPerronArch',
 'validmind.model_validation.statsmodels.KolmogorovSmirnov',
 'validmind.model_validation.statsmodels.ResidualsVisualInspection',
 'validmind.model_validation.statsmode

## Delving into Test Details with `describe_test`

After identifying a set of potential tests, you might want to explore the specifics of an individual test. The `describe_test` function provides a deep dive into the details of a test. It reveals the test name, description, ID, test type, and required inputs. Below, we showcase how to describe a test using its ID:

In [12]:
vt.describe_test(test_id="validmind.model_validation.sklearn.ConfusionMatrix")

Unnamed: 0,Unnamed: 1
ID:,validmind.model_validation.sklearn.ConfusionMatrix
Name:,Confusion Matrix
Description:,"**Purpose**: The Confusion Matrix tester is designed to assess the performance of a classification Machine Learning model. This performance is evaluated based on how well the model is able to correctly classify True Positives, True Negatives, False Positives, and False Negatives - fundamental aspects of model accuracy. **Test Mechanism**: The mechanism used involves taking the predicted results (`y_test_predict`) from the classification model and comparing them against the actual values (`y_test_true`). A confusion matrix is built using the unique labels extracted from `y_test_true`, employing scikit-learn's metrics. The matrix is then visually rendered with the help of Plotly's `create_annotated_heatmap` function. A heatmap is created which provides a two-dimensional graphical representation of the model's performance, showcasing distributions of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). **Signs of High Risk**: Indicators of high risk related to the model include: - High numbers of False Positives (FP) and False Negatives (FN), depicting that the model is not effectively classifying the values. - Low numbers of True Positives (TP) and True Negatives (TN), implying that the model is struggling with correctly identifying class labels. **Strengths**: The Confusion Matrix tester brings numerous strengths: - It provides a simplified yet comprehensive visual snapshot of the classification model's predictive performance. - It distinctly brings out True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), thus, making it easier to focus on potential areas of improvement. - The matrix is beneficial in dealing with multi-class classification problems as it can provide a simple view of complex model performances. - It aids in understanding the different types of errors that the model could potentially make, as it provides in-depth insights into Type-I and Type-II errors. **Limitations**: Despite its various strengths, the Confusion Matrix tester does exhibit some limitations: - In cases of unbalanced classes, the effectiveness of the confusion matrix might be lessened. It may wrongly interpret the accuracy of a model that is essentially just predicting the majority class. - It does not provide a single unified statistic that could evaluate the overall performance of the model. Different aspects of the model's performance are evaluated separately instead. - It mainly serves as a descriptive tool and does not offer the capability for statistical hypothesis testing. - Risks of misinterpretation exist because the matrix doesn't directly provide precision, recall, or F1-score data. These metrics have to be computed separately."
Test Type:,Metric
Required Inputs:,['model']
Params:,{}


## Conclusion

By harnessing the functionalities presented in this guide, you can adeptly navigate and employ the ValidMind test suite tailored to your model and dataset validation requirements. We hope this walkthrough proves invaluable and inspires you to further explore the expansive ValidMind API for an even broader scope of capabilities and details.