# Comprehensive Guide: ValidMind Plots and Statistics Tests

This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. Theseized tests provide powerful visualization and statistical analysis capabilities for any dataset.

## What You'll Learn

In this notebook, we'll explore:

1. **Plotting Tests**: Visual analysis tools for data exploration
   - CorrelationHeatmap
   - HistogramPlot
   - BoxPlot
   - ViolinPlot

2. **Statistical Tests**: Comprehensive statistical analysis tools
   - DescriptiveStats
   - CorrelationAnalysis
   - NormalityTests
   - OutlierDetection

Each test is highly configurable and can be adapted to different datasets and use cases.


# Comprehensive Guide: ValidMind Plots and Statistics Tests

This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. These generalized tests provide powerful visualization and statistical analysis capabilities for any dataset.

## What You'll Learn

In this notebook, we'll explore:

1. **Plotting Tests**: Visual analysis tools for data exploration
   - CorrelationHeatmap
   - HistogramPlot
   - BoxPlot
   - ViolinPlot

2. **Statistical Tests**: Comprehensive statistical analysis tools
   - DescriptiveStats
   - CorrelationAnalysis
   - NormalityTests
   - OutlierDetection

Each test is highly configurable and can be adapted to different datasets and use cases.


## About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation.


## Setting up

### Install the ValidMind Library


In [None]:
%pip install -q validmind


### Initialize the ValidMind Library

For this demonstration, we'll initialize ValidMind in demo mode.


In [None]:
# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

# Note: You need valid API credentials for this to work
# If you don't have credentials, use the standalone script: test_outlier_detection_standalone.py

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

## Import and Prepare Sample Dataset

We'll use the Bank Customer Churn dataset as our example data for demonstrating all the tests.


In [None]:
from validmind.datasets.classification import customer_churn

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)

# Load and preprocess the data
raw_df = customer_churn.load_data()
train_df, validation_df, test_df = customer_churn.preprocess(raw_df)

print(f"\nDataset shapes:")
print(f"• Training: {train_df.shape}")
print(f"• Validation: {validation_df.shape}")
print(f"• Test: {test_df.shape}")

raw_df.head()


### Initialize ValidMind Datasets

Initialize ValidMind dataset objects for our analysis:


In [None]:
# Initialize datasets for ValidMind
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column=customer_churn.target_column,
    class_labels=customer_churn.class_labels,
)

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=customer_churn.target_column,
)

print("✅ ValidMind datasets initialized successfully!")


### Explore Dataset Structure

Let's examine our dataset to understand what columns are available for analysis:


In [None]:
print("📊 Dataset Information:")
print(f"\nAll columns ({len(vm_train_ds.df.columns)}):")
print(list(vm_train_ds.df.columns))

print(f"\nNumerical columns ({len(vm_train_ds.feature_columns_numeric)}):")
print(vm_train_ds.feature_columns_numeric)

print(f"\nCategorical columns ({len(vm_train_ds.feature_columns_categorical) if hasattr(vm_train_ds, 'feature_columns_categorical') else 0}):")
print(vm_train_ds.feature_columns_categorical if hasattr(vm_train_ds, 'feature_columns_categorical') else "None detected")

print(f"\nTarget column: {vm_train_ds.target_column}")


# Part 1: Plotting Tests

The ValidMind plotting tests provide powerful visualization capabilities for data exploration and analysis. All plots are interactive and built with Plotly.

## 1.  Correlation Heatmap

Visualizes correlations between numerical features using a heatmap. Useful for identifying multicollinearity and feature relationships.


In [None]:
# Basic correlation heatmap
vm.tests.run_test(
    "validmind.plots.CorrelationHeatmap",
    inputs={"dataset": vm_train_ds},
    params={
        "method": "pearson",
        "show_values": True,
        "colorscale": "RdBu",
        "mask_upper": False,
        "threshold": None,
        "width": 800,
        "height": 600,
        "title": "Feature Correlation Heatmap"
    }
)


In [None]:
# Advanced correlation heatmap with custom settings
vm.tests.run_test(
    "validmind.plots.CorrelationHeatmap",
    inputs={"dataset": vm_train_ds},
    params={
        "method": "spearman",  # Different correlation method
        "show_values": True,
        "colorscale": "Viridis",
        "mask_upper": True,  # Mask upper triangle
        "width": 900,
        "height": 700,
        "title": "Spearman Correlation (|r| > 0.3)",
        "columns": ["CreditScore", "Age", "Balance", "EstimatedSalary"]  # Specific columns
    }
)


## 2.  Histogram Plot

Creates histogram distributions for numerical features with optional KDE overlay. Essential for understanding data distributions.


In [None]:
# Basic histogram with KDE
vm.tests.run_test(
    "validmind.plots.HistogramPlot",
    inputs={"dataset": vm_train_ds},
    params={
        "columns": ["CreditScore", "Balance", "EstimatedSalary", "Age"],
        "bins": 30,
        "color": "steelblue",
        "opacity": 0.7,
        "show_kde": True,
        "normalize": False,
        "log_scale": False,
        "width": 1200,
        "height": 800,
        "n_cols": 2,
        "vertical_spacing": 0.15,
        "horizontal_spacing": 0.15,
        "title_prefix": "Distribution of"
    }
)


## 3.  Box Plot

Displays box plots for numerical features, optionally grouped by a categorical variable. Excellent for outlier detection and comparing distributions.


In [None]:
# Box plots grouped by target variable
vm.tests.run_test(
    "validmind.plots.BoxPlot", 
    inputs={"dataset": vm_train_ds},
    params={
        "columns": ["CreditScore", "Balance", "Age"],
        "group_by": "Exited",  # Group by churn status
        "colors": ["lightblue", "salmon"],
        "show_outliers": True,
        "width": 1200,
        "height": 600
    }
)


## 4.  Violin Plot

Creates violin plots that combine box plots with kernel density estimation. Shows both summary statistics and distribution shape.


In [None]:
# Violin plots grouped by target variable
vm.tests.run_test(
    "validmind.plots.ViolinPlot",
    inputs={"dataset": vm_train_ds},
    params={
        "columns": ["Age", "Balance"],  # Focus on key variables
        "group_by": "Exited",
        "width": 800,
        "height": 600
    }
)


# Part 2: Statistical Tests

The ValidMind statistical tests provide comprehensive statistical analysis capabilities for understanding data characteristics and quality.

## 1.  Descriptive Statistics

Provides comprehensive descriptive statistics including basic statistics, distribution measures, confidence intervals, and normality tests.


In [None]:
# Advanced descriptive statistics with all measures
vm.tests.run_test(
    "validmind.stats.DescriptiveStats",
    inputs={"dataset": vm_train_ds},
    params={
        "include_advanced": True,  # Include skewness, kurtosis, normality tests, etc.
        "confidence_level": 0.99,  # 99% confidence intervals
        "columns": ["CreditScore", "Balance", "EstimatedSalary", "Age"]  # Specific columns
    }
)


## 2.  Correlation Analysis

Performs detailed correlation analysis with statistical significance testing and identifies highly correlated feature pairs.


In [None]:
# Correlation analysis with significance testing
result = vm.tests.run_test(
    "validmind.stats.CorrelationAnalysis",
    inputs={"dataset": vm_train_ds},
    params={
        "method": "pearson",  # or "spearman", "kendall"
        "significance_level": 0.05,
        "min_correlation": 0.1  # Minimum correlation threshold
    }
)
result.log()


## 3.  Normality Tests

Performs various normality tests to assess whether features follow a normal distribution.


In [None]:
# Comprehensive normality testing
vm.tests.run_test(
    "validmind.stats.NormalityTests",
    inputs={"dataset": vm_train_ds},
    params={
        "tests": ["shapiro", "anderson", "kstest"],  # Multiple tests
        "alpha": 0.05,
        "columns": ["CreditScore", "Balance", "Age"]  # Focus on key features
    }
)


## 4.  Outlier Detection

Identifies outliers using various statistical methods including IQR, Z-score, and Isolation Forest.


In [None]:
# Comprehensive outlier detection with multiple methods
vm.tests.run_test(
    "validmind.stats.OutlierDetection",
    inputs={"dataset": vm_train_ds},
    params={
        "methods": ["iqr", "zscore", "isolation_forest"],
        "iqr_threshold": 1.5,
        "zscore_threshold": 3.0,
        "contamination": 0.1,
        "columns": ["CreditScore", "Balance", "EstimatedSalary"]
    }
)


# Part 3: Complete EDA Workflow Example

Let's demonstrate a complete exploratory data analysis workflow using all the tests together:


In [None]:
# Example: Complete EDA workflow using all tests
print("🔍 Complete Exploratory Data Analysis Workflow")
print("=" * 50)

# 1. Start with descriptive statistics
print("\n1. Descriptive Statistics:")
desc_stats = vm.tests.run_test(
    "validmind.stats.DescriptiveStats",
    inputs={"dataset": vm_train_ds},
    params={"include_advanced": True}
)

print("\n2. Distribution Analysis:")
# 2. Visualize distributions
hist_plot = vm.tests.run_test(
    "validmind.plots.HistogramPlot",
    inputs={"dataset": vm_train_ds},
    params={"show_kde": True, "n_cols": 3}
)

print("\n3. Correlation Analysis:")
# 3. Check correlations
corr_heatmap = vm.tests.run_test(
    "validmind.plots.CorrelationHeatmap",
    inputs={"dataset": vm_train_ds}
)

print("\n4. Outlier Detection:")
# 4. Detect outliers
outliers = vm.tests.run_test(
    "validmind.stats.OutlierDetection",
    inputs={"dataset": vm_train_ds},
    params={"methods": ["iqr", "zscore"]}
)

print("\n✅ EDA Complete! Check the visualizations and tables above for insights.")


# Comprehensive Guide: ValidMind Plots and Statistics Tests

This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. These generalized tests provide powerful visualization and statistical analysis capabilities for any dataset.

## What You'll Learn

In this notebook, we'll explore:

1. **Plotting Tests**: Visual analysis tools for data exploration
   - GeneralCorrelationHeatmap
   - GeneralHistogramPlot
   - GeneralBoxPlot
   - GeneralViolinPlot

2. **Statistical Tests**: Comprehensive statistical analysis tools
   - GeneralDescriptiveStats
   - GeneralCorrelationAnalysis
   - GeneralNormalityTests
   - GeneralOutlierDetection

Each test is highly configurable and can be adapted to different datasets and use cases.


# Conclusion

This notebook demonstrated all the plotting and statistical tests available in ValidMind:

## Plotting Tests Covered:
✅ **GeneralCorrelationHeatmap** - Interactive correlation matrices  
✅ **GeneralHistogramPlot** - Distribution analysis with KDE  
✅ **GeneralBoxPlot** - Outlier detection and group comparisons  
✅ **GeneralViolinPlot** - Distribution shape analysis  

## Statistical Tests Covered:
✅ **GeneralDescriptiveStats** - Comprehensive statistical profiling  
✅ **GeneralCorrelationAnalysis** - Formal correlation testing  
✅ **GeneralNormalityTests** - Distribution assumption checking  
✅ **GeneralOutlierDetection** - Multi-method outlier identification  

## Key Benefits:
- **Highly Customizable**: All tests offer extensive parameter options
- **Interactive Visualizations**: Plotly-based plots with zoom, pan, hover
- **Statistical Rigor**: Formal testing with significance levels
- **Flexible Input**: Works with any ValidMind dataset
- **Comprehensive Output**: Tables, plots, and statistical summaries

## Best Practices:

### When to Use Each Test:

**Plotting Tests:**
- **GeneralCorrelationHeatmap**: Initial data exploration, multicollinearity detection
- **GeneralHistogramPlot**: Understanding feature distributions, identifying skewness
- **GeneralBoxPlot**: Outlier detection, comparing groups
- **GeneralViolinPlot**: Detailed distribution analysis, especially for grouped data

**Statistical Tests:**
- **GeneralDescriptiveStats**: Comprehensive data profiling, baseline statistics
- **GeneralCorrelationAnalysis**: Formal correlation testing with significance
- **GeneralNormalityTests**: Model assumption checking
- **GeneralOutlierDetection**: Data quality assessment, preprocessing decisions

## Next Steps:
- Integrate these tests into your model documentation templates
- Customize parameters based on your specific data characteristics
- Use results to inform preprocessing and modeling decisions
- Combine with ValidMind's model validation tests for complete analysis

These tests provide a solid foundation for exploratory data analysis, data quality assessment, and statistical validation in any data science workflow.
