# Robustness evaluation for calibration

Evaluating the robustness of your calibration ensures that your weights will generalize well to related targets not explicitly used during calibration. This notebook demonstrates how to use the robustness evaluation feature to assess and improve calibration stability.

## What is robustness evaluation

Robustness evaluation uses holdout validation to test how well the calibration performs on unseen targets. The process:
1. Randomly holds out a subset of targets
2. Calibrates on the remaining targets
3. Evaluates performance on the held-out targets
4. Repeats multiple times to assess consistency

This helps identify whether your calibration is overfitting to specific targets or if it will generalize well.

In [24]:
from microcalibrate import Calibration
import numpy as np
import pandas as pd
import logging

calibration_logger = logging.getLogger("microcalibrate.calibration")
calibration_logger.setLevel(logging.WARNING)

np.random.seed(42)

## Creating a test dataset

We'll create a dataset with correlated targets to demonstrate the robustness evaluation. Some targets will be combinations of others, making them easier to predict from partial information.

In [25]:
# Create synthetic data with structure
n_samples = 5000

# Demographics
age = np.random.randint(18, 80, n_samples)
gender = np.random.choice(['M', 'F'], n_samples)
state = np.random.choice(['CA', 'NY', 'TX', 'FL'], n_samples, p=[0.35, 0.25, 0.25, 0.15])

# Income (correlated with age and state)
base_income = 30000
state_multiplier = {'CA': 1.3, 'NY': 1.2, 'TX': 1.0, 'FL': 0.95}
income = (base_income + (age - 18) * 800).astype(float)  # Ensure float type
for s in ['CA', 'NY', 'TX', 'FL']:
    mask = state == s
    income[mask] *= state_multiplier[s]
income += np.random.normal(0, 10000, n_samples)
income = np.maximum(income, 15000)

# Employment (correlated with age)
emp_prob = 0.8 - np.maximum(0, (age - 60) / 20)
emp_prob = np.clip(emp_prob, 0, 1)  # Ensure valid probability range
employed = np.random.binomial(1, emp_prob)

# Create estimate matrix with various overlapping targets
estimate_matrix = pd.DataFrame()

# State-level targets
for s in ['CA', 'NY', 'TX', 'FL']:
    mask = state == s
    estimate_matrix[f'pop_{s}'] = mask.astype(float)
    estimate_matrix[f'income_{s}'] = mask * income
    estimate_matrix[f'employed_{s}'] = mask * employed

# Gender targets
for g in ['M', 'F']:
    mask = gender == g
    estimate_matrix[f'pop_{g}'] = mask.astype(float)
    estimate_matrix[f'income_{g}'] = mask * income

# Age group targets
age_groups = pd.cut(age, bins=[0, 35, 50, 65, 100], labels=['18-35', '36-50', '51-65', '65+'])
for ag in age_groups.unique():
    mask = age_groups == ag
    estimate_matrix[f'pop_age_{ag}'] = mask.astype(float)

# Overall targets
estimate_matrix['total_population'] = 1.0
estimate_matrix['total_income'] = income
estimate_matrix['total_employed'] = employed

# Create realistic targets
true_totals = estimate_matrix.sum().values
# Add some noise to make calibration non-trivial
noise = np.random.normal(1.0, 0.03, len(true_totals))
targets = true_totals * noise

print(f"Dataset: {n_samples} records")
print(f"Number of targets: {len(targets)}")
print(f"Target categories:")
print(f"  - State-level: 12 targets (4 states × 3 metrics)")
print(f"  - Gender: 4 targets (2 genders × 2 metrics)")
print(f"  - Age groups: 4 targets")
print(f"  - Overall: 3 targets")
print(f"\nNote: Many targets overlap (e.g., total_income = sum of state incomes)")

Dataset: 5000 records
Number of targets: 23
Target categories:
  - State-level: 12 targets (4 states × 3 metrics)
  - Gender: 4 targets (2 genders × 2 metrics)
  - Age groups: 4 targets
  - Overall: 3 targets

Note: Many targets overlap (e.g., total_income = sum of state incomes)


## Basic robustness evaluation

Let's evaluate the robustness of a standard calibration without regularization.

In [26]:
# Initialize calibration
weights_init = np.ones(n_samples)

cal = Calibration(
    weights=weights_init,
    targets=targets,
    estimate_matrix=estimate_matrix,
    epochs=200,
    learning_rate=1e-3,
)

print("Evaluating calibration robustness...")
print("This will perform multiple rounds of holdout validation.\n")

# Evaluate robustness
robustness_results = cal.evaluate_holdout_robustness(
    n_holdout_sets=10,    # Number of random holdout sets to test
    holdout_fraction=0.3,   # Hold out 30% of targets each round
)

print("\n" + "="*60)
print("Robustness evaluation complete!")
print("="*60)



Evaluating calibration robustness...
This will perform multiple rounds of holdout validation.



Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 1703.46epoch/s, loss=16.4, weights_mean=5.08, weights_std=2.41, weights_min=0.84] 
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 1172.65epoch/s, loss=15.7, weights_mean=4.98, weights_std=2.41, weights_min=0.84] 
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 1848.77epoch/s, loss=16.5, weights_mean=5.02, weights_std=2.41, weights_min=0.839]
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 1536.10epoch/s, loss=16.3, weights_mean=5.03, weights_std=2.44, weights_min=0.844]
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 2512.73epoch/s, loss=16.4, weights_mean=5.01, weights_std=2.42, weights_min=0.839]
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 2495.60epoch/s, loss=16.1, weights_mean=5.03, weights_std=2.42, weights_min=0.84]
Reweighting progress: 100%|██████████| 200/200 [00:00<00:00, 2466.74epoch/s, loss=16.1, weights_mean=5.06, weights_std=2.42, weights_min=


Robustness evaluation complete!


## Analyzing robustness results

In [27]:
# Display overall metrics
metrics = robustness_results['overall_metrics']
print("Overall robustness metrics:")
print(f"  Average holdout accuracy: {metrics['mean_holdout_accuracy']:.1%}")
print(f"  Std dev of accuracies: {metrics['std_holdout_accuracy']:.1%}")
print(f"  Worst holdout accuracy: {metrics['worst_holdout_accuracy']:.1%}")
print(f"  Best holdout accuracy: {metrics['best_holdout_accuracy']:.1%}")
print()
print(f"Generalization gap:")
print(f"  Average training accuracy: {metrics['mean_train_accuracy']:.1%}")
print(f"  Average holdout accuracy: {metrics['mean_holdout_accuracy']:.1%}")
print(f"  Gap: {metrics['mean_train_accuracy'] - metrics['mean_holdout_accuracy']:.1%}")
print()
# Calculate consistency score (1 - coefficient of variation)
consistency_score = 1 - (metrics['std_holdout_accuracy'] / max(metrics['mean_holdout_accuracy'], 0.01))
print(f"Consistency score: {consistency_score:.2f}/1.00")
print(f"  (Higher is better - measures stability across rounds)")

Overall robustness metrics:
  Average holdout accuracy: 0.0%
  Std dev of accuracies: 0.0%
  Worst holdout accuracy: 0.0%
  Best holdout accuracy: 0.0%

Generalization gap:
  Average training accuracy: 0.0%
  Average holdout accuracy: 0.0%
  Gap: 0.0%

Consistency score: 1.00/1.00
  (Higher is better - measures stability across rounds)


In [28]:
# Show target-level difficulty
target_robustness = robustness_results['target_robustness']

# Sort by accuracy (lower accuracy = higher difficulty)
target_robustness = target_robustness.sort_values('holdout_accuracy_rate', ascending=True)

print("\nMost difficult targets to predict (when held out):")
print(target_robustness.head(10)[['target_name', 'holdout_accuracy_rate', 'times_held_out']].to_string(index=False))

print("\nEasiest targets to predict (when held out):")
print(target_robustness.tail(5)[['target_name', 'holdout_accuracy_rate', 'times_held_out']].to_string(index=False))


Most difficult targets to predict (when held out):
target_name  holdout_accuracy_rate  times_held_out
     pop_CA                    0.0               2
  income_CA                    0.0               3
employed_CA                    0.0               2
  income_NY                    0.0               1
employed_NY                    0.0               3
     pop_TX                    0.0               3
  income_TX                    0.0               2
employed_TX                    0.0               1
     pop_FL                    0.0               6
  income_FL                    0.0               3

Easiest targets to predict (when held out):
     target_name  holdout_accuracy_rate  times_held_out
   pop_age_36-50                    0.0               2
   pop_age_18-35                    0.0               1
total_population                    0.0               1
    total_income                    0.0               3
  total_employed                    0.0               3


## Understanding the recommendation

The robustness evaluation provides actionable recommendations based on the results.

In [29]:
print("ROBUSTNESS EVALUATION RECOMMENDATION")
print("="*60)
print(robustness_results['recommendation'])
print("="*60)

# Additional analysis based on results
metrics = robustness_results['overall_metrics']
if metrics['mean_holdout_accuracy'] < 0.8:
    print("\nAdditional suggestions for poor robustness:")
    print("1. Consider using L0 regularization to reduce overfitting")
    print("2. Review targets with highest difficulty scores")
    print("3. Check for data quality issues in difficult targets")
    print("4. Consider removing highly correlated redundant targets")
elif metrics['std_holdout_accuracy'] > 0.1:
    print("\nAdditional suggestions for high variability:")
    print("1. Some target combinations may be inherently difficult")
    print("2. Consider grouping related targets")
    print("3. Increase epochs to ensure convergence")
else:
    print("\nYour calibration shows good robustness!")
    print("Consider saving these settings for production use.")

ROBUSTNESS EVALUATION RECOMMENDATION
❌ POOR ROBUSTNESS: The calibration shows weak generalization.
On average, 0.0% of held-out targets are within 10% of their true values.
 ⚠️ Worst-case scenario: Only 0.0% accuracy in some holdout sets.

📊 Targets with poor holdout performance (<50% accuracy):
  - pop_CA: 0.0% accuracy
  - total_population: 0.0% accuracy
  - pop_age_18-35: 0.0% accuracy
  - pop_age_36-50: 0.0% accuracy
  - pop_age_65+: 0.0% accuracy

💡 RECOMMENDATIONS:
  1. Consider enabling L0 regularization for better generalization
  2. Increase the noise_level parameter to improve robustness
  3. Try increasing dropout_rate to reduce overfitting
  4. Investigate why these targets are hard to predict: pop_CA, total_population, pop_age_18-35
  5. Consider if these targets have sufficient support in the microdata
  6. Generalization gap of 0.1811 suggests some overfitting - consider regularization

Additional suggestions for poor robustness:
1. Consider using L0 regularization to re

## Advanced: Custom holdout strategies

You can implement custom holdout strategies for specific evaluation needs.

In [30]:
# Example: Evaluate robustness by holding out entire target categories
def evaluate_by_category():
    categories = {
        'State targets': [i for i, name in enumerate(estimate_matrix.columns) 
                         if any(s in name for s in ['CA', 'NY', 'TX', 'FL'])],
        'Gender targets': [i for i, name in enumerate(estimate_matrix.columns) 
                          if any(g in name for g in ['_M', '_F'])],
        'Age targets': [i for i, name in enumerate(estimate_matrix.columns) 
                       if 'age' in name],
        'Total targets': [i for i, name in enumerate(estimate_matrix.columns) 
                         if 'total' in name],
    }
    
    results = []
    
    for category, indices in categories.items():
        if len(indices) == 0:
            continue
            
        # Create masks for train and holdout
        train_mask = np.ones(len(targets), dtype=bool)
        train_mask[indices] = False
        
        # Skip if too few training targets remain
        if train_mask.sum() < 3:
            continue
        
        # Calibrate on subset
        cal_temp = Calibration(
            weights=weights_init.copy(),
            targets=targets[train_mask],
            estimate_matrix=estimate_matrix.iloc[:, train_mask],
            epochs=100,
            learning_rate=1e-3,
        )
        
        # Suppress logging for cleaner output
        import logging
        original_level = logging.getLogger().level
        logging.getLogger().setLevel(logging.WARNING)
        
        try:
            cal_temp.calibrate()
            
            # Evaluate on holdout
            holdout_estimates = (estimate_matrix.iloc[:, indices].T * cal_temp.weights).sum(axis=1).values
            holdout_targets = targets[indices]
            holdout_errors = np.abs((holdout_estimates - holdout_targets) / holdout_targets)
            
            results.append({
                'Category': category,
                'N targets': len(indices),
                'Mean error': f"{np.mean(holdout_errors):.1%}",
                'Max error': f"{np.max(holdout_errors):.1%}",
                'Within 10%': f"{100*np.mean(holdout_errors < 0.1):.0f}%"
            })
        finally:
            # Restore logging level
            logging.getLogger().setLevel(original_level)
    
    return pd.DataFrame(results)

category_results = evaluate_by_category()
print("Robustness by target category:")
print(category_results.to_string(index=False))
print("\nInterpretation:")
print("- Lower errors indicate targets that can be predicted from others")
print("- High errors suggest independent information in those targets")

Reweighting progress: 100%|██████████| 100/100 [00:00<00:00, 2407.43epoch/s, loss=20.1, weights_mean=5.5, weights_std=2.64, weights_min=0.918]
Reweighting progress: 100%|██████████| 100/100 [00:00<00:00, 2500.32epoch/s, loss=20.4, weights_mean=5.5, weights_std=2.62, weights_min=0.919]
Reweighting progress: 100%|██████████| 100/100 [00:00<00:00, 2694.62epoch/s, loss=21, weights_mean=5.56, weights_std=2.67, weights_min=0.918]
Reweighting progress: 100%|██████████| 100/100 [00:00<00:00, 2461.17epoch/s, loss=20, weights_mean=5.43, weights_std=2.68, weights_min=0.917]

Robustness by target category:
      Category  N targets Mean error Max error Within 10%
 State targets         12     449.3%    494.2%         0%
Gender targets          7     445.0%    477.2%         0%
   Age targets          4     450.4%    470.4%         0%
 Total targets          3     422.9%    425.1%         0%

Interpretation:
- Lower errors indicate targets that can be predicted from others
- High errors suggest independent information in those targets





## Best practices for robustness evaluation

### 1. Choose appropriate holdout parameters
- **Holdout fraction**: 20-30% is typically good
- **Number of rounds**: At least 5-10 for reliable estimates
- **Epochs per round**: Enough to converge (check loss curves)

### 2. Interpret results carefully
- **High variability**: Indicates unstable calibration
- **Large generalization gap**: Suggests overfitting
- **Low consistency**: Some target combinations are problematic

### 3. Be aware of data leakage
Since many calibration targets share information (e.g., 'total_income' includes all state incomes), holdout validation may give optimistic results. The evaluation includes a warning about this.

### 4. Use results to improve calibration
- Add regularization if overfitting is detected
- Remove or combine highly correlated targets
- Investigate targets with high difficulty scores
- Consider different optimization parameters

### 5. Document your evaluation
Save robustness results along with your calibration parameters for reproducibility and comparison.

## Next steps

After evaluating robustness:

1. If robustness is poor, try:
   - Hyperparameter tuning to find better L0 parameters
   - Reviewing and cleaning your targets
   - Increasing the dataset size

2. If robustness is good:
   - Save your calibration configuration
   - Apply to production data
   - Monitor performance over time

3. For specific issues:
   - High difficulty targets → Check data quality
   - Large generalization gap → Add regularization
   - High variability → Increase epochs or adjust learning rate