# Multi-Armed Bandit Simulation Framework Demo

This notebook demonstrates the research-grade bandit simulation framework with:
- Pre-generated counterfactual outcomes
- Per-arm variance heterogeneity
- Fair multi-policy comparison
- Heterogeneous reward distributions per arm
- Regret computation and analysis

## Setup

In [None]:
import numpy as np

from research_bandits_methods.bandits import (
    BanditEnvironment,
    BernoulliRewards,
    GaussianRewards,
    MultiPolicyComparison,
    PerArmDistribution,
)
from research_bandits_methods.bandits.policies.markovian_policies import (
    EpsilonGreedy,
    GaussianThompson,
    UCB,
)

# Set random seed for reproducibility
rng = np.random.default_rng(42)

## Example 1: Gaussian Rewards with Per-Arm Variances

We'll compare policies on a 3-armed bandit where arms have different variances.

In [None]:
# Problem parameters
T = 500  # Time horizon
K = 3  # Number of arms
R = 100  # Monte Carlo replications

# Gaussian reward distribution with different variances per arm
true_means = np.array([0.3, 0.5, 0.8])
variances = np.array([1.0, 2.0, 0.5])  # Different variance per arm
dist = GaussianRewards(means=true_means, variances=variances)

# Create environment (pre-generate all counterfactuals)
env = BanditEnvironment(dist, T=T, K=K, R=R, rng=rng)

print(f"Environment created with T={T}, K={K}, R={R}")
print(f"True means: {true_means}")
print(f"Variances: {variances}")
print(f"Optimal arm: {true_means.argmax()} (mean: {true_means.max():.2f})")
print(f"Counterfactuals shape: {env.counterfactuals.shape}")

### Run Multi-Policy Comparison

In [None]:
# Create comparison object
comparison = MultiPolicyComparison(env)

# Add policies to compare
comparison.add_policy("ε-greedy (ε=0.1)", EpsilonGreedy(K=K, R=R, epsilon=0.1, rng=rng))
comparison.add_policy(
    "ε-greedy (ε=0.05)", EpsilonGreedy(K=K, R=R, epsilon=0.05, rng=rng)
)
comparison.add_policy("UCB (c=1.0)", UCB(K=K, R=R, c=1.0))
comparison.add_policy("UCB (c=2.0)", UCB(K=K, R=R, c=2.0))
comparison.add_policy(
    "Thompson Sampling",
    GaussianThompson(K=K, R=R, prior_mean=0.5, prior_var=1.0, rng=rng),
)

# Run all policies on the same counterfactual data
comparison.run_all()

### Results Summary

In [None]:
# Print summary
comparison.print_summary()

### Regret Curves

In [None]:
# Get regret curves with confidence intervals
regret_curves = comparison.get_regret_curves_with_ci(confidence=0.95)

print("\nFinal average cumulative regret (with 95% CI):\n")
for policy_name, curve_data in regret_curves.items():
    final_mean = curve_data["mean"][-1]
    final_lower = curve_data["lower"][-1]
    final_upper = curve_data["upper"][-1]
    print(
        f"{policy_name:25s}: {final_mean:7.2f} [{final_lower:7.2f}, {final_upper:7.2f}]"
    )

## Example 2: Heterogeneous Distributions Per Arm

Different arms can have completely different reward distributions:
- Arm 0: Bernoulli(0.3)
- Arm 1: Gaussian(0.5, 1.0)
- Arm 2: Gaussian(0.8, 2.0)

In [None]:
# Define heterogeneous distributions
arm_distributions = [
    BernoulliRewards([0.3]),
    GaussianRewards([0.5], variances=1.0),
    GaussianRewards([0.8], variances=2.0),
]

# Create per-arm distribution
heterogeneous_dist = PerArmDistribution(arm_distributions)

# Create environment
env2 = BanditEnvironment(heterogeneous_dist, T=500, K=3, R=50, rng=rng)

print("Heterogeneous environment created")
print("Arm 0: Bernoulli(0.3)")
print("Arm 1: Gaussian(0.5, 1.0)")
print("Arm 2: Gaussian(0.8, 2.0) - high variance!")

### Run Comparison on Heterogeneous Environment

In [None]:
comparison2 = MultiPolicyComparison(env2)

comparison2.add_policy("UCB", UCB(K=3, R=50, c=2.0))
comparison2.add_policy("ε-greedy", EpsilonGreedy(K=3, R=50, epsilon=0.1, rng=rng))
comparison2.add_policy(
    "Thompson", GaussianThompson(K=3, R=50, prior_mean=0.5, prior_var=1.0, rng=rng)
)

comparison2.run_all()
comparison2.print_summary()

## Example 3: Analyzing Arm Selection Patterns

In [None]:
# Get detailed results from first comparison
summary = comparison.get_results_summary()

print("\nArm Pull Counts (average across replications):\n")
print(f"{'Policy':<25s} {'Arm 0':>10s} {'Arm 1':>10s} {'Arm 2':>10s}")
print("-" * 60)
for policy_name, stats in summary.items():
    counts = stats["arm_pull_counts"]
    print(f"{policy_name:<25s} {counts[0]:10.1f} {counts[1]:10.1f} {counts[2]:10.1f}")

print(
    f"\nNote: Optimal arm is 2 (mean {true_means[2]:.2f}, variance {variances[2]:.2f})"
)

## Summary

This framework provides:

1. **Fair Comparison**: All policies run on identical counterfactual outcomes
2. **Per-Arm Heterogeneity**: Different means and variances per arm
3. **Flexible Distributions**: Mix Bernoulli, Gaussian, Student's t, etc.
4. **Efficiency**: Pre-generated counterfactuals enable fast experimentation
5. **Extensibility**: Easy to add new distributions and policies

### Key Observations

- **UCB** typically has theoretical guarantees but may be conservative
- **ε-greedy** is simple but continues exploring uniformly
- **Thompson Sampling** balances exploration/exploitation naturally via Bayesian approach
- **Per-arm variances** allow realistic modeling of heterogeneous reward uncertainty
- **Heterogeneous distributions** enable arms with fundamentally different reward structures