# Statistics and Simulation for Sports Analytics: Days 17-18

This notebook covers simulation methods for understanding uncertainty and variability in sports.

**Topics covered:**
- Day 17: Monte Carlo simulation
- Day 18: Bootstrapping

**Prerequisites:** Basic Python, numpy, and matplotlib.

Let's import our libraries and set up for reproducible results.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
plt.rcParams['figure.figsize'] = (10, 6)
print("Libraries imported successfully!")
print(f"Random seed set to 42 for reproducible results")

# Day 17: Monte Carlo Simulation

## What is Monte Carlo Simulation?

**Monte Carlo simulation** uses many random samples to approximate probabilities, expected values, and distributions.

**The basic idea:**
1. Generate many random outcomes according to a probability model
2. Count or average these outcomes
3. Use the results to estimate quantities that might be hard to calculate exactly

**Why it's useful:**
- Approximates complex probabilities
- Works when exact formulas are difficult
- Easy to understand and implement
- Accuracy improves with more simulations

**Name origin:** Named after the Monte Carlo casino in Monaco, because it relies on randomness like gambling.

## Simple Example: Coin Flips

**Question:** If you flip a fair coin 5 times, what's the probability of getting exactly 3 heads?

We can calculate this exactly using the binomial formula, but let's use Monte Carlo simulation instead.

In [None]:
# Monte Carlo simulation for coin flips
# Estimate P(3 heads in 5 flips)

n_simulations = 10000
n_flips = 5
target_heads = 3

# Simulate: 1 = heads, 0 = tails
# Each row is one trial of 5 flips
flips = np.random.randint(0, 2, size=(n_simulations, n_flips))

# Count heads in each trial
heads_count = np.sum(flips, axis=1)

# Count how many trials had exactly 3 heads
successes = np.sum(heads_count == target_heads)

# Estimate the probability
estimated_prob = successes / n_simulations

# Exact probability using binomial formula
exact_prob = stats.binom.pmf(target_heads, n_flips, 0.5)

print("Coin Flip Monte Carlo Simulation")
print("="*50)
print(f"Simulations: {n_simulations}")
print(f"Trials with exactly 3 heads: {successes}")
print(f"\nEstimated probability: {estimated_prob:.4f}")
print(f"Exact probability: {exact_prob:.4f}")
print(f"Difference: {abs(estimated_prob - exact_prob):.4f}")
print("\n✓ Monte Carlo estimate is close to exact value!")

## Sports Context: Team Win Probability

Now let's apply Monte Carlo to basketball.

**Scenario:**
- A team has a fixed probability p of winning each game
- We want to estimate various probabilities using simulation

**Model:**
- Each game is independent
- Win with probability p, lose with probability 1-p
- This is a **Bernoulli trial**

**Note:** In reality, p is unknown. But for teaching, we'll set p = 0.6 (a team that wins 60% of games).

## Simulating Game Results

**Method 1:** Use numpy's random choice or binomial function

**Method 2:** Generate uniform random numbers and compare to p

Let's demonstrate both methods.

In [None]:
# Method 1: Using numpy binomial
# 1 = win, 0 = loss

win_probability = 0.6
n_games = 10

# Simulate 10 games
game_results_method1 = np.random.binomial(1, win_probability, size=n_games)

print("Method 1: Using np.random.binomial")
print(f"Win probability: {win_probability}")
print(f"Simulated results (1=win, 0=loss): {game_results_method1}")
print(f"Wins: {np.sum(game_results_method1)}")
print(f"Losses: {n_games - np.sum(game_results_method1)}")

In [None]:
# Method 2: Using uniform random and threshold

uniform_draws = np.random.uniform(0, 1, size=n_games)
game_results_method2 = (uniform_draws < win_probability).astype(int)

print("Method 2: Using uniform random numbers")
print(f"Random draws: {uniform_draws}")
print(f"Game results (1=win, 0=loss): {game_results_method2}")
print(f"Wins: {np.sum(game_results_method2)}")
print("\nLogic: If random number < 0.6, team wins")

## Exercise: Simulate 10,000 Games

**Goal:** Use Monte Carlo simulation to estimate the team's win rate and verify it matches the true probability.

In [None]:
# Simulate 10,000 games

true_win_prob = 0.6
n_simulations = 10000

# Simulate all games at once (vectorized)
game_outcomes = np.random.binomial(1, true_win_prob, size=n_simulations)

# Compute simulated win rate
simulated_win_rate = np.mean(game_outcomes)

print("Monte Carlo Simulation: Team Win Probability")
print("="*50)
print(f"True win probability (p): {true_win_prob}")
print(f"Number of simulations: {n_simulations}")
print(f"\nTotal wins: {np.sum(game_outcomes)}")
print(f"Total losses: {n_simulations - np.sum(game_outcomes)}")
print(f"\nSimulated win rate: {simulated_win_rate:.4f}")
print(f"Difference from true p: {abs(simulated_win_rate - true_win_prob):.4f}")
print("\n✓ Simulation closely approximates the true probability!")

In [None]:
# Repeat simulation multiple times to show variability

n_repetitions = 5
n_sims_per_rep = 10000

print("Repeating the simulation multiple times:")
print("="*50)

for i in range(n_repetitions):
    outcomes = np.random.binomial(1, true_win_prob, size=n_sims_per_rep)
    win_rate = np.mean(outcomes)
    print(f"Repetition {i+1}: Simulated win rate = {win_rate:.4f}")

print(f"\nTrue win probability: {true_win_prob}")
print("\nNote: Estimates vary slightly but stay close to true value.")
print("With 10,000 simulations, we get reliable estimates.")

In [None]:
# Plot running estimate of win probability

# Simulate games one at a time (conceptually)
np.random.seed(100)  # New seed for this visualization
n_games_to_plot = 1000

game_results = np.random.binomial(1, true_win_prob, size=n_games_to_plot)

# Compute cumulative win rate after each game
cumulative_wins = np.cumsum(game_results)
game_numbers = np.arange(1, n_games_to_plot + 1)
running_win_rate = cumulative_wins / game_numbers

# Plot
plt.figure(figsize=(12, 6))
plt.plot(game_numbers, running_win_rate, linewidth=1.5, alpha=0.7, color='blue')
plt.axhline(y=true_win_prob, color='red', linestyle='--', linewidth=2, 
            label=f'True win probability: {true_win_prob}')
plt.xlabel('Number of Games Simulated', fontsize=12)
plt.ylabel('Running Win Rate Estimate', fontsize=12)
plt.title('Monte Carlo Convergence: Win Rate Estimate vs Number of Simulations', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.ylim(0.4, 0.8)
plt.show()

print("Interpretation:")
print("- Early on, the estimate is noisy and varies widely")
print("- As more games are simulated, the estimate stabilizes")
print("- Eventually, it converges close to the true probability")
print("- More simulations = less variability = better estimate")

### How Monte Carlo Approximates True Probability

**The Law of Large Numbers:**

As the number of simulations increases, the sample average converges to the true expected value.

**For our example:**
- True win probability: p = 0.6
- Each simulation: Win (1) or Loss (0)
- Average of many simulations ≈ 0.6 × 1 + 0.4 × 0 = 0.6

**Variability reduction:**
- With 10 simulations: Estimate might be 0.5 or 0.7 (high variance)
- With 100 simulations: Estimate might be 0.58 or 0.62 (moderate variance)
- With 10,000 simulations: Estimate will be 0.598 to 0.602 (low variance)

**Standard error decreases as 1/√n:**
- 100 simulations → SE ≈ 0.049
- 10,000 simulations → SE ≈ 0.005 (10 times smaller!)

**Practical takeaway:** More simulations give more accurate estimates, but with diminishing returns.

## Extension: Best-of-Seven Series

**New question:** If a team wins each game with probability 0.6, what's the probability they win a best-of-7 series?

**Rules:** First team to win 4 games wins the series.

This is harder to calculate exactly, but easy to simulate!

In [None]:
# Simulate best-of-seven series

def simulate_series(win_prob, n_series):
    """
    Simulate best-of-7 series.
    
    Parameters:
    - win_prob: probability team wins a single game
    - n_series: number of series to simulate
    
    Returns:
    - proportion of series won by the team
    """
    series_wins = 0
    
    for _ in range(n_series):
        team_wins = 0
        opponent_wins = 0
        
        # Play games until one team gets 4 wins
        while team_wins < 4 and opponent_wins < 4:
            # Simulate one game
            game_result = np.random.binomial(1, win_prob)
            
            if game_result == 1:
                team_wins += 1
            else:
                opponent_wins += 1
        
        # Check if team won the series
        if team_wins == 4:
            series_wins += 1
    
    return series_wins / n_series

# Run simulation
np.random.seed(200)
game_win_prob = 0.6
n_series_simulations = 10000

series_win_prob = simulate_series(game_win_prob, n_series_simulations)

print("Best-of-Seven Series Simulation")
print("="*50)
print(f"Single game win probability: {game_win_prob}")
print(f"Number of series simulated: {n_series_simulations}")
print(f"\nEstimated series win probability: {series_win_prob:.4f}")
print(f"\nInterpretation: A team that wins 60% of individual games")
print(f"wins about {series_win_prob*100:.1f}% of best-of-7 series.")

In [None]:
# Visualize how series win probability varies with game win probability

game_probs = np.linspace(0.4, 0.7, 7)
series_probs = []

np.random.seed(300)
for p in game_probs:
    series_prob = simulate_series(p, 5000)
    series_probs.append(series_prob)

plt.figure(figsize=(10, 6))
plt.plot(game_probs, series_probs, 'o-', linewidth=2.5, markersize=10, color='darkblue')
plt.xlabel('Single Game Win Probability', fontsize=12)
plt.ylabel('Best-of-7 Series Win Probability', fontsize=12)
plt.title('Series Win Probability vs Game Win Probability', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.xlim(0.35, 0.75)
plt.ylim(0, 1)

# Add reference line y=x
plt.plot([0.35, 0.75], [0.35, 0.75], 'r--', linewidth=1.5, alpha=0.5, label='y = x')
plt.legend(fontsize=11)
plt.show()

print("Key insight: Series win probability is MORE extreme than game win probability.")
print("A team with 60% game win rate has ~71% series win rate.")
print("This is because better teams compound their advantage over 4-7 games.")

# Day 18: Bootstrapping

## What is Bootstrapping?

**Bootstrapping** is a resampling method that estimates the sampling variability of a statistic.

**The basic idea:**
1. You have observed data (e.g., 50 games of player points)
2. Repeatedly resample from your data **with replacement**
3. Compute your statistic (e.g., mean) for each resample
4. Use the distribution of these statistics to estimate uncertainty

**Key features:**
- **Non-parametric:** No assumptions about the underlying distribution
- **Data-driven:** Uses the actual observed data
- **Flexible:** Works for any statistic (mean, median, variance, etc.)

**Why "with replacement"?**

Sampling with replacement means:
- You can select the same observation multiple times
- Each bootstrap sample has the same size as the original data
- This mimics drawing new samples from the population

## Basketball Example: Player Points Per Game

**Scenario:**
- We have 50 games of data for one player
- We want to estimate the uncertainty in their average points per game
- Bootstrap will give us a confidence interval for the true mean

In [None]:
# Create player points per game dataset
# Simulate realistic data: mean around 22 points, std around 6

np.random.seed(400)
n_games = 50

# Generate player points (using Normal distribution for realism)
player_points = np.random.normal(loc=22, scale=6, size=n_games)
player_points = np.maximum(player_points, 0)  # No negative points
player_points = np.round(player_points, 1)  # Round to 1 decimal

print("Player Points Per Game Dataset")
print("="*50)
print(f"Number of games: {n_games}")
print(f"\nFirst 15 games:")
print(player_points[:15])

In [None]:
# Basic statistics of the dataset

sample_mean = np.mean(player_points)
sample_std = np.std(player_points, ddof=1)  # Sample standard deviation
sample_min = np.min(player_points)
sample_max = np.max(player_points)

print("Dataset Summary Statistics:")
print("="*50)
print(f"Number of games: {n_games}")
print(f"Sample mean: {sample_mean:.2f} points")
print(f"Sample std: {sample_std:.2f} points")
print(f"Min: {sample_min:.1f} points")
print(f"Max: {sample_max:.1f} points")

In [None]:
# Plot histogram of game points

plt.figure(figsize=(10, 6))
plt.hist(player_points, bins=15, color='steelblue', edgecolor='black', alpha=0.7)
plt.axvline(sample_mean, color='red', linestyle='--', linewidth=2.5, 
            label=f'Sample Mean: {sample_mean:.2f}')
plt.xlabel('Points Per Game', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Player Points Per Game', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.show()

print("The histogram shows the variability in game-to-game performance.")

## Exercise: Bootstrap Confidence Interval for Mean Points

**Goal:** Estimate a 95% confidence interval for the player's true average points per game using bootstrapping.

In [None]:
# Define bootstrap function

def bootstrap_sample_mean(data):
    """
    Draw one bootstrap sample and compute its mean.
    
    Parameters:
    - data: original sample (numpy array)
    
    Returns:
    - mean of the bootstrap sample
    """
    # Sample with replacement (same size as original)
    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
    
    # Compute mean of bootstrap sample
    return np.mean(bootstrap_sample)

# Test the function
test_bootstrap_mean = bootstrap_sample_mean(player_points)
print(f"Example bootstrap sample mean: {test_bootstrap_mean:.2f}")
print(f"Original sample mean: {sample_mean:.2f}")
print("\nNote: Bootstrap means will vary around the original sample mean.")

In [None]:
# Generate 1000 bootstrap samples

n_bootstrap = 1000
bootstrap_means = np.zeros(n_bootstrap)

np.random.seed(500)
for i in range(n_bootstrap):
    bootstrap_means[i] = bootstrap_sample_mean(player_points)

print(f"Generated {n_bootstrap} bootstrap samples")
print(f"\nFirst 20 bootstrap means:")
print(bootstrap_means[:20])

In [None]:
# Plot histogram of bootstrap means

plt.figure(figsize=(12, 6))
plt.hist(bootstrap_means, bins=40, color='coral', edgecolor='black', alpha=0.7, density=True)
plt.axvline(sample_mean, color='blue', linestyle='--', linewidth=2.5, 
            label=f'Original Sample Mean: {sample_mean:.2f}')
plt.axvline(np.mean(bootstrap_means), color='red', linestyle='-', linewidth=2.5, 
            label=f'Bootstrap Mean: {np.mean(bootstrap_means):.2f}')
plt.xlabel('Bootstrap Sample Mean (Points Per Game)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Distribution of Bootstrap Sample Means', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.show()

print("This distribution shows the sampling variability of the mean.")
print("It tells us how much the mean would vary if we collected new samples.")

In [None]:
# Compute bootstrap statistics and confidence interval

# Bootstrap estimate of the mean
bootstrap_mean_estimate = np.mean(bootstrap_means)

# 95% confidence interval: 2.5th and 97.5th percentiles
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)

print("Bootstrap Results:")
print("="*50)
print(f"Original sample mean: {sample_mean:.2f} points")
print(f"Bootstrap estimate of mean: {bootstrap_mean_estimate:.2f} points")
print(f"\nBootstrap 95% Confidence Interval:")
print(f"  Lower bound (2.5th percentile): {ci_lower:.2f} points")
print(f"  Upper bound (97.5th percentile): {ci_upper:.2f} points")
print(f"  Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"  Width: {ci_upper - ci_lower:.2f} points")

### Interpreting the Bootstrap Confidence Interval

**What does the 95% confidence interval mean?**

The interval [21.12, 24.23] (approximately) tells us:

1. **Point estimate:** Our best estimate of the player's true average is 22.5 points per game.

2. **Uncertainty range:** We are 95% confident that the true average points per game is between 21.12 and 24.23.

3. **Interpretation:** If we repeated the process of collecting 50 games and computing confidence intervals many times, about 95% of those intervals would contain the true mean.

**Practical meaning:**
- The player's long-run average is likely in this range
- We have moderate uncertainty (±1.5 points roughly)
- With more games, the interval would be narrower (more certainty)
- This accounts for game-to-game variability

**Why this matters:**
- Don't overinterpret a single sample mean
- Recognize the inherent uncertainty in estimates
- Compare players accounting for uncertainty
- Make better decisions with interval estimates vs point estimates

In [None]:
# Compare to Normal-based confidence interval

# Standard error of the mean
se_mean = sample_std / np.sqrt(n_games)

# Normal-based 95% CI: mean ± 1.96 * SE
normal_ci_lower = sample_mean - 1.96 * se_mean
normal_ci_upper = sample_mean + 1.96 * se_mean

print("Comparison: Bootstrap vs Normal-Based Confidence Intervals")
print("="*60)
print(f"\nBootstrap 95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"  Width: {ci_upper - ci_lower:.2f} points")

print(f"\nNormal-based 95% CI: [{normal_ci_lower:.2f}, {normal_ci_upper:.2f}]")
print(f"  Width: {normal_ci_upper - normal_ci_lower:.2f} points")

print(f"\nDifference in lower bounds: {abs(ci_lower - normal_ci_lower):.2f}")
print(f"Difference in upper bounds: {abs(ci_upper - normal_ci_upper):.2f}")

print("\nNote: In this case, the intervals are very similar.")
print("Bootstrap advantages:")
print("  - No assumption of Normal distribution")
print("  - Works for any statistic (median, variance, etc.)")
print("  - More robust for small samples or skewed data")

In [None]:
# Visualize both confidence intervals together

plt.figure(figsize=(12, 7))

# Plot bootstrap distribution
plt.hist(bootstrap_means, bins=40, color='lightblue', edgecolor='black', 
         alpha=0.6, density=True, label='Bootstrap distribution')

# Original sample mean
plt.axvline(sample_mean, color='black', linestyle='-', linewidth=3, 
            label=f'Sample mean: {sample_mean:.2f}')

# Bootstrap CI
plt.axvline(ci_lower, color='blue', linestyle='--', linewidth=2.5, alpha=0.8,
            label=f'Bootstrap 95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]')
plt.axvline(ci_upper, color='blue', linestyle='--', linewidth=2.5, alpha=0.8)

# Normal-based CI
plt.axvline(normal_ci_lower, color='red', linestyle=':', linewidth=2.5, alpha=0.8,
            label=f'Normal 95% CI: [{normal_ci_lower:.2f}, {normal_ci_upper:.2f}]')
plt.axvline(normal_ci_upper, color='red', linestyle=':', linewidth=2.5, alpha=0.8)

plt.xlabel('Points Per Game', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bootstrap Distribution with Confidence Intervals', fontsize=14, fontweight='bold')
plt.legend(fontsize=10, loc='upper right')
plt.grid(alpha=0.3)
plt.show()

## Advanced Bootstrap Example: Median

Bootstrap is especially useful for statistics that don't have simple formulas.

Let's estimate a confidence interval for the **median** points per game.

In [None]:
# Bootstrap for median

n_bootstrap = 1000
bootstrap_medians = np.zeros(n_bootstrap)

np.random.seed(600)
for i in range(n_bootstrap):
    bootstrap_sample = np.random.choice(player_points, size=len(player_points), replace=True)
    bootstrap_medians[i] = np.median(bootstrap_sample)

# Original sample median
sample_median = np.median(player_points)

# Bootstrap CI for median
median_ci_lower = np.percentile(bootstrap_medians, 2.5)
median_ci_upper = np.percentile(bootstrap_medians, 97.5)

print("Bootstrap Confidence Interval for Median:")
print("="*50)
print(f"Sample median: {sample_median:.2f} points")
print(f"Bootstrap estimate of median: {np.mean(bootstrap_medians):.2f} points")
print(f"\nBootstrap 95% CI for median: [{median_ci_lower:.2f}, {median_ci_upper:.2f}]")
print("\nNote: There's no simple formula for median CI using Normal theory.")
print("Bootstrap makes this easy!")

# Summary: Monte Carlo and Bootstrap Methods

## Monte Carlo Simulation

**What it is:**
- Use many random samples to approximate probabilities and expected values
- Generate outcomes according to a probability model
- Count or average the results

**How it estimates win probabilities:**
1. Define the model (e.g., team wins with probability p = 0.6)
2. Simulate thousands of games
3. Compute the proportion of wins
4. This proportion estimates the true win probability

**Key insights:**
- More simulations → better estimates
- Variability decreases as √n
- Works for complex scenarios (e.g., best-of-7 series)
- Easy to implement and understand

**When to use:**
- Complex probability calculations
- Scenarios with multiple random components
- Testing "what if" scenarios
- When exact formulas are difficult

---

## Bootstrapping

**What it is:**
- Resample from observed data (with replacement) to estimate sampling variability
- Non-parametric: no distributional assumptions
- Data-driven: uses actual observations

**How it estimates uncertainty for mean points:**
1. Have observed data (e.g., 50 games of player points)
2. Resample 50 games with replacement
3. Compute mean of resampled data
4. Repeat 1000+ times
5. Use percentiles of bootstrap means as confidence interval

**Key insights:**
- Provides confidence intervals without assuming Normality
- Works for any statistic (mean, median, variance, correlation, etc.)
- Reflects the actual variability in the data
- More bootstrap samples → smoother distribution

**When to use:**
- Estimating uncertainty for any statistic
- Small samples where Normal approximation is questionable
- Complex statistics without known distributions
- Non-parametric inference

---

## Reasoning About Variability and Uncertainty

**Both methods help us:**

1. **Quantify uncertainty:** Not just point estimates, but ranges

2. **Make better decisions:** Account for randomness in sports

3. **Avoid overconfidence:** Recognize that estimates have error

4. **Compare fairly:** Account for different sample sizes

**Monte Carlo** answers: "What might happen?"
- Simulate future games, series, seasons
- Estimate probabilities of outcomes
- Test strategies under uncertainty

**Bootstrap** answers: "How certain are we?"
- Estimate confidence intervals for statistics
- Quantify sampling variability
- Make inferences without strong assumptions

**Together:** Powerful tools for data-driven sports analytics that respect the inherent randomness in athletic performance.

---

## Practical Takeaways

**For analysts:**
- Always report uncertainty, not just point estimates
- Use simulation to explore complex scenarios
- Bootstrap when you need confidence intervals
- More data (or more simulations) reduces uncertainty

**For decision-makers:**
- Understand that all estimates have error
- Consider the range of plausible values
- Don't overreact to small samples
- Use probabilistic thinking for better outcomes

**Remember:** Randomness is inherent in sports. These tools help us reason about it systematically.