# Advanced Statistics for Sports Analytics: Days 19-20

This notebook compares statistical approaches and builds season-level projection models.

**Topics covered:**
- Day 19: Bayesian vs Frequentist approaches
- Day 20: Simulating full season player stats with uncertainty

**Prerequisites:** Basic Python, numpy, matplotlib, Bayes' theorem, Monte Carlo, bootstrapping.

Let's import our libraries and set up for reproducible results.

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import beta, norm

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
plt.rcParams['figure.figsize'] = (12, 6)
print("Libraries imported successfully!")
print(f"Random seed set to 42 for reproducible results")

# Day 19: Bayesian vs Frequentist Approaches

## Overview of Two Approaches

When estimating a player's shooting percentage, we can use two different statistical frameworks.

### Frequentist Estimate

**Definition:** The sample proportion or sample mean calculated directly from observed data.

**For shooting percentage:**
- FG% = Total Makes / Total Attempts
- Example: 40 makes in 100 attempts → 40% FG%

**Key features:**
- Uses only the observed data
- No prior beliefs incorporated
- Uncertainty measured via confidence intervals
- Treats the true parameter as fixed but unknown

### Bayesian Estimate

**Definition:** The posterior mean (or mode/median) after updating prior beliefs with observed data.

**For shooting percentage:**
- Start with prior: Beta(α₀, β₀)
- Update with data: k makes, n-k misses
- Posterior: Beta(α₀+k, β₀+n-k)
- Posterior mean = (α₀+k) / (α₀+β₀+n)

**Key features:**
- Incorporates prior beliefs
- Updates beliefs with data
- Uncertainty represented by posterior distribution
- Treats the parameter as having a probability distribution

### Simple Example

**Scenario:** Player makes 2 out of 5 shots in their first game.

**Frequentist:** FG% = 2/5 = 40%

**Bayesian (with Beta(20,20) prior):**
- Prior mean: 20/40 = 50%
- Posterior: Beta(22, 23)
- Posterior mean: 22/45 = 48.9%

The Bayesian estimate is pulled toward the prior (50%) because we have limited data.

## Create a Toy Dataset

Let's track a player's shooting over their first 15 games.

In [None]:
# Create game-by-game shooting data
# Each row: (made_shots, attempted_shots)

game_data = [
    (5, 12),   # Game 1: 5/12 = 41.7%
    (7, 15),   # Game 2: 7/15 = 46.7%
    (6, 14),   # Game 3: 6/14 = 42.9%
    (8, 16),   # Game 4: 8/16 = 50.0%
    (4, 10),   # Game 5: 4/10 = 40.0%
    (9, 18),   # Game 6: 9/18 = 50.0%
    (7, 13),   # Game 7: 7/13 = 53.8%
    (6, 15),   # Game 8: 6/15 = 40.0%
    (8, 17),   # Game 9: 8/17 = 47.1%
    (10, 20),  # Game 10: 10/20 = 50.0%
    (5, 11),   # Game 11: 5/11 = 45.5%
    (9, 19),   # Game 12: 9/19 = 47.4%
    (7, 14),   # Game 13: 7/14 = 50.0%
    (8, 16),   # Game 14: 8/16 = 50.0%
    (6, 13),   # Game 15: 6/13 = 46.2%
]

n_games = len(game_data)

print("Player Game-by-Game Shooting Data")
print("="*50)
for i, (made, attempted) in enumerate(game_data, 1):
    fg_pct = made / attempted
    print(f"Game {i:2d}: {made:2d}/{attempted:2d} = {fg_pct:.1%}")

total_made = sum(m for m, a in game_data)
total_attempted = sum(a for m, a in game_data)
overall_fg = total_made / total_attempted

print(f"\nOverall: {total_made}/{total_attempted} = {overall_fg:.1%}")

## Frequentist View: Cumulative Sample Proportion

The frequentist estimate uses only the data observed so far.

After each game, we compute:

FG% = (Total makes so far) / (Total attempts so far)

In [None]:
# Compute cumulative frequentist estimates

cumulative_makes = []
cumulative_attempts = []
frequentist_estimates = []

running_makes = 0
running_attempts = 0

for made, attempted in game_data:
    running_makes += made
    running_attempts += attempted
    
    cumulative_makes.append(running_makes)
    cumulative_attempts.append(running_attempts)
    
    # Frequentist estimate: sample proportion
    freq_estimate = running_makes / running_attempts
    frequentist_estimates.append(freq_estimate)

print("Frequentist Estimates After Each Game:")
print("="*60)
print("Game | Cumulative | FG%")
print("-"*60)
for i in range(n_games):
    print(f"{i+1:4d} | {cumulative_makes[i]:3d}/{cumulative_attempts[i]:3d}    | {frequentist_estimates[i]:.1%}")

## Bayesian View: Posterior Mean with Beta Prior

The Bayesian approach starts with a prior belief and updates it with each game.

**Prior:** Beta(20, 20)
- Represents 20 prior "makes" and 20 prior "misses"
- Prior mean = 20/40 = 50%
- Equivalent to having seen 40 shots at 50% shooting

**Update rule:**
- After game with k makes and (n-k) misses
- New α = old α + k
- New β = old β + (n-k)
- Posterior mean = α / (α + β)

In [None]:
# Bayesian estimates with sequential updates

# Prior parameters
alpha_0 = 20
beta_0 = 20
prior_mean = alpha_0 / (alpha_0 + beta_0)

print(f"Prior: Beta({alpha_0}, {beta_0})")
print(f"Prior mean: {prior_mean:.1%}")
print(f"Prior strength: {alpha_0 + beta_0} pseudo-observations\n")

# Track posterior parameters
current_alpha = alpha_0
current_beta = beta_0

bayesian_estimates = []
alpha_history = []
beta_history = []

print("Bayesian Posterior Estimates After Each Game:")
print("="*70)
print("Game | Makes/Att | Posterior      | Post Mean")
print("-"*70)

for i, (made, attempted) in enumerate(game_data, 1):
    missed = attempted - made
    
    # Update posterior
    current_alpha += made
    current_beta += missed
    
    # Compute posterior mean
    posterior_mean = current_alpha / (current_alpha + current_beta)
    
    # Store results
    bayesian_estimates.append(posterior_mean)
    alpha_history.append(current_alpha)
    beta_history.append(current_beta)
    
    print(f"{i:4d} | {made:2d}/{attempted:2d}     | Beta({current_alpha:3d}, {current_beta:3d}) | {posterior_mean:.1%}")

## Exercise: Compare Frequentist and Bayesian Estimates Over Time

Let's visualize how the two approaches differ, especially in early games.

In [None]:
# Plot both estimates on the same graph

game_numbers = np.arange(1, n_games + 1)

plt.figure(figsize=(14, 7))

# Frequentist estimates
plt.plot(game_numbers, frequentist_estimates, 'o-', linewidth=2.5, markersize=8,
         color='blue', label='Frequentist (Sample FG%)', alpha=0.8)

# Bayesian estimates
plt.plot(game_numbers, bayesian_estimates, 's-', linewidth=2.5, markersize=8,
         color='red', label='Bayesian (Posterior Mean)', alpha=0.8)

# Prior mean
plt.axhline(y=prior_mean, color='red', linestyle='--', linewidth=2,
            alpha=0.5, label=f'Prior Mean ({prior_mean:.1%})')

# Final estimate line
plt.axhline(y=frequentist_estimates[-1], color='gray', linestyle=':', linewidth=1.5,
            alpha=0.5, label=f'Final Estimate ({frequentist_estimates[-1]:.1%})')

plt.xlabel('Game Number', fontsize=13)
plt.ylabel('Field Goal Percentage', fontsize=13)
plt.title('Frequentist vs Bayesian FG% Estimates Over Time', fontsize=15, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(alpha=0.3)
plt.xticks(game_numbers)
plt.ylim(0.35, 0.55)
plt.tight_layout()
plt.show()

### Comparing the Two Approaches

**How each method treats uncertainty:**

*Frequentist:*
- Point estimate is the sample proportion
- Uncertainty measured by confidence intervals
- No probability distribution on the parameter itself
- Intervals have a frequency interpretation: "95% of such intervals contain the true value"

*Bayesian:*
- Entire posterior distribution represents uncertainty
- Can make probability statements about the parameter: "95% probability FG% is in this range"
- Posterior credible intervals are more intuitive
- Uncertainty shrinks as data accumulates (posterior gets narrower)

**How prior beliefs influence early games:**

*Early games (Games 1-5):*
- Frequentist estimate jumps around based purely on observed data
- After Game 1: 5/12 = 41.7% (ignores what we know about typical shooters)
- Bayesian estimate starts at 50% (prior) and moves gradually toward data
- After Game 1: (20+5)/(40+12) = 48.1% (more stable, less reactive)

*Why this matters:*
- With limited data, frequentist can give extreme estimates
- Prior acts as regularization, keeping estimates reasonable
- Bayesian approach automatically handles "small sample" problem

**How both approaches behave with lots of data:**

*Late games (Games 10-15):*
- Both estimates converge to similar values
- After 15 games: Frequentist = 47.1%, Bayesian = 47.3%
- Prior influence fades: 40 pseudo-observations vs 215 real observations
- Data dominates the prior when sample size is large

*Mathematical insight:*

Bayesian posterior mean = (α₀ + total_makes) / (α₀ + β₀ + total_attempts)

As total_attempts → ∞, this approaches total_makes / total_attempts (frequentist estimate)

**Practical implications:**
- Early season: Bayesian helps avoid overreacting to hot/cold streaks
- Late season: Both methods give similar results
- Bayesian naturally balances prior knowledge with new data

## Extension: Visualize Posterior Distribution Evolution

Let's see how the posterior distribution changes and uncertainty shrinks.

In [None]:
# Plot posterior distributions at different points

games_to_plot = [1, 5, 10, 15]
x = np.linspace(0.3, 0.7, 1000)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, game_num in enumerate(games_to_plot):
    ax = axes[idx]
    
    if game_num == 0:
        # Prior
        alpha_plot = alpha_0
        beta_plot = beta_0
        title = "Prior (Before Any Games)"
    else:
        # Posterior after game_num games
        alpha_plot = alpha_history[game_num - 1]
        beta_plot = beta_history[game_num - 1]
        title = f"Posterior After Game {game_num}"
    
    # Plot Beta distribution
    pdf = beta.pdf(x, alpha_plot, beta_plot)
    ax.plot(x, pdf, linewidth=2.5, color='darkblue')
    ax.fill_between(x, pdf, alpha=0.3, color='skyblue')
    
    # Mark the mean
    mean_val = alpha_plot / (alpha_plot + beta_plot)
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2,
               label=f'Mean: {mean_val:.1%}')
    
    # 90% credible interval
    ci_lower = beta.ppf(0.05, alpha_plot, beta_plot)
    ci_upper = beta.ppf(0.95, alpha_plot, beta_plot)
    ax.axvline(ci_lower, color='green', linestyle=':', linewidth=1.5, alpha=0.6)
    ax.axvline(ci_upper, color='green', linestyle=':', linewidth=1.5, alpha=0.6,
               label=f'90% CI: [{ci_lower:.1%}, {ci_upper:.1%}]')
    
    ax.set_xlabel('Field Goal Percentage', fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(f'{title}\nBeta({alpha_plot}, {beta_plot})', fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(alpha=0.3)
    ax.set_xlim(0.3, 0.7)

plt.tight_layout()
plt.show()

print("Notice how:")
print("1. The distribution gets narrower (less uncertainty) as games increase")
print("2. The mean shifts from the prior toward the data")
print("3. The 90% credible interval shrinks significantly")

In [None]:
# Quantify uncertainty reduction

print("Uncertainty Reduction Over Time:")
print("="*70)
print("Game | Posterior      | Mean  | 90% CI Width")
print("-"*70)

# Prior
prior_ci_lower = beta.ppf(0.05, alpha_0, beta_0)
prior_ci_upper = beta.ppf(0.95, alpha_0, beta_0)
prior_width = prior_ci_upper - prior_ci_lower
print(f"Prior | Beta({alpha_0:3d}, {beta_0:3d}) | {prior_mean:.1%} | {prior_width:.1%}")

# After selected games
for game_num in [1, 5, 10, 15]:
    alpha_g = alpha_history[game_num - 1]
    beta_g = beta_history[game_num - 1]
    mean_g = alpha_g / (alpha_g + beta_g)
    ci_lower_g = beta.ppf(0.05, alpha_g, beta_g)
    ci_upper_g = beta.ppf(0.95, alpha_g, beta_g)
    width_g = ci_upper_g - ci_lower_g
    
    print(f"{game_num:5d} | Beta({alpha_g:3d}, {beta_g:3d}) | {mean_g:.1%} | {width_g:.1%}")

print("\nThe credible interval width decreases as data accumulates.")
print("This shows decreasing uncertainty about the true FG%.")

# Day 20: Simulate Full Season Player Stats

## Parameter Uncertainty

When projecting season-level performance, we must account for two types of randomness.

### Two Sources of Variability

**1. Game-to-game randomness (aleatory uncertainty):**
- Even if we knew the player's true mean, individual games vary
- A player averaging 25 PPG might score 18 one game and 32 the next
- This is inherent randomness in performance
- Represented by standard deviation σ_game

**2. Uncertainty about the true mean (epistemic uncertainty):**
- We don't know the player's "true" average with certainty
- Based on limited data, the mean could be 24, 25, or 26 PPG
- This uncertainty decreases as we observe more games
- Represented by posterior distribution of μ

### Example: Points Per Game

**Scenario:** Player has averaged 24.5 PPG over 20 games with SD of 6 points.

**Game-to-game randomness:**
- Given true mean μ = 24.5
- Game points ~ Normal(24.5, 6)
- This variability exists even if we knew μ perfectly

**Uncertainty about the mean:**
- True μ might not be exactly 24.5
- Could be 23, 24.5, or 26 based on our data
- Posterior: μ ~ Normal(24.5, τ) where τ depends on sample size
- Standard error: τ = σ_game / √n = 6 / √20 ≈ 1.34

## Bayesian Modeling for Season Projections

**The key insight:** The posterior distribution for the mean can feed into season simulations.

**Standard approach (ignoring parameter uncertainty):**
1. Estimate mean from data: μ̂ = 24.5
2. Simulate season: Each game ~ Normal(24.5, 6)
3. Problem: Treats 24.5 as the true mean (overconfident)

**Bayesian approach (accounting for parameter uncertainty):**
1. Posterior for mean: μ ~ Normal(24.5, 1.34)
2. Simulate season:
   - First, draw μ_sim from posterior
   - Then, simulate games given μ_sim: Each game ~ Normal(μ_sim, 6)
3. Benefit: Accounts for uncertainty in the true mean

**Result:** Season outcomes have wider variability, reflecting our uncertainty.

## Set Up a Simple Model

Let's use a concrete example with made-up data.

In [None]:
# Create fake season data (20 games so far)

np.random.seed(100)
n_games_observed = 20
true_mean_ppg = 24.5  # Unknown in practice
sigma_game = 6.0      # Game-to-game standard deviation

# Generate observed game points
observed_points = np.random.normal(true_mean_ppg, sigma_game, size=n_games_observed)
observed_points = np.maximum(observed_points, 0)  # No negative points

# Compute sample statistics
sample_mean = np.mean(observed_points)
sample_std = np.std(observed_points, ddof=1)

print("Observed Data (First 20 Games):")
print("="*50)
print(f"Number of games: {n_games_observed}")
print(f"Sample mean: {sample_mean:.2f} PPG")
print(f"Sample std: {sample_std:.2f} points")
print(f"\nFirst 10 games: {observed_points[:10].round(1)}")

In [None]:
# Derive posterior for the mean
# Using Normal prior and Normal likelihood → Normal posterior

# For simplicity, assume:
# - Game points are Normal(μ, σ_game) with σ_game known
# - Weak prior on μ: Normal(25, 10) [very uncertain prior]
# - Posterior for μ is Normal(μ_post, τ_post)

# Prior parameters
prior_mu = 25.0
prior_tau = 10.0

# Known game-to-game SD (assume we know this)
sigma_game_known = 6.0

# Posterior parameters (Normal-Normal conjugacy)
# Precision (inverse variance)
prior_precision = 1 / (prior_tau**2)
data_precision = n_games_observed / (sigma_game_known**2)

posterior_precision = prior_precision + data_precision
posterior_tau = 1 / np.sqrt(posterior_precision)

# Posterior mean is weighted average
posterior_mu = (prior_precision * prior_mu + data_precision * sample_mean) / posterior_precision

# For our case with weak prior, posterior ≈ sample-based estimate
# Alternative simpler formula (when prior is weak):
# Standard error of mean
se_mean = sigma_game_known / np.sqrt(n_games_observed)

# Use this as posterior
mu_post = sample_mean
tau_post = se_mean

print("Posterior Distribution for Mean PPG:")
print("="*50)
print(f"Posterior: μ ~ Normal({mu_post:.2f}, {tau_post:.2f})")
print(f"\nPosterior mean: {mu_post:.2f} PPG")
print(f"Posterior std (uncertainty about mean): {tau_post:.2f}")
print(f"\n95% credible interval for true mean:")
print(f"  [{mu_post - 1.96*tau_post:.2f}, {mu_post + 1.96*tau_post:.2f}]")

## Simulate One Full Season

Here's how to generate one possible season outcome.

In [None]:
# Function to simulate one season

def simulate_season(n_games_season, mu_post, tau_post, sigma_game):
    """
    Simulate one full season accounting for parameter uncertainty.
    
    Parameters:
    - n_games_season: number of games in the season (e.g., 82)
    - mu_post: posterior mean for true average PPG
    - tau_post: posterior standard deviation (uncertainty about mean)
    - sigma_game: game-to-game standard deviation
    
    Returns:
    - season_avg: average PPG for this simulated season
    - season_total: total points for this simulated season
    """
    # Step 1: Draw a value for the true mean from posterior
    mu_sim = np.random.normal(mu_post, tau_post)
    
    # Step 2: Simulate each game given this mean
    game_points = np.random.normal(mu_sim, sigma_game, size=n_games_season)
    game_points = np.maximum(game_points, 0)  # No negative points
    
    # Step 3: Compute season statistics
    season_avg = np.mean(game_points)
    season_total = np.sum(game_points)
    
    return season_avg, season_total

# Test the function
np.random.seed(200)
test_avg, test_total = simulate_season(82, mu_post, tau_post, sigma_game_known)

print("Example: One Simulated Season")
print("="*50)
print(f"Simulated season average: {test_avg:.2f} PPG")
print(f"Simulated season total: {test_total:.0f} points")
print("\nThis is one possible outcome given our uncertainty.")

## Exercise: Generate Many Season Outcomes

Let's simulate 5,000 possible seasons to understand the distribution of outcomes.

In [None]:
# Simulate many seasons

n_simulations = 5000
n_games_season = 82

season_averages = []
season_totals = []

np.random.seed(300)
for i in range(n_simulations):
    avg, total = simulate_season(n_games_season, mu_post, tau_post, sigma_game_known)
    season_averages.append(avg)
    season_totals.append(total)

season_averages = np.array(season_averages)
season_totals = np.array(season_totals)

print(f"Generated {n_simulations} simulated seasons")
print(f"\nSummary of Season Averages:")
print(f"  Mean: {np.mean(season_averages):.2f} PPG")
print(f"  Std: {np.std(season_averages):.2f}")
print(f"  Min: {np.min(season_averages):.2f} PPG")
print(f"  Max: {np.max(season_averages):.2f} PPG")

print(f"\nSummary of Season Totals:")
print(f"  Mean: {np.mean(season_totals):.0f} points")
print(f"  Std: {np.std(season_totals):.0f}")
print(f"  Min: {np.min(season_totals):.0f} points")
print(f"  Max: {np.max(season_totals):.0f} points")

In [None]:
# Plot histograms of simulated outcomes

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Season averages
ax1.hist(season_averages, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
ax1.axvline(25, color='red', linestyle='--', linewidth=2.5, label='25 PPG threshold')
ax1.axvline(np.mean(season_averages), color='green', linestyle='-', linewidth=2,
            label=f'Mean: {np.mean(season_averages):.2f}')
ax1.set_xlabel('Season Average (PPG)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Simulated Season Averages', fontsize=13, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(axis='y', alpha=0.3)

# Season totals
ax2.hist(season_totals, bins=50, color='coral', edgecolor='black', alpha=0.7)
ax2.axvline(2000, color='red', linestyle='--', linewidth=2.5, label='2000 points threshold')
ax2.axvline(np.mean(season_totals), color='green', linestyle='-', linewidth=2,
            label=f'Mean: {np.mean(season_totals):.0f}')
ax2.set_xlabel('Season Total Points', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Distribution of Simulated Season Totals', fontsize=13, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Estimate probabilities from simulations

# P(Season average >= 25 PPG)
prob_avg_25 = np.mean(season_averages >= 25)

# P(Season total >= 2000 points)
prob_total_2000 = np.mean(season_totals >= 2000)

print("Estimated Probabilities from Simulation:")
print("="*50)
print(f"P(Avg PPG >= 25): {prob_avg_25:.3f} ({prob_avg_25*100:.1f}%)")
print(f"  Out of {n_simulations} seasons, {int(prob_avg_25*n_simulations)} had avg >= 25")

print(f"\nP(Total >= 2000 pts): {prob_total_2000:.3f} ({prob_total_2000*100:.1f}%)")
print(f"  Out of {n_simulations} seasons, {int(prob_total_2000*n_simulations)} had total >= 2000")

# Additional percentiles
print(f"\nPercentiles of Season Average:")
print(f"  10th: {np.percentile(season_averages, 10):.2f} PPG")
print(f"  25th: {np.percentile(season_averages, 25):.2f} PPG")
print(f"  50th: {np.percentile(season_averages, 50):.2f} PPG")
print(f"  75th: {np.percentile(season_averages, 75):.2f} PPG")
print(f"  90th: {np.percentile(season_averages, 90):.2f} PPG")

### How Uncertainty in True Mean Affects Season Outcomes

**Key observations:**

**1. Wider distribution than naive simulation:**

If we simulated with a fixed mean (ignoring uncertainty):
- SD of season averages ≈ σ_game / √82 = 6 / 9.1 ≈ 0.66

With parameter uncertainty:
- SD of season averages ≈ √(τ_post² + σ_game²/82)
- SD ≈ √(1.34² + 0.66²) ≈ 1.50

The distribution is about 2× wider when we account for uncertainty!

**2. Two sources of variability:**

*Parameter uncertainty (τ_post = 1.34):*
- Different seasons might have different "true" averages
- One season the true mean might be 23.5, another might be 25.5
- This persists across the entire season

*Game-to-game randomness (σ_game = 6):*
- Even with fixed true mean, games vary
- But averages over 82 games: variation is σ/√82 ≈ 0.66

**3. Parameter uncertainty dominates for season averages:**
- τ_post = 1.34 >> σ_game/√82 = 0.66
- Most variation in season average comes from not knowing true mean
- More historical data → smaller τ_post → tighter projections

**Practical implications:**
- Early career players: Large τ_post → wide range of season outcomes
- Established veterans: Small τ_post → narrow range (more predictable)
- Projections should reflect our uncertainty level

### Comparison: With vs Without Parameter Uncertainty

In [None]:
# Compare to simulation with fixed mean (no parameter uncertainty)

np.random.seed(400)
n_sim_comparison = 5000

# Simulation 1: With parameter uncertainty (what we did)
season_avg_with_uncertainty = season_averages

# Simulation 2: Fixed mean (ignoring uncertainty)
season_avg_fixed_mean = []
for i in range(n_sim_comparison):
    # Use sample mean as if it were the true mean
    game_pts = np.random.normal(mu_post, sigma_game_known, size=n_games_season)
    game_pts = np.maximum(game_pts, 0)
    season_avg_fixed_mean.append(np.mean(game_pts))

season_avg_fixed_mean = np.array(season_avg_fixed_mean)

# Compare
print("Comparison: Parameter Uncertainty vs Fixed Mean")
print("="*60)
print(f"\nWith parameter uncertainty:")
print(f"  Mean: {np.mean(season_avg_with_uncertainty):.2f} PPG")
print(f"  Std: {np.std(season_avg_with_uncertainty):.2f}")
print(f"  95% interval: [{np.percentile(season_avg_with_uncertainty, 2.5):.2f}, "
      f"{np.percentile(season_avg_with_uncertainty, 97.5):.2f}]")

print(f"\nFixed mean (no uncertainty):")
print(f"  Mean: {np.mean(season_avg_fixed_mean):.2f} PPG")
print(f"  Std: {np.std(season_avg_fixed_mean):.2f}")
print(f"  95% interval: [{np.percentile(season_avg_fixed_mean, 2.5):.2f}, "
      f"{np.percentile(season_avg_fixed_mean, 97.5):.2f}]")

print(f"\nRatio of standard deviations: "
      f"{np.std(season_avg_with_uncertainty) / np.std(season_avg_fixed_mean):.2f}")
print("\nParameter uncertainty adds substantial additional variability!")

In [None]:
# Visualize the comparison

plt.figure(figsize=(12, 6))

plt.hist(season_avg_fixed_mean, bins=40, alpha=0.5, color='blue', 
         label='Fixed mean (no parameter uncertainty)', edgecolor='black')
plt.hist(season_avg_with_uncertainty, bins=40, alpha=0.5, color='red',
         label='With parameter uncertainty', edgecolor='black')

plt.axvline(mu_post, color='black', linestyle='--', linewidth=2,
            label=f'Estimated mean: {mu_post:.2f}')

plt.xlabel('Season Average (PPG)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Impact of Parameter Uncertainty on Season Projections', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("The red distribution (with uncertainty) is wider.")
print("This better reflects the true range of possible season outcomes.")

### How This Supports Projections and Risk Analysis

**1. More realistic uncertainty quantification:**
- Traditional projections often underestimate uncertainty
- Accounting for parameter uncertainty gives wider, more honest ranges
- Better for decision-making under uncertainty

**2. Risk assessment:**
- "What's the probability the player fails to average 25 PPG?"
- With uncertainty: ~59% chance of < 25 PPG
- Without uncertainty: ~41% chance (underestimates risk)

**3. Contract negotiations:**
- Player wants contract based on optimistic projection
- Team needs to account for downside risk
- Simulation shows full distribution of plausible outcomes

**4. Roster construction:**
- Need to hit certain point totals for playoffs
- Can estimate: P(combined team total > threshold)
- Accounts for correlation and individual uncertainties

**5. Early vs late season:**
- Early season: Large τ_post → wide projections
- Late season: Small τ_post → narrow projections
- Automatically adjusts as more data arrives

## Extension: Compare Two Players

Simulate seasons for two players and estimate the probability one outscores the other.

In [None]:
# Define two players

# Player A: More data, more certain
player_a_mu_post = 24.5
player_a_tau_post = 1.0   # Lower uncertainty
player_a_sigma_game = 6.0

# Player B: Less data, less certain, slightly higher estimate
player_b_mu_post = 25.0
player_b_tau_post = 2.5   # Higher uncertainty (less data)
player_b_sigma_game = 7.0

print("Two Player Comparison:")
print("="*50)
print(f"Player A: μ ~ N({player_a_mu_post:.1f}, {player_a_tau_post:.1f}), σ_game = {player_a_sigma_game}")
print(f"Player B: μ ~ N({player_b_mu_post:.1f}, {player_b_tau_post:.1f}), σ_game = {player_b_sigma_game}")
print("\nPlayer B has slightly higher mean but much more uncertainty.")

In [None]:
# Simulate seasons for both players

np.random.seed(500)
n_sim_compare = 10000

player_a_totals = []
player_b_totals = []

for i in range(n_sim_compare):
    # Player A
    _, total_a = simulate_season(82, player_a_mu_post, player_a_tau_post, player_a_sigma_game)
    player_a_totals.append(total_a)
    
    # Player B
    _, total_b = simulate_season(82, player_b_mu_post, player_b_tau_post, player_b_sigma_game)
    player_b_totals.append(total_b)

player_a_totals = np.array(player_a_totals)
player_b_totals = np.array(player_b_totals)

# Count how often each player has more total points
player_a_wins = np.sum(player_a_totals > player_b_totals)
player_b_wins = np.sum(player_b_totals > player_a_totals)
ties = n_sim_compare - player_a_wins - player_b_wins

prob_a_wins = player_a_wins / n_sim_compare
prob_b_wins = player_b_wins / n_sim_compare

print("Season Total Points Comparison:")
print("="*50)
print(f"Player A mean total: {np.mean(player_a_totals):.0f} points")
print(f"Player B mean total: {np.mean(player_b_totals):.0f} points")

print(f"\nP(Player A total > Player B total): {prob_a_wins:.3f} ({prob_a_wins*100:.1f}%)")
print(f"P(Player B total > Player A total): {prob_b_wins:.3f} ({prob_b_wins*100:.1f}%)")

print(f"\nDespite Player B having a higher estimated mean,")
print(f"Player A has a {prob_a_wins*100:.0f}% chance of outscoring Player B")
print(f"because Player A's performance is more certain.")

In [None]:
# Visualize the distributions

plt.figure(figsize=(12, 6))

plt.hist(player_a_totals, bins=50, alpha=0.5, color='blue', 
         label=f'Player A (μ={player_a_mu_post:.1f}, τ={player_a_tau_post:.1f})', 
         edgecolor='black', density=True)
plt.hist(player_b_totals, bins=50, alpha=0.5, color='red',
         label=f'Player B (μ={player_b_mu_post:.1f}, τ={player_b_tau_post:.1f})',
         edgecolor='black', density=True)

plt.axvline(np.mean(player_a_totals), color='blue', linestyle='--', linewidth=2)
plt.axvline(np.mean(player_b_totals), color='red', linestyle='--', linewidth=2)

plt.xlabel('Season Total Points', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Comparing Two Players: Season Total Points Distribution', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("Player B's distribution is wider due to greater uncertainty.")
print("More data on Player A leads to more consistent projections.")

# Summary: Bayesian vs Frequentist and Season Simulations

## Bayesian vs Frequentist for Shooting Percentage

**Frequentist approach:**
- Estimate: Sample proportion (makes / attempts)
- Uses only observed data
- Can be unstable with limited data
- Confidence intervals based on sampling distribution

**Bayesian approach:**
- Estimate: Posterior mean from Beta distribution
- Combines prior beliefs with data
- More stable with limited data (prior acts as regularization)
- Credible intervals from posterior distribution

**Key differences:**
- Early games: Bayesian stays closer to prior, frequentist jumps around
- Late games: Both converge to similar values
- Bayesian automatically handles small sample problems
- Frequentist treats parameter as fixed; Bayesian as random

---

## Posterior Updating Game by Game

**The update process:**
1. Start with prior Beta(α₀, β₀)
2. After each game: Add makes to α, add misses to β
3. New posterior becomes prior for next game
4. Posterior mean = α / (α + β)

**How beliefs change:**
- Game 1: Posterior close to prior (limited data)
- Game 5: Posterior between prior and data
- Game 15: Posterior dominated by data

**Uncertainty evolution:**
- Posterior distribution gets narrower over time
- Credible intervals shrink as data accumulates
- More games → more certainty about true FG%

---

## Simulating Full Seasons with Uncertainty

**Two types of uncertainty:**

*Game-to-game randomness:*
- Individual games vary even with known mean
- Represented by σ_game
- Averages out over 82 games

*Parameter uncertainty:*
- Don't know the true mean exactly
- Represented by posterior distribution for μ
- Doesn't average out (systematic across season)

**Simulation procedure:**
1. Draw true mean from posterior: μ_sim ~ N(μ_post, τ_post)
2. Simulate games given that mean: Points ~ N(μ_sim, σ_game)
3. Compute season average and total
4. Repeat thousands of times

**Result:** Distribution of season outcomes accounting for both uncertainties

---

## Applications to Season Performance Questions

**Probability estimation:**
- "What's P(player averages ≥ 25 PPG)?"
- Count simulated seasons meeting criterion
- Accounts for all sources of uncertainty

**Comparison:**
- "Which player will score more this season?"
- Simulate both, compare outcomes
- More certain player (lower τ) has advantage

**Projections:**
- Not just point estimate (e.g., 24.5 PPG)
- Full distribution showing range of plausible outcomes
- Different percentiles for optimistic/pessimistic scenarios

**Risk assessment:**
- Understand downside risk
- Account for uncertainty in decision-making
- More realistic than treating estimates as certain

**Why this matters:**
- Traditional projections: Often too confident (narrow ranges)
- Bayesian simulations: Honest about uncertainty
- Better for high-stakes decisions (contracts, trades, draft)
- Naturally updates as season progresses (more data → less uncertainty)

---

## Key Takeaways

1. **Bayesian and frequentist converge with data** but differ importantly early on

2. **Prior beliefs matter when data is limited** but fade with accumulation

3. **Parameter uncertainty is often larger** than game-to-game randomness for season projections

4. **Simulations accounting for uncertainty** give more realistic projections

5. **These methods support better decision-making** by quantifying the full range of plausible outcomes

**The Bayesian framework provides a principled way to:**
- Incorporate prior knowledge
- Update beliefs with data
- Propagate uncertainty through predictions
- Make probabilistic statements about future performance