# Probability and Statistics Mini Course: Days 4-5

This notebook continues our probability journey with:
- Day 4: Law of total probability and independence
- Day 5: Simulating probability distributions and Monte Carlo

**Prerequisites:** You should already know about random variables, PMF, PDF, CDF, expected value, variance, covariance, joint and conditional probability.

Let's start by importing our libraries.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
plt.rcParams['figure.figsize'] = (10, 6)
print("Libraries imported successfully!")

# Day 4: Law of Total Probability and Independence

## Partition of the Sample Space

A **partition** divides all possible outcomes into distinct groups that:
- Cover all possibilities (nothing is left out)
- Do not overlap (each outcome belongs to exactly one group)

**Example:** Basketball lineups can partition games into:
- Small lineup games
- Big lineup games

Every game uses exactly one lineup type. The groups cover all games and don't overlap.

## Law of Total Probability

The law of total probability lets us compute the probability of an event by breaking it down across a partition.

**In words:** If you partition the sample space into cases B₁, B₂, ..., Bₙ, then:

P(A) = P(A|B₁)·P(B₁) + P(A|B₂)·P(B₂) + ... + P(A|Bₙ)·P(Bₙ)

**Formula:**

P(A) = Σ P(A|Bᵢ)·P(Bᵢ)

**Interpretation:** Calculate the probability of A by considering each possible case, weighting by how often that case occurs.

## Example: Law of Total Probability with Basketball Lineups

**Question:** What is the probability the team scores 120+ points in a game?

**Given information:**
- 60% of games use a small lineup, 40% use a big lineup
- P(120+ points | small lineup) = 0.70
- P(120+ points | big lineup) = 0.40

Let's calculate this step by step.

In [None]:
# Given probabilities
p_small_lineup = 0.60
p_big_lineup = 0.40

p_120pts_given_small = 0.70
p_120pts_given_big = 0.40

print("Given Information:")
print(f"P(Small lineup) = {p_small_lineup}")
print(f"P(Big lineup) = {p_big_lineup}")
print(f"P(120+ pts | Small lineup) = {p_120pts_given_small}")
print(f"P(120+ pts | Big lineup) = {p_120pts_given_big}")

In [None]:
# Step-by-step calculation using law of total probability

# Contribution from small lineup games
contribution_small = p_120pts_given_small * p_small_lineup

# Contribution from big lineup games
contribution_big = p_120pts_given_big * p_big_lineup

# Total probability
p_120pts_total = contribution_small + contribution_big

print("Step-by-Step Calculation:")
print(f"\nContribution from small lineup:")
print(f"  P(120+ pts | Small) × P(Small) = {p_120pts_given_small} × {p_small_lineup} = {contribution_small}")

print(f"\nContribution from big lineup:")
print(f"  P(120+ pts | Big) × P(Big) = {p_120pts_given_big} × {p_big_lineup} = {contribution_big}")

print(f"\nTotal probability:")
print(f"  P(120+ pts) = {contribution_small} + {contribution_big} = {p_120pts_total}")

print(f"\n✓ The team scores 120+ points in {p_120pts_total*100:.0f}% of all games.")

### Interpretation

The law of total probability weighted each lineup type by how often it occurs.

- Small lineups are more common (60%) and more likely to score 120+ (70%)
- Big lineups are less common (40%) and less likely to score 120+ (40%)
- Overall: 58% chance of scoring 120+ points

This is a weighted average of the two conditional probabilities.

## Independence of Events

Two events A and B are **independent** if knowing one happened doesn't change the probability of the other.

**Definition (events):**

A and B are independent if:
- P(A | B) = P(A), or equivalently
- P(A and B) = P(A) × P(B)

**Definition (random variables):**

Random variables X and Y are independent if:
- P(X = x, Y = y) = P(X = x) × P(Y = y) for all x, y

**Key insight:** If events are independent, the joint probability factors into marginal probabilities.

## Examples of Independence and Dependence

### Independent Events Example

**Scenario:** Coin flip result and die roll result
- P(Heads) = 0.5
- P(Rolling a 6) = 1/6
- P(Heads and 6) = 0.5 × 1/6 = 1/12

The coin and die don't affect each other, so they're independent.

### Dependent Events Example

**Scenario:** Minutes played and points scored
- More minutes → more scoring opportunities
- P(25+ points | 35+ minutes) > P(25+ points)
- These events are dependent

Knowing minutes played changes the probability of scoring 25+ points.

## Checking Independence from Data

To check if events A and B are empirically independent:

1. Estimate P(A), P(B), and P(A and B) from data
2. Compute P(A) × P(B)
3. Compare P(A and B) to P(A) × P(B)
4. If they're approximately equal, events are independent

Let's see an example.

In [None]:
# Example: Check independence of two coin flips
# Simulate 10,000 pairs of coin flips

n_trials = 10000
coin1 = np.random.choice(['H', 'T'], size=n_trials)
coin2 = np.random.choice(['H', 'T'], size=n_trials)

# Count events
heads1 = np.sum(coin1 == 'H')
heads2 = np.sum(coin2 == 'H')
both_heads = np.sum((coin1 == 'H') & (coin2 == 'H'))

# Estimate probabilities
p_heads1 = heads1 / n_trials
p_heads2 = heads2 / n_trials
p_both_heads = both_heads / n_trials

# Product of marginals
p_product = p_heads1 * p_heads2

print("Independence Check: Two Coin Flips")
print(f"\nP(Heads on coin 1) = {p_heads1:.4f}")
print(f"P(Heads on coin 2) = {p_heads2:.4f}")
print(f"P(Both heads) = {p_both_heads:.4f}")
print(f"\nP(Heads1) × P(Heads2) = {p_product:.4f}")
print(f"\nDifference: {abs(p_both_heads - p_product):.4f}")
print("\n✓ Values are very close → coins are independent")

## Exercise: Independent vs Dependent Player Performance

**Goal:** Simulate two scenarios:
1. Player points and team pace are independent
2. Player points and team pace are dependent (points increase with pace)

We'll check independence by comparing P(A and B) to P(A) × P(B).

### Part 1: Independent Model

Points and pace are generated independently from separate distributions.

In [None]:
# Independent model: points and pace don't affect each other
np.random.seed(100)  # For reproducibility

n_games = 1000

# Team pace: independent of anything else
# Pace = possessions per game, range 95-105
pace_independent = np.random.uniform(95, 105, size=n_games)

# Player points: independent of pace
# Mean 20 points, std 6 points
points_independent = np.random.normal(loc=20, scale=6, size=n_games)

# Create DataFrame
data_independent = pd.DataFrame({
    'Pace': pace_independent,
    'Points': points_independent
})

print("Independent Model: First 10 games")
print(data_independent.head(10))
print(f"\nTotal games simulated: {n_games}")

In [None]:
# Define events for independence check
# Event A: Points >= 25
# Event B: High pace (pace >= 100)

event_a_ind = data_independent['Points'] >= 25
event_b_ind = data_independent['Pace'] >= 100
event_both_ind = event_a_ind & event_b_ind

# Estimate probabilities
p_a_ind = event_a_ind.sum() / n_games
p_b_ind = event_b_ind.sum() / n_games
p_both_ind = event_both_ind.sum() / n_games

# Product of marginals
p_product_ind = p_a_ind * p_b_ind

print("Independent Model: Probability Estimates")
print("="*50)
print(f"P(Points >= 25) = {p_a_ind:.4f}")
print(f"P(High pace) = {p_b_ind:.4f}")
print(f"P(Points >= 25 AND High pace) = {p_both_ind:.4f}")
print(f"\nProduct of marginals:")
print(f"P(Points >= 25) × P(High pace) = {p_product_ind:.4f}")
print(f"\nDifference: {abs(p_both_ind - p_product_ind):.4f}")
print("\n✓ Joint probability ≈ Product of marginals")
print("✓ Events appear INDEPENDENT")

### Part 2: Dependent Model

Now we make points depend on pace. Higher pace → more scoring opportunities → higher expected points.

In [None]:
# Dependent model: points increase with pace
np.random.seed(101)  # Different seed for variety

# Team pace: same as before
pace_dependent = np.random.uniform(95, 105, size=n_games)

# Player points: NOW DEPENDS ON PACE
# Base: 20 points at average pace (100)
# Effect: +1 point for each possession above 100
points_dependent = np.zeros(n_games)

for i in range(n_games):
    # Expected points increase with pace
    expected_points = 20 + 1.0 * (pace_dependent[i] - 100)
    # Add random noise
    points_dependent[i] = np.random.normal(loc=expected_points, scale=4)

# Create DataFrame
data_dependent = pd.DataFrame({
    'Pace': pace_dependent,
    'Points': points_dependent
})

print("Dependent Model: First 10 games")
print(data_dependent.head(10))
print(f"\nTotal games simulated: {n_games}")

In [None]:
# Check independence for dependent model
event_a_dep = data_dependent['Points'] >= 25
event_b_dep = data_dependent['Pace'] >= 100
event_both_dep = event_a_dep & event_b_dep

# Estimate probabilities
p_a_dep = event_a_dep.sum() / n_games
p_b_dep = event_b_dep.sum() / n_games
p_both_dep = event_both_dep.sum() / n_games

# Product of marginals
p_product_dep = p_a_dep * p_b_dep

print("Dependent Model: Probability Estimates")
print("="*50)
print(f"P(Points >= 25) = {p_a_dep:.4f}")
print(f"P(High pace) = {p_b_dep:.4f}")
print(f"P(Points >= 25 AND High pace) = {p_both_dep:.4f}")
print(f"\nProduct of marginals:")
print(f"P(Points >= 25) × P(High pace) = {p_product_dep:.4f}")
print(f"\nDifference: {abs(p_both_dep - p_product_dep):.4f}")
print("\n✗ Joint probability ≠ Product of marginals")
print("✗ Events appear DEPENDENT")

In [None]:
# Visualize the difference between independent and dependent models

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Independent model
ax1.scatter(data_independent['Pace'], data_independent['Points'], 
            alpha=0.4, s=30, color='blue')
ax1.axhline(y=25, color='red', linestyle='--', linewidth=2, label='25 points')
ax1.axvline(x=100, color='green', linestyle='--', linewidth=2, label='Pace 100')
ax1.set_xlabel('Team Pace', fontsize=12)
ax1.set_ylabel('Player Points', fontsize=12)
ax1.set_title('Independent Model: No Relationship', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# Dependent model
ax2.scatter(data_dependent['Pace'], data_dependent['Points'], 
            alpha=0.4, s=30, color='purple')
ax2.axhline(y=25, color='red', linestyle='--', linewidth=2, label='25 points')
ax2.axvline(x=100, color='green', linestyle='--', linewidth=2, label='Pace 100')
ax2.set_xlabel('Team Pace', fontsize=12)
ax2.set_ylabel('Player Points', fontsize=12)
ax2.set_title('Dependent Model: Positive Relationship', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Comparison and Interpretation

**Independent Model:**
- P(Points ≥ 25 AND High pace) ≈ P(Points ≥ 25) × P(High pace)
- The scatter plot shows no pattern
- Knowing pace doesn't help predict points

**Dependent Model:**
- P(Points ≥ 25 AND High pace) > P(Points ≥ 25) × P(High pace)
- The scatter plot shows a positive trend
- High pace increases the chance of scoring 25+ points
- Events are dependent because pace influences points

**Key insight:** When variables are dependent, their joint probability doesn't factor into marginals. The conditional probability P(A|B) differs from P(A).

# Day 5: Simulating Probability Distributions and Monte Carlo

## What is Simulation?

**Simulating from a distribution** means generating random samples that follow a specific probability distribution.

**Why simulate?**
- Approximate probabilities that are hard to calculate exactly
- Test statistical methods
- Model real-world random processes
- Estimate complex quantities

**How it works:** Use a random number generator to produce values that match the distribution's pattern.

## Monte Carlo Methods

**Monte Carlo** is a technique that uses many random samples to approximate probabilities, expected values, or distributions.

**The basic idea:**
1. Generate many random samples from a distribution
2. Compute statistics or probabilities from the samples
3. As the number of samples increases, estimates get closer to true values

**Name origin:** Named after the Monte Carlo casino in Monaco, because it relies on randomness.

**Law of large numbers:** With enough samples, sample statistics converge to population parameters.

## Simulating from a Normal Distribution

Let's simulate from a Normal distribution with mean μ = 100 and standard deviation σ = 15.

In [None]:
# Simulate from Normal distribution
np.random.seed(200)

# Parameters
mu = 100
sigma = 15
n_samples = 5000

# Generate samples
samples = np.random.normal(loc=mu, scale=sigma, size=n_samples)

print(f"Simulated {n_samples} samples from Normal({mu}, {sigma})")
print(f"\nFirst 20 samples:")
print(samples[:20])

In [None]:
# Compute sample statistics
sample_mean = np.mean(samples)
sample_std = np.std(samples, ddof=1)  # ddof=1 for sample std

print("Sample Statistics:")
print("="*40)
print(f"True mean (μ): {mu}")
print(f"Sample mean: {sample_mean:.2f}")
print(f"\nTrue std (σ): {sigma}")
print(f"Sample std: {sample_std:.2f}")
print("\n✓ Sample statistics are close to true parameters")

In [None]:
# Plot histogram with theoretical PDF overlay

# Create histogram
plt.figure(figsize=(10, 6))
counts, bins, patches = plt.hist(samples, bins=50, density=True, 
                                  alpha=0.6, color='skyblue', edgecolor='black')

# Overlay theoretical Normal PDF
x_range = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
pdf_theoretical = stats.norm.pdf(x_range, loc=mu, scale=sigma)
plt.plot(x_range, pdf_theoretical, 'r-', linewidth=2.5, 
         label=f'Theoretical Normal({mu}, {sigma})')

plt.xlabel('Value', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Simulated Data vs Theoretical Normal Distribution', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

print("The histogram matches the theoretical PDF curve closely.")
print("This confirms our simulation is working correctly.")

## Exercise: Monte Carlo Simulation of Player Performance

**Scenario:** A player's points per game follow a Normal distribution with mean 24 and standard deviation 6.

**Goal:** Simulate 10,000 games and estimate probabilities using Monte Carlo.

In [None]:
# Monte Carlo simulation setup
np.random.seed(300)

# Player performance parameters
mean_points = 24
std_points = 6
n_simulations = 10000

# Simulate 10,000 games
simulated_points = np.random.normal(loc=mean_points, scale=std_points, size=n_simulations)

print(f"Simulated {n_simulations} games")
print(f"Player model: Normal(mean={mean_points}, std={std_points})")
print(f"\nFirst 20 simulated point totals:")
print(simulated_points[:20])

In [None]:
# Step 1: Compute sample statistics
sim_mean = np.mean(simulated_points)
sim_std = np.std(simulated_points, ddof=1)

print("Comparison: Input Parameters vs Simulation")
print("="*50)
print(f"Input mean: {mean_points}")
print(f"Sample mean: {sim_mean:.4f}")
print(f"Difference: {abs(sim_mean - mean_points):.4f}")

print(f"\nInput std: {std_points}")
print(f"Sample std: {sim_std:.4f}")
print(f"Difference: {abs(sim_std - std_points):.4f}")

print("\n✓ Sample statistics match input parameters closely")

In [None]:
# Step 2: Estimate probabilities from simulation

# P(Points >= 30)
games_30plus = np.sum(simulated_points >= 30)
prob_30plus_mc = games_30plus / n_simulations

# P(20 <= Points <= 30)
games_20to30 = np.sum((simulated_points >= 20) & (simulated_points <= 30))
prob_20to30_mc = games_20to30 / n_simulations

print("Monte Carlo Probability Estimates:")
print("="*50)
print(f"\nP(Points >= 30):")
print(f"  Games with 30+ points: {games_30plus}")
print(f"  Monte Carlo estimate: {prob_30plus_mc:.4f} ({prob_30plus_mc*100:.2f}%)")

print(f"\nP(20 <= Points <= 30):")
print(f"  Games with 20-30 points: {games_20to30}")
print(f"  Monte Carlo estimate: {prob_20to30_mc:.4f} ({prob_20to30_mc*100:.2f}%)")

In [None]:
# Step 3: Compare to exact Normal CDF calculations

# P(Points >= 30) = 1 - P(Points < 30) = 1 - CDF(30)
prob_30plus_exact = 1 - stats.norm.cdf(30, loc=mean_points, scale=std_points)

# P(20 <= Points <= 30) = CDF(30) - CDF(20)
prob_20to30_exact = stats.norm.cdf(30, loc=mean_points, scale=std_points) - \
                    stats.norm.cdf(20, loc=mean_points, scale=std_points)

print("Comparison: Monte Carlo vs Exact Calculations")
print("="*50)

print(f"\nP(Points >= 30):")
print(f"  Monte Carlo: {prob_30plus_mc:.4f}")
print(f"  Exact (CDF): {prob_30plus_exact:.4f}")
print(f"  Error: {abs(prob_30plus_mc - prob_30plus_exact):.4f}")

print(f"\nP(20 <= Points <= 30):")
print(f"  Monte Carlo: {prob_20to30_mc:.4f}")
print(f"  Exact (CDF): {prob_20to30_exact:.4f}")
print(f"  Error: {abs(prob_20to30_mc - prob_20to30_exact):.4f}")

print("\n✓ Monte Carlo estimates are very close to exact values")

In [None]:
# Step 4: Plot histogram of simulated points

plt.figure(figsize=(12, 6))

# Histogram
plt.hist(simulated_points, bins=60, density=True, alpha=0.6, 
         color='steelblue', edgecolor='black', label='Simulated data')

# Overlay theoretical PDF
x_vals = np.linspace(mean_points - 4*std_points, mean_points + 4*std_points, 1000)
pdf_vals = stats.norm.pdf(x_vals, loc=mean_points, scale=std_points)
plt.plot(x_vals, pdf_vals, 'r-', linewidth=3, label='Theoretical Normal PDF')

# Add vertical lines for probabilities of interest
plt.axvline(x=30, color='green', linestyle='--', linewidth=2, label='30 points')
plt.axvline(x=20, color='orange', linestyle='--', linewidth=2, label='20 points')

plt.xlabel('Points Scored', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'Monte Carlo Simulation: {n_simulations} Games', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

### Visual Inspection

**Observations from the histogram:**
- The histogram has a bell shape, matching the Normal distribution
- The simulated data aligns closely with the theoretical PDF (red curve)
- The distribution is centered around 24 points (the mean)
- Most values fall between 12 and 36 points (within 2 standard deviations)
- The spread matches the standard deviation of 6 points

**Conclusion:** The simulation accurately represents the theoretical Normal distribution.

### How Monte Carlo Helps

**Advantages of Monte Carlo simulation:**

1. **Approximates complex probabilities:** For distributions without closed-form CDFs, simulation provides estimates.

2. **Easy to implement:** Just generate random samples and count outcomes.

3. **Flexible:** Works for any distribution, even custom or empirical ones.

4. **Builds intuition:** Visualizing simulated data helps understand the distribution.

**Accuracy:**
- With 10,000 simulations, our estimates were within 0.01 of exact values
- More simulations → better accuracy (but slower computation)
- Trade-off between speed and precision

**When to use Monte Carlo:**
- Exact calculations are difficult or impossible
- You need to explore "what if" scenarios
- Complex models with multiple random components
- Real-world applications where we can't compute exact probabilities

In [None]:
# Bonus: Show convergence with different sample sizes
sample_sizes = [100, 500, 1000, 5000, 10000]
estimates_30plus = []

np.random.seed(400)
for n in sample_sizes:
    sim = np.random.normal(loc=mean_points, scale=std_points, size=n)
    est = np.sum(sim >= 30) / n
    estimates_30plus.append(est)

# Plot convergence
plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, estimates_30plus, 'o-', linewidth=2, markersize=8, color='navy')
plt.axhline(y=prob_30plus_exact, color='red', linestyle='--', linewidth=2, 
            label=f'True probability: {prob_30plus_exact:.4f}')
plt.xlabel('Number of Simulations', fontsize=12)
plt.ylabel('Estimated P(Points >= 30)', fontsize=12)
plt.title('Monte Carlo Convergence: More Samples → Better Estimate', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

print("As sample size increases, the estimate converges to the true value.")

# Summary: Days 4-5 Key Concepts

## Law of Total Probability

**Formula:** P(A) = Σ P(A|Bᵢ)·P(Bᵢ)

**What it does:** Breaks down a probability across different cases or scenarios.

**Example:** Total probability of scoring 120+ points = weighted average across lineup types.

**When to use:** When you have conditional probabilities for different mutually exclusive scenarios.

---

## Independence and Dependence

**Independent events:** P(A and B) = P(A) × P(B)
- Knowing one event doesn't change the probability of the other
- Example: Two coin flips

**Dependent events:** P(A and B) ≠ P(A) × P(B)
- Events influence each other
- Example: Minutes played and points scored

**Sports example from exercise:**
- Independent: Player points and pace generated separately
- Dependent: Points increase with pace → higher joint probability when pace is high

**How to check:** Compare P(A and B) to P(A) × P(B) from data.

---

## Monte Carlo Simulation

**Key idea:** Generate many random samples to approximate probabilities and distributions.

**Process:**
1. Define the probability model (e.g., Normal distribution)
2. Simulate many samples
3. Compute statistics or count outcomes
4. Estimates improve with more samples

**Player performance example:**
- Simulated 10,000 games from Normal(24, 6)
- Estimated P(Points ≥ 30) ≈ 0.1587 (very close to exact 0.1587)
- Visualized distribution with histogram

**Advantages:**
- Works for any distribution
- Easy to understand and implement
- Flexible for complex scenarios

**Trade-off:** More samples = better accuracy but slower computation.

---

## Next Steps

You now have tools to:
- Combine probabilities across different scenarios
- Check if events are independent
- Use simulation to estimate probabilities
- Model player performance with Normal distributions

**Practice ideas:**
- Simulate different player types (shooters vs defenders)
- Model game outcomes with various distributions
- Test independence of different sports statistics
- Use Monte Carlo for playoff probability estimation

**Happy simulating!**