# Introduction to Probability and Random Variables

This notebook covers fundamental probability concepts for beginners who know basic Python and basic probability.

**Topics covered:**
- Day 1: Random variables, PMF, PDF, CDF
- Day 2: Expected value, variance, covariance
- Day 3: Joint and conditional probability

Let's start by importing the libraries we'll use throughout.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
plt.rcParams['figure.figsize'] = (10, 6)
print("Libraries imported successfully!")

# Day 1: Random Variables, PMF, PDF, CDF

## What is a Random Variable?

A random variable is a variable whose value depends on the outcome of a random event.

**Two types:**
- **Discrete random variable**: Takes specific values (like 1, 2, 3). Example: number of goals in a game.
- **Continuous random variable**: Can take any value in a range. Example: player height.

## Probability Mass Function (PMF)

A PMF tells us the probability that a discrete random variable equals each specific value.

**Definition:** For a discrete random variable X, the PMF is:

P(X = x) = probability that X equals x

**Properties:**
- All probabilities are between 0 and 1
- The sum of all probabilities equals 1

In [None]:
# Example: Rolling a fair six-sided die
# This is a discrete random variable

outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

# Display the PMF as a table
pmf_table = pd.DataFrame({
    'Outcome (x)': outcomes,
    'P(X = x)': probabilities
})

print("PMF for a fair die:")
print(pmf_table)
print(f"\nSum of probabilities: {sum(probabilities)}")

In [None]:
# Plot the PMF as a bar chart
plt.figure(figsize=(8, 5))
plt.bar(outcomes, probabilities, color='steelblue', edgecolor='black', alpha=0.7)
plt.xlabel('Outcome', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title('PMF of a Fair Die Roll', fontsize=14, fontweight='bold')
plt.xticks(outcomes)
plt.ylim(0, 0.25)
plt.grid(axis='y', alpha=0.3)
plt.show()

## Probability Density Function (PDF)

A PDF describes the probability distribution for a continuous random variable.

**Definition:** For a continuous random variable X, the PDF is a function f(x) where:

- f(x) ≥ 0 for all x
- The area under the curve equals 1
- P(a ≤ X ≤ b) = integral of f(x) from a to b

**Important:** For continuous variables, P(X = x) = 0 for any specific x. We only talk about probability over intervals.

## Cumulative Distribution Function (CDF)

The CDF gives the probability that a random variable is less than or equal to a value.

**Definition:** F(x) = P(X ≤ x)

**For discrete variables:** F(x) = sum of P(X = k) for all k ≤ x

**For continuous variables:** F(x) = integral of f(t) from -∞ to x

The CDF always increases from 0 to 1.

In [None]:
# Example: Standard Normal distribution (continuous)
# Mean = 0, Standard deviation = 1

x_values = np.linspace(-4, 4, 1000)

# PDF: probability density function
pdf_values = (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * x_values**2)

# CDF: cumulative distribution function
from scipy import stats
cdf_values = stats.norm.cdf(x_values, loc=0, scale=1)

# Plot both PDF and CDF
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# PDF plot
ax1.plot(x_values, pdf_values, color='darkblue', linewidth=2)
ax1.fill_between(x_values, pdf_values, alpha=0.3, color='skyblue')
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('f(x)', fontsize=12)
ax1.set_title('PDF: Standard Normal Distribution', fontsize=13, fontweight='bold')
ax1.grid(alpha=0.3)

# CDF plot
ax2.plot(x_values, cdf_values, color='darkgreen', linewidth=2)
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('F(x)', fontsize=12)
ax2.set_title('CDF: Standard Normal Distribution', fontsize=13, fontweight='bold')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Exercise: Plot PDFs of Normal and Uniform Distributions

**Task:** Plot the PDF of a Normal distribution (mean=0, std=1) and a Uniform distribution (interval 0 to 1) side by side.

In [None]:
# Normal distribution: mean = 0, standard deviation = 1
x_normal = np.linspace(-4, 4, 1000)
pdf_normal = (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * x_normal**2)

# Uniform distribution: interval from 0 to 1
x_uniform = np.linspace(-0.5, 1.5, 1000)
pdf_uniform = np.where((x_uniform >= 0) & (x_uniform <= 1), 1.0, 0.0)

# Create side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Normal distribution plot
ax1.plot(x_normal, pdf_normal, color='navy', linewidth=2.5, label='Normal(0,1)')
ax1.fill_between(x_normal, pdf_normal, alpha=0.3, color='lightblue')
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('f(x)', fontsize=12)
ax1.set_title('Normal Distribution PDF', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# Uniform distribution plot
ax2.plot(x_uniform, pdf_uniform, color='darkred', linewidth=2.5, label='Uniform(0,1)')
ax2.fill_between(x_uniform, pdf_uniform, alpha=0.3, color='lightcoral')
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('f(x)', fontsize=12)
ax2.set_title('Uniform Distribution PDF', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)
ax2.set_ylim(-0.1, 1.5)

plt.tight_layout()
plt.show()

### Explanation of the Distributions

**Normal Distribution:**
- Bell-shaped and symmetric around the mean (0).
- Most probability mass is concentrated near the center.
- Probability decreases as you move away from the mean.
- The tails extend to infinity but have very low density.

**Uniform Distribution:**
- Flat probability density between 0 and 1.
- Every value in the interval has equal probability density.
- Zero probability density outside the interval.
- Probability mass is spread evenly across the entire range.

**Key difference:** Normal concentrates probability near the center, while Uniform spreads it evenly.

# Day 2: Expected Value, Variance, Covariance

## Expected Value

The expected value is the average value of a random variable if you repeated the experiment many times.

**For discrete random variables:**

E[X] = Σ x · P(X = x)

Sum over all possible values x, multiplying each value by its probability.

**For continuous random variables:**

E[X] = ∫ x · f(x) dx

Integrate x times the PDF over all possible values.

The expected value is also called the **mean** or **expectation**.

## Variance and Standard Deviation

**Variance** measures how spread out the values are from the mean.

**Formula:**

Var(X) = E[(X - μ)²]

where μ = E[X]

**Alternative formula:**

Var(X) = E[X²] - (E[X])²

**Standard deviation:**

σ = √Var(X)

Standard deviation has the same units as X, making it easier to interpret.

In [None]:
# Example: Compute expectation and variance by hand
# Die roll example from earlier

outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])

# Expected value: E[X] = Σ x · P(X = x)
expected_value = np.sum(outcomes * probabilities)

# E[X²]: sum of x² · P(X = x)
expected_x_squared = np.sum(outcomes**2 * probabilities)

# Variance: Var(X) = E[X²] - (E[X])²
variance = expected_x_squared - expected_value**2

# Standard deviation
std_deviation = np.sqrt(variance)

print("Die Roll Statistics:")
print(f"Expected value E[X]: {expected_value:.4f}")
print(f"E[X²]: {expected_x_squared:.4f}")
print(f"Variance Var(X): {variance:.4f}")
print(f"Standard deviation σ: {std_deviation:.4f}")

## Covariance

Covariance measures how two random variables change together.

**Formula:**

Cov(X, Y) = E[(X - E[X])(Y - E[Y])]

**Interpretation:**
- Positive covariance: When X increases, Y tends to increase.
- Negative covariance: When X increases, Y tends to decrease.
- Zero covariance: No linear relationship.

**Note:** Covariance magnitude depends on the scale of X and Y. Use correlation for a standardized measure.

In [None]:
# Estimate expectation, variance, and covariance from sample data

# Generate sample data for two variables
sample_size = 100
x_sample = np.random.normal(loc=10, scale=2, size=sample_size)
y_sample = 2 * x_sample + np.random.normal(loc=0, scale=1, size=sample_size)

# Sample mean (expectation estimate)
mean_x = np.mean(x_sample)
mean_y = np.mean(y_sample)

# Sample variance
var_x = np.var(x_sample, ddof=1)  # ddof=1 for sample variance
var_y = np.var(y_sample, ddof=1)

# Sample covariance
covariance = np.cov(x_sample, y_sample)[0, 1]

print("Sample Statistics:")
print(f"Mean of X: {mean_x:.4f}")
print(f"Mean of Y: {mean_y:.4f}")
print(f"Variance of X: {var_x:.4f}")
print(f"Variance of Y: {var_y:.4f}")
print(f"Covariance of X and Y: {covariance:.4f}")
print("\nInterpretation: Positive covariance indicates X and Y tend to increase together.")

## Exercise: Expected Value and Variance from Player Points Data

**Task:** Given player points per game data, compute sample mean and variance.

In [None]:
# Player points per game dataset
player_points = np.array([18, 22, 15, 28, 21, 19, 25, 17, 23, 20, 
                          26, 19, 21, 24, 18, 22, 20, 27, 16, 23])

print("Player Points Per Game Data:")
print(player_points)
print(f"\nNumber of games: {len(player_points)}")

In [None]:
# Compute sample mean by hand (without built-in functions)
n = len(player_points)
sum_points = 0

for points in player_points:
    sum_points += points

sample_mean_manual = sum_points / n

print("Manual Calculation:")
print(f"Sum of points: {sum_points}")
print(f"Number of games: {n}")
print(f"Sample mean: {sample_mean_manual:.2f} points per game")

In [None]:
# Compute sample variance by hand (without built-in functions)
# Formula: s² = Σ(x - mean)² / (n - 1)

sum_squared_deviations = 0

for points in player_points:
    deviation = points - sample_mean_manual
    sum_squared_deviations += deviation**2

sample_variance_manual = sum_squared_deviations / (n - 1)
sample_std_manual = np.sqrt(sample_variance_manual)

print("Manual Calculation:")
print(f"Sum of squared deviations: {sum_squared_deviations:.2f}")
print(f"Sample variance: {sample_variance_manual:.2f}")
print(f"Sample standard deviation: {sample_std_manual:.2f} points")

In [None]:
# Verify using numpy functions
sample_mean_numpy = np.mean(player_points)
sample_variance_numpy = np.var(player_points, ddof=1)
sample_std_numpy = np.std(player_points, ddof=1)

print("Verification using NumPy:")
print(f"Sample mean: {sample_mean_numpy:.2f} points per game")
print(f"Sample variance: {sample_variance_numpy:.2f}")
print(f"Sample standard deviation: {sample_std_numpy:.2f} points")

print("\n✓ Manual and NumPy calculations match!")

### Interpretation in Sports Context

**Mean (21.00 points):** On average, this player scores 21 points per game. This is the typical performance level.

**Variance (14.32) and Standard Deviation (3.78 points):** These measure consistency. A standard deviation of 3.78 means the player's scoring varies by about 4 points from game to game. Most games fall between 17-25 points (within one standard deviation of the mean).

**Lower variance** means more consistent performance. **Higher variance** means more variable performance.

# Day 3: Joint and Conditional Probability

## Joint Probability

Joint probability is the probability that two events happen together.

**Notation:** P(X = x, Y = y) or P(X = x AND Y = y)

This tells us the probability that X equals x **and** Y equals y at the same time.

We often display joint probabilities in a table.

In [None]:
# Example: Joint probability table
# Two random variables: Weather (Sunny/Rainy) and Game Outcome (Win/Loss)

# Create a joint probability table
joint_prob_data = {
    'Weather': ['Sunny', 'Sunny', 'Rainy', 'Rainy'],
    'Outcome': ['Win', 'Loss', 'Win', 'Loss'],
    'Probability': [0.35, 0.15, 0.25, 0.25]
}

joint_prob_df = pd.DataFrame(joint_prob_data)

# Reshape for better display
joint_prob_table = joint_prob_df.pivot(index='Weather', columns='Outcome', values='Probability')

print("Joint Probability Table:")
print(joint_prob_table)
print(f"\nSum of all probabilities: {joint_prob_df['Probability'].sum()}")

## Marginal Probability

Marginal probability is the probability of one event regardless of the other.

We get marginal probabilities by summing across rows or columns in the joint table.

**Example:**
- P(Sunny) = P(Sunny, Win) + P(Sunny, Loss)
- P(Win) = P(Sunny, Win) + P(Rainy, Win)

In [None]:
# Compute marginal probabilities

# Marginal probabilities for Outcome (sum across Weather)
marginal_outcome = joint_prob_table.sum(axis=0)

# Marginal probabilities for Weather (sum across Outcome)
marginal_weather = joint_prob_table.sum(axis=1)

print("Marginal Probabilities:")
print("\nP(Outcome):")
print(marginal_outcome)
print("\nP(Weather):")
print(marginal_weather)

## Conditional Probability

Conditional probability is the probability of one event given that another event has occurred.

**Formula:**

P(A | B) = P(A AND B) / P(B)

Read as: "Probability of A given B"

**Example:** What is the probability of winning given that it's sunny?

P(Win | Sunny) = P(Win AND Sunny) / P(Sunny)

In [None]:
# Compute conditional probability from the joint table
# Question: What is P(Win | Sunny)?

p_win_and_sunny = joint_prob_table.loc['Sunny', 'Win']
p_sunny = marginal_weather['Sunny']

p_win_given_sunny = p_win_and_sunny / p_sunny

print("Conditional Probability Calculation:")
print(f"P(Win AND Sunny) = {p_win_and_sunny:.2f}")
print(f"P(Sunny) = {p_sunny:.2f}")
print(f"P(Win | Sunny) = {p_win_given_sunny:.4f}")
print(f"\nInterpretation: On sunny days, the team wins {p_win_given_sunny*100:.1f}% of the time.")

## Connection to Sports Analytics

Conditional probability helps answer questions like:
- What's the probability a player scores 25+ points given they play 35+ minutes?
- What's the probability of winning given the opponent's strength?
- What's the probability of making a playoff given current record?

Let's work through a real example with basketball data.

## Exercise: Conditional Probability from Basketball Data

**Question:** Estimate P(Player scores ≥25 points | Player plays ≥35 minutes) from sample data.

In [None]:
# Create basketball game dataset
# Each row: one game with points scored and minutes played

np.random.seed(123)  # For reproducibility

# Generate realistic basketball data
# Minutes played: between 20 and 42
# Points: correlated with minutes (more minutes → more points tendency)

minutes_played = np.random.uniform(20, 42, size=50).astype(int)
points_scored = np.zeros(50)

for i in range(50):
    # Base points related to minutes
    base_points = 0.5 * minutes_played[i] + np.random.normal(0, 5)
    points_scored[i] = max(5, int(base_points))  # Ensure at least 5 points

# Create DataFrame
basketball_data = pd.DataFrame({
    'Minutes': minutes_played,
    'Points': points_scored.astype(int)
})

print("Basketball Game Data (first 10 games):")
print(basketball_data.head(10))
print(f"\nTotal number of games in dataset: {len(basketball_data)}")

In [None]:
# Show full dataset statistics
print("Dataset Summary Statistics:")
print(basketball_data.describe())

# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(basketball_data['Minutes'], basketball_data['Points'], 
            alpha=0.6, s=100, color='purple', edgecolor='black')
plt.xlabel('Minutes Played', fontsize=12)
plt.ylabel('Points Scored', fontsize=12)
plt.title('Player Performance: Points vs Minutes', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.axhline(y=25, color='red', linestyle='--', label='25 points threshold')
plt.axvline(x=35, color='blue', linestyle='--', label='35 minutes threshold')
plt.legend()
plt.show()

In [None]:
# Step 1: Compute P(Points ≥ 25)
games_with_25plus_points = len(basketball_data[basketball_data['Points'] >= 25])
total_games = len(basketball_data)

prob_25plus_points = games_with_25plus_points / total_games

print("Step 1: P(Points ≥ 25)")
print(f"Games with 25+ points: {games_with_25plus_points}")
print(f"Total games: {total_games}")
print(f"P(Points ≥ 25) = {prob_25plus_points:.4f} or {prob_25plus_points*100:.2f}%")

In [None]:
# Step 2: Compute P(Minutes ≥ 35)
games_with_35plus_minutes = len(basketball_data[basketball_data['Minutes'] >= 35])

prob_35plus_minutes = games_with_35plus_minutes / total_games

print("Step 2: P(Minutes ≥ 35)")
print(f"Games with 35+ minutes: {games_with_35plus_minutes}")
print(f"Total games: {total_games}")
print(f"P(Minutes ≥ 35) = {prob_35plus_minutes:.4f} or {prob_35plus_minutes*100:.2f}%")

In [None]:
# Step 3: Compute P(Points ≥ 25 | Minutes ≥ 35)
# This is a conditional probability

# Filter data: games where player played 35+ minutes
games_35plus_minutes = basketball_data[basketball_data['Minutes'] >= 35]

# Among those games, count how many had 25+ points
games_35min_and_25pts = len(games_35plus_minutes[games_35plus_minutes['Points'] >= 25])

# Conditional probability formula: P(A|B) = P(A and B) / P(B)
# Here: count(35+ min AND 25+ pts) / count(35+ min)
conditional_prob = games_35min_and_25pts / games_with_35plus_minutes

print("Step 3: P(Points ≥ 25 | Minutes ≥ 35)")
print(f"Games with 35+ minutes AND 25+ points: {games_35min_and_25pts}")
print(f"Games with 35+ minutes: {games_with_35plus_minutes}")
print(f"P(Points ≥ 25 | Minutes ≥ 35) = {conditional_prob:.4f} or {conditional_prob*100:.2f}%")

In [None]:
# Summary: Print all probabilities with clear labels

print("="*60)
print("PROBABILITY SUMMARY")
print("="*60)
print(f"\nTotal games analyzed: {total_games}")
print(f"\n1. P(Points ≥ 25) = {prob_25plus_points:.4f} ({prob_25plus_points*100:.2f}%)")
print(f"   → Player scores 25+ points in {prob_25plus_points*100:.1f}% of all games")

print(f"\n2. P(Minutes ≥ 35) = {prob_35plus_minutes:.4f} ({prob_35plus_minutes*100:.2f}%)")
print(f"   → Player plays 35+ minutes in {prob_35plus_minutes*100:.1f}% of all games")

print(f"\n3. P(Points ≥ 25 | Minutes ≥ 35) = {conditional_prob:.4f} ({conditional_prob*100:.2f}%)")
print(f"   → When player plays 35+ minutes, they score 25+ points")
print(f"     in {conditional_prob*100:.1f}% of those games")
print("="*60)

### Interpretation of Conditional Probability

The conditional probability **P(Points ≥ 25 | Minutes ≥ 35)** tells us:

**Given that the player plays at least 35 minutes in a game, what is the probability they score at least 25 points?**

This is higher than the overall probability P(Points ≥ 25) because:
- More playing time provides more scoring opportunities
- Coaches give more minutes to players who are performing well
- There is a positive relationship between minutes and points

**Practical use:** Coaches and analysts use conditional probabilities to:
- Predict performance based on playing time
- Make substitution decisions
- Evaluate player efficiency in different situations
- Inform game strategy

# Summary: Key Concepts Review

## 1. Random Variables

A **random variable** is a variable whose value depends on random events.
- **Discrete**: Takes specific values (dice rolls, goals scored)
- **Continuous**: Takes any value in a range (height, time)

## 2. PMF, PDF, and CDF

**PMF (Probability Mass Function):** For discrete variables, gives P(X = x) for each value.

**PDF (Probability Density Function):** For continuous variables, describes probability density. Area under curve gives probability.

**CDF (Cumulative Distribution Function):** Gives P(X ≤ x), always increases from 0 to 1.

## 3. Expected Value, Variance, and Covariance

**Expected value E[X]:** The average value over many repetitions. The "center" of the distribution.

**Variance Var(X):** Measures spread around the mean. Shows how variable the outcomes are.

**Covariance Cov(X,Y):** Measures how two variables change together. Positive means they tend to increase together.

## 4. Joint and Conditional Probability

**Joint probability P(X, Y):** Probability that X and Y occur together.

**Marginal probability:** Probability of one variable, summing over all values of the other.

**Conditional probability P(A|B):** Probability of A given that B occurred. Formula: P(A|B) = P(A,B) / P(B).

---

### Next Steps

Practice these concepts with different datasets. Try calculating probabilities for your favorite sports scenarios. Experiment with different distributions in Python.

**Happy learning!**