# Lesson 1: Foundations of Statistical Thinking

## Understanding Data and Its Characteristics

This notebook covers the fundamental concepts of statistical thinking, including:
- What statistics is and why it matters
- Populations vs. Samples
- Sampling methods and biases
- Measures of central tendency and dispersion

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.animation import FuncAnimation
from IPython.display import HTML, display
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Define consistent colors
COLOR_POPULATION = '#3498db'  # Blue
COLOR_SAMPLE = '#f39c12'      # Orange
COLOR_BIAS = '#e74c3c'        # Red
COLOR_GOOD = '#27ae60'        # Green

print("Libraries loaded successfully! 🎉")

Libraries loaded successfully! 🎉


---

## Slide 1: Welcome to Statistical Thinking

**Key Concept**: Statistics helps us find patterns in noisy, uncertain data. It's like using a flashlight in a dark room - not perfect, but enough to move forward safely.

In [None]:
# Interactive Coin Flip Convergence Demonstration

def coin_flip_convergence(n_flips):
    """Show how coin flips converge to 50/50 over time"""
    np.random.seed(42)
    flips = np.random.choice([0, 1], size=n_flips)
    cumulative_mean = np.cumsum(flips) / np.arange(1, n_flips + 1)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Cumulative proportion
    x = np.arange(1, n_flips + 1)
    ax1.plot(x, cumulative_mean, color=COLOR_POPULATION, linewidth=2, alpha=0.8)
    ax1.axhline(y=0.5, color=COLOR_GOOD, linestyle='--', linewidth=2, label='True Probability')
    ax1.fill_between(x, cumulative_mean, 0.5, alpha=0.3, color=COLOR_POPULATION)
    ax1.set_xlabel('Number of Flips', fontsize=12)
    ax1.set_ylabel('Proportion of Heads', fontsize=12)
    ax1.set_title(f'Convergence to True Probability ({n_flips} flips)', fontsize=14, fontweight='bold')
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim([0, 1])
    
    # Plot 2: Finding Signal in Noise
    noise = np.random.normal(0, 1, 100)
    signal = 2 * np.sin(np.linspace(0, 4*np.pi, 100))
    noisy_signal = signal + noise
    
    ax2.scatter(range(100), noisy_signal, alpha=0.5, s=20, color=COLOR_SAMPLE, label='Noisy Data')
    ax2.plot(range(100), signal, color=COLOR_POPULATION, linewidth=3, label='Hidden Pattern')
    ax2.set_xlabel('Observation', fontsize=12)
    ax2.set_ylabel('Value', fontsize=12)
    ax2.set_title('Finding Signal in Noise', fontsize=14, fontweight='bold')
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Display insight
    final_proportion = cumulative_mean[-1]
    print(f"\n📊 After {n_flips} flips:")
    print(f"   Proportion of heads: {final_proportion:.3f}")
    print(f"   Distance from true probability: {abs(final_proportion - 0.5):.3f}")
    if n_flips < 100:
        print("   💡 Try more flips to see better convergence!")
    else:
        print("   ✨ Notice how we get closer to 0.5 with more data!")

# Create interactive widget
interact(coin_flip_convergence, 
         n_flips=widgets.IntSlider(min=10, max=1000, step=10, value=100, 
                                   description='# Flips:', continuous_update=False));

interactive(children=(IntSlider(value=100, continuous_update=False, description='# Flips:', max=1000, min=10, …

---

## Slide 2: Populations vs. Samples

**Key Concept**: We study small samples to understand large populations. It's like tasting soup - you don't need to eat the whole pot to know how it tastes!

In [None]:
# Interactive Population vs Sample Visualization

def visualize_population_sampling(sample_size=100, n_samples=1):
    """Visualize sampling from a population"""
    np.random.seed(42)
    
    # Create a population
    population_size = 10000
    population = np.random.normal(100, 15, population_size)  # IQ scores example
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Population visualization
    ax1 = axes[0, 0]
    ax1.scatter(np.random.uniform(0, 10, 500), np.random.choice(population, 500), 
                alpha=0.3, s=30, color=COLOR_POPULATION)
    ax1.set_title(f'Population (N = {population_size:,})', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Random Position')
    ax1.set_ylabel('Value')
    ax1.axhline(y=population.mean(), color='red', linestyle='--', label=f'True Mean: {population.mean():.1f}')
    ax1.legend()
    
    # Plot 2: Sample visualization
    ax2 = axes[0, 1]
    samples = []
    sample_means = []
    
    for i in range(n_samples):
        sample = np.random.choice(population, sample_size, replace=False)
        samples.append(sample)
        sample_means.append(sample.mean())
        
    # Show the last sample
    ax2.scatter(np.random.uniform(0, 10, sample_size), samples[-1], 
                alpha=0.6, s=40, color=COLOR_SAMPLE)
    ax2.set_title(f'Sample (n = {sample_size})', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Random Position')
    ax2.set_ylabel('Value')
    ax2.axhline(y=samples[-1].mean(), color='orange', linestyle='--', 
                label=f'Sample Mean: {samples[-1].mean():.1f}')
    ax2.axhline(y=population.mean(), color='red', linestyle='--', alpha=0.5,
                label=f'True Mean: {population.mean():.1f}')
    ax2.legend()
    
    # Plot 3: Distribution comparison
    ax3 = axes[1, 0]
    ax3.hist(population, bins=50, alpha=0.5, color=COLOR_POPULATION, 
             density=True, label='Population')
    ax3.hist(samples[-1], bins=20, alpha=0.7, color=COLOR_SAMPLE, 
             density=True, label=f'Sample (n={sample_size})')
    ax3.set_title('Distribution Comparison', fontsize=14, fontweight='bold')
    ax3.set_xlabel('Value')
    ax3.set_ylabel('Density')
    ax3.legend()
    
    # Plot 4: Sampling distribution of means
    ax4 = axes[1, 1]
    if n_samples > 1:
        ax4.hist(sample_means, bins=min(30, n_samples//3), 
                color=COLOR_GOOD, alpha=0.7, edgecolor='black')
        ax4.axvline(x=population.mean(), color='red', linestyle='--', linewidth=2,
                   label=f'True Mean: {population.mean():.1f}')
        ax4.axvline(x=np.mean(sample_means), color='green', linestyle='--', linewidth=2,
                   label=f'Avg of Sample Means: {np.mean(sample_means):.1f}')
        ax4.set_title(f'Distribution of {n_samples} Sample Means', fontsize=14, fontweight='bold')
        ax4.set_xlabel('Sample Mean')
        ax4.set_ylabel('Frequency')
        ax4.legend()
    else:
        ax4.text(0.5, 0.5, 'Increase number of samples\nto see sampling distribution', 
                ha='center', va='center', fontsize=14, transform=ax4.transAxes)
        ax4.set_xticks([])
        ax4.set_yticks([])
    
    plt.tight_layout()
    plt.show()
    
    # Display insights
    print(f"\n📊 Sampling Insights:")
    print(f"   Population mean: {population.mean():.2f}")
    print(f"   Population std dev: {population.std():.2f}")
    print(f"   Last sample mean: {samples[-1].mean():.2f}")
    print(f"   Sample error: {abs(samples[-1].mean() - population.mean()):.2f}")
    if n_samples > 1:
        print(f"   Average of all {n_samples} sample means: {np.mean(sample_means):.2f}")
        print(f"   Standard error of means: {np.std(sample_means):.2f}")

# Create interactive widgets
interact(visualize_population_sampling,
         sample_size=widgets.IntSlider(min=10, max=1000, step=10, value=100,
                                       description='Sample Size:', continuous_update=False),
         n_samples=widgets.IntSlider(min=1, max=100, step=1, value=1,
                                     description='# Samples:', continuous_update=False));

interactive(children=(IntSlider(value=100, continuous_update=False, description='Sample Size:', max=1000, min=…

---

## Slide 3: Sampling Methods Overview

**Key Concepts**: Different ways to collect samples:
- **Simple Random**: Everyone has equal chance (drawing names from a hat)
- **Stratified**: Divide into subgroups and sample from each
- **Cluster**: Sample whole groups instead of individuals
- **Systematic**: Pick every nth person

In [None]:
# Interactive Sampling Methods Visualization

def demonstrate_sampling_methods(method='Simple Random', sample_pct=20):
    """Demonstrate different sampling methods visually"""
    np.random.seed(42)
    
    # Create a grid population with different characteristics
    grid_size = 20
    total_population = grid_size * grid_size
    sample_size = int(total_population * sample_pct / 100)
    
    # Create strata (4 groups)
    strata = np.array([[0, 0, 1, 1] * 5] * 5 + [[2, 2, 3, 3] * 5] * 5 + 
                      [[0, 0, 1, 1] * 5] * 5 + [[2, 2, 3, 3] * 5] * 5).flatten()[:total_population]
    
    # Create clusters (groups of nearby points)
    clusters = np.repeat(np.arange(25), 16)[:total_population]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Create position grid
    positions = [(i, j) for i in range(grid_size) for j in range(grid_size)]
    x_pos = [p[0] for p in positions]
    y_pos = [p[1] for p in positions]
    
    # Determine which points are sampled based on method
    sampled = np.zeros(total_population, dtype=bool)
    
    if method == 'Simple Random':
        sampled_idx = np.random.choice(total_population, sample_size, replace=False)
        sampled[sampled_idx] = True
        
    elif method == 'Stratified':
        samples_per_stratum = sample_size // 4
        for s in range(4):
            stratum_idx = np.where(strata == s)[0]
            sampled_idx = np.random.choice(stratum_idx, 
                                         min(samples_per_stratum, len(stratum_idx)), 
                                         replace=False)
            sampled[sampled_idx] = True
            
    elif method == 'Cluster':
        n_clusters_to_sample = max(1, sample_size // 16)
        sampled_clusters = np.random.choice(25, n_clusters_to_sample, replace=False)
        for c in sampled_clusters:
            sampled[clusters == c] = True
            
    elif method == 'Systematic':
        interval = max(1, total_population // sample_size)
        start = np.random.randint(0, interval)
        sampled[start::interval] = True
    
    # Plot 1: Population with sampling highlighted
    colors = ['lightblue' if not s else COLOR_SAMPLE for s in sampled]
    sizes = [50 if not s else 150 for s in sampled]
    
    scatter = ax1.scatter(x_pos, y_pos, c=colors, s=sizes, alpha=0.7, edgecolor='black')
    ax1.set_title(f'{method} Sampling', fontsize=14, fontweight='bold')
    ax1.set_xlabel('X Position')
    ax1.set_ylabel('Y Position')
    ax1.grid(True, alpha=0.3)
    ax1.set_aspect('equal')
    
    # Add strata boundaries if stratified sampling
    if method == 'Stratified':
        ax1.axhline(y=4.5, color='red', linestyle='--', alpha=0.5)
        ax1.axhline(y=9.5, color='red', linestyle='--', alpha=0.5)
        ax1.axhline(y=14.5, color='red', linestyle='--', alpha=0.5)
        ax1.axvline(x=9.5, color='red', linestyle='--', alpha=0.5)
    
    # Plot 2: Characteristics of the sample
    ax2.bar(['Population', 'Sample'], 
            [total_population, sampled.sum()],
            color=[COLOR_POPULATION, COLOR_SAMPLE])
    ax2.set_title('Sample Size Comparison', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Count')
    
    # Add percentage labels
    ax2.text(0, total_population + 10, f'{total_population}', ha='center', fontweight='bold')
    ax2.text(1, sampled.sum() + 10, f'{sampled.sum()}\n({sampled.sum()/total_population*100:.1f}%)', 
             ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Display method description
    descriptions = {
        'Simple Random': "Each member has an equal chance of being selected. Like drawing names from a hat!",
        'Stratified': "Divide population into groups (strata), then randomly sample from each group proportionally.",
        'Cluster': "Randomly select entire groups (clusters) rather than individuals. Efficient for geographic sampling!",
        'Systematic': "Select every nth member after a random starting point. Quick and easy to implement!"
    }
    
    print(f"\n📊 {method} Sampling:")
    print(f"   {descriptions[method]}")
    print(f"   Sample size: {sampled.sum()} out of {total_population} ({sampled.sum()/total_population*100:.1f}%)")

# Create interactive widget
interact(demonstrate_sampling_methods,
         method=widgets.Dropdown(options=['Simple Random', 'Stratified', 'Cluster', 'Systematic'],
                                 value='Simple Random', description='Method:'),
         sample_pct=widgets.IntSlider(min=5, max=50, step=5, value=20,
                                      description='Sample %:', continuous_update=False));

interactive(children=(Dropdown(description='Method:', options=('Simple Random', 'Stratified', 'Cluster', 'Syst…

---

## Slide 4: Common Sampling Biases

**Key Concepts**: Biases that can ruin your analysis:
- **Selection Bias**: Only certain types participate
- **Survivorship Bias**: Only studying "winners"
- **Response Bias**: People don't answer honestly
- **Historical Example**: 1936 poll disaster - only surveyed telephone owners (wealthy people)

In [None]:
# Interactive Bias Demonstration

def demonstrate_sampling_bias(bias_type='No Bias', bias_strength=0.5):
    """Demonstrate different types of sampling bias"""
    np.random.seed(42)
    
    # Generate true population data (customer satisfaction scores)
    population_size = 10000
    true_satisfaction = np.random.beta(5, 3, population_size) * 100  # Slightly right-skewed
    
    sample_size = 500
    
    if bias_type == 'No Bias':
        # Random sampling
        sample_idx = np.random.choice(population_size, sample_size, replace=False)
        sample = true_satisfaction[sample_idx]
        
    elif bias_type == 'Selection Bias':
        # Only happy customers respond
        response_probability = (true_satisfaction / 100) ** (2 * bias_strength)
        responded = np.random.random(population_size) < response_probability
        responders = true_satisfaction[responded]
        if len(responders) >= sample_size:
            sample = np.random.choice(responders, sample_size, replace=False)
        else:
            sample = responders
            
    elif bias_type == 'Survivorship Bias':
        # Only successful cases remain
        threshold = np.percentile(true_satisfaction, (1 - bias_strength) * 50)
        survivors = true_satisfaction[true_satisfaction > threshold]
        sample = np.random.choice(survivors, min(sample_size, len(survivors)), replace=False)
        
    elif bias_type == 'Response Bias':
        # People exaggerate their satisfaction
        sample_idx = np.random.choice(population_size, sample_size, replace=False)
        sample = true_satisfaction[sample_idx]
        # Add positive bias
        sample = np.minimum(100, sample + np.random.normal(10 * bias_strength, 5, sample_size))
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Population distribution
    ax1 = axes[0, 0]
    ax1.hist(true_satisfaction, bins=30, color=COLOR_POPULATION, alpha=0.7, edgecolor='black')
    ax1.axvline(true_satisfaction.mean(), color='red', linestyle='--', linewidth=2,
                label=f'True Mean: {true_satisfaction.mean():.1f}')
    ax1.set_title('True Population Distribution', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Satisfaction Score')
    ax1.set_ylabel('Frequency')
    ax1.legend()
    ax1.set_xlim(0, 100)
    
    # Plot 2: Sample distribution
    ax2 = axes[0, 1]
    color = COLOR_SAMPLE if bias_type == 'No Bias' else COLOR_BIAS
    ax2.hist(sample, bins=30, color=color, alpha=0.7, edgecolor='black')
    ax2.axvline(sample.mean(), color='orange', linestyle='--', linewidth=2,
                label=f'Sample Mean: {sample.mean():.1f}')
    ax2.axvline(true_satisfaction.mean(), color='red', linestyle='--', linewidth=2, alpha=0.5,
                label=f'True Mean: {true_satisfaction.mean():.1f}')
    ax2.set_title(f'Sample Distribution ({bias_type})', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Satisfaction Score')
    ax2.set_ylabel('Frequency')
    ax2.legend()
    ax2.set_xlim(0, 100)
    
    # Plot 3: Comparison overlay
    ax3 = axes[1, 0]
    ax3.hist(true_satisfaction, bins=30, alpha=0.4, color=COLOR_POPULATION, 
             density=True, label='True Population')
    ax3.hist(sample, bins=30, alpha=0.6, color=color,
             density=True, label=f'Biased Sample')
    ax3.set_title('Distribution Comparison', fontsize=14, fontweight='bold')
    ax3.set_xlabel('Satisfaction Score')
    ax3.set_ylabel('Density')
    ax3.legend()
    ax3.set_xlim(0, 100)
    
    # Plot 4: Bias Impact Visualization
    ax4 = axes[1, 1]
    metrics = ['True Mean', 'Sample Mean', 'Bias']
    values = [true_satisfaction.mean(), sample.mean(), sample.mean() - true_satisfaction.mean()]
    colors_bar = [COLOR_GOOD, color, COLOR_BIAS if values[2] != 0 else COLOR_GOOD]
    
    bars = ax4.bar(metrics, values, color=colors_bar, alpha=0.7, edgecolor='black')
    ax4.set_title('Bias Impact', fontsize=14, fontweight='bold')
    ax4.set_ylabel('Value')
    ax4.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    
    # Add value labels on bars
    for bar, value in zip(bars, values):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height + np.sign(height) * 1,
                f'{value:.1f}', ha='center', va='bottom' if height > 0 else 'top',
                fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Display insights
    bias_descriptions = {
        'No Bias': "Random sampling - our sample represents the population well!",
        'Selection Bias': "Only certain types respond - happy customers more likely to participate.",
        'Survivorship Bias': "We only see the 'winners' - failures have disappeared from our data.",
        'Response Bias': "People don't answer honestly - they say what sounds good."
    }
    
    print(f"\n⚠️ {bias_type}:")
    print(f"   {bias_descriptions[bias_type]}")
    print(f"\n📊 Results:")
    print(f"   True population mean: {true_satisfaction.mean():.2f}")
    print(f"   Sample mean: {sample.mean():.2f}")
    print(f"   Bias: {sample.mean() - true_satisfaction.mean():+.2f}")
    print(f"   Error: {abs(sample.mean() - true_satisfaction.mean()):.2f}%")

# Create interactive widget
interact(demonstrate_sampling_bias,
         bias_type=widgets.Dropdown(options=['No Bias', 'Selection Bias', 'Survivorship Bias', 'Response Bias'],
                                    value='No Bias', description='Bias Type:'),
         bias_strength=widgets.FloatSlider(min=0.1, max=1.0, step=0.1, value=0.5,
                                           description='Strength:', continuous_update=False));

interactive(children=(Dropdown(description='Bias Type:', options=('No Bias', 'Selection Bias', 'Survivorship B…

---

## Slide 5: Measures of Central Tendency

**Key Concepts**: Different ways to find the "middle":
- **Mean**: The average (sensitive to outliers)
- **Median**: The middle value (robust to outliers)
- **Mode**: The most common value

In [6]:
# Interactive Central Tendency Demonstration

def explore_central_tendency(add_outlier=False, outlier_value=1000000):
    """Interactive demonstration of mean, median, and mode"""
    np.random.seed(42)
    
    # Create salary data (in thousands)
    base_salaries = np.concatenate([
        np.random.normal(35, 5, 30),   # Entry level
        np.random.normal(50, 8, 40),   # Mid level
        np.random.normal(75, 10, 20),  # Senior level
        np.random.normal(100, 15, 10)  # Executive level
    ])
    
    if add_outlier:
        salaries = np.append(base_salaries, outlier_value/1000)  # Add CEO salary
    else:
        salaries = base_salaries
    
    # Calculate statistics
    mean_val = np.mean(salaries)
    median_val = np.median(salaries)
    mode_val = float(pd.Series(np.round(salaries, -1)).mode().iloc[0])  # Round to nearest 10k
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Histogram with measures
    ax1 = axes[0, 0]
    n, bins, patches = ax1.hist(salaries, bins=30, alpha=0.7, color=COLOR_POPULATION, edgecolor='black')
    ax1.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: ${mean_val:.0f}k')
    ax1.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: ${median_val:.0f}k')
    ax1.axvline(mode_val, color='orange', linestyle='--', linewidth=2, label=f'Mode: ${mode_val:.0f}k')
    ax1.set_title('Salary Distribution with Central Measures', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Salary (thousands $)')
    ax1.set_ylabel('Frequency')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Box plot
    ax2 = axes[0, 1]
    bp = ax2.boxplot(salaries, vert=True, patch_artist=True)
    bp['boxes'][0].set_facecolor(COLOR_SAMPLE)
    ax2.set_title('Box Plot View', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Salary (thousands $)')
    ax2.grid(True, alpha=0.3, axis='y')
    
    # Add annotations
    ax2.text(1.2, median_val, f'Median: ${median_val:.0f}k', va='center')
    ax2.text(1.2, np.percentile(salaries, 25), f'Q1: ${np.percentile(salaries, 25):.0f}k', va='center')
    ax2.text(1.2, np.percentile(salaries, 75), f'Q3: ${np.percentile(salaries, 75):.0f}k', va='center')
    
    # Plot 3: Comparison of measures
    ax3 = axes[1, 0]
    measures = ['Mean', 'Median', 'Mode']
    values = [mean_val, median_val, mode_val]
    colors_bars = ['red', 'green', 'orange']
    bars = ax3.bar(measures, values, color=colors_bars, alpha=0.7, edgecolor='black')
    ax3.set_title('Comparison of Central Measures', fontsize=14, fontweight='bold')
    ax3.set_ylabel('Salary (thousands $)')
    
    # Add value labels
    for bar, val in zip(bars, values):
        ax3.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2,
                f'${val:.0f}k', ha='center', fontweight='bold')
    
    # Plot 4: Sensitivity analysis
    ax4 = axes[1, 1]
    
    # Show how mean changes with outliers
    outlier_effects = []
    outlier_values = np.linspace(0, 2000, 50)
    for out_val in outlier_values:
        temp_salaries = np.append(base_salaries, out_val)
        outlier_effects.append(np.mean(temp_salaries))
    
    ax4.plot(outlier_values, outlier_effects, color='red', linewidth=2, label='Mean')
    ax4.axhline(y=np.median(base_salaries), color='green', linestyle='--', linewidth=2, label='Median (stable)')
    ax4.set_title('Sensitivity to Outliers', fontsize=14, fontweight='bold')
    ax4.set_xlabel('Outlier Value (thousands $)')
    ax4.set_ylabel('Central Measure Value')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Display insights
    print("\n📊 Central Tendency Analysis:")
    print(f"   Mean (Average): ${mean_val:.2f}k")
    print(f"   Median (Middle): ${median_val:.2f}k")
    print(f"   Mode (Most Common): ${mode_val:.0f}k")
    print(f"\n💡 Insights:")
    
    if add_outlier:
        print(f"   ⚠️ With CEO salary (${outlier_value/1000:.0f}k):")
        print(f"      Mean jumped to ${mean_val:.0f}k (misleading!)")
        print(f"      Median stayed around ${median_val:.0f}k (more representative)")
        print(f"      This shows why median is better for skewed data!")
    else:
        print(f"   Without outliers, mean and median are similar.")
        print(f"   Try adding an outlier to see the difference!")

# Create interactive widget
interact(explore_central_tendency,
         add_outlier=widgets.Checkbox(value=False, description='Add CEO Salary'),
         outlier_value=widgets.IntSlider(min=500000, max=5000000, step=100000, value=1000000,
                                         description='CEO Salary $:', continuous_update=False));

interactive(children=(Checkbox(value=False, description='Add CEO Salary'), IntSlider(value=1000000, continuous…

---

## Slide 6: Measures of Dispersion

**Key Concepts**: How spread out is your data?
- **Range**: Max - Min (simple but affected by outliers)
- **IQR**: Middle 50% spread (robust)
- **Standard Deviation**: Average distance from mean
- **MAD**: Median Absolute Deviation (robust to outliers)
- **CV**: Coefficient of Variation (relative spread)

In [None]:
# Interactive Dispersion Measures Demonstration

def explore_dispersion(spread_level='Low', add_outliers=False):
    """Demonstrate different measures of spread"""
    np.random.seed(42)
    
    # Create two sales reps with same mean but different spreads
    mean_sales = 100
    
    if spread_level == 'Low':
        std_dev = 5
    elif spread_level == 'Medium':
        std_dev = 15
    else:  # High
        std_dev = 30
    
    # Generate daily sales data for 100 days
    rep_a_sales = np.random.normal(mean_sales, 5, 100)  # Consistent performer
    rep_b_sales = np.random.normal(mean_sales, std_dev, 100)  # Variable performer
    
    if add_outliers:
        # Add some outliers to Rep B
        outlier_days = np.random.choice(100, 5, replace=False)
        rep_b_sales[outlier_days] = np.random.choice([20, 200], 5)  # Very bad or very good days
    
    # Calculate all dispersion measures
    def calculate_measures(data):
        return {
            'Range': np.max(data) - np.min(data),
            'IQR': np.percentile(data, 75) - np.percentile(data, 25),
            'Std Dev': np.std(data),
            'MAD': np.median(np.abs(data - np.median(data))),
            'CV': (np.std(data) / np.mean(data)) * 100  # As percentage
        }
    
    measures_a = calculate_measures(rep_a_sales)
    measures_b = calculate_measures(rep_b_sales)
    
    # Create visualization
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    
    # Plot 1: Time series comparison
    ax1 = axes[0, 0]
    days = np.arange(1, 101)
    ax1.plot(days, rep_a_sales, alpha=0.7, color=COLOR_GOOD, label='Rep A (Consistent)', linewidth=1.5)
    ax1.plot(days, rep_b_sales, alpha=0.7, color=COLOR_BIAS, label=f'Rep B ({spread_level} Variability)', linewidth=1.5)
    ax1.axhline(y=mean_sales, color='black', linestyle='--', alpha=0.5, label=f'Target: ${mean_sales}')
    ax1.set_title('Daily Sales Performance', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Day')
    ax1.set_ylabel('Sales ($)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Distribution comparison
    ax2 = axes[0, 1]
    ax2.hist(rep_a_sales, bins=20, alpha=0.5, color=COLOR_GOOD, label='Rep A', density=True)
    ax2.hist(rep_b_sales, bins=20, alpha=0.5, color=COLOR_BIAS, label='Rep B', density=True)
    ax2.set_title('Distribution of Daily Sales', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Sales ($)')
    ax2.set_ylabel('Density')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Box plot comparison
    ax3 = axes[0, 2]
    bp = ax3.boxplot([rep_a_sales, rep_b_sales], labels=['Rep A', 'Rep B'], patch_artist=True)
    bp['boxes'][0].set_facecolor(COLOR_GOOD)
    bp['boxes'][1].set_facecolor(COLOR_BIAS)
    ax3.set_title('Box Plot Comparison', fontsize=14, fontweight='bold')
    ax3.set_ylabel('Sales ($)')
    ax3.grid(True, alpha=0.3, axis='y')
    
    # Plot 4: Dispersion measures comparison
    ax4 = axes[1, 0]
    measure_names = list(measures_a.keys())
    x = np.arange(len(measure_names))
    width = 0.35
    
    bars1 = ax4.bar(x - width/2, list(measures_a.values()), width, 
                    label='Rep A', color=COLOR_GOOD, alpha=0.7)
    bars2 = ax4.bar(x + width/2, list(measures_b.values()), width, 
                    label='Rep B', color=COLOR_BIAS, alpha=0.7)
    
    ax4.set_title('Dispersion Measures Comparison', fontsize=14, fontweight='bold')
    ax4.set_ylabel('Value')
    ax4.set_xticks(x)
    ax4.set_xticklabels(measure_names)
    ax4.legend()
    ax4.grid(True, alpha=0.3, axis='y')
    
    # Plot 5: Visual explanation of standard deviation
    ax5 = axes[1, 1]
    x_range = np.linspace(70, 130, 100)
    
    # Normal distributions
    from scipy import stats
    dist_a = stats.norm.pdf(x_range, np.mean(rep_a_sales), np.std(rep_a_sales))
    dist_b = stats.norm.pdf(x_range, np.mean(rep_b_sales), np.std(rep_b_sales))
    
    ax5.plot(x_range, dist_a, color=COLOR_GOOD, linewidth=2, label='Rep A (Low SD)')
    ax5.fill_between(x_range, dist_a, alpha=0.3, color=COLOR_GOOD)
    ax5.plot(x_range, dist_b, color=COLOR_BIAS, linewidth=2, label='Rep B (High SD)')
    ax5.fill_between(x_range, dist_b, alpha=0.3, color=COLOR_BIAS)
    
    # Add SD markers
    mean_a = np.mean(rep_a_sales)
    std_a = np.std(rep_a_sales)
    ax5.axvline(mean_a, color=COLOR_GOOD, linestyle='--', alpha=0.5)
    ax5.axvline(mean_a - std_a, color=COLOR_GOOD, linestyle=':', alpha=0.5)
    ax5.axvline(mean_a + std_a, color=COLOR_GOOD, linestyle=':', alpha=0.5)
    
    ax5.set_title('Standard Deviation Visualization', fontsize=14, fontweight='bold')
    ax5.set_xlabel('Sales ($)')
    ax5.set_ylabel('Probability Density')
    ax5.legend()
    ax5.grid(True, alpha=0.3)
    
    # Plot 6: MAD vs SD with outliers
    ax6 = axes[1, 2]
    
    # Show robustness
    outlier_impact = []
    outlier_values = np.linspace(100, 500, 20)
    for out_val in outlier_values:
        temp_data = np.append(rep_a_sales, out_val)
        outlier_impact.append({
            'SD': np.std(temp_data),
            'MAD': np.median(np.abs(temp_data - np.median(temp_data)))
        })
    
    sd_changes = [x['SD'] for x in outlier_impact]
    mad_changes = [x['MAD'] for x in outlier_impact]
    
    ax6.plot(outlier_values, sd_changes, color='red', linewidth=2, label='Std Dev (sensitive)')
    ax6.plot(outlier_values, mad_changes, color='green', linewidth=2, label='MAD (robust)')
    ax6.set_title('Robustness to Outliers', fontsize=14, fontweight='bold')
    ax6.set_xlabel('Outlier Value ($)')
    ax6.set_ylabel('Dispersion Measure')
    ax6.legend()
    ax6.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Display insights
    print("\n📊 Dispersion Analysis:")
    print("\nRep A (Consistent):")
    for measure, value in measures_a.items():
        if measure == 'CV':
            print(f"   {measure}: {value:.1f}%")
        else:
            print(f"   {measure}: ${value:.2f}")
    
    print(f"\nRep B ({spread_level} Variability):")
    for measure, value in measures_b.items():
        if measure == 'CV':
            print(f"   {measure}: {value:.1f}%")
        else:
            print(f"   {measure}: ${value:.2f}")
    
    print("\n💡 Key Insights:")
    print(f"   Both reps average ${mean_sales}/day, but Rep B is {measures_b['CV']/measures_a['CV']:.1f}x more variable!")
    if add_outliers:
        print("   ⚠️ With outliers: SD increases dramatically, but MAD stays stable.")
        print("   This shows why MAD is better for data with outliers!")

# Create interactive widget
interact(explore_dispersion,
         spread_level=widgets.Dropdown(options=['Low', 'Medium', 'High'],
                                       value='Low', description='Variability:'),
         add_outliers=widgets.Checkbox(value=False, description='Add Outliers'));

interactive(children=(Dropdown(description='Variability:', options=('Low', 'Medium', 'High'), value='Low'), Ch…

---

## Summary and Key Takeaways

### 🎯 What We've Learned:

1. **Statistical Thinking**: Using data to make decisions under uncertainty
2. **Populations vs Samples**: We study small groups to understand large ones
3. **Sampling Methods**: Different ways to collect representative data
4. **Sampling Biases**: How bad sampling can ruin your analysis
5. **Central Tendency**: Mean, median, and mode tell different stories
6. **Dispersion**: Understanding spread is as important as understanding center

### 💡 Key Insights:

- **Good data beats fancy analysis** - focus on quality sampling
- **Outliers matter** - use robust measures (median, MAD) when appropriate
- **Variability is information** - don't just look at averages
- **Bias kills predictions** - be vigilant about sampling biases

### 🚀 Next Steps:

In the next lesson, we'll explore **Probability and Distributions** - the mathematical foundation for making predictions!

---

## Practice Exercises

Try these exercises to reinforce your learning:

In [8]:
# Exercise 1: Create your own biased sample
# TODO: Generate a population and create different biased samples from it

# Your code here:


# Exercise 2: Calculate central tendency measures
# TODO: Load a real dataset and calculate mean, median, mode

# Your code here:


# Exercise 3: Compare dispersion measures
# TODO: Create two datasets with same mean but different spreads

# Your code here:
