# MA206X Lesson 3: Measures of Location - Computational Companion

**Course:** MA206X Probability and Statistics  
**Instructor:** CPT Day  
**Term:** AY26-2

This notebook provides computational support for Lesson 3 worksheet problems. Use this to:
- Perform calculations for large datasets
- Visualize distributions
- Verify your hand calculations

**Instructions:**
1. Run each code cell in order (Shift+Enter)
2. Complete the worksheet problems using the output from this notebook
3. Focus on *interpreting* results rather than manual computation

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries loaded successfully!")

---
## Problem 1: Exam Scores

Let's verify the calculations for the exam scores problem.

In [None]:
# Exam scores data
exam_scores = np.array([72, 85, 90, 78, 85, 92, 88, 85, 95])

# Calculate measures of center
mean_score = np.mean(exam_scores)
median_score = np.median(exam_scores)
# Mode (most frequent value)
from scipy import stats
mode_result = stats.mode(exam_scores, keepdims=True)
mode_score = mode_result.mode[0]

print(f"Exam Scores: {sorted(exam_scores)}")
print(f"\nMean: {mean_score:.2f}")
print(f"Median: {median_score:.2f}")
print(f"Mode: {mode_score:.2f}")

# Visualize
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.hist(exam_scores, bins=8, edgecolor='black', alpha=0.7)
plt.axvline(mean_score, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_score:.1f}')
plt.axvline(median_score, color='blue', linestyle='--', linewidth=2, label=f'Median = {median_score:.1f}')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Distribution of Exam Scores')
plt.legend()

plt.subplot(1, 2, 2)
plt.boxplot(exam_scores, vert=True)
plt.ylabel('Score')
plt.title('Boxplot of Exam Scores')

plt.tight_layout()
plt.show()

---
## Problem 2: Tech Startup Salaries

Analyzing salary distribution with an outlier (CEO salary).

In [None]:
# Salaries in thousands of dollars
salaries = np.array([45, 48, 52, 55, 58, 60, 62, 65, 68, 72, 75, 250])

# With CEO
mean_with_ceo = np.mean(salaries)
median_with_ceo = np.median(salaries)

# Without CEO
salaries_no_ceo = salaries[:-1]  # Remove last element (CEO)
mean_without_ceo = np.mean(salaries_no_ceo)
median_without_ceo = np.median(salaries_no_ceo)

print("WITH CEO:")
print(f"Mean: ${mean_with_ceo:.1f}k")
print(f"Median: ${median_with_ceo:.1f}k")
print(f"\nWITHOUT CEO:")
print(f"Mean: ${mean_without_ceo:.1f}k")
print(f"Median: ${median_without_ceo:.1f}k")
print(f"\nCHANGES:")
print(f"Mean changed by: ${mean_with_ceo - mean_without_ceo:.1f}k ({100*(mean_with_ceo - mean_without_ceo)/mean_without_ceo:.1f}%)")
print(f"Median changed by: ${median_with_ceo - median_without_ceo:.1f}k ({100*(median_with_ceo - median_without_ceo)/median_without_ceo:.1f}%)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With CEO
axes[0].hist(salaries, bins=10, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].axvline(mean_with_ceo, color='red', linestyle='--', linewidth=2, label=f'Mean = ${mean_with_ceo:.1f}k')
axes[0].axvline(median_with_ceo, color='blue', linestyle='--', linewidth=2, label=f'Median = ${median_with_ceo:.1f}k')
axes[0].set_xlabel('Salary ($1000s)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Salary Distribution (With CEO)')
axes[0].legend()

# Without CEO
axes[1].hist(salaries_no_ceo, bins=10, edgecolor='black', alpha=0.7, color='lightcoral')
axes[1].axvline(mean_without_ceo, color='red', linestyle='--', linewidth=2, label=f'Mean = ${mean_without_ceo:.1f}k')
axes[1].axvline(median_without_ceo, color='blue', linestyle='--', linewidth=2, label=f'Median = ${median_without_ceo:.1f}k')
axes[1].set_xlabel('Salary ($1000s)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Salary Distribution (Without CEO)')
axes[1].legend()

plt.tight_layout()
plt.show()

---
## Problem 4: Standardized Test Scores

Working with quartiles and percentiles.

In [None]:
# Generate a sample dataset with given characteristics
# We'll create synthetic data that matches the given statistics
np.random.seed(42)
test_scores = np.array([65, 68, 70, 71, 72, 72, 73, 75, 76, 78,
                        79, 80, 81, 82, 83, 85, 86, 87, 88, 89,
                        90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

# Calculate statistics
mean_test = np.mean(test_scores)
median_test = np.median(test_scores)
q1_test = np.percentile(test_scores, 25)
q3_test = np.percentile(test_scores, 75)
iqr_test = q3_test - q1_test

print(f"Test Scores Summary Statistics:")
print(f"Mean: {mean_test:.1f}")
print(f"Median (Q2): {median_test:.1f}")
print(f"Q1 (25th percentile): {q1_test:.1f}")
print(f"Q3 (75th percentile): {q3_test:.1f}")
print(f"IQR (Q3 - Q1): {iqr_test:.1f}")
print(f"\nMinimum: {test_scores.min()}")
print(f"Maximum: {test_scores.max()}")

# Where does a score of 85 fall?
percentile_85 = (test_scores <= 85).sum() / len(test_scores) * 100
print(f"\nA score of 85 is at the {percentile_85:.0f}th percentile")

# Visualize with boxplot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram
axes[0].hist(test_scores, bins=12, edgecolor='black', alpha=0.7, color='green')
axes[0].axvline(mean_test, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_test:.1f}')
axes[0].axvline(median_test, color='blue', linestyle='--', linewidth=2, label=f'Median = {median_test:.1f}')
axes[0].axvline(85, color='purple', linestyle=':', linewidth=2, label='Score = 85')
axes[0].set_xlabel('Test Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Test Scores')
axes[0].legend()

# Boxplot with quartile labels
bp = axes[1].boxplot(test_scores, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightgreen')
axes[1].set_ylabel('Test Score')
axes[1].set_title('Boxplot of Test Scores')
axes[1].text(1.15, q1_test, f'Q1 = {q1_test:.0f}', fontsize=10)
axes[1].text(1.15, median_test, f'Q2 = {median_test:.0f}', fontsize=10)
axes[1].text(1.15, q3_test, f'Q3 = {q3_test:.0f}', fontsize=10)
axes[1].axhline(85, color='purple', linestyle=':', linewidth=2, alpha=0.5)
axes[1].text(0.6, 85, '85', fontsize=10, color='purple')

plt.tight_layout()
plt.show()

---
## Problem 5: Food Delivery Times

Analyzing delivery time distribution using five-number summary.

In [None]:
# Generate realistic delivery time data matching given statistics
np.random.seed(123)
# Creating data with specified five-number summary
delivery_times = np.array([18, 20, 22, 23, 24, 25, 26, 27, 28, 29,
                           30, 31, 32, 33, 34, 35, 36, 38, 40, 42,
                           44, 46, 48, 50, 52, 55, 58, 60, 62, 65])

# Calculate statistics
min_time = delivery_times.min()
q1_time = np.percentile(delivery_times, 25)
median_time = np.median(delivery_times)
q3_time = np.percentile(delivery_times, 75)
max_time = delivery_times.max()
iqr_time = q3_time - q1_time

print("Five-Number Summary:")
print(f"Minimum: {min_time:.0f} minutes")
print(f"Q1: {q1_time:.0f} minutes")
print(f"Median (Q2): {median_time:.0f} minutes")
print(f"Q3: {q3_time:.0f} minutes")
print(f"Maximum: {max_time:.0f} minutes")
print(f"\nIQR: {iqr_time:.0f} minutes")

# What percentile is 35 minutes?
percentile_35 = (delivery_times <= 35).sum() / len(delivery_times) * 100
print(f"\n35 minutes is approximately the {percentile_35:.0f}th percentile")

# Percentage at or below 42 minutes (Q3)
percent_below_42 = (delivery_times <= 42).sum() / len(delivery_times) * 100
print(f"Percentage of deliveries ≤ 42 minutes: {percent_below_42:.0f}%")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram
axes[0].hist(delivery_times, bins=15, edgecolor='black', alpha=0.7, color='orange')
axes[0].axvline(median_time, color='blue', linestyle='--', linewidth=2, label=f'Median = {median_time:.0f} min')
axes[0].axvline(35, color='red', linestyle=':', linewidth=2, label='35 min target')
axes[0].set_xlabel('Delivery Time (minutes)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Delivery Times')
axes[0].legend()

# Boxplot
bp = axes[1].boxplot(delivery_times, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightyellow')
axes[1].set_ylabel('Delivery Time (minutes)')
axes[1].set_title('Boxplot of Delivery Times')
axes[1].axhline(35, color='red', linestyle=':', linewidth=2, alpha=0.5, label='35 min target')
axes[1].legend()

# Add quartile annotations
axes[1].text(1.15, q1_time, f'Q1={q1_time:.0f}', fontsize=9)
axes[1].text(1.15, median_time, f'Q2={median_time:.0f}', fontsize=9)
axes[1].text(1.15, q3_time, f'Q3={q3_time:.0f}', fontsize=9)

plt.tight_layout()
plt.show()

---
## Problem 8: Retail Store Customer Spending

Analyzing customer spending with a right-skewed distribution.

In [None]:
# Generate customer spending data with specified characteristics
np.random.seed(456)

# Create right-skewed spending data
spending = np.array([15, 20, 25, 30, 35, 40, 45, 45, 45, 45, 48, 50, 52, 55, 58,
                    60, 65, 70, 75, 80, 85, 85, 90, 95, 100, 110, 120, 130, 145,
                    150, 160, 175, 190, 210, 230, 250, 275, 300, 350, 400])

# Calculate statistics
mean_spending = np.mean(spending)
median_spending = np.median(spending)
mode_spending = 45  # Most frequent value
q1_spending = np.percentile(spending, 25)
q3_spending = np.percentile(spending, 75)
iqr_spending = q3_spending - q1_spending

print("Customer Spending Summary:")
print(f"Mean: ${mean_spending:.2f}")
print(f"Median: ${median_spending:.2f}")
print(f"Mode: ${mode_spending:.2f}")
print(f"Q1: ${q1_spending:.2f}")
print(f"Q3: ${q3_spending:.2f}")
print(f"IQR: ${iqr_spending:.2f}")
print(f"\nDifference (Mean - Median): ${mean_spending - median_spending:.2f}")
print(f"\nInterpretation: Mean > Median indicates RIGHT-SKEWED distribution")
print("This suggests some customers spend much more than typical.")

# Comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram
axes[0, 0].hist(spending, bins=20, edgecolor='black', alpha=0.7, color='teal')
axes[0, 0].axvline(mean_spending, color='red', linestyle='--', linewidth=2, label=f'Mean = ${mean_spending:.0f}')
axes[0, 0].axvline(median_spending, color='blue', linestyle='--', linewidth=2, label=f'Median = ${median_spending:.0f}')
axes[0, 0].axvline(mode_spending, color='green', linestyle=':', linewidth=2, label=f'Mode = ${mode_spending:.0f}')
axes[0, 0].set_xlabel('Spending Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Customer Spending (Right-Skewed)')
axes[0, 0].legend()

# Boxplot
bp = axes[0, 1].boxplot(spending, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
axes[0, 1].set_ylabel('Spending Amount ($)')
axes[0, 1].set_title('Boxplot of Customer Spending')
axes[0, 1].text(1.15, q1_spending, f'Q1=${q1_spending:.0f}', fontsize=9)
axes[0, 1].text(1.15, median_spending, f'Q2=${median_spending:.0f}', fontsize=9)
axes[0, 1].text(1.15, q3_spending, f'Q3=${q3_spending:.0f}', fontsize=9)

# Cumulative distribution
sorted_spending = np.sort(spending)
cumulative = np.arange(1, len(spending) + 1) / len(spending) * 100
axes[1, 0].plot(sorted_spending, cumulative, linewidth=2, color='purple')
axes[1, 0].axhline(50, color='blue', linestyle='--', alpha=0.5, label='Median (50th %ile)')
axes[1, 0].axvline(median_spending, color='blue', linestyle='--', alpha=0.5)
axes[1, 0].set_xlabel('Spending Amount ($)')
axes[1, 0].set_ylabel('Cumulative Percentage')
axes[1, 0].set_title('Cumulative Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Summary statistics table
axes[1, 1].axis('off')
summary_text = f"""
SUMMARY STATISTICS
{'='*40}

Measures of Center:
  Mean:     ${mean_spending:.2f}
  Median:   ${median_spending:.2f}
  Mode:     ${mode_spending:.2f}

Quartiles:
  Q1:       ${q1_spending:.2f}
  Q2:       ${median_spending:.2f}
  Q3:       ${q3_spending:.2f}
  IQR:      ${iqr_spending:.2f}

Range:
  Min:      ${spending.min():.2f}
  Max:      ${spending.max():.2f}

Distribution Shape:
  RIGHT-SKEWED (Mean > Median)
  
Interpretation:
  Most customers spend modestly
  (~${mode_spending}), but some high
  spenders pull the mean up.
"""
axes[1, 1].text(0.1, 0.5, summary_text, fontsize=11, family='monospace',
                verticalalignment='center')

plt.tight_layout()
plt.show()

# Additional analysis
print("\n" + "="*50)
print("ADDITIONAL INSIGHTS:")
print("="*50)
print(f"Percentage of customers spending ≤ $100: {(spending <= 100).sum() / len(spending) * 100:.1f}%")
print(f"Percentage of customers spending ≥ $200: {(spending >= 200).sum() / len(spending) * 100:.1f}%")
print(f"\nFor 'typical customer' benchmark, use MEDIAN (${median_spending:.0f})")
print("because it's not affected by high-spending outliers.")

---
## Interactive Exploration Tool

Use this section to explore your own datasets or verify worksheet calculations.

In [None]:
def analyze_data(data, title="Dataset"):
    """
    Comprehensive analysis function for any dataset.
    
    Parameters:
    -----------
    data : array-like
        Your dataset
    title : str
        Name of your dataset for labeling
    """
    data = np.array(data)
    
    # Calculate all statistics
    mean_val = np.mean(data)
    median_val = np.median(data)
    q1_val = np.percentile(data, 25)
    q3_val = np.percentile(data, 75)
    iqr_val = q3_val - q1_val
    
    # Determine skewness
    if abs(mean_val - median_val) < 0.01 * median_val:
        skew = "Approximately Symmetric"
    elif mean_val > median_val:
        skew = "Right-Skewed (Positive Skew)"
    else:
        skew = "Left-Skewed (Negative Skew)"
    
    # Print summary
    print(f"\n{'='*50}")
    print(f"ANALYSIS: {title}")
    print(f"{'='*50}")
    print(f"\nSample Size: n = {len(data)}")
    print(f"\nMeasures of Center:")
    print(f"  Mean:   {mean_val:.3f}")
    print(f"  Median: {median_val:.3f}")
    print(f"\nQuartiles:")
    print(f"  Q1: {q1_val:.3f}")
    print(f"  Q2: {median_val:.3f} (Median)")
    print(f"  Q3: {q3_val:.3f}")
    print(f"  IQR: {iqr_val:.3f}")
    print(f"\nRange:")
    print(f"  Min: {data.min():.3f}")
    print(f"  Max: {data.max():.3f}")
    print(f"\nDistribution Shape: {skew}")
    print(f"{'='*50}\n")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Histogram
    axes[0].hist(data, bins='auto', edgecolor='black', alpha=0.7, color='steelblue')
    axes[0].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_val:.2f}')
    axes[0].axvline(median_val, color='blue', linestyle='--', linewidth=2, label=f'Median = {median_val:.2f}')
    axes[0].set_xlabel('Value')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title(f'Distribution: {title}')
    axes[0].legend()
    
    # Boxplot
    bp = axes[1].boxplot(data, vert=True, patch_artist=True)
    bp['boxes'][0].set_facecolor('lightsteelblue')
    axes[1].set_ylabel('Value')
    axes[1].set_title(f'Boxplot: {title}')
    
    plt.tight_layout()
    plt.show()

# Example usage:
# analyze_data([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "My Data")

print("Function loaded! Use: analyze_data(your_data, 'Your Title')")
print("\nExample:")
print("my_data = [10, 15, 12, 18, 20, 14, 16, 15, 13, 17]")
print("analyze_data(my_data, 'Practice Problem')")

In [None]:
# YOUR TURN: Enter your own data here and analyze it!

# Example: Replace this with your own data
my_data = [10, 15, 12, 18, 20, 14, 16, 15, 13, 17]

# Analyze it
analyze_data(my_data, "My Practice Problem")

---
## Summary and Key Takeaways

### When to Use Mean vs. Median:

**Use the MEAN when:**
- Data is roughly symmetric (mean ≈ median)
- No extreme outliers present
- You want to account for all values equally
- Example: Heights in a classroom

**Use the MEDIAN when:**
- Data is skewed (mean ≠ median)
- Outliers are present
- You want a measure resistant to extreme values
- Example: Home prices, salaries

### Key Relationships:

- **Right-Skewed:** Mean > Median (pulled by high values)
- **Symmetric:** Mean ≈ Median
- **Left-Skewed:** Mean < Median (pulled by low values)

### Quartiles:

- **Q1 (25th percentile):** 25% of data below this value
- **Q2 (50th percentile):** Same as median
- **Q3 (75th percentile):** 75% of data below this value
- **IQR = Q3 - Q1:** Middle 50% of data

### Next Lesson Preview:

In **Lesson 4: Measures of Spread**, we'll explore:
- Range
- Variance
- Standard Deviation
- Why spread matters as much as center!

---

**Remember:** Two datasets can have the same mean and median but very different spreads!