# Measure of Central Tendency

Measures of central tendency are statistical metrics that identify a single value as representative of an entire dataset. They provide a summary of where the "center" or "typical" value of the data lies. Understanding these measures is fundamental for data analysis, machine learning feature engineering, and making data-driven decisions.

## Table of Contents

1. [Introduction to Central Tendency](#introduction)
2. [The Three Main Measures](#three-measures)
3. [Mean (Average)](#mean)
   - Arithmetic Mean
   - Geometric Mean
   - Harmonic Mean
4. [Median](#median)
5. [Mode](#mode)
6. [Comparison and When to Use Each](#comparison)
7. [Effect of Outliers](#outliers)
8. [Skewness and Central Tendency](#skewness)
9. [Real-World Applications](#applications)
10. [Summary](#summary)

---

## 1. Introduction to Central Tendency <a id="introduction"></a>

**What is Central Tendency?**

Central tendency refers to the measure that represents the center or middle of a data distribution. It answers the question: "What is a typical value in this dataset?"

**Why Does it Matter in Data Science and ML?**

- **Data Summarization**: Condense large datasets into single representative values
- **Feature Engineering**: Create new features based on central values
- **Anomaly Detection**: Identify outliers by comparing values to central measures
- **Model Evaluation**: Understand baseline performance metrics
- **Data Imputation**: Fill missing values with central tendency measures
- **Business Insights**: Make informed decisions based on typical values (average sales, median income, etc.)

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import gmean, hmean
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

---

## 2. The Three Main Measures <a id="three-measures"></a>

The three primary measures of central tendency are:

| Measure | Definition | Best Used For |
|---------|------------|---------------|
| **Mean** | Average of all values | Symmetric distributions without outliers |
| **Median** | Middle value when data is ordered | Skewed distributions or data with outliers |
| **Mode** | Most frequently occurring value | Categorical data or finding the most common value |

Each measure has its strengths and weaknesses, and choosing the right one depends on your data characteristics and analysis goals.

In [None]:
# Sample dataset for demonstration
data = np.array([23, 25, 27, 28, 30, 32, 35, 38, 40, 42])

# Calculate all three measures
mean_value = np.mean(data)
median_value = np.median(data)
mode_result = stats.mode(data, keepdims=True)
mode_value = mode_result.mode[0]

print("Sample Data:", data)
print(f"\nMean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

---

## 3. Mean (Average) <a id="mean"></a>

The **mean** is the sum of all values divided by the number of values. It's the most commonly used measure of central tendency.

### 3.1 Arithmetic Mean

**Formula:**

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}$$

Where:
- $\bar{x}$ = mean
- $x_i$ = individual values
- $n$ = number of values

**Characteristics:**
- Takes all values into account
- Can be affected by extreme values (outliers)
- Only applicable to numerical data
- Suitable for interval and ratio scales

**When to Use:**
- Data is symmetrically distributed
- No significant outliers
- Need to use all data points in calculation
- Performing further statistical calculations

In [None]:
# Calculating Arithmetic Mean - Multiple Methods

# Sample data: Test scores
test_scores = [78, 85, 92, 88, 76, 90, 85, 88, 92, 80]

# Method 1: Using pure Python
mean_python = sum(test_scores) / len(test_scores)

# Method 2: Using NumPy
mean_numpy = np.mean(test_scores)

# Method 3: Using Pandas
df = pd.DataFrame({'scores': test_scores})
mean_pandas = df['scores'].mean()

# Method 4: Using statistics module
import statistics
mean_stats = statistics.mean(test_scores)

print("Test Scores:", test_scores)
print(f"\nMean (Python): {mean_python:.2f}")
print(f"Mean (NumPy): {mean_numpy:.2f}")
print(f"Mean (Pandas): {mean_pandas:.2f}")
print(f"Mean (Statistics): {mean_stats:.2f}")

### 3.2 Geometric Mean

The **geometric mean** is the nth root of the product of n values. It's particularly useful for data that represents rates of change or ratios.

**Formula:**

$$GM = \sqrt[n]{x_1 \times x_2 \times ... \times x_n} = \left(\prod_{i=1}^{n} x_i\right)^{1/n}$$

**When to Use:**
- Calculating average growth rates
- Analyzing percentage changes
- Comparing different items with different properties
- Data spans several orders of magnitude
- Investment returns over time

**Important Note:** All values must be positive for geometric mean.

In [None]:
# Geometric Mean Example: Investment Returns

# Annual growth rates (as multipliers, not percentages)
# Year 1: +10% growth = 1.10
# Year 2: -5% loss = 0.95
# Year 3: +15% growth = 1.15
# Year 4: +8% growth = 1.08

growth_rates = [1.10, 0.95, 1.15, 1.08]

# Calculate geometric mean
from scipy.stats import gmean
geometric_mean = gmean(growth_rates)

# Convert back to percentage
avg_growth_rate = (geometric_mean - 1) * 100

# Compare with arithmetic mean (which would be incorrect here)
arithmetic_mean = np.mean(growth_rates)
arithmetic_growth_rate = (arithmetic_mean - 1) * 100

print("Investment Growth Rates (as multipliers):", growth_rates)
print(f"\nGeometric Mean: {geometric_mean:.4f}")
print(f"Average Annual Growth Rate (Correct): {avg_growth_rate:.2f}%")
print(f"\nArithmetic Mean: {arithmetic_mean:.4f}")
print(f"Arithmetic Growth Rate (Incorrect): {arithmetic_growth_rate:.2f}%")
print("\n** Geometric mean is the correct measure for compound growth! **")

In [None]:
# Another Geometric Mean Example: Image Aspect Ratios

# Different aspect ratios
aspect_ratios = [16/9, 4/3, 21/9, 1/1]  # Wide, standard, ultrawide, square

# Geometric mean gives a balanced "typical" aspect ratio
typical_ratio = gmean(aspect_ratios)

print("Aspect Ratios:", [f"{ratio:.3f}" for ratio in aspect_ratios])
print(f"Geometric Mean (Typical Ratio): {typical_ratio:.3f}")
print(f"Arithmetic Mean (Not meaningful here): {np.mean(aspect_ratios):.3f}")

### 3.3 Harmonic Mean

The **harmonic mean** is the reciprocal of the arithmetic mean of reciprocals. It's useful for rates and ratios.

**Formula:**

$$HM = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + ... + \frac{1}{x_n}}$$

**When to Use:**
- Calculating average speeds or rates
- Working with ratios or reciprocal relationships
- F1 score in machine learning (harmonic mean of precision and recall)
- Averaging rates (like km/hour or items/second)

**Property:** Harmonic Mean ≤ Geometric Mean ≤ Arithmetic Mean

In [None]:
# Harmonic Mean Example: Average Speed

# A car travels 100 km at 60 km/h, then returns 100 km at 40 km/h
# What's the average speed for the entire trip?

speeds = [60, 40]  # km/h

# Calculate harmonic mean (correct for average speed)
from scipy.stats import hmean
avg_speed_harmonic = hmean(speeds)

# Calculate arithmetic mean (incorrect for this scenario)
avg_speed_arithmetic = np.mean(speeds)

# Verify with actual calculation
total_distance = 100 + 100  # km
time1 = 100 / 60  # hours
time2 = 100 / 40  # hours
total_time = time1 + time2
actual_avg_speed = total_distance / total_time

print("Speeds:", speeds, "km/h")
print(f"\nHarmonic Mean (Correct): {avg_speed_harmonic:.2f} km/h")
print(f"Actual Average Speed: {actual_avg_speed:.2f} km/h")
print(f"\nArithmetic Mean (Incorrect): {avg_speed_arithmetic:.2f} km/h")
print("\n** Harmonic mean correctly calculates average speed! **")

In [None]:
# Harmonic Mean in Machine Learning: F1 Score

# Precision and Recall values
precision = 0.85
recall = 0.75

# F1 Score is the harmonic mean of precision and recall
f1_score = hmean([precision, recall])

# Compare with arithmetic mean
arithmetic_avg = (precision + recall) / 2

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"\nF1 Score (Harmonic Mean): {f1_score:.4f}")
print(f"Arithmetic Mean: {arithmetic_avg:.4f}")
print("\n** F1 Score penalizes extreme differences between precision and recall **")

In [None]:
# Comparison of All Three Means

data = [2, 4, 8, 16, 32]

arithmetic = np.mean(data)
geometric = gmean(data)
harmonic = hmean(data)

print("Data:", data)
print(f"\nArithmetic Mean: {arithmetic:.2f}")
print(f"Geometric Mean: {geometric:.2f}")
print(f"Harmonic Mean: {harmonic:.2f}")
print(f"\nRelationship: HM ({harmonic:.2f}) ≤ GM ({geometric:.2f}) ≤ AM ({arithmetic:.2f})")

# Visualize the relationship
plt.figure(figsize=(10, 6))
means = [harmonic, geometric, arithmetic]
labels = ['Harmonic Mean', 'Geometric Mean', 'Arithmetic Mean']
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

bars = plt.bar(labels, means, color=colors, alpha=0.7, edgecolor='black')
plt.ylabel('Value', fontsize=12)
plt.title('Comparison of Different Types of Means', fontsize=14, fontweight='bold')
plt.ylim(0, max(means) * 1.2)

# Add value labels on bars
for bar, mean in zip(bars, means):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{mean:.2f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

---

## 4. Median <a id="median"></a>

The **median** is the middle value when data is arranged in ascending or descending order. It divides the dataset into two equal halves.

**Formula:**

For **odd** number of values:
$$Median = x_{\frac{n+1}{2}}$$

For **even** number of values:
$$Median = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}$$

**Characteristics:**
- Not affected by extreme values (robust to outliers)
- Represents the 50th percentile
- Only requires ordinal data (can be ranked)
- Better for skewed distributions

**When to Use:**
- Data has outliers or is skewed
- Reporting income, house prices, or other financial data
- Need a robust measure unaffected by extremes
- Ordinal data (rankings, ratings)

In [None]:
# Calculating Median - Multiple Scenarios

# Scenario 1: Odd number of values
odd_data = [12, 15, 18, 20, 22, 25, 30]
median_odd = np.median(odd_data)

print("Odd number of values:", odd_data)
print(f"Median: {median_odd} (the middle value)\n")

# Scenario 2: Even number of values
even_data = [12, 15, 18, 20, 22, 25, 30, 35]
median_even = np.median(even_data)

print("Even number of values:", even_data)
print(f"Median: {median_even} (average of two middle values: 20 and 22)\n")

In [None]:
# Median vs Mean: Impact of Outliers

# Salaries in a small company (in thousands)
salaries_normal = [45, 48, 50, 52, 55, 58, 60, 62, 65]
# CEO's salary is added
salaries_with_ceo = [45, 48, 50, 52, 55, 58, 60, 62, 65, 500]

# Calculate mean and median for both
mean_normal = np.mean(salaries_normal)
median_normal = np.median(salaries_normal)

mean_with_ceo = np.mean(salaries_with_ceo)
median_with_ceo = np.median(salaries_with_ceo)

print("Salaries without CEO:", salaries_normal)
print(f"Mean: ${mean_normal:.2f}k, Median: ${median_normal:.2f}k\n")

print("Salaries with CEO:", salaries_with_ceo)
print(f"Mean: ${mean_with_ceo:.2f}k, Median: ${median_with_ceo:.2f}k\n")

print("Impact of outlier:")
print(f"Mean increased by: ${mean_with_ceo - mean_normal:.2f}k ({((mean_with_ceo/mean_normal - 1)*100):.1f}%)")
print(f"Median increased by: ${median_with_ceo - median_normal:.2f}k ({((median_with_ceo/median_normal - 1)*100):.1f}%)")
print("\n** Median is more robust to outliers! **")

In [None]:
# Visualizing Median vs Mean with Outliers

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Without outlier
axes[0].hist(salaries_normal, bins=15, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].axvline(mean_normal, color='red', linestyle='--', linewidth=2, label=f'Mean: ${mean_normal:.1f}k')
axes[0].axvline(median_normal, color='green', linestyle='--', linewidth=2, label=f'Median: ${median_normal:.1f}k')
axes[0].set_xlabel('Salary (in thousands)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Salaries Without CEO (No Outlier)', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: With outlier
axes[1].hist(salaries_with_ceo, bins=15, color='salmon', edgecolor='black', alpha=0.7)
axes[1].axvline(mean_with_ceo, color='red', linestyle='--', linewidth=2, label=f'Mean: ${mean_with_ceo:.1f}k')
axes[1].axvline(median_with_ceo, color='green', linestyle='--', linewidth=2, label=f'Median: ${median_with_ceo:.1f}k')
axes[1].set_xlabel('Salary (in thousands)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].set_title('Salaries With CEO (Outlier Present)', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how the mean shifts significantly toward the outlier,")
print("while the median remains stable!")

---

## 5. Mode <a id="mode"></a>

The **mode** is the value that appears most frequently in a dataset. A dataset can have:
- **No mode**: All values occur with equal frequency
- **Unimodal**: One value occurs most frequently
- **Bimodal**: Two values occur with equal highest frequency
- **Multimodal**: More than two values occur with equal highest frequency

**Characteristics:**
- Can be used with categorical, ordinal, and numerical data
- Not affected by extreme values
- May not be unique
- May not exist
- Most useful for categorical data

**When to Use:**
- Categorical data (colors, brands, categories)
- Finding the most common value
- Discrete data with repeated values
- Fashion, marketing, or consumer preference analysis

In [None]:
# Calculating Mode - Different Scenarios

# Scenario 1: Unimodal data
unimodal_data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
mode_uni = stats.mode(unimodal_data, keepdims=True)

print("Unimodal Data:", unimodal_data)
print(f"Mode: {mode_uni.mode[0]} (appears {mode_uni.count[0]} times)\n")

# Scenario 2: Bimodal data
bimodal_data = [1, 2, 2, 2, 3, 4, 5, 5, 5]
# scipy.stats.mode returns only one mode, so we'll find all modes manually
from collections import Counter
counter = Counter(bimodal_data)
max_count = max(counter.values())
modes = [k for k, v in counter.items() if v == max_count]

print("Bimodal Data:", bimodal_data)
print(f"Modes: {modes} (each appears {max_count} times)\n")

# Scenario 3: No mode
no_mode_data = [1, 2, 3, 4, 5]
counter_no_mode = Counter(no_mode_data)

print("Data with No Mode:", no_mode_data)
print("All values appear equally, so there's no mode\n")

In [None]:
# Mode with Categorical Data - Real-World Example

# Customer survey: Favorite product color
colors = ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Blue', 'Yellow', 
          'Blue', 'Green', 'Blue', 'Red', 'Blue', 'Blue']

# Find mode using pandas
df_colors = pd.DataFrame({'Color': colors})
mode_color = df_colors['Color'].mode()[0]
mode_count = df_colors['Color'].value_counts()[mode_color]

print("Customer Color Preferences:", colors)
print(f"\nMost Popular Color: {mode_color}")
print(f"Number of votes: {mode_count}\n")

# Show frequency distribution
print("Frequency Distribution:")
print(df_colors['Color'].value_counts())

In [None]:
# Visualizing Mode

# Create a dataset with clear mode
shoe_sizes = [7, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 11, 11, 12]

# Calculate mode
mode_size = stats.mode(shoe_sizes, keepdims=True).mode[0]

# Create visualization
plt.figure(figsize=(10, 6))
unique, counts = np.unique(shoe_sizes, return_counts=True)

colors_bars = ['red' if size == mode_size else 'skyblue' for size in unique]
bars = plt.bar(unique, counts, color=colors_bars, edgecolor='black', alpha=0.7)

plt.xlabel('Shoe Size', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Shoe Sizes (Mode Highlighted in Red)', fontsize=14, fontweight='bold')
plt.xticks(unique)
plt.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

print(f"Mode (Most Common Shoe Size): {mode_size}")

---

## 6. Comparison and When to Use Each Measure <a id="comparison"></a>

### Detailed Comparison Table

| Aspect | Mean | Median | Mode |
|--------|------|--------|------|
| **Definition** | Average of all values | Middle value | Most frequent value |
| **Calculation** | Sum / Count | Middle after sorting | Most common |
| **Data Type** | Numerical only | Numerical/Ordinal | All types |
| **Affected by outliers** | Yes (highly) | No (robust) | No |
| **Uniqueness** | Always unique | Always unique | May not be unique |
| **Existence** | Always exists | Always exists | May not exist |
| **Use in further calculations** | Yes (variance, etc.) | Limited | No |
| **Best for** | Symmetric data | Skewed data | Categorical data |

### Decision Tree: Which Measure to Use?

1. **Is your data categorical?**
   - YES → Use **Mode**
   - NO → Continue to step 2

2. **Does your data have outliers or is it highly skewed?**
   - YES → Use **Median**
   - NO → Continue to step 3

3. **Do you need to perform further statistical calculations?**
   - YES → Use **Mean**
   - NO → Use **Median** for robustness or **Mean** for completeness

4. **Are you working with rates, ratios, or growth?**
   - Growth rates → **Geometric Mean**
   - Average speeds/rates → **Harmonic Mean**
   - General average → **Arithmetic Mean**

In [None]:
# Comprehensive Comparison with Different Data Types

# Create different types of distributions
np.random.seed(42)

# 1. Normal distribution (Symmetric)
normal_data = np.random.normal(100, 15, 1000)

# 2. Right-skewed distribution
skewed_data = np.random.exponential(30, 1000)

# 3. Data with outliers
with_outliers = np.concatenate([np.random.normal(50, 10, 950), 
                                 np.random.normal(200, 10, 50)])

datasets = {
    'Normal (Symmetric)': normal_data,
    'Right-Skewed': skewed_data,
    'With Outliers': with_outliers
}

# Calculate all measures for each dataset
results = []
for name, data in datasets.items():
    results.append({
        'Dataset': name,
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Mode': stats.mode(data.round(), keepdims=True).mode[0],
        'Std Dev': np.std(data)
    })

results_df = pd.DataFrame(results)
print("\nComparison of Central Tendency Measures Across Different Distributions:")
print("="*80)
print(results_df.to_string(index=False))
print("\nObservations:")
print("- Normal distribution: Mean ≈ Median ≈ Mode")
print("- Skewed distribution: Mean > Median (pulled by tail)")
print("- With outliers: Mean is pulled away from median")

In [None]:
# Visualize the comparison

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, data) in enumerate(datasets.items()):
    mean_val = np.mean(data)
    median_val = np.median(data)
    
    axes[idx].hist(data, bins=50, color='lightblue', edgecolor='black', alpha=0.7)
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=2, 
                      label=f'Mean: {mean_val:.1f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=2, 
                      label=f'Median: {median_val:.1f}')
    axes[idx].set_xlabel('Value', fontsize=11)
    axes[idx].set_ylabel('Frequency', fontsize=11)
    axes[idx].set_title(name, fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

---

## 7. Effect of Outliers <a id="outliers"></a>

**Outliers** are extreme values that differ significantly from other observations. They can heavily influence certain measures of central tendency.

### Impact on Each Measure:

1. **Mean**: 
   - HIGHLY affected by outliers
   - Pulled in the direction of extreme values
   - Can be misleading in presence of outliers

2. **Median**: 
   - NOT affected by outliers (robust)
   - Remains stable regardless of extreme values
   - Best choice when outliers are present

3. **Mode**: 
   - NOT affected by outliers
   - Depends only on frequency, not magnitude
   - May not change at all with outliers

### Handling Outliers in Practice:

- **Identify**: Use IQR method, Z-score, or visualization
- **Investigate**: Determine if outliers are errors or genuine extreme values
- **Decide**: Remove, transform, or use robust measures (median)

In [None]:
# Demonstrating Impact of Outliers

# Create a dataset and progressively add outliers
base_data = np.random.normal(100, 10, 100)

# Different scenarios
scenarios = {
    'No Outliers': base_data,
    '1 Outlier': np.append(base_data, 300),
    '5 Outliers': np.append(base_data, [300, 320, 310, 330, 315]),
    '10 Outliers': np.append(base_data, np.random.uniform(300, 350, 10))
}

# Calculate impact
impact_results = []
for scenario_name, data in scenarios.items():
    mean_val = np.mean(data)
    median_val = np.median(data)
    difference = abs(mean_val - median_val)
    
    impact_results.append({
        'Scenario': scenario_name,
        'Mean': f"{mean_val:.2f}",
        'Median': f"{median_val:.2f}",
        'Difference': f"{difference:.2f}",
        'Sample Size': len(data)
    })

impact_df = pd.DataFrame(impact_results)
print("\nImpact of Outliers on Central Tendency:")
print("="*70)
print(impact_df.to_string(index=False))
print("\nNotice how the mean increases dramatically while median stays stable!")

In [None]:
# Visualizing Outlier Impact

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, (scenario_name, data) in enumerate(scenarios.items()):
    mean_val = np.mean(data)
    median_val = np.median(data)
    
    axes[idx].boxplot(data, vert=False, widths=0.5)
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=2, 
                      label=f'Mean: {mean_val:.1f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=2, 
                      label=f'Median: {median_val:.1f}')
    axes[idx].set_xlabel('Value', fontsize=11)
    axes[idx].set_title(scenario_name, fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Box plots clearly show outliers as individual points.")
print("Notice how mean (red) moves toward outliers while median (green) stays centered!")

In [None]:
# Detecting Outliers Using IQR Method

def detect_outliers_iqr(data):
    """
    Detect outliers using the Interquartile Range (IQR) method.
    Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are considered outliers.
    """
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    return outliers, lower_bound, upper_bound

# Example with real-world data
house_prices = np.array([200, 220, 210, 230, 215, 225, 240, 235, 218, 
                         900, 205, 228, 232, 212, 1200])  # in thousands

outliers, lower, upper = detect_outliers_iqr(house_prices)

print("House Prices (in thousands):", house_prices)
print(f"\nOutlier Detection (IQR Method):")
print(f"Lower Bound: ${lower:.2f}k")
print(f"Upper Bound: ${upper:.2f}k")
print(f"Outliers Detected: {outliers}")
print(f"\nMean (with outliers): ${np.mean(house_prices):.2f}k")
print(f"Median (robust): ${np.median(house_prices):.2f}k")
print(f"\nMean (without outliers): ${np.mean(house_prices[~np.isin(house_prices, outliers)]):.2f}k")
print("\n** Median is more representative of typical house price! **")

---

## 8. Skewness and Central Tendency <a id="skewness"></a>

**Skewness** measures the asymmetry of a probability distribution. It indicates whether data is concentrated on one side of the distribution.

### Types of Skewness:

1. **Symmetric (No Skew)**:
   - Mean = Median = Mode
   - Bell-shaped curve
   - Skewness ≈ 0

2. **Right-Skewed (Positive Skew)**:
   - Mode < Median < Mean
   - Long tail on the right
   - Skewness > 0
   - Examples: Income, house prices, age at death

3. **Left-Skewed (Negative Skew)**:
   - Mean < Median < Mode
   - Long tail on the left
   - Skewness < 0
   - Examples: Test scores (if most students do well)

### Skewness Formula:

$$Skewness = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3$$

Where:
- $n$ = sample size
- $\bar{x}$ = mean
- $s$ = standard deviation

### Empirical Relationship:

For moderately skewed distributions:
$$Mean - Mode \approx 3(Mean - Median)$$

In [None]:
# Creating Different Skewed Distributions

np.random.seed(42)

# 1. Symmetric (Normal) distribution
symmetric = np.random.normal(50, 10, 1000)

# 2. Right-skewed (Exponential) distribution
right_skewed = np.random.exponential(20, 1000)

# 3. Left-skewed (Negative of exponential, shifted)
left_skewed = 100 - np.random.exponential(20, 1000)

# Calculate statistics for each
distributions = {
    'Symmetric': symmetric,
    'Right-Skewed': right_skewed,
    'Left-Skewed': left_skewed
}

skew_results = []
for name, data in distributions.items():
    mean_val = np.mean(data)
    median_val = np.median(data)
    mode_val = stats.mode(data.round(), keepdims=True).mode[0]
    skewness = stats.skew(data)
    
    skew_results.append({
        'Distribution': name,
        'Mean': f"{mean_val:.2f}",
        'Median': f"{median_val:.2f}",
        'Mode': f"{mode_val:.2f}",
        'Skewness': f"{skewness:.2f}"
    })

skew_df = pd.DataFrame(skew_results)
print("\nSkewness and Central Tendency Relationships:")
print("="*70)
print(skew_df.to_string(index=False))
print("\nKey Observations:")
print("- Symmetric: Mean ≈ Median, Skewness ≈ 0")
print("- Right-Skewed: Mean > Median, Skewness > 0")
print("- Left-Skewed: Mean < Median, Skewness < 0")

In [None]:
# Visualizing Skewness and Central Tendency

fig, axes = plt.subplots(3, 1, figsize=(12, 12))

colors_dist = ['steelblue', 'coral', 'mediumseagreen']

for idx, (name, data) in enumerate(distributions.items()):
    mean_val = np.mean(data)
    median_val = np.median(data)
    skewness = stats.skew(data)
    
    # Plot histogram
    axes[idx].hist(data, bins=50, color=colors_dist[idx], edgecolor='black', 
                   alpha=0.7, density=True)
    
    # Add mean and median lines
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=2.5, 
                      label=f'Mean: {mean_val:.1f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=2.5, 
                      label=f'Median: {median_val:.1f}')
    
    # Add title with skewness
    axes[idx].set_title(f'{name} Distribution (Skewness: {skewness:.2f})', 
                        fontsize=13, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=11)
    axes[idx].set_ylabel('Density', fontsize=11)
    axes[idx].legend(fontsize=10)
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nVisual Guide:")
print("- Symmetric: Mean and median overlap")
print("- Right-Skewed: Mean is pulled to the right of median")
print("- Left-Skewed: Mean is pulled to the left of median")

In [None]:
# Real-World Example: Income Distribution (Right-Skewed)

np.random.seed(42)

# Simulate income distribution (typically right-skewed)
# Most people earn moderate income, few earn very high income
base_income = np.random.gamma(shape=2, scale=25000, size=950)
high_earners = np.random.uniform(150000, 500000, 50)
income_distribution = np.concatenate([base_income, high_earners])

# Calculate statistics
mean_income = np.mean(income_distribution)
median_income = np.median(income_distribution)
skewness_income = stats.skew(income_distribution)

# Visualization
plt.figure(figsize=(12, 6))
plt.hist(income_distribution, bins=50, color='gold', edgecolor='black', alpha=0.7)
plt.axvline(mean_income, color='red', linestyle='--', linewidth=2.5, 
            label=f'Mean: ${mean_income:,.0f}')
plt.axvline(median_income, color='green', linestyle='--', linewidth=2.5, 
            label=f'Median: ${median_income:,.0f}')
plt.xlabel('Annual Income ($)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title(f'Income Distribution (Right-Skewed, Skewness: {skewness_income:.2f})', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nIncome Analysis:")
print(f"Mean Income: ${mean_income:,.2f}")
print(f"Median Income: ${median_income:,.2f}")
print(f"Difference: ${mean_income - median_income:,.2f}")
print(f"\nSkewness: {skewness_income:.2f} (Right-Skewed)")
print("\n** Median is more representative of 'typical' income! **")
print("** Mean is inflated by high earners! **")

---

## 9. Real-World Applications in Data Science and ML <a id="applications"></a>

### Use Cases:

1. **Data Imputation**:
   - Fill missing values with mean (for symmetric data) or median (for skewed data)
   - Use mode for categorical features

2. **Feature Engineering**:
   - Create aggregated features using central tendency
   - Example: Average purchase amount per customer

3. **Outlier Detection**:
   - Compare values to mean ± 3*std or median ± IQR
   - Identify anomalies in fraud detection

4. **Data Normalization**:
   - Mean-centering: Subtract mean from each value
   - Median normalization: More robust to outliers

5. **Model Evaluation**:
   - Baseline prediction using mean/median
   - Compare model performance against simple central tendency

6. **Business Analytics**:
   - Average customer lifetime value
   - Median transaction amount
   - Most popular product (mode)

7. **A/B Testing**:
   - Compare means of two groups
   - Use median for robust comparison

In [None]:
# Application 1: Data Imputation

# Create a dataset with missing values
np.random.seed(42)
ages = np.random.normal(35, 10, 100)
# Introduce some missing values (represented as NaN)
missing_indices = np.random.choice(100, 15, replace=False)
ages_with_missing = ages.copy()
ages_with_missing[missing_indices] = np.nan

# Create DataFrame
df_impute = pd.DataFrame({'Age': ages_with_missing})

print("Dataset with Missing Values:")
print(f"Total records: {len(df_impute)}")
print(f"Missing values: {df_impute['Age'].isna().sum()}")
print(f"\nFirst 20 values:\n{df_impute.head(20)}")

# Imputation strategies
mean_age = df_impute['Age'].mean()
median_age = df_impute['Age'].median()

# Create imputed versions
df_impute['Age_Mean_Imputed'] = df_impute['Age'].fillna(mean_age)
df_impute['Age_Median_Imputed'] = df_impute['Age'].fillna(median_age)

print(f"\nImputation Statistics:")
print(f"Mean for imputation: {mean_age:.2f}")
print(f"Median for imputation: {median_age:.2f}")
print(f"\nSample of imputed data:")
print(df_impute[df_impute['Age'].isna()].head(10))

In [None]:
# Application 2: Feature Engineering - Customer Segmentation

# Create customer transaction data
np.random.seed(42)
customer_data = {
    'CustomerID': range(1, 101),
    'Purchase1': np.random.uniform(10, 200, 100),
    'Purchase2': np.random.uniform(10, 200, 100),
    'Purchase3': np.random.uniform(10, 200, 100),
    'Purchase4': np.random.uniform(10, 200, 100),
    'Purchase5': np.random.uniform(10, 200, 100)
}

df_customers = pd.DataFrame(customer_data)

# Feature engineering: Create aggregated features
purchase_cols = ['Purchase1', 'Purchase2', 'Purchase3', 'Purchase4', 'Purchase5']

df_customers['Avg_Purchase'] = df_customers[purchase_cols].mean(axis=1)
df_customers['Median_Purchase'] = df_customers[purchase_cols].median(axis=1)
df_customers['Total_Purchase'] = df_customers[purchase_cols].sum(axis=1)
df_customers['Max_Purchase'] = df_customers[purchase_cols].max(axis=1)
df_customers['Min_Purchase'] = df_customers[purchase_cols].min(axis=1)

print("Customer Purchase Analysis with Engineered Features:")
print("="*80)
print(df_customers[['CustomerID', 'Avg_Purchase', 'Median_Purchase', 
                     'Total_Purchase']].head(10))

# Segment customers based on average purchase
df_customers['Segment'] = pd.cut(df_customers['Avg_Purchase'], 
                                  bins=[0, 75, 125, 200],
                                  labels=['Low Value', 'Medium Value', 'High Value'])

print("\nCustomer Segmentation:")
print(df_customers['Segment'].value_counts())

In [None]:
# Application 3: A/B Testing - Comparing Two Groups

# Simulate A/B test: Two different website designs
np.random.seed(42)

# Group A: Current design (control)
time_on_site_A = np.random.normal(180, 40, 200)  # seconds

# Group B: New design (treatment) - slightly better
time_on_site_B = np.random.normal(195, 40, 200)  # seconds

# Calculate central tendency for both groups
mean_A = np.mean(time_on_site_A)
median_A = np.median(time_on_site_A)
mean_B = np.mean(time_on_site_B)
median_B = np.median(time_on_site_B)

# Create comparison DataFrame
ab_comparison = pd.DataFrame({
    'Metric': ['Mean (seconds)', 'Median (seconds)', 'Sample Size'],
    'Group A (Control)': [f"{mean_A:.2f}", f"{median_A:.2f}", len(time_on_site_A)],
    'Group B (Treatment)': [f"{mean_B:.2f}", f"{median_B:.2f}", len(time_on_site_B)],
    'Improvement': [
        f"{((mean_B/mean_A - 1) * 100):.2f}%",
        f"{((median_B/median_A - 1) * 100):.2f}%",
        '-'
    ]
})

print("\nA/B Testing Results: Time on Site")
print("="*80)
print(ab_comparison.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(time_on_site_A, bins=30, color='lightcoral', alpha=0.7, 
             label='Group A', edgecolor='black')
axes[0].axvline(mean_A, color='red', linestyle='--', linewidth=2, label=f'Mean A: {mean_A:.1f}s')
axes[0].set_xlabel('Time on Site (seconds)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Group A: Current Design', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].hist(time_on_site_B, bins=30, color='lightgreen', alpha=0.7, 
             label='Group B', edgecolor='black')
axes[1].axvline(mean_B, color='green', linestyle='--', linewidth=2, label=f'Mean B: {mean_B:.1f}s')
axes[1].set_xlabel('Time on Site (seconds)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].set_title('Group B: New Design', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nConclusion: Group B shows {((mean_B/mean_A - 1) * 100):.2f}% improvement in average time on site!")

In [None]:
# Application 4: Baseline Model using Central Tendency

# Load a sample dataset: House prices prediction
np.random.seed(42)
actual_prices = np.random.normal(300000, 100000, 100)  # Actual house prices
actual_prices = np.maximum(actual_prices, 100000)  # Ensure no negative prices

# Baseline predictions using central tendency
mean_baseline = np.full(len(actual_prices), np.mean(actual_prices))
median_baseline = np.full(len(actual_prices), np.median(actual_prices))

# Calculate Mean Absolute Error (MAE) for each baseline
mae_mean = np.mean(np.abs(actual_prices - mean_baseline))
mae_median = np.mean(np.abs(actual_prices - median_baseline))

# Calculate Mean Squared Error (MSE) for each baseline
mse_mean = np.mean((actual_prices - mean_baseline) ** 2)
mse_median = np.mean((actual_prices - median_baseline) ** 2)

print("\nBaseline Model Evaluation: House Price Prediction")
print("="*70)
print(f"\nMean-based Baseline:")
print(f"  Prediction: ${np.mean(actual_prices):,.2f}")
print(f"  MAE: ${mae_mean:,.2f}")
print(f"  MSE: ${mse_mean:,.2f}")

print(f"\nMedian-based Baseline:")
print(f"  Prediction: ${np.median(actual_prices):,.2f}")
print(f"  MAE: ${mae_median:,.2f}")
print(f"  MSE: ${mse_median:,.2f}")

print(f"\nNote: Mean minimizes MSE, Median minimizes MAE")
print(f"Any ML model should beat these baseline errors!")

In [None]:
# Application 5: Anomaly Detection using Central Tendency

# Simulate server response times
np.random.seed(42)
normal_response_times = np.random.normal(200, 30, 95)  # milliseconds
anomalies = np.array([500, 600, 550, 480, 520])  # Anomalous slow responses
response_times = np.concatenate([normal_response_times, anomalies])

# Calculate statistics
mean_response = np.mean(response_times)
median_response = np.median(response_times)
std_response = np.std(response_times)

# Define anomaly thresholds
# Method 1: Mean ± 3*std (assumes normal distribution)
threshold_mean = mean_response + 3 * std_response

# Method 2: Median + 1.5*IQR (more robust)
Q1 = np.percentile(response_times, 25)
Q3 = np.percentile(response_times, 75)
IQR = Q3 - Q1
threshold_median = Q3 + 1.5 * IQR

# Detect anomalies
anomalies_detected_mean = response_times[response_times > threshold_mean]
anomalies_detected_median = response_times[response_times > threshold_median]

print("\nAnomaly Detection: Server Response Times")
print("="*70)
print(f"Mean Response Time: {mean_response:.2f} ms")
print(f"Median Response Time: {median_response:.2f} ms")
print(f"Standard Deviation: {std_response:.2f} ms")

print(f"\nMethod 1: Mean + 3*Std Threshold = {threshold_mean:.2f} ms")
print(f"Anomalies Detected: {len(anomalies_detected_mean)}")
print(f"Values: {anomalies_detected_mean}")

print(f"\nMethod 2: Median + 1.5*IQR Threshold = {threshold_median:.2f} ms")
print(f"Anomalies Detected: {len(anomalies_detected_median)}")
print(f"Values: {anomalies_detected_median}")

# Visualize
plt.figure(figsize=(12, 6))
plt.scatter(range(len(response_times)), response_times, 
            c=['red' if x > threshold_median else 'blue' for x in response_times],
            alpha=0.6, s=50)
plt.axhline(mean_response, color='green', linestyle='--', linewidth=2, 
            label=f'Mean: {mean_response:.1f} ms')
plt.axhline(median_response, color='orange', linestyle='--', linewidth=2, 
            label=f'Median: {median_response:.1f} ms')
plt.axhline(threshold_median, color='red', linestyle=':', linewidth=2, 
            label=f'Anomaly Threshold: {threshold_median:.1f} ms')
plt.xlabel('Request Number', fontsize=11)
plt.ylabel('Response Time (ms)', fontsize=11)
plt.title('Server Response Times with Anomaly Detection', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

---

## 10. Summary <a id="summary"></a>

### Key Takeaways:

1. **Measures of Central Tendency** provide a single representative value for a dataset:
   - **Mean**: Average of all values (best for symmetric data)
   - **Median**: Middle value (best for skewed data or data with outliers)
   - **Mode**: Most frequent value (best for categorical data)

2. **Types of Means**:
   - **Arithmetic Mean**: Standard average (use for most purposes)
   - **Geometric Mean**: Use for growth rates, ratios, and multiplicative data
   - **Harmonic Mean**: Use for rates, speeds, and reciprocal relationships

3. **Outlier Sensitivity**:
   - Mean is highly affected by outliers
   - Median and mode are robust to outliers
   - Choose median when data has extreme values

4. **Skewness Relationships**:
   - Symmetric: Mean = Median = Mode
   - Right-Skewed: Mode < Median < Mean
   - Left-Skewed: Mean < Median < Mode

5. **Data Science Applications**:
   - Data imputation and cleaning
   - Feature engineering and aggregation
   - Baseline model creation
   - Anomaly detection
   - A/B testing and comparison
   - Business analytics and reporting

6. **Best Practices**:
   - Always visualize your data before choosing a measure
   - Check for outliers and skewness
   - Report multiple measures when appropriate
   - Understand your data's distribution
   - Choose the measure that best represents your use case

### When to Use Which Measure:

| Situation | Best Measure | Reason |
|-----------|--------------|--------|
| Symmetric distribution | Mean | Uses all data points |
| Skewed distribution | Median | Robust to skewness |
| Data with outliers | Median | Not affected by extremes |
| Categorical data | Mode | Only applicable measure |
| Growth rates | Geometric Mean | Correct for compound growth |
| Average speed/rates | Harmonic Mean | Correct for rate averaging |
| Further calculations | Mean | Required for variance, etc. |
| Income/House prices | Median | Typically right-skewed |
| Test scores | Mean or Median | Depends on distribution |
| Customer preferences | Mode | Finding most common choice |

### Further Learning:

- Explore **weighted mean** for data with different importance levels
- Study **trimmed mean** (excluding extreme values)
- Learn about **winsorized mean** (capping extreme values)
- Understand **measures of dispersion** (variance, standard deviation)
- Practice with **real-world datasets** from Kaggle or UCI ML Repository

In [None]:
# Final Comprehensive Example: Analyzing Real-World Dataset

# Create a realistic dataset: E-commerce transaction amounts
np.random.seed(42)

# Mix of different customer segments
small_purchases = np.random.gamma(shape=2, scale=15, size=600)  # Majority
medium_purchases = np.random.gamma(shape=3, scale=30, size=300)
large_purchases = np.random.gamma(shape=2, scale=100, size=80)
very_large_purchases = np.random.uniform(500, 2000, 20)  # Outliers

all_transactions = np.concatenate([small_purchases, medium_purchases, 
                                   large_purchases, very_large_purchases])

# Calculate all measures
arithmetic_mean = np.mean(all_transactions)
geometric_mean_val = gmean(all_transactions)
harmonic_mean_val = hmean(all_transactions)
median_val = np.median(all_transactions)
mode_val = stats.mode(all_transactions.round(), keepdims=True).mode[0]
skewness_val = stats.skew(all_transactions)

# Create summary report
print("\n" + "="*80)
print("E-COMMERCE TRANSACTION ANALYSIS - COMPREHENSIVE REPORT")
print("="*80)

print(f"\nDataset Overview:")
print(f"  Total Transactions: {len(all_transactions):,}")
print(f"  Transaction Range: ${all_transactions.min():.2f} - ${all_transactions.max():.2f}")

print(f"\nCentral Tendency Measures:")
print(f"  Arithmetic Mean:    ${arithmetic_mean:.2f}")
print(f"  Geometric Mean:     ${geometric_mean_val:.2f}")
print(f"  Harmonic Mean:      ${harmonic_mean_val:.2f}")
print(f"  Median:             ${median_val:.2f}")
print(f"  Mode:               ${mode_val:.2f}")

print(f"\nDistribution Characteristics:")
print(f"  Skewness:           {skewness_val:.2f} (Right-Skewed)")
print(f"  Standard Deviation: ${np.std(all_transactions):.2f}")

# Percentiles
p25 = np.percentile(all_transactions, 25)
p75 = np.percentile(all_transactions, 75)
p90 = np.percentile(all_transactions, 90)

print(f"\nPercentiles:")
print(f"  25th Percentile (Q1): ${p25:.2f}")
print(f"  50th Percentile (Median): ${median_val:.2f}")
print(f"  75th Percentile (Q3): ${p75:.2f}")
print(f"  90th Percentile: ${p90:.2f}")

print(f"\nBusiness Insights:")
print(f"  - Typical transaction (Median): ${median_val:.2f}")
print(f"  - Average transaction (Mean): ${arithmetic_mean:.2f}")
print(f"  - Mean is {((arithmetic_mean/median_val - 1)*100):.1f}% higher than median")
print(f"    (indicates presence of high-value outliers)")
print(f"  - 50% of transactions are below ${median_val:.2f}")
print(f"  - 10% of transactions exceed ${p90:.2f}")

print("\n" + "="*80)

In [None]:
# Comprehensive Visualization

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Histogram with all measures
axes[0, 0].hist(all_transactions, bins=60, color='steelblue', 
                edgecolor='black', alpha=0.7)
axes[0, 0].axvline(arithmetic_mean, color='red', linestyle='--', linewidth=2.5, 
                   label=f'Mean: ${arithmetic_mean:.2f}')
axes[0, 0].axvline(median_val, color='green', linestyle='--', linewidth=2.5, 
                   label=f'Median: ${median_val:.2f}')
axes[0, 0].set_xlabel('Transaction Amount ($)', fontsize=11)
axes[0, 0].set_ylabel('Frequency', fontsize=11)
axes[0, 0].set_title('Distribution of Transaction Amounts', 
                     fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Plot 2: Box plot
axes[0, 1].boxplot(all_transactions, vert=False, widths=0.7)
axes[0, 1].axvline(arithmetic_mean, color='red', linestyle='--', linewidth=2, 
                   label='Mean')
axes[0, 1].axvline(median_val, color='green', linestyle='--', linewidth=2, 
                   label='Median')
axes[0, 1].set_xlabel('Transaction Amount ($)', fontsize=11)
axes[0, 1].set_title('Box Plot (Outliers Visible)', fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# Plot 3: Comparison of different means
means_comparison = [harmonic_mean_val, geometric_mean_val, arithmetic_mean, median_val]
means_labels = ['Harmonic\nMean', 'Geometric\nMean', 'Arithmetic\nMean', 'Median']
colors_means = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

bars = axes[1, 0].bar(means_labels, means_comparison, color=colors_means, 
                      alpha=0.7, edgecolor='black')
axes[1, 0].set_ylabel('Value ($)', fontsize=11)
axes[1, 0].set_title('Comparison of Central Tendency Measures', 
                     fontsize=12, fontweight='bold')
axes[1, 0].grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height,
                    f'${height:.2f}', ha='center', va='bottom', 
                    fontsize=10, fontweight='bold')

# Plot 4: Cumulative Distribution
sorted_transactions = np.sort(all_transactions)
cumulative = np.arange(1, len(sorted_transactions) + 1) / len(sorted_transactions) * 100

axes[1, 1].plot(sorted_transactions, cumulative, color='purple', linewidth=2)
axes[1, 1].axvline(median_val, color='green', linestyle='--', linewidth=2, 
                   label=f'Median: ${median_val:.2f}')
axes[1, 1].axhline(50, color='gray', linestyle=':', alpha=0.5)
axes[1, 1].set_xlabel('Transaction Amount ($)', fontsize=11)
axes[1, 1].set_ylabel('Cumulative Percentage (%)', fontsize=11)
axes[1, 1].set_title('Cumulative Distribution Function', 
                     fontsize=12, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nVisualization Complete!")
print("This comprehensive analysis demonstrates all key concepts of central tendency.")

---

## Practice Exercises

Try these exercises to reinforce your understanding:

1. **Calculate all measures** for a dataset of your choice (use pandas to load a CSV)
2. **Identify outliers** in a dataset and compare mean vs median
3. **Create visualizations** showing the relationship between skewness and central tendency
4. **Implement imputation** using different central tendency measures
5. **Compare geometric vs arithmetic mean** for investment returns
6. **Build a baseline model** using central tendency for a regression problem
7. **Analyze a real dataset** from Kaggle and report all central tendency measures

Remember: The best measure of central tendency depends on your data characteristics and analysis goals!