# Central Tendency & Dispersion

**Module: Descriptive & Inferential Statistics**

## Learning Objectives
- Calculate and interpret measures of central tendency (mean, median, mode)
- Calculate and interpret measures of dispersion (range, variance, standard deviation, IQR)
- Choose appropriate measures for different data types and distributions
- Use pandas and numpy for descriptive statistics

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Set display options
pd.set_option('display.precision', 2)
np.random.seed(42)

---
## Quick Refresher

### Central Tendency
| Measure | What it tells you | Best used when |
|---------|------------------|----------------|
| **Mean** | Average value | Data is symmetric, no extreme outliers |
| **Median** | Middle value | Data is skewed or has outliers |
| **Mode** | Most frequent value | Categorical data or finding peaks |

### Dispersion
| Measure | What it tells you | Formula hint |
|---------|------------------|---------------|
| **Range** | Spread from min to max | max - min |
| **Variance** | Average squared deviation from mean | σ² |
| **Std Dev** | Average deviation from mean (same units as data) | σ |
| **IQR** | Spread of middle 50% | Q3 - Q1 |

---
## Working Example: Employee Salaries

Let's analyze salary data from a tech company.

In [None]:
# Sample salary data (in thousands)
salaries = pd.DataFrame({
    'department': ['Engineering']*20 + ['Sales']*15 + ['Marketing']*10,
    'salary': list(np.random.normal(95, 15, 20)) + 
              list(np.random.normal(70, 20, 15)) + 
              list(np.random.normal(65, 10, 10))
})

# Add a few outliers (executives)
salaries = pd.concat([salaries, pd.DataFrame({
    'department': ['Engineering', 'Sales'],
    'salary': [250, 200]
})], ignore_index=True)

salaries.head(10)

### Central Tendency in Python

In [None]:
# Mean - sensitive to outliers
print(f"Mean salary: ${salaries['salary'].mean():.2f}k")

# Median - robust to outliers
print(f"Median salary: ${salaries['salary'].median():.2f}k")

# Mode - most common value (less useful for continuous data)
print(f"Mode salary: ${salaries['salary'].mode()[0]:.2f}k")

In [None]:
# Notice the difference! Outliers pull the mean up
print(f"\nDifference (mean - median): ${salaries['salary'].mean() - salaries['salary'].median():.2f}k")
print("When mean > median, data is right-skewed (positive skew)")

### Dispersion in Python

In [None]:
# Range
salary_range = salaries['salary'].max() - salaries['salary'].min()
print(f"Range: ${salary_range:.2f}k")

# Variance (ddof=1 for sample variance)
print(f"Variance: {salaries['salary'].var():.2f}")

# Standard deviation
print(f"Std Dev: ${salaries['salary'].std():.2f}k")

# IQR
q1 = salaries['salary'].quantile(0.25)
q3 = salaries['salary'].quantile(0.75)
iqr = q3 - q1
print(f"IQR: ${iqr:.2f}k (Q1={q1:.2f}, Q3={q3:.2f})")

### Using `.describe()` for Quick Summary

In [None]:
# One-liner for all key stats
salaries['salary'].describe()

### Group-wise Statistics

In [None]:
# Compare departments
salaries.groupby('department')['salary'].agg(['mean', 'median', 'std', 'count'])

---
## Exercises

### Exercise 1: E-commerce Order Values

You have order data from an online store. Calculate descriptive statistics.

In [None]:
# Order values in dollars
orders = pd.DataFrame({
    'order_id': range(1, 101),
    'value': list(np.random.exponential(50, 95)) + [500, 750, 890, 1200, 2000],  # Some big orders
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], 100)
})

orders.head()

In [None]:
# TODO: Calculate mean, median, and std dev of order values
# Which measure of central tendency better represents a "typical" order?



In [None]:
# TODO: Calculate the IQR for order values
# Use it to identify potential outliers (values below Q1-1.5*IQR or above Q3+1.5*IQR)



In [None]:
# TODO: Find mean and median order value BY category
# Which category has the highest average order value?



### Exercise 2: Website Session Durations

Analyze how long users spend on a website.

In [None]:
# Session duration in seconds
sessions = pd.DataFrame({
    'user_type': np.random.choice(['new', 'returning'], 500, p=[0.6, 0.4]),
    'duration': np.concatenate([
        np.random.gamma(2, 30, 300),   # New users - shorter sessions
        np.random.gamma(4, 45, 200)    # Returning users - longer sessions
    ])
})

sessions.head()

In [None]:
# TODO: Calculate descriptive stats for session duration by user_type
# Include: count, mean, median, std, min, max



In [None]:
# TODO: Calculate the coefficient of variation (CV = std/mean) for each user type
# Which group has more relative variability in session duration?



### Exercise 3: Product Ratings Analysis

Analyze customer ratings for products.

In [None]:
# Product ratings (1-5 stars)
np.random.seed(123)
ratings = pd.DataFrame({
    'product': np.repeat(['A', 'B', 'C'], 100),
    'rating': np.concatenate([
        np.random.choice([1,2,3,4,5], 100, p=[0.05, 0.1, 0.2, 0.35, 0.3]),  # Product A - good
        np.random.choice([1,2,3,4,5], 100, p=[0.3, 0.25, 0.2, 0.15, 0.1]),  # Product B - poor
        np.random.choice([1,2,3,4,5], 100, p=[0.2, 0.1, 0.1, 0.1, 0.5])     # Product C - polarized
    ])
})

ratings.head()

In [None]:
# TODO: For each product, calculate mean, median, mode, and std dev of ratings



In [None]:
# TODO: Which product has the most consistent ratings (lowest dispersion)?
# Which has the most polarized ratings?



In [None]:
# TODO: Create a frequency table showing the distribution of ratings for each product
# Hint: use pd.crosstab() or groupby with value_counts()



### Exercise 4: Custom Aggregation Function

Create a function that returns a complete statistical summary.

In [None]:
# TODO: Write a function that takes a pandas Series and returns a dictionary with:
# - mean, median, mode (first mode if multiple)
# - std, variance
# - range, IQR
# - skewness (use scipy.stats.skew or pandas .skew())

def full_summary(series):
    """Return comprehensive descriptive statistics for a numeric series."""
    pass  # Your code here

# Test it on the salary data
# full_summary(salaries['salary'])

---
## Solutions

In [None]:
# Exercise 1 Solutions

# Mean, median, std
print("Order Value Statistics:")
print(f"Mean: ${orders['value'].mean():.2f}")
print(f"Median: ${orders['value'].median():.2f}")
print(f"Std Dev: ${orders['value'].std():.2f}")
print("\nMedian better represents typical order (data is right-skewed with large outliers)")

In [None]:
# IQR and outliers
q1 = orders['value'].quantile(0.25)
q3 = orders['value'].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = orders[(orders['value'] < lower_bound) | (orders['value'] > upper_bound)]
print(f"IQR: ${iqr:.2f}")
print(f"Outlier bounds: ${lower_bound:.2f} to ${upper_bound:.2f}")
print(f"Number of outliers: {len(outliers)}")

In [None]:
# By category
orders.groupby('category')['value'].agg(['mean', 'median'])

In [None]:
# Exercise 2 Solutions

# Descriptive stats by user type
sessions.groupby('user_type')['duration'].describe()

In [None]:
# Coefficient of variation
cv_by_type = sessions.groupby('user_type')['duration'].agg(lambda x: x.std() / x.mean())
print("Coefficient of Variation by User Type:")
print(cv_by_type)

In [None]:
# Exercise 3 Solutions

# Stats by product
rating_stats = ratings.groupby('product')['rating'].agg(['mean', 'median', 'std'])
rating_stats['mode'] = ratings.groupby('product')['rating'].agg(lambda x: x.mode()[0])
print(rating_stats)

In [None]:
# Frequency table
pd.crosstab(ratings['product'], ratings['rating'], normalize='index').round(2)

In [None]:
# Exercise 4 Solution

def full_summary(series):
    """Return comprehensive descriptive statistics for a numeric series."""
    return {
        'mean': series.mean(),
        'median': series.median(),
        'mode': series.mode()[0] if len(series.mode()) > 0 else None,
        'std': series.std(),
        'variance': series.var(),
        'range': series.max() - series.min(),
        'iqr': series.quantile(0.75) - series.quantile(0.25),
        'skewness': series.skew()
    }

# Test it
full_summary(salaries['salary'])