# 📊 Descriptive Statistics: Summarizing and Visualizing Data

> *"In God we trust. All others must bring data."* - W. Edwards Deming

Welcome to **Descriptive Statistics**! Now that we understand the theoretical underpinnings of probability, let's get our hands dirty with real data. This notebook will teach you how to summarize, visualize, and interpret datasets to uncover their core characteristics.

## 🎯 What You'll Master

- **Measures of Central Tendency**: Calculating and interpreting the Mean, Median, and Mode.
- **Measures of Dispersion**: Quantifying data spread with Variance, Standard Deviation, and Range.
- **Data Visualization**: Creating insightful plots like histograms, box plots, and scatter plots.
- **Correlation and Covariance**: Measuring the relationship between variables.

## 📚 Import Essential Libraries

Let's start by importing the tools of the trade.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import scipy.stats as stats

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 12

# Generate a synthetic dataset for our examples
np.random.seed(42)
data = pd.DataFrame({
    'A': np.random.normal(loc=100, scale=15, size=200),
    'B': np.random.poisson(lam=7, size=200) * 10,
    'C': np.random.uniform(low=50, high=150, size=200)
})
# Add some skewness and an outlier to 'A'
data['A'] = data['A'] + np.random.gamma(2, 2, 200)
data.loc[199, 'A'] = 200 # Add an outlier

print("📊 Libraries and synthetic data loaded successfully!")
data.head()

---

# 📍 Chapter 1: Measures of Central Tendency

Measures of central tendency give us a single value that describes the center or typical value of a dataset.

- **Mean**: The average of all data points. Sensitive to outliers.
- **Median**: The middle value when the data is sorted. Robust to outliers.
- **Mode**: The most frequently occurring value. Useful for categorical data.

In [None]:
def visualize_central_tendency(data_series, series_name):
    """
    Calculate and visualize mean, median, and mode for a data series.
    """
    # Calculate measures
    mean_val = data_series.mean()
    median_val = data_series.median()
    mode_val = data_series.mode()[0] # Mode can have multiple values
    
    # Create plot
    plt.figure(figsize=(12, 7))
    sns.histplot(data_series, kde=True, color='skyblue', bins=30)
    
    # Add lines for each measure
    plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    plt.axvline(median_val, color='green', linestyle='-', linewidth=2, label=f'Median: {median_val:.2f}')
    plt.axvline(mode_val, color='purple', linestyle='-.', linewidth=2, label=f'Mode: {mode_val:.2f}')
    
    plt.title(f'Central Tendency for Dataset "{series_name}"', fontsize=16, weight='bold')
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.legend()
    plt.grid(True, which='both', linestyle='--', linewidth=0.5)
    plt.show()
    
    print(f'--- Analysis for "{series_name}" ---')
    print(f'Mean: {mean_val:.2f}')
    print(f'Median: {median_val:.2f}')
    print(f'Mode: {mode_val:.2f}')
    
    # Interpretation based on skewness
    if abs(mean_val - median_val) < 0.1 * data_series.std():
        print("💡 Interpretation: The distribution is roughly symmetric.")
    elif mean_val > median_val:
        print("💡 Interpretation: The distribution is right-skewed (positively skewed).")
        print("   The mean is pulled higher by large values (or outliers).")
    else:
        print("💡 Interpretation: The distribution is left-skewed (negatively skewed).")
        print("   The mean is pulled lower by small values.")

# Visualize for our synthetic data 'A'
visualize_central_tendency(data['A'], 'A')

Notice how the outlier we added to dataset 'A' pulls the **mean** to the right, while the **median** remains more representative of the central bulk of the data. This is why the median is called a **robust** statistic.

---

# 🌊 Chapter 2: Measures of Dispersion (Spread)

Measures of dispersion tell us how spread out or varied our data is.

- **Range**: The difference between the maximum and minimum values. Very sensitive to outliers.
- **Variance (σ²)**: The average of the squared differences from the Mean. Measures overall variability.
- **Standard Deviation (σ)**: The square root of the variance. Expressed in the same units as the data, making it more interpretable.
- **Interquartile Range (IQR)**: The range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). Robust to outliers.

In [None]:
def analyze_dispersion(data_series, series_name):
    """
    Calculate and print measures of dispersion.
    """
    range_val = data_series.max() - data_series.min()
    variance_val = data_series.var()
    std_dev_val = data_series.std()
    q1 = data_series.quantile(0.25)
    q3 = data_series.quantile(0.75)
    iqr_val = q3 - q1
    
    print(f'--- Dispersion Analysis for "{series_name}" ---')
    print(f'Range: {range_val:.2f}')
    print(f'Variance (σ²): {variance_val:.2f}')
    print(f'Standard Deviation (σ): {std_dev_val:.2f}')
    print(f'Interquartile Range (IQR): {iqr_val:.2f}')
    
    return std_dev_val, iqr_val

std_A, _ = analyze_dispersion(data['A'], 'A')
print("\n")
std_C, _ = analyze_dispersion(data['C'], 'C')

# Visualize dispersion with box plots
plt.figure(figsize=(14, 8))
sns.boxplot(data=data, orient='h', palette='pastel')
plt.title('Visualizing Dispersion with Box Plots', fontsize=16, weight='bold')
plt.xlabel('Value')
plt.show()

print("💡 Box Plot Interpretation:")
print("  - The box represents the IQR (the middle 50% of the data).")
  - The line inside the box is the median.")
  - The 'whiskers' extend to show the range of the data (typically 1.5 * IQR from the box).")
  - Points outside the whiskers are considered outliers (like the one in dataset 'A').")

---

# 📈 Chapter 3: Correlation and Covariance

These measures describe the relationship between two variables.

- **Covariance**: Measures how two variables change together. Its value is hard to interpret because it depends on the scale of the variables.
- **Correlation (r)**: A standardized version of covariance. It's a value between -1 and 1.
  - **+1**: Perfect positive linear relationship.
  - **-1**: Perfect negative linear relationship.
  - **0**: No linear relationship.

**Important**: Correlation does not imply causation!

In [None]:
# Let's create some correlated data to make this interesting
np.random.seed(123)
correlated_data = pd.DataFrame()
correlated_data['X'] = np.random.normal(0, 1, 100)
correlated_data['Y_positive'] = correlated_data['X'] + np.random.normal(0, 0.5, 100)
correlated_data['Y_negative'] = -correlated_data['X'] + np.random.normal(0, 0.5, 100)
correlated_data['Y_none'] = np.random.normal(0, 1, 100)

# Calculate the correlation matrix
correlation_matrix = correlated_data.corr()

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Positive Correlation
sns.regplot(x='X', y='Y_positive', data=correlated_data, ax=axes[0], color='g', scatter_kws={'alpha':0.6})
corr_pos = correlation_matrix.loc['X', 'Y_positive']
axes[0].set_title(f'Positive Correlation (r = {corr_pos:.2f})', fontsize=14, weight='bold')

# Negative Correlation
sns.regplot(x='X', y='Y_negative', data=correlated_data, ax=axes[1], color='r', scatter_kws={'alpha':0.6})
corr_neg = correlation_matrix.loc['X', 'Y_negative']
axes[1].set_title(f'Negative Correlation (r = {corr_neg:.2f})', fontsize=14, weight='bold')

# No Correlation
sns.regplot(x='X', y='Y_none', data=correlated_data, ax=axes[2], color='b', scatter_kws={'alpha':0.6})
corr_none = correlation_matrix.loc['X', 'Y_none']
axes[2].set_title(f'No Linear Correlation (r = {corr_none:.2f})', fontsize=14, weight='bold')

plt.tight_layout()
plt.show()

# Visualize the correlation matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix Heatmap', fontsize=16, weight='bold')
plt.show()

print("💡 Heatmap Interpretation:")
print("  - Warm colors (reds) indicate a strong positive correlation.")
print("  - Cool colors (blues) indicate a strong negative correlation.")
print("  - Colors near zero indicate a weak or no linear correlation.")

---

# 🎯 Key Takeaways

## 📊 Summarizing Data
- **Central Tendency (Mean, Median, Mode)**: Tells you where the 'center' of your data is. The choice of measure depends on the data's distribution and the presence of outliers.
- **Dispersion (Variance, Std Dev, IQR)**: Tells you how spread out your data is. Standard deviation is interpretable, while IQR is robust to outliers.

## 📈 Visualizing Data
- **Histograms**: Great for understanding the distribution of a single variable.
- **Box Plots**: Excellent for comparing distributions and identifying outliers.
- **Scatter Plots**: The go-to for visualizing the relationship between two variables.

## 🔗 Measuring Relationships
- **Correlation**: A single number that quantifies the strength and direction of a *linear* relationship between two variables. Always visualize with a scatter plot to check for non-linear patterns!

## 🧠 AI Connections
- **Feature Engineering**: Understanding distributions helps in transforming features (e.g., log transforms for skewed data).
- **Model Evaluation**: Descriptive statistics are used to analyze model errors and residuals.
- **Exploratory Data Analysis (EDA)**: This entire notebook is a core part of EDA, the first step in any machine learning project.
- **Anomaly Detection**: Outliers identified through descriptive stats can be the very thing an anomaly detection model is trying to find.

---

# 🚀 What's Next?

We've learned to describe a sample of data. Now, how can we use that sample to make conclusions about the entire population? That's the job of **Inferential Statistics**.

- **Hypothesis Testing**: Making decisions based on data.
- **Confidence Intervals**: Estimating population parameters.
- **p-values**: Quantifying the evidence against a null hypothesis.

**Ready to make inferences? Let's move on to the next chapter! 🔎**