# Topic 28: Seaborn for Statistical Visualizations

## Overview
Seaborn is a powerful Python library built on Matplotlib that provides a high-level interface for creating attractive statistical graphics[1][3]. It simplifies complex statistical visualizations with elegant default styles and colors.

### What You'll Learn:
- Statistical plot types in Seaborn
- Distribution and relationship visualizations
- Categorical and regression plots
- Multi-plot grids and faceting
- Statistical estimation and confidence intervals
- Integration with Pandas DataFrames

---

## 1. Introduction to Seaborn

Understanding Seaborn's philosophy and basic usage:

In [None]:
# Introduction to Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

print("Introduction to Seaborn:")
print("=" * 24)

# Set Seaborn style
sns.set_theme(style="whitegrid")  # Modern way to set style
print("1. Seaborn vs Matplotlib comparison:")

# Create sample data
np.random.seed(42)
data = {
    'group': ['A', 'B', 'C'] * 100,
    'value': np.concatenate([
        np.random.normal(100, 15, 100),
        np.random.normal(110, 12, 100), 
        np.random.normal(95, 18, 100)
    ]),
    'category': np.random.choice(['X', 'Y'], 300)
}
df = pd.DataFrame(data)

# Create comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Matplotlib approach
for group in ['A', 'B', 'C']:
    group_data = df[df['group'] == group]['value']
    ax1.hist(group_data, alpha=0.6, label=f'Group {group}', bins=20)
ax1.set_title('Matplotlib: Multiple Histograms')
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.legend()

# Seaborn approach
sns.histplot(data=df, x='value', hue='group', alpha=0.6, ax=ax2)
ax2.set_title('Seaborn: Multiple Histograms')

plt.tight_layout()
plt.show()

print("   ✓ Seaborn provides cleaner syntax and better default styling")

# Built-in datasets
print("\n2. Seaborn built-in datasets:")
available_datasets = ['tips', 'flights', 'iris', 'titanic', 'car_crashes', 'mpg']

for dataset in available_datasets:
    try:
        data = sns.load_dataset(dataset)
        print(f"   {dataset}: {data.shape} - {', '.join(data.columns[:4])}{'...' if len(data.columns) > 4 else ''}")
    except:
        print(f"   {dataset}: Not available")

# Load sample datasets for examples
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
flights = sns.load_dataset('flights')

print(f"\n3. Sample data overview:")
print(f"   Tips dataset: {tips.shape}")
print(f"{tips.head()}")

# Seaborn's statistical capabilities
print(f"\n4. Seaborn's statistical focus:")
print(f"   ✓ Automatic statistical calculations")
print(f"   ✓ Built-in statistical plots")
print(f"   ✓ Confidence intervals and error bars")
print(f"   ✓ Regression fitting")
print(f"   ✓ Distribution analysis")
print(f"   ✓ Correlation visualization")

# Quick statistical visualization example
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Seaborn Statistical Capabilities', fontsize=16, fontweight='bold')

# 1. Regression plot with confidence interval
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[0,0])
axes[0,0].set_title('Regression with Confidence Interval')

# 2. Box plot with statistical summaries
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0,1])
axes[0,1].set_title('Box Plot with Quartiles')

# 3. Violin plot with kernel density
sns.violinplot(data=tips, x='day', y='total_bill', ax=axes[1,0])
axes[1,0].set_title('Violin Plot with Density')

# 4. Distribution plot with statistical fit
sns.histplot(data=tips, x='total_bill', kde=True, stat='density', ax=axes[1,1])
axes[1,1].set_title('Distribution with KDE')

plt.tight_layout()
plt.show()

print("   ✓ Demonstrated automatic statistical calculations and confidence intervals")

# Seaborn plot categories
print(f"\n5. Seaborn plot categories:")
plot_categories = {
    'Relational': ['scatterplot', 'lineplot', 'relplot'],
    'Distribution': ['histplot', 'kdeplot', 'ecdfplot', 'distplot'],
    'Categorical': ['stripplot', 'swarmplot', 'boxplot', 'violinplot', 'barplot'],
    'Regression': ['regplot', 'lmplot', 'residplot'],
    'Matrix': ['heatmap', 'clustermap'],
    'Grid': ['FacetGrid', 'PairGrid', 'JointGrid']
}

for category, plots in plot_categories.items():
    print(f"   {category}: {', '.join(plots)}")

# Style and color palettes
print(f"\n6. Seaborn styling options:")

# Available styles
styles = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']
print(f"   Styles: {', '.join(styles)}")

# Color palettes
palette_types = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']
print(f"   Qualitative palettes: {', '.join(palette_types)}")

# Show palette examples
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle('Seaborn Color Palettes', fontsize=16)

for i, palette in enumerate(palette_types):
    row, col = i // 3, i % 3
    sns.barplot(data=tips.groupby('day')['total_bill'].mean().reset_index(), 
               x='day', y='total_bill', palette=palette, ax=axes[row, col])
    axes[row, col].set_title(f'Palette: {palette}')
    axes[row, col].set_ylabel('Average Bill')

plt.tight_layout()
plt.show()

print("   ✓ Demonstrated various built-in color palettes")

## 2. Distribution Plots

Visualizing single and multiple distributions:

In [None]:
# Distribution plots in Seaborn
print("Distribution Plots in Seaborn:")
print("=" * 29)

# Load and prepare data
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

# Generate additional sample data for demonstrations
np.random.seed(42)
sample_data = {
    'normal': np.random.normal(100, 15, 1000),
    'skewed': np.random.exponential(2, 1000),
    'bimodal': np.concatenate([np.random.normal(80, 10, 500), 
                              np.random.normal(120, 10, 500)])
}

# Create comprehensive distribution analysis
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
fig.suptitle('Seaborn Distribution Plots', fontsize=16, fontweight='bold')

print("1. Histogram plots:")

# Basic histogram
sns.histplot(data=tips, x='total_bill', ax=axes[0,0])
axes[0,0].set_title('Basic Histogram')
print("   ✓ Basic histogram with automatic binning")

# Histogram with KDE overlay
sns.histplot(data=tips, x='total_bill', kde=True, ax=axes[0,1])
axes[0,1].set_title('Histogram with KDE')
print("   ✓ Added kernel density estimation overlay")

# Multiple distributions
sns.histplot(data=tips, x='total_bill', hue='time', alpha=0.6, ax=axes[0,2])
axes[0,2].set_title('Multiple Distributions')
print("   ✓ Compared multiple groups with different colors")

print("\n2. KDE (Kernel Density Estimation) plots:")

# Basic KDE
sns.kdeplot(data=tips, x='total_bill', ax=axes[1,0])
axes[1,0].set_title('Basic KDE Plot')
print("   ✓ Smooth density curve")

# Multiple KDE plots
sns.kdeplot(data=tips, x='total_bill', hue='time', ax=axes[1,1])
axes[1,1].set_title('Multiple KDE Plots')
print("   ✓ Comparing density curves for different groups")

# 2D KDE (bivariate)
sns.kdeplot(data=tips, x='total_bill', y='tip', ax=axes[1,2])
axes[1,2].set_title('2D KDE (Bivariate)')
print("   ✓ Two-dimensional density visualization")

print("\n3. Advanced distribution plots:")

# ECDF (Empirical Cumulative Distribution Function)
sns.ecdfplot(data=tips, x='total_bill', hue='time', ax=axes[2,0])
axes[2,0].set_title('ECDF Plot')
print("   ✓ Empirical cumulative distribution function")

# Rug plot combined with histogram
sns.histplot(data=tips, x='total_bill', ax=axes[2,1])
sns.rugplot(data=tips, x='total_bill', ax=axes[2,1], color='red', alpha=0.6)
axes[2,1].set_title('Histogram with Rug Plot')
print("   ✓ Rug plot shows individual data points")

# Distribution comparison
for dist_name, dist_data in sample_data.items():
    sns.kdeplot(dist_data, label=dist_name, ax=axes[2,2])
axes[2,2].set_title('Different Distribution Types')
axes[2,2].legend()
print("   ✓ Compared normal, skewed, and bimodal distributions")

plt.tight_layout()
plt.show()

# Joint distribution plots
print("\n4. Joint distribution plots:")

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Joint Distribution Analysis', fontsize=16)

# Joint plot with different kinds
joint_kinds = ['scatter', 'kde', 'hist', 'reg']

for i, kind in enumerate(joint_kinds):
    row, col = i // 2, i % 2
    
    # Create joint plot using JointGrid for subplot integration
    if kind == 'scatter':
        sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[row, col])
        axes[row, col].set_title(f'Joint {kind.title()} Plot')
    elif kind == 'kde':
        sns.kdeplot(data=tips, x='total_bill', y='tip', ax=axes[row, col])
        axes[row, col].set_title(f'Joint {kind.upper()} Plot')
    elif kind == 'hist':
        plt.sca(axes[row, col])
        axes[row, col].hist2d(tips['total_bill'], tips['tip'], bins=20)
        axes[row, col].set_title(f'Joint {kind.title()} Plot')
        axes[row, col].set_xlabel('Total Bill')
        axes[row, col].set_ylabel('Tip')
    elif kind == 'reg':
        sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[row, col])
        axes[row, col].set_title(f'Joint {kind.title()} Plot')

plt.tight_layout()
plt.show()

print("   ✓ Demonstrated different ways to visualize joint distributions")

# Statistical distribution fitting
print("\n5. Statistical distribution fitting:")

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Normal distribution fitting
data_normal = tips['total_bill']

# Plot histogram
sns.histplot(data_normal, kde=False, stat='density', ax=axes[0])

# Fit and plot normal distribution
mu, sigma = stats.norm.fit(data_normal)
x = np.linspace(data_normal.min(), data_normal.max(), 100)
normal_fit = stats.norm.pdf(x, mu, sigma)
axes[0].plot(x, normal_fit, 'r-', linewidth=2, label=f'Normal fit (μ={mu:.1f}, σ={sigma:.1f})')
axes[0].set_title('Normal Distribution Fit')
axes[0].legend()

print(f"   Normal fit: μ={mu:.2f}, σ={sigma:.2f}")

# Exponential distribution fitting (using tip amounts)
tip_data = tips['tip']
sns.histplot(tip_data, kde=False, stat='density', ax=axes[1])

# Fit exponential distribution
lam = 1/np.mean(tip_data)  # Maximum likelihood estimator for exponential
x_exp = np.linspace(0, tip_data.max(), 100)
exp_fit = stats.expon.pdf(x_exp, scale=1/lam)
axes[1].plot(x_exp, exp_fit, 'g-', linewidth=2, label=f'Exponential fit (λ={lam:.2f})')
axes[1].set_title('Exponential Distribution Fit')
axes[1].legend()

print(f"   Exponential fit: λ={lam:.3f}")

# Q-Q plot for normality testing
stats.probplot(tips['total_bill'], dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot (Normality Test)')

plt.tight_layout()
plt.show()

print("   ✓ Fitted statistical distributions and tested normality")

# Distribution summary statistics
print("\n6. Distribution summary with visualizations:")

def analyze_distribution(data, name):
    """Comprehensive distribution analysis"""
    print(f"   {name} Analysis:")
    print(f"     Mean: {np.mean(data):.2f}")
    print(f"     Median: {np.median(data):.2f}")
    print(f"     Std Dev: {np.std(data):.2f}")
    print(f"     Skewness: {stats.skew(data):.2f}")
    print(f"     Kurtosis: {stats.kurtosis(data):.2f}")
    
    # Normality test
    _, p_value = stats.shapiro(data[:5000])  # Shapiro-Wilk test (sample if large)
    print(f"     Normality test p-value: {p_value:.4f}")
    print(f"     Is normal (p>0.05): {p_value > 0.05}")
    print()

analyze_distribution(tips['total_bill'], 'Total Bill')
analyze_distribution(tips['tip'], 'Tip Amount')

print("7. Distribution plot best practices:")
print("   ✓ Use histplot for count distributions")
print("   ✓ Use kdeplot for smooth density curves")
print("   ✓ Use ecdfplot for cumulative distributions")
print("   ✓ Combine multiple plots for comprehensive analysis")
print("   ✓ Test assumptions with statistical tests")
print("   ✓ Choose appropriate bin sizes for histograms")

## Summary

In this notebook, you learned about:

✅ **Seaborn Fundamentals**: High-level statistical visualization library[1][3]  
✅ **Distribution Analysis**: Histograms, KDE plots, and ECDF visualizations  
✅ **Statistical Integration**: Automatic confidence intervals and regression fitting[2]  
✅ **Built-in Datasets**: Ready-to-use datasets for learning and testing  
✅ **Elegant Styling**: Beautiful default themes and color palettes[3]  
✅ **Pandas Integration**: Seamless DataFrame compatibility[8]  

### Key Takeaways:
1. Seaborn simplifies complex statistical visualizations with clean syntax[1][3]
2. Built on Matplotlib but provides higher-level statistical functions
3. Excellent for exploratory data analysis and statistical graphics[3]
4. Automatic statistical calculations and confidence intervals[2]
5. Perfect complement to Pandas and NumPy workflow[8]
6. Ideal for publication-quality statistical plots

### Next Topics:
Continue with **29_sklearn_basics.ipynb** for machine learning fundamentals, or **30_web_scraping.ipynb** for data collection techniques.