# Types of Statistics

Statistics is broadly divided into two main branches: **Descriptive Statistics** and **Inferential Statistics**. Understanding these two types is fundamental to data analysis, machine learning, and data science. Descriptive statistics helps us summarize and describe data, while inferential statistics allows us to make predictions and draw conclusions about populations based on sample data.

---

## Table of Contents

1. [Overview of the Two Main Branches](#overview)
2. [Descriptive Statistics](#descriptive)
3. [Inferential Statistics](#inferential)
4. [Key Differences](#differences)
5. [When to Use Each Type](#when-to-use)
6. [Practical Examples](#practical)
7. [Summary](#summary)

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, t
import statsmodels.api as sm
from statsmodels.stats import weightstats as stests

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed for reproducibility
np.random.seed(42)

---

## 1. Overview of the Two Main Branches <a id='overview'></a>

Statistics can be categorized into two fundamental types:

| Type | Purpose | Key Question Answered |
|------|---------|----------------------|
| **Descriptive Statistics** | Summarize and describe data | "What does the data show?" |
| **Inferential Statistics** | Make predictions and inferences | "What can we conclude about the population?" |

### Visual Representation of Statistics Hierarchy

In [None]:
# Create a visual representation of statistics branches
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.axis('off')

# Main title
ax.text(0.5, 0.95, 'STATISTICS', ha='center', va='center', 
        fontsize=24, fontweight='bold', 
        bbox=dict(boxstyle='round', facecolor='lightblue', edgecolor='black', linewidth=2))

# Two main branches
ax.text(0.25, 0.80, 'DESCRIPTIVE\nSTATISTICS', ha='center', va='center', 
        fontsize=16, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='lightgreen', edgecolor='black', linewidth=2))

ax.text(0.75, 0.80, 'INFERENTIAL\nSTATISTICS', ha='center', va='center', 
        fontsize=16, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2))

# Descriptive Statistics subcategories
desc_categories = ['Central\nTendency', 'Dispersion', 'Position']
for i, cat in enumerate(desc_categories):
    ax.text(0.08 + i*0.17, 0.60, cat, ha='center', va='center', 
            fontsize=11, bbox=dict(boxstyle='round', facecolor='#90EE90', alpha=0.7))

# Inferential Statistics subcategories
infer_categories = ['Hypothesis\nTesting', 'Confidence\nIntervals', 'Regression\nAnalysis']
for i, cat in enumerate(infer_categories):
    ax.text(0.60 + i*0.15, 0.60, cat, ha='center', va='center', 
            fontsize=11, bbox=dict(boxstyle='round', facecolor='#FFFFE0', alpha=0.7))

# Descriptive examples
desc_examples = ['Mean\nMedian\nMode', 'Std Dev\nVariance\nRange', 'Percentiles\nQuartiles\nIQR']
for i, ex in enumerate(desc_examples):
    ax.text(0.08 + i*0.17, 0.40, ex, ha='center', va='center', 
            fontsize=9, style='italic', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Inferential examples
infer_examples = ['t-test\nANOVA\nChi-square', 'Margin of\nError\nCI Range', 'Linear\nLogistic\nMultiple']
for i, ex in enumerate(infer_examples):
    ax.text(0.60 + i*0.15, 0.40, ex, ha='center', va='center', 
            fontsize=9, style='italic', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Draw connecting lines
ax.plot([0.5, 0.25], [0.92, 0.85], 'k-', linewidth=2)
ax.plot([0.5, 0.75], [0.92, 0.85], 'k-', linewidth=2)

plt.title('Hierarchy of Statistical Methods', fontsize=18, pad=20)
plt.tight_layout()
plt.show()

print("Statistics is divided into two main branches, each with specific subcategories and techniques.")

---

## 2. Descriptive Statistics <a id='descriptive'></a>

**Descriptive Statistics** involves methods for summarizing and organizing data to make it understandable and interpretable. It describes the main features of a dataset without making conclusions beyond the data at hand.

### Key Characteristics:
- Summarizes data using numbers and graphs
- Describes what the data shows
- No predictions or inferences about the population
- Works with the entire dataset or sample

### 2.1 Measures of Central Tendency

Central tendency describes the center or typical value of a dataset.

| Measure | Definition | When to Use |
|---------|-----------|-------------|
| **Mean** | Average of all values | Symmetric distributions, no outliers |
| **Median** | Middle value when sorted | Skewed distributions, with outliers |
| **Mode** | Most frequent value | Categorical data, finding most common |


In [None]:
# Example: Calculate measures of central tendency
# Creating a sample dataset of student exam scores
exam_scores = np.array([85, 92, 78, 90, 88, 76, 95, 89, 84, 91, 87, 83, 90, 86, 94])

# Calculate mean
mean_score = np.mean(exam_scores)
print(f"Mean (Average) Score: {mean_score:.2f}")

# Calculate median
median_score = np.median(exam_scores)
print(f"Median (Middle) Score: {median_score:.2f}")

# Calculate mode using scipy
mode_result = stats.mode(exam_scores, keepdims=True)
mode_score = mode_result.mode[0]
print(f"Mode (Most Frequent) Score: {mode_score}")

print(f"\nInterpretation:")
print(f"- On average, students scored {mean_score:.2f}")
print(f"- Half of students scored above {median_score:.2f}")
print(f"- The most common score was {mode_score}")

In [None]:
# Visualize central tendency measures
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Histogram with central tendency lines
axes[0].hist(exam_scores, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(mean_score, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_score:.2f}')
axes[0].axvline(median_score, color='green', linestyle='--', linewidth=2, label=f'Median: {median_score:.2f}')
axes[0].axvline(mode_score, color='orange', linestyle='--', linewidth=2, label=f'Mode: {mode_score}')
axes[0].set_xlabel('Exam Scores')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution with Central Tendency Measures')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Box plot showing central tendency
axes[1].boxplot(exam_scores, vert=True)
axes[1].scatter([1], [mean_score], color='red', s=100, zorder=3, label=f'Mean: {mean_score:.2f}')
axes[1].set_ylabel('Exam Scores')
axes[1].set_title('Box Plot with Mean')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.2 Measures of Dispersion (Variability)

Dispersion measures describe how spread out the data is from the center.

| Measure | Definition | Formula |
|---------|-----------|----------|
| **Range** | Difference between max and min | max - min |
| **Variance** | Average squared deviation from mean | Σ(x - μ)² / n |
| **Standard Deviation** | Square root of variance | √variance |
| **IQR** | Range of middle 50% of data | Q3 - Q1 |

In [None]:
# Example: Calculate measures of dispersion
# Using the same exam scores dataset

# Calculate range
data_range = np.max(exam_scores) - np.min(exam_scores)
print(f"Range: {data_range}")
print(f"  (Maximum: {np.max(exam_scores)}, Minimum: {np.min(exam_scores)})")

# Calculate variance
variance = np.var(exam_scores, ddof=0)  # Population variance
sample_variance = np.var(exam_scores, ddof=1)  # Sample variance
print(f"\nVariance (Population): {variance:.2f}")
print(f"Variance (Sample): {sample_variance:.2f}")

# Calculate standard deviation
std_dev = np.std(exam_scores, ddof=1)  # Sample standard deviation
print(f"\nStandard Deviation: {std_dev:.2f}")
print(f"  (On average, scores deviate by {std_dev:.2f} points from the mean)")

# Calculate IQR (Interquartile Range)
q1 = np.percentile(exam_scores, 25)
q3 = np.percentile(exam_scores, 75)
iqr = q3 - q1
print(f"\nInterquartile Range (IQR): {iqr:.2f}")
print(f"  (Q1: {q1}, Q3: {q3})")

In [None]:
# Compare datasets with same mean but different dispersion
# Dataset 1: Low variability
low_var = np.array([88, 89, 87, 90, 88, 89, 87, 90, 89, 88])

# Dataset 2: High variability
high_var = np.array([75, 95, 82, 98, 70, 92, 85, 99, 78, 95])

print("Comparison of Two Datasets:")
print(f"\nLow Variability Dataset:")
print(f"  Mean: {np.mean(low_var):.2f}")
print(f"  Std Dev: {np.std(low_var, ddof=1):.2f}")

print(f"\nHigh Variability Dataset:")
print(f"  Mean: {np.mean(high_var):.2f}")
print(f"  Std Dev: {np.std(high_var, ddof=1):.2f}")

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Low variability
axes[0].hist(low_var, bins=8, alpha=0.7, color='lightblue', edgecolor='black')
axes[0].axvline(np.mean(low_var), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0].set_xlabel('Scores')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Low Variability (σ = {np.std(low_var, ddof=1):.2f})')
axes[0].legend()
axes[0].set_xlim(65, 105)

# Plot 2: High variability
axes[1].hist(high_var, bins=8, alpha=0.7, color='lightcoral', edgecolor='black')
axes[1].axvline(np.mean(high_var), color='red', linestyle='--', linewidth=2, label='Mean')
axes[1].set_xlabel('Scores')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'High Variability (σ = {np.std(high_var, ddof=1):.2f})')
axes[1].legend()
axes[1].set_xlim(65, 105)

plt.tight_layout()
plt.show()

print("\nNotice: Both datasets have similar means but very different spreads!")

### 2.3 Measures of Position

Position measures tell us where a specific data point stands relative to the rest of the dataset.

| Measure | Definition | Use Case |
|---------|-----------|----------|
| **Percentile** | Value below which a percentage falls | Understanding rank/position |
| **Quartiles** | Divide data into four equal parts | Q1 (25%), Q2 (50%), Q3 (75%) |
| **Z-Score** | Number of std devs from mean | Identifying outliers |

In [None]:
# Example: Calculate measures of position
# Using exam scores dataset

# Calculate percentiles
p25 = np.percentile(exam_scores, 25)
p50 = np.percentile(exam_scores, 50)  # Same as median
p75 = np.percentile(exam_scores, 75)
p90 = np.percentile(exam_scores, 90)

print("Percentile Analysis:")
print(f"25th Percentile: {p25:.2f} (25% scored below this)")
print(f"50th Percentile: {p50:.2f} (Median)")
print(f"75th Percentile: {p75:.2f} (75% scored below this)")
print(f"90th Percentile: {p90:.2f} (90% scored below this)")

# Calculate z-scores (standardized scores)
z_scores = (exam_scores - np.mean(exam_scores)) / np.std(exam_scores, ddof=1)

print(f"\nZ-Score Analysis:")
print(f"Z-scores for first 5 students: {z_scores[:5]}")
print(f"\nInterpretation of first student's z-score ({z_scores[0]:.2f}):")
print(f"  This score is {abs(z_scores[0]):.2f} standard deviations {'above' if z_scores[0] > 0 else 'below'} the mean")

In [None]:
# Visualize measures of position
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Box plot with quartiles
bp = axes[0].boxplot(exam_scores, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
axes[0].set_ylabel('Exam Scores')
axes[0].set_title('Box Plot Showing Quartiles')
axes[0].grid(True, alpha=0.3)

# Add labels for quartiles
axes[0].text(1.2, p25, f'Q1: {p25:.1f}', fontsize=10, color='blue')
axes[0].text(1.2, p50, f'Q2: {p50:.1f}', fontsize=10, color='green')
axes[0].text(1.2, p75, f'Q3: {p75:.1f}', fontsize=10, color='blue')

# Plot 2: Z-score distribution
axes[1].scatter(range(len(exam_scores)), z_scores, alpha=0.6, s=100)
axes[1].axhline(y=0, color='red', linestyle='--', label='Mean (z=0)')
axes[1].axhline(y=2, color='orange', linestyle=':', label='±2 std dev')
axes[1].axhline(y=-2, color='orange', linestyle=':')
axes[1].set_xlabel('Student Index')
axes[1].set_ylabel('Z-Score')
axes[1].set_title('Z-Scores for Each Student')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.4 Complete Descriptive Statistics Summary with Pandas

In [None]:
# Create a pandas DataFrame for easier analysis
df = pd.DataFrame({
    'Student_ID': range(1, len(exam_scores) + 1),
    'Score': exam_scores
})

# Get comprehensive descriptive statistics
print("Complete Descriptive Statistics Summary:")
print("="*50)
print(df['Score'].describe())

# Additional statistics
print(f"\nAdditional Measures:")
print(f"Range: {df['Score'].max() - df['Score'].min()}")
print(f"Variance: {df['Score'].var():.2f}")
print(f"Skewness: {df['Score'].skew():.2f}")
print(f"Kurtosis: {df['Score'].kurtosis():.2f}")

---

## 3. Inferential Statistics <a id='inferential'></a>

**Inferential Statistics** uses sample data to make inferences, predictions, or generalizations about a larger population. It allows us to draw conclusions beyond the immediate data.

### Key Characteristics:
- Makes predictions about populations from samples
- Uses probability theory
- Involves uncertainty (confidence levels, p-values)
- Tests hypotheses and relationships

### 3.1 Hypothesis Testing

Hypothesis testing is used to determine if there is enough evidence to reject a null hypothesis.

**Key Concepts:**
- **Null Hypothesis (H₀)**: No effect or no difference
- **Alternative Hypothesis (H₁)**: There is an effect or difference
- **p-value**: Probability of observing data if H₀ is true
- **Significance Level (α)**: Threshold for rejecting H₀ (commonly 0.05)

**Common Tests:**

| Test | Purpose | When to Use |
|------|---------|-------------|
| **t-test** | Compare means | Two groups, continuous data |
| **ANOVA** | Compare means of 3+ groups | Multiple groups |
| **Chi-square** | Test independence | Categorical data |
| **Z-test** | Compare proportions | Large samples |

In [None]:
# Example: One-sample t-test
# Question: Is the average exam score significantly different from 85?

# Hypotheses:
# H0: μ = 85 (population mean is 85)
# H1: μ ≠ 85 (population mean is not 85)

population_mean = 85
alpha = 0.05  # Significance level

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(exam_scores, population_mean)

print("One-Sample t-test Results:")
print("="*50)
print(f"Null Hypothesis (H0): Population mean = {population_mean}")
print(f"Alternative Hypothesis (H1): Population mean ≠ {population_mean}")
print(f"\nSample mean: {np.mean(exam_scores):.2f}")
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Significance level (α): {alpha}")

# Decision
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p-value {p_value:.4f} < {alpha})")
    print(f"Conclusion: The average score is significantly different from {population_mean}")
else:
    print(f"\nDecision: Fail to reject H0 (p-value {p_value:.4f} >= {alpha})")
    print(f"Conclusion: No significant difference from {population_mean}")

In [None]:
# Example: Two-sample t-test (Independent samples)
# Question: Do students in Class A perform differently than Class B?

# Create two sample groups
class_a = np.array([85, 92, 78, 90, 88, 76, 95, 89])
class_b = np.array([82, 79, 85, 81, 88, 84, 80, 86])

# Hypotheses:
# H0: μA = μB (no difference in means)
# H1: μA ≠ μB (difference exists)

# Perform independent two-sample t-test
t_stat, p_val = stats.ttest_ind(class_a, class_b)

print("Independent Two-Sample t-test:")
print("="*50)
print(f"Class A mean: {np.mean(class_a):.2f}")
print(f"Class B mean: {np.mean(class_b):.2f}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")

if p_val < 0.05:
    print(f"\nConclusion: Significant difference between classes (p < 0.05)")
else:
    print(f"\nConclusion: No significant difference between classes (p >= 0.05)")

In [None]:
# Visualize the two-sample t-test
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Box plots comparing two classes
data_to_plot = [class_a, class_b]
bp = axes[0].boxplot(data_to_plot, labels=['Class A', 'Class B'], patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
bp['boxes'][1].set_facecolor('lightcoral')
axes[0].set_ylabel('Exam Scores')
axes[0].set_title('Comparison of Two Classes')
axes[0].grid(True, alpha=0.3)

# Add mean markers
axes[0].scatter([1, 2], [np.mean(class_a), np.mean(class_b)], 
                color='red', s=100, zorder=3, marker='D', label='Mean')
axes[0].legend()

# Plot 2: Distribution overlay
axes[1].hist(class_a, bins=6, alpha=0.5, label='Class A', color='blue', edgecolor='black')
axes[1].hist(class_b, bins=6, alpha=0.5, label='Class B', color='red', edgecolor='black')
axes[1].axvline(np.mean(class_a), color='blue', linestyle='--', linewidth=2)
axes[1].axvline(np.mean(class_b), color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Exam Scores')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.2 Confidence Intervals

A **confidence interval** provides a range of values that likely contains the true population parameter.

**Formula for mean:** CI = x̄ ± (t * SE)
- x̄ = sample mean
- t = t-value for confidence level
- SE = standard error = s / √n

In [None]:
# Example: Calculate 95% confidence interval for mean
confidence_level = 0.95
alpha = 1 - confidence_level

# Sample statistics
sample_mean = np.mean(exam_scores)
sample_std = np.std(exam_scores, ddof=1)
sample_size = len(exam_scores)

# Calculate standard error
standard_error = sample_std / np.sqrt(sample_size)

# Get t-critical value
degrees_freedom = sample_size - 1
t_critical = stats.t.ppf(1 - alpha/2, degrees_freedom)

# Calculate margin of error
margin_of_error = t_critical * standard_error

# Calculate confidence interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"95% Confidence Interval for Population Mean:")
print("="*50)
print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample std dev: {sample_std:.2f}")
print(f"Sample size: {sample_size}")
print(f"Standard error: {standard_error:.2f}")
print(f"t-critical value: {t_critical:.4f}")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"\n95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"\nInterpretation:")
print(f"We are 95% confident that the true population mean")
print(f"lies between {ci_lower:.2f} and {ci_upper:.2f}")

In [None]:
# Alternative method using scipy
ci_scipy = stats.t.interval(confidence=0.95, 
                            df=len(exam_scores)-1,
                            loc=np.mean(exam_scores),
                            scale=stats.sem(exam_scores))

print(f"\nUsing scipy.stats.t.interval():")
print(f"95% CI: [{ci_scipy[0]:.2f}, {ci_scipy[1]:.2f}]")

In [None]:
# Visualize confidence intervals
fig, ax = plt.subplots(figsize=(12, 6))

# Calculate CIs for different confidence levels
confidence_levels = [0.90, 0.95, 0.99]
colors = ['lightblue', 'lightgreen', 'lightyellow']

for i, conf in enumerate(confidence_levels):
    ci = stats.t.interval(confidence=conf, 
                         df=len(exam_scores)-1,
                         loc=sample_mean,
                         scale=stats.sem(exam_scores))
    
    # Plot confidence interval
    ax.plot([ci[0], ci[1]], [i, i], linewidth=8, 
            color=colors[i], label=f'{int(conf*100)}% CI: [{ci[0]:.2f}, {ci[1]:.2f}]')
    ax.plot([ci[0], ci[1]], [i, i], 'ko', markersize=8)

# Plot sample mean
ax.axvline(sample_mean, color='red', linestyle='--', linewidth=2, label=f'Sample Mean: {sample_mean:.2f}')

ax.set_yticks(range(len(confidence_levels)))
ax.set_yticklabels([f'{int(c*100)}%' for c in confidence_levels])
ax.set_xlabel('Exam Score')
ax.set_ylabel('Confidence Level')
ax.set_title('Confidence Intervals at Different Confidence Levels')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Higher confidence levels produce wider intervals!")

### 3.3 Regression Analysis

**Regression analysis** examines the relationship between variables and makes predictions.

**Types:**
- **Simple Linear Regression**: One predictor variable
- **Multiple Linear Regression**: Multiple predictor variables
- **Logistic Regression**: Binary outcome variable

In [None]:
# Example: Simple Linear Regression
# Question: Does study time predict exam scores?

# Generate sample data
np.random.seed(42)
study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
exam_scores_reg = 50 + 3 * study_hours + np.random.normal(0, 5, len(study_hours))

# Perform linear regression using scipy
slope, intercept, r_value, p_value, std_err = stats.linregress(study_hours, exam_scores_reg)

print("Simple Linear Regression Analysis:")
print("="*50)
print(f"Regression equation: y = {intercept:.2f} + {slope:.2f}x")
print(f"\nCoefficients:")
print(f"  Intercept (β0): {intercept:.2f}")
print(f"  Slope (β1): {slope:.2f}")
print(f"\nModel quality:")
print(f"  R-squared (R²): {r_value**2:.4f}")
print(f"  p-value: {p_value:.6f}")
print(f"  Standard error: {std_err:.4f}")
print(f"\nInterpretation:")
print(f"  - For each additional hour of study, exam score increases by {slope:.2f} points")
print(f"  - R² = {r_value**2:.4f} means {r_value**2*100:.2f}% of variance is explained")
print(f"  - The relationship is {'significant' if p_value < 0.05 else 'not significant'} (p < 0.05)")

In [None]:
# Perform regression using statsmodels for more detailed output
X = sm.add_constant(study_hours)  # Add intercept term
model = sm.OLS(exam_scores_reg, X).fit()

print("\nDetailed Regression Summary (using statsmodels):")
print("="*50)
print(model.summary())

In [None]:
# Visualize the regression
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Scatter plot with regression line
axes[0].scatter(study_hours, exam_scores_reg, alpha=0.6, s=100, label='Actual scores')
predicted_scores = intercept + slope * study_hours
axes[0].plot(study_hours, predicted_scores, 'r-', linewidth=2, label='Regression line')
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Exam Score')
axes[0].set_title(f'Linear Regression (R² = {r_value**2:.3f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Add equation to plot
axes[0].text(0.05, 0.95, f'y = {intercept:.2f} + {slope:.2f}x', 
             transform=axes[0].transAxes, fontsize=12, verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot 2: Residual plot
residuals = exam_scores_reg - predicted_scores
axes[1].scatter(predicted_scores, residuals, alpha=0.6, s=100)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.4 ANOVA (Analysis of Variance)

**ANOVA** tests whether there are significant differences between the means of three or more groups.

In [None]:
# Example: One-way ANOVA
# Question: Do three different teaching methods produce different results?

# Create three groups
method_a = np.array([85, 88, 90, 87, 89, 91, 86, 88])
method_b = np.array([78, 82, 80, 79, 81, 83, 77, 84])
method_c = np.array([92, 95, 93, 94, 96, 91, 93, 95])

# Perform one-way ANOVA
f_statistic, p_value_anova = stats.f_oneway(method_a, method_b, method_c)

print("One-Way ANOVA Results:")
print("="*50)
print(f"Null Hypothesis: All group means are equal")
print(f"Alternative Hypothesis: At least one group mean differs")
print(f"\nGroup Means:")
print(f"  Method A: {np.mean(method_a):.2f}")
print(f"  Method B: {np.mean(method_b):.2f}")
print(f"  Method C: {np.mean(method_c):.2f}")
print(f"\nF-statistic: {f_statistic:.4f}")
print(f"p-value: {p_value_anova:.6f}")

if p_value_anova < 0.05:
    print(f"\nConclusion: Reject null hypothesis (p < 0.05)")
    print(f"At least one teaching method produces significantly different results")
else:
    print(f"\nConclusion: Fail to reject null hypothesis (p >= 0.05)")
    print(f"No significant difference between teaching methods")

In [None]:
# Visualize ANOVA comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Box plots for all three groups
data_anova = [method_a, method_b, method_c]
bp = axes[0].boxplot(data_anova, labels=['Method A', 'Method B', 'Method C'], patch_artist=True)
colors = ['lightblue', 'lightcoral', 'lightgreen']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

axes[0].set_ylabel('Exam Scores')
axes[0].set_title('Comparison of Three Teaching Methods')
axes[0].grid(True, alpha=0.3)

# Add mean markers
means = [np.mean(method_a), np.mean(method_b), np.mean(method_c)]
axes[0].scatter([1, 2, 3], means, color='red', s=100, zorder=3, marker='D', label='Mean')
axes[0].legend()

# Plot 2: Distribution comparison
axes[1].hist(method_a, bins=5, alpha=0.5, label='Method A', color='blue')
axes[1].hist(method_b, bins=5, alpha=0.5, label='Method B', color='red')
axes[1].hist(method_c, bins=5, alpha=0.5, label='Method C', color='green')
axes[1].set_xlabel('Exam Scores')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Scores by Method')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 4. Key Differences Between Descriptive and Inferential Statistics <a id='differences'></a>

Understanding the differences helps you choose the right approach for your analysis.

In [None]:
# Create a comparison table
comparison_data = {
    'Aspect': ['Purpose', 'Scope', 'Tools', 'Output', 'Certainty', 'Use Case'],
    'Descriptive Statistics': [
        'Summarize and describe data',
        'Limited to observed data',
        'Mean, median, std dev, graphs',
        'Summary statistics, charts',
        'Definite (describes what is)',
        'Data exploration, reporting'
    ],
    'Inferential Statistics': [
        'Make predictions and inferences',
        'Extends to population',
        'Hypothesis tests, CI, regression',
        'Probabilities, p-values, predictions',
        'Uncertain (involves probability)',
        'Decision making, testing theories'
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\nComparison: Descriptive vs Inferential Statistics")
print("="*80)
print(comparison_df.to_string(index=False))

In [None]:
# Visual comparison using a practical example
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Descriptive vs Inferential Statistics: A Practical Comparison', 
             fontsize=16, fontweight='bold')

# Generate sample and population data
np.random.seed(42)
population = np.random.normal(75, 10, 10000)  # Entire population
sample = np.random.choice(population, 100)  # Sample from population

# DESCRIPTIVE: Plot 1 - Sample histogram
axes[0, 0].hist(sample, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].axvline(np.mean(sample), color='red', linestyle='--', linewidth=2, 
                   label=f'Sample Mean: {np.mean(sample):.2f}')
axes[0, 0].set_xlabel('Values')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('DESCRIPTIVE: Sample Distribution\n(What we observe)', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# DESCRIPTIVE: Plot 2 - Sample statistics
stats_text = f"""Sample Statistics (Descriptive):
────────────────────────────
Sample Size: {len(sample)}
Mean: {np.mean(sample):.2f}
Median: {np.median(sample):.2f}
Std Dev: {np.std(sample, ddof=1):.2f}
Min: {np.min(sample):.2f}
Max: {np.max(sample):.2f}
Range: {np.max(sample) - np.min(sample):.2f}
"""
axes[0, 1].text(0.1, 0.5, stats_text, fontsize=11, family='monospace',
                verticalalignment='center',
                bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
axes[0, 1].axis('off')
axes[0, 1].set_title('DESCRIPTIVE: Summary Statistics\n(Describing our sample)', fontweight='bold')

# INFERENTIAL: Plot 3 - Population with confidence interval
ci = stats.t.interval(0.95, len(sample)-1, loc=np.mean(sample), scale=stats.sem(sample))
axes[1, 0].hist(population, bins=50, alpha=0.4, color='gray', edgecolor='black', label='True Population')
axes[1, 0].axvline(np.mean(population), color='green', linestyle='-', linewidth=2, 
                   label=f'True Mean: {np.mean(population):.2f}')
axes[1, 0].axvline(ci[0], color='red', linestyle='--', linewidth=2, label=f'95% CI')
axes[1, 0].axvline(ci[1], color='red', linestyle='--', linewidth=2)
axes[1, 0].axvspan(ci[0], ci[1], alpha=0.2, color='red')
axes[1, 0].set_xlabel('Values')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('INFERENTIAL: Population Inference\n(What we predict)', fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# INFERENTIAL: Plot 4 - Inference results
inference_text = f"""Inferential Statistics:
────────────────────────────
Population Mean Estimate:
  Point Estimate: {np.mean(sample):.2f}
  
95% Confidence Interval:
  [{ci[0]:.2f}, {ci[1]:.2f}]
  
Interpretation:
  We are 95% confident the
  true population mean lies
  within this interval.
  
True Population Mean: {np.mean(population):.2f}
  (In reality, we don't know this!)
"""
axes[1, 1].text(0.1, 0.5, inference_text, fontsize=11, family='monospace',
                verticalalignment='center',
                bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
axes[1, 1].axis('off')
axes[1, 1].set_title('INFERENTIAL: Population Predictions\n(Making inferences)', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nKey Insight:")
print(f"Descriptive statistics tells us about our sample (n={len(sample)})")
print(f"Inferential statistics helps us estimate the population (N={len(population)})")

---

## 5. When to Use Each Type <a id='when-to-use'></a>

### Use Descriptive Statistics When:

1. **Exploring data**: Initial data analysis and understanding
2. **Reporting findings**: Presenting summary information
3. **Comparing groups visually**: Simple comparisons without formal testing
4. **You have the entire population**: No need for inference
5. **Creating dashboards**: Displaying key metrics and KPIs

### Use Inferential Statistics When:

1. **Making predictions**: Forecasting future outcomes
2. **Testing hypotheses**: Validating assumptions or theories
3. **Working with samples**: Drawing conclusions about populations
4. **Determining relationships**: Finding correlations and causations
5. **Making decisions**: Business or scientific decisions based on data

In [None]:
# Decision flowchart as a function
def statistics_decision_helper(question):
    """
    Helper function to decide which type of statistics to use
    """
    descriptive_keywords = ['describe', 'summarize', 'what is', 'average', 'median', 'show']
    inferential_keywords = ['predict', 'test', 'significant', 'population', 'infer', 'relationship']
    
    question_lower = question.lower()
    
    desc_score = sum(1 for kw in descriptive_keywords if kw in question_lower)
    infer_score = sum(1 for kw in inferential_keywords if kw in question_lower)
    
    if desc_score > infer_score:
        return "Descriptive Statistics"
    elif infer_score > desc_score:
        return "Inferential Statistics"
    else:
        return "Both might be needed"

# Test with example questions
example_questions = [
    "What is the average age of customers?",
    "Is there a significant difference between groups A and B?",
    "Can we predict sales based on advertising spend?",
    "Show me the distribution of product ratings",
    "Are customer satisfaction scores improving over time?"
]

print("Question Classification Examples:")
print("="*70)
for q in example_questions:
    suggestion = statistics_decision_helper(q)
    print(f"Q: {q}")
    print(f"   → Use: {suggestion}\n")

---

## 6. Practical Real-World Examples <a id='practical'></a>

### Example 1: E-commerce Sales Analysis

In [None]:
# Create a realistic e-commerce dataset
np.random.seed(42)
n_customers = 200

ecommerce_data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(18, 70, n_customers),
    'purchase_amount': np.random.gamma(50, 2, n_customers),
    'time_on_site': np.random.exponential(10, n_customers),
    'num_items': np.random.poisson(3, n_customers),
    'returning_customer': np.random.choice([0, 1], n_customers, p=[0.6, 0.4])
})

print("E-commerce Dataset Sample:")
print(ecommerce_data.head(10))

In [None]:
# DESCRIPTIVE STATISTICS: Summarize the data
print("\n" + "="*70)
print("DESCRIPTIVE STATISTICS: Understanding Our Customer Base")
print("="*70)

print("\n1. Summary Statistics:")
print(ecommerce_data.describe())

print("\n2. Customer Segmentation:")
print(f"Total customers: {len(ecommerce_data)}")
print(f"New customers: {(ecommerce_data['returning_customer'] == 0).sum()} ({(ecommerce_data['returning_customer'] == 0).sum()/len(ecommerce_data)*100:.1f}%)")
print(f"Returning customers: {(ecommerce_data['returning_customer'] == 1).sum()} ({(ecommerce_data['returning_customer'] == 1).sum()/len(ecommerce_data)*100:.1f}%)")

print("\n3. Purchase Behavior:")
print(f"Average purchase amount: ${ecommerce_data['purchase_amount'].mean():.2f}")
print(f"Median purchase amount: ${ecommerce_data['purchase_amount'].median():.2f}")
print(f"Average items per transaction: {ecommerce_data['num_items'].mean():.2f}")

In [None]:
# INFERENTIAL STATISTICS: Test hypotheses and make predictions
print("\n" + "="*70)
print("INFERENTIAL STATISTICS: Testing Business Hypotheses")
print("="*70)

# Question 1: Do returning customers spend significantly more?
new_customers = ecommerce_data[ecommerce_data['returning_customer'] == 0]['purchase_amount']
returning_customers = ecommerce_data[ecommerce_data['returning_customer'] == 1]['purchase_amount']

t_stat, p_val = stats.ttest_ind(returning_customers, new_customers)

print("\n1. Hypothesis Test: Do returning customers spend more?")
print(f"   New customers average: ${new_customers.mean():.2f}")
print(f"   Returning customers average: ${returning_customers.mean():.2f}")
print(f"   Difference: ${returning_customers.mean() - new_customers.mean():.2f}")
print(f"   t-statistic: {t_stat:.4f}")
print(f"   p-value: {p_val:.4f}")
if p_val < 0.05:
    print(f"   ✓ Conclusion: Returning customers spend significantly more (p < 0.05)")
else:
    print(f"   ✗ Conclusion: No significant difference (p >= 0.05)")

# Question 2: Predict purchase amount from time on site
slope, intercept, r_val, p_val_reg, std_err = stats.linregress(
    ecommerce_data['time_on_site'], 
    ecommerce_data['purchase_amount']
)

print("\n2. Regression Analysis: Time on site vs Purchase amount")
print(f"   Equation: Purchase = ${intercept:.2f} + ${slope:.2f} × (time on site)")
print(f"   R-squared: {r_val**2:.4f}")
print(f"   p-value: {p_val_reg:.6f}")
if r_val**2 > 0.1:
    print(f"   ✓ Time on site explains {r_val**2*100:.2f}% of purchase variance")
else:
    print(f"   ✗ Weak relationship (R² = {r_val**2:.4f})")

In [None]:
# Visualize the e-commerce analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('E-commerce Sales Analysis: Descriptive vs Inferential', fontsize=16, fontweight='bold')

# Plot 1: Descriptive - Purchase amount distribution
axes[0, 0].hist(ecommerce_data['purchase_amount'], bins=30, alpha=0.7, 
                color='skyblue', edgecolor='black')
axes[0, 0].axvline(ecommerce_data['purchase_amount'].mean(), color='red', 
                   linestyle='--', linewidth=2, label='Mean')
axes[0, 0].axvline(ecommerce_data['purchase_amount'].median(), color='green', 
                   linestyle='--', linewidth=2, label='Median')
axes[0, 0].set_xlabel('Purchase Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('DESCRIPTIVE: Purchase Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Descriptive - Customer type breakdown
customer_counts = ecommerce_data['returning_customer'].value_counts()
axes[0, 1].pie(customer_counts, labels=['New', 'Returning'], autopct='%1.1f%%',
               colors=['lightcoral', 'lightgreen'], startangle=90)
axes[0, 1].set_title('DESCRIPTIVE: Customer Type Distribution')

# Plot 3: Inferential - Comparison of customer types
bp = axes[1, 0].boxplot([new_customers, returning_customers], 
                        labels=['New', 'Returning'], patch_artist=True)
bp['boxes'][0].set_facecolor('lightcoral')
bp['boxes'][1].set_facecolor('lightgreen')
axes[1, 0].set_ylabel('Purchase Amount ($)')
axes[1, 0].set_title(f'INFERENTIAL: Customer Comparison\n(p-value = {p_val:.4f})')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Inferential - Regression analysis
axes[1, 1].scatter(ecommerce_data['time_on_site'], ecommerce_data['purchase_amount'],
                  alpha=0.5)
x_line = np.array([ecommerce_data['time_on_site'].min(), ecommerce_data['time_on_site'].max()])
y_line = intercept + slope * x_line
axes[1, 1].plot(x_line, y_line, 'r-', linewidth=2, label='Regression line')
axes[1, 1].set_xlabel('Time on Site (minutes)')
axes[1, 1].set_ylabel('Purchase Amount ($)')
axes[1, 1].set_title(f'INFERENTIAL: Predictive Model\n(R² = {r_val**2:.4f})')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Example 2: Medical Treatment Efficacy Study

In [None]:
# Create medical trial dataset
np.random.seed(42)
n_patients = 100

# Treatment group (should show improvement)
treatment_before = np.random.normal(140, 15, n_patients//2)  # Blood pressure before
treatment_after = treatment_before - np.random.normal(15, 5, n_patients//2)  # After treatment

# Control group (placebo, minimal change)
control_before = np.random.normal(140, 15, n_patients//2)
control_after = control_before - np.random.normal(2, 5, n_patients//2)

medical_data = pd.DataFrame({
    'patient_id': range(1, n_patients + 1),
    'group': ['Treatment'] * (n_patients//2) + ['Control'] * (n_patients//2),
    'bp_before': np.concatenate([treatment_before, control_before]),
    'bp_after': np.concatenate([treatment_after, control_after])
})

medical_data['bp_change'] = medical_data['bp_after'] - medical_data['bp_before']

print("Medical Treatment Study Dataset:")
print(medical_data.head(10))

In [None]:
# DESCRIPTIVE STATISTICS
print("\n" + "="*70)
print("DESCRIPTIVE STATISTICS: Treatment Study Summary")
print("="*70)

print("\nTreatment Group:")
treatment_group = medical_data[medical_data['group'] == 'Treatment']
print(f"  Mean BP before: {treatment_group['bp_before'].mean():.2f} mmHg")
print(f"  Mean BP after: {treatment_group['bp_after'].mean():.2f} mmHg")
print(f"  Mean change: {treatment_group['bp_change'].mean():.2f} mmHg")

print("\nControl Group:")
control_group = medical_data[medical_data['group'] == 'Control']
print(f"  Mean BP before: {control_group['bp_before'].mean():.2f} mmHg")
print(f"  Mean BP after: {control_group['bp_after'].mean():.2f} mmHg")
print(f"  Mean change: {control_group['bp_change'].mean():.2f} mmHg")

In [None]:
# INFERENTIAL STATISTICS: Test if treatment is effective
print("\n" + "="*70)
print("INFERENTIAL STATISTICS: Treatment Efficacy Analysis")
print("="*70)

# Test 1: Paired t-test for treatment group (before vs after)
t_stat_paired, p_val_paired = stats.ttest_rel(
    treatment_group['bp_before'], 
    treatment_group['bp_after']
)

print("\n1. Paired t-test: Treatment group before vs after")
print(f"   t-statistic: {t_stat_paired:.4f}")
print(f"   p-value: {p_val_paired:.6f}")
if p_val_paired < 0.05:
    print(f"   ✓ Treatment shows significant effect (p < 0.05)")
else:
    print(f"   ✗ No significant effect (p >= 0.05)")

# Test 2: Independent t-test comparing changes between groups
t_stat_ind, p_val_ind = stats.ttest_ind(
    treatment_group['bp_change'],
    control_group['bp_change']
)

print("\n2. Independent t-test: Treatment vs Control (change scores)")
print(f"   Treatment change: {treatment_group['bp_change'].mean():.2f} mmHg")
print(f"   Control change: {control_group['bp_change'].mean():.2f} mmHg")
print(f"   Difference: {treatment_group['bp_change'].mean() - control_group['bp_change'].mean():.2f} mmHg")
print(f"   t-statistic: {t_stat_ind:.4f}")
print(f"   p-value: {p_val_ind:.6f}")
if p_val_ind < 0.05:
    print(f"   ✓ Treatment significantly better than control (p < 0.05)")
else:
    print(f"   ✗ No significant difference from control (p >= 0.05)")

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(
    ((len(treatment_group)-1) * treatment_group['bp_change'].std()**2 + 
     (len(control_group)-1) * control_group['bp_change'].std()**2) / 
    (len(treatment_group) + len(control_group) - 2)
)
cohens_d = (treatment_group['bp_change'].mean() - control_group['bp_change'].mean()) / pooled_std

print(f"\n3. Effect Size (Cohen's d): {cohens_d:.4f}")
if abs(cohens_d) < 0.2:
    print("   Small effect")
elif abs(cohens_d) < 0.5:
    print("   Medium effect")
else:
    print("   Large effect")

In [None]:
# Visualize medical study results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Medical Treatment Study: Statistical Analysis', fontsize=16, fontweight='bold')

# Plot 1: Before and After comparison (Treatment)
axes[0, 0].scatter(treatment_group['bp_before'], treatment_group['bp_after'], alpha=0.6, s=50)
lims = [120, 180]
axes[0, 0].plot(lims, lims, 'r--', alpha=0.5, label='No change line')
axes[0, 0].set_xlabel('Blood Pressure Before (mmHg)')
axes[0, 0].set_ylabel('Blood Pressure After (mmHg)')
axes[0, 0].set_title('Treatment Group: Before vs After')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Distribution of changes
axes[0, 1].hist(treatment_group['bp_change'], bins=15, alpha=0.5, 
                label='Treatment', color='green', edgecolor='black')
axes[0, 1].hist(control_group['bp_change'], bins=15, alpha=0.5, 
                label='Control', color='red', edgecolor='black')
axes[0, 1].axvline(0, color='black', linestyle='--', linewidth=2, label='No change')
axes[0, 1].set_xlabel('Blood Pressure Change (mmHg)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of BP Changes')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Box plot comparison
bp_data = [treatment_group['bp_change'], control_group['bp_change']]
bp = axes[1, 0].boxplot(bp_data, labels=['Treatment', 'Control'], patch_artist=True)
bp['boxes'][0].set_facecolor('lightgreen')
bp['boxes'][1].set_facecolor('lightcoral')
axes[1, 0].axhline(0, color='red', linestyle='--', linewidth=2, alpha=0.5)
axes[1, 0].set_ylabel('Blood Pressure Change (mmHg)')
axes[1, 0].set_title(f'Group Comparison (p = {p_val_ind:.4f})')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Summary statistics table
summary_data = [
    ['Metric', 'Treatment', 'Control'],
    ['Sample Size', f'{len(treatment_group)}', f'{len(control_group)}'],
    ['Mean Change', f'{treatment_group["bp_change"].mean():.2f}', f'{control_group["bp_change"].mean():.2f}'],
    ['Std Dev', f'{treatment_group["bp_change"].std():.2f}', f'{control_group["bp_change"].std():.2f}'],
    ['p-value', f'{p_val_ind:.4f}', ''],
    ['Effect Size', f'{cohens_d:.4f}', '']
]

table = axes[1, 1].table(cellText=summary_data, cellLoc='center', loc='center',
                         bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

# Color header row
for i in range(3):
    table[(0, i)].set_facecolor('lightblue')
    table[(0, i)].set_text_props(weight='bold')

axes[1, 1].axis('off')
axes[1, 1].set_title('Statistical Summary')

plt.tight_layout()
plt.show()

---

## Common Techniques Summary

### Descriptive Statistics Techniques:

1. **Measures of Central Tendency**: Mean, Median, Mode
2. **Measures of Dispersion**: Range, Variance, Standard Deviation, IQR
3. **Measures of Position**: Percentiles, Quartiles, Z-scores
4. **Data Visualization**: Histograms, Box plots, Bar charts, Pie charts
5. **Frequency Distributions**: Tables and charts showing data frequency
6. **Shape Measures**: Skewness and Kurtosis

### Inferential Statistics Techniques:

1. **Hypothesis Testing**: t-tests, z-tests, ANOVA, Chi-square
2. **Confidence Intervals**: Estimating population parameters with uncertainty
3. **Regression Analysis**: Linear, Multiple, Logistic regression
4. **Correlation Analysis**: Pearson, Spearman correlation
5. **Sampling Methods**: Random sampling, stratified sampling
6. **Power Analysis**: Determining sample size requirements
7. **Bayesian Inference**: Updating beliefs with new evidence

In [None]:
# Create a comprehensive technique comparison
techniques = {
    'Category': [
        'Central Tendency', 'Central Tendency', 'Central Tendency',
        'Dispersion', 'Dispersion', 'Dispersion',
        'Position', 'Position',
        'Hypothesis Testing', 'Hypothesis Testing', 'Hypothesis Testing',
        'Estimation', 'Estimation',
        'Relationship', 'Relationship'
    ],
    'Technique': [
        'Mean', 'Median', 'Mode',
        'Standard Deviation', 'Variance', 'Range',
        'Percentiles', 'Quartiles',
        't-test', 'ANOVA', 'Chi-square',
        'Confidence Intervals', 'Point Estimates',
        'Correlation', 'Regression'
    ],
    'Type': [
        'Descriptive', 'Descriptive', 'Descriptive',
        'Descriptive', 'Descriptive', 'Descriptive',
        'Descriptive', 'Descriptive',
        'Inferential', 'Inferential', 'Inferential',
        'Inferential', 'Inferential',
        'Both', 'Both'
    ],
    'Data Type': [
        'Numerical', 'Numerical', 'Categorical',
        'Numerical', 'Numerical', 'Numerical',
        'Numerical', 'Numerical',
        'Numerical', 'Numerical', 'Categorical',
        'Numerical', 'Numerical',
        'Numerical', 'Numerical'
    ]
}

techniques_df = pd.DataFrame(techniques)

print("\nComprehensive Statistical Techniques Reference:")
print("="*80)
print(techniques_df.to_string(index=False))

# Count by type
print("\n\nTechnique Count by Type:")
print(techniques_df['Type'].value_counts())

---

## 7. Summary <a id='summary'></a>

### Key Takeaways:

1. **Two Main Branches**: Statistics is divided into Descriptive and Inferential Statistics, each serving distinct purposes in data analysis.

2. **Descriptive Statistics**:
   - Summarizes and describes observed data
   - Uses measures of central tendency (mean, median, mode)
   - Employs measures of dispersion (variance, standard deviation, range)
   - Includes measures of position (percentiles, quartiles, z-scores)
   - Answers: "What does the data show?"

3. **Inferential Statistics**:
   - Makes predictions and inferences about populations from samples
   - Uses hypothesis testing (t-tests, ANOVA, chi-square)
   - Employs confidence intervals for estimation
   - Includes regression analysis for predictions
   - Answers: "What can we conclude about the population?"

4. **Complementary Nature**: Both types of statistics work together in data analysis:
   - Start with descriptive statistics to understand your data
   - Use inferential statistics to test hypotheses and make predictions

5. **Practical Applications**:
   - Business: Customer analysis, A/B testing, sales forecasting
   - Healthcare: Treatment efficacy, disease prevalence, clinical trials
   - Machine Learning: Feature analysis, model validation, performance testing

### Best Practices:

- Always start with descriptive statistics to explore your data
- Check assumptions before applying inferential tests
- Use visualizations to complement numerical statistics
- Report effect sizes along with p-values
- Consider both statistical and practical significance
- Document your methods and assumptions clearly

### Further Learning:

- Study probability distributions (Normal, Binomial, Poisson)
- Learn about sampling techniques and bias
- Explore advanced regression methods
- Practice with real-world datasets
- Understand the assumptions behind statistical tests

In [None]:
# Final visualization: The complete statistical workflow
fig, ax = plt.subplots(figsize=(14, 8))
ax.axis('off')

# Title
ax.text(0.5, 0.95, 'Complete Statistical Analysis Workflow', 
        ha='center', va='top', fontsize=18, fontweight='bold')

# Step 1: Data Collection
ax.add_patch(plt.Rectangle((0.05, 0.78), 0.9, 0.12, 
                           facecolor='lightblue', edgecolor='black', linewidth=2))
ax.text(0.5, 0.84, 'Step 1: Data Collection', ha='center', va='center', 
        fontsize=14, fontweight='bold')
ax.text(0.5, 0.81, 'Gather data through surveys, experiments, or observations', 
        ha='center', va='center', fontsize=10)

# Arrow
ax.annotate('', xy=(0.5, 0.78), xytext=(0.5, 0.75), 
            arrowprops=dict(arrowstyle='->', lw=2))

# Step 2: Descriptive Statistics
ax.add_patch(plt.Rectangle((0.05, 0.58), 0.4, 0.15, 
                           facecolor='lightgreen', edgecolor='black', linewidth=2))
ax.text(0.25, 0.69, 'Step 2: DESCRIPTIVE STATISTICS', ha='center', va='top', 
        fontsize=12, fontweight='bold')
ax.text(0.25, 0.64, '• Summarize data\n• Calculate measures\n• Create visualizations', 
        ha='center', va='center', fontsize=9)

# Step 3: Inferential Statistics
ax.add_patch(plt.Rectangle((0.55, 0.58), 0.4, 0.15, 
                           facecolor='lightyellow', edgecolor='black', linewidth=2))
ax.text(0.75, 0.69, 'Step 3: INFERENTIAL STATISTICS', ha='center', va='top', 
        fontsize=12, fontweight='bold')
ax.text(0.75, 0.64, '• Test hypotheses\n• Make predictions\n• Draw conclusions', 
        ha='center', va='center', fontsize=9)

# Arrows from descriptive and inferential
ax.annotate('', xy=(0.25, 0.58), xytext=(0.25, 0.53), 
            arrowprops=dict(arrowstyle='->', lw=2))
ax.annotate('', xy=(0.75, 0.58), xytext=(0.75, 0.53), 
            arrowprops=dict(arrowstyle='->', lw=2))

# Step 4: Insights and Actions
ax.add_patch(plt.Rectangle((0.15, 0.38), 0.7, 0.12, 
                           facecolor='lightcoral', edgecolor='black', linewidth=2))
ax.text(0.5, 0.44, 'Step 4: Insights & Decision Making', ha='center', va='center', 
        fontsize=14, fontweight='bold')
ax.text(0.5, 0.41, 'Interpret results and take informed actions', 
        ha='center', va='center', fontsize=10)

# Tools box
ax.add_patch(plt.Rectangle((0.05, 0.05), 0.9, 0.28, 
                           facecolor='white', edgecolor='black', linewidth=2))
ax.text(0.5, 0.30, 'Common Python Tools', ha='center', va='top', 
        fontsize=14, fontweight='bold')

tools_text = '''NumPy: Numerical computations, arrays, mathematical operations
Pandas: Data manipulation, DataFrames, data cleaning
Matplotlib/Seaborn: Data visualization, charts, plots
SciPy: Statistical functions, hypothesis tests, distributions
Statsmodels: Advanced statistical models, regression, time series'''

ax.text(0.5, 0.22, tools_text, ha='center', va='center', 
        fontsize=9, family='monospace')

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("Congratulations! You now understand the two main types of statistics")
print("and how to apply them in Python for data science and machine learning.")
print("="*70)