Skip to content

shri-singh/Statistics-DataScience-Masterclass

Repository files navigation

Statistics & Data Science Masterclass

A structured, hands-on curriculum covering statistics from fundamentals to advanced inference — built entirely with Jupyter Notebooks. Theory (definitions + LaTeX formulas) and practice (Python code) in every lesson.

License: CC BY-NC 4.0 Python 3.13+


Why This Exists

Most statistics courses are either too theoretical (textbook-only) or too practical (library calls with no intuition). This masterclass bridges the gap — every concept is explained with definitions and formulas, then implemented from first principles in Python before using library shortcuts.

The curriculum follows a university-style progression (STATS100 → STATS500), where each level builds on the previous one.


Curriculum Overview

STATS100 — Descriptive Statistics Fundamentals

The building blocks: what statistics is, data types, frequency distributions, and the three pillars of central tendency.

# Notebook Topics
01 Descriptive Statistics Fundamentals Definition of statistics (descriptive vs inferential), types of data (NOIR scale), population vs sample, frequency distributions, mean (arithmetic, weighted, trimmed), median, mode, geometric & harmonic mean, when to use what, best practices

STATS200 — Measures of Dispersion

Central tendency alone isn't enough — learn to quantify how spread out data is.

# Notebook Topics
01 Measures of Dispersion Range, interquartile range (IQR), outlier detection (1.5×IQR rule), mean absolute deviation (MAD), variance (population vs sample, Bessel's correction), standard deviation, empirical rule (68-95-99.7), coefficient of variation (CV), mean absolute error (MAE)

STATS300 — Probability Distributions

The mathematical models that describe how data is generated — from coin flips to bell curves.

# Notebook Topics
01 Probability Distributions Probability axioms, random variables, PMF, PDF, CDF, expected value & variance. Discrete: Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric, Discrete Uniform. Continuous: Normal, Standard Normal (Z-scores), Exponential, Gamma, Beta, Weibull, Log-Normal, Chi-Square, Student's t, F-distribution. Central Limit Theorem

STATS400 — Inference & Estimation

Drawing conclusions about populations from samples — the core of statistical reasoning.

# Notebook Topics
01 Inference and Estimation Point estimation (sample mean, variance, proportion), properties of estimators (unbiasedness, consistency, efficiency), biased vs unbiased estimators, sampling distributions, standard error, Maximum Likelihood Estimation (MLE), log-likelihood, method of moments, confidence intervals (Z, t, proportion)

STATS450 — Hypothesis Testing

The formal framework for making data-driven decisions under uncertainty.

# Notebook Topics
01 Hypothesis Testing Null & alternative hypotheses, test statistics, p-values (interpretation & misconceptions), significance level & confidence, Type I/II errors & power, Z-test (one-sample, two-sample), t-test (one-sample, independent, paired/Welch's), skewness (testing & visualization), kurtosis, normality tests (Shapiro-Wilk, D'Agostino), multiple testing & Bonferroni correction

STATS500 — Causality, Correlation & ANOVA

Measuring relationships between variables and testing for group differences.

# Notebook Topics
01 Correlation, ANOVA & Causality Correlation vs causation, covariance, Pearson r, Spearman ρ, Kendall τ, point-biserial correlation, correlation heatmaps, chi-square test of independence (Cramér's V), one-way ANOVA (F-test, ANOVA table, η²), post-hoc tests (Tukey HSD), two-way ANOVA (interaction effects), MANOVA (Wilks' Lambda), causal inference methods

Learning Path

STATS100  Descriptive Stats   →  Summarize data (mean, median, mode)
  ↓
STATS200  Dispersion           →  Measure spread (variance, std, IQR)
  ↓
STATS300  Distributions        →  Model data generation (normal, binomial, etc.)
  ↓
STATS400  Estimation           →  Infer population parameters (MLE, CIs)
  ↓
STATS450  Hypothesis Testing   →  Make decisions (p-values, t-tests, z-tests)
  ↓
STATS500  Relationships        →  Measure associations (correlation, ANOVA, causality)

Tech Stack

Category Tools
Language Python 3.13+
Package Manager uv
Notebooks Jupyter (via VS Code)
Core Libraries NumPy, Pandas, Matplotlib, Seaborn
Statistics SciPy, statsmodels, scikit-learn

Getting Started

Prerequisites

Setup

# 1. Clone the repository
git clone https://github.com/shri-singh/Statistics-DataScience-Masterclass.git
cd Statistics-DataScience-Masterclass

# 2. Install uv (if not already installed)
pip install uv

# 3. Create virtual environment and install all dependencies
uv sync

# 4. Activate the environment
source .venv/Scripts/activate      # Git Bash on Windows
source .venv/bin/activate           # macOS / Linux

# 5. Register the Jupyter kernel
python -m ipykernel install --user --name=stats-demo --display-name "Stats Demo"

# 6. Open in VS Code
code .

Then open any .ipynb file and select the "Stats Demo" kernel from the top-right kernel picker.


Project Structure

Statistics-DataScience-Masterclass/
├── STATS100/                # Descriptive Statistics Fundamentals
├── STATS200/                # Measures of Dispersion
├── STATS300/                # Probability Distributions
├── STATS400/                # Inference & Estimation
├── STATS450/                # Hypothesis Testing
├── STATS500/                # Causality, Correlation & ANOVA
├── LICENSE                  # CC BY-NC 4.0
└── README.md

Who Is This For

  • Aspiring Data Scientists who need a strong statistical foundation
  • Analysts transitioning from Excel/SQL to Python-based statistics
  • Students who want a structured, progressive curriculum with code
  • ML Engineers who want to understand the statistical theory behind models
  • Self-learners who prefer understanding fundamentals over memorizing library calls

License

This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

You are free to:

  • Share, copy, and redistribute the material
  • Adapt, remix, and build upon the material

Under these terms:

  • Attribution — Credit the original work and link to this repository
  • NonCommercial — You may not use the material for commercial purposes

See the full LICENSE file for details.


Author: Shri Singh

About

comprehensive coverage of statistics for Data Scientists

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors