A structured, hands-on curriculum covering statistics from fundamentals to advanced inference — built entirely with Jupyter Notebooks. Theory (definitions + LaTeX formulas) and practice (Python code) in every lesson.
Most statistics courses are either too theoretical (textbook-only) or too practical (library calls with no intuition). This masterclass bridges the gap — every concept is explained with definitions and formulas, then implemented from first principles in Python before using library shortcuts.
The curriculum follows a university-style progression (STATS100 → STATS500), where each level builds on the previous one.
The building blocks: what statistics is, data types, frequency distributions, and the three pillars of central tendency.
| # | Notebook | Topics |
|---|---|---|
| 01 | Descriptive Statistics Fundamentals | Definition of statistics (descriptive vs inferential), types of data (NOIR scale), population vs sample, frequency distributions, mean (arithmetic, weighted, trimmed), median, mode, geometric & harmonic mean, when to use what, best practices |
Central tendency alone isn't enough — learn to quantify how spread out data is.
| # | Notebook | Topics |
|---|---|---|
| 01 | Measures of Dispersion | Range, interquartile range (IQR), outlier detection (1.5×IQR rule), mean absolute deviation (MAD), variance (population vs sample, Bessel's correction), standard deviation, empirical rule (68-95-99.7), coefficient of variation (CV), mean absolute error (MAE) |
The mathematical models that describe how data is generated — from coin flips to bell curves.
| # | Notebook | Topics |
|---|---|---|
| 01 | Probability Distributions | Probability axioms, random variables, PMF, PDF, CDF, expected value & variance. Discrete: Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric, Discrete Uniform. Continuous: Normal, Standard Normal (Z-scores), Exponential, Gamma, Beta, Weibull, Log-Normal, Chi-Square, Student's t, F-distribution. Central Limit Theorem |
Drawing conclusions about populations from samples — the core of statistical reasoning.
| # | Notebook | Topics |
|---|---|---|
| 01 | Inference and Estimation | Point estimation (sample mean, variance, proportion), properties of estimators (unbiasedness, consistency, efficiency), biased vs unbiased estimators, sampling distributions, standard error, Maximum Likelihood Estimation (MLE), log-likelihood, method of moments, confidence intervals (Z, t, proportion) |
The formal framework for making data-driven decisions under uncertainty.
| # | Notebook | Topics |
|---|---|---|
| 01 | Hypothesis Testing | Null & alternative hypotheses, test statistics, p-values (interpretation & misconceptions), significance level & confidence, Type I/II errors & power, Z-test (one-sample, two-sample), t-test (one-sample, independent, paired/Welch's), skewness (testing & visualization), kurtosis, normality tests (Shapiro-Wilk, D'Agostino), multiple testing & Bonferroni correction |
Measuring relationships between variables and testing for group differences.
| # | Notebook | Topics |
|---|---|---|
| 01 | Correlation, ANOVA & Causality | Correlation vs causation, covariance, Pearson r, Spearman ρ, Kendall τ, point-biserial correlation, correlation heatmaps, chi-square test of independence (Cramér's V), one-way ANOVA (F-test, ANOVA table, η²), post-hoc tests (Tukey HSD), two-way ANOVA (interaction effects), MANOVA (Wilks' Lambda), causal inference methods |
STATS100 Descriptive Stats → Summarize data (mean, median, mode)
↓
STATS200 Dispersion → Measure spread (variance, std, IQR)
↓
STATS300 Distributions → Model data generation (normal, binomial, etc.)
↓
STATS400 Estimation → Infer population parameters (MLE, CIs)
↓
STATS450 Hypothesis Testing → Make decisions (p-values, t-tests, z-tests)
↓
STATS500 Relationships → Measure associations (correlation, ANOVA, causality)
| Category | Tools |
|---|---|
| Language | Python 3.13+ |
| Package Manager | uv |
| Notebooks | Jupyter (via VS Code) |
| Core Libraries | NumPy, Pandas, Matplotlib, Seaborn |
| Statistics | SciPy, statsmodels, scikit-learn |
- Python 3.13 or higher
- VS Code with the Jupyter extension
- Git
# 1. Clone the repository
git clone https://github.com/shri-singh/Statistics-DataScience-Masterclass.git
cd Statistics-DataScience-Masterclass
# 2. Install uv (if not already installed)
pip install uv
# 3. Create virtual environment and install all dependencies
uv sync
# 4. Activate the environment
source .venv/Scripts/activate # Git Bash on Windows
source .venv/bin/activate # macOS / Linux
# 5. Register the Jupyter kernel
python -m ipykernel install --user --name=stats-demo --display-name "Stats Demo"
# 6. Open in VS Code
code .Then open any .ipynb file and select the "Stats Demo" kernel from the top-right kernel picker.
Statistics-DataScience-Masterclass/
├── STATS100/ # Descriptive Statistics Fundamentals
├── STATS200/ # Measures of Dispersion
├── STATS300/ # Probability Distributions
├── STATS400/ # Inference & Estimation
├── STATS450/ # Hypothesis Testing
├── STATS500/ # Causality, Correlation & ANOVA
├── LICENSE # CC BY-NC 4.0
└── README.md
- Aspiring Data Scientists who need a strong statistical foundation
- Analysts transitioning from Excel/SQL to Python-based statistics
- Students who want a structured, progressive curriculum with code
- ML Engineers who want to understand the statistical theory behind models
- Self-learners who prefer understanding fundamentals over memorizing library calls
This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
You are free to:
- Share, copy, and redistribute the material
- Adapt, remix, and build upon the material
Under these terms:
- Attribution — Credit the original work and link to this repository
- NonCommercial — You may not use the material for commercial purposes
See the full LICENSE file for details.
Author: Shri Singh