Theoretical Machine Learning (4490/5490) class, Spring 2026, at CU Boulder
Details on what we covered in previous semesters:
- spring 2020 at Lectures2020.md
- spring 2022 at Lectures2022.md
- spring 2024 at Lectures2024.md (1st day of class was Wed Jan 17)
Abbreviations for sources:
- [SSS] is Shai Shalev-Shwartz and Shai Ben-David's Understanding Machine Learning
- [Mohri] is Mehryar Mohri, Afshin Rostamizadeh and Ammet Talwaker's Foundations of Machine Learning (2018, 2nd ed, MIT press)
- [Woodruff] is David Woodruff's 2014 Sketching as a Tool for Numerical Linear Algebra
- [Hastie] is The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2nd ed, 2009)
- [Vogel] is Computational Inverse Problems by Vogel (2002).
- [RW] is Gaussian Processes for Machine Learning by Rasmussen and Williams (2006, MIT press, free PDFs online)
- [Murphy] is Kevin Murphy's Machine Learning: a Probabilistic Perspective (2012, MIT press)
- [SuttonBarto] is Richard Sutton and Andrew Barto's Reinforcement Learning: An Introduction (2018, 2nd edition)
- [Puterman] is Martin Puterman's Markov Decision Processes: Discrete Stochastic Dynamic Programming (1994, John Wiley)
- [Fasshauer] is Kernel-based approximation methods using Matlab by G Fasshauer and M McCourt (World Scientific, 2015)
- [Fri 1/9/2026] Introduction, ch 1 [SSS], parts of ch 1.3 [Mohri]. What is ML, compare to other types of learning, types of learning (supervised, etc.), standard tasks, papaya example, inductive bias and generalization. See 01_Intro, partway through 02 More Intro and Terminology
- [Mon 1/12/2026] Continue 02 More Intro and Terminology, start 03 Adding Inductive Bias
- [Wed 1/24/2026] Finish 04 FiniteHypothesisClass and definition of PAC learning, then the key analysis in 05 Analysis of Finite Hypothesis Class
- [Fri 1/16/2026] Finish discussion on agnostic PAC learning in 05 Analysis of Finite Hypothesis Class, define agnostic PAC learning.
- Did in class exercise on different notions of convergence for functions (e.g., pointwise, uniform, L^p)
- Later, we will discuss different types of convergence of random variables (in expectation/L1, vs in probability/measure, vs almost sure). See probability handout on Canvas
- See also the supplementary notes on measure-theoretic probability which are a reference / cheat-sheet only (no examples nor discussion.) This is standard, so see any measure-theoretic probability, though these notes in particular are taken from Joel Tropp's lectures notes for the class Probability Theory and Computational Mathematics
- Note: 06 StatLearningTerminology.pdf is a cheat-sheet of terminology that may be a helpful reference
-
[Mon 1/19/2026] No class due to MLK Jr holiday
-
[Wed 1/21/2026] In-class exercise on 06a Big-O notation.pdf (more details at wikipedia's big-O notation). Then start 07 Uniform Convergence, going up to Markov's inequality
-
[Fri 1/23/2026] Continue on 07 Uniform Convergence, proving Hoeffding's inequality
-
[Mon 1/26/2026] 08 No Free Lunch theorem and hence the need for some inductive bias.
-
[Wed 1/28/2026] 9 Bias Variance Tradeoff (ch 5), discussing double-descent and the James-Stein estimator. Did an in-class exercise to prove the James Stein result
-
[Fri 1/30/2026] 10 Intro to Rademacher Complexity introducing Rademacher complexity. We follow the Mohri text book for a lot of this. In-class exercise on sup of expectations
- See Supplemental notes: Commuting operators for when you can interchange operations like limit and integrals, or sup and inf, etc.
-
[Mon 2/2/2026] Finish #10, start 11 Generalization via Rademacher Complexity getting to McDiarmid's inequality
-
[Wed 2/4/2026] Finish #11, then start 12 More Rademacher, and Covering Numbers.
- in class exercise: is there a difference between "bounded" and "totally bounded"?
- Note: we are postponing notes 13 and 14 until after we cover notes 15--17 so that we can cover VC dimension since we'll use it on the homework
-
[Fri 2/6/2026] Finish #12 (the covering number part), then start 15 Growth Function. In class exercises
-
[Mon 2/9/2026] Finish 15 Growth Function, intro to 16 VC Dimension and in-class exercises
-
[Wed 2/11/2026] More on notes #16, in-class exercises on VC dimension
-
[Fri 2/13/2026] Finish #16, and cover 17 Fundamental Theorem of ML. In class exercises on VC dimension.
-
[Mon 2/16/2026] Start 13 (Aside) Dudley's Chaining. We didn't cover this in 2024, but we're covering it in 2026. More details on the full stochastic process version are in in Wainwright 5.3.3, Tropp's ACM 217 notes, or Vershynin's chapter 8.
- in class exercise: if set A is the unit ball and B is a given rectangle, what's the Minkowski sum A+B?
-
[Wed 2/18/2026] Finished 13 (Aside) Dudley's Chaining.
- in class exercise: if X is uniform[0,1] and Y is uniform[10,15] and X and Y are independent, then what's the distribution of X + Y? And is this the same as the distribution that adds the PDFs together (and scales by 1/2)? i.e., mixture model vs additive model/convolution.
-
[Fri 2/20/2026] 14 (Aside) Johnson-Lindenstrauss.
- in class exercise: show that the sum of normal r.v. is still normal (without doing any integrals). Solution sketch: use the MGF, or the characteristic function, or the Fourier Transform.
-
[Mon 2/23/2026] Finish #14 on chaining argument for subspace; start 18 Linear Predictors (part 1 classification), cover binary predictors, introduce linear programs and discuss their complexity. Discuss ERM of binary classification (tractable iff separable).
-
[Wed 2/25/2026] Finish #18 (Perceptron), start 19 Linear Predictors (part 2 regression). We spent the last 30 min on laptops doing the Least Squares Programming Challenge (solutions are on the Solutions branch of the git repo; see
LeastSquaresChallenge_soln.ipynb -
[Fri 2/27/2026] Finish #19, going over the ERM methods and then discussing pseudo-dimension.
NOTE: These are still the 2024 dates, and will be adjusted to 2026 dates gradually
- [Wed 2/28/2024] Cover 20 Linear Predictors (part 3 logistic regression).pdf. Logistic regression and GLM; derive loss function based on maximum likelihood; discuss log-sum-exp trick (e.g.,
numpy.logaddexpandnumpy.log1p) - [Fri 3/1/2024]. Start 21 Boosting.pdf. gamma-weak-learners, motivate need for boosting; example with 3-piece classifier and decision stump (10.1 in SS), and complexity of computing ERM of decision stumps.
- [Mon 3/4/2024] Continue on #21, complexity of sorting, top-k problems, median finding, shuffling (Fisher-Yates-Knuth shuffle). Comparison to Bootstrap and Bagging. Start 22 AdaBoost.pdf
- [Wed 3/6/2024] Class canceled
- [Fri 3/8/2024] Finish #22 on AdaBoost, analysis of training error convergence. Start #23 Model Selection and Validation
-
[Mon 3/11/2024] Continue on #23 Model Selection and Validation. Overall topics this week and next:
- Structured Risk Minimization (SRM). Ref: [SSS]
- Validation set (the test/train/validation split), Bonferroni correction
- Mallow's C_p / Unbiased Predictive Risk Estimate (UPRE); and Stein's Unbiased Risk Estimate (SURE)
- AIC and BIC, loosely based on Hastie et al.
- Adjusted R^2 / coefficient of determination
- Minimum Description Length (ref: 7.8 Hastie et al. and Grunwald tutorial), coding theory, Kolmogorov complexity, minimum message length. See A tutorial introduction to the minimum description length principle (Peter Grunwald, 2004).
- Morozov Discrepancy Principle (ref: 7.3 Vogel)
- "L-curve" method (like elbow method), ref Vogel
- Bootstrap resampling (ref: 7.11 Hastie et al.)
- Cross-validation, loosely following Hastie et al.
- Generalized CV (GCV), Sherman-Morrison-Woodbury / matrix inversion lemma, Neumann series. Ref: Vogel.
-
[Wed 3/13/2024] Exam review
-
[Fri 3/15/2024] Midterm planned but canceled due to snow day
- [Mon 3/18/2024] Midterm
- [Wed 3/20/2024] Continue on #23 Model Selection and Validation and start #24 More Model Selection and Validation
- [Fri 3/22/2024] Continue on #24 More Model Selection and Validation, start #25 More Model Selection and Validation
- [Mon 4/1/2024] Finish #25 More Model Selection and Validation
- [Wed 4/3/2024] Start #26 Convex Learning Problems on convex optimization
- [Fri 4/5/2024] Continue on #26
- We're skipping Spring2020/ch12_convexInequalities and SubgradientDescent but read on your own if you want
- [Mon 4/8/2024] Start Spring2020/ch13_stability_article
- [Wed 4/10/2024] Finish lecture from last class, start and finish Spring2020/ch13_stability_part2_OneNote
- [Fri 4/12/2024] Start ch14 SGD (L1 convergence proof). Some of this is based on Unified analysis of gradient/subgradient descent which we mostly skipped. Discuss L1 vs L2 convergence (convergence in mean vs quadratic mean), almost sure convergence, etc.. Discuss Stochastic Approximation (SA) vs Sample Average Approximation (SAA/ERM)
- [Mon 4/15/2024] Continue SGD; discuss types of convergence of random variables (formal definitions and subtleties) and did in-class worksheet on this SGD - Random Variable Convergence Worksheet.pdf
- [Wed 4/17/2024] (In-class quiz) Continue SGD. Discuss when you can commute expectation and gradient 2026 update: see pages 2 and 3 of Commuting operators
- [Fri 4/19/2024] Continue SGD, start ch15 SVM
- [Mon 4/22/2024] Finish ch15 SVM, start ch16 kernels; (separable and non-separable cases, hard vs soft SVM, analysis without dimension dependence)
- [Wed 4/24/2024] More on kernels. Motivation for kernels; the kernel trick, example with kernel ridge regression. Derivation via matrix inversion lemma.
- [Fri 4/26/2024] Finish ch16 kernels. Examples of kernels (polynomial, Gaussian, Matern). Kernel-SVM, kernel-ridge regression, kernel-PCA, nearest neighbor, kernel density estimation. Thm 16.1 Representer Thm, Lemma 16.2 (simplified Mercer's Thm), Reproducing Kernel Hilbert Spaces (RKHS). Random Fourier Features (Recht and Rahimi '07) and Bochner's theorem and the Nystrom method.
- Refs: mostly ch 16 in [SSS] but also some from [Murphy] and some from [Fasshauer].
- [Mon 4/29/2024] Finish RKHS intro, then start Gaussian Processes on GPs for regression, Bayesian setup, estimation and forecasting, facts about Gaussians
- [Wed 5/1/2024] (Last day of class) Finish Gaussian Processes.
Note: unlike 2022 and 2020, in 2024 and 2026 we'll have project presentations during our 3 hour final exam slot, freeing up an extra 3 days of lecture
In 2020, since we were online at the end due to the pandemic, we went faster (prewritten notes) and were able to cover the following (which we didn't get to in 2022 or 2024):
Week N/A. Ch 16 Kernel methods [SSS], and Gaussian Processes [Murphy], and ch 20 Neural Nets [SSS and various sources]
- [Mon 4/18] Finish kernels, going over Random Fourier Features (Recht and Rahimi '07) and Bochner's theorem and the Nystrom method. Most of my notes followed ch 6 in [SSS] but some from Murphy. Then start Gaussian Processes, mostly following Murphy's textbook and my written notes Gaussian Processes, for classification but mostly for regression.
- [Wed 4/20] More on GPs for regression, Bayesian setup, estimation and forecasting, facts about Gaussians. Start on ch 20 Neural Networks ch20_NN_part1_approxError
- [Fri 4/22] More on approximation error of neural networks; didn't have time to talk about estimation error (generalization) h20_NN_part2_estimationError nor optimization error ch20_NN_part3_optimizationError
-
[Mon 4/6] Neural nets description, background and history, discussion of approximation error (e.g., universal function approximation, like Stone-Weierstrass style density theorems) for many variants (e.g., L^1 density, density in continuous functions with uniform norm, exact representation of Boolean functions, etc.). Lower bounds on size of networks needed to approximate functions. Some from book, some from recent neural net papers in past 4 years. Discussion of shortcoming of classical theory, some mention of modern algorithm-dependent approaches. PDF of notes (handwritten) about neural net approximation error
-
[Wed 4/8] Short lecture on bounding the VC dimension of neural nets. Proof for one activation function, results stated for two more activation functions. PDF of notes (handwritten) about neural net estimation error
-
[Fr 4/10] Short lecture on the NP-Hardness of ERM (e.g., training) for neural nets (no proof), discussion of SGD again, except in non-convex case. Introduce reverse-mode Automatic Differentiation (at a high-level, no example) and backpropagation for neural nets. PDF of notes (handwritten) about neural net optimization error
-
See the related neural net demo in Matlab, showing an example of two neural nets for the same problem, both with zero empirical risk, one of them hand-tuned (and has bad generalization error), the other trained via SGD and has much better generalization error.
-
[Mon and Wed 4/13 and 4/15] online learning for binary classification, discussing the consistent, halving an standard optimal algorithm of Littlestone. Discuss the Littlestone dimension and shattering trees. Prove mistake bounds and regret bounds. PDF of notes (handwritten) about online classification
-
[Fri 4/17] The doubling trick and online-to-batch conversion. Convex online learning (skip proof, as similar to ch 14), and briefly mention the perceptron. See Shalev-Shwartz's 2011 monograph on Online Learning for more background. PDF of notes (handwritten) about doubling/online-to-batch/online-convex
-
[Mon 4/20] Introduction to Reinforcement Learning (RL) mostly following Mohri, but with examples from [SuttonBarto] and [Puterman]. Give examples: MuJuCo, AlphaGo, Tesauro's backgammon, Pig dice game. Define infinite-horizon, discounted, Markov Decision process (MDP), and define the value of a policy, and define an optimal policy. Discuss finite-MDP and deterministic policies. PDF of notes (handwritten) about intro to RL
-
[Wed 4/22] Theoretical background on optimality and state-action value function Q, eventually deriving the Bellman Equations. PDF of notes (handwritten) about Bellman Equations
-
[Fri 4/24] (Note: for the next three classes, there are presentations, but still two lectures) Planning algorithms (aka dynamic programming) including value iteration (and variants like Gauss-Seidel), policy iteration (and variants, like modified policy iteration), and linear programming formulation. PDF of notes (handwritten) about Planning Algoriths
NOTE in 2024, Ashutosh Trivedi is teaching a special topics course on Reinforcement Learning in the CS dept (CSCI 4831/7000)
- [Mon 4/27] Learning algorithms: very short intro on Stochastic Approximation as generalization of law of large numbers, and on a super-Martingale convergence theorem, then on Temporal Difference TD(0) and Q-learning algorithms. PDF of notes (handwritten) about Learning Algorithms
(high-level)
Classical Statistical Learning Theory We mainly focus on supervised statistical batch learning with a passive learner.
- Ch 1: Intro to class: what is it about?
- Ch 2: Formal models (statistical learning), Empirical Risk Minimization (ERM), finite hypothesis class
- Ch 3: Formal Learning model: Probably-Almost-Correct (PAC)
- Ch 4: Learning via Uniform Convergence (and concentration inequalities, cf Appendix B and Vershynin)
- Ch 5: Bias-Complexity Tradeoff, double-descent, no-free-lunch theorems
- Ch 6: VC-Dimension
- Ch 26: Rademacher Complexity (and ch 3.1 in Mohri)
- Ch 27: Covering Numbers
Analysis of Algorithms As time permits, we will analyze standard algorithms.
- Ch 9: Linear predictors
- Ch 10: Boosting, AdaBoost
- Ch 11: Model selection and validation
- Ch 12: Convex learning problems (generalization bounds)
- Ch 13: Regularization and Stability
- Ch 15: Support Vector Machines (SVM)
- Ch 16: Kernel methods
- Ch 20: Neural Networks, expressive power, and new results about deep networks (2017–now)
Additional Topics We will cover these as we have time (which we probably won't)
- Ch 21: Online Learning
- Reinforcement learning (ch 17 in Mohri)
- Background on Information Theory (Appendix E in Mohri)
- Max Entropy (ch 12 in Mohri)
- Ch 22: Clustering (K-means, spectral clustering, information bottleneck)
- Probabilistic analysis a la Arthur and Vassilvitskii's Kmeans++
- Ch 7: Nonuniform Learnability
- Computational Complexity models (Turing Machines; see Scott Aaronson book)
- Ch 8: Computational Complexity of learning
- Ch 14: Stochastic Gradient Descent (edit: we usually cover this at least partially)
- More stats, e.g., Expectation Maximization
- Variational Inference, ELBO
- Information Theory, information bottleneck
- Generative Models (GANS, Variational AutoEncoders, Diffusion Models)
- Equivariance and Invariance results; group theory
- Kernel methods in more detail; RKHS
- Recent papers from the literature
Skills we hope students develop
- Statistics
- More comfort with multivariate random variables, e.g., multivariate Gaussian
- Convergence of random variables
- Concentration inequalities
- When is E[ gradient f] = gradient E[ f ], etc.
- Cross-validation and regularization techniques; bootstrap
- Intro to chaining techniques
- Basic analysis
- Lots of inequalities
- Comfort with function classes, function spaces
- Comfort with kernel methods
- Basic optimization theory
- and basic stochastic processes, either algorithmic or Gaussian Processes
- Some discrete math
- VC dimension calculations