Skip to content

Latest commit

 

History

History
259 lines (184 loc) · 22.7 KB

File metadata and controls

259 lines (184 loc) · 22.7 KB

Day by day lecture schedule, spring 2026

Theoretical Machine Learning (4490/5490) class, Spring 2026, at CU Boulder

Details on what we covered in previous semesters:

Abbreviations for sources:

Week 1. Ch 1 [SSS]

  • [Fri 1/9/2026] Introduction, ch 1 [SSS], parts of ch 1.3 [Mohri]. What is ML, compare to other types of learning, types of learning (supervised, etc.), standard tasks, papaya example, inductive bias and generalization. See 01_Intro, partway through 02 More Intro and Terminology

Week 2. Ch 1, 2, 3 [SSS] and Finite Hypothesis Class

Week 3. Ch 4 [SSS] on uniform convergence

Week 4. No Free Lunch / Bias-Variance, Rademacher complexity

Week 5. Rademacher complexity etc. [Mohri et al.]

  • [Mon 2/2/2026] Finish #10, start 11 Generalization via Rademacher Complexity getting to McDiarmid's inequality

  • [Wed 2/4/2026] Finish #11, then start 12 More Rademacher, and Covering Numbers.

    • in class exercise: is there a difference between "bounded" and "totally bounded"?
    • Note: we are postponing notes 13 and 14 until after we cover notes 15--17 so that we can cover VC dimension since we'll use it on the homework
  • [Fri 2/6/2026] Finish #12 (the covering number part), then start 15 Growth Function. In class exercises

Week 6. VC dimension [SSS] and [Mohri et al.]

Week 7. Chaining and Johnson Lindenstrauss

  • [Mon 2/16/2026] Start 13 (Aside) Dudley's Chaining. We didn't cover this in 2024, but we're covering it in 2026. More details on the full stochastic process version are in in Wainwright 5.3.3, Tropp's ACM 217 notes, or Vershynin's chapter 8.

    • in class exercise: if set A is the unit ball and B is a given rectangle, what's the Minkowski sum A+B?
  • [Wed 2/18/2026] Finished 13 (Aside) Dudley's Chaining.

    • in class exercise: if X is uniform[0,1] and Y is uniform[10,15] and X and Y are independent, then what's the distribution of X + Y? And is this the same as the distribution that adds the PDFs together (and scales by 1/2)? i.e., mixture model vs additive model/convolution.
  • [Fri 2/20/2026] 14 (Aside) Johnson-Lindenstrauss.

    • in class exercise: show that the sum of normal r.v. is still normal (without doing any integrals). Solution sketch: use the MGF, or the characteristic function, or the Fourier Transform.

Week 8. ch 9 of [SSS] on linear predictors

2024 schedule:

NOTE: These are still the 2024 dates, and will be adjusted to 2026 dates gradually

Week 7. Linear predictors, ch 10 of [SSS] on boosting

  • [Wed 2/28/2024] Cover 20 Linear Predictors (part 3 logistic regression).pdf. Logistic regression and GLM; derive loss function based on maximum likelihood; discuss log-sum-exp trick (e.g., numpy.logaddexp and numpy.log1p)
  • [Fri 3/1/2024]. Start 21 Boosting.pdf. gamma-weak-learners, motivate need for boosting; example with 3-piece classifier and decision stump (10.1 in SS), and complexity of computing ERM of decision stumps.

Week 8. Finish boosting, start model selection/validation (ch 11 of [SSS])

  • [Mon 3/4/2024] Continue on #21, complexity of sorting, top-k problems, median finding, shuffling (Fisher-Yates-Knuth shuffle). Comparison to Bootstrap and Bagging. Start 22 AdaBoost.pdf
  • [Wed 3/6/2024] Class canceled
  • [Fri 3/8/2024] Finish #22 on AdaBoost, analysis of training error convergence. Start #23 Model Selection and Validation

Week 9. Model Selection and Validation, and midterm

  • [Mon 3/11/2024] Continue on #23 Model Selection and Validation. Overall topics this week and next:

    1. Structured Risk Minimization (SRM). Ref: [SSS]
    2. Validation set (the test/train/validation split), Bonferroni correction
    3. Mallow's C_p / Unbiased Predictive Risk Estimate (UPRE); and Stein's Unbiased Risk Estimate (SURE)
    4. AIC and BIC, loosely based on Hastie et al.
    5. Adjusted R^2 / coefficient of determination
    6. Minimum Description Length (ref: 7.8 Hastie et al. and Grunwald tutorial), coding theory, Kolmogorov complexity, minimum message length. See A tutorial introduction to the minimum description length principle (Peter Grunwald, 2004).
    7. Morozov Discrepancy Principle (ref: 7.3 Vogel)
    8. "L-curve" method (like elbow method), ref Vogel
    9. Bootstrap resampling (ref: 7.11 Hastie et al.)
    10. Cross-validation, loosely following Hastie et al.
    11. Generalized CV (GCV), Sherman-Morrison-Woodbury / matrix inversion lemma, Neumann series. Ref: Vogel.
  • [Wed 3/13/2024] Exam review

  • [Fri 3/15/2024] Midterm planned but canceled due to snow day

Week 10. Model Selection and Validation

Spring break

Week 11. Finish Model Selection, start optimization (ch 12, 13 [SsS])

Week 12. Algorithmic Stability (ch 13, 14 [SSS])

Week 13. SGD (ch 14 [SSS])

Week 14. SVM and Kernels

  • [Mon 4/22/2024] Finish ch15 SVM, start ch16 kernels; (separable and non-separable cases, hard vs soft SVM, analysis without dimension dependence)
  • [Wed 4/24/2024] More on kernels. Motivation for kernels; the kernel trick, example with kernel ridge regression. Derivation via matrix inversion lemma.
  • [Fri 4/26/2024] Finish ch16 kernels. Examples of kernels (polynomial, Gaussian, Matern). Kernel-SVM, kernel-ridge regression, kernel-PCA, nearest neighbor, kernel density estimation. Thm 16.1 Representer Thm, Lemma 16.2 (simplified Mercer's Thm), Reproducing Kernel Hilbert Spaces (RKHS). Random Fourier Features (Recht and Rahimi '07) and Bochner's theorem and the Nystrom method.
    • Refs: mostly ch 16 in [SSS] but also some from [Murphy] and some from [Fasshauer].

Week 15. TBD

  • [Mon 4/29/2024] Finish RKHS intro, then start Gaussian Processes on GPs for regression, Bayesian setup, estimation and forecasting, facts about Gaussians
  • [Wed 5/1/2024] (Last day of class) Finish Gaussian Processes.

Note: unlike 2022 and 2020, in 2024 and 2026 we'll have project presentations during our 3 hour final exam slot, freeing up an extra 3 days of lecture

Optional content if we have time

In 2020, since we were online at the end due to the pandemic, we went faster (prewritten notes) and were able to cover the following (which we didn't get to in 2022 or 2024):

Week N/A. Ch 16 Kernel methods [SSS], and Gaussian Processes [Murphy], and ch 20 Neural Nets [SSS and various sources]

  • [Mon 4/18] Finish kernels, going over Random Fourier Features (Recht and Rahimi '07) and Bochner's theorem and the Nystrom method. Most of my notes followed ch 6 in [SSS] but some from Murphy. Then start Gaussian Processes, mostly following Murphy's textbook and my written notes Gaussian Processes, for classification but mostly for regression.
  • [Wed 4/20] More on GPs for regression, Bayesian setup, estimation and forecasting, facts about Gaussians. Start on ch 20 Neural Networks ch20_NN_part1_approxError
  • [Fri 4/22] More on approximation error of neural networks; didn't have time to talk about estimation error (generalization) h20_NN_part2_estimationError nor optimization error ch20_NN_part3_optimizationError

Week N/A. Ch 20 [SSS] on artificial Neural Networks

  • [Mon 4/6] Neural nets description, background and history, discussion of approximation error (e.g., universal function approximation, like Stone-Weierstrass style density theorems) for many variants (e.g., L^1 density, density in continuous functions with uniform norm, exact representation of Boolean functions, etc.). Lower bounds on size of networks needed to approximate functions. Some from book, some from recent neural net papers in past 4 years. Discussion of shortcoming of classical theory, some mention of modern algorithm-dependent approaches. PDF of notes (handwritten) about neural net approximation error

  • [Wed 4/8] Short lecture on bounding the VC dimension of neural nets. Proof for one activation function, results stated for two more activation functions. PDF of notes (handwritten) about neural net estimation error

  • [Fr 4/10] Short lecture on the NP-Hardness of ERM (e.g., training) for neural nets (no proof), discussion of SGD again, except in non-convex case. Introduce reverse-mode Automatic Differentiation (at a high-level, no example) and backpropagation for neural nets. PDF of notes (handwritten) about neural net optimization error

  • See the related neural net demo in Matlab, showing an example of two neural nets for the same problem, both with zero empirical risk, one of them hand-tuned (and has bad generalization error), the other trained via SGD and has much better generalization error.

Week N/A. Ch 21 [SSS] on Online learning

Week N/A. Reinforcement Learning (from ch 17 [Mohri])

  • [Mon 4/20] Introduction to Reinforcement Learning (RL) mostly following Mohri, but with examples from [SuttonBarto] and [Puterman]. Give examples: MuJuCo, AlphaGo, Tesauro's backgammon, Pig dice game. Define infinite-horizon, discounted, Markov Decision process (MDP), and define the value of a policy, and define an optimal policy. Discuss finite-MDP and deterministic policies. PDF of notes (handwritten) about intro to RL

  • [Wed 4/22] Theoretical background on optimality and state-action value function Q, eventually deriving the Bellman Equations. PDF of notes (handwritten) about Bellman Equations

  • [Fri 4/24] (Note: for the next three classes, there are presentations, but still two lectures) Planning algorithms (aka dynamic programming) including value iteration (and variants like Gauss-Seidel), policy iteration (and variants, like modified policy iteration), and linear programming formulation. PDF of notes (handwritten) about Planning Algoriths

NOTE in 2024, Ashutosh Trivedi is teaching a special topics course on Reinforcement Learning in the CS dept (CSCI 4831/7000)

Week N/A, more RL

  • [Mon 4/27] Learning algorithms: very short intro on Stochastic Approximation as generalization of law of large numbers, and on a super-Martingale convergence theorem, then on Temporal Difference TD(0) and Q-learning algorithms. PDF of notes (handwritten) about Learning Algorithms

What we hope to cover in a typical course

(high-level)

Classical Statistical Learning Theory We mainly focus on supervised statistical batch learning with a passive learner.

  1. Ch 1: Intro to class: what is it about?
  2. Ch 2: Formal models (statistical learning), Empirical Risk Minimization (ERM), finite hypothesis class
  3. Ch 3: Formal Learning model: Probably-Almost-Correct (PAC)
  4. Ch 4: Learning via Uniform Convergence (and concentration inequalities, cf Appendix B and Vershynin)
  5. Ch 5: Bias-Complexity Tradeoff, double-descent, no-free-lunch theorems
  6. Ch 6: VC-Dimension
  7. Ch 26: Rademacher Complexity (and ch 3.1 in Mohri)
  8. Ch 27: Covering Numbers

Analysis of Algorithms As time permits, we will analyze standard algorithms.

  1. Ch 9: Linear predictors
  2. Ch 10: Boosting, AdaBoost
  3. Ch 11: Model selection and validation
  4. Ch 12: Convex learning problems (generalization bounds)
  5. Ch 13: Regularization and Stability
  6. Ch 15: Support Vector Machines (SVM)
  7. Ch 16: Kernel methods
  8. Ch 20: Neural Networks, expressive power, and new results about deep networks (2017–now)

Additional Topics We will cover these as we have time (which we probably won't)

  1. Ch 21: Online Learning
  2. Reinforcement learning (ch 17 in Mohri)
  3. Background on Information Theory (Appendix E in Mohri)
  4. Max Entropy (ch 12 in Mohri)
  5. Ch 22: Clustering (K-means, spectral clustering, information bottleneck)
  6. Ch 7: Nonuniform Learnability
  7. Computational Complexity models (Turing Machines; see Scott Aaronson book)
  8. Ch 8: Computational Complexity of learning
  9. Ch 14: Stochastic Gradient Descent (edit: we usually cover this at least partially)
  10. More stats, e.g., Expectation Maximization
  11. Variational Inference, ELBO
  12. Information Theory, information bottleneck
  13. Generative Models (GANS, Variational AutoEncoders, Diffusion Models)
  14. Equivariance and Invariance results; group theory
  15. Kernel methods in more detail; RKHS
  16. Recent papers from the literature

Skills we hope students develop

  • Statistics
    • More comfort with multivariate random variables, e.g., multivariate Gaussian
    • Convergence of random variables
    • Concentration inequalities
    • When is E[ gradient f] = gradient E[ f ], etc.
    • Cross-validation and regularization techniques; bootstrap
    • Intro to chaining techniques
  • Basic analysis
    • Lots of inequalities
    • Comfort with function classes, function spaces
    • Comfort with kernel methods
  • Basic optimization theory
    • and basic stochastic processes, either algorithmic or Gaussian Processes
  • Some discrete math
    • VC dimension calculations