Skip to content
forked from mitmath/18065

18.065/18.0651: Matrix Methods in Data Analysis, Signal Processing, and Machine Learning

Notifications You must be signed in to change notification settings

will-hath/18065

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 

Repository files navigation

MIT Course 18.065/18.0651, Spring 2023

This is a repository for the course 18.065: Matrix Methods in Data Analysis, Signal Processing, and Machine Learning at MIT in Spring 2023. See also 18.065 from spring 2018 (MIT OpenCourseWare) for a previous version of this class.

Instructor: Prof. Steven G. Johnson.

Lectures: MWF1 in 2-190. Handwritten notes are posted, along with video recordings (MIT only).

Office hours (virtual): Thursdays at 4pm via Zoom.

Textbook: Linear Algebra and Learning from Data by Gilbert Strang (2019). (Additional readings will be posted in lecture summaries below.)

Resources: Piazza discussion forum, pset partners.

Grading: 50% homework, 50% final project.

Homework: Biweekly, due Fridays (2/17, 3/3, 3/17, 4/7, 4/21, 5/5) on Canvas. You may consult with other students or any other resources you want, but must write up your solutions on your own.

Exams: None.

Final project: Due May 15 on Canvas. You can work in groups of 1–3 (sign up on Canvas).

  • 1-page proposal due Monday April 3 on Canvas (right after spring break), but you are encouraged to discuss it with Prof. Johnson earlier to get feedback.
  • Pick a problem involving "learning from data" (in the style of the course, but not exactly the same as what's covered in lecture), and take it further: to numerical examples, to applications, to testing one or more solution algorithms. Must include computations (using any language).
  • Final report due May 15, as an 8–15 page academic paper in the style template of IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Like a good academic paper, you should thoroughly reference the published literature (citing both original articles and authoritative reviews/books where appropriate [rarely web pages]), tracing the historical development of the ideas and giving the reader pointers on where to go for more information and related work and later refinements, with references cited throughout the text (enough to make it clear what references go with what results). (Note: you may re-use diagrams from other sources, but all such usage must be explicitly credited; not doing so is plagiarism.) See some previous topic areas.

What followes is a brief summary of what was covered in each lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.

Lecture 1 (Feb 6)

  • Syllabus (above) and introduction.
  • 18.065 overview diagram
  • Column space, basis, rank, rank-1 matrices, A=CR, and AB=∑(col)(row)
  • See handwritten notes and lecture video linked above.

Further reading: Textbook 1.1–1.3. OCW lecture 1

Lecture 2 (Feb 8)

  • Matrix multiplication by blocks and columns-times-rows. Complexity: standard algorithm for (m×p)⋅(p×n) is Θ(mnp): roughly proportional to mnp for large m,n,p, regardless of how we rearrange it into blocks. (There also exist theoretically better, but highly impractical, algorithms.)
  • Briefly reviewed the "famous four" matrix factorizations: LU, diagonalization XΛX⁻¹ or QΛQᵀ, QR, and the SVD UΣVᵀ. QR and QΛQᵀ in the columns-times-rows picture, especially QΛQᵀ (diagonalization for real A=Aᵀ) as a sum of symmetric rank-1 projections.
  • The four fundamental subspaces for an m×n matrix A of rank r, mapping "inputs" x∈ℝⁿ to "outputs" Ax∈ℝᵐ: the "input" subspaces C(Aᵀ) (row space, dimension r) and its orthogonal complement N(A) (nullspace, dimension n–r); and the "output" subspaces C(A) (column space, dimension r) and its orthogonal complement N(Aᵀ) (left nullspace, dimension m–r).
  • pset 1, due Friday Feb 17

Further reading: Textbook 1.3–1.6. OCW lecture 2. If you haven't seen matrix multiplication by blocks before, here is a nice video.

Optional Julia Tutorial: Wed Feb 8 @ 5pm via Zoom

A basic overview of the Julia programming environment for numerical computations that we will use in 18.06 for simple computational exploration. This (Zoom-based) tutorial will cover what Julia is and the basics of interaction, scalar/vector/matrix arithmetic, and plotting — we'll be using it as just a "fancy calculator" and no "real programming" will be required.

If possible, try to install Julia on your laptop beforehand using the instructions at the above link. Failing that, you can run Julia in the cloud (see instructions above).

Lecture 3 (Feb 10)

  • Orthogonal bases and unitary matrices "Q".

Choosing the right "coordinate system" (= "right basis" for linear transformations) is a key aspect of data science, in order to reveal and simplify information. The "nicest" bases are often orthonormal. (The opposite is a nearly linearly dependent "ill-conditioned" basis, which can greatly distort data.)

Orthonormal bases ⟺ QᵀQ=I, hence basis coefficients c=Qᵀx from dot products. QQᵀ is orthogonal projection onto C(Q). A square Q with orthonormal columns is known as a "orthogonal matrix" or (more generally) as a "unitary matrix": it has Qᵀ=Q⁻¹ (both its rows and columns are orthonormal). Qx preserves length ‖x‖=‖Qx‖ and dot products (angles) x⋅y=(Qx)⋅(Qy). Less obviously: any square matrix that preserves length must be unitary.

Some important examples of unitary matrices:

  • 2×2 rotation matrices
  • the identity matrix I
  • any permutation matrix P which re-orders a vector, and is simply a re-ordering of the rows/cols of I
  • Hadamard matrices: unitary matrices Hₙ/√n where Hₙ has entries of ±1 only. For n=2ᵏ they are easy to construct recursively, and are known as Walsh–Hadamard transforms. (See also the Julia Hadamard package.)
  • discrete Haar wavelets, which are unitary after a diagonal scaling and consist of entries ±1 and 0. They are a form of "time-frequency analysis" because they reveal information about both how oscillatory a vector is ("frequency domain") and where the oscillations occur ("time domain").
  • orthonormal eigenvectors can be found for any real-symmetric ("Hermitian") matrix A=Aᵀ: A=QΛQᵀ
  • the SVD A=UΣVᵀ of any matrix A gives (arguably) the "best" orthonormal basis U for C(A) and the "best" orthonormal basis V for C(Aᵀ), which reveal a lot about A.
  • orthonormal eigenvectors can also be found for any unitary matrix! (The proof is similar to that for Hermitian matrices, but the eigenvalues |λ|=1 in this case.) Often, unitary matrices are used to describe symmetries of problems, and their eigenvectors can be thought of as a kind of "generalized Fourier transform". (All of the familar Fourier transforms, including Fourier series, sine/cosine transforms, and discrete variants thereof, can be derived in this way. For example, the symmetry of a circle gives the Fourier series, and the symmetry of a sphere gives a "spherical-harmonic transform".) For example, eigenvectors of a cyclic shift permutation give the discrete Fourier transform, which is famously computed using FFT algorithms.

Further reading: Textbook section 1.5 (orthogonality), 1.6 (eigenproblems), and 4.1 (Fourier); OCW lecture 3. The fact that preserving lengths implies unitarity is not obvious, but is proved in various textbooks; a concise summary is found here. The relationship between symmetries and Fourier-like transforms can be most generally studied through the framework of "group representation theory"; see e.g. textbooks on "group theory in physics" like Inui et al. (1996). Of course, there are whole books just on the discrete Fourier transform (DFT), just on wavelet transforms, etcetera, and you can find lots of material online at many levels of sophistication.

Lecture 4 (Feb 13)

  • Eigenproblems, diagonalization, complex inner products, and the spectral theorem.

Reviewed eigenvectors/eigenvalues Ax=λx, which make a square matrix act like a scalar λ. For an m×m matrix A, you get m eigenvalues λₖ from the roots (possibly repeated) of the characteristic polynomial det(A-λΙ), and you almost always get m independent eigenvalues xₖ (except in the very rare case of a defective matrix).

Went through the example of the 4×4 cyclic-shift permutation P from last lecture, and showed that P⁴x=x ⥰ λ⁴=1 ⥰ λ=±1,±i. We can then easily obtain the corresponding eigenvectors, and put them into the "discrete Fourier transform" matrix F.

I then claimed that the eigenvectors (properly scaled) are orthonormal, so that F is unitary, but there is a catch: we need to generalize our definition of "dot"/inner product for complex vectors and matrices. For complex vectors, the dot product x⋅y is conj(xᵀ)y (conj=complex conjugate), not xᵀy. And the length of a vector is then ‖x‖² = x⋅x = ∑ᵢ|xᵢ|², which is always ≥ 0 (and = 0 only for x=0). The "adjoint" conj(xᵀ) is sometimes denoted "xᴴ" (or x* in math, or x in physics). A unitary matrix is now more generally one for which QᴴQ=I, and we see that our Fourier matrix F is indeed unitary. And unitary matrices still preserve lengths (and inner products) with the complex inner product.

If we do a change of basis x=Bc, then Ax=λx is transformed to Cc=λc, where C=B⁻¹AB. C and A are called similar matrices, and we see that they are just the same linear operator in different bases; similar matrices always have identical eigenvalues, and eigenvectors transformed by a factor of B.

The most important change of basis for eigenproblems is changing to the basis of eigenvectors X, and showed that this gives the diagonalizaton X⁻¹AX=Λ ⟺ A = XΛX⁻¹. However, this basis may be problematic in a variety of ways if the eigenvectors are nearly dependent (X is nearly singular or "ill-conditioned", corresponding to A being nearly defective).

The nice case of diagonalization is when you have orthonormal eigenvectors X=Q, and it turns out that this arises whenever A commutes with its conjugate-transpose Aᴴ (these are called normal matrices), and give A=QΛQᴴ. Two important cases are:

  • Hermitian matrices A=Aᴴ, and especially the special case of real-symmetric matrices (real A=Aᵀ) — not only are their eigenvectors orthogonal, but their eigenvalues λ are real. In fact, if A is real-symmetric, then its eigenvectors are real too, and we can work with a purely real diagonalization A=QΛQᵀ.
  • Unitary matrices Aᴴ=A⁻¹. In this case, we can easily show that |λ|=1, and then prove orthogonality of eigenvectors from the fact that unitary matrices preserve inner products.

Further reading OCW lecture 4 and textbook section I.6.

Lecture 5 (Feb 15)

  • (Symmetric/Hermitian) positive definite ("SPD") matrices A=Aᵀ: λ > 0 ⟺ xᵀAx > 0 (for x ≠ 0) ⟺ A = BᵀB (B full column rank) ⟺ pivots > 0 (in Gaussian elimination / LU or Cholesky factorization) ⟺ ... (but it does not mean all entries of A are necessarily positive or vice versa!)
  • BᵀB matrices arise in SVDs, least-squares, statistics (covariance matrices), and many other problems.
  • Positive-definiteness of the Hessian matrix Hᵢⱼ=∂²f/∂xᵢ∂xⱼ is the key test for determining whether x∈ℝⁿ is a local minimum of f(x), since f(x+δ)=f(x)+∇fᵀδ + ½δᵀHδ + (higher-order). For example, H=2A for the convex quadratic "bowl" function f(x)=xᵀAx+bᵀx
  • Analogous definitions of "positive semidefinite" (> 0 replaced by ≥ 0) and negative definite/semidefinite (< 0 / ≤ 0).

Further reading OCW lecture 5 and textbook section I.7. In Julia, the isposdef function checks whether a matrix is positive definite, and does so using a Cholesky factorization (which is just Gaussian elimination speeded up 2× for SPD matrices, and fails if it encounters a negative pivot). See also these lecture slides from Stanford for more properties, examples, and applications of SPD matrices. See the matrix calculus course at MIT for a more general presentation of derivatives, gradients, and second derivatives (Hessians) of functions with vector inputs/outputs.

Lecture 6 (Feb 17)

  • "Reduced"/"compact" SVD A=UᵣΣᵣVᵣᵀ=∑σₖuₖvₖᵀ, "full" SVD A=UΣVᵀ, and "thin" SVD (what computers actually return, with zero singular values replaced by tiny roundoff errors).
  • Uᵣ and Vᵣ as the "right" bases for C(A) and C(Aᵀ). Avₖ=σₖvₖ
  • SVD and eigenvalues: U and V as eigenvectors of AAᵀ and AᵀA, respectively, and σₖ² as the positive eigenvalues.
  • Deriving the SVD from eigenvalues: the key step is showing that AAᵀ and AᵀA share the same positive eigenvalues, and λₖ=σₖ², with eigenvectors related by a factor of A. From there you can work backwards to the SVD.
  • pset 1 solutions
  • pset 2: Due March 3 at 1pm

Further reading: OCW lecture 6 and textbook section I.8. The Wikipedia SVD article.

Lecture 7 (Feb 21)

Further reading: Textbook sections I.9, I.11, III.5. OCW lecture 7 and lecture 8.

Lecture 8 (Feb 22)

Further reading: Textbook sections I.9, V.1, V.4. OCW lecture 7 and OCW lecture 20.

Lecture 9 (Feb 24)

  • pseudo-inverse A⁺=VΣ⁺Uᵀ
  • Least-squares: solve Ax≈b by minimizing ‖b-Ax‖ ⟺ solving AᵀAx̂=Aᵀb
  • 4 methods for least squares: (1) Normal equations AᵀAx̂=Aᵀb (the fastest method, but least robust to roundoff errors etc); (2) orthogonalization A=QR ⟹ Rx̂=Qᵀb (much more robust, this is essentially what A \ b does in Julia for non-square A); (3) pseudo-inverse x̂=A⁺b (pinv(A)*b in Julia; this lets you "regularize" the problem by dropping tiny singular values); (4) "ridge" or "Tikhonov" regularization (AᵀA + δ²I)⁻¹Aᵀb ⟶ x̂ as δ→0 (δ≠0 is useful to "regularize" ill-conditioned fitting problems where A has nearly dependent columns, making the solutions more robust to errors).
  • Least-squares demo

Further reading: Textbook section II.2 and OCW lecture 9. Many advanced linear-algebra texts talk about the practical aspects of roundoff errors in QR and least-squares, e.g. Numerical Linear Algebra by Trefethen and Bau (the 18.335 textbook). A nice historical review can be found in the article Gram-Schmidt orthogonalization: 100 years and more. Rarely (on a computer) explicitly form AᵀA or solve the normal equations: it turns out that this greatly exacerbates the sensitivity to numerical errors (in 18.335, you would learn that it squares the condition number). Instead, we typically use the A=QR factorization and solve Rx̂=Qᵀb. Better yet, just do A \ b (in Julia or Matlab) or the equivalent in other languages (e.g. numpy.linalg.lstsq), which will use a good algorithm. (Even professionals can get confused about this.)

Lecture 10 (Feb 27)

  • Training vs test data: VMLS slides p. 294
  • Conditioning: κ(A) = (max σ)/(min σ) is the condition number of a matrix A, and gives us a bound on the "amplification" ‖Δx‖/‖x‖ ≤ κ(a) ‖Δb‖/‖b‖ of the relative error from inputs (b) to outputs (x) when solving Ax=b (including least-squares). "Ill-conditioned problems" (κ≫1) magnify noise and other errors, and typically require some regularization (e.g. dropping smallest σ's) that trades off robustness for accuracy ‖b-Ax‖.
  • Ridge/Tikhonov/ℓ² regularization: minimize ‖b-Ax‖₂² + δ²‖x‖₂² for some penalty δ≠0 to push the solution towards smaller x. (More generally, δ²‖Dx‖₂² for some matrix D.) This gives (AᵀA+δ²I)x̂=Aᵀb, which is similar to A⁺b but replaces 1/σ with σ/(σ²+δ²). Effectively, this drops small σ's, but doesn't require an SVD and generalizes to other types of penalties. (Example: VMLS slides pg. 346.)
  • Under-determined problems: for "wide" matrices, Ax=b has many solutions (we can add any N(A) vector to a solution). A common way to pick a solution is to pick the minimum-norm solution: minimize ‖x‖₂ subject to Ax=b. (It turns out that this gives x̂=A⁺b!)

Further reading: Training/test data: VMLS section 13.2. Condition numbers: Strang exercises II.3, OCW video lecture 10, and these 18.06 notes; a more in-depth treatment can be found in e.g. Numerical Linear Algebra by Trefethen and Bau (the 18.335 textbook). Tikhonov regularization: Strang section II.2, OCW video lecture 10, VMLS section 15.3. Underdetermined minimum-norm solutions: Strang section II.2, OCW video lecture 11, UIUC Nonlinear Programming lecture notes.

Lecture 11 (Mar 1)

  • Minimum-norm solutions x̂=A⁺b=Aᵀ(AAᵀ)⁻¹: smallest ‖x‖₂ for underdetermined problems Ax=b for "wide" A.
  • Other common norms: ℓ¹ and ℓ, and sparsity with ‖x‖₁ norm (LASSO regularization).
  • Avoid AᵀA: it squares the condition number κ(AᵀA)=κ(A)²
  • Gram–Schmidt orthogonalization and QR factorization.

Further reading: Strang II.2, II.4.

About

18.065/18.0651: Matrix Methods in Data Analysis, Signal Processing, and Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%