# ECON622: Computational Economics with Data Science Applications

Generalization, Deep Learning, and Representations

Jesse Perla (University of British Columbia)

# Overview

## Summary

-   Step back from optimization and fitting process to discuss the
    broader issues of statistics and function approximation
-   Key topics:
    -   ERM, interpolation, and generalization error
    -   Features and Representations
    -   Neural Networks and broader hypothesis classes
-   In the subsequent lecture we can briefly return to the optimization
    process itself to discuss
    -   Global vs. Local solutions
    -   Inductive bias and regularization

# Neural Networks, Part I

## Neural Networks Include Almost Everything

-   **Neural Networks** as just an especially flexible functional form
    -   Linear, polynomials, and all of our standard approximation fit
        in there
-   However, when we say “Neural Network” you should have in mind
    approximations that are typically
    -   parameterized by some complicated:
        $\theta \equiv \{\theta_1, \theta_2, \cdots, \theta_L\}$
    -   nested with **layers**:
        $y = f_1(f_2(\ldots f_L(x;\theta_L);\theta_2);\theta_1) \equiv f(x;\theta)$
    -   **highly overparameterized**, such that $\theta$ is often
        massive, often far larger than the amount of data
-   Terminology: **depth** is number of layers $L$, **width** is the
    size of $\theta_{\ell}$

## (Asymptotic) Universal Approximation Theorems

-   At this point you may expect a theorem that says that neural
    networks are [**universal
    approximators**](https://en.wikipedia.org/wiki/Universal_approximation_theorem#Arbitrary-width_case).
    e.g., Cybenko, Hornik, and others
    -   i.e., for any function $f^*(x)$, there exists a neural network
        $f(x;\theta)$ that can approximate it arbitrarily well
    -   Takes limits of the number of parameters at each layer
        (e.g. $\theta_{\ell}\nearrow$), or sometimes the number of
        layers (e.g. $L\nearrow$)
-   A low bar to pass that rarely gives useful guidance or bounds
    -   We do not use enormously “wide” approximations
    -   The theorems are too pessimistic, as NNs do much better in
        practice
    -   Important when doing core functional analysis and asymptotics

## How Can we Fit Huge Approximations?

-   At this point you should not be scared off by a big $\theta$
-   The ML and deep-learning revolution is built on ideas you have
    covered
    -   With a scalar loss $L(\theta)$, you can use VJPs (reverse-mode
        AD) to get gradients $\nabla_{\theta} L(\theta)$
    -   AD software like PyTorch or JAX makes it easy to collect the
        $\theta$ and run the AD on complicated $L(\theta)$
    -   Hardware (e.g. GPUs) can make common operations like VJPs fast
    -   Optimization methods like SGD work great, using gradient
        estimates when memory is an issue for the full
        $\nabla_{\theta} L(\theta)$
    -   Regularization, in all its forms, helps when used appropriately
-   But that doesn’t explain why this can help generalization
    performance?

## Puzzling Empirical Success, Ahead of Theory

-   Deep learning and AD are old, but only recently have the software
    and hardware been good enough to scale
-   Bias-variance tradeoff: adding parameters can make things worse
    -   But ML practitioners often find the opposite **empirically**
    -   Frequent success with lots of layers and massive
        over-parameterization
    -   This seems counter to all sorts of basic statistics intuition
-   ML theory is still catching up to the empirical success
    -   We will try to give some perspectives on **when/why deep
        learning works**
    -   First, we will have to be precise on what we mean by “works”

# ERM and Interpolation

## Estimation and Interpolation Setup

-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 5.4 on ERM
-   Generalizing that notation somewhat to better nest functional
    equations
    -   Let $p^*(x,y)$ be the **true** distribution on inputs
        $x \in \mathcal{X}$ and outputs $y \in \mathcal{Y}$
    -   Let $f : \mathcal{X} \to \mathcal{Y}$ be a function mapping
        inputs to outputs. For example, maybe $\hat{y} = f(x)$ is an
        estimator for the relationship between $x$ and $y$ in $p^*(x,y)$
    -   For now, assume that $f \in \mathcal{F}$ for some very large set
        of functions appropriate to the problem (e.g., Sobolev space,
        etc.)
    -   Define the loss as an operator $(f, x, y) \to \ell(f, x, y)$
        -   For example, $\ell(f, x, y) = (y - f(x))^2$ for LLS

## Population Risk

-   The **population risk** is then defined as

$$
R(f, p^*) \equiv \mathbb{E}_{p^*(x,y)}\left[\ell(f, x, y)\right]
$$

-   This lets us define the ideal minimizer of the population risk as

$$
f^{**} = \arg\min_{f \in \mathcal{F}} R(f, p^*)
$$

-   If $\min_{f \in \mathcal{F}} R(f, p^*) = 0$, there is an
    interpolating solution across the whole of $p^*$. Common with
    functional equations, rare in statistics applications

## Functional Equations

-   The deviations from the ProbML book are to ensure we can nest
    solving functional equations. In that case, we can drop the $y$.
-   We can loosely think of $f$ as solving the functional equation as
    long as

$$
\min_{f \in \mathcal{F}} R(f, p^*) \equiv \min_{f \in \mathcal{F}} \mathbb{E}_{p^*(x)}\left[\ell(f, x)\right] = 0
$$

-   Set the function class $\mathcal{F}$ to be consistent with the
    $\ell$
    -   e.g., for $\ell(f,x) = (\partial_x f(x) - a f(x))^2$ choose a
        Sobolev $\mathcal{F}$
    -   We are leaving it out here, but you might add on other terms
        like boundary conditions to the $R(f,p^*)$ loss
    -   Note the key difference relative to standard methods is the
        $p^*(x)$!

## Empirical Risk

-   Denote $N$ samples from $p^*(x,y)$ as
    $\mathcal{D} \equiv \{x_n, y_n\}_{n=1}^N$
-   We can think of this as empirical distribution
    $p_{\mathcal{D}}(x,y)$
-   Which lets us define the **empirical risk**

$$
R(f, \mathcal{D}) \equiv \frac{1}{N}\sum_{n=1}^N \ell(f, x_n, y_n) = R(f, p_{\mathcal{D}}) 
$$

-   Note that the empirical risk is decoupled from the function space of
    $f$ and simply captures the role of finite-samples from $p^*(x,y)$

## Hypothesis Class

-   If $\mathcal{F}$ is too large then it may be hard to solve, or we
    may introduce errors due to overfitting
-   Instead, consider a **hypothesis class**
    $\mathcal{H}\subseteq \mathcal{F}$, which is parameterized in
    practice by $\theta \in \Theta$
-   For example, we might choose an $\mathcal{H}$ that is a
    -   Linear function
    -   Orthogonal polynomial
    -   Neural network with large number of parameters and
        nonlinearities

## Empirical Risk Minimization

-   Finally, we can now look at the **empirical risk minimization**
    problem, which is the core of most statistical estimation problems
    like regressions

$$
f^*_{\mathcal{D}} = \arg\min_{f \in \mathcal{H}} R(f, \mathcal{D}) = \arg\min_{f \in \mathcal{H}}\frac{1}{N}\sum_{n=1}^N \ell(f, x_n, y_n)
$$

-   If the $\mathcal{H}(\Theta)$, then we could implement this as

$$
\arg\min_{\theta \in \Theta} \frac{1}{N}\sum_{n=1}^N \ell(f_{\theta}, x_n, y_n)
$$

## Linear Least Squares

-   For example, if
    -   $\mathcal{H}(\Theta)$ only includes linear functions
        -   i.e. $f_{\theta}(x) = \theta \cdot x$ for some
            $\theta \in \Theta$
    -   $\ell(f, x, y) = (y - f(x))^2$
-   Then this problem is just OLS
    -   $\arg\min_{\theta \in \Theta} \frac{1}{N}\sum_{n=1}^N (y_n - \theta \cdot x_n)^2$
-   Economists are very good at analyzing the properties of the
    $\mathcal{D}$ in $p^*(x,y)$ but often ignore the differences between
    $\mathcal{H}$ vs. $\mathcal{F}$?
-   In contrast: for solving functional equations, we are better at
    analyzing $\mathcal{F}$ vs. $\mathcal{H}$, but typically implicitly
    assume a uniform $p^*(x)$

## Approximation Error

-   To evaluate what we lose by using $\mathcal{H}$ instead of
    $\mathcal{F}$, we can define

$$
f^* \equiv \arg\min_{f \in \mathcal{H}} R(f, p^*)
$$

-   Then the **approximation error** is defined as

$$
\varepsilon_{app}(\mathcal{H}) \equiv R(f^*, p^*) - R(f^{**}, p^*)
$$

-   This says, taking the population distribution as given, how close we
    can get a $f^*$ to the ideal solution $f^{**}$ using $\mathcal{H}$
    instead of $\mathcal{F}$
    -   The weighting by $p^*(x)$ is crucial to gauge “success”

## Generalization Error

-   Alternatively, we can fix the hypothesis class and ask how much
    error we are introducing by the use of finite data $\mathcal{D}$
    instead of the population distribution $p^*(x,y)$
-   The **generalization error** (or estimation error) is

$$
\varepsilon_{est}(\mathcal{H}) \equiv \mathbb{E}_{\mathcal{D} \sim p^*}\left[R(f^*_{\mathcal{D}}, \mathcal{D}) - R(f^*, p^*)\right]
$$

-   By that notation we are showing that this is taking the expectation
    over samples $\mathcal{D}$ from the true $p^*(x,y)$

## Calculating the Generalization Error

-   Since we typically do not have the true $p^*$, or can at most sample
    from it, we need to find ways to approximate the $\varepsilon_{est}$
    for a given problem.
-   A typical approach is the data splitting we discussed in the
    previous section.
    -   Partition $\mathcal{D}$ into $\mathcal{D}_{train}$ and
        $\mathcal{D}_{test}$, then solve ERM to find $$
        f^*_{\mathcal{D}_{train}} = \arg\min_{f \in \mathcal{H}} R(f, \mathcal{D}_{train}) = \arg\min_{f \in \mathcal{H}}\frac{1}{N_{train}}\sum_{n=1}^{N_{train}} \ell(f, x_n, y_n)
        $$
    -   Then, we can approximate with what is sometimes called the
        **generalization gap** $$
        \varepsilon_{est}(\mathcal{H}) \approx R(f^*_{\mathcal{D}_{train}} , \mathcal{D}_{train}) - R(f^*_{\mathcal{D}_{train}} , \mathcal{D}_{test})
        $$

## Decomposing the Error

-   Armed with these definitions, we can now decompose the error of
    using a particular hypothesis class $\mathcal{H}$ and finite set of
    samples $\mathcal{D}$ from $p^*$ as

$$
\mathbb{E}_{\mathcal{D} \sim p^*}\left[\min_{f \in \mathcal{H}} R(f, \mathcal{D}) - \min_{f \in \mathcal{F}} R(f, p^*)\right] = \varepsilon_{app}(\mathcal{H}) + \varepsilon_{est}(\mathcal{H})
$$

-   Note that this is a property of the approximation class
    $\mathcal{H}$ and the selection process for the $\mathcal{D}$, not a
    particular $\mathcal{D}$

## Tradeoffs

-   This is at the core of the bias-variance tradeoff
    -   $\varepsilon_{app}(\mathcal{H})$ is the bias, i.e. approximation
        error
    -   $\varepsilon_{est}(\mathcal{H})$ is the variance,
        i.e. estimation error
-   The classic tradeoff here is that if you make $\mathcal{H}$ too
    rich, you may decrease the $\varepsilon_{app}(\mathcal{H})$ but the
    extra flexibility may lead to a much higher
    $\varepsilon_{est}(\mathcal{H})$
    -   i.e., flexibility leads to overfitting
-   In the next lectures we will investigate this classic intuition in
    more detail and discuss when it falls apart.
    -   Hint: consider the role of regularization in all its forms
-   The bigger challenge is that as data and economic mechanisms become
    richer, you may not be able to choose the appropriate $\mathcal{H}$
    manually

# Features and Representations

## Features

-   The **features** of a problem start with the inputs that we use
    within the hypothesis class $\mathcal{H}(\Theta)$
-   Economists are used to **shallow** approximations, e.g.,
    -   Linear functions, $f_{\theta}(x) = \theta \cdot x$
    -   Orthogonal polynomials,
        $f_{\theta}(x) = \sum_{m=1}^M \theta_m \phi_m(x)$ for basis
        $\phi_m(x)$
-   Economists **feature engineer** to choose the appropriate
    $x\in\mathcal{X}$ form raw data, typically then used with shallow
    approximations
    -   e.g. log, dummies, polynomials, first-differences, means, etc.
    -   Embeds all sorts of priors, wisdom, and economic intuition
    -   Priors are inescapable - and a good thing as long as you are
        aware when you use them and how they affect inference

## Simple Representation in ERM

-   To abstract from this manual process, we can think of instead taking
    the raw data $x$ and transforming it into a **representation**
    $z\in \mathcal{Z}$ with $g : \mathcal{X} \to \mathcal{Z}$

-   Then, instead of finding a $f : \mathcal{X} \to \mathcal{Y}$, we can
    find a $\tilde{f} : \mathcal{Z} \to \mathcal{Y}$ and in our loss use
    $\ell(\tilde{f}\circ g, x, y)$

    $$
    \tilde{f}^*_{\mathcal{D}} = \arg\min_{\tilde{f} \in \mathcal{H}}\frac{1}{N}\sum_{n=1}^N \ell(\tilde{f} \circ g, x_n, y_n)
    $$

    -   And then define
        $f^*_{\mathcal{D}} \equiv \tilde{f}^*_{\mathcal{D}} \circ g$

-   Our approximation class $\mathcal{H}$ then changes - for better or
    worse.

## Finding Representations is an Art

-   This is the process of finding **latent variables** and inverse
    mapping from observables to them
-   What is our goal when choosing $g$?
    -   Drop irrelevant information, which is the simplest feature
        engineering
    -   Find $z$ that captures the relevant information in $x$ for the
        problem at hand (i.e., the particular $\ell, y, \mathcal{D}$)
    -   See [ProbML Book
        2](https://probml.github.io/pml-book/book1.html) Section 5.6 for
        more on information theory
    -   Disentangled (i.e., the factors of variation are separated out
        in $z$)

## Problem Specific or Reusable?

-   Are representations reusable between tasks?
    -   e.g., our wisdom of when to take logs or first-differences is
        often reusable
-   Remember: representations are on $\mathcal{X}$, not $\mathcal{Y}$
    -   Encodes age old wisdom from working with the datasources
    -   Though wisdom probably included seeing $\mathcal{Y}$ from
        previous tasks
-   What if $\mathcal{X}$ is complicated, or we are worried we may have
    chosen the wrong $z = g(x)$?
    -   Can we learn this automatically?

## Can Representations Be Learned?

-   The short answer is: yes, but it is still an art (with finite data)
-   **Representation learning** finds representations
    $g : \mathcal{X} \to \mathcal{Z}$ using $\mathcal{D}$
    -   Hopefully: works well for $x \sim p^*(x,y)$, and for many
        $\ell(\tilde{f} \circ g, x, y)$
-   This happens in subtle ways in many different methods. e.g. 
    -   If we run “unsupervised” clustering or embeddings on our data to
        embed it into a lower dimensional space then run a regression
    -   “Learning the kernel”. See [ProbML Book
        2](https://probml.github.io/pml-book/book2.html) Section 18.6
    -   Autoencoders and variational autoencoders
    -   Deep learning approximations, which we will discuss in detail

## Benefits of Having Learned Representations

-   If you know the correct features, there is no benefit besides maybe
    dimension reduction. But are you so sure they are correct?
-   Can handle complicated data (e.g. text, networks, high-dimensions)
-   Maybe the representations are reusable across problems, just like
    they were for our manual feature engineering
    -   This is part of the process of **transfer learning** and
        **fine-tuning**
-   Using a good representation is more sample efficient because data
    ends up used in fitting $\tilde{f}$ instead of jointly finding
    $f = \tilde{f} \circ g$
-   Maybe problems which are complicated and nonlinear in $\mathcal{X}$
    are simpler in $\mathcal{Z}$ (i.e., linear regression in
    $\mathcal{Z}$)
-   Above all: good representations overfit less and **generalize
    better**

## Jointly Learning Representations and Functions?

-   Because $f \equiv \tilde{f} \circ g$, the simplest approach is just
    to jointly learn both
-   Come up with some hypothesis class $\mathcal{H}$ that flexible
    enough to have both the representation and function of interest
-   Extra flexibility could overfit (i.e.,
    $\varepsilon_{est}(\mathcal{H})$ could increase even if
    $\varepsilon_{app}(\mathcal{H}) \searrow 0$)
    -   Hence the crucial need for regularization in various forms
-   Notice the nested structure here of $\tilde{f} \circ g$
    -   Hints at why Neural Networks might work so well

## Is this a Mixture of Supervised and Unsupervised?

-   Remember that we talked about finding representations as intrinsic
    to the data itself, and hence “unsupervised”
    -   Fitting $\tilde{f} \circ g$ jointly combines supervised and
        unsupervised
-   Nested structure means we may be able to use this for new problems
    -   Isolate the $g$ parameters and structure from the $\tilde{f}$
    -   Train $\tilde{f} \circ g$ for a new $\tilde{f}$ and fit jointly
        again or **freeze** the $g$
    -   Can work shockingly well in practice!
    -   [What do Neural Networks Learn When Trained on Random
        Labels?](https://arxiv.org/pdf/2006.10455.pdf)
-   Best to start with existing representations and **fine-tune**
    (essential in LLMs)

# Neural Networks, Part II

## Rough Intuition on The Success of Deep Learning

-   The broadest intuition on why deep learning with flexible,
    overparameterized neural networks often works well is a combination
    of:
    1.  The massive number of parameters makes the optimization process
        find more generalizable solutions (sometimes through
        regularization)
    2.  The depth of approximations allowing for better representations
    3.  The optimization process seems to be able to learn those
        representations
    4.  The representations are reusable across problems
    5.  Regularization (implicit and explicit) helps us avoid
        overfitting
-   Art, not magic
    -   Need to design architectures ($\mathcal{H}$), optimization, and
        regularization

## Neural Networks and Representations

-   Neural networks are typically “deep”:
    $f_1(f_2(\ldots f_L(x;\theta_L);\theta_2);\theta_1) \equiv f(x;\theta)$
-   If we fit $f(x) \equiv \tilde{f}(g(x;\theta_1);\theta_2)$ then there
    is a chance that we could fit both a good representation and a
    generalizable function
-   So it seems that having two “layers” helps. What is less clear is
    that
    1.  Having further nesting of representations helps
    2.  That the representations themselves can be learned in this way
-   For examples on why multiple layers help see:[Mark Schmidt’s CPSC
    440](https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L6.pdf),
    [CPSC340](https://www.cs.ubc.ca/~schmidtm/Courses/340-F22/L32.pdf),
    [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 13.2.1 on the XOR Problem and 13.2.5-13.2.7 for more

## Common Neural Network Designs

-   A common pattern for $\mathcal{H}$ is called the Multi Layer
    Perception (MLP)
    -   See [ProbML Book
        1](https://probml.github.io/pml-book/book1.html) Section 13.2
    -   Alternate linear/nonlinear then end with linear (or match
        domain)
    -   One **hidden** layer $f(x) = W_2 \sigma(W_1 x + b_1) + b_2$
    -   Another layer:
        $f(x) = W_3 \sigma(W_2 \sigma(W_1 x + b_1) + b_2) + b_3$
    -   Where $\sigma(\cdot)$ is a nonlinear **activation function** in
        the jargon. e.g. $\tanh(\cdot)$ or $\max(0, \cdot)$ (called
        ReLU)
    -   See [ProbML Book
        1](https://probml.github.io/pml-book/book1.html) Section 13.2.4
        numerical properties of gradients
-   If $f : \mathbb{R}^N \to \mathbb{R}$ with 2 hidden layers and
    **width** of $M$:
    -   $W_1 \in \mathbb{R}^{M \times N}, b_1 \in \mathbb{R}^M, W_2 \in \mathbb{R}^{M \times M}, b_2 \in \mathbb{R}^M, W_3 \in \mathbb{R}^{1 \times M}, b_3 \in \mathbb{R}$

## Many Problem Specific Variations

-   Use economic intuition and problem specific knowledge to design
    $\mathcal{H}$

    -   Encode knowledge of good representations, easier learning

-   For example, you can approximation function
    $f : \mathbb{R}^N \to \mathbb{R}$ which are symmetric in arguments
    (i.e. permutation invariance) with

    $$
    f(X) = \rho\left(\frac{1}{N}\sum_{x\in X} \phi(x)\right)
    $$

    -   $\rho : \mathbb{R}^M \to \mathbb{R}$,
        $\phi : \mathbb{R} \to \mathbb{R}^M$ both neural networks

-   See [Probabilistic Symmetries and Invariant Neural
    Networks](https://www.jmlr.org/papers/volume21/19-322/19-322.pdf) or
    [Exploiting Symmetry in High Dimensional Dynamic
    Programming](https://www.jesseperla.com/publication/symmetry-dynamic-programming/symmetry-dynamic-programming.pdf)

## Transfer Learning and Few-Shot Learning

-   If you have a good representation, you can often use it for new
    problems
-   Take the $\theta$ as an initial condition for the optimizer and
    **fine-tune** on the new problem
-   e.g. take $g \equiv f_1 \circ f_2 \circ \ldots f_{L-1}$ (e.g. the
    all but the last layer) and **freeze them** only changing the last
    $f_L$ layer with training
    -   Or only freeze some of them
    -   May work well even if the task was completely different. Many
        $y$ \$are simply linear combinations of the disentangled
        representations - part of why kernel methods (which we will
        discuss in a future lecture) work well
-   Can find sometimes “one-shot”, “few-shot”, or “zero-shot” learning
    -   e.g. [Zero-Shot Learning](https://arxiv.org/pdf/1706.03466.pdf)
        for image classification

# More Perspectives on Representations

## The Tip of the Iceberg

-   The ideas sketched out previously are just the beginning
-   This section points out a few important ideas and directions for you
    to explore on your own

## Autoencoders and Representations

-   One unsupervised approach is to consider **autoencoders**. See
    [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 20.3, [ProbML Book
    2](https://probml.github.io/pml-book/book2.html) Section 21, and
    [CPSC 440
    Notes](https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L10.pdf)
-   Consider connection between representations and compression
-   Let $g : \mathcal{X} \to \mathcal{Z}$ be a **encoder** and
    $h : \mathcal{Z} \to \mathcal{X}$ be a **decoder**, parameterized by
    some $\theta_d$ and $\theta_e$ then we want to find the empirical
    equivalent to

$$
\min_{\theta_e, \theta_d} \mathbb{E}_{p^*(x)} (h(g(x;\theta_e);\theta_d) - x)^2 + \text{regularizer}
$$

-   Whether the $g(x;\theta_e)$ is a good representation depends on
    whether “compression” of the information of $x$ is useful for
    downstream tasks

## Manifold Hypothesis

-   Are there always simple, reusable representations?
-   On perspective is called the [Manifold
    Hypothesis](https://en.wikipedia.org/wiki/Manifold_hypothesis), see
    [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 20.4.2
-   The basic idea is that even the most complicated data sources in the
    real world ultimately are concentrated on a low-dimensional manifold
    -   including images, text, networks, high-dimensional time-series,
        etc.
    -   i.e., the world is much simpler than it appears if you find the
        right transformation
-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 20.4.2.4 describes Manifold Learning, which tries to learn
    the geometry of these manifolds through variations on nonlinear
    dimension reduction

## Embeddings

-   A related idea is to **embed** data into a different (sometimes
    lower-dimensional) space
-   For example, text or images or networks, mapped into $\mathbb{R}^M$
-   Whether it is lower or higher dimensional, the key is to preserve
    some geometric properties, typically a norm. i.e.,
    $||x - y|| \approx ||g(x) - g(y)||$
-   You can think of learned representations within the inside of neural
    networks as often doing embedding in some form, especially if you
    use an [information
    bottleneck](https://arxiv.org/pdf/1503.02406.pdf)
-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 20.5 and 23 for examples with text and networks

## Lottery Tickets and Representations

-   [The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural
    Networks](https://arxiv.org/pdf/1803.03635.pdf)
-   The idea is that inside of every huge, overparameterized
    approximation with random initialization and layers is a small
    sparse approximation
    -   You can show this by pruning most of the parameters and training
        it
-   Perhaps huge, overparameterized functions with many layers are more
    likely to contain the lottery ticket
    -   Then the optimization methods are just especially good at
        finding them
    -   Ex-post you could prune, but ex-ante you don’t know the sparsity
        structure

## Distentangling Representations

-   Representations that separate out different factors of variation
    into separate variables are often easier to use for downstream tasks
    and are more transferrable
-   The keyword in a literature review is to look for is [disentangled
    representations](https://arxiv.org/pdf/1812.02230.pdf)
-   It seems like this is hard to do [without
    supervision](https://proceedings.mlr.press/v97/locatello19a/locatello19a.pdf)
-   But
    [semi-supervised](https://www.cs.toronto.edu/~bonner/courses/2022s/csc2547/papers/generative/disentangled-representations/semi-supervised-disentangling,-siddharth,-nips2017.pdf)
    approaches seem to be a good approach

## Out of Distribution Learning

-   We have been discussing a $\mathcal{D}$ from some idealized,
    constant $p^*(x,y)$
-   What if the distribution changes, or our samples are not IID (e.g.,
    in control and reinforcement learning applications where it is
    $p^*(x,y;f)$)?
-   See [ProbML Book 2](https://probml.github.io/pml-book/book1.html)
    Section 19.1 for more
-   This is called “robustness to distribution shift”, covariate shift,
    etc. in different settings
-   There are many methods, but one common element is:
    -   Models with better “generalization” tends to perform much better
        under distribution shift
    -   Good representations are much more robust
    -   With transfer learning you may be able to adapt very easily