# ECON622: Computational Economics with Data Science Applications

Optimization for Machine Learning

Jesse Perla (University of British Columbia)

# Overview

## Summary

-   This lecture continues from the previous lecture on gradients to
    further explore optimization methods in machine learning, and
    discusses training pipelines and tooling
-   Primary reference materials are:
    -   [ProbML Book 1:
        Introduction](https://probml.github.io/pml-book/book1.html)
    -   [ProbML Book 2: Advanced
        Topics](https://probml.github.io/pml-book/book2.html) including
        Section 6.3
    -   [Mark Schmidt’s ML Lecture
        Notes](https://www.cs.ubc.ca/~schmidtm/Courses/LecturesOnML/)
-   We will also give a sense of a standard machine learning pipeline of
    training, validation, and test data and discuss generalization,
    logging, etc.

## Why the Emphasis on Optimization and Gradients?

-   A huge number of algorithms for economists can be written as
    optimization problems (e.g., MLE, interpolation) or as something
    similar in spirit (e.g. Bayesian Sampling, Reinforcement Learning)
-   In practice, **all** problems with high-dimensions parameters or
    latents require gradients
-   Previous lectures on AD showed we can find gradients for extremely
    complicated functions with VJPs

# Optimization Crash Course

## Optimization Methods

-   Learning continuous optimization methods is an enormous project
-   See referenced materials and lecture notes
-   Here we will give an overview of some key concepts
-   Be warned! The details matter, so more study is required if you want
    to use these methods in practice

## Crash Course in Unconstrained Optimization

$$
\min_{\theta} \mathcal{L}(\theta)
$$

Will briefly introduce

-   First-order methods
-   Second-order methods
-   Preconditioning
-   Momentum
-   Regularization

## First-Order Methods

-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 8.2
-   Armed with reverse-mode AD for
    $\mathcal{L} : \mathbb{R}^N \to \mathbb{R}$ we can calculate
    $\nabla \mathcal{L}(\theta)$ with the same computational order as
    $\mathcal{L}(\theta)$
-   Furthermore, given JVPs we know we can calculate these objective
    functions for extremely complicated functions (e.g., nested fixed
    points, and implicit functions)
-   Iterative: take $\theta_0$ and provide $\theta_t \to \theta_{t+1}$
    -   May converge to a stationary point (hopefully close to a global
        argmin)
    -   If it doesn’t converge, the solution may still be an argmin
    -   See references for details on convergence for convex and
        non-convex problems

## Gradient Descent

-   See [Mark Schmidt’s
    Notes](https://www.cs.ubc.ca/~schmidtm/Courses/340-F22/L13.pdf)
-   Gradient descent takes $\theta_0$, and stepsize $\eta_t$ and
    iterates until $\nabla \mathcal{L}(\theta_t)$ is small, or
    $\theta_t$ stationary

$$
\theta_{t+1} = \theta_t - \eta_t \nabla \mathcal{L}(\theta_t)
$$

-   It is the simplest “first-order” method (i.e., ones using just the
    gradient of $\mathcal{L}$)
-   Will call $\eta_t$ a “learning rate schedule”
-   Think of line-search methods as choosing the stepsize $\eta_t$
    optimally. Useful as well for economists, even if used infrequently
    in M

## When and Where Does This Converge?

-   Skipping a million details, see [ProbML Book
    1](https://probml.github.io/pml-book/book1.html) Section 8.2.2 and
    [Mark Schmidt’s
    Notes](https://www.cs.ubc.ca/~schmidtm/Courses/340-F22/L14.pdf)

-   For strictly convex problems this converges to the global minima,
    though sufficient conditions include Robbins-Monro
    $\lim_{T\to \infty} \eta_T \to 0$ and

    $$
    \lim_{T\to\infty}\frac{\sum_{t=1}^T \eta_t}{\sum_{t=1}^T \eta_t^2} = 0
    $$

-   For problems that not globally convex this may go to local optima,
    but if the function is locally strictly convex then it will converge
    to a local optima

-   For other types of functions (e.g.,
    [invex](https://en.wikipedia.org/wiki/Invex_function)) it may still
    converge to the “right” solution in some important sense

## Preconditioned Gradient Descent

-   As we saw analyzing LLS, badly conditioned problems converge slowly
    with iterative methods

-   We can precondition a problem as we did with linear systems, and it
    has the same stationary point

-   Choose some $C_t$ for preconditioned gradient descent $$
    \theta_{t+1} = \theta_t - \eta_t C_t \nabla \mathcal{L}(\theta_t)
    $$

-   We saw before that the Hessian tells us the geometry, so the optimal
    preconditioner must be related to $\nabla^2 \mathcal{L}(\theta_t)$

## Second-Order Methods

-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 8.3
-   Adapt $\eta_t C_t$ to use the Hessian (e.g., Newton’s Method)

$$
\theta_{t+1} = \theta_t - \eta_t \left[\nabla^2 \mathcal{L}(\theta_t) \right]^{-1}\nabla \mathcal{L}(\theta_t)
$$

-   Second order methods are rarer because the calculating the Hessian
    is no longer the same computational order as $\mathcal{L}(\theta)$
-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 8.3.2 for info on Quasi-Newtonian methods which
    approximation Hessian using gradients like BFGS

## Momentum

-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 8.2.4
-   Can use “momentum”, which speeds up convergence, helps avoid local
    optima, and moves fast in flat regions
-   Momentum will be a common feature of many ML optimizers (e.g. Adam,
    RMSProp, etc.) as it helps with heavily non-convex problems
-   A classic method is called Nesterov Accelerated Gradient (NAG),
    which is a modification of gradient descent for some
    $\beta_t\in (0,1)$ (e.g., $0.9$)

$$
\begin{aligned}
\hat{\theta}_{t+1} &= \theta_t + \beta_t(\theta_t - \theta_{t-1})\\
\theta_{t+1} &= \hat{\theta}_{t+1} - \eta_t \nabla \mathcal{L}(\hat{\theta}_{t+1})
\end{aligned}
$$

## Does Uniqueness Matter?

-   Remember from our previous lecture on Sobolev norms and
    regularization that we care about functions, not parameters.
-   Consider when $\theta$ is used as parameters for a function
    (e.g. $\hat{f}_{\theta}$)
    -   Then what does a lack of convergence of the $\theta_t$ or
        multiplicity with multiple $\theta$ solutions mean?
    -   Maybe nothing! If
        $||\hat{f}_{\theta_0} - \hat{f}_{\theta_1}||_S$ is small, then
        the functions themselves may be in the same equivalence class.
        Depends on the norm, of course.
-   This topic will be discussed when we consider double-descent curves,
    but the punchline for now is that the training/optimization is a
    means to an end (i.e., generalization) and not an end in itself.

## Regularization

-   See [Mark Schmidt’s
    Notes](https://www.cs.ubc.ca/~schmidtm/Courses/340-F22/L16.pdf). For
    LLS this is the ridge regression
-   We discussed regularization as a way to deal with multiplicity

$$
\min_{\theta}\left[\mathcal{L}(\theta) + \frac{\alpha}{2} ||\theta||^2\right]
$$

-   Gradient descent becomes (called “weight decay” in ML, and “ridge
    regression” if objective is LLS)

$$
\theta_{t+1} = \theta_t - \eta_t \left[\nabla \mathcal{L}(\theta_t) + \alpha \theta_t\right]
$$

-   Mapping of regularized $\theta_t$ to a $f_{\theta_t}$ is subtle if
    nonlinear

# Stochastic Optimization

## Are Gradients Really that Cheap to Calculate?

-   Consider that the objective often involves data (or grid points for
    interpolation)
    -   Denote $x_n$, and observables $y_n$ for $n=1, \ldots N$
-   With VJPs, the computational order of
    $\nabla_{\theta} \mathcal{L}(\theta;\{x_n, y_n\}_{n=1}^N)$ may be
    the same as that of $\mathcal{L}$ itself
-   However, keep in mind that reverse-mode requires storing the
    intermediate values in the “primal” calculation (i.e.,
    $\mathcal{L}(\theta;\{x_n, y_n\}_{n=1}^N)$)
    -   Hence, the memory requirements grow with $N$
    -   This may be a big problem for large datasets or complicated
        calculations, especially with GPUs which have more limited
        memory

## Do We Need the Full Gradient?

-   In practice, it is impossible to calculate the full gradient for
    large datasets

-   In GD, the gradient provided the direction of steepest descent

-   Consider an algorithm with a $g_t$ as an unbiased estimate of the
    gradient

    $$
    \begin{aligned}
    \theta_{t+1} = \theta_t - \eta_t g_t\\
    \mathbb{E}[g_t] = \nabla \mathcal{L}(\theta_t)
    \end{aligned}
    $$

    -   Make the $\eta_t$ smaller to deal with noise if this is
        high-variance
    -   Choose $g_t$ to be far cheaper to calculate than
        $\nabla \mathcal{L}(\theta_t)$

-   Will turn out that this also adds additional regularization, which
    helps with generalization

## Stochastic Optimization

-   To formalize: Up until now our optimizers have been “deterministic”
-   Now we introduce a source of randomness $z \sim q_{\theta}(z)$,
    i.e. it might depend on the estimated parameters $\theta$ later with
    RL/etc.
    -   $z$ could be a source of uncertainty in the environment
    -   $z$ could involve latent variables
    -   $z$ could come from randomness in the optimization process
        (e.g., using subsets of data to form $g_t$)
-   Denote expectations using this distribution as
    $\mathbb{E}_{q_{\theta}(z)}$
-   For now, drop the dependence on $\theta$ for simplicity, though it
    becomes crucial for understanding reinforcement learning/etc.

## Stochastic Objective

-   The full optimization problem is then to minimize this stochastic
    objective

$$
\min_{\theta}\overbrace{\mathbb{E}_{q(z)} \tilde{\mathcal{L}}(\theta, z)}^{\equiv \mathcal{L}(\theta)}
$$

-   Under appropriate regularity conditions, could use GD on this
    objective

$$
\nabla \mathcal{L}(\theta) = \mathbb{E}_{q(z)}\left[\nabla \tilde{\mathcal{L}}(\theta, z)\right]
$$

-   But in practice, it is rare that we can marginalize out the $z$

## Unbiased Draws from the Gradient

-   Assume we can sample $z_t \sim q(z)$ IID
-   Then with enough regularity the gradient using just $z_t$ is
    unbiased

$$
\mathbb{E}_{q(z)}\left[  \nabla \tilde{\mathcal{L}}(\theta_t, z_t) \right] = \nabla \mathcal{L}(\theta_t)
$$

-   That is, on average $\nabla \tilde{\mathcal{L}}(\theta_t, z_t)$ is
    in the right direction for minimizing $\mathcal{L}(\theta_t)$
-   This basic approach of finding unbiased estimators of the gradient
    (and finding ways to lower the variance) is at the heart of most ML
    optimization algorithms

## Stochastic Gradient Descent

-   See [ProbML Book 1](https://probml.github.io/pml-book/book1.html)
    Section 8.4, [ProbML Book
    2](https://probml.github.io/pml-book/book2.html) Section 6.3, and
    [Mark Schmidt’s
    Notes](https://www.cs.ubc.ca/~schmidtm/Courses/340-F22/L23.pdf)
-   Given the previous slide, given IID samples $z_t \sim q$, the
    gradient is unbiased and we have the simplest version of stochastic
    gradient descent (SGD)

$$
\theta_{t+1} = \theta_t - \eta_t \nabla \tilde{\mathcal{L}}(\theta_t, z_t)
$$

-   Which converges to the minima of $\min_{\theta} \mathcal{L}(\theta)$
    under appropriate conditions
-   We can layer on all of the other features we discussed (e.g.,
    momentum, preconditioning, etc) with SGD, but some become especially
    important (e.g. the $\eta_t$ schedule)

## Finite-Sum Objectives

-   Consider a special case of the loss function which is the sum of $N$
    terms. For example with empirical risk minimization used in LLS/etc.

    -   $z_n \equiv (x_n, y_n)$ are typically data, observables, or grid
        points
    -   $\ell(\theta, x_n, y_n)$ is a loss function for a single data
        point (e.g., forecasting using some $f_{\theta}$)

    $$
    \mathcal{L}(\theta) = \frac{1}{N}\sum_{n=1}^N \tilde{\mathcal{L}}(\theta, z_n) \equiv \frac{1}{N}\sum_{n=1}^N \ell(\theta, x_n, y_n)
    $$

    -   For example, LLS is
        $\ell(\theta, x_n, y_n) = ||y_n - \theta \cdot x_n||^2_2$

-   In this case, the randomness of $z_t$ is which data point is chosen

## SGD for Finite-Sum Objectives

-   Hence consider sampling $z_t \equiv (x_t, y_t)$ from our data.
    -   In principle, IID with replacement
-   Then run SGD on one data point at a time

$$
\theta_{t+1} = \theta_t - \eta_t \nabla_{\theta} \ell(\theta_t, x_t, y_t)
$$

-   This may converges to the minima of $\mathcal{L}(\theta)$, and
    potentially the storage requirements for calculations the gradient
    are radically reduced
-   You can guess that the $\eta_t$ parameter is especially sensitive to
    the variance of the gradient estimate

## Decrease Variance with Multiple Draws

-   With a single draw, the variance of the gradient estimate may be
    high

$$
\mathbb{E}\left[\nabla_{\theta} \ell(\theta_t, x_t, y_t)- \nabla \mathcal{L}(\theta_t)\right]^2
$$

-   One tool to decrease the variance is just more monte-carlo draws.
    With finite-sum objectives draw $B \subseteq \{1,\ldots N\}$ indices

$$
\frac{1}{|B|}\sum_{n \in B} \nabla_{\theta} \ell(\theta_t, x_n, y_n)
$$

-   Classic SGD: $|B|=1$; GD: $B = \{1, \ldots N\}$ and in between is
    called “minibatch SGD”. Usually minibatch is implied with “SGD”

## Minibatch SGD

-   Algorithm is to draw $B_t$ indices at each step and execute SGD $$
    \begin{aligned}
    g_t \equiv \frac{1}{|B_t|}\sum_{n \in B_t} \nabla_{\theta} \ell(\theta_t, x_n, y_n)\\
    \theta_{t+1} = \theta_t - \eta_t g_t
    \end{aligned}
    $$

-   Note that we never need to calculate $\mathcal{L}(\theta_t)$
    directly, so can write our code to all operate on batches $B_t$

-   Then layer other tricks on top (e.g., momentum, preconditioning,
    etc.)

    -   In principle you could also use minibatch with second-order or
        quasi-newtonian methods but much rarer

## Choosing Batches

-   Choosing the $B_t$ process may be tricky. You could sample from
    $\{1,\ldots N\}$
    -   with replacement
    -   without replacement
    -   without replacement after shuffling the data, and then ensure
        you have gone through all of the data before repeating
    -   etc.
-   Just remember the goal: variance reduction on gradient estimates
-   You want it to be unbiased in principle (consider partitioning the
    data into batches and operating sequentially?)
-   More art than science in many cases, because it requires many priors

## “Grad Student Descent”

-   This is how virtually all deep learning works. Just swap SGD with
    slightly fancier algorithms using momentum, tinker with parameters,
    etc.
-   In practice, all of these optimizer settings (e.g., how large for
    $|B_t|$, $\eta_t$, convergence criteria, etc.) are fragile and
    require a lot of tuning
    -   Part of a a process called **hyperparameter optimization (HPO)**
        where you try to find the best non-model parameters for your
        goals
    -   Same issue with all numerical methods in economics
        (e.g. convergence criteria of fixed point iteration, initial
        conditions)
-   The concern is not just that it is time-consuming for researchers
    (and ML “Grad Students”), but that it is easy for priors to sneak in
    and bias results

## What was our Goal?

-   We will address this more formally next lecture, but it is worth
    stepping back to think about our goals. Loosely:
    -   If we are solving an empirical risk minimization problem (like
        regressions, etc.) or interpolation, then our goal is to use the
        “data” to find a function $\hat{f}_{\theta}$ that is close to
        the “true” function $f^*$
-   Fitting $\hat{f}_{\theta}$ is easy, but we want it to **generalize**
    within the true distribution
    -   But we don’t know that distribution (hence the “empirical”)
    -   So a typical approach is to emulate this by splitting the data
        we have
    -   But HPO is dangerous because if we are not careful we can
        “contaminate” our process for finding $\hat{f}_{\theta}$ using
        some of the data we intend to check it with. Which might lead to
        overfitting/etc.

# Training Loops

## Splitting the Data

A standard way to do this for Empirical Risk
Minimization/Regressions/etc. is to split it into three parts:

1.  **Training** data used in fitting our approximations
    -   This is just a means to an end in ML and economics
2.  **Validation** data used for HPO and checking convergence criteria
    -   Be cautious to avoid using it for training
3.  **Test** data used to evaluate the generalization performance
    -   Ensure we don’t accidentally use it in training or validation

Not all problems will have this structure, and not all with have
validation data.

## Why Separate Validation and Test?

-   As we will see in deep learning, with massive over-parameterization
    you typically can interpolate all of the training data.
    -   Minimizing training loss is a means to an end, which usually
        ends at zero
-   The validation data might be used to check stopping criteria by
    checking how well the approximation generalizes to data outside of
    training
-   But if we are using it for a stopping criteria or HPO, then is is
    **contaminated**!
    -   Distorts our picture of generalization if we combine it into
        test data

## What about Interpolation Problems?

-   When simply trying to find interpolating functions which solve
    functional equations, the risk of prior contamination is less clear
-   However, you may still want to separate out validation and test grid
    points because any data you use for HPO or convergence criteria
    can’t be used to understand generalization.
-   For example consider:
    1.  Fit until “training” loss is zero
    2.  Keep running stochastic optimizer until “validation” loss is
        zero
-   In that case, it crudely interpolating the validation data, which
    makes it equivalent to training data? Not useful for generalization
    -   May find that the model generalized better if you **stopped
        earlier**

## Level of Abstraction for Optimizers

-   While you can setup a standard optimization objective and optimizer,
    most ML frameworks work at a lower level
-   The key reasons are that:
    -   Minibatching (usually just called “batches”) requires more
        flexibility in implementation to be efficient
    -   Stopping criteria is more complicated with highly
        overparameterized models
    -   Logging and validation logic requires more flexibility
    -   Often you will want to take a snapshot of the current best
        solution and continue later for refinement (or to solve in
        parallel)

## Steps and Epochs

-   There is a great deal of flexibility in how you setup the optimizer
-   But a common approach is to randomly shuffle the data, create a set
    of batches $B_t$ (without replacement), and then iterate through
    them
-   Terminology (when relevant)
    -   Every iteration of SGD for a given batch is a **step**
    -   If you have gone through the entire dataset once, we say that
        you have completed an **epoch**
-   At the end of an epoch is a good time to log, check the validation
    loss, and potentially stop the training

## Software Components used in ML

Some common software components for optimization are

1.  **Autodifferentiation** and libraries of functions provide the
    approximation class
2.  **Data loaders** which will take care of providing batches to the
    optimizers
3.  **Optimizers** are typically iterative, have an internal state, and
    you can update with one sample of the gradient for that batch
4.  **Logging** and visualization tools to track progress because the
    optimization process may be slow and you want to do HPO
5.  **HPO** software using training, validation, and possibly test loss

## Logging and Visualization

-   Several tools exist for logging to babysit optimizers, find good
    hyperparameters, etc. including
    [Tensorboard](https://www.tensorflow.org/tensorboard)
    -   But we will use [Weights and Biases](https://wandb.ai/site)
        (W&B) because it is a market leader, free for academics and
        seems to be the frontrunner
-   Many algorithms and frameworks exist for HPO:
    -   [Weights and Biases](https://wandb.ai/site) (W&B) has a built-in
        HPO framework using random search and bayesian optimization
    -   [Optuna](https://optuna.org/) and [Ray
        Tune](https://docs.ray.io/en/master/tune/index.html) is a
        popular open-source HPO framework
    -   [Ray Tune](https://docs.ray.io/en/master/tune/index.html) is a
        popular open-source HPO framework
-   HPO frameworks will often use the
    [command-line](https://github.com/shadawck/awesome-cli-frameworks#python)
    to run new jobs. [Python
    Fire](https://github.com/google/python-fire)

## Broad Frameworks for Machine Learning

-   You can just hand-code loops/etc. which seems the best approach for
    JAX
    -   Even with Pytorch, it isn’t obvious that a framework is better
        ex-post, though ex-ante it can help you try different
        permutations easily
-   [Pytorch Lightning](https://www.pytorchlightning.ai/) is a popular
    framework which will formalize the training loops even across
    distributed systems and make CLI, HPO, logging, etc. convenient
    -   It remains fairly flexible because it is just wrapping Pytorch
-   [Keras](https://keras.io/) is a similar framework with the ability
    to target multiple backends (e.g., Pytorch, JAX)
    -   The challenge is that it is much less flexible for non-typical
        research
-   [Hydra](https://github.com/facebookresearch/hydra) is a framework
    for more serious engineering code

## Linear Regression with SGD in Pytorch

See Pytorch implementations of solving LLS in repository

1.  Baseline: GD with full gradient
2.  SGD with minibatches
3.  Linear Model as a “Neural Network”
4.  Logging
5.  CLI

## Linear Regression with SGD in JAX

1.  Using JAX and `vmap`
2.  Using JAX and equinox