[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rndsrc/stat2ml/blob/main/stat2ml.ipynb)

# From Statistics to Machine Learning

![meme](fig/stat2ml.png)

## Why AI?

AI is everywhere in the news these days.

People debate whether we are in an AI bubble, there is even a
[Wikipedia page about it](https://en.wikipedia.org/wiki/AI_bubble),
and whether today's massive investments will lead to transformative
breakthroughs or another AI winter.

But regardless of future speculation, one thing is clear:

> **AI has already delivered real, undeniable scientific and
>   technological breakthroughs.**

Students use ChatGPT daily, but more importantly, the last decade has
produced advances that matter directly to science:
* **AlphaGo** (2016):
  A [reinforcement learning system](https://www.nature.com/articles/nature16961)
  that defeated the world champion in Go—a game once believed to
  require human intuition.
  There is even a
  [documentory](https://www.youtube.com/watch?v=WXuK6gekU1Y) on it.
* **The Transformer Architecture** (2017):
  Introduced in
  "[Attention Is All You Need](https://arxiv.org/abs/1706.03762)",
  this model revolutionized sequence learning and laid the foundation
  for today's large language models including ChatGPT.
* **AlphaFold** (~2020):
  Achieved near-experimental accuracy in
  [predicting protein structures](https://www.nature.com/articles/s41586-021-03819-2),
  solving a 50-year grand challenge in biology, and won the
  [Nobel Prize in Chemistry 2024](https://www.nobelprize.org/prizes/chemistry/2024/summary/).
* **Large Language Models** (~2020-today):
  complex reasoning, coding, symbolic manipulation, and
  scientific workflows at scale.
  Notable startups include
  [OpenAI](https://openai.com/) and
  [Anthropic](https://www.anthropic.com/).
* **Diffusion Models** (~2020-today):
  Generated photorealistic images, molecular structures, and
  simulation surrogates using
  [probabilistic forward-reverse processes](https://arxiv.org/abs/2209.00796).
* **AI-assisted scientific discovery** (today):
  [AI systems](https://www.nature.com/articles/d41586-024-02842-3)
  now help design new materials, discover antibiotics,
  control fusion plasmas, and analyze particle-physics, astrophysics,
  and cosmology datasets.

There are true algorithmic innovations and real scientific value!

As physicists and scientists, it is important to ask:

> **What is AI?  
>   How does AI work?  
>   How can we use AI to accelerate scientific discovery?**

The last question can become an excellent open-ended homework
problem.
For this lab, we will focus primarily on the first two.

## What is AI/ML?

The current wave of AI can be viewed as a continuation of earlier
ideas that appeared under buzz words like *"big data"* and *"data
science"*,

To understand this evolution, it helps to take a step back and look at
the history of scientific methodology.

### The First Two Paradigms: Experiment & Theory

Before modern computing, science operated through two complementary
paradigms:

1. **Empirical/Experimental Science**

   * Start with observations.
   * Identify patterns and regularities in nature.
   * Build phenomenological descriptions.

2. **Theoretical Science**

   * Start with mathematical principles.
   * Derive predictions about how systems should behave.
   * Compare theory to experiments.

Together, experiment and theory form the foundation of the classical
scientific method.

3.  **The Third Pillar: Computational Science**

    As physics, chemistry, and engineering advanced, systems became
    too complex for purely pencil-and-paper analysis: turbulence,
    weather, plasma physics, galaxy formation, quantum many-body
    systems, general relativity, etc.

    Numerical algorithms became essential:

    > Theoretical science + computing power = computational science.

    This gave rise to the **third pillar: computational science**,
    i.e., using algorithms and simulations to test and extend theory,
    and make predictions.

    ![The Third Pillar](fig/third.png)

4.  The Fourth Paradigm: Data Science

    In recent decades:
    * Experiments produce massive data streams (radio telescopes,
      climate satellites).
    * Sensors became cheap and popular.
    * Digital transactions and interactions created enormous datasets.

    When **the data** becomes too large for traditional statistical
    analysis, we need new tools to find structure, correlations, and
    predictions.

    This drove the "big data" era and the rise of **data science**:

    > Empirical science + computing power = data science.

    This is the **fourth paradigm of science**.

    ![The Fourth Paradigm](fig/fourth.jpg)

### Machine Learning: Let the Computer Learn the Pattern

One natural consequence of data science is that the algorithms we
build often become *useful beyond just summarizing data*.
For example, an algorithm that characterizes pixel patterns in images
can also be used to recognize digits, classify galaxies, or identify
particle tracks.
In other words:

> Instead of manually writing a program to perform a task, we can
> train a model to *learn* how to perform the task directly from data.

If we have enough examples and a suitable model, the computer can
infer the mapping between inputs and outputs on its own.
This idea of letting the computer learn patterns, rules, or behaviors
from data is the essence of **machine learning (ML)**.

### What About AI?

The term **Artificial Intelligence (AI)** has a long, complicated, and
sometimes
[hype-filled history](https://en.wikipedia.org/wiki/History_of_artificial_intelligence).
Today, in practice:

* **ML** often refers to the specific algorithms and mathematical
  tools that learn from data.
  These include linear models, neural networks, decision trees,
  reinforcement learning, etc.
* **AI** is often used as a broad umbrella term, or sometimes a
  marketing term, for systems *powered by* machine learning.
* When an ML system becomes extremely capable like ChatGPT, AlphaFold,
  AlphaGo, or a self-driving car, we tend to call it **AI**,
  especially outside technical circles.

A useful rule of thumb for modern usage:

> **ML is the toolbox.  
>   AI is what we call ML systems when they look impressive,  
>   or when we want other people to think they are impressive.**

## The Machine Learning Landscape

ML is not just statistics.
In fact the above meme may upset a lot of people.
However, a large portion of modern ML is deeply grounded in
**statistical reasoning**, **probability**, and **optimization**.

To understand where our lab fits, it helps to map the landscape at a
high level.

### Supervised Learning

In **supervised learning**, we are given input–output pairs $(x, y)$:

* $x$: the data
* $y$: the label or target
* The task is to learn a function $f_\theta(x) \approx y$, where the
  parameters $\theta$ are adjusted using examples.

Because each input comes with a **label**, supervised learning has a
clear objective: make predictions that match the provided examples.

This makes supervised learning fundamentally similar to **curve
fitting**:

> We choose a function with adjustable parameters and fit it to
> labeled data.

Whether the function is a line, a polynomial, or a giant neural
network, the principle is the same.
For this reason, our lab focuses on supervised learning.
It provides the cleanest conceptual connection between statistics and
modern deep learning.

### Unsupervised Learning

In contrast, **unsupervised learning** provides only the inputs $x$,
with no labels.
The goal is to find structure or patterns *without* being told what
the correct output should be.

However, unsupervised learning is *not* a single coherent class of
algorithms.
Instead, it is a loosely grouped collection of very different
techniques, such as:

* **PCA**: finds directions of maximum variance
* **$k$-means**: partitions points into clusters
* **Gaussian mixture models**: probabilistic clustering
* **Autoencoders**: neural-network-based dimensionality reduction
* **Density estimation**: learning probability distributions
* **Manifold learning**: discovering low-dimensional structure

These methods behave differently, solve different problems, and rely
on different mathematical ideas.
What unifies them is *only one thing*:

> They all learn patterns from data **without labels**.

Because unsupervised learning covers such a diverse set of tools and
has no direct analog to curve fitting, we will not cover it in this
introductory lab.
A good place to see many popular unsupervised learning algorithm is
[scikit-learn](https://scikit-learn.org/stable/).

This notebook is a **lab-style introduction** that takes you on a
smooth path from:

* basic **statistics**;
* statistic **moments**;
* simple **curve fitting**;
* gradient-based **optimization**;
* automatic differentiation with **JAX**, and
* a first **deep learning** model on MNIST.

To see the connections between the different steps, we try to change
only a few things at a time.