# Introduction to machine learning

<a href="https://github.com/NVIDIA/FastPhotoStyle" target="_blank"><img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/fastphotostyle.png" width=600px /></a>

## PHYS 2600: Scientific Computing

## Lecture 27

## What is machine learning?

We use computers to "learn" lots of things, but __machine learning__  usually implies _lack of supervision_; programs which come up with some new fact, pattern, or model that we didn't simply program in by hand. 

A computer being able to "learn" is the first step towards __artificial intelligence__ (AI); computer programs that  mimic human abilities to synthesize information, find patterns, and come up with new ideas.  The term __machine learning__ is narrower; it usually refers to algorithms that extract information from some data set - typically __big data__ (too big for a human to deal with by hand.)

An example of machine learning you probably already know about is __model regression__: we have a model that we think describes some set of data, we perform a least-squares fit to find the best parameters.  Then we can make statements about the model and its parameters, or use the model to extrapolate and predict future data.

## Categories of machine learning

All machine learning algorithms (and there are a lot!) can be broadly divided into three categories:

* __Supervised learning__: we have a collection of input data with known "response": the output of our experiment given the inputs, or a set of weather observations labelled by station ID.
* __Unsupervised learning__: we have a collection of "input" data, but no known label or response: a list of material electrical properties, or demographic information from a city.
* __Reinforcement learning__: we have no data at all, but the algorithm is allowed to _propose_ a new "action" (new set of inputs), at which point some process (a simulation, a function, a robot interacting with the real world) provides an output.

Note that "supervision" has nothing to do with whether we specify the model by hand or not; there are plenty of supervised learning algorithms that don't need a specific model choice like regression does.  (Regression is, of course, a supervised learning problem.)


Another way to state each form of learning, in a way which is more closely analogous to the familiar regression problem:

* __Supervised learning__: we know input $\mathbf{x}$ and output $\mathbf{y}(\mathbf{x})$. We want to find a model $f(\mathbf{x})$ that produces $\mathbf{y}$.
* __Unsupervised learning__: we know only the input $\mathbf{x}$.  We want to find a model $f(\mathbf{x})$ that reproduces the _probability distribution_ of $\mathbf{x}$ itself (and therefore any patterns in the distribution.)
* __Reinforcement learning__: we don't know anything, but we have access to a _feedback mechanism_: for any input $\mathbf{x}$ we propose, we can learn the _response_ $\mathbf{y}(\mathbf{x})$.  We want to discover a model $f(\mathbf{x})$, subject to some _goal_ (minimizing how many samples of $\mathbf{x}$ we need, finding some $\mathbf{y}(\mathbf{x})$ above a certain threshold, etc.)

Some concrete examples:
1. Taking your Netflix rating history (like/dislike) and predicting whether you will like the next original movie they make is a _supervised learning_ problem. 
2. Using real-estate listing data to identify neighborhoods of similar houses is an _unsupervised learning_ problem.  
1. Teaching a robot how to walk by allowing it to _try_ to walk is a _reinforcement learning_ problem.

Let's see some more concrete examples.  The last one (reinforcement learning) is easiest to just show:

In [None]:
from IPython.display import IFrame

IFrame("https://www.youtube-nocookie.com/embed/W_gxLKSsSIE", 640, 480)

## Classification problems

Note that in many of the examples above, the output of the model is __discrete__: we want to find a Boolean answer (will you like this movie, or not?) or apply a __label__ or __class__ to our data (which neighborhood does this house belong to?)

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/chihuahua_vs_muffin.jpg" style="float:right;margin:10px" />

Problems where the output is _discrete_ are __classification__ problems: we want to assign our current (and future) data to _classes_ or _categories_.  A common example is __computer vision__, in which image files are classified based on the type of object contained in them (like chihuahuas vs. blueberry muffins, for example.)

Simply assigning a definite label is an example of _hard classification_, which has some drawbacks.  Much more commonly, machine learning algorithms (such as __logistic classification__) will assign a _probability_ between 0 and 1 as to whether a given input is in a given class.

For the following machine learning example, I'm going to borrow heavily from Guttag.  (If you're interested in machine learning, reading chapters 24-26 of Guttag would be a great place to start!)  Let's do an example of __unsupervised learning__, using the concept of __clustering__.

Suppose we're interested in the classification problem of identifying whether different animal species are reptiles or not.  Here's a table of some species and a few of their properties:

| Species name | Lays eggs | Has scales | # of legs | Venomous | __Reptile?__ |
|--------------|-----------|------------|-----------|-----------|--------------|
| Diamondback | No | Yes | 0 | Yes | Yes |
| Crocodile | Yes | Yes | 4 | No | Yes |
| Anaconda | No | Yes | 0 | No | Yes |
| Bullfrog | Yes | No | 4 | No | No |
| Tuna | Yes | Yes | 0 | No | No |
| Komodo dragon | Yes | Yes | 4 | Yes | Yes
| Grass snake | Yes | Yes | 0 | No | Yes |

The collection of values (lays eggs, has scales, # of legs, venomous) is known as a __feature vector__: each component has one piece of information about each species of animal.  The final column is a __label__, here whether the species is a reptile or not.  Since we have these labels, this is a _supervised learning_ problem.

Let's convert the feature vectors into numerical values to work with them:

In [None]:
import numpy as np

# Features: lays eggs, has scales, # of legs, venomous
train_features = np.array(
    [
        [0, 1, 0, 1],  # Diamondback
        [1, 1, 4, 0],  # Crocodile
        [0, 1, 0, 0],  # Anaconda
        [1, 0, 4, 0],  # Bullfrog
        [1, 1, 0, 0],  # Tuna
        [1, 1, 4, 1],  # Komodo dragon
        [1, 1, 0, 0],  # Grass snake
    ]
)

# Label: reptile or not?
label_reptile = np.array([1, 1, 1, 0, 0, 1, 1])

I've called this variable "train" since it is the __training dataset__: we'll use it to determine the models in our machine learning exercise, and then use a second __testing dataset__ to see how well the model does at predicting.  This is a common division for machine learning problems.

Almost any machine learning algorithm requires a notion of _distance_ between feature vectors in order to operate.  Since feature-vector space is fictional, we're free to choose how we define distance!  Let's use the common __L2-norm__, which is just the usual vector distance:

In [None]:
def L2_norm(
    a,
    b,
):
    return np.sqrt(np.sum((a - b) ** 2))


print(
    [
        L2_norm(
            train_features[0], train_features[i]
        )  # distance of diamondback to others in list
        for i in range(len(train_features))
    ]
)

So using the "L2-norm", the diamondback is closest to the anaconda...but it's closer to a tuna than a komodo dragon? This might seem surprising, but there's an obvious flaw in our distance: the number of legs goes up to _four_, but every other direction goes up to _one_.  To avoid issues like this, it's crucial to __preprocess__ our feature vectors by _rescaling_ the features to have the same range:

In [None]:
train_scaled = train_features / [1, 1, 4, 1]
print([L2_norm(train_scaled[0], train_scaled[i]) for i in range(len(train_features))])

Now we're ready to learn!  We could work just in NumPy, but I'll use the `scikit-learn` module, which has a wide array of pre-built machine learning algorithms implemented.  One of the simplest is the __linear support vector machine__, which tries to find a hyperplane in feature space that separates the data classes (here, reptile or not.)

In [None]:
import sklearn.svm

liz_classifier = sklearn.svm.LinearSVC(random_state=0, dual=True).fit(
    train_features, label_reptile
)
print(liz_classifier.coef_)

The `.coef_` property gives the "weights" of the support vector machine, which describe a vector _perpendicular_ to the hyperplane separating reptile from not-reptile.  This means that the feature directions with the largest weight components are _more important_ in determining reptile-ness, according to the SVM and our training data.

We see here that according to the SVM, laying eggs is very not-reptile-like, having scales is very reptile-like, and having legs or being venomous is weakly reptile-like.

Now let's make a testing dataset, and then try out our support vector machine model on it:

In [None]:
test_features = np.array(
    [
        [1, 1, 4, 0],  # Gecko
        [1, 1, 0, 1],  # Lachesis pit viper
        [1, 0, 4, 1],  # Poison dart frog
        [0, 1, 4, 0],  # Jackson's chameleon
        [1, 0, 0, 0],  # Earthworm
    ]
)
test_label = np.array([1, 1, 0, 1, 0])
test_scaled = test_features / [1, 1, 4, 1]

print(liz_classifier.predict(test_features))

Five out of five classified correctly!

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/ml-nonlinear.png" width=400px style="float:right;" />

Here we used one of the simplest approaches for supervised learning, which is _linear_: we assume that we can draw a flat surface in feature space to separate our classes.  In general, there's no reason for this to be true!

There are all kinds of non-linear generalizations for machine learning, and one of the most exciting is __neural nets__.  These are inspired in part by how neurons in the brain. My one-sentence explanation is that a sufficiently large and complex network of neurons enables almost arbitrary non-linear classification and modeling.

Neural nets can perform many machine learning tasks with surprising efficiency; they also enable __deep learning__, in which the algorithm learns abstract representations from sufficient amounts of raw data.  (The picture I showed on the first slide is an example from NVidia, where the "style" of a photo can be learned and applied to another photo.)

"Deep learning" is a rapidly evolving subject, at the forefront of computer science - but it's also practically useful, so you can get your hands dirty with it!  Aside from `sklearn`, well-known Python modules you can check out include TensorFlow, PyTorch, and JAX.

We can use the same training sample to try out some unsupervised learning instead.  We'll use __k-means clustering__, which is a simple but powerful technique.  Basically, we decide how many clusters we want, propose a set of mean values in feature space, and then move the means around to minimize the overall distance between our data and one of them.

In [None]:
import sklearn.cluster

kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0, n_init="auto").fit(
    train_scaled
)
print(kmeans.labels_)

# Cluster 0: Crocodile, Bullfrog, Komodo Dragon
# Cluster 1: Diamondback, Anaconda, Tuna, Grass Snake

print(kmeans.predict(test_scaled))
# 0: gecko, frog, and chameleon, 1: viper, earthworm

We don't get to decide what the unsupervised learning algorithm finds; in this case, it seems to be separating things mostly based on number of legs (even though we rescaled that variable already.)

Before you get too excited about the power of machine learning, let's see what happens if we try to apply our minimally-trained classifier on some more interesting data:

In [None]:
more_test_features = np.array(
    [
        [0, 1, 4, 0],  # Pangolin
        [0, 0, 4, 0],  # Grizzly bear
        [1, 0, 2, 0],  # Blue jay
        [0, 0, 2, 0],  # Human
        [0, 1, 0, 0],  # Guppy
    ]
) / [1, 1, 4, 1]

kmeans.predict(more_test_features)

Part of the problem is that we're probably missing important features: for example, "tuna" and "grass snake" were exactly the same vector, but only one is a reptile.  And all mammals (except humans) are `[0,0,4,0]` in our current space!  We really need more information for a better prediction.



Aside from the size of our feature vectors, there's a more insidious problem here: _our training data were biased!_  Five of the seven training data points were reptiles, so we've engineered a model which is more likely to find reptiles than not.  This is an example of __data bias__, which is something that can be corrected for, but you have to watch for it!

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/ml-fitting-error.png" width=400px style="float:right;" />

The other source of bias to watch out for is __model bias__, which can come from both _underfitting_ the data (failing to include important feature dependence) but also _overfitting_ the data (including too many model details that fail to generalize out of training set - this is counteracted by having a testing set.)

Basically, the outputs of machine learning are only as good as what you put in to the algorithm!