### Table of Contents:

- Intro to Linear classification
- Linear score function
- Interpreting a linear classifier
- Loss function
    - Multiclass SVM
    - Softmax classifier
    - SVM vs Softmax
- Interactive Web Demo of Linear Classification
- Summary

# Linear Classification

In the last section we introduced the problem of Image Classification, which is the task of assigning a single label to an image from a fixed set of categories. Morever, we described the `k-Nearest Neighbor (kNN)` classifier which labels images by comparing them to (annotated) images from the training set. As we saw, kNN has a number of disadvantages:
- The classifier must **remember all of the training data** and **store it** for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size.
- Classifying a test image is **expensive** since it requires a **comparison to all training images**.

**Overview.** 
<br>We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. 
<br>The approach will have two major components: 
- A **`score function`** that maps the raw data to class scores.
- A **`loss function`** that quantifies the agreement between the predicted scores and the ground truth labels. 

We will then cast this as an optimization problem in which we will **minimize the loss function with respect to the parameters of the score function**.

### Parameterized mapping from images to label scores

The first component of this approach is to **define the `score function` that maps the pixel values of an image to confidence scores for each class**. 
<br>Let’s assume a training dataset of images $x_i \in R^D$, each associated with a label $y_i$. <br>Here $i = 1 \dots N$ and $y_i \in { 1 \dots K }$. That is, we have:
- **N** examples (each with a dimensionality D) 
- **K** distinct categories. 

For example, in CIFAR-10 we have a training set of:
- **N = 50,000 images**
- Each with **D = 32 x 32 x 3 = 3072 pixels**, 
- **K = 10**, since there are 10 distinct classes (dog, cat, car, etc). 

We will now define the `score function` $f: R^D \mapsto R^K$ that maps the raw image pixels to class scores.

**Linear classifier.** 
<br>In this module we will start out with arguably the simplest possible function, a **linear mapping**:

$$f(x_i, W, b) =  W x_i + b$$

In the above equation, we are assuming that:
- The **image $x_i$** has all of its pixels **flattened out to a single column** vector of shape \[D x 1\]. 
- The matrix **W** (of size \[K x D\]).
- The vector **b** (of size \[K x 1\]) are the `parameters` of the function. 

In CIFAR-10:
- $x_i$ contains all pixels in the i-th image \[32 x 32 x 3\] flattened into a single \[3072 x 1\] column.
- **W** is \[10 x 3072\] 
- **b** is \[10 x 1\], 

so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). 

The parameters in **`W`** are often called the **`weights`**, and **`b`** is called the **`bias vector`** because it **influences the output scores**, but **without interacting with the actual data $x_i$**. However, you will often hear people use the terms _`weights`_ and _`parameters`_ interchangeably.

There are a few things to note:
- First, note that the single matrix multiplication $Wx_i$ is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of **W**.
- Notice also that we think of the input data $(x_i,y_i)$ as given and fixed, but we have control over the setting of the parameters **W,b**. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes.
- An advantage of this approach is that the training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classified based on the computed scores.
- Lastly, note that classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images.

> Foreshadowing: Convolutional Neural Networks will map image pixels to scores exactly as shown above, but the mapping (f) will be more complex and will contain more parameters.

## Interpreting a linear classifier

Notice that a linear classifier computes the score of a class as a weighted sum of **all of its pixel values** across all 3 of its color channels. 

Depending on precisely what values we set for these weights, the function has the capacity to **like or dislike** (depending on the sign of each weight) certain colors at certain positions in the image. 

For instance, you can imagine that the “ship” class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the “ship” classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).

![](./Piks/imagemap.jpg)

In [28]:
import numpy as np

W = np.array([[0.2, -0.5, 0.1,  2.0],
              [1.5,  1.3, 2.1,  0.0],
              [  0, 0.25, 0.2, -0.3]])
x_i = np.array([56, 231, 24, 2])
b = np.array([1.1, 3.2, -1.2])

f = W @ x_i.T + b.T
f

array([-96.8 , 437.9 ,  60.75])

In [30]:
f2 = W @ x_i + b
f2

array([-96.8 , 437.9 ,  60.75])

An example of mapping an image to class scores. For the sake of visualization, we assume:
- The image only has 4 pixels (4 monochrome pixels, we are not considering color channels in this example for brevity),-
- 3 classes (red (cat), green (dog), blue (ship) class). (Clarification: in particular, the colors here simply indicate 3 classes and are not related to the RGB channels.) 

We stretch the image pixels into a column and perform matrix multiplication to get the scores for each class. Note that this particular set of **weights W is not good at all**: the weights assign our cat image a very low cat score. In particular, this set of weights seems convinced that it's looking at a dog.

**Analogy of images as high-dimensional points.** Since the images are stretched into high-dimensional column vectors, we can interpret each image as a single point in this space (e.g. each image in CIFAR-10 is a point in 3072-dimensional space of 32x32x3 pixels). Analogously, the entire dataset is a (labeled) set of points.

Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing:

![](./Piks/pixelspace.jpeg)

*Cartoon representation of the image space, where each image is a single point, and three classifiers are visualized. Using the example of the car classifier (in red), the red line shows all points in the space that get a score of zero for the car class. The red arrow shows the direction of increase, so all points to the right of the red line have positive (and linearly increasing) scores, and all points to the left have a negative (and linearly decreasing) scores.*

As we saw above:
- Every row of $W$ is a classifier for one of the classes. If we change one of the rows of $W$, the corresponding line in the pixel space will rotate in different directions. 
- The biases $b$, on the other hand, allow our classifiers to translate the lines. In particular, note that without the bias terms, plugging in $x_i=0$ would always give score of zero regardless of the weights, so all lines would be forced to cross the origin.

**Interpretation of linear classifiers as template matching.**
<br/>Another interpretation for the weights $W$ is that each row of $W$ corresponds to a **`template`** (or sometimes also called a **`prototype`**) for one of the classes. 

The score of each class for an image is then obtained by comparing each template with the image using an _inner product (or dot product)_ one by one to find the one that `"fits" best`. With this terminology, the linear classifier is doing `template matching`, where the templates are learned. 

Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.

![](./Piks/templates.jpg)

*Skipping ahead a bit: Example learned weights at the end of learning for CIFAR-10. Note that, for example, the ship template contains a lot of blue pixels as expected. This template will therefore give a high score once it is matched against images of ships on the ocean with an inner product.*

Additionally, note that the horse template seems to contain *a two-headed horse*, which is due to both left and right facing horses in the dataset. The linear classifier merges these two modes of horses in the data into a single template. 
<br>Similarly, the car classifier seems to have merged several modes into a single template which has to identify cars from all sides, and of all colors. 
<br>In particular, this template ended up being red, which hints that there are more red cars in the CIFAR-10 dataset than of any other color. 
<br>The linear classifier is *too weak to properly account for different-colored cars*, but as we will see later neural networks will allow us to perform this task. Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

**Bias trick.** 
<br>Before moving on we want to mention a common simplifying trick to representing the two parameters $W,b$ as one. Recall that we defined the score function as:

$$f(x_i, W, b) =  W x_i + b$$

As we proceed through the material it is a little cumbersome to keep track of two sets of parameters (the biases $b$ and weights $W$) separately. A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector $x_i$ with one additional dimension that always holds the constant 1 - a default `bias dimension`. With the extra dimension, the new score function will simplify to a single matrix multiply:

$$f(x_i, W) =  W x_i$$

With our CIFAR-10 example, $x_i$ is now \[3073 x 1\] instead of \[3072 x 1\] - (with the extra dimension holding the constant 1), and $W$ is now \[10 x 3073\] instead of \[10 x 3072\]. The extra column that $W$ now corresponds to the bias $b$. An illustration might help clarify:

![](./Piks/wb.jpeg)

*Illustration of the bias trick. Doing a matrix multiplication and then adding a bias vector (left) is equivalent to adding a bias dimension with a constant of 1 to all input vectors and extending the weight matrix by 1 column - a bias column (right). Thus, if we preprocess our data by appending ones to all vectors we only have to learn a single matrix of weights instead of two matrices that hold the weights and the biases.*

In [61]:
import numpy as np

W = np.array([[0.2, -0.5, 0.1,  2.0],
              [1.5,  1.3, 2.1,  0.0],
              [  0, 0.25, 0.2, -0.3]])
x_i = np.array([56, 231, 24, 2])
b = np.array([[1.1, 3.2, -1.2]])

W_new = np.concatenate((W, b.T), axis=1)
W_new

(1, 3)

In [62]:
x_i_new = np.insert(x_i, b.shape[1]+1, 1, axis = 0)
x_i_new

array([ 56, 231,  24,   2,   1])

In [63]:
f = W_new @ x_i_new
f

array([-96.8 , 437.9 ,  60.75])

In [64]:
f2 = W @ x_i + b
f2

array([[-96.8 , 437.9 ,  60.75]])

**Image data preprocessing.**
<br>As a quick note, in the examples above we used the raw pixel values (which range from \[0…255\]). In Machine Learning, it is a very common practice to always *perform normalization* of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to **`center your data`** by **subtracting the mean from every feature.** 
<br>In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately \[-127 … 127\]. 
<br>Further common preprocessing is to scale each input feature so that its values range from \[-1, 1\]. Of these, zero mean centering is arguably more important but we will have to wait for its justification until we understand the dynamics of gradient descent.

**You can read more about normalization: [here](./Additional_info/Normalization.ipynb)**

## Loss function

In the previous section we defined a function from the pixel values to class scores, which was parameterized by a set of weights $W$. 
<br>Moreover, we saw that we don’t have control over the data $(x_i,y_i)$ (it is fixed and given), but we do have control over these `weights` and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data.

For example, going back to the example image of a cat and its scores for the classes "cat", "dog" and "ship", we saw that the particular set of `weights` in that example was **not very good at all**. We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95). We are going to measure our unhappiness with outcomes such as this one with a **`loss function`** (or sometimes also referred to as the cost function or the objective). Intuitively, the loss will be high if we’re doing a poor job of classifying the training data, and it will be low if we’re doing well.

### Multiclass Support Vector Machine loss

There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. <br>The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some **fixed margin $\Delta$**. 
<br>Notice that it's sometimes helpful to anthropomorphise(meaning: ascribe human features to something) the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good).

Let’s now get more precise. Recall that for the i-th example we are given the pixels of image $x_i$ and the label $y_i$ that specifies the index of the correct class. The `score function` takes the pixels and computes the vector $f(x_i,W)$ of class scores, which we will abbreviate to $s$ (short for `scores`). For example, the score for the j-th class is the j-th element: $s_j=f(x_i, W)j$. The Multiclass SVM loss for the i-th example is then formalized as follows:

$$L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$$

**Example.** Suppose that:
- We have three classes that receive the scores s = \[13, −7, 11\]. 
- The first class is the true class (i.e. $y_i = 0$). 
- Also assume that $\Delta$ (a `hyperparameter` we will go into more detail about soon) is `10`. 

The expression above sums over all incorrect classes (j≠yi), so we get two terms:

$$L_i=max(0, −7−13+10) + max(0, 11−13+10)$$

You can see that the first term gives zero since \[-7 - 13 + 10\] gives a negative number, which is then thresholded to zero with the `max(0, −) function`. 
<br>We get **zero loss** for this pair because the correct **class score (13) was greater than the incorrect class score (-7) by at least the margin 10.** In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; 
<br>Any additional difference above the margin is **clamped at zero with the max operation**. The second term computes \[11 - 13 + 10\] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). 
<br>In summary, **the SVM loss function wants the score of the correct class $y_i$ to be larger than the incorrect class scores by at least by $\Delta$ (delta)**. If this is not the case, we will accumulate loss.

Note that in this particular module we are working with linear score functions $(f(x_i;W)=Wx_i )$, so we can also rewrite the `loss function` in this equivalent form:

$$L_i = \sum_{j\neq y_i} \max(0, w_j^T x_i - w_{y_i}^T x_i + \Delta)$$
where $w_j$ is the j-th row of $W$ reshaped as a column. 

However, this will not necessarily be the case once we start to consider more complex forms of the `score function f`.

A last piece of terminology we’ll mention before we finish with this section is that the threshold at zero `max(0, −) function` is often called the **hinge loss**. You'll sometimes hear about people instead using the **`squared hinge loss SVM`** (or `L2-SVM`), which uses the form $max(0,−)^2$ that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during `cross-validation`.

> *The loss function quantifies our unhappiness with predictions on the training set.*

![](./Piks/margin.jpg)

*The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta. If any class has a score inside the red region (or higher), then there will be accumulated loss. Otherwise the loss will be zero. Our objective will be to find the weights that will simultaneously satisfy this constraint for all examples in the training data and give a total loss that is as low as possible.*

**Regularization.** There is **one bug** with the loss function we presented above. Suppose that we have a dataset and a set of parameters $W$ that correctly classify every example (i.e. all scores are so that all the margins are met, and $L_i=0$ for all i). 
<br>The issue is that this set of $W$ is **not necessarily unique**: there might be many similar $W$ that correctly classify the examples. One easy way to see this is that if some parameters $W$ correctly classify all examples (so loss is zero for each example), then any multiple of these parameters $\lambda W$ where $\lambda > 1$ will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. 
<br>For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of $W$ by 2 would make the new difference 30.

(continue later .........)