# Convolutional Neural Networks for Visual Recognition
http://cs231n.github.io

## Image Classification

### Nearest Neighbor Classifier

** $L_1$ distance**

$d_1(I_1,I_2) = \sum_{p}|I_1^p-I_2^p|$

where $I_1$, $I_2$ are vectors of two images being compred.

**$L_2$ distance**

$d_2(I_1,I_2) = \sqrt{\sum_{p}(I_1^p-I_2^p)^2}$

**k-Nearest Neighbor Classifier**

Motivation: instead of finding single closest imgae in the traing set, we find the top k closest images and have them vote on the label of the test image.


![alt text](./img/k-nearest.png "Title")

### Pros and Cons of Nearest Neighbor Classifer
**Pros**
* Simple
* Takes no time to train

**Cons**
* May work on low-dimensional data but distance over high-dimensional spaces can be ver counter-intuitive
* Images that are nearby each other are much more a funciton of the general color distribution of the images, or the type of background rather than their semantic identity.
* The classifier must remember all the training data and store for future comparisons with the test data.
* Classifying a test image is expensive since it requires a comparison to all training images.


## Validation sets for Hyperparameter tuning

**Example of hyperparameters:** the setting for k of k-nearest neighbor classifier.

**Tuning hyperparameters:** split training test in two: a slightly smaller training set and what we call a validation set.

In case where the size of training data might be small, people sometimes use amore ophisticated techniques for hyperparameter tuning called **cross-validation**.

![alt text](./img/k-fold.png "Title")

Classic way is to split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. 

## Linear Classification

The approach will have two major components: a **score function** that maps the raw data to class scores, and a **loss function** that quantifies the agreement between the predicted scores and the ground truth labels.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). 

### Linear Classifier

$f(x_i, W, b) = Wx_i+b$

where $x_i$ contains all pixels in the i-th image flattened into a single [3072 x 1] column. W is often called weights and is [10 x 3072] and b is called bias vector, with size of [10 x 1].

**Example**

![alt text](./img/classifier.png "Title")

In the example shown above, the linear classifier compute the scores of a class as a weighted sumof all its pixel values across all 3 of its color channels. We assume the image only has 4 pixels and that we have 3 classes.


**Bias Trick**

We can combine two sets of paramters (the biases b and weights W) into a signle matrix taht holds both of them, in which we get:

$f(x_i, W) = Wx_i$


### Loss Function

**Multiclass Support Vector Machine (SVM)**

SVM loss is setup so that the correct class for each image to have a score higher than the incorrect class by some fixed amount margin $\Delta$. Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

$$L_i = \sum_{j \neq y_i}\max (0, s_j-s_{yi}+\Delta)$$

where $y_i$ is the index of the correct class and $s_j = f(x_i,W)_j$ (the score for j-th class is the j-th element). 

We can also rewrite the loss function as:

$$L_i = \sum_{j \neq y_i}\max (0, w_j^T-w_{yi}^T+\Delta)$$


### Regularization
With the loss function presented above, a potential problem would be that threre are a set of prarmeters W correctly classify every example. We want to encode some preference for certain set of weights W over otheres to remove this ambiguity.

We can do so by extending the loss funciton with a regularization penaty R(w). The most common regularization penalty is L2 norm.

$$R(W) = \sum_{k} \sum_{l} W_{k,l}^2$$

The full multiclass SVM loss becomes:
$$L = \frac{1}{N} \sum_i L_i + \lambda R(W)$$

or expanding this out in its full form:

$$L = \frac{1}{N} \sum_i \sum_{j \neq y_i}max(0, w_j^T-w_{yi}^T+\Delta) + \lambda \sum_{k} \sum_{l} W_{k,l}^2$$

L2 penalty leads to the appealing max margin paroperty in SVM. The most appealing property is that penalizing large weights tends to improve generalization, becasue it means that no input dimersion can have a very large influence on the scores by itself.

 ### Practical Consideration
 
 **Setting Delta**
 
 It turns out that this hyperparameter can safely be set to $\Delta$=1.0
 
 **Relation to Binary Support Vector Machine**
 
 $$L_i = Cmax(0,1,-y_iw^Tx_i)+R(W)$$
 
 where C is a hyperparameter and $y_i \in \{-1,1\}$. This can be regard as a special case when there are only two classes in this SVM.

### Softmax Classifier

It turns out that SVM is one of two commonly seen classifiers. The other popular choice is Softmax classifier.

$$L_i = -\log (\frac{e^{f_{yi}}}{\sum_je^{f_j}})$$ or equivalently $$L_i = -f_{yi}+\log \sum_je^{f_j}$$

It takes a vector of arbitrary real-valued scores and squashes it to vector values between 0 and one that sum to one.

**Information theory view**

The cross-entropy between a "true" distribution p and an estimated distributed q is defiend as 

$$H(p,1)=-\sum_xp(x)l\log q(x)$$

The softwmax classifier is hence minimizing cross-entropy between the estimated class probablities and the "true" distribution.

Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as $H(p,q)=H(p)+D_{KL}(p||q)$ and the entropy of the data function p is zero, this is also equivalent to minimizing the KL divergence between the two distributions.

In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.

**Probablistic interpretation**

$$P(y_i|x_i:W)=\frac{e^{f_{yi}}} {\sum_j e^{f_j}}$$

can be interpreted as the probability assigned to the correct label $y_i$, given the image $x_i$ and parameterized by W.

**Practical issues: Numeric stability**

When we are writing code for computing the Softmax function in practice, the intermediate term $e^{f_{yi}}$ and $\sum_j e^{f_j}$ may be very large due to exponential. So it is important to use a normalization trick.

$$\frac{e^{f_{yi}}} {\sum_j e^{f_j}} = \frac{Ce^{f_{yi}}} {\sum_j e^{f_j}}=\frac{e^{f_{yi}+\log C}} {\sum_j e^{{f_j}+\log C}}$$

A common choice for C is to set $\log C = -\max_jf_j$.

**Possibly confusing**

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses cross-entropy loss.

### SVM vs. Softmax

In both cases we computer the same score vector f. The difference is in the interpretation of the scores in f: The SVM interprets these as class cores adn its loss function encourage the correct class to have a score higher by a margin than the other class scores.

The Softmax classifier instead interprets the score as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high. 

In practices, SVM nad Softmax are usually comparable.