## Classification setup and notation

* Generally we have a training dataset consisting of samples

$${ \{x_i, y_i \} }_{i=N}^N$$

* $x_i$: inputs - $d$ dimension
* $y_i$: labels

### Details about softmax

$$p(y \vert x) = \dfrac{\exp(W_y \cdot x)}{\sum_{c=1}^C \exp(W_c \cdot x)}$$

seperate into 2 steps:
1. Take y'th row of W and multiply that row with x: unnormalized score $f_y$, and compute all $f_c$ for all $c=1, \cdots ,C$
$$W_{y\cdot} x = \sum_{i=1}^d W_{yi} x_i = f_y$$
2. Normalize!
$$p(y \vert x) = \dfrac{\exp(f_y)}{\sum_{c=1}^C \exp(f_c)} = softmax(f)_y$$

> 1 번에서 W 과 x 의 단순 dot product 가 아니고 W 의 모든 row 와 한번씩 곱해서 summation 한 것이다.

## Training with softmax and cross-entropy error

For each training example $\{x,y\}$, our objective is to maximize the probability of the correct class $y$

Hence, we minimize the negative log probability of that class. = "Cross Entropy Error"

**why?**

First, since log function is monotonic increase function, so maximizing the probability is same as maximizing log probability.

$$\arg \max P(y \vert x) = \arg \max \log P(y \vert x)$$

Second, Cross Entropy.

Assuming a ground truth (or gold or target) probability distribution that is 1 at the right class and 0 everywhere else: $p = [0, \cdots,  0, 1, 0, \cdots 0]$ and our computed probability is $q$, then the cross entropy is:

$$H(p, q) = - \sum_{c=1}^C p(c)\log q(c)$$

Because of **one-hot p**, the only term left is the negative log probability of the true class

In our case, $\log q(c) = \log P(y \vert x)$, also there is "minus" at the front, so, $\arg \max \rightarrow \arg \min$

### Slide note: KL divergence
[wiki](https://ko.wikipedia.org/wiki/%EC%BF%A8%EB%B0%B1-%EB%9D%BC%EC%9D%B4%EB%B8%94%EB%9F%AC_%EB%B0%9C%EC%82%B0)

Cross-entropy can be re-written in terms of the entropy and Kullback-Leibler divergence between the two distributions:

$$H(p, q) = H(p) + D_{KL} (p\|q)$$

Because $H(p)=- \sum_{c=1}^C p(c)\log p(c)= -(0\times \log 0 + \cdots + 1 \times \log 1 \cdots )=0$ ( Not evaluate at $p(c)$=0 ) is zero in our case (and even if it wasn’t it would be fixed and have no contribution to gradient), to minimize this is equal to minimizing the KL divergence between $p$ and $q$.

The KL divergence is **not a distance** but a non-symmetric measure of the difference between two probability distributions $p$ and $q$

$$D_{KL} (p\|q) = \sum_{c=1}^C p(c) \log \dfrac{p(c)}{q(c)}$$

## Classification over a full dataset & Regularization

Cross entropy loss function over full dataset ${ \{x_i, y_i \} }_{i=N}^N$.

$$J(\theta) = \dfrac{1}{N} \sum_{i=1}^N - \log \big( \dfrac{ e^{f_{y_i} } }{ \sum_{c=1}^C e^{f_c} } \big)$$

Regularization:

$$J(\theta) = \dfrac{1}{N} \sum_{i=1}^N - \log \big( \dfrac{ e^{f_{y_i} } }{ \sum_{c=1}^C e^{f_c} } \big) + \lambda \sum_k \theta_k^2$$

Prevent overfitting when we have a lot of features.

## Classification difference with word vectors

업데이트 할 양이 어마어마하게 많음

$$\triangledown_\theta J(\theta) = 
\begin{bmatrix} \triangledown_{W_{\cdot 1} } \\ \vdots \\ \triangledown_{W_{\cdot d} } \\ \triangledown_{x_{aardvark} } \\ \vdots \\ \triangledown_{x_{zebra} } \end{bmatrix} \in \Bbb{R}^{Cd + Vd}$$

Vd 가 엄청나게 큼, regularization 필요함

## A pitfall when retraining word vectors
* Losing generalization by re-training word vectors

Should I train my own word vectors?
* If you only have a small training data set, don’t train the word vectors.
* If you have have a very large dataset, it may work better to train word vectors to the task.

## Window classification

https://arxiv.org/pdf/1103.0398v1.pdf

Classifying single words is rarely done. Interesting problems like ambiguity arise in context!

즉, context로 부터 이 단어의 의미를 규정할 수 있을때 자주 쓰인다.

* Example: auto-antonyms:
    * "To sanction" can mean "to permit" or "to punish.”
    * "To seed" can mean "to place seeds" or "to remove seeds."

* Example: ambiguous named entities:
    * Paris $\rightarrow$ Paris, France vs Paris Hilton
    * Hathaway $\rightarrow$ Berkshire Hathaway vs Anne Hathaway
    
Idea: classify a word in its context window of neighboring words

* Example: Named Entity Recognition is a 4-way classification task:
    * Person, Location, Organization, None

There are many ways to classify a single word in context
* For example: average all the words in a window
* Problem: that would lose position information

Train softmax classifier to classify a center word by taking concatenation of all word vectors surrounding it

* Example: Classify “Paris” in the context of this sentence with window length 2:
    * Sentence: ... museums in Paris are amazing ...
$$x_{window} = \begin{bmatrix} x_{museums} & x_{in} & x_{Paris} & x_{are} & x_{amazing} \end{bmatrix}^T$$
    * resulting vector $x_{window} = x \in R^{5d}$
    
With $x = x_{window}$ we can use the same softmax classifier as before:

$$\hat{y}_y = p(y \vert x) = \dfrac{\exp(W_y \cdot x)}{\sum_{c=1}^C \exp(W_c \cdot x)}$$

With cross entropy error as before:

$$J(\theta) = \dfrac{1}{N} \sum_{i=1}^N - \log \big( \dfrac{ e^{f_{y_i} } }{ \sum_{c=1}^C e^{f_c} } \big)$$

**Update?** Use derivatives

## The max-margin loss

* Idea for training objective: Make true window’s score larger and corrupt window’s score lower (until they’re good enough): minimize

$$J = max(0, 1-s+s_c)$$

* $s$ = score(museums in Paris are amazing)
* $s_c$ = score(Not all museums in Paris) = negative class