# Language Modeling

**Language Modeling** is the task of predicting	what word comes next.

<img src="img/predict_next.png" style="height: 150px;"/>
More formally: given a sequence o words $x^{(1)}, x^{(2)}, \dots,    x^{(t)}$, compute the probability distribution of the next word $x^{(t+1)}$:
$$P(x^{(t+1)}=w_j\;|\;x^{(t)}, \dots, x^{(1)})$$
where $w_j$ is a word in the vocabulary $V = \{w_1, \dots, w_{|V|}\}$ 

# Everyday use of Language Models

<tr>
<td> <img src="img/use1.png" style="height: 250px;"/> </td>
<td> <img src="img/use2.png" style="height: 250px;"/> </td>
</tr>

# n-gram Language Models
First we make a simplifying assumption: $\large x^{(t+1)}$ depends only on the preceding $ (n-1)$ words
$$\large P(x^{(t+1)}\;|\;x^{(t)}, \dots, x^{(1)}) = P(x^{(t+1)}\;|\;x^{(t)}, \dots, x^{(t-n+2)})=$$
$$\large = \frac{P(x^{(t+1)}, x^{(t)}, \dots, x^{(t-n+2)})}{P(x^{(t)}, \dots, x^{(t-n+2)})} \approx$$
$$\large \approx \frac{count(x^{(t+1)}, x^{(t)}, \dots, x^{(t-n+2)})}{count(x^{(t)}, \dots, x^{(t-n+2)})}$$

# Generating text
You can also use a Language Model to **generate text**.

<tr>
<td> <img src="img/generate.png" style="height: 300px;"/> </td>
<td> <img src="img/generate2.png" style="height:300px;"/> </td>
</tr>

# A fixed-window neural Language Model

![neural language model](img/fixed_window_example.png)
![neural language model](img/nlm_network.png)

**Improvements over** n-gram LM:
* No sparsity problem
* Model size is $O(n)$ not $O(exp(n))$

Remaining **problems**:
* Fixed window is *too small*
* Enlarging window enlarges $W$ 
* Window can never be large enough!
* Each $x^{(i)}$ uses different rows of. We *don’t share weights* across the window.

# Recurent Neural Networks  (RNN)
![rnn](img/rnn.png)

# RNN Language Model
![RNNLM](img/rnn_lm.png)

RNN **Advantages**:
* Can process any length input
* Model size doesn’t increase for longer input
* Computation for step t can (in theory) use information from many steps back
* Weights are shared across timesteps: representations are shared 

RNN **Disadvantages**:
* Recurrent computation is slow
* In practice, difficult to access information from many steps back

# Training a RNN Language Model

True next word: $\large y^{(t)} = x^{(t+1)}$ (one-hot ecoded vector)
<br>
Predicted Probability distribtion: $\large \hat{y}^{(t)}$ (sum to one)
<br>
Loss function on step t is usual **cross-entropy** between $y^{(t)}$ and $\hat{y}^{(t)}$:

$$\large J^{(t)}(\theta) = - \sum_j y_j^{(t)} ln(\hat{y}_j^{(t)})$$
$$\large \mathcal L(\theta) = \frac{1}{T}\sum_{t=1}^T J^{(t)}(\theta)$$

# RNNs can be used for tagging
e.g. part-of-speech tagging
![tagging](img/rnn_tagging.png)

# RNNs can be used for sentence classification
e.g. sentiment classification
![classification](img/rnn_encode.png)

# RNNs can be used to generate text
e.g. speech recognition, machine translation, summarization, image captioning
![speech](img/rnn_speech.png)

### Composition of CNN and RNN
![arcitecture](img/convplusrnn.png)
<br>
<br>
![results](img/captioning.jpg)

## More reading
[The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
<br>
[Recurrent Neural Networks Tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)