#Language Models

We are interested in creating summaries of texts. In order to do so, we want to find the probability distribution over the next word given all the words we have seen so far:

$$ p(w_n | \{w_i\}_{i=1}^{n-1}) $$

This is known as a language model: a probability distribution over sequences of words, or the next word in our case. We make a Markovian assumption that the probability of the next words depends only on the previous $m$ words seen:

$$ p(w_n | \{w_i\}_{i=1}^{n-1})  \approx p(w_n | \{w_i\}_{i=n - m+1}^{n-1}) $$

For example, if we assume $m = 2$, i.e. that the distribution over the next word depends only on the two most recently seen words, we have

$$ p(w_n | \{w_i\}_{i=1}^{n-1}) \approx p(w_n | w_{n-1}, w_{n-2})$$

For the rest of this note, we use $m=2$ but the code can easily be generalized.

The statistical model we use is similar to the one we used for sentiment analysis:

$$ p(w_n | w_{n-1}, w_{n-2}) \propto \exp\{V f(w_{n-2}, w_{n-1}) + V_0\}$$

where $V \in \mathbb{R}^{|\mathcal V| \times h}$ is some weight matrix with hidden layer size $h$ and $V_0$ is our bias term. We define $f(w_{n-1}, w_{n-2})$ to be

$$f(w_{n-2}, w_{n-1}) = \text{tanh}(W[Ew_{n-2}, Ew_{n-1}]^\top + W_0)$$

where $E \in \mathbb{R}^{d \times |\mathcal V|}$ is our embedding matrix, $W \in \mathbb{R}^{(md) \times h}$ another weight matrix, and $W_0$ another bias term. Then, we are interested in learning $V, W,$ and $E$.

In [None]:
-- Vocab processing

nV -- size of vocab

In [None]:
nn = require "nn"
m = 2 -- size of "memory"
d = 10 -- embedding size
h  -- hidden layer size

model = nn.Sequential()
lookup = nn.LookupTable(d, nV)
model:add(lookup) -- matrix E
model:add(nn.Linear(m*d, h)) -- V and V_0
model:add(Tanh()) -- activation function
model:add(nn.Linear(nV, h)) -- W and W_0
model:add(nn.LogSoftMax()) -- log softmax

criterion = nn.CLassNLLCriterion()

In [None]:
-- Training via SGD w/ minibatch
batch_size = 100
epochs = 50
step_size = .05

In [None]:
for i = 1, epochs do
    nll = 0
    for j = 1, train:size(1)/batch_size do
        model:zeroGradParameters()
        input = train:narrow
        output = train_t:narrow
    end
end

In [None]:
-- perplexity: exp{-1/N sum_i(ln(q(x_i)))}
-- for each data point, generate distribution over words q
-- find the probability of the actual word, q(x_i)