# The unreasonable effectiveness of language models

Adapted by Sravana Reddy from a [notebook](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139) by [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo).

First, we download a sample of text from Shakespeare.

In [3]:
import urllib2
shakespeare = urllib2.urlopen('http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt').read()
print('Downloaded text with {0} characters'.format(len(shakespeare)))
print('Showing the first 100 characters in the text...')
print(shakespeare[:100])

Downloaded text with 4573338 characters
Showing the first 100 characters in the text...
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


## LSTM-Based Character Model

The code below uses Keras to train an LSTM character model. Run it if you have Keras working on your laptop. Read through the code otherwise.

First, set up the LSTM architecture.

In [4]:
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.layers.wrappers import TimeDistributed

def make_model(H, maxlen, vocabsize):
    model = Sequential()
    model.add(LSTM(H, return_sequences = True, input_shape=(maxlen, vocabsize)))
    model.add(TimeDistributed(Dense(vocabsize, activation='softmax')))
    model.compile(loss='categorical_crossentropy', optimizer = RMSprop(lr=0.001), metrics=['accuracy'])
    print('Created an LSTM model')
    print(model.summary())
    return model

Using TensorFlow backend.


Format the text data appropriately.

**Exercise:** Write down the output of the `text_to_chardata` function evaluated on the example in the handout.

In [5]:
from keras.utils import to_categorical  # for one-hot encoding
import numpy as np

def text_to_chardata(text, maxlen):
    # get data into batches with characters mapped to indices
    char2id = {}  
    curid = 0
    X = []
    Y = []
    X_sample = []
    for i, c in enumerate(text):
        if c not in char2id:
            char2id[c] = curid
            curid += 1
        if len(X_sample)==maxlen:
            X.append(X_sample)
            Y.append(X_sample[1:]+[char2id[c]])  # output shifted by 1
            X_sample = []
        X_sample.append(char2id[c])
    
    # convert data into one-hot vectors
    X = np.array(map(lambda X_sample: to_categorical(X_sample, num_classes=len(char2id)), X))
    Y = np.array(map(lambda Y_sample: to_categorical(Y_sample, num_classes=len(char2id)), Y))
    
    return X, Y, char2id
    
# load the first million characters of the shakespeare data
shakX, shakY, char2id = text_to_chardata(shakespeare[:1000000], 200)
print('Data size')
print(shakX.shape, shakY.shape)
        
    

Data size
((4999, 200, 65), (4999, 200, 65))


Train the model on this data. 

In [None]:
shakmodel = make_model(200, shakX.shape[1], shakX.shape[2])
shakmodel.fit(shakX, shakY, batch_size = 100, epochs = 15)

Created an LSTM model
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 200, 200)          212800    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 200, 65)           13065     
Total params: 225,865
Trainable params: 225,865
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/15
 300/4999 [>.............................] - ETA: 159s - loss: 4.1357 - acc: 0.0911

Let the LSTM model above train while we talk about **non-neural language models**. Skip the block below for now.

In [None]:
def generate(trained_model, H, char2id, outlen=1000):
    print('Building Inference model...')
    inference_model = Sequential()
    # 1. The inference model only takes one sample in the batch, and it always has sequence length 1.
    # 2. The inference model is stateful, meaning it inputs the output hidden state ("its history state")
    #    to the next batch input.
    inference_model.add(LSTM(H, batch_input_shape=(1, 1, len(char2id)), stateful = True))
    # Since the above LSTM does not output sequences, we don't need TimeDistributed anymore.
    inference_model.add(Dense(len(char2id), activation='softmax'))
    # Copy the weights of the trained network. Both should have the same exact number of parameters (why?).
    inference_model.set_weights(trained_model.get_weights())
    
    id2char = {v:k for k, v in char2id.items()}

    inference_model.reset_states()  
    startChar = np.zeros((1, 1, len(char2id)))
    startChar[0, 0, 0] = 1

    # sample
    generated = []
    for i in range(outlen):
        nextCharProbs = inference_model.predict(startChar)
    
        nextCharProbs = np.asarray(nextCharProbs).astype('float64') 
        nextCharProbs = nextCharProbs / nextCharProbs.sum()  
    
        nextCharId = np.random.multinomial(1, nextCharProbs.squeeze(), 1).argmax()
        generated.append(id2char[nextCharId]) 
        startChar.fill(0)
        startChar[0, 0, nextCharId] = 1
    
    print ''.join(generated)

generate(shakmodel, 200, char2id)

## Unsmoothed Maximum Likelihood Character Level Language Model 

We want a model whose job is to guess the next character based on the previous n letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call n, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing `n` letters, and need to guess the `n+1`th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function `P(c | h)`. Here, `c` is a character, `h` is an `n`-letters history, and `P(c|h)` stands for how likely is it to see `c` after we've seen `h`.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter `c` appeared after `h`, and divide by the total numbers of letters appearing after `h`. The **unsmoothed** part means that if we did not see a given letter following `h`, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [None]:
from collections import *

def train_char_lm(text, order=4):
    lm = defaultdict(Counter)
    pad = ["<s>"] * order  # beginning of sentence tags at start
    data = pad + list(text)
    for i in xrange(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[tuple(history)][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.iteritems()]
    outlm = {hist:normalize(chars) for hist, chars in lm.iteritems()}
    return outlm

Let's train this on the Shakespeare text again.

In [None]:
maxlik_shak = train_char_lm(shakespeare)

Ok. Now let's do some queries:

In [None]:
maxlik_shak[tuple(list('ello'))]

In [None]:
maxlik_shak[tuple(list('Firs'))]

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [None]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [None]:
def maxlik_generate(lm, order, nletters=1000):
    history = ["<s>"] * order
    out = []
    for i in xrange(nletters):
        c = generate_letter(lm, tuple(history), order)
        history = history[-order:] + [c]
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [None]:
print maxlik_generate(maxlik_shak, 4)

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 10?

### order 10

In [None]:
lm10 = train_char_lm(shakespeare, order=10)
print maxlik_generate(lm10, 10)

### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 10 (~two short words of history) already gets us quite passable Shakepearan text. 

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [None]:
import urllib2
linux = urllib2.urlopen('http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt').read()
print('Downloaded text with', len(linux), 'characters')
print('Showing the 100 characters in the text...')
print(linux[:100])

In [None]:
lm = train_char_lm(linux, order=10)
print maxlik_generate(lm, 10)

Order 10 is pretty much junk. We are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

Let's try this with an LSTM. You may have to run this on your own after class.

In [None]:
linuxX, linuxY, linchar2id = text_to_chardata(linux, 200)
print('Data size')
print(linuxX.shape, linuxY.shape)
linuxmodel = make_model(100, linuxX.shape[1], linuxX.shape[2])
linuxmodel.fit(linuxX, linuxY, batch_size = 100, epochs = 15)
generate(linuxmodel, 100, linchar2id)

 4900/31034 [===>..........................] - ETA: 480s - loss: 3.7703 - acc: 0.0978