# Language modelling

In this tutorial we will use two character level language models to create dinosaur names. We will recurrent neural networks as main tool for language modellig. 

![picture](https://vignette.wikia.nocookie.net/uncyclopedia/images/7/73/Hankk_the_dino.png/revision/latest?cb=20100127020302)

In [1]:
!wget https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt

--2019-07-05 16:05:42--  https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19909 (19K) [text/plain]
Saving to: ‘dinos.txt.1’


2019-07-05 16:05:42 (1.95 MB/s) - ‘dinos.txt.1’ saved [19909/19909]



## Recurrent neural networks

Input:

$x_{1:n} = x_1, x_2, \ldots, x_n$, $x_i \in \mathbb{R}^{d_{in}}$

For each input $x_{1:i}$ we get an output $y_i$:

$y_i = RNN(x_{1:i})$, $y_i \in \mathbb{R}^{d_{out}}$

For the whole input sequence $x_{1:n}$:

$y_{1:n} = RNN^{*}(x_{1:n})$, $y_i \in \mathbb{R}^{d_{out}}$

$R$ is a recursive activation function with two inputs: $x_i$ и $s_{i-1}$ (state vector)

$RNN^{*}(x_{1:n}, s_0) = y_{1:n}$

$y_i = O(s_i) = g(W^{out}[s_{i} ,x_i] +b)$

$s_i = R(s_{i-1}, x_i)$

$s_i = R(s_{i-1}, x_i) = g(W^{hid}[s_{i-1} ,x_i] +b)$  -- concatenate $[s_{i-1}, x]$

$x_i \in \mathbb{R}^{d_{in}}$, $y_i \in \mathbb{R}^{ d_{out}}$, $s_i \in \mathbb{R}^{d_{hid}}$

$W^{hid} \in \mathbb{R}^{(d_{in}+d_{out}) \times d_{hid}}$, $W^{out} \in \mathbb{R}^{d_{hid} \times d_{out}}$

![rnn](https://github.com/enggen/Deep-Learning-Coursera/raw/1407e19c98833d2686a0748db26b594f3102301e/Sequence%20Models/Week1/Dinosaur%20Island%20--%20Character-level%20language%20model/images/rnn.png)

We are going to create an RNN-LM using pytorch

In [2]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
hidden_size = 50

print(device)

cuda:0


**Step 1**. Prepare the dataset

In [4]:
class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content)) + ['<', '>']
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        #teacher forcing
        x_str = '<' + line 
        y_str = line + '>' 
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        for i, (x_ch, y_ch) in enumerate(zip(x_str, y_str)):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        
        return x, y
    
    def __len__(self):
        return len(self.lines)

In [5]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, shuffle=True)

In [6]:
trn_ds.lines[1]

'aardonyx'

In [7]:
print(trn_ds.idx_to_ch)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 27: '<', 28: '>'}


In [8]:
trn_ds.vocab_size

29

In [9]:
x, y = trn_ds[1]

In [10]:
x.shape

torch.Size([9, 29])

In [11]:
y.shape

torch.Size([9])

**Step 2**. Define a model, loss function and optimization algorithm

In [28]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.i2h = nn.Linear(input_size + hidden_size,hidden_size)
        self.dropout = nn.Dropout(0.3)
        self.i2o = nn.Linear(hidden_size, output_size)
    
    def forward(self, h_prev, x):
        
        h = torch.torch.functional.F.tanh(self.dropout(self.i2h(torch.cat((x,h_prev), dim=1))))
        result = self.i2o(h)
        return h, result

In [29]:
model = RNN(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

**Step 3**. Declare a sampling procedure

![rnn](https://github.com/enggen/Deep-Learning-Coursera/raw/1407e19c98833d2686a0748db26b594f3102301e/Sequence%20Models/Week1/Dinosaur%20Island%20--%20Character-level%20language%20model/images/dinos3.png)

Sidenote: we can use Softmax probabilities to evaluate perplexety, which a standart quality measure for the task

$ 2^{{H(p)}}=2^{{-\sum _{x}p(\hat(y)^{<i>} == y^{<i>}_{train} )\log _{2}p(\hat(y)^{<i>} == y^{<i>}_{train})}} $

In [14]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
 
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [15]:
def print_sample(sample_idxs):
    [print(trn_ds.idx_to_ch[x], end ='') for x in sample_idxs]
    print()

**Step 4**. Almost done! Train the model

In [16]:
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x, y = x.to(device), y.to(device)
        for i in range(x.shape[1]):
            h_prev, y_pred = model(h_prev, x[:, i])
            loss += loss_fn(y_pred, y[:, i])
            
        if (line_num+1) % 100 == 0:
            print_sample(sample(model))
        loss.backward()
        optimizer.step()

In [17]:
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        print()

In [30]:
train(model, loss_fn, optimizer, epochs = 5)

Epoch:1
<sqa>
<mn>
<ueoonatao>
<pvctssaurus>
<tnrusaurhs>
<ainrar>
<naurus>
<ardrus>
<ciuros>
<saxrosrulus>
<gmcrosaurud>
<cjlrasausus>
<suror>
<qnysaurud>
<kbmuaosres>

Epoch:2
<laieooaurus>
<puctosaurus>
<sauruanuhustur>
<agonaratopa>
<amvrrgsurus>
<rewskrqycaurus>
<sltaariaaurus>
<tcrwsturus>
<dounysaurus>
<laltasgvcatoaueopodsausus>
<ssrngsausus>
<suaruaiod>
<gscasnaauros>
<tautotaurus>
<stiytpauras>

Epoch:3
<altciurus>
<maplunresaptosltt>
<llpph>
<snossalpaurus>
<tceonaonaurus>
<hvltasiurus>
<kyuroshirus>
<scthsaiaurus>
<habtaphurus>
<sejpshysaurus>
<kcbtatesam>
<lbkhtrarrus>
<pulrusaurus>
<throuapatrus>
<gtcdupaurus>

Epoch:4
<apcrus>
<crsuoaurus>
<lortaaurus>
<sesaeopaurus>
<ascrusturus>
<diurysaurus>
<kcatatisaurur>
<ctemasap>
<oucusaurus>
<ztlsiuous>
<epcrosaurus>
<bselaurus>
<terstor>
<rnyonosaurus>
<scrisaurun>

Epoch:5
<ciuros>
<iyurasaurus>
<kyuroshurus>
<saurucauras>
<selascntaurus>
<meotiuourus>
<hrahtcrraurus>
<lisioasaurus>
<surolaurus>
<pruaisaurus>
<tcconaonaurus>
<

### Tasks 
1. Complete the forward fuction for the RNN model
2. Rewrite the sampling function so that pangrams (words that contain each character only ones) are generated

### Advanced tasks
3. Make yourself a dinosaur nickname: initialize the sampling procedure with your own name
4. Implemet a fuction ```get_perplexity``` to compute perplexety. Add more layers and check, whether they affect the perplexety.

# Reference

1. Sampling in  RNN: https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/
2. Coursera course (main source): https://github.com/furkanu/deeplearning.ai-pytorch/tree/master/5-%20Sequence%20Models
3. Coursera course (main source): https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
4. LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/