# Language Modeling with RNNs
[참고자료1](https://wikidocs.net/21668), 
[참고자료2](https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model/main.py#L30-L50)

Language Modeling (LM)은 언어라는 현상을 모델링하고자 단어 시퀀스 (또는 문장)에 확률을 할당한는 모델이다.

다시 말하면, 언어 모델은 가장 자연스러운 단어 시퀀스를 찾아내는 모델이다. 단어 시퀀스에 확률을 할당하게 하기 위해 가장 보편적으로 사용되는 방법은 언어 모델이 **이전 단어들이 주어졌을 때 다음 단어를 예측** 하도록 하는 것이다.

**단어 시퀀스의 확률**

하나의 단어를 $w$, 단어 시퀀스를 대문자 $W$ 라고 한다면, $n$개의 단어가 등장하는 단어 시퀀스 $W$의 확률은 다음과 같다.

$$P(W)=P(w_1, w_2, \cdots, w_n)$$

**다음 단어 등장 확률**

다음 단어가 등장할 확률을 식으로 표현하면, $n-1$개의 단어가 나열된 상태에서 $n$번째 단어의 확률은 다음과 같다.

$$p(w_n|w_1, \cdots, w_{n-1})$$

전체 단어 시퀀스 $W$의 확률은 모든 단어가 예측되고 나서야 알 수 있으므로 단어 시퀀스의 확률은 다음과 같다. (Chain Rule)
$$P(W)=P(w_1, w_2, \cdots, w_n)=\prod_{i=1}^{n}P(w_n|w_1, \cdots, w_{n-1})$$

예제로 검색엔진에서 특정 단어를 치면 뒤에 단어를 제안 해주는 것을 생각 할 수 있다.

**Perplexity**

Language Model의 성능을 평가하는 방법

$$PPL(W)=P(w_1, w_2, \cdots, w_N)^{-\frac{1}{N}}$$

PPL은 언어모델이 `헷갈리는 정도` 라고 해석 할 수 있다. 값이 낮을수록 좋은 성능이다.

예를들어서 PPL이 10인 언어 모델은 모델이 테스트 데이터에 대해서 다음 단어를 예측 하는 모든 시점마다 평균적으로 10개의 단어를 가지고 어떤 것이 정답인지 고민하고 있다고 볼 수 있다.

$$PPL(W)=P(w_1, w_2, \cdots, w_N)^{-\frac{1}{N}}=(\frac{1}{10}^{N})^{-\frac{1}{N}}=10$$

**Cross Entropy와의 관계**

$$PPL=exp(Cross Entropy)$$

## Requirements

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive/')

import os
os.chdir('/content/gdrive/My Drive/Colab Notebooks/')

Mounted at /content/gdrive/


In [2]:
# Some part of the code was referenced from below.
# https://github.com/pytorch/examples/tree/master/word_language_model 
import torch
import torch.nn as nn
import numpy as np
from torch.nn.utils import clip_grad_norm_

## Preprocessing

In [3]:
import os

# Dictionary
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0
    
    def add_word(self, word):
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1
    
    def __len__(self):
        return len(self.word2idx)

# Corpus
class Corpus(object):
    def __init__(self):
        self.dictionary = Dictionary()

    def get_data(self, path, batch_size=20):
        # Add words to the dictionary
        with open(path, 'r') as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words: 
                    self.dictionary.add_word(word)  
        
        # Tokenize the file content
        ids = torch.LongTensor(tokens)
        token = 0
        with open(path, 'r') as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1
        num_batches = ids.size(0) // batch_size
        ids = ids[:num_batches*batch_size]
        # word index들의 sequence를 return
        return ids.view(batch_size, -1)

## Hyperparameters

Train data download:
https://url.kr/sulxnh
<!-- https://drive.google.com/file/d/1vQRyXr5pdJfdAlzR7WUuOFoqJL4CRqK3/view?usp=sharing -->

데이터의 경로는 `/gdrive/My Drive/Colab Notebooks/train.txt`로 지정해주시기 바랍니다.

In [4]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
embed_size = 128
hidden_size = 1024

num_layers = 1
num_epochs = 5

num_samples = 1000     # number of words to be sampled

batch_size = 20
seq_length = 30
learning_rate = 0.002

# Load "Penn Treebank" dataset
corpus = Corpus()
ids = corpus.get_data('./train.txt', batch_size)
vocab_size = len(corpus.dictionary)
num_batches = ids.size(1) // seq_length

In [5]:
ids

tensor([[   0,    1,    2,  ...,  152, 4955, 4150],
        [  93,  718,  590,  ...,  170, 6784,  133],
        [  27,  930,   42,  ...,  392, 4864,   26],
        ...,
        [ 997,   42,  507,  ...,  682, 6849, 6344],
        [ 392, 5518, 3034,  ..., 2264,   42, 3401],
        [4210,  467, 1496,  ..., 9999,  119, 1143]])

## Model

In [6]:
# RNN based language model
class RNNLM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(RNNLM, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, h):
        # Embed word ids to vectors
        x = self.embed(x)
        
        # Forward propagate LSTM
        out, (h, c) = self.lstm(x, h)
        
        # Reshape output to (batch_size*sequence_length, hidden_size)
        out = out.reshape(out.size(0)*out.size(1), out.size(2))
        
        # Decode hidden states of all time steps
        out = self.linear(out)
        return out, (h, c)

In [7]:
model = RNNLM(vocab_size, embed_size, hidden_size, num_layers).to(device)

In [8]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Truncated backpropagation
def detach(states):
    return [state.detach() for state in states]

In [9]:
# Train the model
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+seq_length].to(device)
        targets = ids[:, (i+1):(i+1)+seq_length].to(device)
        
        # Forward pass
        states = detach(states)
        outputs, states = model(inputs, states)
        loss = criterion(outputs, targets.reshape(-1))
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                   .format(epoch+1, num_epochs, step, num_batches, loss.item(), np.exp(loss.item())))

Epoch [1/5], Step[0/1549], Loss: 9.2065, Perplexity: 9961.93
Epoch [1/5], Step[100/1549], Loss: 5.9976, Perplexity: 402.47
Epoch [1/5], Step[200/1549], Loss: 5.9204, Perplexity: 372.56
Epoch [1/5], Step[300/1549], Loss: 5.7660, Perplexity: 319.25
Epoch [1/5], Step[400/1549], Loss: 5.6974, Perplexity: 298.09
Epoch [1/5], Step[500/1549], Loss: 5.0929, Perplexity: 162.87
Epoch [1/5], Step[600/1549], Loss: 5.1624, Perplexity: 174.58
Epoch [1/5], Step[700/1549], Loss: 5.3218, Perplexity: 204.75
Epoch [1/5], Step[800/1549], Loss: 5.1488, Perplexity: 172.23
Epoch [1/5], Step[900/1549], Loss: 5.0973, Perplexity: 163.58
Epoch [1/5], Step[1000/1549], Loss: 5.1383, Perplexity: 170.43
Epoch [1/5], Step[1100/1549], Loss: 5.3104, Perplexity: 202.44
Epoch [1/5], Step[1200/1549], Loss: 5.1157, Perplexity: 166.61
Epoch [1/5], Step[1300/1549], Loss: 5.1095, Perplexity: 165.58
Epoch [1/5], Step[1400/1549], Loss: 4.7800, Perplexity: 119.10
Epoch [1/5], Step[1500/1549], Loss: 5.1351, Perplexity: 169.88
Epo

In [10]:
# Test the model
with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample.txt'))

Sampled [100/1000] words and save to sample.txt
Sampled [200/1000] words and save to sample.txt
Sampled [300/1000] words and save to sample.txt
Sampled [400/1000] words and save to sample.txt
Sampled [500/1000] words and save to sample.txt
Sampled [600/1000] words and save to sample.txt
Sampled [700/1000] words and save to sample.txt
Sampled [800/1000] words and save to sample.txt
Sampled [900/1000] words and save to sample.txt
Sampled [1000/1000] words and save to sample.txt


In [11]:
i

999