## Introduction
This tutorial will introduce you to PyTorch framework for analyzing free text documents. You will learn how to model the language used in the Federalist Papers in RNN, which is a more efficient way than the simple N-gram.

The objective and dataset are the same with CMU 15688 Spring 2017's Homework3 Natural Language Processing part.
You can check the detail by downloading handout of https://autolab.andrew.cmu.edu/courses/15388-s18/assessments/homework3.

<b>Please go to https://drive.google.com/open?id=1EQWp3hAQ9R1y4SfkExvt2gqw1S3jl3ER with login gmail with andrew.cmu.edu and download "data.npy" and "new.pkl" then put them under the same directory of tutorial.ipynb</b>

### Tutorial content
In this tutorial, we will show how to train a model and generate specific style text in Python, specifically using [Pytorch](http://pytorch.org/)

You'll use a copy of the Federalist Papers downloaded from Project Guttenberg, available here: http://www.gutenberg.org/ebooks/18

The dataset is proprocessed as npy file thus you can load it easily.

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Loading data](#Loading-data)
- [N-gram recap](#N-gram-recap)
- [From machine learning to deep learning](#From-machine-learning-to-deep-learning)
- [From RNN to LSTM](#From-RNN-to-LSTM)
- [RNNs using PyTorch](#RNNs-using-PyTorch)
- [Summary and references](#Summary-and-references)

## Installing the libraries

Before getting started, you'll need to install the libraries that we will use. We assume you have conda, so you will use conda to install. You can install PyTorch by following the instructiton of http://pytorch.org/.
If you are using Linux without cuda* or MAC OS, simply use:

    $ conda install pytorch-cpu torchvision -c pytorch

If you are using other OS or you have a nvidia gpu card, just follow the instructiton of http://pytorch.org/ to install. PyTorch does not officially support Windows, so we recommend you to use Linux or MAC OS.

*: cuda is parallel computing platform and application programming interface model created by Nvidia, which can accelerate your float calculation. However, it needs you have a nvidia gpu card.

## Loading data

First, you will use numpy to load proprocessed Federalist Papers dataset. "papers" is a list containing 86 acticles as string. 

In [1]:
import numpy as np
papers = np.load("data.npy")

## N-gram recap

In natural language processing field, n-gram can be used for text generation, as we learned in the class. Due to limitation of space, this tutorial will not go into the detail of it. LanguageModel model in Homework3 Natural Language Processing part will be used here. Please review n-gram and homework before you start.

In [2]:
from nlp import LanguageModel

After load LanguageModel in nlp.py, let us check the function in it. They are a little bit different from the homework3. 

first(): generate a n-1 long text. The text should appear in somewhere of train text.

sample(length, first): take the text length and first text, then return generated random samples of text.

In [3]:
l_hamilton = LanguageModel(papers,10)
first = l_hamilton.first()
n_gram = l_hamilton.sample(200,first)
print(n_gram)

has been performed by some individual citizen of preeminent wisdom and approved integrity . minos , we learn , was the primitive founder of the government of crete , as zaleucus was of that of the locrians . theseus first , and after him draco and solon , instituted the government of athens . lycurgus was the lawgiver of sparta . the foundation of the original government of rome was laid by romulus , and the work completed by two of his elective successors , numa and tullius hostilius . on the abolition of royalty the consular administration was substituted by brutus , who stepped forward with a project for such a reform , which , he alleged , had been prepared by tullius hostilius , and to which his address obtained the assent and ratification of the senate and people . this remark is applicable to confederate governments also . amphictyon , we are told , was the author of that which bore his name . the achaean league received its first birth from achaeus , and its second from aratus 

However, N-gram has some limitations. First, the memory usage and time complexity will increase rapidly with increasing n. Second, N-gram can only remember exact n word, thus the important context before or after it cannot be taken into account. Third, if you want to change the number of N, you must train the model again.

## From machine learning to deep learning



How to create a text with the same style of writing? Whether n-gram or RNN, the essence is to predict the next word by observing the context, then add newly generated words to the context, and loops until generates the expected length. In this process, a function of "predicting the next word" is the key to the problem. You may recall the definition of machine learning in 15688 class. We are told "learning" is actually fitting hypothesis function with sample and label, and then predicting label of new sample. Machine learning is very good at such "learn and predict" task. And deep learning is a subclass of machine learning, which work particularly excellent in image, text, sound and so on. We use different deep learning models, i.e. neural networks, to solve different problems. Here we use RNN, a special kind of neural network. To understand RNN, you need some preliminary knowledge about deep learning. Let's first introduce the neural network.



<b>Neural network</b>: you may have heard of it in many places, but not sure its exact definition. First of all, what is a neuron? [Perceptron](https://en.wikipedia.org/wiki/Perceptron) is a typical kind of neuron. Let's take it as example.

![Perceptron](img/Perceptron-diagram.jpg)
<center>Figure 1. Perceptron source: http://harveycohen.net/image/perceptron.html</center>

A neuron take the outputs of predecessor neurons as inputs, then do some calculation (e.g. for perceptron, calculate linear sum). Then activation function will decide whether or not activate (you can image the light is on) itself by comparing the calculation result with threshold.

Single layer perceptron is one of the simplest neural network, but it can also learn and solve quite complex problems. However, it does not terminate if the learning set is not [linearly separable](https://en.wikipedia.org/wiki/Linear_separability), which is an unsolvable problems.

A neural network is consisted of many connected neurons as below.

<img src="img/neural-network.png" width="40%">
<center>Figure 2. neural network source: http://neuralnetworksanddeeplearning.com/chap1.html</center>

The details of neural network are quite trivial, so we recommond you check it on https://en.wikipedia.org/wiki/Artificial_neural_network

Here we only talk about the components help you understand the contents of this tutorial.

In theory, multiple layers neural networks can express arbitrary functions. Once the hyperparameters(e.g. the number of layers, the number of neurons) are decided by human, we need to find a set of parameter(e.g. weights of every layer) which can optimize the neural networks to fit the train set. This problem is actcully [optimization problem](https://en.wikipedia.org/wiki/Optimization_problem). [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) or its variants are usually used to solve this problem in the field of deep learning.

Since you can simply think neural networks as a kind of function, let's first take naive function such as $y = ax^2 + bx + c (a>0)$ as example. We want to find the minimum of this function.

<img src="img/gd.png" width="30%">
<center>Figure 3. Gradient descent source: https://www.quora.com/What-is-Stochastic-Gradient-Descent</center>

We calculate the current gradient every iteration, then step forward on the opposite of that gradient. If everything goes well (e.g step length is appropriate and function is convex), we will finally find the minimum after finite iteration.

But the gradient of neural networks cannot be calculated directly. So we use [chain rule](https://en.wikipedia.org/wiki/Chain_rule) to calculate every partial derivative with respect of parameters. This method is called [Backpropagation](https://en.wikipedia.org/wiki/Backpropagation). 
<img src="img/giphy.gif" width="40%">
<center>Figure 4. feed forward back propagation source: https://giphy.com/explore/neural-networks</center>
This is critical for understanding neural network, so we recommond you read the algorithm to get intuitive feeling.


## From RNN to LSTM
Traditional neural networks are not good at memorize context. The relationship of order and time between sample and sample can not be learned. So it's considered be not suitable for solving natural language processing problem.

RNN[(Recurrent neural network)](https://en.wikipedia.org/wiki/Recurrent_neural_network) address this issue. Loops in RNN allow information to persist.

<img src="img/RNN-unrolled.png" width="50%">
<center>Figure 5. RNN source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor[[1]](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). Thus the "memory" of processed sample can be used for processing current sample.

Naive recurrent neural network is difficult to capture long term correlation, because it can not handle exploding gradient problem and vanishing gradient problem. Both of them results for inappropriate learning.


Using LSTM[(Long Short-Term Memory)](https://en.wikipedia.org/wiki/Long_short-term_memory) can solve this problem well.

<img src="img/LSTM3-chain.png" width="50%">
<center>Figure 6. RNN source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>


Every LSTM cell persist and pass not only hidden output but also cell state, which is kind of like a conveyor belt. This change makes long-term information easily flow along cell state unchanged. The LSTM also have the ability to remove or add information to the cell state, carefully regulated by structures called gates[[1]](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). This improvement moderates exploding gradient problem and vanishing gradient problem of RNN. Please check http://colah.github.io/posts/2015-08-Understanding-LSTMs/ for details.

## RNNs using PyTorch

Firstly, we load numpy and torch package.

In [4]:
import numpy as np
import torch

Then, we load processed npy file as dataset and save it save "papers".
We split each paper into word list as "paper_list". "words" is the set of unique words which appear in the dataset. "word_dict" gives these word numeric label, makes it easier for RNN training later, since neural networks cannot process string directly.

In [5]:
papers = np.load("data.npy")

def word_map(papers):
    paper_list = [paper.split() for paper in papers]
    corpus = sum(paper_list, [])
    print(len(corpus))
    words = list(set(corpus))
    words.sort()
    word_dict = {word: i for i, word in enumerate(words)}
    return paper_list, words, word_dict

paper_list, words, word_dict = word_map(papers)

211879


We print the length of "corpus", "paper_list", "words", "word_dict". As you can seen, there are 211879 words, 86 acticles, 8686 unique words.

In [6]:
print(len(paper_list), len(words), len(word_dict))

86 8686 8686


"word_to_number" changes word list into number list. "number_to_word" changes number list into word list.

In [7]:
def word_to_number(word_list, word_dict):
    return [word_dict[word] for word in word_list]

def number_to_word(number_list, words):
    return [words[number] for number in number_list]

In the cell below, we changes

"after an unequivocal experience of the inefficacy of the subsisting federal government , you are called upon to deliberate on"

into

[312, 456, 8148, 3184, 5442, 7817, 4232, 5442, 7817, 7552, 3350, 3699, 5, 8678, 607, 1107, 8274, 7900, 2149, 5476]

and recover it for validation.

In [8]:
#take the first 20 words of the first article as example
test = word_to_number(paper_list[0][:20], word_dict)
print(test)
test = number_to_word(test, words)
print(" ".join(test))

[312, 456, 8148, 3184, 5442, 7817, 4232, 5442, 7817, 7552, 3350, 3699, 5, 8678, 607, 1107, 8274, 7900, 2149, 5476]
after an unequivocal experience of the inefficacy of the subsisting federal government , you are called upon to deliberate on


In the cell below, we do some further preprocessing for generating data set and data loader. You need to understand  some necessary definations below.

<b>Batch</b>: In deep learning, we usually chunk the raw dataset into many same size matrixs to accelerate parallel floating point calculation. Such matrix is called a batch. Here, batch size is 18 and batch length is 149. Thus there are 79 batchs (211878 = 18 $*$ 149 $*$ 79). Usually you can set any appropriate numbers as you like.

<b>Input and output</b>: Input and output are basically the same article fragment. The only difference between them is the output is one word "faster" than input. for instance, if the input is [312, 456, 8148, 3184, 5442], then the output is [456, 8148, 3184, 5442, 7817]

<b>data_set and data_loader</b>: torch.utils.data.Dataset and torch.utils.data.DataLoader class encapsulate batch and put them into neural network. They have many convenient function you can customize, and you can even override the function of them. Here we just use simplest data_set and data_loader.

In [9]:
#####preprocessing#####
from torch.utils.data.dataset import TensorDataset
from torch.utils.data.dataloader import DataLoader

BATCH_LEN = 149
BATCH_SIZE = 18

def make_batch(corpus, batch_len):
    batches = len(corpus) // batch_len
    array = corpus[:batches * batch_len]
    return np.array(corpus).reshape(batches, batch_len)
def preprocessing(paper_list):
    np.random.shuffle(paper_list)
    #shuffle the paper list every time to guarantee randomness
    corpus = np.concatenate(paper_list)
    #concatenate them together into a long list
    corpus = word_to_number(corpus, word_dict)
    inputs = np.array(make_batch(corpus[:-1], BATCH_LEN))
    #inputs will discard the last word
    targets = np.array(make_batch(corpus[1:], BATCH_LEN))
    #targets will discard the first word
    #inputs, targets shape: 1422 * 149
    data_set = TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))
    data_loader = torch.utils.data.DataLoader(data_set, batch_size=BATCH_SIZE)
    return data_loader
#one batch 18 * 149... 79 batchs

Here we use 3 layers lstm RNN. Before lstm, we use embedding for dense vector representations of the word to make learning results better. After lstm, a linear layer is used for output. This model and hyperparameters are adapted from https://arxiv.org/pdf/1708.02182.pdf .

Finding a perfect model and hyperparameters is not intuitive, so just remember we use 3 layers lstm RNN.

In [10]:
#####model######
from torch import nn

EMBEDDING_DIM = 400
HIDDEN_DIM = 1150

class TextModel(nn.Module):
    def __init__(self, wordcount, embedding_dim, hidden_dim):
        super(TextModel, self).__init__()
        #create each layer in sequence
        self.wordcount = wordcount
        self.embedding = nn.Embedding(num_embeddings=wordcount, embedding_dim=embedding_dim)
        self.rnns = nn.ModuleList([
            nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True),
            nn.LSTM(input_size=hidden_dim, hidden_size=hidden_dim, batch_first=True),
            nn.LSTM(input_size=hidden_dim, hidden_size=embedding_dim, batch_first=True)])
        self.projection = nn.Linear(in_features=embedding_dim, out_features=wordcount)

    #forward function will take train text as input, then calculate and pass the hidden results in sequence.
    #if variable forward is greater than 0, forward length text will be generated.
    def forward(self, input, forward=0):
        h = input.long()
        h = self.embedding(h)
        states = []
        for rnn in self.rnns:
            h, state = rnn(h)
            states.append(state)
        h = self.projection(h)
        logits = h
        if forward > 0:
            outputs = []
            h = torch.max(logits[:, -1:, :], dim=2)[1]
            for i in range(forward):
                h = self.embedding(h)
                for j, rnn in enumerate(self.rnns):
                    h, state = rnn(h, states[j])
                    states[j] = state
                h = self.projection(h)
                outputs.append(h)
                h = torch.max(h, dim=2)[1]
            logits = torch.cat([logits] + outputs, dim=1)
        return logits

[Variable](http://pytorch.org/tutorials/beginner/former_torchies/autograd_tutorial.html) is a wrapper around a Tensor(basic arithmetic type in Pytorch) which can help back propagation. Remember we should always check if we can use cuda to accelerate calculations.

In [11]:
#####Variable######
from torch.autograd import Variable
def to_variable(tensor):
    # Tensor -> Variable (on GPU if possible)
    if torch.cuda.is_available():
        # Tensor -> GPU Tensor
        tensor = tensor.cuda()
    return torch.autograd.Variable(tensor)

Loss function is a function tell us how good our prediction is by comparing it with expected output. Here we use CrossEntropyLoss3D. Because

<center>$Perplexity(text)=2^{Entropy(text)}$</center>

, as we minimize the loss of CrossEntropyLoss3D, the perplexity also decreases.

In [12]:
#####loss function#####
class CrossEntropyLoss3D(nn.CrossEntropyLoss):
    def forward(self, input, target):
        return super(CrossEntropyLoss3D, self).forward(input.view(-1, input.size()[2]), target.long().view(-1))

During each training iteration,

step 1: put current batch into RNN to generate prediction

step 2: use loss function to calculate loss

step 3: use loss.backward to back propagation

step 4: use optimizer.step to update weights.

Since training is usually time-cosuming, here we only train for 10 iterations (if it is still too slow, just use ctrl-c to stop it). As you can see, the loss keeps going down, from 9 to 6.

In [13]:
#####train#####
LEARNING_RATE = 0.001
EPOCHS = 1
def train():
    model = TextModel(len(words),EMBEDDING_DIM,HIDDEN_DIM)
    loss_fn = CrossEntropyLoss3D()
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    #Adam is a kind of method of gradient descent. Commonly, we can use SGD, Adam, or RMSprop.
    
    if torch.cuda.is_available():
        model = model.cuda()
        loss_fn = loss_fn.cuda()

    for epoch in range(EPOCHS):
        data_loader = preprocessing(paper_list)
        for iteration, (input, label) in enumerate(data_loader):
            optimizer.zero_grad()
            prediction = model(to_variable(input), forward=0)
            loss = loss_fn(prediction, to_variable(label))
            print("#"*50)
            print("epoch{}iteration{}".format(epoch,iteration))
            print(loss)
            print("#"*50)
            if iteration == 10:
                break
            loss.backward()
            optimizer.step()
            
    print("training finished")
    return model
train()

##################################################
epoch0iteration0
Variable containing:
 9.0664
[torch.FloatTensor of size 1]

##################################################
##################################################
epoch0iteration1
Variable containing:
 9.0087
[torch.FloatTensor of size 1]

##################################################
##################################################
epoch0iteration2
Variable containing:
 8.6229
[torch.FloatTensor of size 1]

##################################################
##################################################
epoch0iteration3
Variable containing:
 7.9662
[torch.FloatTensor of size 1]

##################################################
##################################################
epoch0iteration4
Variable containing:
 7.4917
[torch.FloatTensor of size 1]

##################################################
##################################################
epoch0iteration5
Variable containing:
 7.0940
[torch.F

TextModel(
  (embedding): Embedding(8686, 400)
  (rnns): ModuleList(
    (0): LSTM(400, 1150, batch_first=True)
    (1): LSTM(1150, 1150, batch_first=True)
    (2): LSTM(1150, 400, batch_first=True)
  )
  (projection): Linear(in_features=400, out_features=8686)
)

Functions below are used for generating text. We trained the model above for 20 epochs by using a GTX 1080 computer for 30 minutes. It should perfermance well since the validation loss is less than 2. You will load the trained model from "new.pkl".

In [28]:
def generate(model, sequence_length, inp=None):
    model.eval()
    logits = model(inp, forward=sequence_length)
    classes = torch.max(logits, dim=2)[1]
    return classes[:, -sequence_length:]

def generation(inp, forward):
    filepath = "new.pkl"
    model = TextModel(len(words),EMBEDDING_DIM,HIDDEN_DIM)
    #create a empty model
    if torch.cuda.is_available():
        model.load_state_dict(torch.load(filepath))
        model = model.cuda()
    else:
        model.load_state_dict(torch.load(filepath, lambda storage, loc: storage))
        model = model.cpu()
    #load rnn weights from new.pkl
    #load weights to model
    inp = Variable(torch.from_numpy(inp))
    generated = generate(model, forward, inp).data.cpu().numpy()
    return generated

def to_text(preds, words):
    #change number list into article
    return [" ".join(words[c] for c in line) for line in preds]

Here we generate test using both n-gram and rnn model.

In [29]:
print("n-gram output: ")
print(n_gram)
print(len(n_gram.split()))
print("#"*50)

rnn_first = [first.split()]
rnn_first = [word_to_number(i, word_dict) for i in rnn_first]
rnn = generation(np.array(rnn_first),200)
rnn =  to_text(rnn, words)[0]
print("rnn output: ")
print(rnn)
print(len(rnn.split()))

n-gram output: 
has been performed by some individual citizen of preeminent wisdom and approved integrity . minos , we learn , was the primitive founder of the government of crete , as zaleucus was of that of the locrians . theseus first , and after him draco and solon , instituted the government of athens . lycurgus was the lawgiver of sparta . the foundation of the original government of rome was laid by romulus , and the work completed by two of his elective successors , numa and tullius hostilius . on the abolition of royalty the consular administration was substituted by brutus , who stepped forward with a project for such a reform , which , he alleged , had been prepared by tullius hostilius , and to which his address obtained the assent and ratification of the senate and people . this remark is applicable to confederate governments also . amphictyon , we are told , was the author of that which bore his name . the achaean league received its first birth from achaeus , and its sec

## Summary and references

This tutorial shows another approach to model the language in Pytorch. Much more detail are available from the following links.

1. rnn and lstm: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
2. model: https://arxiv.org/pdf/1708.02182.pdf
3. neural network: https://en.wikipedia.org/wiki/Artificial_neural_network
4. back propagation: https://en.wikipedia.org/wiki/Backpropagation