# Lab 6: Sequence-to-sequence models

## Description:
For this lab, you will code up the [char-rnn model of Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). This is a recurrent neural network that is trained probabilistically on sequences of characters, and that can then be used to sample new sequences that are like the original.

This lab will help you develop several new skills, as well as understand some best practices needed for building large models. In addition, we'll be able to create networks that generate neat text!

## There are two parts of this lab:
###  1.   Wiring up a basic sequence-to-sequence computation graph
###  2.   Implementing your own GRU cell.


An example of my final samples are shown below (more detail in the
final section of this writeup), after 150 passes through the data.
Please generate about 15 samples for each dataset.

<code>
And ifte thin forgision forward thene over up to a fear not your
And freitions, which is great God. Behold these are the loss sub
And ache with the Lord hath bloes, which was done to the holy Gr
And appeicis arm vinimonahites strong in name, to doth piseling 
And miniquithers these words, he commanded order not; neither sa
And min for many would happine even to the earth, to said unto m
And mie first be traditions? Behold, you, because it was a sound
And from tike ended the Lamanites had administered, and I say bi
</code>


---

## Part 0: Readings, data loading, and high level training

---

There is a tutorial here that will help build out scaffolding code, and get an understanding of using sequences in pytorch.

* Read the following

> * [Pytorch sequence-to-sequence tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)
* [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)






In [5]:
! wget -O ./text_files.tar.gz 'https://piazza.com/redirect/s3?bucket=uploads&prefix=attach%2Fjlifkda6h0x5bk%2Fhzosotq4zil49m%2Fjn13x09arfeb%2Ftext_files.tar.gz' 
! tar -xzf text_files.tar.gz
! pip install unidecode
! pip install torch

import unidecode
import string
import random
import re
import torch.nn as nn
import numpy as np
import gc
import torch
from torch.autograd import Variable
 
import pdb
 
all_characters = string.printable
n_characters = len(all_characters)
# file = unidecode.unidecode(open('./text_files/lotr.txt').read())
file = unidecode.unidecode(open('./klyrics.txt').read())
file_len = len(file)
print('file_len =', file_len)
print(n_characters)

assert torch.cuda.is_available(), "You need to request a GPU from Runtime > Change Runtime"

--2019-10-19 18:55:22--  https://piazza.com/redirect/s3?bucket=uploads&prefix=attach%2Fjlifkda6h0x5bk%2Fhzosotq4zil49m%2Fjn13x09arfeb%2Ftext_files.tar.gz
Resolving piazza.com (piazza.com)... 52.2.48.133, 3.214.17.10, 34.205.95.128, ...
Connecting to piazza.com (piazza.com)|52.2.48.133|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://d1b10bmlvqabco.cloudfront.net/attach/jlifkda6h0x5bk/hzosotq4zil49m/jn13x09arfeb/text_files.tar.gz [following]
--2019-10-19 18:55:23--  https://d1b10bmlvqabco.cloudfront.net/attach/jlifkda6h0x5bk/hzosotq4zil49m/jn13x09arfeb/text_files.tar.gz
Resolving d1b10bmlvqabco.cloudfront.net (d1b10bmlvqabco.cloudfront.net)... 99.84.170.141, 99.84.170.226, 99.84.170.199, ...
Connecting to d1b10bmlvqabco.cloudfront.net (d1b10bmlvqabco.cloudfront.net)|99.84.170.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1533290 (1.5M) [application/x-gzip]
Saving to: ‘./text_files.tar.gz’


2019-10-19 18:55:23 (25.4 M

In [15]:
chunk_len = 100
 
def random_chunk():
  start_index = random.randint(0, file_len - chunk_len)
  end_index = start_index + chunk_len + 1
  return file[start_index:end_index]
  
print(random_chunk())

na salute me or seduce me
Indubitably, I'm too street--indubitably, I'ma do me
Better than your bitch


In [16]:

# Turn string into list of longs
def char_tensor(string):
  tensor = torch.zeros(len(string)).long()
  for c in range(len(string)):
      tensor[c] = all_characters.index(string[c])
  return Variable(tensor)

print(char_tensor('abcDEF'))
print(all_characters[91])

tensor([10, 11, 12, 39, 40, 41])
|


---

## Part 4: Creating your own GRU cell 

**(Come back to this later - its defined here so that the GRU will be defined before it is used)**

---

The cell that you used in Part 1 was a pre-defined Pytorch layer. Now, write your own GRU class using the same parameters as the built-in Pytorch class does.

Please try not to look at the GRU cell definition. The answer is right there in the code, and in theory, you could just cut-and-paste it. This bit is on your honor!

**TODO:**
* Create a custom GRU cell

**DONE:**



In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


class GRU(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(GRU, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.ones = torch.tensor(np.zeros(shape=(1,1,hidden_size)), dtype=torch.float32)
        self.ones = self.ones.cuda(async=True)

        # Define weight layers
        self.W_xz = nn.Linear(self.input_size, self.hidden_size)
        self.W_hz = nn.Linear(self.hidden_size, self.hidden_size)
        self.W_xr = nn.Linear(self.input_size, self.hidden_size)
        self.W_hr = nn.Linear(self.hidden_size, self.hidden_size)
        # Initialize reset gate's (forget gate's) biases to 1, per the recommendation from lecture
        nn.init.constant_(self.W_hr.bias, 1)
        nn.init.constant_(self.W_xr.bias, 1)
        self.W_xn = nn.Linear(self.input_size, self.hidden_size)
        self.W_hn = nn.Linear(self.hidden_size, self.hidden_size)

    # inputs will be an embedding of size [1,hidden_size]
    def forward(self, inputs, hidden):
        # Each layer does the following:
        # z_t = sigmoid(W_iz*x_t + b_iz + W_hz*h_(t-1) + b_hz)
        # r_t = sigmoid(W_ir*x_t + b_ir + W_hr*h_(t-1) + b_hr)
        # n_t = tanh(W_in*x_t + b_in + r_t**(W_hn*h_(t-1) + b_hn))
        # h_(t) = (1 - z_t)**n_t + z_t**h_(t-1)
        # Where ** is hadamard product (not matrix multiplication, but elementwise multiplication)
        # 1. Update gate 
        z_t = torch.sigmoid(self.W_xz(inputs) + self.W_hz(hidden))
        # 2. Reset Gate
        r_t = torch.sigmoid(self.W_xr(inputs) + self.W_hr(hidden))
        # 3. Current memory
        n_t = torch.tanh(self.W_xn(inputs) + r_t*self.W_hn(hidden))
        # 4. Final memory
        hidden = (self.ones-z_t)*n_t + z_t*hidden

        return hidden, hidden
  


---

##  Part 1: Building a sequence to sequence model

---

Great! We have the data in a useable form. We can switch out which text file we are reading from, and trying to simulate.

We now want to build out an RNN model, in this section, we will use all built in Pytorch pieces when building our RNN class.


**TODO:**
* Create an RNN class that extends from nn.Module.

**DONE:**



In [0]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers=1):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        # more stuff here...
        # The GRU's "input_size" is our hidden_size, because we've initialized the embedder to be of size "hidden_size"
        self.GRU = GRU(self.hidden_size,self.hidden_size,num_layers=self.n_layers)
        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        self.relu = nn.ReLU() # why use this?
        # This squishes the output from hidden_size to output_size giving us a vector representation of our prediction
        self.out = nn.Linear(self.hidden_size, self.output_size) 


    def forward(self, input_char, hidden):
        # by reviewing the documentation, construct a forward function that properly uses the output
        # of the GRU
        # Get char embedding
        input_embedding = self.embedding(input_char).view(1, 1, -1)
        # Pass embedding through RelU for some reason (up to you whether you see better results or not)
        input_embedding = self.relu(input_embedding)
        # Compute output and new hidden (h_n)
        output, h_n = self.GRU(input_embedding, hidden)
        # Currently output is a vector of size hidden_size. We need to squish it to be the vocabulary (output) size
        output = self.out(output)
        return output, h_n


    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_size))

In [0]:
def random_training_set():    
  chunk = random_chunk()
  inp = char_tensor(chunk[:-1])
  target = char_tensor(chunk[1:])
  return inp, target

---

## Part 2: Sample text and Training information

---

We now want to be able to train our network, and sample text after training.

This function outlines how training a sequence style network goes. 

**TODO:**
* Fill in the pieces.

**DONE:**




In [0]:
# This architecture is set up to train on random chunks of text of length chunk_len (defined above)
# Thus, you will write your training loop to iterate over each character in the chunks
def train(inp, target):
    ## initialize hidden layers, set up gradient and loss 
    # your code here
    ## /
    decoder_optimizer.zero_grad()
    hidden = decoder.init_hidden()
    loss = 0

    for char_x, char_y_truth in zip(inp, target):
        # 1. Get Prediction
        # char_y_hat is going to be a vector of length vocab, which should be a probability dist over the values
        char_x = char_x.cuda(async=True)
        hidden = hidden.cuda(async=True)
        char_y_hat, hidden = decoder(char_x, hidden)

        # 2. Compute Loss
        char_y_truth = char_y_truth.cuda(async=True)
        char_y_truth = char_y_truth.unsqueeze(0)
        loss += criterion(char_y_hat.squeeze(0), char_y_truth)
    
    # 3. Compute Gradient (loss.backward)
    loss.backward()
    # 4. Update weights (step)
    decoder_optimizer.step()

    return loss

---

## Part 3: Sample text and Training information

---

You can at this time, if you choose, also write out your train loop boilerplate that samples random sequences and trains your RNN. This will be helpful to have working before writing your own GRU class.

If you are finished training, or during training, and you want to sample from the network you may consider using the following function. If your RNN model is instantiated as `decoder` then this will probabilistically sample a sequence of length `predict_len`

**TODO:**


**DONE:**
* Fill out the evaluate function to generate text from a primed string


In [0]:
def evaluate(prime_str='A', predict_len=100, temperature=0.8):
    ## initialize hidden variable, initialize other useful variables 
        # your code here
    ## /
    with torch.no_grad():
        predict_str = prime_str
        # convert prime_str into machine readable list of integer
        prime_str = char_tensor(prime_str)
        # Initialize hidden
        hidden = decoder.init_hidden()

        # loop over prime_str, throwing away the prediction and feeding the hidden state 
        # through the model through over course of iteration through prime_str
        for char_x in prime_str:
            # Throw away prediction and feed hidden through the GRU
            char_x = char_x.cuda(async=True)
            hidden = hidden.cuda(async=True)
            _, hidden = decoder(char_x, hidden)

        # finish generating output
        # char_x should still be the last char of prime_str at the beginning of the loop
        for i in range(len(prime_str), predict_len):
            # 1. Get prediction for the next char
            char_x = char_x.cuda(async=True)
            hidden = hidden.cuda(async=True)
            pred, hidden = decoder(char_x, hidden)
            # 2. Convert pred to probability distribution
            # temperature represents how much to divide the logits by before computing the softmax.
            pred = pred.squeeze(0)
            pred = F.softmax(pred/temperature, dim=1)
            # 3. Sample
            pred_char = torch.multinomial(pred, 1)
            # 4. Convert int to char
            char = all_characters[pred_char.item()]
            # 5. Append char to predict_str
            predict_str += char
            # 6. Set the input of the next pass to be the output of this pass
            char_x = pred_char
        
        return predict_str

  

---

## Part 4: (Create a GRU cell, requirements above)

---



---

## Part 5: Run it and generate some text!

---

Assuming everything has gone well, you should be able to run the main function in the scaffold code, using either your custom GRU cell or the built in layer, and see output something like this. I trained on the “lotr.txt” dataset, using chunk_length=200, hidden_size=100 for 2000 epochs gave.

**TODO:** 


**DONE:**
* Create some cool output


In [0]:
import time
n_epochs = 5000
print_every = 200
plot_every = 10
hidden_size = 200
n_layers = 1
lr = 0.001
 
gc.collect()
torch.cuda.empty_cache()

decoder = RNN(n_characters, hidden_size, n_characters, n_layers)
decoder.cuda()
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
 
start = time.time()
all_losses = []
loss_avg = 0


In [0]:
# n_epochs = 2000
for epoch in range(1, n_epochs + 1):
    loss_ = train(*random_training_set())       
    loss_avg += loss_

    if epoch % print_every == 0:
        print('[%s (%d %d%%) %.4f]' % (time.time() - start, epoch, epoch / n_epochs * 100, loss_))
        # print(".8",evaluate('Th', 100), '\n')
        print(".6",evaluate('[Verse 1: Kendrick]', 1000,.7), '\n')
        # print(".5",evaluate('Th', 100,.5), '\n')
        # print(".2",evaluate('Th', 100,.2), '\n')

    if epoch % plot_every == 0:
        all_losses.append(loss_avg / plot_every)
        loss_avg = 0

[57.5126678943634 (200 4%) 280.9722]
.6 [Verse 1: Kendrick]
ger
SWopkee, it ha o.dt corres the thed a tie de se ma
t aan forour beo unge calk lind ray nd out neds ind se alll iond ipg been di'me caped gay lore ,ou the alin'd mon arge le to macer bered neere ce, hol her, at tho it ou're afling youl on tou b'ig Hoh I wan bere ali,' she cwer tha ad wocl Aodt athe gr guse bet at it hee
sAn aill blo, tareis thou tee ad tthe thee tig and haet yis mive thers ime tCe g Hallk asur hamrore s tereHs e the at l wou tin pds us farringe sour on odle dceas thte te isr beat m imar theas r onr opend, f
hallisC you ma8 amed meg mer mel Ve at houte shoure ycl olr'l bee sor, tr
[ousg wad od soun' merhe mere the iftle
Be nuror bel rian the pountuuth had ne mere ant Iot gowe miuss ither to and thaar come rirnagare our er ho and dor yorinr eere, [it he at ou tacss aneas tind'its ato nurat it ho, cear at wpacke sreet anr amm the rout theent y' kante ot bheas n fomu rece bora warlol.e im ou ute wit anrht panog

In [0]:
for i in range(10):
  start_strings = [" Th", " wh", " he", " I ", " ca", " G", " lo", " ra"]
  start = random.randint(0,len(start_strings)-1)
  print(start_strings[start])
#   all_characters.index(string[c])
  print(evaluate(start_strings[start], 1000), '\n')

---

## Part 6: Generate output on a different dataset

---

**TODO:**



**DONE:**

* Choose a textual dataset. Here are some [text datasets](https://www.kaggle.com/datasets?tags=14104-text+data%2C13205-text+mining) from Kaggle 

* Generate some decent looking results and evaluate your model's performance (say what it did well / not so well)

### Evaluation
I downloaded all of Kendrick Lamar's lyrics into a text file and ran my implementation of the GRU on the text. I primed the output on the string "[Verse 1: Kendrick]". Interestingly it learned that shorter lines tend to occur next to shorter lines, and longer lines tend to occur next to longer lines. The model's output isn't anything crazy impressive, with most lines containing mostly nonsense. But it's clear that the model learned to speak like Kendrick Lamar, using a fair number of curse words and properly capitalizing the word "Compton." The most interesting thing to me is that it appears not only to have learned how to format song lyrics, but that there is some sequential rules in formatting song lyrics. [Verse 2] occurs after [Verse 1] and [Chorus] occurs between verses. (Note: periods between verses were added by me for the sake of formatting the output in this notebook. The actual model learned to output an additional newline character between verses and chorus)


### Output 1000 Characters (After 18 minutes of training w/ Temperature .6)
[Verse 1: Kendrick]

Nixting amardor. I know a brean

You wanks a lack I deots that's a nothed all dollect

No veane takine

I dead a litch like me cangaius and I'm a know

I a sayinica

Het the pick

Then I last on the bout petben to they unt they for a poinion and the hone

Every niggas to moter your feel see gotting fuck a bist to give get at can a go that's store be off the real

That's you for me, you wastarits

.

[Chorus: Kendricked her we got got the dirster and and Kendrick

Lever up wone so some suck cass be in check

Mated midobous, this dy got the be it, you making at the gotter a bill all me how and the sonations mars

I need don't steen this a don't in I adure stare on that down I was yu'll to me ime

I was the got be gon' beatien and I got the strick thone round the bottelter sall me sone he droppen my bally Compton and shouse to man ot mate compon me

.

[Verse 2: Kendrick Lamar]

Hat a of don't got I gott me with he be don't go cause on a in the botter comieg boring

I like I day, "You can