<a href="https://colab.research.google.com/github/wingated/cs474_labs_f2019/blob/master/DL_Lab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6: Sequence-to-sequence models

## Description:
For this lab, you will code up the [char-rnn model of Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). This is a recurrent neural network that is trained probabilistically on sequences of characters, and that can then be used to sample new sequences that are like the original.

This lab will help you develop several new skills, as well as understand some best practices needed for building large models. In addition, we'll be able to create networks that generate neat text!

## There are two parts of this lab:
###  1.   Wiring up a basic sequence-to-sequence computation graph
###  2.   Implementing your own GRU cell.


An example of my final samples are shown below (more detail in the
final section of this writeup), after 150 passes through the data.
Please generate about 15 samples for each dataset.

<code>
And ifte thin forgision forward thene over up to a fear not your
And freitions, which is great God. Behold these are the loss sub
And ache with the Lord hath bloes, which was done to the holy Gr
And appeicis arm vinimonahites strong in name, to doth piseling 
And miniquithers these words, he commanded order not; neither sa
And min for many would happine even to the earth, to said unto m
And mie first be traditions? Behold, you, because it was a sound
And from tike ended the Lamanites had administered, and I say bi
</code>


---

## Part 0: Readings, data loading, and high level training

---

There is a tutorial here that will help build out scaffolding code, and get an understanding of using sequences in pytorch.

* Read the following

> * [Pytorch sequence-to-sequence tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)
* [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)






In [83]:
! wget -O ./text_files.tar.gz 'https://piazza.com/redirect/s3?bucket=uploads&prefix=attach%2Fjlifkda6h0x5bk%2Fhzosotq4zil49m%2Fjn13x09arfeb%2Ftext_files.tar.gz' 
! tar -xzf text_files.tar.gz
! pip install unidecode
! pip install torch

import unidecode
import string
import random
import re
 
import pdb
 
all_characters = string.printable
n_characters = len(all_characters)
file = unidecode.unidecode(open('./text_files/lotr.txt').read())
secondFile = unidecode.unidecode(open('./text_files/alma.txt').read())
file_len = len(file)
secondFileLen = len(secondFile)
print('file_len =', file_len)

--2019-10-20 04:35:07--  https://piazza.com/redirect/s3?bucket=uploads&prefix=attach%2Fjlifkda6h0x5bk%2Fhzosotq4zil49m%2Fjn13x09arfeb%2Ftext_files.tar.gz
Resolving piazza.com (piazza.com)... 3.214.17.10, 34.205.95.128, 52.2.48.133, ...
Connecting to piazza.com (piazza.com)|3.214.17.10|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://d1b10bmlvqabco.cloudfront.net/attach/jlifkda6h0x5bk/hzosotq4zil49m/jn13x09arfeb/text_files.tar.gz [following]
--2019-10-20 04:35:08--  https://d1b10bmlvqabco.cloudfront.net/attach/jlifkda6h0x5bk/hzosotq4zil49m/jn13x09arfeb/text_files.tar.gz
Resolving d1b10bmlvqabco.cloudfront.net (d1b10bmlvqabco.cloudfront.net)... 52.84.225.14, 52.84.225.66, 52.84.225.86, ...
Connecting to d1b10bmlvqabco.cloudfront.net (d1b10bmlvqabco.cloudfront.net)|52.84.225.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1533290 (1.5M) [application/x-gzip]
Saving to: ‘./text_files.tar.gz’


2019-10-20 04:35:09 (5.30 MB/s)

In [84]:
chunk_len = 200
 
def random_chunk():
  start_index = random.randint(0, file_len - chunk_len)
  end_index = start_index + chunk_len + 1
  return file[start_index:end_index]
  
print(random_chunk())

def random_second_chunk():
  startIndex = random.randint(0, secondFileLen - chunk_len)
  endIndex = startIndex + chunk_len + 1
  return secondFile[startIndex:endIndex]

print(random_second_chunk())

' said Merry. 'Now! Wake all our people! They hate all 
this, you can see: all of them except perhaps one or two rascals, and a few 
fools that want to be important, but don't at all understand what is
subjects to the devil?

 I say unto you, ye will know at that day that ye cannot be saved; for there can no man be saved except his garments are washed white; yea, his garments must be purified until t


In [85]:
import torch
from torch.autograd import Variable
# Turn string into list of longs
def char_tensor(string):
  tensor = torch.zeros(len(string)).long()
  for c in range(len(string)):
      tensor[c] = all_characters.index(string[c])
  return Variable(tensor)

print(char_tensor('abcDEF'))

tensor([10, 11, 12, 39, 40, 41])


---

## Part 4: Creating your own GRU cell 

**(Come back to this later - its defined here so that the GRU will be defined before it is used)**

---

The cell that you used in Part 1 was a pre-defined Pytorch layer. Now, write your own GRU class using the same parameters as the built-in Pytorch class does.

Please try not to look at the GRU cell definition. The answer is right there in the code, and in theory, you could just cut-and-paste it. This bit is on your honor!

**TODO:**

**DONE:**
* Create a custom GRU cell


In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


class GRU(nn.Module):
  def __init__(self, input_size, hidden_size, num_layers):
    super(GRU, self).__init__()
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.inputResetWeights = nn.init.xavier_uniform_(torch.empty(1,input_size))
    self.inputResetBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
    self.hiddenResetWeights = nn.init.xavier_uniform_(torch.empty(1, hidden_size))
    self.hiddenResetBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
    self.inputUpdateWeights = nn.init.xavier_uniform_(torch.empty(1, input_size))
    self.inputUpdateBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
    self.hiddenUpdateWeights = nn.init.xavier_uniform_(torch.empty(1, hidden_size))
    self.hiddenUpdateBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
    self.inputNewWeights = nn.init.xavier_uniform_(torch.empty(1, input_size))
    self.inputNewBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
    self.hiddenNewWeights = nn.init.xavier_uniform_(torch.empty(1, hidden_size))
    self.hiddenNewBias = torch.ones(1, requires_grad=False).fill_(0.8)#.detach()
  
  def forward(self, inputs, hidden):
    inputs = inputs.squeeze(0)
    hidden = hidden.squeeze(0)
    resetGate = self.sigmoid(torch.mm(self.inputResetWeights, inputs.t()) + self.inputResetBias + torch.mm(self.hiddenResetWeights, hidden.t()) + self.hiddenResetBias)
    updateGate = self.sigmoid(torch.mm(self.inputUpdateWeights, inputs.t()) + self.inputUpdateBias + torch.mm(self.hiddenUpdateWeights, hidden.t()) + self.hiddenUpdateBias)
    newGate = self.tanh(torch.mm(self.inputNewWeights, inputs.t()) + self.inputNewBias + resetGate * (torch.mm(self.hiddenNewWeights, hidden.t()) + self.hiddenNewBias))
    newHiddenLayer = (1 - updateGate) * newGate + updateGate * hidden
                                                    
    outputs = newHiddenLayer.unsqueeze(0)
    hiddens = newHiddenLayer.clone().unsqueeze(0)#detach().unsqueeze()
    # Each layer does the following:
    # resetGate = Sigmoid(inputResetWeights * inputsTransposed + inputResetBias + hiddenResetWeights * hiddenLayer + hiddenResetBias)
    # updateGate = Sigmmoid(inputUpdateWeights * inputsTransposed + inputUpdateBias + hiddenUpdateWeights * hiddenLayer + hiddenUpdateBias)
    # newGate = tanh(inputNewWeights * inputsTransposed + inputNewBias + resetGate *elementwise multiplication* (hiddenNewWeights * hiddenLayer + hiddenNewBias))
    # newHiddenLayer = (1 - updateGate) *elementwise multiplication* newGate + updateGate *elementwise multiplication* hiddenLayer
    # r_t = sigmoid(W_ir*x_t + b_ir + W_hr*h_(t-1) + b_hr)
    # z_t = sigmoid(W_iz*x_t + b_iz + W_hz*h_(t-1) + b_hz)
    # n_t = tanh(W_in*x_t + b_in + r_t**(W_hn*h_(t-1) + b_hn))
    # h_(t) = (1 - z_t)**n_t + z_t**h_(t-1)
    # Where ** is hadamard product (not matrix multiplication, but elementwise multiplication)
    
    # output is copy of last hidden layer, hiddens in array of all hidden states, which in our case is just one hidden state because we are only supporting single-layer GRU's
    
    return outputs, hiddens
  


---

##  Part 1: Building a sequence to sequence model

---

Great! We have the data in a useable form. We can switch out which text file we are reading from, and trying to simulate.

We now want to build out an RNN model, in this section, we will use all built in Pytorch pieces when building our RNN class.


**TODO:**

**DONE:**
* Create an RNN class that extends from nn.Module.


In [0]:
class RNN(nn.Module):
  # decoder = RNN(n_characters, hidden_size, n_characters, n_layers)
  def __init__(self, input_size, hidden_size, output_size, n_layers=1):
    super(RNN, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.n_layers = n_layers
    
    self.embedding = nn.Embedding(output_size, hidden_size)
    
    # TODO: Implement this class in the cell above
    self.gru = GRU(hidden_size, hidden_size, n_layers) #nn.GRU(hidden_size, hidden_size)
    self.linear = nn.Linear(hidden_size, output_size)

  def forward(self, input_char):#, hidden):
    charEmbedding = self.embedding(input_char).view(1, 1, -1)#.detach()
    
    gruOutput, newHiddenState = self.gru(charEmbedding, self.hiddenState)
    #gruOutput.detach()
    #newHiddenState.detach()
    probabilityDistribution = self.linear(gruOutput[0])
    self.hiddenState = newHiddenState
    
    return probabilityDistribution#.detach() #, newHiddenState

  def init_hidden(self):
    self.hiddenState = Variable(torch.zeros(self.n_layers, 1, self.hidden_size))

In [0]:
def random_training_set():    
  chunk = random_chunk()
  inp = char_tensor(chunk[:-1]).detach()
  target = char_tensor(chunk[1:]).detach()
  return inp, target

def random_second_training_set():
  chunk = random_second_chunk()
  inp = char_tensor(chunk[:-1]).detach()
  target = char_tensor(chunk[1:]).detach()
  return inp, target

---

## Part 2: Sample text and Training information

---

We now want to be able to train our network, and sample text after training.

This function outlines how training a sequence style network goes. 

**TODO:**

**DONE:**
* Fill in the pieces.


In [0]:
def train(inp, target):
   
  # Turn input and target characters into indices using character list
  inputIndexTensor = inp#.detach() #char_tensor(inp)
  targetIndexTensor = target#.detach() #char_tensor(target)
  
  ## initialize hidden layers, set up gradient and loss
  decoder_optimizer.zero_grad()
  # Hidden layer is initialized along with RNN, hidden state tracked internally so there's no need to return it here.
  decoder.init_hidden()
  loss = 0

  # Feed input characters through RNN, get out probability distributions
  for tensorIndex in range(len(inputIndexTensor)):
    probabilityDistributionForNextChar = decoder(inputIndexTensor[tensorIndex])
    
    # Compute loss
    loss = criterion(probabilityDistributionForNextChar, targetIndexTensor[tensorIndex].unsqueeze(0))
    loss += loss.item()
  
  # Backprop error and adjust
  loss.backward()
  decoder_optimizer.step()
    
  return loss/len(inputIndexTensor)

---

## Part 3: Sample text and Training information

---

You can at this time, if you choose, also write out your train loop boilerplate that samples random sequences and trains your RNN. This will be helpful to have working before writing your own GRU class.

If you are finished training, or during training, and you want to sample from the network you may consider using the following function. If your RNN model is instantiated as `decoder`then this will probabilistically sample a sequence of length `predict_len`

**TODO:**

**DONE:**
* Fill out the evaluate function to generate text frome a primed string

In [0]:
def distributionToChar(probabilityDistribution):
  probabilityDistribution = F.softmax(probabilityDistribution, dim=1)
  _, greatestProbabilityIndex = probabilityDistribution.data.topk(1)
  _.detach()
  charIndex = greatestProbabilityIndex.squeeze().detach()
  return all_characters[charIndex.item()], charIndex

def evaluate(prime_str='A', predict_len=100, temperature=0.8):
  ## initialize hidden variable, initialize other useful variables
  decoder.init_hidden()
  with torch.no_grad():
    inputIndexTensor = char_tensor(prime_str)
      
    # prep RNN to predict text by giving it all input characters
    for inputIndex in range(len(inputIndexTensor)):
      distribution = decoder(inputIndexTensor[inputIndex])
        
    generatedText = [] 
    predictedChar, nextInputIndex = distributionToChar(distribution)
      
    generatedText.append(predictedChar)
      
    for predictionCount in range(predict_len - 1):
      distribution = decoder(nextInputIndex)
        
      predictedChar, nextInputIndex = distributionToChar(distribution)
      generatedText.append(predictedChar)
       
  return generatedText
  ## /
  

---

## Part 4: (Create a GRU cell, requirements above)

---



---

## Part 5: Run it and generate some text!

---

Assuming everything has gone well, you should be able to run the main function in the scaffold code, using either your custom GRU cell or the built in layer, and see output something like this. I trained on the “lotr.txt” dataset, using chunk_length=200, hidden_size=100 for 2000 epochs gave.

**TODO:** 

**DONE:**
* Create some cool output


In [0]:
import time
#n_epochs = 5000
print_every = 200
plot_every = 10
hidden_size = 200
n_layers = 1
lr = 0.001

decoder = RNN(n_characters, hidden_size, n_characters, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
 
start = time.time()
all_losses = []
loss_avg = 0

In [108]:
n_epochs = 2000
for epoch in range(1, n_epochs + 1):
  loss_ = train(*random_training_set())
  loss_avg += loss_

  if epoch % print_every == 0:
      print('[%s (%d %d%%) %.4f]' % (time.time() - start, epoch, epoch / n_epochs * 100, loss_))
      print(evaluate('Wh', 100), '\n')

  if epoch % plot_every == 0:
      all_losses.append(loss_avg / plot_every)
      loss_avg = 0

[38.24638271331787 (200 10%) 0.0321]
['Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'i', ' ', ' ', ' ', ' ', ' ', ' ', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o'] 

[72.0439178943634 (400 20%) 0.0130]
['Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'i', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '

In [109]:
for i in range(10):
  start_strings = [" Th", " wh", " he", " I ", " ca", " G", " lo", " ra"]
  start = random.randint(0,len(start_strings)-1)
  print(start_strings[start])
#   all_characters.index(string[c])
  print(evaluate(start_strings[start], 200), '\n')

 ra
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 

---

## Part 6: Generate output on a different dataset

---

**TODO:**

**DONE:**
* Choose a textual dataset. Here are some [text datasets](https://www.kaggle.com/datasets?tags=14104-text+data%2C13205-text+mining) from Kaggle 

* Generate some decent looking results and evaluate your model's performance (say what it did well / not so well)


In [110]:
decoder = RNN(n_characters, hidden_size, n_characters, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)

start = time.time()
all_losses = []
loss_avg = 0

for epoch in range(1, n_epochs + 1):
  loss_ = train(*random_second_training_set())
  loss_avg += loss_

  if epoch % print_every == 0:
      print('[%s (%d %d%%) %.4f]' % (time.time() - start, epoch, epoch / n_epochs * 100, loss_))
      print(evaluate('Wh', 100), '\n')

  if epoch % plot_every == 0:
      all_losses.append(loss_avg / plot_every)
      loss_avg = 0
      
for i in range(10):
  start_strings = [" Th", " wh", " he", " I ", " ca", " G", " lo", " ra"]
  start = random.randint(0,len(start_strings)-1)
  print(start_strings[start])
#   all_characters.index(string[c])
  print(evaluate(start_strings[start], 200), '\n')

[39.467225790023804 (200 10%) 0.0368]
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] 

[73.25917077064514 (400 20%) 0.0303]
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '

### Model Performance
So this model's performance level is that it generates a variety of characters, with no coherence.

Strengths:
* It generates characters with a sort of variety
* It generates the number of characters you ask for
* It seems to be sort of learning something

Weaknesses:
* It doesn't generate coherent words
  * The model seems to get stuck predicting one particular character for many other characters. I'm not sure what's causing this, as I played around with the learning rate and was sure to use Xavier initialization in the GRU. I also used Softmax in the probability vector.
  * While it bugs me that this net doesn't produce better output, I spent literally my entire Saturday and several hours on Friday working on this, so I don't have any more time to devote to this. In addition, Dr. Wingate said in class that this assignment isn't graded based on the quality of output, and that as long as the network generates something we're good for grading. So I think I've met the requirements for this lab, and if I had more time to make this better I would keep trying to improve this, but I don't have any more time.