# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





# Use the Cornell Movie Dialog dataset

The Utterances dataset has been uploaded to the /data folder.

We do not need the metadata, so extract only utterance text and ID, conversation ID, and the "reply-to" field.

In [1]:
import json

raw_data = []

input_path = './data/utterances.jsonl'

with open(input_path, 'r', encoding='utf-8') as f:
    
    for line in f:
        
        line_data = json.loads(line.rstrip('\n|\r'))
        
        line_data_dict = {}
        
        line_data_dict['id'] = line_data['id']
        line_data_dict['conversation_id'] = line_data['conversation_id']
        line_data_dict['text'] = line_data['text']
        line_data_dict['reply_to'] = line_data['reply-to']
        
        raw_data.append(line_data_dict)

Create a Pandas dataframe for ease of processing

In [2]:
import pandas as pd

In [3]:
lines_df = pd.DataFrame(raw_data)

In [4]:
lines_df.head(10)

Unnamed: 0,id,conversation_id,text,reply_to
0,L1045,L1044,They do not!,L1044
1,L1044,L1044,They do to!,
2,L985,L984,I hope so.,L984
3,L984,L984,She okay?,
4,L925,L924,Let's go.,L924
5,L924,L924,Wow,
6,L872,L870,Okay -- you're gonna need to learn how to lie.,L871
7,L871,L870,No,L870
8,L870,L870,I'm kidding. You know how sometimes you just ...,
9,L869,L866,Like my fear of wearing pastels?,L868


# Create a list of exchanges and convert to a dataframe

Each entry will contain a "call" and a "response". For a given entry, we identify its "call" by using the "reply_to" field. This way, we get an appropriate response for each utterance (except the first one in each dialogue, for which the "response_to" field is NONE.

In [5]:
conversation_id_list = list(lines_df.conversation_id.unique())

# Limiting the size of the list to 10,000 entries just to get the model working; it takes too long to process the entire list...

In [6]:
conversation_id_list = conversation_id_list[:1000]

In [7]:
exchange_list = []
index = 0

for conversation_id in conversation_id_list:
    
    index += 1

    temp_df = lines_df.loc[lines_df['conversation_id'] == conversation_id]
    
    utterance_list = temp_df.to_dict('records')
    
    utterance_dict = {}

    for utterance in utterance_list:

        temp_dict = {}
        temp_dict['text'] = utterance['text']
        temp_dict['reply_to'] = utterance['reply_to']
        utterance_dict[utterance['id']] = temp_dict
        
    for utterance in utterance_dict.keys():
    
        call_id = utterance_dict[utterance]['reply_to']

        if call_id != None:
            
            exchange_list.append({'CONVERSATION':conversation_id, 'EXCHANGE': call_id + '->' + utterance, 'CALL':utterance_dict[call_id]['text'], 'RESPONSE':utterance_dict[utterance]['text']})
            
    if index % 500 == 0:
        print(index, ': processing conversation_id', conversation_id)

500 : processing conversation_id L4219
1000 : processing conversation_id L8094


In [8]:
exchange_df = pd.DataFrame.from_dict(exchange_list)

In [9]:
exchange_df.head()

Unnamed: 0,CONVERSATION,EXCHANGE,CALL,RESPONSE
0,L1044,L1044->L1045,They do to!,They do not!
1,L984,L984->L985,She okay?,I hope so.
2,L924,L924->L925,Wow,Let's go.
3,L870,L871->L872,No,Okay -- you're gonna need to learn how to lie.
4,L870,L870->L871,I'm kidding. You know how sometimes you just ...,No


# Now construct the vocabulary

First, preprocess text of call and response by converting to lower case and expanding contractions.

In [10]:
import re
from nltk.tokenize import RegexpTokenizer
from collections import Counter

In [11]:
def preprocess_text(text):
    
    clean_text = text.lower()
    
    clean_text = re.sub('can\'t', 'can not', clean_text)
    clean_text = re.sub('won\'t', 'will not', clean_text)
    clean_text = re.sub('n\'t', ' not', clean_text)
    clean_text = re.sub('\'ll', ' will', clean_text)
    clean_text = re.sub('\'m', ' am', clean_text)
    clean_text = re.sub('he\'s', 'he is', clean_text)
    clean_text = re.sub('she\'s', 'she is', clean_text)
    clean_text = re.sub('it\'s', 'it is', clean_text)
    clean_text = re.sub('how\'s', 'how is', clean_text)
    clean_text = re.sub('that\'s', 'that is', clean_text)
    clean_text = re.sub('what\'s', 'what is', clean_text)
    clean_text = re.sub('here\'s', 'here is', clean_text)
    clean_text = re.sub('there\'s', 'there is', clean_text)
    clean_text = re.sub('let\'s', 'let us', clean_text)
    clean_text = re.sub('\'re', ' are', clean_text)
    clean_text = re.sub('\'ve', ' have', clean_text)
    clean_text = re.sub('\'d', ' would', clean_text)
    
    return clean_text

In [12]:
exchange_df['CALL'] = exchange_df['CALL'].apply(preprocess_text)
exchange_df['RESPONSE'] = exchange_df['RESPONSE'].apply(preprocess_text)

In [13]:
exchange_df.head()

Unnamed: 0,CONVERSATION,EXCHANGE,CALL,RESPONSE
0,L1044,L1044->L1045,they do to!,they do not!
1,L984,L984->L985,she okay?,i hope so.
2,L924,L924->L925,wow,let us go.
3,L870,L871->L872,no,okay -- you are gonna need to learn how to lie.
4,L870,L870->L871,i am kidding. you know how sometimes you just...,no


In [14]:
def add_to_vocab(text, tokenizer, vocab):
    
    vocab.extend(tokenizer.tokenize(text))

In [15]:
tokenizer = RegexpTokenizer(r'\w+\'s|\w+')
raw_word_list = []

exchange_df['CALL'].apply(add_to_vocab, tokenizer=tokenizer, vocab=raw_word_list)
exchange_df['RESPONSE'].apply(add_to_vocab, tokenizer=tokenizer, vocab=raw_word_list)

# Create a list of unique words sorted by frequency
#
word_counter = Counter(raw_word_list)

unique_word_list = sorted(word_counter, key=word_counter.get, reverse=True)

# Add tokens for start_of_sentence, end_of_sentence, padding
#
unique_word_list.insert(0, '<sos>')
unique_word_list.insert(1, '<eos>')
unique_word_list.insert(2, '<pad>')

#  This is our vocab size
#
vocab_size = len(unique_word_list)

# Create mappings from words to IDs and back

In [16]:
# Create mappings from words to IDs and from IDs to words
#
word_to_id = {word:id for id, word in enumerate(unique_word_list)}

id_to_word = {value:key for key,value in word_to_id.items()}

# Convert calls and responses to IDs and pad

In [17]:
# Determine maximum call and response length, adding 2 for <sos> and <eos> tokens
#
# max_call_length = exchange_df.CALL.str.len().max() + 2
# max_response_length = exchange_df.RESPONSE.str.len().max() + 2


#  Set MAX_LENGTH = 128 for both call and response to reduce memory usage
#
max_call_length = 128
max_response_length = 128

In [18]:
#  Function to convert a text to IDs and to pad to LENGTH with zeros
#  If TEXT is longer than LENGTH, TEXT will be truncated
#

def text_to_ids(text, length):
    
    #  Initialize to all <pad> tokens
    #
    padded_seq = [2] * length
    
    padded_seq[0] = 0  #  index for <sos>
    
    word_list = tokenizer.tokenize(text)
    
    for index, word in enumerate(word_list):
        
        try:
        
            padded_seq[index+1] = word_to_id[word]
        
        except KeyError:
            
            print('Key Error:', word)
            
        except IndexError:
            
            break
            
    eos_index = min(length-1, len(word_list)+1)
    
    padded_seq[eos_index] = 1  #  index for <eos>
    
    return padded_seq

In [19]:
# Convert contexts and questions to IDs
#
exchange_df['call_to_ids'] = exchange_df.CALL.apply(text_to_ids, length=max_call_length)
exchange_df['response_to_ids'] = exchange_df.RESPONSE.apply(text_to_ids, length=max_response_length)

In [20]:
exchange_df.head()

Unnamed: 0,CONVERSATION,EXCHANGE,CALL,RESPONSE,call_to_ids,response_to_ids
0,L1044,L1044->L1045,they do to!,they do not!,"[0, 37, 11, 6, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...","[0, 37, 11, 7, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2..."
1,L984,L984->L985,she okay?,i hope so.,"[0, 49, 90, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...","[0, 4, 294, 51, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ..."
2,L924,L924->L925,wow,let us go.,"[0, 2744, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[0, 71, 93, 55, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ..."
3,L870,L871->L872,no,okay -- you are gonna need to learn how to lie.,"[0, 38, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...","[0, 90, 3, 12, 87, 107, 6, 564, 47, 6, 593, 1,..."
4,L870,L870->L871,i am kidding. you know how sometimes you just...,no,"[0, 4, 19, 633, 3, 25, 47, 374, 3, 44, 1313, 2...","[0, 38, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,..."


# Define Encoder, Decoder, and Seq2Seq model

I will use a single-layer LSTM for both the Encoder and the Decoder.

In [207]:
import torch
import torch.nn as nn

In [208]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Encoder

In [209]:
class Encoder(nn.Module):
    
    def __init__(self, vocab_size, embedding_size, hidden_size):
        
        '''
        vocab_size:     the size of the vocabulary
        embedding_size: the size of the embedding
        hidden_size:    the number of hidden state features

        '''
        
        super(Encoder, self).__init__()
        
        # self.embedding provides a vector representation of the inputs to our model
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        
        # define the embedding layer based on the embedding_size parameter
        #
        self.embedding_size = embedding_size
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        
        # initialize the hidden state to all zeros
        #
        self.hidden = torch.zeros(1, 1, hidden_size)
        
        # define the lstm layer based on the embedding dimension and hidden state size
        #
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, 1) 
        
    
    def forward(self, source_batch):
        
        '''
        Input:   source_batch, the source vector of the size INPUT_SEQ_LENGTH x BATCH_SIZE
        Outputs: output, the encoder outputs
                 hidden, the hidden state
                 cell, the cell state
        '''
        #  Generate the embedding
        #
        embedding = self.embedding(source_batch.type(torch.LongTensor))
               
        #  Generate outputs. We care only about the hidden states and the cell states
        #
        _, hidden, cell = self.lstm(embedding)
        
        return hidden, cell

# Decoder

In [210]:
class Decoder(nn.Module):
      
    def __init__(self, embedding_size, hidden_size, vocab_size):
        
        '''
        vocab_size:     the size of the vocabulary
        embedding_size: the size of the embedding
        hidden_size:    the number of hidden state features

        '''
        
        super(Decoder, self).__init__()
        
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        
        # self.embedding provides a vector representation of the target to our model
        #
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        #
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, 1)

        # self.ouput, a vector of the size 1 x vocab_size with probabilities for each word in vocab
        #
        self.output = nn.Linear(self.hidden_size, self.vocab_size)
        
        
    def forward(self, source, hidden, cell):
        
        '''
        Inputs:  source, the target vector
                 hidden, the previous hidden state
                 cell, the previous cell state
                
        Outputs: output, the prediction
                 hidden, the new hidden state
                 cell, the new cell state
        '''
        source = source.unsqueeze(0)
        
        # generate the embedding
        #
        embedding = self.embedding(source.type(torch.LongTensor))
        
        output, (hidden, cell) = self.lstm(source.type(torch.LongTensor), (hidden, cell))
        
        output = self.output(output.squeeze(0))
        
        return output, hidden, cell

# Seq2Seq

In [211]:
class Seq2Seq(nn.Module):
    
    def __init__(self, encoder, decoder, tf_ratio = 0.5):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder

        self.tf_ratio = tf_ratio
    
    
    def forward(self, source_batch, target_batch):
        
        batch_size = source_batch.shape[1]

        output_sequence = torch.zeros(max_response_length, batch_size, vocab_size)

        hidden, cell = self.encoder(source_batch.type(torch.LongTensor))

        predicted_word = target[0]

        for index in range(1, max_response_length):

            word_probabilities, hidden, cell = self.decoder(predicted_word, hidden, cell)

            output_sequence[index] = word_probabilities

            if random.random() < tf_ratio:
            
                predicted_word = target[index]
                
            else:
                
                predicted_word = word_probabilities.argmax(1)

        return output_sequence

# Define hyperparameters and set up the training loop

In [212]:
import torch.optim as optim

In [213]:
tf_ratio = 0.5

learning_rate = 0.01

embedding_size = 300

hidden_size = 1024

encoder = Encoder(vocab_size, embedding_size, hidden_size).to(device)

decoder = Decoder(embedding_size, hidden_size, vocab_size).to(device)

model = Seq2Seq(encoder, decoder, tf_ratio).to(device)

#  Ignore the padding token, which has index 2 in our vocab
#
criterion = nn.CrossEntropyLoss(ignore_index=2).to(device)

optimizer = optim.Adam(model.parameters(), lr = learning_rate)

In [214]:
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4164, 300)
    (lstm): LSTM(300, 1024)
  )
  (decoder): Decoder(
    (embedding): Embedding(4164, 300)
    (lstm): LSTM(300, 1024)
    (output): Linear(in_features=1024, out_features=4164, bias=True)
  )
)

In [215]:
def train(model, dataloader, optimizer, criterion):
    
    model.train()

    epoch_loss = 0
    
    for _, data in enumerate(dataloader):
        
        optimizer.zero_grad()
        
        calls, responses = data
        
        outputs = model(calls, responses)
        
        # 1. as mentioned in the seq2seq section, we will
        # cut off the first element when performing the evaluation
        # 2. the loss function only works on 2d inputs
        # with 1d targets we need to flatten each of them
        
        outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
        responses_flatten = responses[1:].view(-1)
        
        loss = criterion(outputs_flatten, responses_flatten)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

In [216]:
def evaluate(model, dataloader, criterion):
    
    model.eval()

    epoch_loss = 0
    
    with torch.no_grad():
        
        for _, data in enumerate(dataloader):
            
            calls, responses = data
            
            # turn off teacher forcing
            #
            outputs = model(calls, responses, tf_ratio=0) 
            
            outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
            responses_flatten = responses[1:].view(-1)
            
            loss = criterion(outputs_flatten, responses_flatten)
            
            epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

# Create training, validation, and testing datasets

In [217]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [218]:
#  Split the dataframe
#

#  Use 20% of the entire dataframe for testing
#
test_df = exchange_df.sample(frac=0.2)

#  Use 80% for training
#
train_df = exchange_df.drop(test_df.index)

#  Use 20% of the training df for validation
#
validation_df = train_df.sample(frac=0.2)

train_df = train_df.drop(validation_df.index)

In [219]:
#  Dataset class
#
class ExchangeDataset(Dataset):
    
    def __init__(self, df):
        
        x = df.iloc[:,4].values.tolist()
        y = df.iloc[:,5].values.tolist()
        
        self.x = torch.tensor(x, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)
        
    def __len__(self):
        
        return len(self.y)
    
    def __getitem__(self, index):
        
        return self.x[index], self.y[index]

In [220]:
train_dataset = ExchangeDataset(train_df)

validation_dataset = ExchangeDataset(validation_df)

test_dataset = ExchangeDataset(test_df)

# Create dataloaders

In [221]:
batch_size = 5

In [222]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)

validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)

test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Run the training loop

In [223]:
num_epochs = 10
best_valid_loss = float('inf')

for epoch in range(num_epochs):
    
    train_loss = train(model, train_dataloader, optimizer, criterion)
    valid_loss = evaluate(model, validation_dataloader, criterion)

    if valid_loss < best_valid_loss:

        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)