# DEMO NOTEBOOK - RETRIEVAL CHATBOT

# 1. INTRODUCTION

In this exercise we will see step by step the process of building a retrieval chatbot, preparing the dataset we want to work with, create a model and train the model to get a dialog system with the ability to answer to the users questions. 

As you know from the slides, a retrieval chatbot doesn't generate an answer from scratch. It receives a question (the user input), use some heuristic to retrieve a set of candidates to be answer to that question and finally it selects the best one as final answer. Our goal in this exercise is to have a chatbot able to perform this task in a closed domain: Ubuntu customer support.

## 1.1. Dataset
In this case we are going to work with the Ubuntu Corpus (https://arxiv.org/pdf/1506.08909) to create a retrieval chatbot capable of answering technical support questions about the well known OS Ubuntu. The set can be downloaded in **https://drive.google.com/file/d/0B_bZck-ksdkpVEtVc1R6Y01HMWM/view**. It consists of dialogs extracted from the forums, so each conversation has two participants.

In the training dataset, the dialogs have been processed to obtain a series of pairs **context** - **utterance**. Each sentence of the dialog is going to appear as an utterance in one of the pairs, while the context of that especific pair is formed by the sentences previous to the utterance.

The testing dataset is different, as we have each sentence of the dialog as **context** and then the following sentence of the dialog (from the other user) as **utterance**. In addition, each pair has also 9 **distractors**, false utterances selected randomly from the dataset. Given a context, the model will receive the correct utterance and the distractors as candidates to be answers, and the model should be able to give the correct one a better score than the others.


## 1.2. Model
The architecture of the neural network is called the Dual Encoder LSTM. It's described also in the paper mentioned before, and it's formed by two encoders. One of them encodes the question we want to answer and the other one the candidate to be the answer. The output of the architecture is a score between 0 and 1. The closer the score is to 1, the better the answer is for that question.

<img src="dualencoder.png">


# 2. Requirements

First of all, we need to install the libraries required to complete this project. The most important are:

* Python 3.6.5
* PyTorch 1.x
* nltk

Once installed, import them into the project and we are ready to start.

Make sure you set the value of your data_dir!

In [2]:
import csv

import random

import nltk
from nltk.stem import SnowballStemmer

import torch
from torch import nn
from torch.nn import init
from torch import optim
from torch.autograd import Variable

data_dir = "/Users/taamucl2/Swisscom/_Research-projects/_chatbot_course/data"

# 3. Loading and preprocessing the data

Since our data is provided as CSV files we imported the required library to read that data.
- We have created a reader for the CSV file.
- Use the reader to create a list of rows.
- Keep in mind the first row of a CSV contains the column headers.
- Finally use the random library to shuffle the training list. 

In [None]:
reader = csv.reader(open(data_dir + 'train.csv'))
all_rows = #Todo
#then clear the headers:
rows = #Todo
#Now randomize the rows
random. .. #Todo

In [None]:
reader = # TODO
valid = # TODO

The next step is to load the vocabulary and the pre-trained word embeddings, in our case glove.
We need to create tow dictionaries in python. Additionally, we will load a stemmer that we will need for the preprocessing functions
- For the vocab, we are going to iterate over all words in the list, and set them as keys, while the value is going to be the ordinal number of the word
- Our dictionary uses the ordinal number of the wors as a key while the value is the pre-trained word embedding. This function is already given to you

#### Hint:
Given a dictionary like:
- **hello**
- **hi**
- **bot**

The output of load_vocab will be:
- **{'hello': 0, 'hi': 1, 'bot': 2}**

In [None]:
def load_vocab(filename):
    lines = open(filename).readlines()
    return # TODO


def load_glove_embeddings(filename):
    lines = open(filename).readlines()
    embeddings = {}
    for line in lines:
        word = line.split()[0] 
        embedding = list(map(float, line.split()[1:]))
        if word in vocab:
            embeddings[vocab[word]] = embedding

    return embeddings

vocab = load_vocab(data_dir + 'vocabulary.txt')
glove_embeddings = load_glove_embeddings(data_dir + 'glove.6B.100d.txt')

stemmer = SnowballStemmer("english")

Finally, we are going to define our preprocessing functions. 

- The numberize function takes as input a string. Splits the string and uses our vocabulary dictionary to create a vector of the string such that work is represented by a number. If the string is short we also pad the vector. 

- In the process_train function, you need to return a tuple of numberized context vector, numberized response vector and integer variable for the label

- In the process_valid function, you need to return a tuple of numberized context vector, numberized response vector, and list of numberized distractor vectors

- The process_predict_embed function does 3 things to a response. First, it tokenizes it, then we stem each token and finally we generated a numeric vector of the response

#### Hint:
Given the row tuple:
- **("hello bot", "hi", "1")**

The output of process_train will be:

- **([0,..,0,222,909],  [0,..,0,137], 1)**

In [None]:
def numberize(inp):
    inp = inp.split()
    result = list(map(lambda k: vocab.get(k, 0), inp))[-160:]
    if len(result) < 160:
        result = [0] * (160 - len(result)) + result

    return result

def process_train(row):
    context, response, label = row

    context = # TODO
    response = # TODO
    label = # TODO

    return context, response, label

def process_valid(row):
    context = # TODO
    response = # TODO
    distractors = # TODO

    context = # TODO
    response = # TODO
    distractors = # TODO

    return context, response, distractors

def process_predict_embed(response):
    response = ' '.join(list(map(stemmer.stem, nltk.word_tokenize(response))))
    response = numberize(response)
    return response

# 4. Description of the model

Once we have our data initialized and all preprocessing functions ready, it's time to start defining the <b>Graph</b> of the model.

We are going to define our model with two classes. First, we will define the Encoder. For this, we are going to extend the <b>nn.Module class of torch</b>

There are 3 functions within our Class:

- _init_ is the default python method to create an instance, in this method we pass the basic parameters such as the number of network layers, the layers hidden size etc. In this model, we are going to use **LSTM**, although GRU could also work. 

- The forward is the most important method in the class and defines the computation that this module performs when it is given an input. In our case since it is an encoder, we take as input a numerzied vector of a string. Then using the embedding dictionary for each token we get its word embedding. Finally, we pass all the tokes to the RNN and return the outputs and hidden states.

- Finally, we define an init_weights to that is called by _init_ and gives us more control over how the parameters of the RNN are initialized. Additionally, we initialize the self.embedding.weight with the dictionary of word embeddings we already loaded

In [None]:
dtype = torch.FloatTensor # 

class Encoder(nn.Module):
    def __init__(
            self,
            input_size,
            hidden_size,
            vocab_size,
            num_layers=1,
            dropout=0,
            bidirectional=True,
    ):
        super(Encoder, self).__init__()
        self.num_directions = 2 if bidirectional else 1
        self.vocab_size = vocab_size
        self.input_size = input_size
        self.hidden_size = hidden_size // self.num_directions
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, input_size, sparse=False, padding_idx=0)

        self.rnn = nn.LSTM(
            input_size,
            self.hidden_size,
            num_layers=num_layers,
            dropout=dropout,
            bidirectional=bidirectional,
            batch_first=True,
        )

        self.init_weights()

    def forward(self, inps):
        embs = self.embedding(inps)
        outputs, hiddens = self.rnn(embs)
        return outputs, hiddens

    def init_weights(self):
        init.orthogonal_(self.rnn.weight_ih_l0)
        init.uniform_(self.rnn.weight_hh_l0, a=-0.01, b=0.01)
        embedding_weights = torch.FloatTensor(self.vocab_size, self.input_size)
        init.uniform_(embedding_weights, a=-0.25, b=0.25)
        for k, v in glove_embeddings.items():
            embedding_weights[k] = torch.FloatTensor(v)
        embedding_weights[0] = torch.FloatTensor([0] * self.input_size)
        del self.embedding.weight
        self.embedding.weight = nn.Parameter(embedding_weights)

Now that we defined an Encoder module, we can define our full DualEncoder Module.

Again we extend the nn.Module class and we implement the required methods:

- In the __init__ Module we pass an instance of an Encoder Module. Then based on the size of the Encoder output we define our trainable square matrix. We also define the final dense layer

- For the forward method first, we generate the encoding for both the contexts and responses. Finally, the **prediction** for a given context will be obtained by multiplying the context encoded by the *prediction matrix* M, that will be trained. However, it isn't this prediction what we want to get. Now, we can multiply it to the real encoded utterance and apply the *sigmoid* function to get the probability of the pair context-utterance being correct. 



<img src="dualencoder.png">

In [None]:
class DualEncoder(nn.Module):
    def __init__(self, encoder):
        super(DualEncoder, self).__init__()
        self.encoder = encoder
        h_size = self.encoder.hidden_size * self.encoder.num_directions
        M = torch.FloatTensor(h_size, h_size)
        init.normal_(M)
        self.M = nn.Parameter(
            M,
            requires_grad=True,
        )
        
        dense_dim = 2 * self.encoder.hidden_size
        self.dense = nn.Linear(dense_dim, dense_dim)
      

    def forward(self, contexts, responses):
    
        context_os, context_hs = self.encoder(contexts)
        response_os, response_hs = self.encoder(responses)
        
        context_hs = context_hs[0]
        response_hs = response_hs[0]

        results = []
        response_encodings = []

        h_size = self.encoder.hidden_size * self.encoder.num_directions
        for i in range(len(context_hs[0])):
            context_h = context_os[i][-1].view(1, h_size)
            response_h = response_os[i][-1].view(h_size, 1)

            ans = torch.mm(torch.mm(context_h, self.M), response_h)[0][0]
            results.append(torch.sigmoid(ans))
            response_encodings.append(response_h)

        results = torch.stack(results)

        return results, response_encodings

Finally, we can create an instance of the model. You can play around at home, with the size of the model. For now, we keep them fixed so you can load our pre-trained model

In [None]:
encoder_model = Encoder(
    input_size=100,  # embedding dim
    hidden_size=300,  # rnn dim
    vocab_size=91620,  # vocab size
    bidirectional=False,  
)

model = DualEncoder(encoder_model)


# 5. Training

Now let's define some function that will help us with training and testing. Our dataset is too large to fit all at once in our model. Your PC will just run out of RAM while performing the computation. For that reason, we will feed the data to our model in small parts (batches).

- We need a function to return batches of rows of constant size. It takes 2 parameters the batch_number and the batch_size. 

- Be careful to not set the start index with a size greater than the length of the list. if the current batch index is too large, you should iterate again. Remember that rows is a global variable.

- The get validation function returns the full validation set or a set of the given size

In [None]:
def get_batch(batch_num, batch_size):
    start = # TODO
    return rows[start:start+batch_size]

def get_validation(num=None):
    if num is None:
        return valid
    return # TODO

#### Hint:
Given the input:
- **(10, 1)**

The output of **get_batch** will be:
- [["be i suppos to get a question about the mode i want to be avail dure xserver-xfree86 ubuntu7 's ... __eou__ __eot__",
  'ok __eou__',
  '1']]

For the training we must minimize the **mean loss** for each batch. The chosen loss function is the **cross entropy**, as in the training dataset we have labelled whether each utterance belongs to the context. 

Thanks to that, if the label is 1 (the pair is correct) the loss will be very close to 0 only if the score given is high, penalizing the mistake. The same works for the other case, being the label 0 (the pair is wrong), if the score is high then it will be penalized as the loss will increase.

In [None]:
loss_fn = torch.nn.BCELoss()

learning_rate = 0.001
num_steps = 10
batch_size = 512
evaluate_batch_size = 250

Now let's use the optim pacakge of torch to define an optimizer. We suggest you use Adam. 

In [None]:
optimizer = # TODO

Before we can proceed to training we need to define an evaltion function. We need this because we want to monitor the perfomance of our model during training

Remember the estructure:
* Context
* Correct utterance
* Nine distractors (wrong utterances)

The evaluation is based on the function **recall@k**, being k the size of the subset selected. In other words, for each context the model will evaluate all 10 possible utterances and assign a score to each of them. For recall@1 only is correct if the best score is the correct utterance, for recall@5 it's considered correct if the correct utterance is between the 5 best scores, etc.

In [None]:
def evaluate(model, size=None):
    """
    Evaluate the model on a subset of vallidation set.
    """
    valid = list(map(process_valid, get_validation(size)))

    count = [0] * 10

    for e in valid:
        context, response, distractors = e
        
        with torch.no_grad():
            cs = Variable(torch.stack([torch.LongTensor(context) for i in range(10)], 0))
        rs = [torch.LongTensor(response)]
        rs += [torch.LongTensor(distractor) for distractor in distractors]
        with torch.no_grad():
            rs = Variable(torch.stack(rs, 0))

        results, responses = model(cs, rs)
        results = [e.data.cpu().numpy() for e in results]

        better_count = sum(1 for val in results[1:] if val >= results[0])
        count[better_count] += 1

    return count

Now we are going to iterate in a loop for number of steps we definied. We left a copule of things missing the for loop. 
Look at the comments and fill in the missing lines

In [None]:
for i in range(num_steps):
    
    # First we need to get the batch for the new step
    batch = get_batch(i, batch_size) 
    # Use the process_train function to generate and make a list with all the elements in the batch
    batch_list = list(map(process_train, batch))
    count = 0

    cs = []
    rs = []
    ys = []

    for c, r, y in batch_list:
        count += 1

        # Forward pass: compute predicted y by passing x to model
        # Convert all c,r,y to tensors and append them to the defined lists
        
        # ATTENTION: think about the types
        # ATTENTION: pass y as an array
        cs.append(# TODO)
        rs.append(# TODO)
        ys.append(# TODO)


    cs = Variable(torch.stack(cs, 0))
    rs = Variable(torch.stack(rs, 0))
    ys = Variable(torch.stack(ys, 0))

    y_preds, responses = model(cs, rs)

    # Compute loss
    # Think about which parameters you need to use to comput the loss.
    # You minght need to use .view one of the inputs to avoid missmatch in dimensions
    loss = loss_fn(# TODO)
    print(i, loss.data.item())
    

    
    # Every 100 Steps we evaluate the model with a batch_size evaluation set
    if i % 100 == 0:
        res = evaluate(model, size=evaluate_batch_size)
        print(i)
        print("1 in 10: %0.2f, 2 in 10: %0.2f, 5 in 10: %0.2f" % (
            res[0] / evaluate_batch_size,
            sum(res[:2]) / evaluate_batch_size,
            sum(res[:5]) / evaluate_batch_size,
        ))
        

    # Every 1000 Steps we evaluate the model with a 2000 sample evaluation set    
    if i % 1000 == 0 and i > 0:
        res = evaluate(model, size=2000)

        one_in = res[0] / 2000
        two_in = sum(res[:2]) / 2000
        three_in = sum(res[:5]) / 2000

        print("!!!!!!!!!!")
        print("1 in 10: %0.2f, 2 in 10: %0.2f, 5 in 10: %0.2f" % (
            one_in,
            two_in,
            three_in,
        ))
       
    
    
    # Finaly update the model
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    del loss, batch

**Everything is ready for the training!** Note that on the original experiment they trained for 20000 steps, which can take long (about 30-40 hours) without a GPU. Feel free to change the number of steps, although the results can be notably worse.

# 6. MAKING PREDICTIONS

We have to remember that the main goal of this course is to be able to build a chatbot that it's able to interact with human beings. That means that it should be able to **give answers to questions outside the dataset**. For that, everytime a question is asked we can retrieve a set of possible answers and pass them by the model to obtain the score. After all the process is gone, we select the one with best score as the answer that will be returned to the user!

In [None]:


def predict_val(context, response):
    c_num = process_predict_embed(context)
    r_num = process_predict_embed(response)
    c = Variable(torch.LongTensor([c_num]), volatile=True)
    r = Variable(torch.LongTensor([r_num]), volatile=True)

    res = model(c, r)[0].data.cpu().numpy()[0]
    return [(context, response), res]




Poupulate the val_cache with the use of the predict_val function and the POTENTIAL_RESPONSES

In [None]:
# # Load your own data here
INPUT_CONTEXT = "What is the command to remove a file"
POTENTIAL_RESPONSES = ["cp", "rm", "mkdir", "top"]


for response in POTENTIAL_RESPONSES:
    print("") #TODO

Exaime the val_cache dict

In [None]:
val_cache

However, in the last step we have cheated. We have manually added the candidates to be evaluated, but this is not going to be possible in a real world scenario. For that, we came up with the idea of using <b>Solr</b>. Solr gives you the opportunity (among many others that we don't need here) of indexing the whole dataset and performing similarity queries in it.

## For motivated studnets

The best way to perform the indexing is by creating an appropiate estructure of the data. We are going to need to query the user input (the question) against the database, select a group of the most similar existing questions and get the answer of the other user in the Ubuntu forum to be evaluated. Each sentence in the dataset can be stored with the following information:

- **author**: name of the user that wrote the sentence
- **recipient**: name of the other user present in the dialog
- **content**: the sentence (can be considered the <i>answer</i>
- **responseTo**: the last sentence from the other user that came before this one (can be considered the <i>question</i>)

With this estructure in Solr we can query by the user question to the chatbot against the <i>responseTo</i> field of all the stored sentences. The ones with biggest Solr similarity score are the sentences that have the best probability to be asking the same questions as the user, so we can take their <i>content</i> field and add them to the set of possible answers to return to the user.

Get strated with Solr here http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-1