# Seq2Seq Chatbot
## 2018.11.20
Homework Instruction:

In this homework you will be creating a Chatbot using a sequence-to-sequence model. You are allowed to work in
groups of up to 2 students. This homework is of an open format; all we will be providing you with is the data. It is up
to you to pre-process the data, build the seq-2-seq model with keras, and train the model. You will be submitting your
code and write-up containing the three sections deﬁned below.

<img src="https://cdn.technologyadvice.com/wp-content/uploads/2018/02/friendly-chatbot-700x408.jpg" width=500 height=500>

# Setup:
## Prepare data for neural network
We'll begin by importing the needed models and load the data.

In [2]:
import numpy as np
import pandas as pd
import re
import math
import pickle
import h5py

from keras.models import Sequential
from keras.models import Model, load_model
from keras.layers import LSTM, Embedding 
from keras.layers import Input
from keras.layers import Dense, Flatten, Reshape

# hyperparameter
mxlen = 10 # Max length for a sequence of tokens
batch_size = 64  # Batch size for training.
epochs = 3  # Number of epochs to train for.
char_dim= 50 # Embedding size
latent_dim = 50  # Latent dimensionality of the encoding space.

word2id = {}
id2word = {}

data_path = "movie_lines.tsv"

In [3]:
def load_data(filename):
    """
        Load the movie_lines.tsv file which contains the data. 
        The ﬁle has ﬁve tab separated columns containing the following ﬁelds:
        1. lineID
        2. characterID (who uttered this phrase)
        3. movieID
        4. character name
        5. text of the utterance
        
        all_lineids = [lineids1, lineids2, ...] where lineids is a sequence of 
            utterances for one movie.
        all_ids = [ids1, ids2, ...] where ids is a sequence of 
            character ids for one movie.
        all_toks = [toks1, toks2, ...] where toks is a sequence of 
            words (sentences) for one movie.
    
    """
    file = open(filename)
    all_lineids = []
    all_ids = []
    all_toks = []
    
    lineids = []
    ids = []
    toks = []
    mid = "m0"
    i = 0
    
    for line in file:
        cols = line.rstrip().split("\t")
        #Only get the data with entire five columns
        if len(cols) < 5:
            continue
        
        if cols[2] != mid:
            all_lineids.append(lineids)
            all_ids.append(ids)
            all_toks.append(toks)
            
            #Restart new movie data
            lineids = [int(cols[0].strip('"L'))]
            ids = [int(cols[1].strip('u'))]
            toks = [cols[4]]
            mid = cols[2]
            continue
        
        lineids.append(int(cols[0].strip('"L')))
        ids.append(int(cols[1].strip('u')))
        toks.append(cols[4])
        
        i += 1
        if i == 103085:
            break
    
    if len(toks) > 0:
        all_lineids.append(lineids)
        all_ids.append(ids)
        all_toks.append(toks)
    
    return all_lineids, all_ids, all_toks

In [4]:
all_lineids, all_ids, all_toks = load_data(data_path)

### 1. Text Prepare

We are only providing you with the utterances. It is up to you to process the utterances into a format that can be fed
into the model. <br>
For information on how to process the data take a look at this paper: A Neural Conversation Model. <br>
Some ideas for pre-processing may include removing infrequent words and replacing them with the <UNK> token.
In your write-up include what you did to pre-process the dialogue data.
    
** 1) Remove bad symbols and tokenization <br> **
** 2) *Count the words that appear only onces and replace them with <UNK\> <br> **

In [5]:
# 1) Remove bad symbols and tokenization
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

def text_prepare(text):
    """
        text: a string
        
        return: modified string tokens 
                [tok1, tok2 , ...] which is a single sentence from one character
    """
    tok = ["<START>"]
    text = text.lower() # lowercase text
    text = re.sub(REPLACE_BY_SPACE_RE, ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, '', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    tok += (text.split()+["<EOS>"])
    
    return tok

In [6]:
# 2) Dictionary of all words from train corpus with their counts.
#    Dictionary of all words with its ids
def count_words(all_toks):
    """
    count:
    { tok1: count1, tok2: count2, ...}
    word2id:
    { tok1: id1, tok2: id2, ...}
    id2word:
    { id1: tok1, id2: tok2, ...}
    
    """
    count = {}
    
    # Special Tokens
    word2id["<START>"] = 0
    word2id["<EOS>"] = 1
    word2id["<UNK>"] = 2
    word2id["<PAD>"] = 3
    index = 4
    
    for toks in all_toks:
        for tok in toks:
            for word in tok:
                # Count the words
                if not word in count:
                    count[word] = 1
                else:
                    count[word] += 1
                # Make dictionary
                if not word in word2id:
                    word2id[word] = index
                    index += 1
    
    return count

In [7]:
def text_tokenize(all_toks):
    
    """
     all_toks_new: 
     [
        movie 0:[ line 0: [tok1, tok2, ...],
                  line 1: [tok1, tok2, ...],
                  ... ]
        movie 1:[ line 0: [tok1, tok2, ...],
                  line 1: [tok1, tok2, ...],
                  ... ]
        movie 2:[ line 0: [tok1, tok2, ...],
                  line 1: [tok1, tok2, ...],
                  ... ]
        ...

     ]
     
     scarce_words_counts: a list of words that appear only once.
     [tok1, tok2, tok3, ...]
     
    """
    all_toks_new = []

    # Prepare the text
    for toks in all_toks:
        toks = [text_prepare(x) for x in toks]
        all_toks_new.append(toks)

    # Count the words that appears only once.
    words_counts = count_words(all_toks_new)
    scarce_words_counts = [x[0] for x in sorted(words_counts.items(), key = lambda x: x[1], reverse=True) if x[1] == 1]
    
    # Remove scarce words in word2id dictionary and reindex all words
    for word in scarce_words_counts:
        del word2id[word]
    
    return all_toks_new, scarce_words_counts

In [8]:
%%time
all_toks_new, scarce_words_counts = text_tokenize(all_toks)
word2id = {key: i for i, key in enumerate(word2id.keys())}
id2word = {i:symbol for symbol, i in word2id.items()}

CPU times: user 1.03 s, sys: 36.5 ms, total: 1.07 s
Wall time: 1.07 s


In [10]:
%%time
############################################
# Default: Skip, load the data directly
# Replace the word with <UNK> that appears only once. This should take a while
#for movie in all_toks_new:
for i in range(len(all_toks_new)):
    if i % 100 == 0:
        print("Iteration (per 100 movies): ",int(i/100))
    for toks in all_toks_new[i]:
        for j in range(len(toks)):
            if toks[j] in scarce_words_counts:
                toks[j] = "<UNK>"

file1=open("all_toks_new.bin","wb")
file2=open("word2id.bin","wb")
pickle.dump(all_toks_new,file1)
pickle.dump(word2id,file2)
file1.close()
file2.close()
############################################

Iteration (per 100 movies):  0
Iteration (per 100 movies):  1
Iteration (per 100 movies):  2
CPU times: user 6min 10s, sys: 1.49 s, total: 6min 11s
Wall time: 6min 14s


In [9]:
# Load the processed data to save time
file1=open("all_toks_new.bin","rb")
file2=open("word2id.bin","rb")
all_toks_new=pickle.load(file1)
word2id=pickle.load(file2)

### 2. Word Vector and Padding 
In the following steps, we first make the word into id vectors which will be fed into keras embedding layers. <br>
As for Padding: For the tokens into ids we have processed, we wish to pad all the sentences to a fixed mxlen, where we define 20.


In [10]:
def pad(all_toks_new, mxlen):
    """
    all_toks_new: (each with same length mxlen)
     [
        movie 0:[ line 0: [id1, id2, ...],
                  line 1: [id1, id2, ...],
                  ... ]
        movie 1:[ line 0: [id1, id2, ...],
                  line 1: [id1, id2, ...],
                  ... ]
        movie 2:[ line 0: [id1, id2, ...],
                  line 1: [id1, id2, ...],
                  ... ]
        ...

     ]
    """
    for movie in all_toks_new:
        for i in range(len(movie)):
            #token to id
            movie[i] = [word2id[x] for x in movie[i]]
            #padding
            if len(movie[i]) < mxlen:
                movie[i] += [3] * (mxlen-len(movie[i])) 
            movie[i] = np.array(movie[i][:mxlen])
    
    return all_toks_new
    

In [11]:
all_toks_new = pad(all_toks_new, mxlen)

### 3. Make encoding and decoding data
Turn the data into three main data forms: <br>
**input_tokens, output_tokens, output_target_tokens**


for training a basic LSTM-based Seq2Seq model to predict output_target_tokens given input_tokens and output_tokens.

In [12]:
def separate_conv(ids, toks):
    """
    Separate the sequence of characters and their words if they utter continuously without waiting for the other to speak
    For example:
    ids = [2, 0, 2, 0, 2, 0, 0 ,2]
    toks = [tok1, tok2, tok3, tok4, tok5, tok6, tok7, tok8]
    sep_toks = [[tok1, tok2, tok3, tok4, tok5, tok6], [tok7, tok8]]
    
    """
    sep_toks = []
    for i in range(len(ids)):
        if i == 0:
            temp = ids[i]
            idx = i
        else:
            if temp == ids[i]:
                sep_toks.append(toks[idx:i])
                idx = i
            temp = ids[i]
        
        if i == (len(ids)-1):
            sep_toks.append(toks[idx:len(ids)])
    
    return sep_toks            

In [13]:
def make_data(all_lineids, all_ids, all_toks_new):
    """
    Transform our original data with all dialogues all_toks_new into training data (input_tokens, output_tokens)
    
    A movie can be seen as an entity with sequential characters' conversations.
    We deem a conversation end when two line ids are not consecutive.
    for example, if a lineid sequence is [242, 241, 237, 236, 235]
    we can make it into two conversations: [242, 241], [237, 236, 235]
    
    After specifying the conversations, we can then prepare the training data as follows:
    for two conversations: [242, 241], [237, 236, 235] and corresponding token sequence is [toks1, toks2], [toks3, toks4, toks5]
    we make input_tokens as [toks2], [toks4, toks5]
            output_tokens as [toks1], [toks3, toks4]
    
    
    Then we combine all tokens input-output pairs of every conversation in every movie.
    so we will have
    input_tokens = [toks2, toks4, toks5, toks7, ...]
    output_tokens = [toks1, toks3, toks4, toks6, ...]
    
    Finally we wish to have our target output tokens to be almost same as output_tokens with each data ahead by one timestep.
    output_target_tokens = [toks1, toks3, toks4, toks6, ...]
    
    """
    input_tokens = []
    output_tokens = []
    
    N = len(all_lineids) #number of movies

    for i in range(N):
        #For a single movie
        movie = all_lineids[i]
        for j in range(len(movie)):
            if j == 0:
                temp = movie[j]
                idx = j
            else:
                if (temp-movie[j]) is not 1:
                    sep_toks = separate_conv(all_ids[i][idx:j], all_toks_new[i][idx:j])
                    for toks in sep_toks:
                        input_tokens += toks[1:]
                        output_tokens += toks[:-1]

                    idx = j
                temp = movie[j]

            #Last Sequence
            if j == len(movie)-1:
                sep_toks = separate_conv(all_ids[i][idx:len(movie)], all_toks_new[i][idx:len(movie)])
                for toks in sep_toks:
                    input_tokens += toks[1:]
                    output_tokens += toks[:-1]
    
    output_target_tokens = [0]*len(output_tokens)
    for i in range(len(output_tokens)):
        output_target_tokens[i] = np.concatenate((output_tokens[i][1:], [3]))
            
    return input_tokens, output_tokens, output_target_tokens

In [14]:
input_tokens, output_tokens, output_target_tokens = make_data(all_lineids, all_ids, all_toks_new)
input_tokens = np.asarray(input_tokens)
output_tokens = np.asarray(output_tokens)
output_target_tokens = np.asarray(output_target_tokens)

In [15]:
print(input_tokens[0])
print(output_tokens[0])
print(output_target_tokens[0])

[0 4 5 7 1 3 3 3 3 3]
[0 4 5 6 1 3 3 3 3 3]
[4 5 6 1 3 3 3 3 3 3]


In [16]:
input_tokens = input_tokens[:1000]
output_tokens = output_tokens[:1000]
output_target_tokens = output_target_tokens[:1000]

In [17]:
decoder_target_data = np.zeros(
    (len(output_target_tokens), mxlen, len(word2id)),
    dtype='float32')
for i, target in enumerate(output_target_tokens):
    for t, idd in enumerate(target):
        decoder_target_data[i, t, idd] = 1.

In [18]:
print(input_tokens.shape)
print(output_target_tokens.shape)
print(decoder_target_data.shape)
print(len(word2id))

(1000, 10)
(1000, 10)
(1000, 10, 18223)
18223


# -----------------------------------
# Model:
## Define a basic LSTM-based Seq2Seq model using keras

Instructions:

You will be implementing the sequence-to-sequence model described in class where the model makes predictions
using the left context and the dialogue context. More information model can be found in these lecture slides, this
paper, or the reading.

In [19]:
# Encoder
# Define an input sequence and process it.

encoder_inputs = Input(shape=(mxlen,), name="Encoder_input")

embedding = Embedding(input_dim=len(word2id), 
                      output_dim=char_dim, 
                      input_length=mxlen, name="Embedding")
encoder_e = embedding(encoder_inputs)

encoder_lstm = LSTM(latent_dim, return_state=True, name="Encoder_lstm")
encoder_outputs, state_h, state_c = encoder_lstm(encoder_e)
#We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]



# decoder, using `encoder_states` as initial state.
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_inputs = Input(shape=(mxlen,), name="Decoder_input")

decoder_e = embedding(decoder_inputs)

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, name="Decoder_lstm")
decoder_outputs, _, _ = decoder_lstm(decoder_e, initial_state=encoder_states)

decoder_dense = Dense(len(word2id), activation='softmax', name="Dense_layer")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
my_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [20]:
# Run training
my_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(my_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Decoder_input (InputLayer)      (None, 10)           0                                            
__________________________________________________________________________________________________
Encoder_input (InputLayer)      (None, 10)           0                                            
__________________________________________________________________________________________________
Embedding (Embedding)           (None, 10, 50)       911150      Encoder_input[0][0]              
                                                                 Decoder_input[0][0]              
__________________________________________________________________________________________________
Encoder_lstm (LSTM)             [(None, 50), (None,  20200       Embedding[0][0]                  
__________

In [21]:
my_model.fit([input_tokens, output_tokens], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

############################################
#################Optional###################
############################################
# Save model
my_model.save('s2s.h5')
############################################
#################Optional###################
############################################

Train on 800 samples, validate on 200 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


  '. They will not be included '


In [None]:
############################################
#################Optional###################
############################################
model = load_model('s2s.h5')

encoder_inputs = model.input[0]   # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output   # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]   # input_2
decoder_state_input_h = Input(shape=(latent_dim,), name='input_3')
decoder_state_input_c = Input(shape=(latent_dim,), name='input_4')
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)
############################################
#################Optional###################
############################################

# Next: Inference mode (sampling)
Here we do the sampling to retrieve initial decoder state. <br>

1) encode input and retrieve initial decoder state <br>
2) run one step of decoder with this initial state and a "start of sequence" token as target. Output will be the next target token <br>
3) Repeat with the current target token and current states

In [22]:
# Define sampling encoder models
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Encoder_input (InputLayer)   (None, 10)                0         
_________________________________________________________________
Embedding (Embedding)        (None, 10, 50)            911150    
_________________________________________________________________
Encoder_lstm (LSTM)          [(None, 50), (None, 50),  20200     
Total params: 931,350
Trainable params: 931,350
Non-trainable params: 0
_________________________________________________________________


In [23]:
# Define sampling decoder models
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_e2 = embedding(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(decoder_e2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]

decoder_outputs2 = decoder_dense(decoder_outputs2)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)

# Next: Predict decoder sequence
At the last stage, we could predict the input sequence by putting in our predefined model. <br>

In [24]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, mxlen))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = word2id["<START>"]

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        outputs, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(outputs[0,-1,:])
        sampled_token = id2word[sampled_token_index]
        decoded_sentence += (sampled_token + " ")

        # Exit condition: either hit max length or find stop character.
        if (sampled_token == "<EOS>" or len(decoded_sentence) > mxlen):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, mxlen))
        target_seq[0, 0] = word2id["<START>"]

        # Update states
        states_value = [h, c]

    return decoded_sentence

# -----------------------------------
# Testing
#### Here we can do the final prediction

In [25]:
test_tokens = []

for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_tokens[seq_index]
    input_seq = input_seq.reshape(-1,input_seq.shape[0])
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_tokens[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: [0 4 5 7 1 3 3 3 3 3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 11 12  1  3  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 15  1  3  3  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 22  1  3  3  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 23 24 25 26 20 27 25 28 29]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 41 42 25  1  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 43 44 45  1  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0  8 46 47 48  7 41 44 45 49]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 60 30 61 62 63 23 35 64 65]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 43 66  1  3  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0  5 25 67  7 30 66  1  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 22  1  3  3  3  3  3  3  3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [ 0 68  2 69 52 25 1

-
Input sentence: [  0 232   1   3   3   3   3   3   3   3]
Decoded sentence: <PAD> <PAD> 
-
Input sentence: [  0 135 412  91 125 413   1   3   3   3]
Decoded sentence: <PAD> <PAD> 
