Two approaches were considered for this project:



1.   **Conditional language modeling**: seq2seq model using encoder-attention-decoder architecture.
2.   **Selective model**: using embeddings-based ranking


In this notebook, the results of the selective approach are shown. The other approach is still experimental, although that is also promising. For that approach, please refer to the following github repo.

[encoder-attention-decoder architecture](https://github.com/ragibhassantonoy/NLP_Honor/blob/master/attention_week5_honor.ipynb)

The architecture:

1.   Preprocess the dataset so that the model can also recognise the tone, beginning and end of the sentences.
2.   Group sentences in pairs to construct conversations.
3.   Learn unsupervised sentence embeddings on the corpus
4. Select best (or, at random from multiple bests) response based on the input embedding.

5. The codes can be found here
[selective architecture](https://github.com/ragibhassantonoy/NLP_Honor/)





In [0]:
"""! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
import setup_google_colab
# please, uncomment the week you're working on
# setup_google_colab.setup_week1()  
# setup_google_colab.setup_week2()
# setup_google_colab.setup_week3()
# setup_google_colab.setup_week4()
# setup_google_colab.setup_project()

setup_google_colab.setup_honor()"""

In [0]:
import pickle
import tqdm
import pandas as pd
import numpy as np
import random

# Datasets

Four datasets were considered for training:

1. [Cornell Movie Dialogs](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
2. [OpenSubs](http://opus.nlpl.eu/OpenSubtitles.php)
3. Scotus
4. Ubuntu dialogs


The codes for mining and initial preprocessing are adapted from the following links:



1.   [DeepQA](https://github.com/Conchylicultor/DeepQA#presentation)
2.   [HSE github](https://github.com/hse-aml/natural-language-processing/tree/master/honor)



In [86]:
# Copyright 2015 Conchylicultor. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import ast
import os
import random
import re
from time import time

import nltk
from tqdm import tqdm

import xml.etree.ElementTree as ET
import datetime
import sys
import json
import pprint

from gzip import GzipFile
import numpy as np
import nltk

#nltk.download('stopwords')
#from nltk.corpus import stopwords
nltk.download('punkt')

"""
Load the cornell movie dialog corpus.

Available from here:
http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

"""

class CornellData:
    """

    """

    def __init__(self, dirName):
        """
        Args:
            dirName (string): directory where to load the corpus
        """
        self.lines = {}
        self.conversations = []

        MOVIE_LINES_FIELDS = ["lineID","characterID","movieID","character","text"]
        MOVIE_CONVERSATIONS_FIELDS = ["character1ID","character2ID","movieID","utteranceIDs"]

        self.lines = self.loadLines(os.path.join(dirName, "movie_lines.txt"), MOVIE_LINES_FIELDS)
        self.conversations = self.loadConversations(os.path.join(dirName, "movie_conversations.txt"), MOVIE_CONVERSATIONS_FIELDS)

        # TODO: Cleaner program (merge copy-paste) !!

    def loadLines(self, fileName, fields):
        """
        Args:
            fileName (str): file to load
            field (set<str>): fields to extract
        Return:
            dict<dict<str>>: the extracted fields for each line
        """
        lines = {}

        with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
            for line in f:
                values = line.split(" +++$+++ ")

                # Extract fields
                lineObj = {}
                for i, field in enumerate(fields):
                    lineObj[field] = values[i]

                lines[lineObj['lineID']] = lineObj

        return lines

    def loadConversations(self, fileName, fields):
        """
        Args:
            fileName (str): file to load
            field (set<str>): fields to extract
        Return:
            list<dict<str>>: the extracted fields for each line
        """
        conversations = []

        with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
            for line in f:
                values = line.split(" +++$+++ ")

                # Extract fields
                convObj = {}
                for i, field in enumerate(fields):
                    convObj[field] = values[i]

                # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
                lineIds = ast.literal_eval(convObj["utteranceIDs"])

                # Reassemble lines
                convObj["lines"] = []
                for lineId in lineIds:
                    convObj["lines"].append(self.lines[lineId])

                conversations.append(convObj)

        return conversations

    def getConversations(self):
        return self.conversations


# Based on code from https://github.com/AlJohri/OpenSubtitles
# by Al Johri <al.johri@gmail.com>

"""
Load the opensubtitles dialog corpus.
"""

class OpensubsData:
    """
    """

    def __init__(self, dirName):
        """
        Args:
            dirName (string): directory where to load the corpus
        """

        # Hack this to filter on subset of Opensubtitles
        # dirName = "%s/en/Action" % dirName

        print("Loading OpenSubtitles conversations in %s." % dirName)
        self.conversations = []
        self.tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
        self.conversations = self.loadConversations(dirName)

    def loadConversations(self, dirName):
        """
        Args:
            dirName (str): folder to load
        Return:
            array(question, answer): the extracted QA pairs
        """
        conversations = []
        dirList = self.filesInDir(dirName)
        for filepath in tqdm(dirList, "OpenSubtitles data files"):
            if filepath.endswith('xml'):
                try:
                    doc = self.getXML(filepath)
                    conversations.extend(self.genList(doc))
                except ValueError:
                    tqdm.write("Skipping file %s with errors." % filepath)
                except:
                    print("Unexpected error:", sys.exc_info()[0])
                    raise
        return conversations

    def getConversations(self):
        return self.conversations

    def genList(self, tree):
        root = tree.getroot()

        timeFormat = '%H:%M:%S'
        maxDelta = datetime.timedelta(seconds=1)

        startTime = datetime.datetime.min
        strbuf = ''
        sentList = []

        for child in root:
            for elem in child:
                if elem.tag == 'time':
                    elemID = elem.attrib['id']
                    elemVal = elem.attrib['value'][:-4]
                    if elemID[-1] == 'S':
                        startTime = datetime.datetime.strptime(elemVal, timeFormat)
                    else:
                        sentList.append((strbuf.strip(), startTime, datetime.datetime.strptime(elemVal, timeFormat)))
                        strbuf = ''
                else:
                    try:
                        strbuf = strbuf + " " + elem.text
                    except:
                        pass

        conversations = []
        for idx in range(0, len(sentList) - 1):
            cur = sentList[idx]
            nxt = sentList[idx + 1]
            if nxt[1] - cur[2] <= maxDelta and cur and nxt:
                tmp = {}
                tmp["lines"] = []
                tmp["lines"].append(self.getLine(cur[0]))
                tmp["lines"].append(self.getLine(nxt[0]))
                if self.filter(tmp):
                    conversations.append(tmp)

        return conversations

    def getLine(self, sentence):
        line = {}

        temptext = self.tag_re.sub('', sentence).replace('\\\'','\'').strip().lower()

        if temptext.startswith("-"):
            line["text"] = temptext[2:]
        else:
            line["text"] = temptext
            
        return line

    def filter(self, lines):
        # Use the followint to customize filtering of QA pairs
        #
        # startwords = ("what", "how", "when", "why", "where", "do", "did", "is", "are", "can", "could", "would", "will")
        # question = lines["lines"][0]["text"]
        # if not question.endswith('?'):
        #     return False
        # if not question.split(' ')[0] in startwords:
        #     return False
        #
        return True

    def getXML(self, filepath):
        fext = os.path.splitext(filepath)[1]
        if fext == '.gz':
            tmp = GzipFile(filename=filepath)
            return ET.parse(tmp) #, parser=ET.XMLParser(encoding="utf8"))
        else:
            return ET.parse(filepath) #, parser=ET.XMLParser(encoding="utf8"))

    def filesInDir(self, dirname):
        result = []
        for dirpath, dirs, files in os.walk(dirname):
            for filename in files:
                fname = os.path.join(dirpath, filename)
                result.append(fname)
        return result


"""
Load transcripts from the Supreme Court of the USA.
Available from here:
https://github.com/pender/chatbot-rnn
"""

class ScotusData:
    """
    """

    def __init__(self, dirName):
        """
        Args:
            dirName (string): directory where to load the corpus
        """
        self.lines = self.loadLines(os.path.join(dirName, "scotus"))
        self.conversations = [{"lines": self.lines}]


    def loadLines(self, fileName):
        """
        Args:
            fileName (str): file to load
        Return:
            list<dict<str>>: the extracted fields for each line
        """
        lines = []

        with open(fileName, 'r') as f:
            for line in f:
                l = line[line.index(":")+1:].strip()  # Strip name of speaker.

                lines.append({"text": l})

        return lines


    def getConversations(self):
        return self.conversations


"""
Ubuntu Dialogue Corpus
http://arxiv.org/abs/1506.08909
"""

class UbuntuData:
    """
    """

    def __init__(self, dirName):
        """
        Args:
            dirName (string): directory where to load the corpus
        """
        self.MAX_NUMBER_SUBDIR = np.inf #10
        self.conversations = []
        __dir = os.path.join(dirName, "dialogs")
        number_subdir = 0
        for sub in tqdm(os.scandir(__dir), desc="Ubuntu dialogs subfolders", total=len(os.listdir(__dir))):
            if number_subdir == self.MAX_NUMBER_SUBDIR:
                print("WARNING: Early stoping, only extracting {} directories".format(self.MAX_NUMBER_SUBDIR))
                return

            if sub.is_dir():
                number_subdir += 1
                for f in os.scandir(sub.path):
                    if f.name.endswith(".tsv"):
                        try:
                            self.conversations.append({"lines": self.loadLines(f.path)})
                        except ValueError:
                            tqdm.write("Skipping file %s with errors." % f.path)
                        except:
                            print("Unexpected error:", sys.exc_info()[0])
                            raise

    def loadLines(self, fileName):
        """
        Args:
            fileName (str): file to load
        Return:
            list<dict<str>>: the extracted fields for each line
        """
        lines = []
        with open(fileName, 'r', encoding="utf8") as f:
            for line in f:
                l = line[line.rindex("\t")+1:].strip()  # Strip metadata (timestamps, speaker names)

                lines.append({"text": l})

        return lines


    def getConversations(self):
        return self.conversations

"""
Load data from a dataset of simply-formatted data
from A to B
from B to A
from A to B
from B to A
from A to B
===
from C to D
from D to C
from C to D
from D to C
from C to D
from D to C
...
`===` lines just separate linear conversations between 2 people.
"""

class LightweightData:
    """
    """

    def __init__(self, lightweightFile):
        """
        Args:
            lightweightFile (string): file containing our lightweight-formatted corpus
        """
        self.CONVERSATION_SEP = "==="
        self.conversations = []
        self.loadLines(lightweightFile + '.txt')

    def loadLines(self, fileName):
        """
        Args:
            fileName (str): file to load
        """

        linesBuffer = []
        with open(fileName, 'r') as f:
            for line in f:
                l = line.strip()
                if l == self.CONVERSATION_SEP:
                    self.conversations.append({"lines": linesBuffer})
                    linesBuffer = []
                else:
                    linesBuffer.append({"text": l})
            if len(linesBuffer):  # Eventually flush the last conversation
                self.conversations.append({"lines": linesBuffer})

    def getConversations(self):
        return self.conversations

def text_prepare(text):
    """Performs tokenization and simple preprocessing."""

    replace_by_space_re = re.compile('[/(){}\[\]\|@,;#+_]')
    bad_symbols_re = re.compile('[^0-9a-z ]')
    #stopwords_set = set(stopwords.words('english'))

    text = text.lower().strip()
    text = replace_by_space_re.sub(' ', text)
    text = bad_symbols_re.sub('', text)
    #text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
    text = ' '.join([x for x in text.split() if x])

    return text.strip()

def extractText(line, fast_preprocessing=True):
    if fast_preprocessing:

        """
        GOOD_SYMBOLS_RE = re.compile('[^0-9a-z ]')
        REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;#+_]')
        REPLACE_SEVERAL_SPACES = re.compile('\s+')

        line = line.lower()
        line = REPLACE_BY_SPACE_RE.sub(' ', line)
        line = GOOD_SYMBOLS_RE.sub('', line)
        line = REPLACE_SEVERAL_SPACES.sub(' ', line)
        return line.strip()
        """

        return text_prepare(line)

    else:
        return nltk.word_tokenize(line.lower())


def splitConversations(conversations, max_len=20, fast_preprocessing=True):
    data = []
    for i, conversation in enumerate(tqdm(conversations)):
        lines = conversation['lines']
        for i in range(len(lines) - 1):
            request = extractText(lines[i]['text'], fast_preprocessing=fast_preprocessing)
            reply = extractText(lines[i + 1]['text'], fast_preprocessing=fast_preprocessing)
            if 0 < len(request) <= max_len and 0 < len(reply) <= max_len:
                data += [(request, reply)]
    return data


def readCornellData(path, max_len=20, fast_preprocessing=True):
    dataset = CornellData(path)
    conversations = dataset.getConversations()
    return splitConversations(conversations, max_len=max_len, fast_preprocessing=fast_preprocessing)


def readOpensubsData(path, max_len=20, fast_preprocessing=True):
    dataset = OpensubsData(path)
    conversations = dataset.getConversations()
    return splitConversations(conversations, max_len=max_len, fast_preprocessing=fast_preprocessing)

def readScotusData(path, max_len=20, fast_preprocessing=True):
    dataset = ScotusData(path)
    conversations = dataset.getConversations()
    return splitConversations(conversations, max_len=max_len, fast_preprocessing=fast_preprocessing)

def readUbuntuData(path, max_len=20, fast_preprocessing=True):
    dataset = UbuntuData(path)
    conversations = dataset.getConversations()
    return splitConversations(conversations, max_len=max_len, fast_preprocessing=fast_preprocessing)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [87]:
%cd "/content/drive/My Drive/HonorProject/NLP Honor"

choices=["cornell"]#, "scotus", "opensubs"] #, "ubuntu"]
Classes = [CornellData]#, ScotusData, OpensubsData] #, UbuntuData]


for choice, Class in zip(choices, Classes):
    print(choice, Class, "started")
    dataset_path = os.path.join("data", choice)
    dataset = Class(dataset_path)
    conversations = dataset.getConversations()

    print(len(conversations))

/content/drive/My Drive/HonorProject/NLP Honor
cornell <class '__main__.CornellData'> started
83097



# Data Preprocessing
1. **NLTK tokenizer is used for initial preprocessing lines from the dataset. But punctuations like ",",  ".", "?" etc. are weighted as same as words to capture the tone and structure of the sentences.**

2. For each preprocessed sentence, the original sentence is also stored for later.

3. Here, only the cornell dataset is shown. The other ones were considered in separate experiments.

In [88]:
data = []
for i, conversation in enumerate(tqdm(conversations)):
    lines = conversation['lines']
    for i in range(len(lines)):
        line = extractText(lines[i]['text'].lower(), fast_preprocessing=False)
        
        line = ' '.join(line)

        data += [(line, lines[i]['text'])]

print(len(data))


100%|██████████| 83097/83097 [00:51<00:00, 1607.24it/s]

304713





**The individual lines are grouped to form conversations. In this case, conversation lines are considered bidirectional. **

In [0]:
def splitConversationsHere(conversations):
    data = []
    for i, conversation in enumerate(tqdm(conversations)):
        lines = conversation['lines']
        for i in range(len(lines) - 1):
            request = extractText(lines[i]['text'].lower(), fast_preprocessing=False)
            request = ' '.join(request)

            reply = extractText(lines[i + 1]['text'].lower(), fast_preprocessing=False)
            reply = ' '.join(reply)
            
            data += [(request, reply, lines[i]['text'], lines[i + 1]['text'])]

    return data

In [90]:
convs = splitConversationsHere(conversations)

100%|██████████| 83097/83097 [01:13<00:00, 1131.00it/s]


In [0]:

dataDF = pd.DataFrame(data=data)
convDF = pd.DataFrame(data=convs)
alllines = pd.DataFrame(data=dataDF[0])

In [0]:
dataDF.to_csv(path_or_buf="all_lines_DF_only_cornell.txt", index=False, header=False)
convDF.to_csv(path_or_buf="all_convs_DF_only_cornell.txt", index=False, header=False)
alllines.to_csv(path_or_buf="all_lines_txt_only_cornell.txt", index=False, header=False)

To learn sentence embeddings, **sent2vec** module was used. The module can be found there:

[sent2vec](https://github.com/epfml/sent2vec)

In [0]:
"""!wget -O sent2vec.zip https://codeload.github.com/epfml/sent2vec/zip/master
!unzip sent2vec.zip """

In [0]:
%cd "/content/drive/My Drive/sent2vec/sent2vec-master"
!pip install .
!make

In [66]:
%cd "/content/drive/My Drive/sent2vec/"

/content/drive/My Drive/sent2vec


# **Training**


**Unsupervised embeddings for the whole sentences were learnt on the corpus data. The training was run for 9 epochs to learn embeddings of dimention 300.**

In [0]:
! sent2vec-master/./fasttext sent2vec -input "/content/drive/My Drive/HonorProject/NLP Honor/all_lines_txt_only_cornell.txt" -output my_sent2vec_cornell_model2 -dropoutK 0 -dim 300 -epoch 9 -lr 0.2 -thread 10 -bucket 100000

In [92]:
%cd "/content/drive/My Drive/sent2vec/"
import sent2vec
sent2vec_model = sent2vec.Sent2vecModel()
sent2vec_model.load_model('my_sent2vec_cornell_model.bin')

/content/drive/My Drive/sent2vec


# Experimental Observation

As embeddings are learnt for the complete sentence as a whole, it works better than word embedding based approaches, as the order of the words are also taken into consideration here. The puctuations and the tone in the sentence is also taken into account, as confirmed by the difference in responses depending on the tone

In [166]:
predict("I don't know.")

'Well, I recommend knowing before speaking. The law leaves much room for interpretation -- but very little for self-doubt.'

In [167]:
predict("I don't know?")

"You do, you do. You're just not saying."


##Caching

**Using the learnt model, sentence embeddings were calculated and stored for all the lines in the dataset. This would require less computation during inference. This would also help in better RAM memory utilization when the bot will be inactive.**

In [0]:
import os
#os.makedirs("Sentence_Embeddings", exist_ok=True)
import numpy as np

count = alllines.shape[0]
sent_vectors = np.zeros((count, 300), dtype=np.float32)

sent_ids = alllines[0].values

for i, row in alllines.iterrows():

    sentence = row[0]
    emb_sent2vec = sent2vec_model.embed_sentence(sentence)
    sent_vectors[i, :] = emb_sent2vec[0]
    
with open("Sentence_Embeddings/sent2vec_vectors_only_cornell.pickle", 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump((sent_ids, sent_vectors), f, pickle.HIGHEST_PROTOCOL)

# Inference

1. When a new sentence comes along, the embedding for the sentence as a whole is extracted from the model. 

2. Then nearest neighboring is used to find the closest sentence in the corpus. 

3. Based on the nearest neighbor, the sentence from the conversation involving that neighbor is returned. If there are multiple choices, then one at random is returned for element of surprise.

In [58]:
%cd "/content/drive/My Drive/HonorProject/NLP Honor"
import pandas as pd

convDF = pd.read_csv("all_convs_DF_only_cornell.txt", header=None)
alllines = pd.read_csv("all_lines_txt_only_cornell.txt", header=None)

/content/drive/My Drive/HonorProject/NLP Honor


In [0]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances_argmin

In [0]:
import pickle
sent_ids, sent_vectors = pickle.load(open('sent2vec_vectors_only_cornell.pickle','rb'))

In [0]:
def predict(st):

    st = nltk.word_tokenize(st.lower())
    st = ' '.join(st)
    arg = pairwise_distances_argmin(sent2vec_model.embed_sentence(st),sent_vectors, metric='cosine')[0]

    which = sent_ids[arg]

    try:
        response = random.choice(convDF.loc[convDF[0] == which][3].values).strip()
    except:
      	response = random.choice(convDF.loc[convDF[1] == which][2].values).strip()

    return response

**The following are some examples of responses:**

In [125]:
predict("What are you doing?")

'You know what Mr. White said. A photographer always goes after a story.'

In [126]:
predict("Is it raining?")

'Yeah, just started.'

In [143]:
predict("What is your name?")

"That's cool.  We're gettin' to know each other. This is a good thing. I'm Carter."

In [144]:
predict("Go away.")

"You're supposed to watch me and entertain me, and make me appreciate the brief but happy years of childhood."

# Of course, there are some bad and weird responses too!

In [145]:
predict("I am sad.")

'I assume you are loitering here to learn what efficiency rating I plan to give your cadets.'

In [146]:
predict("what is your aim?")

"I'm a drunkard."

In [147]:
predict("What is your age?")

'To find the Holy Grail.'

# Probable Problems



1.   Training data was not diverse
2.   Trained for a short number of epochs
3. Need to do selective modeling. Probably a better inference method could be learnt.
4. casual responses, can not carry out meaningful conversations.



# Alternative

**I would encourage you too look at the other approch (provided link to the notebook in the beginning) and provide suggestions if you like! Although still experimental, that approach is more appropriate. Afterall, conversations are rarely selective, are they?!**

[Providing the link here again!](https://github.com/ragibhassantonoy/NLP_Honor/blob/master/attention_week5_honor.ipynb)