# Acknowledgements

This project borrows code from the following sources:

    https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

    Yuan-Kuei Wu’s pytorch-chatbot implementation: https://github.com/ywk991112/pytorch-chatbot
            
    Sean Robertson’s practical-pytorch seq2seq-translation example: https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation
            
    FloydHub’s Cornell Movie Corpus preprocessing code: https://github.com/floydhub/textutil-preprocess-cornell-movie-corpus

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
from preprocessing import *
import pandas as pd
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

In [21]:
!pip install tqdm

Collecting tqdm
  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
Installing collected packages: tqdm
Successfully installed tqdm-4.31.1


You are using pip version 9.0.1, however version 19.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
torch.__version__

'0.4.1'

In [4]:
CORNELL_PATH = "data/cornell"
SITCOM_PATH = "data/sitcom"
CHANDLER_PATH = "data/chandler"

# Load and Preprocess data

In [5]:
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]
# Load lines and process conversations
print("\nProcessing corpus...")
lines = loadLines(os.path.join(CORNELL_PATH, "movie_lines.txt"), MOVIE_LINES_FIELDS)
movie_conversations = loadMovieConversations(os.path.join(CORNELL_PATH, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS)


Processing corpus...


In [6]:
movie_conversations[0]

{'character1ID': 'u0',
 'character2ID': 'u2',
 'lines': [{'character': 'BIANCA',
   'characterID': 'u0',
   'lineID': 'L194',
   'movieID': 'm0',
   'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
  {'character': 'CAMERON',
   'characterID': 'u2',
   'lineID': 'L195',
   'movieID': 'm0',
   'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
  {'character': 'BIANCA',
   'characterID': 'u0',
   'lineID': 'L196',
   'movieID': 'm0',
   'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
  {'character': 'CAMERON',
   'characterID': 'u2',
   'lineID': 'L197',
   'movieID': 'm0',
   'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}],
 'movieID': 'm0',
 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n"}

In [10]:
sitcom_conversations = loadSitcomConversations(SITCOM_PATH)

Fraiser.txt
 No. of lines in Fraiser.txt => 65853
 No. of scenes written from  Fraiser.txt => 1666
Friends.txt
 No. of lines in Friends.txt => 56482
 No. of scenes written from  Friends.txt => 3707
HIMYM1.txt
 No. of lines in HIMYM1.txt => 31896
 No. of scenes written from  HIMYM1.txt => 205
Seinfield.txt
 No. of lines in Seinfield.txt => 51254
 No. of scenes written from  Seinfield.txt => 136


In [16]:
sitcom_conversations[1666]

[" There's nothing to tell! He's just some guy I work with!",
 " C'mon, you're going out with the guy! There's gotta be something wrong with him!",
 ' All right Joey, be nice. So does he have a hump? A hump and a hairpiece?',
 ' Wait, does he eat chalk?',
 " Just, 'cause, I don't want her to go through what I went through with Carl- oh!",
 " Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.",
 ' Sounds like a date to me.']

In [17]:
delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

with open("data/formatted_lines_all.txt", 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(movie_conversations):
        writer.writerow(pair)
    for pair in extractSentencePairs1(sitcom_conversations):
        writer.writerow(pair)
with open("data/formatted_lines_sitcom.txt", 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs1(sitcom_conversations):
        writer.writerow(pair)

In [18]:
# Print a sample of lines
print("\nSample lines from file:")
printLines("data/formatted_lines_all.txt")


Sample lines from file:
b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\n

In [19]:
printLines("data/formatted_lines_sitcom.txt")

b' Listen to yourself, Bob! You follow her to work, you eavesdrop on her calls, you open her mail. The minute you started doing these things, the relationship was over! Thank you for your call. Roz, I think we have time for one more?\t Yes, Dr Crane. On line four, we have Russell from Kirkland.\r\n'
b" Yes, Dr Crane. On line four, we have Russell from Kirkland.\t Hello, Russell. This is Dr Frasier Crane; I'm listening.\r\n"
b" Hello, Russell. This is Dr Frasier Crane; I'm listening.\t Well, I've been feeling sort of, uh, you know, depressed lately. My life's not going anywhere and-and, er, it's not that bad. It's just the same old apartment, same old job...\r\n"
b" Well, I've been feeling sort of, uh, you know, depressed lately. My life's not going anywhere and-and, er, it's not that bad. It's just the same old apartment, same old job...\t Er, Russell, we're just about at the end of our hour. Let me see if I can cut to the chase by using myself as an example. Six months ago, I was livi

In [69]:
# To do : Prepare chandler data