## Dialogue Dataset

This notebook is for displaying several public dialogue dataset from web.

In [1]:
import pandas as pd

### Ubuntu Dialogue Dataset

**Description**: Mainly used for **Retrieval-based Chatbot**. Testing response selection algorithm. A dataset containing almost 1 million multi-turn dialogues, with a to- tal of over 7 million utterances and 100 million words.

[download link](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/)

[Github link](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)


Papers: 
1. Ryan Lowe, Nissan Pow, Iulian V. Serban and Joelle Pineau, **"The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructure Multi-Turn Dialogue Systems"**, SIGDial 2015.
[arXiv:1506:08909](https://arxiv.org/abs/1506.08909)


pros:
1. Multi turn dialogues already processed;
2. Large dataset.

cons:
1. Task oriented;
2. Specified in Ubuntu problems.


Related works:

1. [Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](https://www.aclweb.org/anthology/P18-1103)

2. [Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots](https://arxiv.org/pdf/1612.01627.pdf)

In [5]:
ubuntu_datapath = '../datasets/UbuntuDialogs/13/2.tsv'

In [11]:
ubuntu = pd.read_csv(ubuntu_datapath,sep='\t',header=None)
ubuntu.columns=[['time','ID1','ID2','context']]
ubuntu

Unnamed: 0,time,ID1,ID2,context
0,2004-11-08T11:19:00.000Z,Gmail,,so i'll add it to my sources.list and use apt-src
1,2004-11-08T11:21:00.000Z,Gmail,,and just use ubuntu
2,2004-11-08T11:22:00.000Z,Gmail,,i say install ubuntu get (what its name vm-war...
3,2004-11-08T11:23:00.000Z,Gmail,,its a lot more fun having not have to reboot t...
4,2004-11-08T11:23:00.000Z,Gmail,,bur[n] er: there is a cheap one i here like $15
5,2004-11-08T11:25:00.000Z,Gmail,,and linux is way better than xp and is free
6,2004-11-08T11:25:00.000Z,Gmail,,i rather save $200 and run a better os
7,2004-11-08T11:26:00.000Z,Gmail,,linux to winblows is like choc to poo
8,2004-11-08T11:26:00.000Z,lifeless,Gmail,you have chocolate poo?
9,2004-11-08T11:29:00.000Z,Gmail,lifeless,i said linux compard to winblows is like choc ...


### Film corpus V2

**Description**:
This corpus is an updated version of the Film Corpus 1.0. It contains complete texts for the scripts of 1068 films in txt files, scraped from imsdb.com on Nov, 2015 using scrapy. It also contains 960 film scripts where the dialog in the film has been separated from the scene descriptions.

[Download link](https://nlds.soe.ucsc.edu/fc2)

Papers:
1. Walker, Marilyn A., Ricky Grant, Jennifer Sawyer, Grace I. Lin, Noah Wardrip-Fruin, and Michael Buell. ["Perceived or Not Perceived: Film Character Models for Expressive NLG."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.380.9026&rep=rep1&type=pdf) BEST PAPER AWARD. In International Conference on Interactive Digital Storytelling (ICIDS), Vancouver, Canada, 2011.


2. Marilyn A. Walker, Grace I. Lin, Jennifer E. Sawyer. ["An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style."](http://www.lrec-conf.org/proceedings/lrec2012/pdf/1114_Paper.pdf) In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012.


In [13]:
filmcorpus_datapath = '../datasets/FilmCorpusV2/dialogs/Action/15minutes_dialog.txt'

In [19]:
filmcorpus = pd.read_csv(filmcorpus_datapath,sep='\t',header=None,skiprows=5)

In [105]:
import re,chardet, os

In [250]:
from smart_open import smart_open
class FilmCorpus(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        regex = r"^\b[A-Z].*?[A-Z]+\b$"
        for line in smart_open(os.path.join(self.dirname), 'rb'):
            encode_type = chardet.detect(line) 
            line = line.decode(encode_type['encoding']) #进行相应解码，赋给原标识符（变量）
            line = re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]|\\♪.*?♪|\\#.*?#|\\=.*?=|\\¶.*?¶", "", line)
            line = line.lstrip().rstrip()
            # label = 1 if re.findall(regex,line) else 0
            label = 1 if line.isupper() else 0
            if line:
                yield {'label':label,'line':line}

In [251]:
test = FilmCorpus(filmcorpus_datapath)
test = pd.DataFrame(test)

In [253]:
test.label.sum()

1035

In [256]:
speakers = []
lines = []
i = test.label.to_list().index(1)
while i < len(test)-1:
    if test.loc[i].label==1:
        speaker = test.loc[i].line
        speakers.append(speaker)
        i += 1
        if i == len(test)-1:
            break
        line = test.loc[i].line
        while test.loc[i].label!=1:
            line = ' '.join((line,test.loc[i].line))
            i += 1
            if i == len(test)-1: break
        lines.append(line)
    

In [257]:
pd.DataFrame({'Speaker':speakers,'Line':lines})

Unnamed: 0,Speaker,Line
0,EMIL,Just do what I do. Say the same thing I Just ...
1,OLEG,Okay. Okay.
2,EMIL,Don't fool around. Don't fool around.
3,OLEG,Okay. Okay.
4,EMIL,Did you hear what I said? Did you hear what I ...
5,OLEG,I want to document my trip to America. I want ...
6,IMMIGRATION OFFICER,"Next. Next. Could I see your documents, please?"
7,EMIL,Yes sir. Yes sir.
8,IMMIGRATION OFFICER,What is your intended purpose of your What is ...
9,EMIL,Two weeks holiday. Two weeks holiday.


### TV data

**Description**: Dialogue from tbbt, friends and Cornel movie dialog. For emotion analysis.

pros:
1. Neatly cleaned;
2. well separated lines.

cons:
1. Small dataset (412826 rows);
2. Not implying dialogue boundaries.

In [31]:
TVdata_datapath = '../datasets/TVdata/2Processed/Aggregated_Dialogues.csv'

In [32]:
TVdata = pd.read_csv(TVdata_datapath)

  interactivity=interactivity, compiler=compiler, result=result)


In [38]:
TVdata.head(5)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Actor,MovieId,SeriesName,Subtitle,Emotionalities Lexicon Based,Emotion Categories Lexicon Based
0,0,0,Sheldon,,BBT,so if a photon is directed through a plane wi...,21,Neutral
1,1,1,Leonard,,BBT,"agreed, what is your point?",21,Neutral
2,2,2,Sheldon,,BBT,"there is no point, i just think it is a good ...",21,Neutral
3,3,3,Leonard,,BBT,excuse me?,21,Neutral
4,4,4,Receptionist,,BBT,hang on.,21,Neutral


### The Office (US) Episode Scripts

**Description**: 59909 rows, 9161 scenes.

Read more: https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=the-office-us

[Download link](https://data.world/abhinavr8/the-office-scripts-dataset)

pros:
1. neatly cleaned；
2. labelled with speakers and dialogue boundaries (scene)

cons:
1. small dataset;
2. not all scenes have dialogue. 5987/9161.

In [34]:
officescript_datapath = '../datasets/US-office-scripts/the-office-lines - scripts.csv'

In [35]:
officescript = pd.read_csv(officescript_datapath)

In [50]:
officescript.head(6)

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted
0,1,1,1,1,All right Jim. Your quarterlies look very good...,Michael,False
1,2,1,1,1,"Oh, I told you. I couldn't close it. So...",Jim,False
2,3,1,1,1,So you've come to the master for guidance? Is ...,Michael,False
3,4,1,1,1,"Actually, you called me in here, but yeah.",Jim,False
4,5,1,1,1,"All right. Well, let me show you how it's done.",Michael,False
5,6,1,1,2,"[on the phone] Yes, I'd like to speak to your ...",Michael,False


In [51]:
test = officescript.groupby(['season','episode','scene']).count().reset_index()
test.loc[test.id>1].count()

season       5987
episode      5987
scene        5987
id           5987
line_text    5987
speaker      5987
deleted      5987
dtype: int64

In [49]:
test.describe()

Unnamed: 0,season,episode,scene,id,line_text,speaker,deleted
count,9161.0,9161.0,9161.0,9161.0,9161.0,9161.0,9161.0
mean,5.193865,11.544045,27.895317,6.53957,6.53957,6.53957,6.53957
std,2.442073,7.095069,19.245451,7.17049,7.17049,7.17049,7.17049
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,5.0,13.0,1.0,1.0,1.0,1.0
50%,5.0,11.0,25.0,4.0,4.0,4.0,4.0
75%,7.0,18.0,39.0,10.0,10.0,10.0,10.0
max,9.0,26.0,116.0,75.0,75.0,75.0,75.0


### Stanford Dialog Dataset
**Description**: Task Oriented	car autopilot agent: schedule, weather, navigation

In [52]:
standford_datapath = '../datasets/StandfordDialog/kvret_train_public.json'

In [53]:
import json

In [56]:
#Read JSON data into the datastore variable
with open(standford_datapath, 'r') as f:
    datastore = json.load(f)


In [59]:
datastore[0]['dialogue']

[{'turn': 'driver',
  'data': {'end_dialogue': False,
   'utterance': "where's the nearest parking garage"}},
 {'turn': 'assistant',
  'data': {'end_dialogue': False,
   'requested': {'distance': True,
    'traffic_info': False,
    'poi_type': True,
    'address': True,
    'poi': False},
   'slots': {'distance': 'nearest', 'poi_type': 'parking garage'},
   'utterance': 'The nearest parking garage is Dish Parking at 550 Alester Ave. Would you like directions there? '}},
 {'turn': 'driver',
  'data': {'end_dialogue': False,
   'utterance': 'Yes, please set directions via a route that avoids all heavy traffic if possible. '}},
 {'turn': 'assistant',
  'data': {'end_dialogue': False,
   'requested': {'distance': False,
    'traffic_info': True,
    'poi_type': False,
    'address': False,
    'poi': True},
   'slots': {'traffic_info': 'avoid heavy traffic ', 'poi': 'dish parking'},
   'utterance': 'It looks like there is a road block being reported on the route but I will still find the 

### Cornell Dialogue Dataset

**Description**: 220,579 conversational exchanges between 10,292 pairs of movie characters; involves 9,035 characters from 617 movies; in total 304,713 utterances.

pros:

1. neatly cleaned, aligned with speakers

cons:

1. not very large

In [12]:
cornell_datapath = '../datasets/cornell-corpus/movie_conversations.tsv'

In [18]:
cornell = pd.read_csv(cornell_datapath,sep='\t',header=None)
cornell.columns = ['ID1','ID2','MovieID','context']

In [19]:
cornell.head(5)

Unnamed: 0,ID1,ID2,MovieID,context
0,u0,u2,m0,['L194' 'L195' 'L196' 'L197']
1,u0,u2,m0,['L198' 'L199']
2,u0,u2,m0,['L200' 'L201' 'L202' 'L203']
3,u0,u2,m0,['L204' 'L205' 'L206']
4,u0,u2,m0,['L207' 'L208']


In [20]:
cornell.count()

ID1        83097
ID2        83097
MovieID    83097
context    83097
dtype: int64

In [22]:
83097/617

134.67909238249595

### Scripts data set


In [4]:
import pandas as pd
script_datapath = '/Users/yan/Documents/document/EPFL/MA2/semesterprj/datasets/scripts/script_data_set.csv'
pd.read_csv(script_datapath).head()

Unnamed: 0,Speaker,Line,Label,MovieID,MovieName
0,MODERATOR,Tonight we'll discuss a subject most of us see...,1,m478,Midnight-Cowboy
1,IRATE WOMAN,"They always put it that way, but well, all it ...",1,m478,Midnight-Cowboy
2,COOL WOMAN,"This, this image of the, the man eating woman....",1,m478,Midnight-Cowboy
3,SAD WOMAN,"No, I never had, well, whatever it is you call...",1,m478,Midnight-Cowboy
4,SAD WOMAN'S VOICE,... but it's a problem. A big problem. With so...,1,m478,Midnight-Cowboy
