## Converting the Cornell Movie-Dialogs Corpus into ConvoKit format 

This notebook is a demonstration of how custom datasets can be converted into Corpus with ConvoKit. 

The original version of the Cornell Movie-Dialogs Corpus can be downloaded from:  https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. It contains the following files:

* __movie_characters_metadata.txt__ contains information about each movie character
* __movie_lines.txt contains__ the actual text of each utterance
* __movie_conversations.txt__ contains the structure of the conversations
* __movie_titles_metadata.txt__ contains information about each movie title

In [4]:
from tqdm import tqdm
from convokit import Corpus, Speaker, Utterance
from collections import defaultdict

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


### __Constructing the Corpus from a list of Utterances__

Corpus can be constructed from a list of utterances with:

    corpus = Corpus(utterances= custom_utterance_list)
    
Our goal is to convert the original dataset into this "custom_utterance_list", and let ConvoKit will do the rest of the conversion for us. In the context of converting the Movie-Dialogs corpus, we will need the following steps, which will be explained in further detail below:

    1. create Speaker objects who are the speakers of the Utterances. Each speaker will correspond to a character in a movie. 
    2. create the Utterance objects that corresponds to utterances in the movie dialogs  
    3. construct the Corpus from the list of Utterance objects 
    4. incorporate additional information as Conversation/Corpus metadata. 

We will additionally show how some simple processing can be done. 

### __1. Creating speakers__

Each character in a movie is considered a speaker, and there are 9,035 characters in total in this dataset. We will read off metadata for each speaker from __movie_characters_metadata.txt__. 

In [5]:
# replace the directory with where your downloaded cornell movie dialogs corpus is saved
data_dir = "/cornell-movie-dialogs-corpus/"

In [6]:
with open(data_dir + "movie_characters_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    speaker_data = f.readlines()

In general, we would directly use the name of the speaker as the name. However, in our case, since only the first name of the movie character is given for most characters, these names may not uniquely map to a character. We will instead use speaker_id provided in the original dataset as speakername, whereas the actual charatcter name will be saved in speaker metadata. Note that this also means we are not able to account for characters that show up in a series of moviews (i.e., characters who share the same name and should actually be regarded as the same character). 

For this dataset, we include the following information for each speaker:  
* name of the character.
* idx and name of the movie this charater is from
* gender(available for 3,774 characters)
* position on movie credits (3,321 characters available)

In [7]:
speaker_meta = {}
for speaker in speaker_data:
    speaker_info = [info.strip() for info in speaker.split("+++$+++")]
    speaker_meta[speaker_info[0]] = {"character_name": speaker_info[1],
                               "movie_idx": speaker_info[2],
                               "movie_name": speaker_info[3],
                               "gender": speaker_info[4],
                               "credit_pos": speaker_info[5]}

In general, a Speaker object can be initiated with `speaker(id = <speaker_name>, meta = <speaker_metadata>)`. The following example shows how we create a Speaker object for each unique character in the dataset, which will be used to create Utterances objects later. 

In [8]:
corpus_speakers = {k: Speaker(id = k, meta = v) for k,v in speaker_meta.items()}

Sanity checking use-level data:

In [9]:
print("number of speakers in the data = {}".format(len(corpus_speakers)))

number of speakers in the data = 9035


In [10]:
corpus_speakers['u0'].meta

{'character_name': 'BIANCA',
 'movie_idx': 'm0',
 'movie_name': '10 things i hate about you',
 'gender': 'f',
 'credit_pos': '4'}

### __2. Creating utterance objects__
Utterances can be found in __movie_lines.txt__. There are 304,713 utterances in total. 

In [11]:
with open(data_dir + "movie_lines.txt", "r", encoding='utf-8', errors='ignore') as f:
    utterance_data = f.readlines()

To instantiate an utterance object, we generally need the following information (all ids should be of type string):
- id: representing the unique id of the utterance. 
- speaker: a ConvoKit speaker object representing the speaker giving the utterance.
- root: the id of the root utterance of the conversation.
- reply_to: id of the utterance this was a reply to.
- timestamp: timestamp of the utterance. 
- text: text of the utterance.

Additional information associated with the utterance may be saved as utterance level metadata. In this case, we consider the movie_id from which this utterance is extracted as an example for metadata. 

An utterance possessing all the above information may be initiated by `Utterance(id=..., speaker =..., conversation_id =..., reply_to=..., timestamp=..., text =..., meta =...)`. We now create such `Utterance` objects for the utterances in our dataset. Note that normally we would provide `conversation_id` and `reply_to` information at the time of instantiation, but we will defer it to later as such information need to be retrieved from a different file. 

In [12]:
utterance_corpus = {}

count = 0
for utterance in tqdm(utterance_data):
    
    utterance_info = [info.strip() for info in utterance.split("+++$+++")]
    
    if len(utterance_info) < 4:
        print(utterance_info)
        
    try:
        idx, speaker, movie_id, text = utterance_info[0], utterance_info[1], utterance_info[2], utterance_info[4]
    except:
        print(utterance_info)
    
    meta = {'movie_id': movie_id}
    
    # root & reply_to will be updated later, timestamp is not applicable 
    utterance_corpus[idx] = Utterance(id=idx, speaker=corpus_speakers[speaker], text=text, meta=meta)

print("Total number of utterances = {}".format(len(utterance_corpus)))

100%|██████████| 304713/304713 [00:04<00:00, 71861.45it/s] 

Total number of utterances = 304713





If we check on the status of an Utterance object, it should now contain an id, the speakers who said them, the actual texts, as well as the movie ids as the metadata: 

In [13]:
utterance_corpus['L1044'] 

Utterance({'obj_type': 'utterance', 'meta': {'movie_id': 'm0'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'character_name': 'CAMERON', 'movie_idx': 'm0', 'movie_name': '10 things i hate about you', 'gender': 'm', 'credit_pos': '3'}, 'vectors': [], 'owner': None, 'id': 'u2'}), 'conversation_id': None, 'reply_to': None, 'timestamp': None, 'text': 'They do to!', 'owner': None, 'id': 'L1044'})

#### __Updating root and reply_to information to utterances__
__movie_conversations.txt__ provides the structure of conversations that organizes the above utterances. This will allow us to add the missing root and reply_to information to individual utterances. 

In [14]:
with open(data_dir + "movie_conversations.txt", "r", encoding='utf-8', errors='ignore') as f:
    convo_data = f.readlines()

In [15]:
import ast

In [16]:
for info in tqdm(convo_data):
        
    speaker1, speaker2, m, convo = [info.strip() for info in info.split("+++$+++")]

    convo_seq = ast.literal_eval(convo)
    
    # update utterance
    conversation_id = convo_seq[0]
    
    # convo_seq is a list of utterances ids, arranged in conversational order
    for i, line in enumerate(convo_seq):
        
        # sanity checking: speaker giving the utterance is indeed in the pair of characters provided
        if utterance_corpus[line].speaker.id not in [speaker1, speaker2]:
            print("speaker mismatch in line {0}".format(i))
        
        utterance_corpus[line].conversation_id = conversation_id
        
        if i == 0:
            utterance_corpus[line].reply_to = None
        else:
            utterance_corpus[line].reply_to = convo_seq[i-1]

100%|██████████| 83097/83097 [00:02<00:00, 28279.64it/s]


Sanity checking on the status of utterances. After updating root and reply_to information, they should now contain all mandatory fields:

In [17]:
utterance_corpus['L666499']

Utterance({'obj_type': 'utterance', 'meta': {'movie_id': 'm616'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'character_name': 'COGHILL', 'movie_idx': 'm616', 'movie_name': 'zulu dawn', 'gender': '?', 'credit_pos': '?'}, 'vectors': [], 'owner': None, 'id': 'u9028'}), 'conversation_id': 'L666497', 'reply_to': 'L666498', 'timestamp': None, 'text': 'How quickly can you move your artillery forward?', 'owner': None, 'id': 'L666499'})

### __3. Creating corpus from list of utterances__
We are now ready to create the movie-corpus. Recall that to instantiate a `Corpus`, we need a list of `Utterance`s.

In [18]:
utterance_list = utterance_corpus.values()

In [19]:
# Note that by default the version number is incremented 
movie_corpus = Corpus(utterances=utterance_list)

ConvoKit will automatically help us create conversations based on the information about the utterances we provide. 

In [20]:
print("number of conversations in the dataset = {}".format(len(movie_corpus.get_conversation_ids())))

number of conversations in the dataset = 83097


In [21]:
convo_ids = movie_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(movie_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['L1045', 'L1044']
sample conversation 1:
['L985', 'L984']
sample conversation 2:
['L925', 'L924']
sample conversation 3:
['L872', 'L871', 'L870']
sample conversation 4:
['L869', 'L868', 'L867', 'L866']


### __4. Updating Conversation and Corpus level metadata__

For each `Conversation`, we can add contextual information about the movie, including genres, release year to as `Conversation` metadata. To do that,  we will read off such meta data for each movie from __movie_titles_metadata.txt__, and we will attach them to all `Conversation`s taken from the movie. 

In [22]:
with open(data_dir + "movie_titles_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    movie_extra = f.readlines()

In [23]:
movie_meta = defaultdict(dict)

for movie in movie_extra:
    movie_id, title, year, rating, votes, genre  = [info.strip() for info in movie.split("+++$+++")]
    movie_meta[movie_id] = {"movie_name": title,
                            "release_year": year,
                            "rating": rating,
                            "votes": votes,
                            "genre": genre}

For our purpose, the movie_id of a given conversation can be retrieved from the root of the conversation.

In [24]:
for convo in movie_corpus.iter_conversations():
    
    # get the movie_id for the conversation by checking from utterance info
    convo_id = convo.get_id()
    movie_idx = movie_corpus.get_utterance(convo_id).meta['movie_id']
    
    # add movie idx as meta, and update meta with additional movie information
    convo.meta['movie_idx'] = movie_idx
    convo.meta.update(movie_meta[movie_idx])

If we check the `conversation` metadata, it now includes the above-mentioned fields

In [25]:
movie_corpus.get_conversation("L609301").meta

{'movie_idx': 'm570',
 'movie_name': 'three kings',
 'release_year': '1999',
 'rating': '7.30',
 'votes': '69757',
 'genre': "['action', 'adventure', 'comedy', 'drama', 'war']"}

We also include the original urls from which these conversations are extracted as corpus metadata. 

In [26]:
with open(data_dir + "raw_script_urls.txt", "r", encoding='utf-8', errors='ignore') as f:
    urls = f.readlines()

In [27]:
movie2url = {}
for movie in urls:
    movie_id, _, url = [info.strip() for info in movie.split("+++$+++")]
    movie2url[movie_id] = url

In [28]:
movie_corpus.meta['url'] = movie2url

Optionally, we can also the original name of the dataset:

In [29]:
movie_corpus.meta['name'] = "Cornell Movie-Dialogs Corpus"

### __5. Processing utterance texts__

We can also "annotate" the utterances, e.g., getting dependency parses for them, and save the resultant parses. Here is an example of how this can be done, more examples related to text processing can be found at https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/text-processing/text_preprocessing_demo.ipynb:

In [30]:
from convokit.text_processing import TextParser

In [31]:
parser = TextParser(verbosity=10000)

In [32]:
movie_corpus = parser.transform(movie_corpus)

10000/304713 utterances processed
20000/304713 utterances processed
30000/304713 utterances processed
40000/304713 utterances processed
50000/304713 utterances processed
60000/304713 utterances processed
70000/304713 utterances processed
80000/304713 utterances processed
90000/304713 utterances processed
100000/304713 utterances processed
110000/304713 utterances processed
120000/304713 utterances processed
130000/304713 utterances processed
140000/304713 utterances processed
150000/304713 utterances processed
160000/304713 utterances processed
170000/304713 utterances processed
180000/304713 utterances processed
190000/304713 utterances processed
200000/304713 utterances processed
210000/304713 utterances processed
220000/304713 utterances processed
230000/304713 utterances processed
240000/304713 utterances processed
250000/304713 utterances processed
260000/304713 utterances processed
270000/304713 utterances processed
280000/304713 utterances processed
290000/304713 utterances proc

- parses are saved under 'parsed' in utterance meta

In [34]:
movie_corpus.get_utterance('L666499').retrieve_meta('parsed')

[{'rt': 4,
  'toks': [{'tok': 'How', 'tag': 'WRB', 'dep': 'advmod', 'up': 1, 'dn': []},
   {'tok': 'quickly', 'tag': 'RB', 'dep': 'advmod', 'up': 4, 'dn': [0]},
   {'tok': 'can', 'tag': 'MD', 'dep': 'aux', 'up': 4, 'dn': []},
   {'tok': 'you', 'tag': 'PRP', 'dep': 'nsubj', 'up': 4, 'dn': []},
   {'tok': 'move', 'tag': 'VB', 'dep': 'ROOT', 'dn': [1, 2, 3, 6, 7, 8]},
   {'tok': 'your', 'tag': 'PRP$', 'dep': 'poss', 'up': 6, 'dn': []},
   {'tok': 'artillery', 'tag': 'NN', 'dep': 'dobj', 'up': 4, 'dn': [5]},
   {'tok': 'forward', 'tag': 'RB', 'dep': 'advmod', 'up': 4, 'dn': []},
   {'tok': '?', 'tag': '.', 'dep': 'punct', 'up': 4, 'dn': []}]}]

### __Saving created datasets__
To complete the final step of dataset conversion, we want to save the dataset such that it can be loaded later for reuse. You may want to specify a name. The default location to find the saved datasets will be __./convokit/saved-copora__ in your home directory, but you can also specify where you want the saved corpora to be. 

In [35]:
# movie_corpus.dump("movie-corpus", base_path = <specify where you prefer to save it to>)
# the following would save the Corpus to the default location, i.e., ./convokit/saved-corpora
movie_corpus.dump("movie-corpus")

After saving, the available info from dataset can be checked directly, without loading. 

In [36]:
from convokit import meta_index
import os.path

In [37]:
meta_index(filename = os.path.join(os.path.expanduser("~"), ".convokit/saved-corpora/movie-corpus"))

{'utterances-index': {'movie_id': ["<class 'str'>"],
  'parsed': ["<class 'list'>"]},
 'speakers-index': {'character_name': ["<class 'str'>"],
  'movie_idx': ["<class 'str'>"],
  'movie_name': ["<class 'str'>"],
  'gender': ["<class 'str'>"],
  'credit_pos': ["<class 'str'>"]},
 'conversations-index': {'movie_idx': ["<class 'str'>"],
  'movie_name': ["<class 'str'>"],
  'release_year': ["<class 'str'>"],
  'rating': ["<class 'str'>"],
  'votes': ["<class 'str'>"],
  'genre': ["<class 'str'>"]},
 'overall-index': {'url': ["<class 'dict'>"], 'name': ["<class 'str'>"]},
 'version': 1,
 'vectors': []}

### __Other ways of conversion__

The above method is only one way to convert the dataset. Alternatively, one may follow strictly with the specifications of the expected data format described [here](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/doc/source/data_format.rst) and write out the component files directly. 

Additional examples of converting datasets originally released in other formats can be found inside the [datasets](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/tree/master/datasets) folder. 