In [2]:
import convokit

In [3]:
from convokit import Corpus, download
import pandas as pd

In this notebook, we demonstrate how to generate corpora from a pandas DataFrame. In general, users with csv data may find it more straightforward to load their csv data as a DataFrame, make a few adjustments, and then generate a Corpus using the `Corpus.from_pandas()` method.

We will use an existing corpora to demonstrate what your own DataFrame should look like.

In [4]:
# using an existing Corpus of the subreddit named 'hey'
corpus = Corpus(download('subreddit-hey')) 

Dataset already exists at /Users/calebchiam/.convokit/downloads/subreddit-hey


In [5]:
# this is a super small corpus, which is good for teaching purposes
corpus.print_summary_stats()

Number of Speakers: 15
Number of Utterances: 23
Number of Conversations: 16


In [6]:
# you can ignore this
utt_df = corpus.get_utterances_dataframe().drop(columns=['vectors'])
convo_df = corpus.get_conversations_dataframe().drop(columns=['vectors'])
speaker_df = corpus.get_speakers_dataframe().drop(columns=['vectors'])

Now, take a close look at each of these dataframes. Notice that each utterance, speaker, conversation has its own ID. (In this corpus in particular, the conversation ID is based on the ID of the first utterance in the conversation.)

Utterances have the following **primary data fields**: ID, timestamp, text, speaker (a string ID), reply_to (a string ID), conversation_id (a string ID).

Conversations and Speakers have only one **primary data field**, their ID.

All other information associated with these objects are *metadata* and included in the dataframes as *meta.[keyname]*.

In [7]:
utt_df.head(20)

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
relat,1332788704,,falida,,relat,1,,-1,-1,,hey,False,/r/hey/comments/relat/hey/,
panyb,1328368532,,sweetshabdh,,panyb,0,,-1,-1,,hey,False,/r/hey/comments/panyb/love/,
pmgqa,1329081029,,erin1234,,pmgqa,1,,-1,-1,,hey,False,/r/hey/comments/pmgqa/taking/,
yxdgm,1346105238,,stentoft88,,yxdgm,1,,-1,-1,,hey,False,/r/hey/comments/yxdgm/hey/,
yxx8k,1346123685,,Clever_and_Original,,yxx8k,1,,-1,-1,,hey,False,/r/hey/comments/yxx8k/hey_whats_going_on/,
n5ktg,1323391011,,Betterment66,,n5ktg,0,,-1,-1,,hey,False,/r/hey/comments/n5ktg/yo_yo_yo_takes_a_deeeeep...,
nrkar,1324939653,,alfanialain,,nrkar,0,,-1,-1,,hey,False,/r/hey/comments/nrkar/i_just_watched_the_video...,
zlpg8,1347201597,"Everybody says hello to each other's username,...",[deleted],,zlpg8,1,,1413591417,0,,hey,False,/r/hey/comments/zlpg8/official_hello_thread/,
fdsjv,1296654854,,RavenStaar,,fdsjv,1,,-1,-1,,hey,False,/r/hey/comments/fdsjv/goanimatecom_news/,
9u5go,1255567761,[removed],pedroazp,,9u5go,1,,1522828010,0,,hey,False,/r/hey/comments/9u5go/no_entiendo/,


In [8]:
convo_df.head(10)

Unnamed: 0_level_0,meta.title,meta.num_comments,meta.domain,meta.timestamp,meta.subreddit,meta.gilded,meta.gildings,meta.stickied,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
relat,hey,0,sixpackshortcuts.com,1332788704,hey,-1,,False,
panyb,LOVE,0,logs.omegle.com,1328368532,hey,-1,,False,
pmgqa,taking,1,thefreedictionary.com,1329081029,hey,-1,,False,
yxdgm,hey,0,logs.omegle.com,1346105238,hey,-1,,False,
yxx8k,Hey. What's going on?,0,youtube.com,1346123685,hey,-1,,False,
n5ktg,"""Yo, yo, yo"" **takes a deeeeep breath** ""This-...",1,self.hey,1323391011,hey,-1,,False,
nrkar,I just watched the video Lotta Woman vol 1 sce...,1,mobile.youporn.com,1324939653,hey,-1,,False,
zlpg8,Official hello thread,0,self.hey,1347201597,hey,0,,False,
fdsjv,GoAnimate.com: News,1,goanimate.com,1296654854,hey,-1,,False,
9u5go,No entiendo.,0,self.hey,1255567761,hey,0,,False,


In [9]:
# looks the like speakers have no metadata
speaker_df.head(10)

falida
sweetshabdh
erin1234
stentoft88
Clever_and_Original
Betterment66
alfanialain
[deleted]
RavenStaar
pedroazp


If you format your data to follow this format, you can generate Corpora from them easily. For example, using the above dataframes, we can re-generate the original corpus! 

In [10]:
new_corpus = Corpus.from_pandas(utterances_df=utt_df, speakers_df=speaker_df, conversations_df=convo_df)

23it [00:00, 3007.79it/s]

ID column is not present in utterances dataframe, generated ID column from dataframe index...
ID column is not present in conversations dataframe, generated ID column from dataframe index...
ID column is not present in speakers dataframe, generated ID column from dataframe index...





In [11]:
new_corpus.print_summary_stats()

Number of Speakers: 15
Number of Utterances: 23
Number of Conversations: 16


In fact, you can **generate corpora from utterance data only**. Let's consider the simplest case scenario, where you have the primary data fields and nothing else. 

(Note that 'id' does not have to be the DataFrame index, it can just be another column in your dataframe.)

In [12]:
# constructing simple utterance dataframe, you can ignore this
simple_utt_df = utt_df[['timestamp', 'text', 'speaker', 'reply_to', 'conversation_id']]
ids = list(simple_utt_df.index)
simple_utt_df = simple_utt_df.reset_index()
simple_utt_df['id'] = ids

In [13]:
# what your basic utterance data might look like
simple_utt_df

Unnamed: 0,id,timestamp,text,speaker,reply_to,conversation_id
0,relat,1332788704,,falida,,relat
1,panyb,1328368532,,sweetshabdh,,panyb
2,pmgqa,1329081029,,erin1234,,pmgqa
3,yxdgm,1346105238,,stentoft88,,yxdgm
4,yxx8k,1346123685,,Clever_and_Original,,yxx8k
5,n5ktg,1323391011,,Betterment66,,n5ktg
6,nrkar,1324939653,,alfanialain,,nrkar
7,zlpg8,1347201597,"Everybody says hello to each other's username,...",[deleted],,zlpg8
8,fdsjv,1296654854,,RavenStaar,,fdsjv
9,9u5go,1255567761,[removed],pedroazp,,9u5go


In [14]:
new_corpus = Corpus.from_pandas(simple_utt_df)

23it [00:00, 5165.12it/s]


In [15]:
new_corpus.print_summary_stats()

Number of Speakers: 15
Number of Utterances: 23
Number of Conversations: 16


We generated the same Corpus! The only difference is that because we excluded the conversations and speakers dataframes, these objects will have no metadata, whereas previously the conversations had metadata.

In [16]:
# before
corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

Unnamed: 0_level_0,meta.title,meta.num_comments,meta.domain,meta.timestamp,meta.subreddit,meta.gilded,meta.gildings,meta.stickied,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
relat,hey,0,sixpackshortcuts.com,1332788704,hey,-1,,False,
panyb,LOVE,0,logs.omegle.com,1328368532,hey,-1,,False,
pmgqa,taking,1,thefreedictionary.com,1329081029,hey,-1,,False,
yxdgm,hey,0,logs.omegle.com,1346105238,hey,-1,,False,
yxx8k,Hey. What's going on?,0,youtube.com,1346123685,hey,-1,,False,


In [17]:
# after
new_corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

relat
panyb
pmgqa
yxdgm
yxx8k


This concludes a short tutorial on how to generate ConvoKit corpora from pandas dataframes. More details on the `Corpus.from_pandas()` can be found in the documentation.