In [1]:
import convokit

In [2]:
from convokit import Corpus, download
import pandas as pd

In this notebook, we demonstrate how to generate corpora from a pandas DataFrame. In general, users with csv data may find it more straightforward to load their csv data as a DataFrame, make a few adjustments, and then generate a Corpus using the `Corpus.from_pandas()` method.

We will use an existing corpora to demonstrate what your own DataFrame should look like.

In [3]:
# using an existing Corpus of the subreddit named 'hey'
corpus = Corpus(filename="C:/Users/L/.convokit/downloads/subreddit-ADD") 

In [4]:
# this is a super small corpus, which is good for teaching purposes
corpus.print_summary_stats()

Number of Speakers: 801
Number of Utterances: 5725
Number of Conversations: 555


In [5]:
# you can ignore this
utt_df = corpus.get_utterances_dataframe().drop(columns=['vectors'])
convo_df = corpus.get_conversations_dataframe().drop(columns=['vectors'])
speaker_df = corpus.get_speakers_dataframe().drop(columns=['vectors'])

Now, take a close look at each of these dataframes. Notice that each utterance, speaker, conversation has its own ID. (In this corpus in particular, the conversation ID is based on the ID of the first utterance in the conversation.)

Utterances have the following **primary data fields**: ID, timestamp, text, speaker (a string ID), reply_to (a string ID), conversation_id (a string ID).

Conversations and Speakers have only one **primary data field**, their ID.

All other information associated with these objects are *metadata* and included in the dataframes as *meta.[keyname]*.

In [6]:
utt_df.head(20)

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ckqoc,1277943229,and I just realized that rhymed,pastachef,,ckqoc,4,,1522931980,0,,ADD,False,/r/ADD/comments/ckqoc/i_forgot_to_take_meds_to...,
nywe5,1325451559,,djspacebunny,,nywe5,24,,-1,-1,,ADD,False,/r/ADD/comments/nywe5/dea_refuses_to_believe_t...,
nyvgx,1325450108,"Drama aside, I took an adderall XR when I firs...",[deleted],,nyvgx,3,,-1,-1,,ADD,False,/r/ADD/comments/nyvgx/oh_god_i_accidentally_to...,
o06gv,1325538002,I've heard it's very difficult to get diagnose...,ergo456,,o06gv,4,,-1,-1,,ADD,False,/r/ADD/comments/o06gv/anyone_from_the_uk_been_...,
o0401,1325534630,So I have been taking 30mg vyvanse for the las...,anotheraddthrowaway,,o0401,3,,-1,-1,,ADD,False,/r/ADD/comments/o0401/need_new_medicine_help_m...,
nzn0d,1325497332,I am 20 years of age. I have tried everything....,addthrowaway1,,nzn0d,2,,-1,-1,,ADD,False,/r/ADD/comments/nzn0d/please_help_me_radd_i_ha...,
nz89v,1325469935,"Background: 23yo, high IQ, performed very poor...",Miserlou57,,nz89v,5,,-1,-1,,ADD,False,/r/ADD/comments/nz89v/after_a_night_of_drinkin...,
o1m6s,1325628210,Myth: People who have ADHD calm down when taki...,kassem23,,o1m6s,19,,-1,-1,,ADD,False,/r/ADD/comments/o1m6s/lets_squat_the_myth_once...,
o1kar,1325625737,,21onlyne,,o1kar,1,,-1,-1,,ADD,False,/r/ADD/comments/o1kar/sys_bureau_detectives_as...,
o1j3t,1325624160,,Joemasta66,,o1j3t,3,,-1,-1,,ADD,False,/r/ADD/comments/o1j3t/robbs_reddit_add_that_ma...,


In [7]:
convo_df.head(10)

Unnamed: 0_level_0,meta.title,meta.num_comments,meta.domain,meta.timestamp,meta.subreddit,meta.gilded,meta.gildings,meta.stickied,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ckqoc,"I forgot to take meds today, and I frittered t...",16,self.ADD,1277943229,ADD,0,,False,
nywe5,DEA Refuses to believe there's a shortage of A...,5,msnbc.msn.com,1325451559,ADD,-1,,False,
nyvgx,OH GOD I ACCIDENTALLY TOOK TWO PILLS WHAT DO I...,26,self.ADD,1325450108,ADD,-1,,False,
o06gv,Anyone from the UK been diagnosed? How was the...,4,self.ADD,1325538002,ADD,-1,,False,
o0401,"Need new medicine, help me out /add!",8,self.ADD,1325534630,ADD,-1,,False,
nzn0d,Please help me /r/Add... I have no one else to...,5,self.ADD,1325497332,ADD,-1,,False,
nz89v,"After a night of drinking, why do I have more ...",11,self.ADD,1325469935,ADD,-1,,False,
o1m6s,Let's Squat The Myth Once And For All,21,self.ADD,1325628210,ADD,-1,,False,
o1kar,sys bureau detectives asturias,0,websempresas.es,1325625737,ADD,-1,,False,
o1j3t,Robbs reddit add that made him late for a date,0,redditmedia.com,1325624160,ADD,-1,,False,


In [8]:
# looks the like speakers have no metadata
speaker_df.head(10)

pastachef
djspacebunny
[deleted]
ergo456
anotheraddthrowaway
addthrowaway1
Miserlou57
kassem23
21onlyne
Joemasta66


If you format your data to follow this format, you can generate Corpora from them easily. For example, using the above dataframes, we can re-generate the original corpus! 

In [9]:
new_corpus = Corpus.from_pandas(utterances_df=utt_df, speakers_df=speaker_df, conversations_df=convo_df)

ID column is not present in utterances dataframe, generated ID column from dataframe index...
ID column is not present in conversations dataframe, generated ID column from dataframe index...
ID column is not present in speakers dataframe, generated ID column from dataframe index...


5725it [00:00, 20325.51it/s]


In [10]:
new_corpus.print_summary_stats()

Number of Speakers: 801
Number of Utterances: 5725
Number of Conversations: 555


In fact, you can **generate corpora from utterance data only**. Let's consider the simplest case scenario, where you have the primary data fields and nothing else. 

(Note that 'id' does not have to be the DataFrame index, it can just be another column in your dataframe.)

In [11]:
# constructing simple utterance dataframe, you can ignore this
simple_utt_df = utt_df[['timestamp', 'text', 'speaker', 'reply_to', 'conversation_id']]
ids = list(simple_utt_df.index)
simple_utt_df = simple_utt_df.reset_index()
simple_utt_df['id'] = ids

In [12]:
# what your basic utterance data might look like
simple_utt_df

Unnamed: 0,id,timestamp,text,speaker,reply_to,conversation_id
0,ckqoc,1277943229,and I just realized that rhymed,pastachef,,ckqoc
1,nywe5,1325451559,,djspacebunny,,nywe5
2,nyvgx,1325450108,"Drama aside, I took an adderall XR when I firs...",[deleted],,nyvgx
3,o06gv,1325538002,I've heard it's very difficult to get diagnose...,ergo456,,o06gv
4,o0401,1325534630,So I have been taking 30mg vyvanse for the las...,anotheraddthrowaway,,o0401
...,...,...,...,...,...,...
5720,c53ah2c,1340228511,I logged on here because I was finding it hard...,LCMV,odqmn,odqmn
5721,c53ap4r,1340229399,Using a digital calendar for EVERYTHING I DO ...,LCMV,obemr,obemr
5722,c568jpi,1340797795,You're a genius,Poffertje,c4wbq5m,omb6e
5723,c5air4w,1341634249,Thought I was just a bad reader because whenev...,Eyeredit,oe0pl,oe0pl


In [13]:
new_corpus = Corpus.from_pandas(simple_utt_df)

5725it [00:00, 17831.26it/s]


In [14]:
new_corpus.print_summary_stats()

Number of Speakers: 801
Number of Utterances: 5725
Number of Conversations: 555


We generated the same Corpus! The only difference is that because we excluded the conversations and speakers dataframes, these objects will have no metadata, whereas previously the conversations had metadata.

In [15]:
# before
corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

Unnamed: 0_level_0,meta.title,meta.num_comments,meta.domain,meta.timestamp,meta.subreddit,meta.gilded,meta.gildings,meta.stickied,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ckqoc,"I forgot to take meds today, and I frittered t...",16,self.ADD,1277943229,ADD,0,,False,
nywe5,DEA Refuses to believe there's a shortage of A...,5,msnbc.msn.com,1325451559,ADD,-1,,False,
nyvgx,OH GOD I ACCIDENTALLY TOOK TWO PILLS WHAT DO I...,26,self.ADD,1325450108,ADD,-1,,False,
o06gv,Anyone from the UK been diagnosed? How was the...,4,self.ADD,1325538002,ADD,-1,,False,
o0401,"Need new medicine, help me out /add!",8,self.ADD,1325534630,ADD,-1,,False,


In [16]:
# after
new_corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

ckqoc
nywe5
nyvgx
o06gv
o0401


This concludes a short tutorial on how to generate ConvoKit corpora from pandas dataframes. More details on the `Corpus.from_pandas()` can be found in the documentation.