# Reusing Reddit Data from [ConvoKit](https://convokit.cornell.edu/documentation/index.html)

ConvoKit contains tools to extract conversational features and analyze social phenomena in conversations, using a single unified interface. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. This also includes a large Reddit dataset.

However, the data is only avaialble till 2018.

In [4]:
! pip install -U pyopenssl cryptography

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Collecting pyopenssl
  Downloading pyOpenSSL-24.0.0-py3-none-any.whl.metadata (12 kB)
Collecting cryptography
  Downloading cryptography-42.0.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (5.3 kB)
Downloading pyOpenSSL-24.0.0-py3-none-any.whl (58 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.6/58.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cryptography-42.0.5-cp39-abi3-manylinux_2_28_x86_64.whl (4.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0mm
[?25hInstalling collected packages: cryptography, pyopenssl
  Attempting uninstall: cryptography
    Found existing installation: cryptography 41.0.7
    Uninstalling cryptography-41.0.7:
    

In [5]:
! pip install convokit

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple


In [6]:
from convokit import Corpus, download
corpus = Corpus(download('subreddit-Cornell'))

Downloading subreddit-Cornell to /home/sukayna/.convokit/downloads/subreddit-Cornell
Downloading subreddit-Cornell from http://zissou.infosci.cornell.edu/convokit/datasets/subreddit-corpus/corpus-zipped/CookingScrewups~-~CrappyDesign/Cornell.corpus.zip (11.2MB)... Done
No configuration file found at /home/sukayna/.convokit/config.yml; writing with contents: 
# Default Backend Parameters
db_host: localhost:27017
data_directory: ~/.convokit/saved-corpora
default_backend: mem


[<img src="https://convokit.cornell.edu/documentation/_images/convokit_classes.svg">](https://convokit.cornell.edu/documentation/architecture.html)

**utterance**: something said by a **speaker**

**conversation**: a thread of utterances

In [11]:
# Stack Exchange 
corpus_se = Corpus(filename=download("stack-exchange-politeness-corpus"))

Dataset already exists at /home/sukayna/.convokit/downloads/stack-exchange-politeness-corpus


Each utterance corresponds to a Stack Exchange request. For each utterance, we provide:

    id: row index of the request given in the original data release.

    speaker: the author of the utterance.

    conversation_id: id of the first utterance in the conversation this utterance belongs to, which in this case is the id of the utterance itself

    reply_to: None. In this dataset, each request is seen as a full conversation, and thus all utterances are at the ‘root’ of the conversations

    timestamp: “NOT_RECORDED”.

    text: textual content of the utterance.


In [13]:
for utt in corpus.iter_utterances():
    print(utt.meta)
    print(utt.text)
    break

ConvoKitMeta({'score': 2, 'top_level_comment': None, 'retrieved_on': -1, 'gilded': -1, 'gildings': None, 'subreddit': 'Cornell', 'stickied': False, 'permalink': '/r/Cornell/comments/nyx4d/so_i_was_away_this_past_semester_whats_going_on/', 'author_flair_text': ''})
I was just reading about the Princeton Mic-Check and it's getting [national press](http://www.bloomberg.com/news/2011-12-29/princeton-brews-trouble-for-us-1-percenters-commentary-by-michael-lewis.html).

I want to get a sense of what people felt like around campus. Anything interesting happen? Anything interesting coming up?


In [8]:
udf = corpus.get_utterances_dataframe()

In [9]:
udf

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
nyx4d,1325452698,I was just reading about the Princeton Mic-Che...,reddmau5,,nyx4d,2,,-1,-1,,Cornell,False,/r/Cornell/comments/nyx4d/so_i_was_away_this_p...,,[]
o0145,1325530635,I have added support for Cornell to courseoff....,shtylman,,o0145,9,,-1,-1,,Cornell,False,/r/Cornell/comments/o0145/course_schedule_plan...,,[]
o1gca,1325620506,"i don't have a facebook, so we'd need a volunt...",moon_river,,o1gca,3,,-1,-1,,Cornell,False,/r/Cornell/comments/o1gca/should_we_advertise_...,,[]
o0ss4,1325571377,"so, i'm starting to mess with some of the css ...",moon_river,,o0ss4,24,,-1,-1,,Cornell,False,/r/Cornell/comments/o0ss4/oh_look_a_picture/,,[]
o31u0,1325714498,,cchambo,,o31u0,27,,-1,-1,,Cornell,False,/r/Cornell/comments/o31u0/cornell_scientists_c...,SNES 2015,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
e8tjh94,1541029381,69%,HowYaLikeTheseApples,9t39ho,9t39ho,2,e8tjh94,1541687938,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9t39ho/chem_2070_fail_rate...,: ) ----&gt;--------&lt;,[]
e8tjyg1,1541029851,harvard is the **i̼͕̻͓̩̘͟n͙̞̙f̶̢̙̻̺͍̟͞è̶͚͙̳̩...,ultimatefishlover,9t3tj6,9t3tj6,43,e8tjyg1,1541688177,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9t3tj6/cornell_is_the_harv...,,[]
e8tkb66,1541030205,Agreed,KickAssEmployee,9t4bsv,9t4bsv,14,e8tkb66,1541688335,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9t4bsv/house_dinners_are_t...,,[]
e8tkctl,1541030250,Why did this make me laugh so hard ahaahha,dasfsadf123,9t1nqa,9t1nqa,5,e8tkctl,1541688355,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9t1nqa/i_am_a_parent_of_a_...,,[]
