Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. The transformer currently only allows computation of how surprising a speaker's utterances in one conversation (target) are compared to their utterances in all other conversations (context) in the corpus. Eventually, the functionality of the Surprise transformer will be abstracted to allow for computation of surprise between any target and context types.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances()) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20700
Number of Conversations: 6904


Step 2: Create instance of surprise transformer
---------------
`min_target_length` and `min_context_length` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements.

In [11]:
surp = Surprise(min_target_length=100, min_context_length=100, n_samples=50)

Step 3: Fit transformer to corpus
-----


In [12]:
surp = surp.fit(subset_corpus, group_models_by=['speaker'])

Step 4: Transform corpus
--------
Currently, this transforms each utterance in the corpus adding a field to its metadata with the calculated surprise.

In [13]:
transformed_corpus = surp.transform(subset_corpus, 'utterance', group_target_by=['speaker', 'conversation'], context_selector=lambda s, t: (s.index.get_level_values('speaker') == t[0]) & (s.index.get_level_values('conversation_id') != t[1]), model_selector=lambda ind: ind[0])

Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [14]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dbrix3s,1483037010,BILL BILL BILL BILL BILL BILL BILL BILL BILL B...,Straight_Derpin,dbqkxgp,5kst5l,22,dbqkxgp,1484130862,0,,Cornell,False,,CS 2020,5.08828
dbrnysy,1483043255,"Memes aside, I honestly don't know a good answ...",Straight_Derpin,5kst5l,5kst5l,2,dbrnysy,1484133363,0,,Cornell,False,,CS 2020,5.08828
diujudc,1497369546,Fall 2011 Admissions Stats:\n\nSchool | Apps |...,Dr_Narwhal,6h08sg,6h08sg,9,diujudc,1499312897,0,,Cornell,False,,Physics &amp; Mathematics 2019,4.52956
cbnqq4d,1376508444,Even better is Tracfone's StraightTalk. They'r...,arandomaltaccount,cbnjwiu,1kbz4p,1,cbnjwiu,1429829700,0,,Cornell,False,,,4.33434
c5sqkzc,1344908188,Fuck YAF. And fuck you. Don't let that hate ...,omgdonerkebab,y6cd9,y6cd9,0,c5sqkzc,1429631413,0,,Cornell,False,,,4.32641
d5pb24c,1469403847,"They should be fine, though there are other op...",t3hasiangod,4ufm6z,4ufm6z,3,d5pb24c,1471654647,0,,Cornell,False,,,4.2999
d64vv88,1470366925,"Eh, I would be hesitant to wear those. From th...",t3hasiangod,4w8d7e,4w8d7e,5,d64vv88,1473230691,0,,Cornell,False,,,4.28591
dosp7xl,1508806551,"Far above Cayuga's waters,\nwith its waves of ...",SwissWatchesOnly,dosp7mm,78cb5y,-2,dosp7mm,1510121623,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/dosp7xl/,,4.2755
78cb5y,1508806536,[removed],SwissWatchesOnly,,78cb5y,0,,1510471733,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/,,4.2755
dosp7mm,1508806541,"Far above Cayuga's waters,\nwith its waves of ...",SwissWatchesOnly,78cb5y,78cb5y,-1,dosp7mm,1510121618,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/dosp7mm/,,4.2755


You can see above that utterances with the same speaker and conversation id have the same surprise as expected. Let's remove these duplicate entries so we can see more of the data.

In [15]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dbrix3s,1483037010,BILL BILL BILL BILL BILL BILL BILL BILL BILL B...,Straight_Derpin,dbqkxgp,5kst5l,22,dbqkxgp,1484130862,0,,Cornell,False,,CS 2020,5.08828
diujudc,1497369546,Fall 2011 Admissions Stats:\n\nSchool | Apps |...,Dr_Narwhal,6h08sg,6h08sg,9,diujudc,1499312897,0,,Cornell,False,,Physics &amp; Mathematics 2019,4.52956
cbnqq4d,1376508444,Even better is Tracfone's StraightTalk. They'r...,arandomaltaccount,cbnjwiu,1kbz4p,1,cbnjwiu,1429829700,0,,Cornell,False,,,4.33434
c5sqkzc,1344908188,Fuck YAF. And fuck you. Don't let that hate ...,omgdonerkebab,y6cd9,y6cd9,0,c5sqkzc,1429631413,0,,Cornell,False,,,4.32641
d5pb24c,1469403847,"They should be fine, though there are other op...",t3hasiangod,4ufm6z,4ufm6z,3,d5pb24c,1471654647,0,,Cornell,False,,,4.2999
d64vv88,1470366925,"Eh, I would be hesitant to wear those. From th...",t3hasiangod,4w8d7e,4w8d7e,5,d64vv88,1473230691,0,,Cornell,False,,,4.28591
dosp7xl,1508806551,"Far above Cayuga's waters,\nwith its waves of ...",SwissWatchesOnly,dosp7mm,78cb5y,-2,dosp7mm,1510121623,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/dosp7xl/,,4.2755
deydmud,1489576091,"Ah yes, because Newt is so tolerant of those w...",Bigmouthstrikesback,dexqq3j,5z98fa,3,dewdp2k,1491495064,0,,Cornell,False,,,4.20287
crd4uhk,1431970713,1933 in NY based on wikipedia https://en.wikip...,howlingchief,crd4i14,36ceei,1,crd2smi,1433132718,0,,Cornell,False,,,4.12154
90nmvb,1532157947,"My SHP waiver app got denied, after being appr...",Weinfield,,90nmvb,5,,1536641349,0,,Cornell,False,/r/Cornell/comments/90nmvb/shp_waiver_denial/,CS 19,4.07915


Now, let's look at some of the least surprising entries.

In [16]:
least_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise').drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
least_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
e3b5chg,1532980787,All of the other responses here are spot-on bu...,Fencerman2,92rvm5,92rvm5,4,e3b5chg,1536885658,0,,Cornell,False,/r/Cornell/comments/92rvm5/what_to_expect_for_...,COE | CS '20,2.41803
cymjym0,1451978623,"Look, every college within Cornell is differen...",cryptkeep,cymhmfa,3zgnom,3,cymhmfa,1454289713,0,,Cornell,False,,,2.49118
d4itxwh,1466559788,The placement exams are generally final exams ...,laveritecestla,4p84yz,4p84yz,1,d4itxwh,1469205803,0,,Cornell,False,,,2.55911
e7wd72e,1539728814,a) You can't minor in AEP (but you can minor i...,rwaterbender,9ora8i,9ora8i,5,e7wd72e,1541128816,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9ora8i/conflicting_time_sc...,AEP '20,2.6121
dbj1d3a,1482456805,I am not at all familiar with the transfer pro...,laveritecestla,dbiys1w,5jn3x9,1,dbhxklw,1483978661,0,,Cornell,False,,BME 18,2.61999
dxwnvbb,1524603525,Many of the classes are in the host language (...,Missora,8emxf7,8emxf7,6,dxwnvbb,1526699887,0,,Cornell,False,/r/Cornell/comments/8emxf7/can_i_study_abroad_...,,2.68823
e4mtybd,1534949183,"If you swap in that case, you would need to st...",Fencerman2,99dspg,99dspg,2,e4mtybd,1537809814,0,,Cornell,False,/r/Cornell/comments/99dspg/swap_lectures_quest...,COE | CS '20,2.68897
dmww8fw,1505241788,I don't know much about really competitive fra...,Fencerman2,dmwqw1q,6zhozg,2,dmwhmg8,1506719522,0,,Cornell,False,,CS '20,2.70664
dvir3qu,1520758241,CHEM 2150 is not a class you want to take if y...,rwaterbender,83daa2,83daa2,2,dvir3qu,1525042213,0,,Cornell,False,/r/Cornell/comments/83daa2/chem_2150_for_an_ae...,AEP '20,2.72102
d3eo3tj,1463875929,If you got it leaving the Student Agencies or ...,cryptkeep,4kg1kp,4kg1kp,3,d3eo3tj,1466012592,0,,Cornell,False,,,2.7291


### Comparison to SpeakerConvoDiversity

In [17]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=100)

In [18]:
# from sklearn.feature_extraction.text import CountVectorizer
# cv = CountVectorizer()

# for utt in corpus.iter_utterances():
#     tokens = cv.build_analyzer()(utt.text)
#     toks = [{'toks': [{'tok': x} for x in tokens]}]
#     utt.add_meta('tokens', toks)

# corpus.get_utterances_dataframe()

In [19]:
from convokit.text_processing import TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20700 utterances processed
2000/20700 utterances processed
3000/20700 utterances processed
4000/20700 utterances processed
5000/20700 utterances processed
6000/20700 utterances processed
7000/20700 utterances processed
8000/20700 utterances processed
9000/20700 utterances processed
10000/20700 utterances processed
11000/20700 utterances processed
12000/20700 utterances processed
13000/20700 utterances processed
14000/20700 utterances processed
15000/20700 utterances processed
16000/20700 utterances processed
17000/20700 utterances processed
18000/20700 utterances processed
19000/20700 utterances processed
20000/20700 utterances processed
20700/20700 utterances processed


In [20]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
100 / 15394
200 / 15394
300 / 15394
400 / 15394
500 / 15394
600 / 15394
700 / 15394
800 / 15394
900 / 15394
1000 / 15394
1100 / 15394
1200 / 15394
1300 / 15394
1400 / 15394
1500 / 15394
1600 / 15394
1700 / 15394
1800 / 15394
1900 / 15394
2000 / 15394
2100 / 15394
2200 / 15394
2300 / 15394
2400 / 15394
2500 / 15394
2600 / 15394
2700 / 15394
2800 / 15394
2900 / 15394
3000 / 15394
3100 / 15394
3200 / 15394
3300 / 15394
3400 / 15394
3500 / 15394
3600 / 15394
3700 / 15394
3800 / 15394
3900 / 15394
4000 / 15394
4100 / 15394
4200 / 15394
4300 / 15394
4400 / 15394
4500 / 15394
4600 / 15394
4700 / 15394
4800 / 15394
4900 / 15394
5000 / 15394
5100 / 15394
5200 / 15394
5300 / 15394
5400 / 15394
5500 / 15394
5600 / 15394
5700 / 15394
5800 / 15394
['bill' 'bill' 'bill' ... 'at' 'least' '.']
[array(['computers', 'have', 'always', ..., 'it', "'s", 'fantastic'],
      dtype='<U59')]
5900 / 15394
6000 / 15394
6100 / 15394
6200 / 15394
6300 / 15394
6400 / 15

Here are the speaker convo entries that have the highest diversity score.

In [21]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,4.586888
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,4.541451
rrrrrrr1131__8l3xht,rrrrrrr1131,8l3xht,25,4.498673
sasha07974__8v40c1,sasha07974,8v40c1,42,4.494537
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,4.488979
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,4.488228
agottler__9iyo8u,agottler,9iyo8u,66,4.486631
blackashi__2xxkm4,blackashi,2xxkm4,6,4.485264
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,4.48518
t3hasiangod__4ufm6z,t3hasiangod,4ufm6z,262,4.474613


Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [22]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
happysted__8qssd1,happysted,8qssd1,69,4.215634
Fencerman2__90r1nf,Fencerman2,90r1nf,231,4.219402
dedicateddan__4krfrc,dedicateddan,4krfrc,97,4.220458
Enyo287__5ipedu,Enyo287,5ipedu,281,4.223258
t3hasiangod__4ar3u0,t3hasiangod,4ar3u0,99,4.224444
kickstand__obvjl,kickstand,obvjl,6,4.227376
happysted__7mibi3,happysted,7mibi3,13,4.23038
iBeReese__1uuldh,iBeReese,1uuldh,6,4.230645
Pjcrafty__5apodz,Pjcrafty,5apodz,17,4.231452
t3hasiangod__3wtoeo,t3hasiangod,3wtoeo,46,4.23302
