Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. The transformer currently only allows computation of how surprising a speaker's utterances in one conversation (target) are compared to their utterances in all other conversations (context) in the corpus. Eventually, the functionality of the Surprise transformer will be abstracted to allow for computation of surprise between any target and context types.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances()) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20700
Number of Conversations: 6904


Step 2: Create instance of surprise transformer
---------------
`min_target_length` and `min_context_length` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements.

In [11]:
surp = Surprise(min_target_length=100, min_context_length=100)

Step 3: Fit transformer to corpus
-----


In [12]:
surp = surp.fit(subset_corpus)

Step 4: Transform corpus
--------
Currently, this transforms each utterance in the corpus adding a field to its metadata with the calculated surprise.

In [13]:
transformed_corpus = surp.transform(subset_corpus, 'utterance')

AttributeError: 'list' object has no attribute 'lower'

Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [16]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
d5pb24c,1469403847,"They should be fine, though there are other op...",t3hasiangod,4ufm6z,4ufm6z,3,d5pb24c,1471654647,0,,Cornell,False,,,11.5836
d64vv88,1470366925,"Eh, I would be hesitant to wear those. From th...",t3hasiangod,4w8d7e,4w8d7e,5,d64vv88,1473230691,0,,Cornell,False,,,10.8618
e0r92ty,1529122507,This is from a recent grad who lived in balch ...,mushiettake,8rgj59,8rgj59,1,e0r92ty,1532599439,0,,Cornell,False,/r/Cornell/comments/8rgj59/got_assigned_to_bal...,,10.2574
8rh1my,1529123054,So I spent the effort of making my initial rep...,mushiettake,,8rh1my,20,,1536383121,0,,Cornell,False,/r/Cornell/comments/8rh1my/perks_of_living_in_...,,9.92811
db144co,1481398407,You cannot move out of West halfway through th...,t3hasiangod,5hlpq0,5hlpq0,11,db144co,1483628051,0,,Cornell,False,,Computational Biology 2015,9.87174
dqi8dmr,1511965247,The physical location and environment - Ithaca...,cornell256,7gd8vb,7gd8vb,32,dqi8dmr,1513124252,0,,Cornell,False,/r/Cornell/comments/7gd8vb/awesome_things_abou...,Econ '16,9.64534
dk0e0gl,1499654310,There have been other threads in this sub talk...,mushiettake,6mbiyo,6mbiyo,3,dk0e0gl,1501108383,0,,Cornell,False,,,9.58313
e32s0to,1532622562,O-Week: Everyone here is so nice and I want to...,EQUASHNZRKUL,e31ttiw,91yv8u,22,e31ttiw,1536683082,0,,Cornell,False,/r/Cornell/comments/91yv8u/least_favorite_thin...,CS &amp; Physics 2020,9.50873
e32s42w,1532622635,Just don’t go on the meme page. Just don’t go ...,EQUASHNZRKUL,e32cqj1,91yv8u,11,e32cqj1,1536683122,0,,Cornell,False,/r/Cornell/comments/91yv8u/least_favorite_thin...,CS &amp; Physics 2020,9.50873
e32sjki,1532622986,"Its mostly deflecting, insecure Brown and Dart...",EQUASHNZRKUL,e32sdi0,91yv8u,10,e32cqj1,1536683573,0,,Cornell,False,/r/Cornell/comments/91yv8u/least_favorite_thin...,CS &amp; Physics 2020,9.50873


You can see above that utterances with the same speaker and conversation id have the same surprise as expected. Let's remove these duplicate entries so we can more of the data.

In [17]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
d5pb24c,1469403847,"They should be fine, though there are other op...",t3hasiangod,4ufm6z,4ufm6z,3,d5pb24c,1471654647,0,,Cornell,False,,,11.5836
d64vv88,1470366925,"Eh, I would be hesitant to wear those. From th...",t3hasiangod,4w8d7e,4w8d7e,5,d64vv88,1473230691,0,,Cornell,False,,,10.8618
e0r92ty,1529122507,This is from a recent grad who lived in balch ...,mushiettake,8rgj59,8rgj59,1,e0r92ty,1532599439,0,,Cornell,False,/r/Cornell/comments/8rgj59/got_assigned_to_bal...,,10.2574
8rh1my,1529123054,So I spent the effort of making my initial rep...,mushiettake,,8rh1my,20,,1536383121,0,,Cornell,False,/r/Cornell/comments/8rh1my/perks_of_living_in_...,,9.92811
db144co,1481398407,You cannot move out of West halfway through th...,t3hasiangod,5hlpq0,5hlpq0,11,db144co,1483628051,0,,Cornell,False,,Computational Biology 2015,9.87174
dqi8dmr,1511965247,The physical location and environment - Ithaca...,cornell256,7gd8vb,7gd8vb,32,dqi8dmr,1513124252,0,,Cornell,False,/r/Cornell/comments/7gd8vb/awesome_things_abou...,Econ '16,9.64534
dk0e0gl,1499654310,There have been other threads in this sub talk...,mushiettake,6mbiyo,6mbiyo,3,dk0e0gl,1501108383,0,,Cornell,False,,,9.58313
e32s0to,1532622562,O-Week: Everyone here is so nice and I want to...,EQUASHNZRKUL,e31ttiw,91yv8u,22,e31ttiw,1536683082,0,,Cornell,False,/r/Cornell/comments/91yv8u/least_favorite_thin...,CS &amp; Physics 2020,9.50873
d2cqjqx,1461302875,"Well, you don't get to choose which dorm you l...",t3hasiangod,4fwz3i,4fwz3i,4,d2cqjqx,1463611881,0,,Cornell,False,,,9.50164
db7t3p5,1481777599,Co-ops and frats have some significant differe...,t3hasiangod,5ifb7k,5ifb7k,7,db7t3p5,1483777566,0,,Cornell,False,,Computational Biology 2015,9.48275


Let's take a look at the full text for these most surprising speaker conversation pairs.

In [18]:
for i,x in most_surprising.iterrows():
    speaker, convo, surprise = x['speaker'], x['conversation_id'], x['meta.surprise']
    print('Speaker: {}, Conversation: {}, Surprise: {}'.format(speaker, convo, surprise))
    utterance_ids = transformed_corpus.get_speaker_convo_info(speaker, convo)['utterance_ids']
    print(' '.join([transformed_corpus.get_utterance(utt).text for utt in utterance_ids]))
    print()

Speaker: t3hasiangod, Conversation: 4ufm6z, Surprise: 11.583617063125047
They should be fine, though there are other options that are just as good that are cheaper. If you look around stores like REI or Eastern Mountain Sports, you'll find a lot of hiking shoes and boots (men and women) that are waterproof and offer great traction in the snow. A lot of these stores even have a "winter boots" section too. 

[Merrell Waterproof Hiking Boots for $109.99 from EMS](http://www.ems.com/merrell-mens-radius-ii-mid-waterproof-hiking-boots/1365320.html#start=2)

[Hi Tec Waterproof Hiking Boots for $59.99 from EMS](http://www.ems.com/hi-tec-men%E2%80%99s-logan-waterproof-hiking-boots/1293873.html#start=1)

[Sorel Waterproof Boots for $89.93 from REI](https://www.rei.com/product/888277/sorel-paxson-tall-waterproof-winter-boots-mens)

[Merrell Waterproof Hiking Shoes for $100 from EMS](http://www.ems.com/merrell-mens-moab-ventilator-hiking-shoes-walnut/1365104.html#start=2)

Speaker: t3hasiangod, Co

We can also take a look at what the speaker with the most surprising conversation said in other conversations.

In [19]:
[utt.text for utt in transformed_corpus.get_speaker('t3hasiangod').iter_utterances(selector=lambda utt: utt.conversation_id != '4ufm6z')]

research experience after graduating. I was lucky, and most of the research done in my field is dry, rather than bench. But if you're in something like biochem or molecular bio, you'll want that bench experience.",
 "Swapping a bio class to pass/fail would almost certainly be a huge disqualification. It's better to take the crappy grade and retake the course.",
 'https://isso.cornell.edu/students/working-us/f1-internships-cpt',
 "You could. The only thing that it would save is your GPA, and while I'm not positive, I'm sure that it would still be a huge blemish. I'm also positive that med school adcoms have some way of putting S/U courses into context.",
 "I would advise against using a fake. The bars and liquor stores around Ithaca are really strict about it, and even though they probably don't have a lot of experience picking out the fake international IDs, I wouldn't play Russian Roulette with them.",
 "Congrats to all the admits! If anyone is in the Cleveland area who wants to know 

Now, let's look at some of the least surprising entries.

In [20]:
least_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise').drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
least_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dbrnysy,1483043255,"Memes aside, I honestly don't know a good answ...",Straight_Derpin,5kst5l,5kst5l,2,dbrnysy,1484133363,0,,Cornell,False,,CS 2020,0.43012
diujudc,1497369546,Fall 2011 Admissions Stats:\n\nSchool | Apps |...,Dr_Narwhal,6h08sg,6h08sg,9,diujudc,1499312897,0,,Cornell,False,,Physics &amp; Mathematics 2019,3.00525
deydmud,1489576091,"Ah yes, because Newt is so tolerant of those w...",Bigmouthstrikesback,dexqq3j,5z98fa,3,dewdp2k,1491495064,0,,Cornell,False,,,3.51494
dosp83a,1508806556,"Far above Cayuga's waters,\nwith its waves of ...",SwissWatchesOnly,dosp80m,78cb5y,1,dosp7mm,1510121625,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/dosp83a/,,3.52073
de5g57y,1487928276,"&gt;Gucci, whose real name is Radric Davis, is...",chrissydablack,5vvc60,5vvc60,-2,de5g57y,1489102864,0,,Cornell,False,,,3.67407
e6nzua8,1537964565,some ideas based off of my personal experience...,agottler,9iyo8u,9iyo8u,10,e6nzua8,1539548114,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9iyo8u/sending_a_care_pack...,,4.18462
d4gn2ut,1466432480,"Here is some actual data on incoming flights, ...",turtlecrk,4owh99,4owh99,3,d4gn2ut,1469167939,0,,Cornell,False,,,4.26553
dra9mbr,1513338093,Could you give more information about yourself...,Sleeppppp,7jz1e0,7jz1e0,5,dra9mbr,1514739645,0,,Cornell,False,/r/Cornell/comments/7jz1e0/to_those_who_took_c...,,4.42843
e34p00z,1532700628,I don't understand why you're downvoting. You ...,Bearclawmen,e33j95b,91yv8u,0,e321q4c,1536736848,0,,Cornell,False,/r/Cornell/comments/91yv8u/least_favorite_thin...,HUMAN BEING,4.47033
90nmvb,1532157947,"My SHP waiver app got denied, after being appr...",Weinfield,,90nmvb,5,,1536641349,0,,Cornell,False,/r/Cornell/comments/90nmvb/shp_waiver_denial/,CS 19,4.62374


In [21]:
for i,x in least_surprising.iterrows():
    speaker, convo, surprise = x['speaker'], x['conversation_id'], x['meta.surprise']
    print('Speaker: {}, Conversation: {}, Surprise: {}'.format(speaker, convo, surprise))
    utterance_ids = transformed_corpus.get_speaker_convo_info(speaker, convo)['utterance_ids']
    print(' '.join([transformed_corpus.get_utterance(utt).text for utt in utterance_ids]))
    print()

Speaker: Straight_Derpin, Conversation: 5kst5l, Surprise: 0.43012022915541887
BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BI

### Comparison to SpeakerConvoDiversity

In [22]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', cmp_select_fn=lambda df, aux: df.tokens.map(len) >= 100, select_fn=lambda df, row, aux: (df.convo_idx % 2 != row.convo_idx % 2) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=100)

In [24]:
from convokit.text_processing import TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20700 utterances processed
2000/20700 utterances processed
3000/20700 utterances processed
4000/20700 utterances processed
5000/20700 utterances processed
6000/20700 utterances processed
7000/20700 utterances processed
8000/20700 utterances processed
9000/20700 utterances processed
10000/20700 utterances processed
11000/20700 utterances processed
12000/20700 utterances processed
13000/20700 utterances processed
14000/20700 utterances processed
15000/20700 utterances processed
16000/20700 utterances processed
17000/20700 utterances processed
18000/20700 utterances processed
19000/20700 utterances processed
20000/20700 utterances processed
20700/20700 utterances processed


In [25]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
100 / 3097
200 / 3097
300 / 3097
400 / 3097
500 / 3097
600 / 3097
700 / 3097
800 / 3097
900 / 3097
1000 / 3097
1100 / 3097
1200 / 3097
1300 / 3097
1400 / 3097
1500 / 3097
1600 / 3097
1700 / 3097
1800 / 3097
1900 / 3097
2000 / 3097
2100 / 3097
2200 / 3097
2300 / 3097
2400 / 3097
2500 / 3097
2600 / 3097
2700 / 3097
2800 / 3097
2900 / 3097
3000 / 3097


Here are the speaker convo entries that have the highest diversity score.

In [28]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,4.591143
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,4.535059
sasha07974__8v40c1,sasha07974,8v40c1,42,4.50058
rrrrrrr1131__8l3xht,rrrrrrr1131,8l3xht,25,4.497185
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,4.496635
blackashi__2xxkm4,blackashi,2xxkm4,6,4.490274
agottler__9iyo8u,agottler,9iyo8u,66,4.488779
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,4.485214
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,4.484983
t3hasiangod__4ufm6z,t3hasiangod,4ufm6z,262,4.482088


Notice that the first speaker-convo entry was the one that was least surprising according to the surprise transformer. This result is conflicting and we will want to do some further digging into why this might have occurred. One potential reason could be due to the sampling used when calculating perplexity in the SpeakerConvoDiversity transformer. This attempts to get rid of any length-based effects on the perplexity. We may need to replicate this sampling method in the surprise transformer.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [29]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
happysted__916yut,happysted,916yut,99,4.217198
laveritecestla__4pylgl,laveritecestla,4pylgl,74,4.21986
Pjcrafty__5apodz,Pjcrafty,5apodz,17,4.226092
shadowclan98__9mfj81,shadowclan98,9mfj81,97,4.228523
dedicateddan__1bdpdq,dedicateddan,1bdpdq,7,4.229425
kickstand__obvjl,kickstand,obvjl,6,4.229869
dedicateddan__4krfrc,dedicateddan,4krfrc,97,4.234172
t3hasiangod__4ar3u0,t3hasiangod,4ar3u0,99,4.235278
cornell256__96utv3,cornell256,96utv3,292,4.238155
Fencerman2__6tiomd,Fencerman2,6tiomd,111,4.238464
