Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. The transformer currently only allows computation of how surprising a speaker's utterances in one conversation (target) are compared to their utterances in all other conversations (context) in the corpus. Eventually, the functionality of the Surprise transformer will be abstracted to allow for computation of surprise between any target and context types.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances()) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20700
Number of Conversations: 6904


Step 2: Create instance of surprise transformer
---------------
`min_target_length` and `min_context_length` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements.

In [11]:
surp = Surprise(min_target_length=100, min_context_length=100, n_samples=50)

Step 3: Fit transformer to corpus
-----


In [13]:
surp = surp.fit(subset_corpus, group_models_by=['speaker'])

Step 4: Transform corpus
--------
Currently, this transforms each utterance in the corpus adding a field to its metadata with the calculated surprise.

In [14]:
transformed_corpus = surp.transform(subset_corpus, 'utterance', group_target_by=['speaker', 'conversation'], context_selector=lambda s, t: (s.index.get_level_values('speaker') == t[0]) & (s.index.get_level_values('conversation_id') != t[1]), model_selector=lambda ind: ind[0])

Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [15]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dvir3qu,1520758241,CHEM 2150 is not a class you want to take if y...,rwaterbender,83daa2,83daa2,2,dvir3qu,1525042213,0,,Cornell,False,/r/Cornell/comments/83daa2/chem_2150_for_an_ae...,AEP '20,2.62628
cymjym0,1451978623,"Look, every college within Cornell is differen...",cryptkeep,cymhmfa,3zgnom,3,cymhmfa,1454289713,0,,Cornell,False,,,2.59665
cylyz8l,1451942801,I don't believe there is a minimum GPA require...,cryptkeep,3zgnom,3zgnom,7,cylyz8l,1454279614,0,,Cornell,False,,,2.59665
e7wd72e,1539728814,a) You can't minor in AEP (but you can minor i...,rwaterbender,9ora8i,9ora8i,5,e7wd72e,1541128816,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9ora8i/conflicting_time_sc...,AEP '20,2.56051
e3b5chg,1532980787,All of the other responses here are spot-on bu...,Fencerman2,92rvm5,92rvm5,4,e3b5chg,1536885658,0,,Cornell,False,/r/Cornell/comments/92rvm5/what_to_expect_for_...,COE | CS '20,2.53709
6v857y,1503369657,Please place all admissions related posts here...,laveritecestla,,6v857y,3,,1504709210,0,,Cornell,False,/r/Cornell/comments/6v857y/biweekly_chance_me_...,Biomedical Engineering '18,2.50844
e6ojkdq,1537982013,Right. So I want to do engineering if I do wan...,CornellMan333,e6ohi18,9j2exy,2,e6o584v,1539557297,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9j2exy/physics_vs_aep/e6oj...,,2.50098
e6oa9hc,1537974575,What physics related jobs can you do with a ba...,CornellMan333,e6o584v,9j2exy,1,e6o584v,1539552969,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9j2exy/physics_vs_aep/e6oa...,,2.50098
9j2exy,1537967895,Physics vs AEP\n\nSo I’m a sophomore majoring ...,CornellMan333,,9j2exy,3,,1540177615,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9j2exy/physics_vs_aep/,,2.50098
e6ow0l0,1537991816,"Yeah I was considering just doing that, but it...",CornellMan333,e6o44ke,9j2exy,2,e6o44ke,1539563128,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9j2exy/physics_vs_aep/e6ow...,,2.50098


You can see above that utterances with the same speaker and conversation id have the same surprise as expected. Let's remove these duplicate entries so we can see more of the data.

In [16]:
most_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise', ascending=False).drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
most_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dvir3qu,1520758241,CHEM 2150 is not a class you want to take if y...,rwaterbender,83daa2,83daa2,2,dvir3qu,1525042213,0,,Cornell,False,/r/Cornell/comments/83daa2/chem_2150_for_an_ae...,AEP '20,2.62628
cymjym0,1451978623,"Look, every college within Cornell is differen...",cryptkeep,cymhmfa,3zgnom,3,cymhmfa,1454289713,0,,Cornell,False,,,2.59665
e7wd72e,1539728814,a) You can't minor in AEP (but you can minor i...,rwaterbender,9ora8i,9ora8i,5,e7wd72e,1541128816,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9ora8i/conflicting_time_sc...,AEP '20,2.56051
e3b5chg,1532980787,All of the other responses here are spot-on bu...,Fencerman2,92rvm5,92rvm5,4,e3b5chg,1536885658,0,,Cornell,False,/r/Cornell/comments/92rvm5/what_to_expect_for_...,COE | CS '20,2.53709
6v857y,1503369657,Please place all admissions related posts here...,laveritecestla,,6v857y,3,,1504709210,0,,Cornell,False,/r/Cornell/comments/6v857y/biweekly_chance_me_...,Biomedical Engineering '18,2.50844
e6ojkdq,1537982013,Right. So I want to do engineering if I do wan...,CornellMan333,e6ohi18,9j2exy,2,e6o584v,1539557297,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9j2exy/physics_vs_aep/e6oj...,,2.50098
e7y5e4h,1539798896,I think you are actually in a good position. Y...,rwaterbender,9oz71e,9oz71e,2,e7y5e4h,1541158805,0,"{'gid_1': 0, 'gid_2': 0, 'gid_3': 0}",Cornell,False,/r/Cornell/comments/9oz71e/help_guiding_a_lost...,AEP '20,2.48104
din8ic4,1496955479,"Imo, the best way to boost your chances is to ...",Fencerman2,6g3p38,6g3p38,4,din8ic4,1499182410,0,,Cornell,False,,CS '20,2.4683
e4mtybd,1534949183,"If you swap in that case, you would need to st...",Fencerman2,99dspg,99dspg,2,e4mtybd,1537809814,0,,Cornell,False,/r/Cornell/comments/99dspg/swap_lectures_quest...,COE | CS '20,2.46359
dfqnxf1,1491164477,It's based on what you think it will be worth....,agarcia900,dfpz4uj,60wqtn,2,dfpz4uj,1493735430,0,,Cornell,False,,2020,2.44975


Let's take a look at the full text for these most surprising speaker conversation pairs.

In [17]:
for i,x in most_surprising.iterrows():
    speaker, convo, surprise = x['speaker'], x['conversation_id'], x['meta.surprise']
    print('Speaker: {}, Conversation: {}, Surprise: {}'.format(speaker, convo, surprise))
    utterance_ids = transformed_corpus.get_speaker_convo_info(speaker, convo)['utterance_ids']
    print(' '.join([transformed_corpus.get_utterance(utt).text for utt in utterance_ids]))
    print()

Speaker: rwaterbender, Conversation: 83daa2, Surprise: 2.6262797619618783
CHEM 2150 is not a class you want to take if you don&#39;t have to, from what I&#39;ve heard. Know an AEP guy who was turned off from ChemE by it, so since you&#39;re not even considering ChemE you might as well take the AP credit. I did that and have not regretted it.  

Also, do not take 1116 and 2217 the same semester. I mean, whether you do is up to you, but almost no one does this and it is not recommended due to the workload. If you really want to get ahead in AEP and you have the prerequisites, double up on maths (2930 and 2940) and if you have credit for those go straight into the AEP Major (MatPhys 4210 or Circuits 3630). If you have any other questions about AEP feel free to ask!

Speaker: cryptkeep, Conversation: 3zgnom, Surprise: 2.596652486329722
I don&#39;t believe there is a minimum GPA requirement, but if you are applying to transfer, the key is really going to be articulating why coming to Cornel

We can also take a look at what the speaker with the most surprising conversation said in other conversations.

In [18]:
[utt.text for utt in transformed_corpus.get_speaker('t3hasiangod').iter_utterances(selector=lambda utt: utt.conversation_id != '4ufm6z')]

research experience after graduating. I was lucky, and most of the research done in my field is dry, rather than bench. But if you&#39;re in something like biochem or molecular bio, you&#39;ll want that bench experience.&quot;,
 &quot;Swapping a bio class to pass/fail would almost certainly be a huge disqualification. It&#39;s better to take the crappy grade and retake the course.&quot;,
 &#39;https://isso.cornell.edu/students/working-us/f1-internships-cpt&#39;,
 &quot;You could. The only thing that it would save is your GPA, and while I&#39;m not positive, I&#39;m sure that it would still be a huge blemish. I&#39;m also positive that med school adcoms have some way of putting S/U courses into context.&quot;,
 &quot;I would advise against using a fake. The bars and liquor stores around Ithaca are really strict about it, and even though they probably don&#39;t have a lot of experience picking out the fake international IDs, I wouldn&#39;t play Russian Roulette with them.&quot;,
 &quot;C

Now, let's look at some of the least surprising entries.

In [19]:
least_surprising = transformed_corpus.get_utterances_dataframe().sort_values('meta.surprise').drop_duplicates(subset=['speaker', 'conversation_id']).head(10)
least_surprising

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dbrix3s,1483037010,BILL BILL BILL BILL BILL BILL BILL BILL BILL B...,Straight_Derpin,dbqkxgp,5kst5l,22,dbqkxgp,1484130862,0,,Cornell,False,,CS 2020,0.0689971
diujudc,1497369546,Fall 2011 Admissions Stats:\n\nSchool | Apps |...,Dr_Narwhal,6h08sg,6h08sg,9,diujudc,1499312897,0,,Cornell,False,,Physics &amp; Mathematics 2019,0.586454
dosp83a,1508806556,"Far above Cayuga's waters,\nwith its waves of ...",SwissWatchesOnly,dosp80m,78cb5y,1,dosp7mm,1510121625,0,,Cornell,False,/r/Cornell/comments/78cb5y/alma_mater/dosp83a/,,0.796297
cbnqq4d,1376508444,Even better is Tracfone's StraightTalk. They'r...,arandomaltaccount,cbnjwiu,1kbz4p,1,cbnjwiu,1429829700,0,,Cornell,False,,,0.904599
d5pb24c,1469403847,"They should be fine, though there are other op...",t3hasiangod,4ufm6z,4ufm6z,3,d5pb24c,1471654647,0,,Cornell,False,,,0.936434
c5sqkzc,1344908188,Fuck YAF. And fuck you. Don't let that hate ...,omgdonerkebab,y6cd9,y6cd9,0,c5sqkzc,1429631413,0,,Cornell,False,,,0.941106
deydmud,1489576091,"Ah yes, because Newt is so tolerant of those w...",Bigmouthstrikesback,dexqq3j,5z98fa,3,dewdp2k,1491495064,0,,Cornell,False,,,1.01133
crd4uhk,1431970713,1933 in NY based on wikipedia https://en.wikip...,howlingchief,crd4i14,36ceei,1,crd2smi,1433132718,0,,Cornell,False,,,1.02485
d64vv88,1470366925,"Eh, I would be hesitant to wear those. From th...",t3hasiangod,4w8d7e,4w8d7e,5,d64vv88,1473230691,0,,Cornell,False,,,1.04711
d4gn2ut,1466432480,"Here is some actual data on incoming flights, ...",turtlecrk,4owh99,4owh99,3,d4gn2ut,1469167939,0,,Cornell,False,,,1.18033


In [20]:
for i,x in least_surprising.iterrows():
    speaker, convo, surprise = x['speaker'], x['conversation_id'], x['meta.surprise']
    print('Speaker: {}, Conversation: {}, Surprise: {}'.format(speaker, convo, surprise))
    utterance_ids = transformed_corpus.get_speaker_convo_info(speaker, convo)['utterance_ids']
    print(' '.join([transformed_corpus.get_utterance(utt).text for utt in utterance_ids]))
    print()

se: 0.06899711145989863
BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL BILL B

### Comparison to SpeakerConvoDiversity

In [21]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', cmp_select_fn=lambda df, aux: df.tokens.map(len) >= 100, select_fn=lambda df, row, aux: (df.convo_idx % 2 != row.convo_idx % 2) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=100)

In [22]:
from convokit.text_processing import TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20700 utterances processed
2000/20700 utterances processed
3000/20700 utterances processed
4000/20700 utterances processed
5000/20700 utterances processed
6000/20700 utterances processed
7000/20700 utterances processed
8000/20700 utterances processed
9000/20700 utterances processed
10000/20700 utterances processed
11000/20700 utterances processed
12000/20700 utterances processed
13000/20700 utterances processed
14000/20700 utterances processed
15000/20700 utterances processed
16000/20700 utterances processed
17000/20700 utterances processed
18000/20700 utterances processed
19000/20700 utterances processed
20000/20700 utterances processed
20700/20700 utterances processed


In [22]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
100 / 3097
200 / 3097
300 / 3097
400 / 3097
500 / 3097
600 / 3097
700 / 3097
800 / 3097
900 / 3097
1000 / 3097
1100 / 3097
1200 / 3097
1300 / 3097
1400 / 3097
1500 / 3097
1600 / 3097
1700 / 3097
1800 / 3097
1900 / 3097
2000 / 3097
2100 / 3097
2200 / 3097
2300 / 3097
2400 / 3097
2500 / 3097
2600 / 3097
2700 / 3097
2800 / 3097
2900 / 3097
3000 / 3097


Here are the speaker convo entries that have the highest diversity score.

In [23]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,4.590484
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,4.530407
rrrrrrr1131__8l3xht,rrrrrrr1131,8l3xht,25,4.503671
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,4.498516
sasha07974__8v40c1,sasha07974,8v40c1,42,4.494779
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,4.48823
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,4.485967
blackashi__2xxkm4,blackashi,2xxkm4,6,4.477411
agottler__9iyo8u,agottler,9iyo8u,66,4.476641
t3hasiangod__4ufm6z,t3hasiangod,4ufm6z,262,4.473743


Notice that the first speaker-convo entry was the one that was least surprising according to the surprise transformer. This result is conflicting and we will want to do some further digging into why this might have occurred. One potential reason could be due to the sampling used when calculating perplexity in the SpeakerConvoDiversity transformer. This attempts to get rid of any length-based effects on the perplexity. We may need to replicate this sampling method in the surprise transformer.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [24]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
kickstand__obvjl,kickstand,obvjl,6,4.219946
prov167__9n4jxi,prov167,9n4jxi,187,4.221791
cornell256__4g9uub,cornell256,4g9uub,34,4.222422
dedicateddan__4krfrc,dedicateddan,4krfrc,97,4.223227
cornell256__4e8hww,cornell256,4e8hww,30,4.228412
dedicateddan__1zcxhv,dedicateddan,1zcxhv,20,4.230954
cornell256__96utv3,cornell256,96utv3,292,4.233005
Fencerman2__6tiomd,Fencerman2,6tiomd,111,4.233683
kickstand__p72an,kickstand,p72an,9,4.233733
Enyo287__5ipedu,Enyo287,5ipedu,281,4.233831
