This notebook provides examples of how to use convokit to perform analyses of behaviors of particular users within conversations. In other words, we will be dealing with attributes at the (user, conversation) level.
Attributes at this granularity include linguistic diversity, described in the following paper : http://www.cs.cornell.edu/~cristian/Finding_your_voice__linguistic_development.html
They can be used to perform longitudinal analyses of user behaviors across multiple conversations.

Since we cannot publicly release the dataset of counseling conversations used in that paper, we will use the ChangeMyView subreddit as a test case---as such, this notebook is mostly to demonstrate how the functionality works, rather than to suggest any substantive scientific claims about longitudinal behavior change.

## setup

Imports and loading corpora:

In [1]:
import convokit

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [6]:
### comment out to download the corpus
filename = convokit.download('subreddit-changemyview')

### modify path to load your own copy
# filename = '~/.convokit/downloads/subreddit-changemyview/'
corpus = convokit.Corpus(filename)

Dataset already exists at /home/justine/.convokit/downloads/subreddit-changemyview


In [7]:
corpus.print_summary_stats()

Number of Users: 217100
Number of Utterances: 5017556
Number of Conversations: 117492


To start, we will set up a data structure mapping each user to their conversations, and each utterance they contributed in the conversation.

To do this we run the `UserConvoHistory` transformer, which annotates each `User` in a corpus with a dict of conversations --> the user's utterances in that conversation, and the timestamp of their first utterance (i.e., when they "entered" the conversation).

Note that we can specify what counts as participating in a conversation. Here, we omit posts and focus only on comments (such that a user doesn't count as participating if they only submitted the root post)

In [8]:
# also remove moderators and deleted users
USER_BLACKLIST = ['[deleted]', 'DeltaBot','AutoModerator']
def utterance_is_valid(utterance):
    return (utterance.id != utterance.root) and (utterance.user.name not in USER_BLACKLIST)

In [9]:
uchistory = convokit.user_convo_helpers.user_convo_history.UserConvoHistory(utterance_filter=utterance_is_valid)
corpus = uchistory.fit_transform(corpus)

example of user-level annotation:

In [10]:
corpus.get_user('ThatBelligerentSloth').meta['conversations']['2wkciy']

{'idx': 0,
 'n_utterances': 1,
 'start_time': 1424463398,
 'utterance_ids': ['cort13k']}

In [11]:
corpus.get_user('ThatBelligerentSloth').meta['n_convos']

1039

In [12]:
corpus.get_user('ThatBelligerentSloth').meta['start_time']

1424463398

to speed up this demo, we will only take the top 100 most active users. to help with this we will call the `get_user_convo_attribute_table` function, explained later on. 

In [13]:
user_convo_df = convokit.user_convo_helpers.user_convo_utils.get_user_convo_attribute_table(corpus, [])
top_users = user_convo_df.groupby('user').size().sort_values(ascending=False).head(100).index
subset_utts = []
for user in top_users:
    subset_utts += list(corpus.get_user(user).iter_utterances())
subset_corpus = convokit.Corpus(utterances=subset_utts)

In [14]:
subset_corpus.print_summary_stats()

Number of Users: 100
Number of Utterances: 539413
Number of Conversations: 66051


finally, to finish setting things up we will tokenize the utterances (this takes a while), using the `Tokenizer` transformer.

In [15]:
### comment out to tokenize the corpus.
tokenizer = convokit.Tokenizer(verbosity=5000)
subset_corpus = tokenizer.fit_transform(subset_corpus)

5000 utterances tokenized
10000 utterances tokenized
15000 utterances tokenized
20000 utterances tokenized
25000 utterances tokenized
30000 utterances tokenized
35000 utterances tokenized
40000 utterances tokenized
45000 utterances tokenized
50000 utterances tokenized
55000 utterances tokenized
60000 utterances tokenized
65000 utterances tokenized
70000 utterances tokenized
75000 utterances tokenized
80000 utterances tokenized
85000 utterances tokenized
90000 utterances tokenized
95000 utterances tokenized
100000 utterances tokenized
105000 utterances tokenized
110000 utterances tokenized
115000 utterances tokenized
120000 utterances tokenized
125000 utterances tokenized
130000 utterances tokenized
135000 utterances tokenized
140000 utterances tokenized
145000 utterances tokenized
150000 utterances tokenized
155000 utterances tokenized
160000 utterances tokenized
165000 utterances tokenized
170000 utterances tokenized
175000 utterances tokenized
180000 utterances tokenized
185000 utter

## Analysis

The goal of this analysis is to examine how a user's conversational behavior looks like within a single conversation, and then how it evolves over the conversations they take. To demonstrate what this looks like we'll start with a simple attribute, wordcount. 
First, we count the words in each utterance using the `WordCount` transformer. Note this computes _per utterance_ statistics.

In [16]:
wordcount = convokit.WordCount()
subset_corpus = wordcount.fit_transform(subset_corpus)

Next, we aggregate per-utterance statistics over all the utterances a particular user contributed in a conversation. That is, we will turn wordcount into a user,convo-level attribute.

We call the `UserConvoAttrs` transformer to do this. Here, `agg_fn=np.mean` means that the user,convo-level attribute is an _average_ over utterance lengths, but you could replace this with your own aggregation function (e.g., `max`)

In [17]:
uc_wordcount = convokit.user_convo_helpers.user_convo_attrs.UserConvoAttrs('wordcount', agg_fn=np.mean)
subset_corpus = uc_wordcount.fit_transform(subset_corpus)

This transformer annotates each conversation in each User object with a wordcount:

In [18]:
subset_corpus.get_user('ThatBelligerentSloth').meta['conversations']['2wkciy']

{'idx': 0,
 'n_utterances': 1,
 'start_time': 1424463398,
 'utterance_ids': ['cort13k'],
 'wordcount': 100.0}

We will now use this aggregate statistic to analyze how users change behavior over time. The particular question here is whether or not users systematically increase or decrease in wordcount, and in the number of utterances contributed to each conversation.

To facilitate further analyses, we'll load all the user,convo information pertaining to the attributes we want into a dataframe. We'll use the `get_user_convo_attribute_table` function to do this:

In [19]:
user_convo_len_df = convokit.user_convo_helpers.user_convo_utils.get_user_convo_attribute_table(subset_corpus, ['wordcount', 'n_utterances'])
user_convo_len_df.head()

Unnamed: 0_level_0,convo_id,convo_idx,n_utterances,user,user_n_convos,user_start_time,wordcount
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
cdb03b__18x6j5,18x6j5,0,2,cdb03b,7159,1361407894,23.5
cdb03b__1adg1v,1adg1v,1,1,cdb03b,7159,1361407894,4.0
cdb03b__1cciah,1cciah,2,2,cdb03b,7159,1361407894,25.0
cdb03b__1ccvs4,1ccvs4,3,1,cdb03b,7159,1361407894,33.0
cdb03b__1e2r7u,1e2r7u,4,1,cdb03b,7159,1361407894,74.0


We perform our longitudinal analyses at the level of life-stages: i.e., contiguous blocks of conversations. Here, we compare between the first two life-stages of 10 conversations: how the user behaves in their first 10, versus their 10th to 20th conversations. 
We say that users systematically increase (or decrease) in an attribute if for a significant majority of users the value of this attribute at one life-stage increases to the next. 

To this end, we need to aggregate attributes over a life-stage, e.g., mean wordcount. To perform this aggregation we'll use the `get_lifestage_attributes` function, specifying lifestages of 10 conversations each, with a max of 20 conversations (i.e., 2 life-stages). Note that this function omits all users with less than 20 conversations---so we are not biased by survivorship.


In [20]:
stage_wc_df = convokit.user_convo_helpers.user_convo_utils.get_lifestage_attributes(user_convo_len_df, 'wordcount', 10, 20)
stage_wc_df.head()

convo_idx,0,1
user,Unnamed: 1_level_1,Unnamed: 2_level_1
ACrusaderA,98.522045,130.283333
A_Mirror,70.15,93.183333
A_Soporific,286.9,228.166667
AlphaGoGoDancer,291.55,410.733333
Amablue,160.0,295.716667


In [21]:
stage_wc_df.mean()

convo_idx
0    155.192666
1    145.403478
dtype: float64

Just looking at the means, it looks like there's a slight decrease in wordcount across the population from the first to the second lifestage. To check significance, we can compute that % of users who experience this decrease, and see if it's significant per a binomial test against a null proportion of 50% of users (ie., people randomly increase or decrease)

In [22]:
def print_lifestage_comparisons(stage_df):
    for i in range(stage_df.columns.max()):
        
        mask = stage_df[i+1].notnull() & stage_df[i].notnull()
        c1 = stage_df[i+1][mask]
        c0 = stage_df[i][mask]
        
        print('stages %d vs %d (%d users)' % (i + 1, i, sum(mask)))
        n_more = sum(c1 > c0)
        n = sum(c1 != c0)
        print('\tprop more: %.3f, binom_p=%.2f' % (n_more/n, stats.binom_test(n_more,n)))

We see that this is (almost) significant ... maybe more data would help!

In [23]:
print_lifestage_comparisons(stage_wc_df)

stages 1 vs 0 (100 users)
	prop more: 0.410, binom_p=0.09


the analogous analyses for number of utterances contributed suggests that users don't systematically increase or decrease between their first two life-stages:

In [24]:
stage_convo_len_df = convokit.user_convo_helpers.user_convo_utils.get_lifestage_attributes(user_convo_len_df, 'n_utterances', 10, 20)

In [25]:
stage_convo_len_df.mean()

convo_idx
0    2.786
1    2.733
dtype: float64

In [26]:
print_lifestage_comparisons(stage_convo_len_df)

stages 1 vs 0 (100 users)
	prop more: 0.458, binom_p=0.48


Finally, we'll compute some attributes related to linguistic diversity described in the following paper : http://www.cs.cornell.edu/~cristian/Finding_your_voice__linguistic_development.html 

In short, for each life-stage, we compare the words used by one user in one conversation to the words they use in their other conversations, or the words that others use. As such, this is a user,convo-level attribute. Given our small sample here (and the fact that CMV and crisis counseling conversations are very different), we're not going for any scientific claims, but use the following function calls to demostrate how the pipeline would work.

These attributes are all computed through the `UserConvoDiversity` transformer, which computes three attributes:

* `self_div`: within-diversity in the paper, comparing language use across a user's own conversations
* `other_div`: between-diversity in the paper, comparing language use across different users
* `adj_other_div`: relative diversity: between - within. (intuitively, is the diversity coming from users being different from others, beyond being diverse in their own right?)

In [27]:
dt = convokit.UserConvoDiversity(10, 20, n_iters=10, test=False)

(this takes a while to run, especially with more users involved)

In [28]:
subset_corpus = dt.fit_transform(subset_corpus)

preparing corpus
computing diversities
250 / 439


In [29]:
div_df = convokit.user_convo_helpers.user_convo_utils.get_user_convo_attribute_table(subset_corpus, ['n_utterances','self_div','other_div',
                                             'adj_other_div','tokens','wordcount'])


note that one present limitation of this methodology is that it requires a user's activity in a conversation---and in their other conversations---to be substantive enough. if a user doesn't meet the minimum wordcount per conversation, then the function returns `np.nan` for that particular user,conversation. Filtering out these null values:

In [30]:
div_df = div_df[div_df.self_div.notnull()]
div_df.shape

(366, 11)

as with the wordcount example, we can make cross-lifestage comparisons. here we see that there _might_ be some effect with relative diversity---users get more diverse. This might be worth exploring with more users, though note that interpreting this result for CMV versus for counseling conversations where users are randomly assigned might be different. Here, users may appear more diverse because they self-select more esoteric challengers. 

In [31]:
for attr in ['self_div', 'other_div', 'adj_other_div']:
    print(attr)
    stage_df = convokit.user_convo_helpers.user_convo_utils.get_lifestage_attributes(div_df, attr, 10, 20)
    print_lifestage_comparisons(stage_df)
    print('\n\n===')

self_div
stages 1 vs 0 (52 users)
	prop more: 0.519, binom_p=0.89


===
other_div
stages 1 vs 0 (49 users)
	prop more: 0.510, binom_p=1.00


===
adj_other_div
stages 1 vs 0 (49 users)
	prop more: 0.612, binom_p=0.15


===
