In [1]:
import os
os.chdir('../../..')

In this demo, we demonstrate how to:

1. Annotate a Corpus's utterances with their bag-of-words vector representations
2. Use these bag-of-words in a predictive task

In [2]:
import convokit

In [3]:
from convokit import Corpus, download

In [4]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at /Users/calebchiam/.convokit/downloads/subreddit-Cornell


In [5]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


### Annotating the Corpus with bag-of-words vectors

To do this, we use ConvoKit's [Bag-of-words Transformer](https://convokit.cornell.edu/documentation/bow.html) and set it to vectorize the Corpus's utterances.

In [6]:
from convokit import BoWTransformer

In [7]:
bow_transformer = BoWTransformer(obj_type="utterance")

Initializing default unigram CountVectorizer...Done.


Note that a custom text vectorizer can be sent by configuring the vectorizer parameter:

e.g. BoWTransformer(obj_type="utterance", *vectorizer*=...)

Let's inspect one of the Corpus utterances to see the changes that get made.

In [8]:
# before transformation
corpus.get_utterance('dsbgljl').vectors

[]

In [9]:
bow_transformer.fit_transform(corpus)

<convokit.model.corpus.Corpus at 0x12f17cb10>

In [10]:
# after transformation
corpus.get_utterance('dsbgljl').vectors

['bow_vector']

The Corpus now has a new vector matrix associated with it.

In [11]:
corpus.vectors

{'bow_vector'}

In [12]:
corpus.get_vector_matrix('bow_vector')

ConvoKitMatrix('name': bow_vector, 'matrix': <74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>)

### Predictive task: will an utterance (i.e. Reddit comment) have a positive score?

We want to predict whether an utterance will have a positive score (i.e. more upvotes than downvotes) based on its bag-of-words vector.

Inspecting a random utterance, we see that it has a 'score' metadata attribute.

In [13]:
corpus.random_utterance().meta

{'score': -8,
 'top_level_comment': 'd7d8cd1',
 'retrieved_on': 1475425135,
 'gilded': 0,
 'gildings': None,
 'subreddit': 'Cornell',
 'stickied': False,
 'permalink': '',
 'author_flair_text': ''}

We then use ConvoKit's VectorClassifier to train a classifier model predicting for whether the utterance's score is positive. Notice that the labeller is how we indicate the binary y value that we want the internal model to predict for, while vector_name specifies the vector feature set (i.e. the X data) to use in training the classifier.

In [14]:
from convokit import VectorClassifier

In [15]:
bow_classifier = VectorClassifier(obj_type="utterance", 
                                  vector_name='bow_vector',
                                  labeller=lambda utt: utt.meta['score'] > 0)

Initialized default classification model (standard scaled logistic regression).


In [16]:
# This fit_transform() step fits the classifier and then uses it to compute predictions for all the 
# utterances in the Corpus
bow_classifier.fit_transform(corpus)

<convokit.model.corpus.Corpus at 0x12f17cb10>

In [17]:
# A DataFrame summary of the computed predictions
bow_classifier.summarize(corpus).head(10)

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
dhhm9sa,True,1.0
dw553ml,True,1.0
dvzmhdx,True,1.0
dvzpp79,True,1.0
dw0imao,True,1.0
c3bsi2g,True,1.0
dw0mm3b,True,1.0
d5pddzi,True,1.0
dw25pga,True,1.0
5om61s,True,1.0


In [18]:
corpus.get_utterance('15enm8').text

'One, just to get this out of the way: I\'m only a sophomore in high school. In spite of this, my high school is one of the top public schools in New Jersey (and to put it bluntly it\'s a very affluent area... although I\'m not necessarily affluent like my classmates). The point of telling you guys that is kids start talking about all these amazing schools they want to go to in like eighth grade, so I know quite a bit about colleges. As stated in the title, I really want to go to Cornell, and I just was hoping that some of you guys and girls on here would be awesome enough to give out some SAT scores, ACT scores (if you took them), and extra curricular activities you guys got/did? My unweighted GPA is a 3.8 (weighted is a 4.2), and my first PSAT was an overall 1900, and from taking that I (not to sound cocky here) *know* that I\'m going to get that score up a *lot*. I\'m in all the highest level classes I can be in, and I\'m looking to take multiple AP courses next year (junior). Do yo

We can then inspect what are the coefficient weights assigned to the bag-of-words n-grams.

In [19]:
# The ngrams weighted most positively (i.e. utterances with these ngrams are more likely to have positive scores)
bow_classifier.get_coefs(feature_names=corpus.get_vector_matrix('bow_vector').columns).head()

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
hotels,1.270001
hbhs,1.11569
engine,1.109702
involves,1.081836
lincoln,1.071464


In [20]:
bow_classifier.get_coefs(feature_names=bow_transformer.get_vocabulary()).tail()

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
mahogany,-0.667785
ignoreme,-0.722992
hilton,-0.742234
binary,-0.764383
creation,-0.784593


In [21]:
y_true, y_pred = bow_classifier.get_y_true_pred(corpus)

In [22]:
bow_classifier.base_accuracy(corpus)

0.9279546644822538

In [23]:
bow_classifier.accuracy(corpus)

0.9491452589737737

In [24]:
print(bow_classifier.classification_report(corpus))

              precision    recall  f1-score   support

       False       0.88      0.34      0.49      5365
        True       0.95      1.00      0.97     69102

    accuracy                           0.95     74467
   macro avg       0.91      0.67      0.73     74467
weighted avg       0.95      0.95      0.94     74467



## Bag-of-words prediction for comment thread doubling in length versus staying the same length based on first 5 utterances

In [25]:
top_level_comment_ids = [utt.id for utt in corpus.iter_utterances() if utt.id == utt.meta['top_level_comment']]

In [26]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In [27]:
len(top_level_comment_ids)

32893

In [28]:
threads_corpus = corpus.reindex_conversations(new_convo_roots=top_level_comment_ids)


['c3oyf4d', 'c3od15i', 'c3ocsyl', 'c3p8bze', 'c3p1rn8']


In [29]:
threads_corpus.print_summary_stats()

Number of Speakers: 6160
Number of Utterances: 63697
Number of Conversations: 32888


In [30]:
for thread in threads_corpus.iter_conversations():
    thread_len = len(list(thread.iter_utterances()))
    if thread_len == 5:
        thread.meta['thread_doubles'] = False
    elif thread_len >= 10:
        thread.meta['thread_doubles'] = True
    else:
        thread.meta['thread_doubles'] = None

In [31]:
bow_transformer2 = BoWTransformer(obj_type="conversation", vector_name='bow_vector_2')

Initializing default unigram CountVectorizer...Done.


In [32]:
bow_transformer2.fit_transform(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

<convokit.model.corpus.Corpus at 0x131c41210>

In [33]:
bow_classifier2 = VectorClassifier(obj_type="conversation", vector_name='bow_vector_2',
                                   labeller=lambda convo: convo.meta['thread_doubles'])

Initialized default classification model (standard scaled logistic regression).


In [34]:
bow_classifier2.fit_transform(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

<convokit.model.corpus.Corpus at 0x131c41210>

In [35]:
summary = bow_classifier2.summarize(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

In [36]:
summary.head()

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
cx87pi5,True,1.0
e5626fc,True,1.0
e8p3t2v,True,1.0
dxw7g0r,True,1.0
dnqc6mc,True,1.0


In [37]:
summary.tail()

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
e0iez9l,False,4.206535e-08
cyeq0e8,False,2.781052e-08
dmtcex3,False,2.706866e-08
ck1dyvi,False,1.796866e-08
e6m7j9z,False,2.439918e-10


In [38]:
bow_classifier2.base_accuracy(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

0.6761904761904762

In [39]:
bow_classifier2.accuracy(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

1.0

In [40]:
print(bow_classifier2.classification_report(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       852
        True       1.00      1.00      1.00       408

    accuracy                           1.00      1260
   macro avg       1.00      1.00      1.00      1260
weighted avg       1.00      1.00      1.00      1260



In [41]:
bow_classifier2.get_coefs(feature_names=bow_transformer2.get_vocabulary()).head(10)

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
removed,0.681572
welcome,0.620917
word,0.429786
hour,0.375649
brought,0.359237
profile,0.351563
http,0.351203
head,0.323356
www,0.310282
comp,0.300921


In [42]:
bow_classifier2.get_coefs(feature_names=bow_transformer2.get_vocabulary()).tail(10)

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
tried,-0.265925
desk,-0.269328
internet,-0.269901
long,-0.278265
dean,-0.278963
23,-0.282094
extra,-0.316223
hill,-0.317776
goes,-0.361495
thanks,-0.362734


In [43]:
bow_classifier2.confusion_matrix(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

array([[852,   0],
       [  0, 408]])

In [44]:
import pandas as pd

In [45]:
bow_classifier2.evaluate_with_cv(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

Running a cross-validated evaluation...Done.


array([0.67460317, 0.68650794, 0.73015873, 0.68253968, 0.69047619])

In [46]:
bow_classifier2.evaluate_with_train_test_split(threads_corpus, 
                                               selector=lambda convo: convo.meta['thread_doubles'] is not None,
                                               test_size=0.2)

Running a train-test-split evaluation...
Done.


(0.7023809523809523, array([[142,  18],
        [ 57,  35]]))