In [1]:
import os
os.chdir('../..')

This notebook demonstrates the use of vectors in ConvoKit, as well as use of the bag-of-words transformer, BoWTransformer, and VectorClassifier.

In [2]:
import convokit

In [3]:
from convokit import Corpus, download

For the purposes of this demo, we will use the r/Cornell subreddit corpus.

In [4]:
corpus = Corpus(download('subreddit-Cornell'))

Dataset already exists at /Users/calebchiam/.convokit/downloads/subreddit-Cornell


In [5]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


**The motivation for vectors**: As covered in [introductory tutorial](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/Introduction_to_ConvoKit.ipynb), any arbitrary data for an corpus component object can be encoded in the object's metadata field. However, a more natural data structure to use in some of these cases might be some sort of vector embedding, e.g. GloVe vectors, BERT vectors. One way to do this might be to add such array vectors to an object's metadata. But often when it comes to vectors, we might imagine wanting to run efficient linear algebra operations that require matrix representations of data for efficient computation. In such cases, we would have to construct a matrix by stacking vectors from each corpus component object in order to set up these computations.

We thus offer another way of encoding vector information, through the ConvoKitMatrix object. A ConvoKitMatrix contains vectors for a collection of corpus component objects and is stored in the Corpus object, as opposed to individual utterances / speakers / conversations.

This allows it to be readily used as a matrix as needed. At the same time, with some nifty engineering, the vector for any corpus component object can be accessed directly from the object itself. The ConvoKitMatrix object also stores mappings from rows to object ids and mappings from columns to column names, allowing for easy interpretation of the meaning of these matrices. 


### The bag-of-words transformer

We demonstrate the use of vectors with the bag-of-words transformer, which computes a vector representation of the text of a given object (typically, utterances).

In [6]:
from convokit import BoWTransformer, VectorClassifier

Consider an arbitrary utterance in the Corpus.

In [7]:
random_utt = corpus.random_utterance()
random_utt

Utterance({'obj_type': 'utterance', 'meta': {'score': 6, 'top_level_comment': 'cav2bmu', 'retrieved_on': 1430584908, 'gilded': 0, 'gildings': None, 'subreddit': 'Cornell', 'stickied': False, 'permalink': '', 'author_flair_text': ''}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x12d3b95d0>, 'id': '[deleted]'}), 'conversation_id': '1hjndj', 'reply_to': 'cav4tfq', 'timestamp': 1372954373, 'text': '[deleted]', 'owner': <convokit.model.corpus.Corpus object at 0x12d3b95d0>, 'id': 'cavvwym'})

By default, it has no vectors associated with it.

In [8]:
# The utterance does not have any vectors associated with it
random_utt.vectors

[]

In [9]:
# The corpus does not have any vectors associated with it
corpus.vectors

set()

Let's initialize a bag-of-words transformer to vectorize the texts of the utterances in the corpus and to store these vectors in a vector matrix called 'bow'.

In [10]:
bow_transformer = BoWTransformer(obj_type="utterance", vector_name='bow')
bow_transformer.fit_transform(corpus)

Initializing default unigram CountVectorizer...Done.


<convokit.model.corpus.Corpus at 0x12d3b95d0>

The utterance we inspected earlier now has a 'bow' vector associated with it.

In [11]:
random_utt

Utterance({'obj_type': 'utterance', 'meta': {'score': 6, 'top_level_comment': 'cav2bmu', 'retrieved_on': 1430584908, 'gilded': 0, 'gildings': None, 'subreddit': 'Cornell', 'stickied': False, 'permalink': '', 'author_flair_text': ''}, 'vectors': ['bow'], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x12d3b95d0>, 'id': '[deleted]'}), 'conversation_id': '1hjndj', 'reply_to': 'cav4tfq', 'timestamp': 1372954373, 'text': '[deleted]', 'owner': <convokit.model.corpus.Corpus object at 0x12d3b95d0>, 'id': 'cavvwym'})

In [12]:
random_utt.vectors

['bow']

#### Fetching the vector for the utterance

In [13]:
random_utt.get_vector('bow')

<1x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [14]:
# We can get a more interpretable display of the vector as a dataframe
random_utt.get_vector('bow', as_dataframe=True)

Unnamed: 0,00,000,00am,00pm,01,02,03,04,05,06,...,youtu,youtube,yr,yrs,yup,zero,zeus,zimride,zip,zone
cavvwym,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice that the dataframe has a row index corresponding to the utterance ID, and a column index corresponding to the n-grams in the bag-of-words vectorization. This is handled automatically by the BoWTransformer.

We can even get subsets of the columns:

In [15]:
random_utt.get_vector('bow', as_dataframe=True, columns=['youtu', 'youtube', 'yr'])

Unnamed: 0,youtu,youtube,yr
cavvwym,0,0,0


In [16]:
# This works for the non-dataframe format too
random_utt.get_vector('bow', as_dataframe=False, columns=['youtu', 'youtube', 'yr'])

<1x3 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

#### What does this look like at the Corpus level?

In [17]:
corpus.vectors # The corpus has a 'bow' vector associated with it

{'bow'}

We can access the ConvoKitMatrix object directly. This matrix contains the vectors for all the utterances in the Corpus, as opposed to the single utterance that we saw just above.

In [18]:
corpus.get_vector_matrix('bow')  

ConvoKitMatrix('name': bow, 'matrix': <74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>)

In [19]:
bow_matrix = corpus.get_vector_matrix('bow')

In [20]:
bow_matrix.to_dataframe().head()

Unnamed: 0,00,000,00am,00pm,01,02,03,04,05,06,...,youtu,youtube,yr,yrs,yup,zero,zeus,zimride,zip,zone
nyx4d,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o0145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o1gca,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o0ss4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o31u0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The ConvoKitMatrix object also has attributes storing the 'row to ids' and 'columns to feature names' mappings:

In [21]:
bow_matrix.columns[:10]

['00', '000', '00am', '00pm', '01', '02', '03', '04', '05', '06']

In [22]:
bow_matrix.ids[:10]

['nyx4d',
 'o0145',
 'o1gca',
 'o0ss4',
 'o31u0',
 'o4ipd',
 'o456r',
 'o4544',
 'o3l7i',
 'o3fqm']

In [23]:
bow_matrix.name

'bow'

In [24]:
# Accessing the numpy matrix directly; we could use this 
bow_matrix.matrix

<74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>

### Adding an arbitrary matrix

To add an arbitrary matrix, we initialize a ConvoKitMatrix object and the append it to the Corpus object.

For example, pretend for a moment we had computed the bag-of-words matrix first, instead of having had the BoWTransformer do this for us automatically.

In [25]:
matrix_data = bow_matrix.matrix

In [26]:
# We would typically have an array matrix like this
matrix_data

<74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>

In [27]:
from convokit import ConvoKitMatrix

We initialize a ConvoKitMatrix with the array matrix simply as follows:

In [28]:
ck_matrix = ConvoKitMatrix(name='bag-of-words', matrix=matrix_data)

In [29]:
# if no row ids and column names are provided, a default numeric index for rows and columns is used
ck_matrix.to_dataframe().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9330,9331,9332,9333,9334,9335,9336,9337,9338,9339
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


But suppose we had collected the utterance ids and column feature names ourselves prior to initializing the matrix, the ids and column names would look something like this:

In [30]:
column_names = bow_matrix.columns
print(column_names[:10])

['00', '000', '00am', '00pm', '01', '02', '03', '04', '05', '06']


In [31]:
row_ids = bow_matrix.ids
print(row_ids[:10])

['nyx4d', 'o0145', 'o1gca', 'o0ss4', 'o31u0', 'o4ipd', 'o456r', 'o4544', 'o3l7i', 'o3fqm']


In [32]:
# We can initialize the ConvoKitMatrix with this information 
ck_matrix = ConvoKitMatrix(name='bag-of-words',
                           matrix=matrix_data,
                           columns=column_names,
                           ids=row_ids
                          )

In [33]:
# This gives us a much more descriptive dataframe
ck_matrix.to_dataframe().head()

Unnamed: 0,00,000,00am,00pm,01,02,03,04,05,06,...,youtu,youtube,yr,yrs,yup,zero,zeus,zimride,zip,zone
nyx4d,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o0145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o1gca,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o0ss4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
o31u0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finally, we store this ConvoKitMatrix in the Corpus object so that it is now associated with the Corpus and will be dumped with the Corpus.

In [34]:
corpus.vectors

{'bow'}

In [35]:
corpus.append_vector_matrix(ck_matrix)

In [36]:
corpus.vectors

{'bag-of-words', 'bow'}

Note, however that the utterances are not 'automatically linked' to this newly added vector. For example, if we look at utterance with ID 'nyx4d' (the first row in the vector matrix):

In [37]:
# It does not have a 'bag-of-words' vector
utt_example = corpus.get_utterance('nyx4d')
utt_example.vectors

['bow']

In [38]:
# this call will fail since there is no such vector associated with the utterance
utt_example.get_vector('bag-of-words') 

ValueError: This utterance has no vector stored as 'bag-of-words'.

If desired, we can add this association with the vector to the object as shown below.

In [39]:
utt_example.add_vector('bag-of-words')

Then when we call get_vector(), the corpus knows to search its 'bag-of-words' vector matrix for the vector associated with utterance ID 'nyx4d'.

In [40]:
utt_example.get_vector('bag-of-words')

<1x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>

Conversely, if we would like to discard this object-vector association or the ConvoKitMatrix altogether:

In [41]:
utt_example.vectors

['bow', 'bag-of-words']

In [42]:
utt_example.delete_vector('bag-of-words')
utt_example.vectors

['bow']

In [43]:
corpus.vectors

{'bag-of-words', 'bow'}

In [44]:
corpus.delete_vector_matrix('bag-of-words')
corpus.vectors

{'bow'}

The same methods (add_vector() and delete_vector()) are supported for the other Corpus components, i.e. Conversations and Speakers, and so it is equally straightforward to compute and store vector data for these Corpus components as well.

### Dumping and loading vectors

By default, all vectors are dumped alongside the corpus as *pickle* files.

In [45]:
# dumps all vectors by default
corpus.dump('cornell-with-bow', base_path='examples/vectors')

In [64]:
os.listdir('examples/vectors/cornell-with-bow')

['utterances.jsonl',
 'conversations.json',
 'vectors.bow.p',
 'corpus.json',
 'speakers.json',
 'index.json']

### But vectors can be excluded

In [65]:
corpus.dump('cornell-no-bow', base_path='examples/vectors', exclude_vectors=['bow'])

In [66]:
os.listdir('examples/vectors/cornell-no-bow')

['utterances.jsonl',
 'conversations.json',
 'corpus.json',
 'speakers.json',
 'index.json']

Let's check if they really are excluded:

In [67]:
corpus = Corpus(filename='examples/vectors/cornell-no-bow')

In [68]:
corpus.vectors

set()

In [69]:
corpus.random_utterance().vectors

[]

### When the corpus is loaded, vectors are present 'structurally' but not actually loaded

By default, vectors are 'lazy-loaded'. This means that they are not loaded into memory until actually required. However, by inspection, they appear accessible just as if they were already loaded with the Corpus.

In [70]:
corpus = Corpus(filename='examples/vectors/cornell-with-bow')

In [72]:
corpus.random_utterance()

Utterance({'obj_type': 'utterance', 'meta': {'score': 5, 'top_level_comment': 'c701swp', 'retrieved_on': 1430383153, 'gilded': 0, 'gildings': None, 'subreddit': 'Cornell', 'stickied': False, 'permalink': '', 'author_flair_text': ''}, 'vectors': ['bow'], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x549637050>, 'id': 'macguffing'}), 'conversation_id': '131dr6', 'reply_to': '131dr6', 'timestamp': 1352697701, 'text': 'There is an inexplicably good bacon cheeseburger in the Ivy Room as well. Also the soups in Temple of Zeus, which is used by no one but English majors. They also do Gimmie Coffee, which is a REVELATION. ', 'owner': <convokit.model.corpus.Corpus object at 0x549637050>, 'id': 'c701swp'})

In [52]:
corpus.random_utterance().vectors

['bow']

In [73]:
corpus.vectors

{'bow'}

In [74]:
# fetched normally (the lazy-loading is invisible to the ConvoKit user)
corpus.get_vector_matrix('bow')

ConvoKitMatrix('name': bow, 'matrix': <74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>)

### We can also load the corpus with vectors fully loaded

In [75]:
corpus = Corpus(filename='examples/vectors/cornell-with-bow', preload_vectors=['bow'])

In [76]:
corpus.vectors

{'bow'}

In [77]:
corpus.get_vector_matrix('bow')

ConvoKitMatrix('name': bow, 'matrix': <74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>)

## In summary

In this demo, we have covered:
- how to use Transformers that compute vectors for a Corpus
- how to create our own ConvoKitMatrix objects + store them in / delete them from the Corpus
- how to interact with ConvoKitMatrix objects and Corpus component objects' vectors
- how ConvoKit dumps and loads vectors with the Corpus