# Analyzing Conversations In Python Using ConvoKit

**NLP+CSS 201 Series Spring 2022**

**Presenter: [Jonathan P. Chang](https://cs.cornell.edu/~jpchang), Department of Computer Science, Cornell University**

In this tutorial, you will learn how to use the [ConvoKit](https://convokit.cornell.edu) software package to computationally analyze *conversational data* in Python. More concretely, the learning objectives for this tutorial are:
- Understand what makes conversational data different from general natural language data, and why this can be important to computational social science research.
- Learn to install and use the ConvoKit package.
- Understand how ConvoKit computationally represents conversational data from diverse sources under a single unified model.
- Learn how to do manipulate conversational data in ConvoKit via its *transformer* system

Before we begin, let's install ConvoKit and set up its requirements.

In [1]:
!pip install convokit
# spacy setup
!python -m spacy download en_core_web_sm
# nltk setup
import nltk
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting convokit
  Downloading convokit-2.5.3.tar.gz (167 kB)
[K     |████████████████████████████████| 167 kB 7.4 MB/s 
Collecting msgpack-numpy>=0.4.3.2
  Downloading msgpack_numpy-0.4.8-py2.py3-none-any.whl (6.9 kB)
Collecting clean-text>=0.1.1
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 36.1 MB/s 
[?25hCollecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.9 MB/s 
[?25hCollecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 51.6 MB/s 
Building wheels for collected packages: convokit, emoji
  Building wheel for convokit (setup.py) ... [?25l[?25hdone
  Created wheel for convokit: filename=convokit-2

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1. Why use conversational data for social science research?

NLP has historically tended to focus on *standalone* text documents, such as news articles, blog posts, or individual tweets. While this type of data can be valuable to social science (e.g., a study comparing and contrasting the language used by different news outlets), a huge part of social science research is looking at *interactions* between people, and in the arena of natural language, interactions play out in the form of *conversations*.

A naive approach to handling conversations might be to just treat them as if they were regular text documents, for example by transcribing them in a film-script-like format. But this treatment would obscure or outright remove some key properties that make conversations unique:
- Unlike documents, which are generally written by a single author or small group of collaborating authors, conversations involve **multiple interlocutors**, who may each bring their own (sometimes conflicting) goals to the table.
- Related to the above, conversations are **dynamic**: beyond just the raw text of what is said, there is also meaningful information in how utterances relate to each other and how the interlocutors interact.
- Finally, conversations are **temporal** - while a document is meant to simply be read top-to-bottom without regard to when each sentence was actually written, in a conversation it is important to consider the order in which utterances played out and the rate at which utterances were made.

Therefore, to fully leverage conversational data, it is not sufficient to simply treat them like documents and apply existing NLP techniques directly. Instead, many social science discoveries have depended on the development of new NLP techniques that are specifically designed to leverage the unique structure of conversations. Here are just a sample of research findings that took advantage of one or more of the unique properties of conversations listed above:

- Learning a typology of questions in an unsupervised way by characterizing them in terms of their expected answers ([Zhang et al., 2017](https://aclanthology.org/D17-1164/))
- Characterizing how "interesting" online discussion threads are in terms of participant engagement patterns ([Backstrom et al., 2013](https://dl.acm.org/doi/10.1145/2433396.2433401))
- Predicting the popularity of web content based on commenting activity over time ([He et al., 2014](https://dl.acm.org/doi/abs/10.1145/2600428.2609558))
- Understanding the effects of misalignment between intentions and perceptions among discussion participants ([Chang et al., 2020](https://dl.acm.org/doi/abs/10.1145/3366423.3380273))

### 1.1 How ConvoKit can help

Thus far, we have discussed the importance of treating conversations as a unique category of data and specifically designing computational techniques for such data. Unfortunately, there has historically been a major barrier to working with conversational data: the software and data ecosystem is fragmented. The lack of a common standard for representing conversational data has meant that popular datasets are distributed in different data formats with their own task-specific schemas, and similarly, code for reproducing various conversational methods tends to be ad-hoc with no guarantee of interoperability with each other.

This is where ConvoKit comes in. ConvoKit is designed to gather various conversational datasets and computational methods under a single umbrella. To achieve this, it provides two key offerings: a unified **representation** of conversational data, and a standardized language for describing **manipulation** of such data. We will now proceed to describe each of these through interactive examples.

To foreshadow just how useful ConvoKit can be, the following code cell contains a *complete* script for running a *linguistic coordination* analysis in ConvoKit. Linguistic coordination is a measure of how much a speaker in a discussion corpus tends to adopt the language of other speakers, or conversely, how much other speakers tend to adopt that speaker's language (for example: if a US President tends to refer to the country with first-person pronouns like "we" and "our", senators who normally refer to it in third-person terms like "America" or "the country" might switch to using first-person pronouns while addressing the President). It is a great example of a task that inherently involves the unique structure of conversational data, since it requires knowledge of the relationships between speakers and the reply-to structure of the conversations. As we can see, ConvoKit lets us run this complex and useful task in just a few lines of code.



In [2]:
## Example end-to-end script for running linguistic coordination on the r/stanford corpus
## (If you have ConvoKit installed, you can copy this code into a file and it will run with no additional modification needed)
from convokit import download, Corpus, Coordination
import random

r_stanford = Corpus(download("subreddit-stanford"))

coord = Coordination()
coord.fit(r_stanford)
r_stanford = coord.transform(r_stanford)

scores = coord.summarize(r_stanford, focus="targets").averages_by_speaker()
# Higher scores -> other users tend to coordinate more to this this user
# We expect the highly active moderator u/tick_tock_clock to have a relatively high score
print("Score of highly active moderator u/tick_tock_clock:", scores[r_stanford.get_speaker('tick_tock_clock')])
# We expect the typical user to have relatively lower scores than that
random_users = random.sample(list(scores.keys()), 5)
print("Scores of 5 random users:", [scores[u] for u in random_users])

Downloading subreddit-stanford to /root/.convokit/downloads/subreddit-stanford
Downloading subreddit-stanford from http://zissou.infosci.cornell.edu/convokit/datasets/subreddit-corpus/corpus-zipped/sports101~-~stanisms/stanford.corpus.zip (3.4MB)... Done
Score of highly active moderator u/tick_tock_clock: 0.05984600038555586
Scores of 5 random users: [0.0, -0.04761904761904764, 0.002705627705627711, -0.027083333333333348, 0.0]


This example gives us a quick glance at what ConvoKit can do. Now let's go into it in more detail: this tutorial will walk you through the fundamental features of ConvoKit, and with this knowledge you'll be able to use ConvoKit to run interesting conversational analyses of your own!

## 2. Representing conversational data: The `Corpus` hierarchy

All computational social science research starts with a good dataset! ConvoKit represents conversational datasets using the `Corpus` class. Thus, the first step in working with ConvoKit is to load our dataset as a `Corpus` object. ConvoKit supports two ways to create a `Corpus`:

1. You can load one of the many pre-prepared datasets offered by the ConvoKit maintainers. Many of the most popular conversational datasets are already available this way. A full list can be found on the ConvoKit website [here](https://convokit.cornell.edu/documentation/datasets.html) and can be accessed in Python via the `convokit.download` function.
2. If you are working with custom data or a dataset that is not yet available through `convokit.download`, it is also possible to construct a `Corpus` object from scratch and populate it with any arbitrary data.

For the sake of demonstration, in this tutorial we will load a pre-prepared `Corpus`. For anyone interested in creating a `Corpus` from scratch, we invite you to consult the [official tutorial on that topic](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/converting_movie_corpus.ipynb).

Let's try loading the r/stanford subreddit `Corpus` (part of ConvoKit's [complete Reddit dump](https://convokit.cornell.edu/documentation/subreddit.html) collection):

In [3]:
from convokit import Corpus, download

file_path = download("subreddit-stanford") # convokit.download downloads the corpus files and returns their location on disk...
r_stanford = Corpus(filename=file_path)    # ...which we can then pass to the Corpus constructor

Dataset already exists at /root/.convokit/downloads/subreddit-stanford


### 2.1 Anatomy of a `Corpus`

Great, we now have our `Corpus`! But...what exactly is in it? Let's take a moment to explain the `Corpus` class hierarchy. Besides `Corpus` itself, there are three other classes in the hierarchy: `Conversation`, `Utterance`, and `Speaker`. The relationship between the classes is as follows: 
- A `Corpus` contains one or more `Conversation`s...
- ...each `Conversation` is made up of one or more `Utterance`s... 
- ...each `Utterance` is attributed to exactly one `Speaker` (though each `Speaker` can own multiple `Utterances`), and...
- ...each `Utterance` can be a *reply* to another `Utterance` in the same `Conversation` (and multiple `Utterance`s can reply to the same parent `Utterance`, resulting in a tree structure).

It may also help to see this relationship visually:

convokit_classes.svg

As shown in the figure, each object in the `Corpus` hierarchy is uniquely identified through an `id` attribute. Furthermore, the `Utterance` class, being at the lowest level in the hierarchy, also contains additional attributes to represent the actual content of the conversation: `text`, which is what was actually said in that `Utterance`, and `timestamp`, which is when the `Utterance` was made. These reflect basic attributes that are common to all datasets. Of course, most datasets will also involve some more domain-specific information; we'll return to this point later.

But first, let's get a more concrete intuition for what the `Corpus` hierarchy is like, by unpacking it in the context of our example r/stanford `Corpus`. One basic check we can start with is asking how large the `Corpus` is:

In [4]:
r_stanford.print_summary_stats()

Number of Speakers: 4034
Number of Utterances: 21865
Number of Conversations: 4193


Now suppose we want to examine individual components in more detail. ConvoKit offers a number of options for navigating from a between levels in the hierarchy (e.g., selecting an `Utterance` from a `Corpus`). The most basic such operation is selecting a component by ID. At the `Corpus` level, there are a set of consistently-named functions for selecting individual components of the `Corpus` by ID: `get_conversation`, `get_utterance`, and `get_speaker`. All three functions operate the same way: you provide an ID, and it returns the corresponding component with that ID.

In [5]:
print(r_stanford.get_conversation("6txrue"))   # conversation IDs in Reddit corpora are Reddit post IDs
print(r_stanford.get_utterance("cuid83j"))     # utterance IDs in Reddit corpora are Reddit comment IDs
print(r_stanford.get_speaker("mr_something1")) # speaker IDs in Reddit corpora are Reddit usernames

Conversation('id': '6txrue', 'utterances': ['6txrue', 'dloeafz', 'dlog4vl', 'dloh5t3', 'dlohax7', 'dlohg0s'], 'meta': {'title': 'Prospective grad student here. Confused between masters programs. Should I be applying for EE or CS?', 'num_comments': 5, 'domain': 'self.stanford', 'timestamp': 1502836553, 'subreddit': 'stanford', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''})
Utterance(id: 'cuid83j', conversation_id: 3ioxpk, reply-to: cuick8s, speaker: Speaker(id: iam7U, vectors: [], meta: {}), timestamp: 1440741823, text: "Interesting -- I've always preferred SJC, but essentially for the same reason you prefer SFO.  You can take Caltrain to Santa Clara and take the *FREE* VTA 10 bus (which runs every 15 minutes) to the airport.  I would be surprised if SFO were easier -- but I haven't tried that Caltrain/Bart route so I can't comment.  SJC is a smaller airport and quicker to get around.\n\nBy car they're about equidistant.\n\nBut here's the real deciding facto

`Conversation`, `Utterance`, and `Speaker` also support similarly-named functions, allowing you to easily navigate not only from higher to lower levels of the hierarchy, but also the reverse. To spell it out:
- You can navigate from each `Utterance` to the `Conversation` it belongs to or the `Speaker` that made it.
- You can navigate from each `Speaker` to the `Utterances` that they have made or the `Conversation`s they have participated in.
- You can navigate from each `Conversation` to the `Utterances` that compose it or the `Speaker`s that participated in it.

This can be summarized more neatly in a visual "flowchart":
convokit_interop.svg

In [6]:
print("Navigating from Utterances")
utt = r_stanford.get_utterance("cuid83j")
print(utt.get_conversation())                 # navigating from an Utterance to the Conversation it belongs to (no ID needed since each Utterance belongs to exactly one Conversation)
print(utt.get_speaker())                      # navigating from an Utterance to the Speaker that made it (no ID needed since each Utterance has exactly one authoring Speaker) 
print()
print("Navigating from Speakers")
spk = r_stanford.get_speaker("mr_something1")
print(spk.get_utterance("dfynsgc"))           # navigating from a Speaker to one of the Utterances they made
print(spk.get_conversation("63xc3z"))         # navigating from a Speaker to one of the Conversations they participated in
print()
print("Navigating from Conversations")
convo = r_stanford.get_conversation("6txrue")
print(convo.get_utterance("dlog4vl"))         # navigating from a Conversation to one of its constituent Utterances
print(convo.get_speaker("Master565"))         # navigating from a Conversation to a Speaker that participated in it

Navigating from Utterances
Conversation('id': '3ioxpk', 'utterances': ['3ioxpk', 'cuick8s', 'cuid83j', 'cuievxi', 'cuil8rk', 'cuinarn', 'cuiriq3', 'cuirjj6', 'cuirpet', 'cuiwk8w', 'cuj1u55', 'cumoz6u', 'cump74e'], 'meta': {'title': 'SFO vs SJC: Which airport do you fly in/out of?', 'num_comments': 12, 'domain': 'self.stanford', 'timestamp': 1440739265, 'subreddit': 'stanford', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''})
Speaker(id: iam7U, vectors: [], meta: {})

Navigating from Speakers
Utterance(id: 'dfynsgc', conversation_id: 63xkov, reply-to: 63xkov, speaker: Speaker(id: mr_something1, vectors: [], meta: {}), timestamp: 1491584284, text: 'As an Econ major now, you may not want to do a double major as depending on what subjects end up being interesting to you a lot of the MCS or Econ major will be a lot less engaging and useful.', vectors: [], meta: {'score': 2, 'top_level_comment': 'dfynsgc', 'retrieved_on': 1493880157, 'gilded': 0, 'gildings': None, 

The `get_<component>` functions work great if you already know the ID of the component you want. But most often when you work with conversational data, you don't already know in advance which specific items you want to pick out. Instead, in research involving conversational data we most often want to select components based on some criteria of interest, e.g., "find all the utterances that mention a specific keyword". 

To help with this, ConvoKit provides *iterators* over `Corpus` components: `iter_conversations`, `iter_utterances`, and `iter_speakers`. Just like the case with the ID-based selection functions, all three iterators are available at the `Corpus` level, and additionally, the `Conversation`, `Utterance`, and `Speaker` classes each have their own implementations of iterators; the available directions of navigation are the same as the ones previously listed for the ID selectors. 

These iterators can be used to implement loops over `Corpus` components. In the below example, we demonstrate this by using `iter_conversations` and `iter_utterances` to find all r/stanford posts with at least 50 comments:

In [7]:
# first use iter_conversations at the Corpus level to iterate over all the posts
for convo in r_stanford.iter_conversations():
    # then use iter_utterances at the Conversation level to count the number of comments in each post
    # (the Conversation-level iter_utterances iterates over Utterances in that Conversation only)
    n_comments = len([u for u in convo.iter_utterances()])
    if n_comments >= 50:
        print("Found long conversation:", convo)

Found long conversation: Conversation('id': 'sd2va', 'utterances': ['sd2va', 'c4d1qwa', 'c4d1yvf', 'c4d22nx', 'c4d2kz6', 'c4d2s52', 'c4d2wcg', 'c4d2wmq', 'c4d2zj8', 'c4d32p0', 'c4d36bf', 'c4d3ae0', 'c4d3cpc', 'c4d3mc5', 'c4d3oua', 'c4d3ush', 'c4d3xbo', 'c4d494b', 'c4d4bj8', 'c4d4bpe', 'c4d4l3f', 'c4d4lvx', 'c4d4muo', 'c4d4pqf', 'c4d4q4d', 'c4d4zfx', 'c4d4zzk', 'c4d54up', 'c4d5cby', 'c4d5hrt', 'c4d5ydj', 'c4d66lo', 'c4d67dv', 'c4d6hyg', 'c4d74d7', 'c4d7662', 'c4d7bul', 'c4d7cgs', 'c4d7d4h', 'c4d7p1b', 'c4d7wq9', 'c4d7xqw', 'c4d8md1', 'c4d8o9s', 'c4dak9n', 'c4dc950', 'c4dgax3', 'c4dglom', 'c4dgq64', 'c4dgutz', 'c4dgz6b'], 'meta': {'title': 'This is unacceptable. Get on this r/stanford admin. We need our own r/stanford theme.', 'num_comments': 45, 'domain': 'imgur.com', 'timestamp': 1334612792, 'subreddit': 'stanford', 'gilded': -1, 'gildings': None, 'stickied': False, 'author_flair_text': ''})
Found long conversation: Conversation('id': 'jsgnz', 'utterances': ['jsgnz', 'c2er9p6', 'c2erb0

### 2.2 Custom attributes: the `meta` field

Previously, we mentioned that objects in the `Corpus` hierarchy have a basic set of attributes: they all have `id`'s, and `Utterances` also have the `text` and `timestamp`. These attributes are meant to be generalizable, representing basic information that every conversational dataset will have. But in addition to basic attributes, most conversational datasets also have more specific information that is relevant to their specific domains. For example, in Reddit datasets, each `Utterance` represents a Reddit comment, and Reddit comments don't just have text and timestamps, they also have things like scores (upvotes minus downvotes). Such information is often extremely valuable to researchers. But it would not make sense for, say, the `Utterance` class to directly include a `score` attribute, since that property is specific to Reddit and the ConvoKit `Corpus` hierarchy is meant to be general. How can we reconcile this dilemma? 

ConvoKit's solution is the inclusion of *metadata attributes*: these are "extra" attributes that are not directly part of each class, but are instead contained in a dict-like object called `meta`. Metadata attributes can be freely assigned and modified by anyone using a `Corpus`, and they are meant to represent both domain-specific information (such as scores of Reddit comments) or research-specific information (such as intermediate outputs produced by code you have written).

Let's take a look at the metadata attributes that are already provided at the `Conversation`, `Utterance`, and `Speaker` levels in the r/stanford dataset:

In [8]:
print("conversation meta:", r_stanford.random_conversation().meta)
print("utterance meta:", r_stanford.random_utterance().meta)
print("speaker meta:", r_stanford.random_speaker().meta)

conversation meta: {'title': 'If you looking for great time - join here, registred and only gc6RbIKho', 'num_comments': 0, 'domain': 'sunbet15.com', 'timestamp': 1456667302, 'subreddit': 'stanford', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''}
utterance meta: {'score': 6, 'top_level_comment': 'dds8bsy', 'retrieved_on': 1488869115, 'gilded': 0, 'gildings': None, 'subreddit': 'stanford', 'stickied': False, 'permalink': '', 'author_flair_text': ''}
speaker meta: {}


`meta` is also *writeable*, which means that you can both modify existing metadata attributes and add new ones:

In [9]:
utt = r_stanford.random_utterance()
# check the original utt.meta before we modify it
print("original meta:", utt.meta)
# assign a new metadata attribute, then print utt.meta again to confirm the change
utt.add_meta("the_answer", 42)
print("modified meta:", utt.meta)

original meta: {'score': 53, 'top_level_comment': None, 'retrieved_on': 1504673876, 'gilded': 0, 'gildings': None, 'subreddit': 'stanford', 'stickied': False, 'permalink': '/r/stanford/comments/6t9pa8/tldr_for_the_i_got_into_x_dorm_posts/', 'author_flair_text': ''}
modified meta: {'score': 53, 'top_level_comment': None, 'retrieved_on': 1504673876, 'gilded': 0, 'gildings': None, 'subreddit': 'stanford', 'stickied': False, 'permalink': '/r/stanford/comments/6t9pa8/tldr_for_the_i_got_into_x_dorm_posts/', 'author_flair_text': '', 'the_answer': 42}


Adding and modifying metadata is typically used as a way to store the intermediate results of computations. We will return to this later in the tutorial.

### 2.3 Component class functions

In addition to basic and metadata attributes, all three classes also implement some useful functions to perform more complicated operations. One type of operation that is shared by all the classes is conversion to pandas `DataFrame` objects. This can be useful for interoperability with other packages that accept `DataFrame`s as inputs. The `DataFrame` conversion classes follow the same rules as the ID-based selectors and the iterators. For instance, from a `Speaker` object, you can generate `DataFrame`s of its associated `Utterance`s and `Conversation`s:

In [10]:
spk = r_stanford.random_speaker()
spk.get_utterances_dataframe()

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
d1gm6fi,1459189034,I don't think you need to do anything to impro...,Pendrell_Crush,d1gj4d7,4cavcv,4,d1gij8k,1460934757,0,,stanford,False,,,[]
d21o4fl,1460580377,"Last I heard, [her parents were arrested](http...",Pendrell_Crush,4edbjy,4edbjy,1,d21o4fl,1463421727,0,,stanford,False,,,[]
d2250ta,1460604535,"No, I didn't. Hmm...someone told me about this...",Pendrell_Crush,d224sde,4edbjy,1,d21o4fl,1463429836,0,,stanford,False,,,[]
d225lyk,1460605520,I literally have never heard of or read anythi...,Pendrell_Crush,d225i6t,4edbjy,1,d21o4fl,1463430113,0,,stanford,False,,,[]
d3bs8tw,1463676504,I think the placement tests are reasonably dif...,Pendrell_Crush,4k1bel,4k1bel,2,d3bs8tw,1465962609,0,,stanford,False,,,[]
d3dhxu1,1463783161,I don't remember too many details. What I do r...,Pendrell_Crush,d3cijd2,4k1bel,1,d3bs8tw,1465992370,0,,stanford,False,,,[]


In [11]:
spk.get_conversations_dataframe()

Unnamed: 0_level_0,vectors,meta.title,meta.num_comments,meta.domain,meta.timestamp,meta.subreddit,meta.gilded,meta.gildings,meta.stickied,meta.author_flair_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4edbjy,[],What happened to Azia Kim?,9,self.stanford,1460416062,stanford,0,,False,
4cavcv,[],Tips for an 8th Grader who's thinking Stanford?,37,self.stanford,1459182383,stanford,0,,False,
4k1bel,[],Placement testing for language requirement?,4,self.stanford,1463642179,stanford,0,,False,


The `Conversation` class in particular also implements additional helpful functions to assist with navigating the tree-like reply structure of conversations. For example, we can directly visualize the structure, which can be very useful for exploration:

In [12]:
convo = r_stanford.random_conversation()
convo.print_conversation_structure()

sikarux
    Monster-Cat
    [deleted]
    [deleted]
    shoeler16


Note that by default `print_conversation_structure` represents each `Utterance` by printing the ID of its `Speaker`, but it also accepts an optional argument to customize what gets printed. For example we might want to instead print the `Utterance` timestamps to get a sense of temporal structure as well:

In [13]:
convo.print_conversation_structure(lambda u: str(u.timestamp))

1487805980
    1487892429
    1488003941
    1488310230
    1488666204


Besides just visualizing the tree structure of a `Conversation`, we might want to leverage it to traverse the `Utterances` in more meaningful ways (since the default iterator does not necessarily respect the tree structure). To this end, several common tree traversal methods are implemented:

In [14]:
# breadth first traversal
print("===Breadth-first traversal===")
for utt in convo.traverse('bfs'):
    print(utt.speaker.id)
print()

# depth first traversal
print("===Depth-first traversal===")
for utt in convo.traverse('dfs'):
    print(utt.speaker.id)
print()

# preorder traversal
print("===Pre-order traversal===")
for utt in convo.traverse('preorder'):
    print(utt.speaker.id)
print()

# postorder traversal
print("===Post-order traversal===")
for utt in convo.traverse('postorder'):
    print(utt.speaker.id)
print()

# unpack all linear reply chains (i.e., root-to-leaf paths), useful for algorithms that can't handle tree structure
print("===Linear reply-to chains===")
for path in convo.get_root_to_leaf_paths():
    print([utt.speaker.id for utt in path])

===Breadth-first traversal===
sikarux
Monster-Cat
[deleted]
[deleted]
shoeler16

===Depth-first traversal===
sikarux
Monster-Cat
[deleted]
[deleted]
shoeler16

===Pre-order traversal===
sikarux
Monster-Cat
[deleted]
[deleted]
shoeler16

===Post-order traversal===
Monster-Cat
[deleted]
[deleted]
shoeler16
sikarux

===Linear reply-to chains===
['sikarux', 'Monster-Cat']
['sikarux', 'shoeler16']
['sikarux', '[deleted]']
['sikarux', '[deleted]']


In some cases, what you might care about is not so much the reply-to structure of the `Conversation`, but rather the chronological order in which `Utterances` were made. The `Conversation` class offers functionality for this type of traversal as well:

In [15]:
for utt in convo.get_chronological_utterance_list():
    print(utt.timestamp)

1487805980
1487892429
1488003941
1488310230
1488666204


## 3. Manipulating conversational data: the `Transformer` system

Thus far, we have looked at how ConvoKit standardizes the representation of conversational data. Next, we will examine the second key contribution of ConvoKit: a unified way to express manipulations of conversational data.

In ConvoKit, manipulations of conversational data are expressed using the `Transformer` abstract class interface. At a high level, a `Transformer` is an object that takes in a `Corpus` and returns the same `Corpus` with some modifications done to it, almost always in the form of changed or added metadata (as we discussed previously):

transformer.svg

To draw an analogy to language, if `Corpus` objects are the nouns of ConvoKit, then `Transformer`s are the verbs. And just like in the case of language, even if an individual `Transformer` seems simple, *combining* various `Corpus` and `Transformer` objects, much like constructing a long sentence out of individual words, can allow you to express surprisingly complex analyses.

It should be noted that the name "Transformer" has no relation to the increasingly popular neural architectures. Rather, the name is borrowed from scikit-learn's `Transformer` class interface, which ConvoKit's `Transformer`s are directly inspired by. Indeed, the ConvoKit `Transformer` specification covers a similar set of functions to its scikit-learn counterpart:
- **`Transformer.fit`** *(optional)* prepares the `Transformer` object with any information it needs to do its job, such as training any models that the `Transformer` will use.
- **`Transformer.transform`** *(required)* does the actual modification (or "transformation") of the `Corpus`, which may involve applying any models that were previously trained in the `fit` step.
- **`Transformer.summarize`** *(optional)* generates a human-readable summary of what the `Transformer` has done; for example, a count of features that were extracted during the `transform` step. 

Implementation-wise, `Transformer` is an abstract class: it is not meant to be used directly, but instead simply defines a common set of methods (`fit`, `transform`, and `summarize`). Individual analysis methods can then be implemented as *subclasses* of `Transformer`. For those not familiar with how interfaces and subclasses work in Python, don't worry - for our purposes, all you need to know is that each manipulation of a Corpus is represented as an individual class, but these classes all ``look the same'' in that they have the same basic set of functions.

It may help to see an example in action. Let's start by looking at a very common first step that is often done to conversational data: *parsing* all the utterances to obtain dependency trees and part-of-speech tags for use in subsequent analyses. ConvoKit implements this as a `Transformer` subclass known as `TextParser`, which uses SpaCy to perform the parsing. 

In [16]:
from convokit import TextParser

parser = TextParser() # TextParser implements the Transformer interface, meaning that it can be used by calling transform()
r_stanford = parser.transform(r_stanford) # transform accepts a corpus and returns the modified corpus

  config_value=config["nlp"][key],


As we've previously mentioned, `Transformer`s typically modify the `Corpus` by adding or modifying metadata attributes. For any individual `Transformer` subclass, you should check the documentation to find out what kind of new metadata is added. In the case of `TextParser` [the documentation](https://convokit.cornell.edu/documentation/textParser.html) states that the parses are stored in an `Utterance` metadata attribute called "parsed":

In [17]:
print(r_stanford.random_utterance().meta['parsed'])

[{'rt': 0, 'toks': [{'tok': 'Great', 'tag': 'JJ', 'dep': 'ROOT', 'dn': [2, 3]}, {'tok': 'to', 'tag': 'TO', 'dep': 'aux', 'up': 2, 'dn': []}, {'tok': 'hear', 'tag': 'VB', 'dep': 'xcomp', 'up': 0, 'dn': [1]}, {'tok': '!', 'tag': '.', 'dep': 'punct', 'up': 0, 'dn': []}]}, {'rt': 0, 'toks': [{'tok': 'Thank', 'tag': 'VBP', 'dep': 'ROOT', 'dn': [1, 2, 6]}, {'tok': 'you', 'tag': 'PRP', 'dep': 'dobj', 'up': 0, 'dn': []}, {'tok': 'for', 'tag': 'IN', 'dep': 'prep', 'up': 0, 'dn': [5]}, {'tok': 'your', 'tag': 'PRP$', 'dep': 'poss', 'up': 5, 'dn': []}, {'tok': 'detailed', 'tag': 'JJ', 'dep': 'amod', 'up': 5, 'dn': []}, {'tok': 'reply', 'tag': 'NN', 'dep': 'pobj', 'up': 2, 'dn': [3, 4]}, {'tok': '!', 'tag': '.', 'dep': 'punct', 'up': 0, 'dn': []}]}]


We have now seen an example of a `Transformer` in action. But you might be thinking "this doesn't look particularly related to *conversations*!" Indeed, `TextParser` simply implements a classical, non-conversational NLP technique, namely parsing. But this is because `TextParser` is meant to serve as a *first step* in a larger analysis. This is an important thing to note about `Transformer`s: because they store their results in the `Corpus` itself, multiple `Transformer`s can build upon each other, with one `Transformer` acting on the the output of a previous `Transformer`. Indeed, "stacking" `Transformer`s in this way is exactly how we can construct complex analyses out of comparatively simple `Transformer`s (remember the "verbs and sentences" analogy)!

transformer_chaining.svg

Informally, we tend to group `Transformer`s based on what step in the research workflow they are designed to perform:
- **Preprocessing**: these `Transformer`s perform basic preliminary work to prepare the data for subsequent analysis. `TextParser`, which we previously saw, is an example of this: its output is not interesting in itself but is useful as input to other models. Most preprocessing steps are not conversational in nature, and are simply traditional NLP methods applied at the `Utterance` level.
- **Feature extraction**: these `Transformer`s apply more sophisticated techniques to transform arbitrary conversational information into numerical features. Such features can either be interesting results on their own, or can be used as inputs to subsequent analyses, e.g., to machine learning models.
- **Analysis**: these `Transformer`s implement conversation-centric models from the latest cutting edge research, possibly relying on features previously extracted by a feature extraction `Transformer`. The output of such models can represent interesting high-level findings about the conversational dataset.

To illustrate how `Transformer`s can be used to do obtain interesting, uniquely *conversational* findings, we'll look at two examples: one of a feature extraction `Transformer`, and one of an analysis `Transformer`, both of which implement methods from the literature that are designed to take advantage of the unique structure of conversational data.

### 3.1 A feature extraction example: `PromptTypes`

`PromptTypes` is an implementation of the question typology model from the paper ``Asking Too Much? The Rhetorical Role of Questions in Political Discourse'' ([Zhang et al., 2017](https://aclanthology.org/D17-1164/)). This method learns categories of questions that exist in a dataset. It does this in an unsupervised way (without manual annotation), by characterizing questions in terms of what kinds of replies they tend to get.

In [18]:
from convokit import PromptTypeWrapper

# since not all threads in r/stanford are questions and answers, we'll mark in the metadata which comments look like questions, with a simple heuristic of looking for a question mark.
# we'll name this metadata field "is_question" (which PromptTypeWrapper is programmed to check by default)
for utt in r_stanford.iter_utterances():
    utt.add_meta('is_question', '?' in utt.text)

pt = PromptTypeWrapper(min_support=10)
pt.fit(r_stanford) # the PromptTypes model must first be fit to the data
r_stanford = pt.transform(r_stanford) # we can then use the trained model to extract question type features



10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed




10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed




counting frequent itemsets for 5088 sets
	first pass: counting itemsets up to and including 5 items large
	second pass: counting itemsets more than 5 items large
	second pass: checking 25 sets for itemsets of length 6
making itemset tree for 491 itemsets
deduplicating itemsets
	finding supersets
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
fitting 4485 input pairs
fitting reference tfidf model
fitting prompt tfidf model
fitting svd model
fitting 8 prompt types




10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed




10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed
10000/21865 utterances processed
20000/21865 utterances processed
21865/21865 utterances processed


Now that the `PromptTypes` transformer has done its work, we want to know exactly what it computed. Most feature extraction and analysis `Transformer`s implement the `summarize` method to provide a human-readable report of what they learned or computed. In the case of `PromptTypes`, the `summarize` method prints a detailed explanation of each type of question it learned, showing both the kind of language associated with that type and some examples of questions that got classified as that type.

In [19]:
pt.summarize(r_stanford)

TYPE 0
top prompt:
                          0         1         2         3         4         5  \
does>*__know_*     0.428382  1.271197  1.072943  0.977456  1.109204  1.066245   
know_*__know_does  0.480151  1.175713  1.055373  0.958883  1.182640  1.090463   
can>*              0.529171  1.158638  0.814260  0.847814  0.895091  0.936703   
are_*__what>*      0.591241  0.965245  0.916393  0.892366  0.776307  0.943002   

                          6         7  type_id  
does>*__know_*     1.477024  1.007456      0.0  
know_*__know_does  1.401752  0.996162      0.0  
can>*              1.355543  0.921045      0.0  
are_*__what>*      1.409501  1.038675      0.0  
top response:
Empty DataFrame
Columns: [0, 1, 2, 3, 4, 5, 6, 7, type_id]
Index: []
top prompts:
d5sf8xr Hey. Congrats on your admission! Your mixture of excitement and anxiety is normal and understandable. What a thoughtful post; it's clear you've given this a lot of consideration, even perhaps too much. I certainly don't have a

### 3.2 An analysis example: `Coordination`

Finally, we can now return to the opening example of *linguistic coordination*. Armed with the knowledge we have now covered, we are now able to understand exactly what the example script was doing: it took a `Corpus` (the same r/stanford `Corpus` we've been looking at throughout the tutorial) and computed coordination scores using the `Coordination` transformer, which implements the computational model proposed in ``Echoes of power:  Language effects and power differences in social interaction'' ([Danescu-Niculescu-Mizil et al., 2012](https://dl.acm.org/doi/abs/10.1145/2187836.2187931)).

Let's look again at the relevant lines, which ran the transformation:

In [20]:
from convokit import Coordination

coord = Coordination()
coord.fit(r_stanford)
r_stanford = coord.transform(r_stanford)

And, we can also now understand how the example script was retrieving the scores at the end: this was powered by the `summarize` method. Indeed, we can see that `summarize` actually produces a full `Speaker`-level lookup table of scores:

In [21]:
pairwise_coord = coord.summarize(r_stanford, focus="targets")
for spk, score in sorted(pairwise_coord.averages_by_speaker().items(),
    key=lambda x: x[1], reverse=True):
    print(spk.id, score)

freudian_nipple_slip 0.7272727272727273
mjanes 0.5555555555555556
70mmIMAX 0.4642857142857143
whatdidijustread 0.4464285714285714
Derander 0.42687074829931976
stanford-chem 0.4
kgregg 0.3
keyboardgato 0.2777777777777778
word_up 0.26041666666666663
sageshadows7 0.2511904761904762
cooltrainerdq 0.25
shampistols76 0.25
jradams 0.25
stanford2016 0.25
romqA3 0.25
galikat131 0.25
Yoda_the_Jedi 0.25
francescofont10 0.25
greenlion98 0.25
ChesterEnergy 0.25
bookemdano08 0.25
omarkhatib01 0.25
natrius 0.24285714285714283
jsalsman 0.24
keenonkyrgyzstan 0.23333333333333334
favoredfortune 0.22857142857142856
alekami98 0.2276605339105339
darkslair 0.22499999999999998
smileguy91 0.21666666666666667
jeaguilar 0.21590909090909094
JMez2612 0.21428571428571425
TheAlphaNerd 0.20899621212121214
creative-heart 0.2
pussyconqueror 0.2
FootballObelisk 0.19642857142857142
Vig249 0.1908333333333333
artyj 0.19074675324675322
More-Cowbell 0.19062500000000002
hapajazz 0.19047619047619047
SnicketKnight 0.18999999999

## What's next?

Hopefully you've now gotten a taste of the powerful things you can do using the building blocks that ConvoKit gives you! Of course, in this brief tutorial we could only scratch the surface of all that ConvoKit offers. If you're interested in learning more, we invite you to read the [official documentation](https://convokit.cornell.edu/documentation/). We'd also like to remind you that ConvoKit is [open source](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit) and we invite any and all contributions - even if you are unable to contribute code, we also invite feature requests and dataset contributions.

Happy conversation analyzing!

## Acknowledgements

The content of this tutorial is heavily based on material from the official ConvoKit documentation, including the [Core Concepts overview](https://convokit.cornell.edu/documentation/architecture.html) and the [interactive tutorial](https://colab.research.google.com/github/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/Introduction_to_ConvoKit.ipynb). It also uses figures from the ConvoKit SIGDIAL presentation originally designed by [Caleb Chiam](https://github.com/calebchiam).