In this example, we load up two corpora - one for training a classifier, and one for testing it.

We preprocess the two corpora by extracting text parses for the utterances in them. Following this, we extract politeness features from these text parses and train a politeness classifier on these politeness features from the training corpus. We then use the trained classifier to compute the average politeness score of utterances from two subreddits (r/aww and r/politics) in the testing corpus.

In [1]:
import convokit

In [2]:
from convokit import Corpus, download, TextParser, PolitenessStrategies, Classifier

In [3]:
train_corpus = Corpus(filename=download('wiki-politeness-annotated'))

Dataset already exists at /Users/calebchiam/.convokit/downloads/wiki-politeness-annotated


In [4]:
test_corpus = Corpus(filename=download('reddit-corpus-small'))

Dataset already exists at /Users/calebchiam/Documents/GitHub/Cornell-Conversational-Analysis-Toolkit/convokit/tensors/reddit-corpus-small


## Step 1: Preprocessing

In [5]:
parser = TextParser()
parser.transform(train_corpus)


<convokit.model.corpus.Corpus at 0x1316cc310>

In [6]:
parser.transform(test_corpus)

<convokit.model.corpus.Corpus at 0x132c08a10>

## Step 2: Feature extraction

In [7]:
ps = PolitenessStrategies()
ps.transform(train_corpus)
ps.transform(test_corpus)

<convokit.model.corpus.Corpus at 0x132c08a10>

## Step 3: Analysis

In [8]:
clf = Classifier(obj_type='utterance', pred_feats=['politeness_strategies'], 
                 labeller=lambda utt: utt.meta['Binary']==1)

In [9]:
clf.fit(train_corpus)

<convokit.classifier.classifier.Classifier at 0x3964ed190>

In [11]:
clf.transform(test_corpus)

<convokit.model.corpus.Corpus at 0x132c08a10>

In [None]:
aww_vals = clf.summarize(test_corpus, selector=lambda utt: utt.meta['subreddit']=='aww')
politics_vals = clf.summarize(test_corpus, selector=lambda utt: utt.meta['subreddit']=='politics')

In [None]:
print(aww_vals['pred_score'].mean())
print(politics_vals['pred_score'].mean())