This notebook demos the `phrasing_motifs` module, which extracts representations of utterances in terms of how they are phrased. 

This is a really clear example of a method which reflects both good (we think) ideas and ad-hoc implementation decisions. As such, there are lots of options and potential variations to consider (beyond the deeper question of what phrasings even are) -- I'll detail these as I go along.

## Preliminaries

First we load the corpus (for now I'm using tennis, since it's a bit faster to run, but in the release demo I'll probably use parliament).

In [1]:
import convokit

In [2]:
ROOT_DIR = '/kitchen/convokit_corpora/tennis-corpus/'

We'll load the corpus, plus some pre-computed dependency parses (see `todo: scripts to do this parsing` for how to get these parses).

In [3]:
corpus = convokit.Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [4]:
VERBOSITY = 10000

Our specific goal, which we'll use ConvoKit to accomplish, is to produce an abstract representation of questions asked by reporters to players after tennis matches, in terms of how they are phrased: what phrasing, or lexico-syntatic "motif", does a question have? 

Here's an example of an utterance by a reporter:

In [82]:
test_utt_id = '5188_0.q'
utt = corpus.get_utterance(test_utt_id)

In [83]:
utt.text

'How do you feel? Watching the Australian Open it was very scary watching your ankle buckle. How does your ankle feel now?'

For each _sentence_ that has a question (the first and the third), we want to come up with a representation of that sentence's phrasing. Intuitively, both questions sound like they could be abstractly thought of as "how do/does X feel?" -- that is this is a query that could be asked of things beyond "you" or "your ankle".

Intuitively, if we want to get at this higher level of abstraction, we might want to start by looking at the structural "skeleton" of the sentence, i.e., its dependency parse.

## Arcs

As as starting point, we're going to provide a representation of questions in terms of their dependency parse by extracting all the parent-to-child token edges, or "arcs". We will use the `TextToArcs` class to do this:

In [6]:
from convokit.text_processing import TextToArcs

`get_arcs` is a transformer (actually a `TextProcessor`) that will read the dependency parse of an utterance and write the resultant arcs to a field called `'arcs'`:

In [33]:
get_arcs = TextToArcs('arcs', verbosity=VERBOSITY)
corpus = get_arcs.transform(corpus)

10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140000/163948 utterances processed
150000/163948 utterances processed
160000/163948 utterances processed


`'arcs'` is a list where each element corresponds to a sentence in the utterance. Each sentence is represented in terms of its arcs, in a space-separated string. 

Each arc, in turn, can be read as follows:

* `x_y` means that `x` is the parent and `y` is the child token (e.g., `feel_do` = `feel --> do`)
* `x_*` means that `x` is a token with at least one descendant, which we do not resolve (this is roughly like bigrams backing off to unigrams)
* `x>y` means that `x` and `y` are the first two tokens in the sentence (the decision here was that how the sentence starts is a signal of "phrasing structure" on par with the dependency tree structure)
* `x>*` means that `x` is the first token in the sentence. 

In [43]:
utt.get_info('arcs')

['do_* feel_* feel_do feel_how feel_you how>* how>do how_* you_*',
 'ankle_* australian_* buckle_* buckle_ankle buckle_your it_* open_* open_australian open_the scary_* scary_very the_* very_* was_* was_it was_scary was_watching watching>* watching>the watching_* watching_buckle watching_open your_*',
 'ankle_* ankle_your does_* feel_* feel_ankle feel_does feel_how feel_now how>* how>does how_* now_* your_*']

## Further preprocessing: cleaned-up arcs

At this point, while we've got the methodology to start making sense of the dependency tree, we arguably haven't progressed beyond producing fancy bigram representations of sentences. One problem is perhaps that the default arc extraction is a bit too permissive -- it gives us _all_ of the arcs. We might not want this for a few reasons:

* We only want to learn about question phrasings; we don't actually care about non-question sentences.
* The structure of a question might be best encapsulated by the arcs that go out of the _root_ of the tree; as you get further down we might end up with less structural and more content-specific representations.
* Likewise, the particular _nouns_ used (e.g., `australian`) might not be good descriptions of the more abstract phrasing pattern.

All of these points are debatable, and the resultant modules I'll show below hopefully allow you to play around with them. Taking these point as is for now, though, we'll do the following.

In [35]:
from convokit.phrasing_motifs import CensorNouns, QuestionSentences
from convokit.convokitPipeline import ConvokitPipeline

We will actually create a pipeline to extract the arcs we want. This pipeline has the following components, in order:

* `CensorNouns`: a transformer that removes all the nouns and pronouns from a dependency parse. This transformer also collapses constructions like `What tournament [was it]` into `What [was it]`.
* `TextToArcs`: calling the arc extractor from above with an extra parameter: `root_only=True` which will only extract arcs attached to the root (in addition to the first two tokens, though this is also tunable by passing in parameter `use_start=True`).
* `QuestionSentences`: a transformer that, given utterance fields consisting of a list of sentences, removes all the sentences which contain question marks. Here, we pass an extra parameter `input_filter=question_filter`, telling it to ignore utterances which aren't listed in the Corpus as questions (i.e., if a player asks a question, we'll discount this, since it's not labeled in the Corpus as a reporter question). 
    * (you may wonder how this transformer can tell whether a sentence has a question mark in it, given that the output of `TextToArcs` doesn't have any punctuation. Under the hood, `QuestionSentences` looks at the dependency parse of the sentence and checks whether the last token is a question.)
    * `QuestionSentences` also omits any sentences which don't begin in capital letters. To turn this off, pass parameter `use_caps=False`.

In [84]:
def question_filter(utt, aux_input={}):
    return utt.meta['is_question']

In [38]:
q_arc_pipe = ConvokitPipeline([
    ('censor_nouns', CensorNouns('parsed_censored', verbosity=VERBOSITY)),
    ('shallow_arcs', TextToArcs('arcs_censored', input_field='parsed_censored', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs', input_field='arcs_censored',
                                         input_filter=question_filter, verbosity=VERBOSITY))
])

In [39]:
corpus = q_arc_pipe.transform(corpus)

10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140000/163948 utterances processed
150000/163948 utterances processed
160000/163948 utterances processed
10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140

This pipeline results in a more minimalistic representation of utterances, in terms of just the arcs at the root of dependency trees, just the questions, and no nouns:

In [44]:
utt.get_info('question_arcs')

['feel_* feel_do feel_how how>* how>do',
 'feel_* feel_does feel_how feel_now how>* how>does']

## Phrasing Motifs

Finally, to arrive at our final representation of phrasings, we can go one further level of abstraction. In short, `feel_does` doesn't feel yet like a fully-specified phrasing; we might instead want to return `how does __ feel`. To do this, our intuition is to think of phrasings as frequently-cooccurring sets of multiple arcs. 

To extract these frequent arc-sets (which may remind you of the data mining idea of extracting frequent itemsets) we will use the `PhrasingMotifs` class.

In [45]:
from convokit.phrasing_motifs import PhrasingMotifs

In [46]:
pm_model = PhrasingMotifs('question_motifs','question_arcs',min_support=50,fit_filter=question_filter,
                          verbosity=VERBOSITY)

Here, `pm_model` will:

* extract all sets of arcs, as read from the `question_arcs` field, which occur at least 50 times in a corpus. These frequently-occurring arc sets will constitute the set, or "vocabulary", of phrasings.
* write the resultant output -- the phrasings that an utterance contains -- to a field called `question_motifs`. 

On the latter point, `pm_model` will only transform (i.e., label phrasings for) utterances which are questions, i.e., `question_filter(utterance) = True`. That is, in both the train and transform steps, we totally ignore non-questions.

Note that the phrasings learned by `pm_model` are therefore _corpus-specific_ -- different corpora may have different frequently-occurring sets, resulting in different vocabularies of phrasings. For instance, you wouldn't expect people in the British House of Commons to ask questions that sound like questions asked to tennis players. In this respect, think of `PhrasingMotifs` like models from scikit learn (e.g., `LogisticRegression`) -- it is fit to a particular dataset:

In [47]:
pm_model.fit(corpus)

counting frequent itemsets for 81911 sets
	first pass: counting itemsets up to and including 5 items large
	first pass: 10000/81911 sets processed
	first pass: 20000/81911 sets processed
	first pass: 30000/81911 sets processed
	first pass: 40000/81911 sets processed
	first pass: 50000/81911 sets processed
	first pass: 60000/81911 sets processed
	first pass: 70000/81911 sets processed
	first pass: 80000/81911 sets processed
	second pass: counting itemsets more than 5 items large
	second pass: checking 5897 sets for itemsets of length 6
	second pass: checking 1728 sets for itemsets of length 7
making itemset tree for 3525 itemsets
deduplicating itemsets
	counting itemset cooccurrences for 10000/78794 collections
	counting itemset cooccurrences for 20000/78794 collections
	counting itemset cooccurrences for 30000/78794 collections
	counting itemset cooccurrences for 40000/78794 collections
	counting itemset cooccurrences for 50000/78794 collections
	counting itemset cooccurrences for 6000

Here are the most common phrasings and how often they occur in the data (in # of sentences). Note that `('*',)` denotes the null phrasing -- i.e., it encapsulates sentences with _any_ root word. 

In [48]:
pm_model.print_top_phrasings(50)

('*',) 81911
('what>*',) 12510
('is_*',) 10314
('how>*',) 9354
('do>*',) 8844
('think_*',) 7148
('think_do',) 6169
('think_*', 'think_do') 6169
('is>*',) 5879
('feel_*',) 5768
('was_*',) 5467
('is>*', 'is_*') 5325
('did>*',) 4686
('are_*',) 4031
('feel_do',) 3367
('feel_*', 'feel_do') 3367
('are>*',) 2931
('do>*', 'think_*') 2816
('do>*', 'think_do') 2799
('do>*', 'think_*', 'think_do') 2799
('have_*',) 2670
('can>*',) 2598
('was>*',) 2430
('what>do',) 2345
('what>*', 'what>do') 2345
('was>*', 'was_*') 2188
('were_*',) 2109
('how>do',) 2072
('how>*', 'how>do') 2072
('when>*',) 1986
('do>*', 'feel_*') 1887
('do>*', 'feel_do') 1881
('do>*', 'feel_*', 'feel_do') 1881
('have>*',) 1767
("'s_*",) 1744
('are>*', 'are_*') 1685
('does>*',) 1620
('feel_did',) 1596
('feel_*', 'feel_did') 1596
('think_what',) 1575
('think_*', 'think_what') 1575
('feel_how',) 1529
('feel_*', 'feel_how') 1529
('were>*',) 1511
('going_*',) 1478
('have_do',) 1458
('have_*', 'have_do') 1458
('feel_*', 'how>*') 1438
('t

Having "trained", or fitted our model, we can then use it to annotate each (question) utterance in the corpus with the phrasings this utterance contains.

In [49]:
corpus = pm_model.transform(corpus)

10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140000/163948 utterances processed
150000/163948 utterances processed
160000/163948 utterances processed


`PhrasingMotifs` will actually annotate utterances with _two_ fields. `question_motifs` lists all the phrasings:

In [50]:
utt.get_info('question_motifs')

['feel_* feel_*__feel_do feel_*__feel_do__feel_how feel_*__feel_do__how>* feel_*__feel_how feel_*__how>* how>* how>*__how>do',
 'feel_* feel_*__feel_does feel_*__feel_does__feel_how feel_*__feel_how feel_*__feel_how__feel_now feel_*__feel_now feel_*__feel_now__how>* feel_*__how>* how>* how>*__how>does']

Note that, if we think of phrasings as sets of arcs, then a sentence that has phrasing `(arc1, arc2, arc3)` will also have phrasing `(arc1, arc2)`. Intuitively, more finely-specified phrasings (i.e., the 3-arc case) more closely specify the phrasing embodied by a sentence. As such, we list only the most finely-specified phrasings in `question_motifs__sink`. (while this roughly corresponds to the number of arcs in the phrasing, a more detailed description can be found in the paper and code.)

In [51]:
utt.get_info('question_motifs__sink')

['feel_*__feel_do__how>*',
 'feel_*__feel_does__feel_how feel_*__feel_now__how>*']

Finally, you may still object and say that a "how _do_ X feel?" and a "how _does_ X feel?" question still are roughly of the same abstract type -- this is just a grammatical difference, and the phrasing algorithm won't capture it because the root words of the sentences (do vs does) are different. There are ways to associate these two sentences, but here we must use some linear algebra (see TODO, the other notebook for details).


For now, we'll save a subset of our output to disk, potentially for use in a later transformer.

In [54]:
corpus.dump_info('utterance', ['question_motifs', 'question_motifs__sink', 'arcs_censored'])

## Storing models

In [55]:
import os

In [56]:
pm_model.dump_model(os.path.join(ROOT_DIR, 'pm_model'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information


In [61]:
pm_model_dir = os.path.join(ROOT_DIR, 'pm_model')
!ls $pm_model_dir

downlinks.json	itemset_counts.json  itemset_to_ids.json  meta.json


In [62]:
new_pm_model = PhrasingMotifs('question_motifs_new','question_arcs',min_support=50,fit_filter=question_filter,
                          verbosity=VERBOSITY)

In [63]:
new_pm_model.load_model(os.path.join(ROOT_DIR, 'pm_model'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information


In [65]:
utt = new_pm_model.transform_utterance(utt)

In [66]:
utt.get_info('question_motifs__sink')

['feel_*__feel_do__how>*',
 'feel_*__feel_does__feel_how feel_*__feel_now__how>*']

In [67]:
utt.get_info('question_motifs_new__sink')

['feel_*__feel_do__how>*',
 'feel_*__feel_does__feel_how feel_*__feel_now__how>*']

In [75]:
q_arc_pipe_full = ConvokitPipeline([
    ('shallow_arcs_full', TextToArcs('root_arcs', input_field='parsed', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs_full', input_field='root_arcs',
                                         input_filter=question_filter, verbosity=VERBOSITY)),

])

In [76]:
corpus = q_arc_pipe_full.transform(corpus)

10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140000/163948 utterances processed
150000/163948 utterances processed
160000/163948 utterances processed
10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140

In [77]:
noun_pm_model = PhrasingMotifs('question_motifs_full','question_arcs_full',min_support=50,
                               fit_filter=question_filter, max_naive_itemset_size=4,
                          verbosity=VERBOSITY)

In [78]:
noun_pm_model.fit(corpus)

counting frequent itemsets for 81911 sets
	first pass: counting itemsets up to and including 4 items large
	first pass: 10000/81911 sets processed
	first pass: 20000/81911 sets processed
	first pass: 30000/81911 sets processed
	first pass: 40000/81911 sets processed
	first pass: 50000/81911 sets processed
	first pass: 60000/81911 sets processed
	first pass: 70000/81911 sets processed
	first pass: 80000/81911 sets processed
	second pass: counting itemsets more than 4 items large
	second pass: checking 29918 sets for itemsets of length 5
	second pass: checked 10000/29918 sets for itemsets of length 5
	second pass: checked 20000/29918 sets for itemsets of length 5
	second pass: checking 21232 sets for itemsets of length 6
	second pass: checked 10000/21232 sets for itemsets of length 6
	second pass: checked 20000/21232 sets for itemsets of length 6
	second pass: checking 9206 sets for itemsets of length 7
	second pass: checking 1570 sets for itemsets of length 8
making itemset tree for 895

In [79]:
noun_pm_model.print_top_phrasings(50)

('*',) 81911
('what>*',) 12573
('is_*',) 10333
('how>*',) 9427
('do>*',) 8863
('do>you',) 8605
('do>*', 'do>you') 8605
('think_*',) 7158
('think_you',) 7000
('think_*', 'think_you') 7000
('think_do',) 6179
('think_*', 'think_do') 6179
('think_do', 'think_you') 6164
('think_*', 'think_do', 'think_you') 6164
('is>*',) 5897
('feel_*',) 5780
('was_*',) 5469
('is>*', 'is_*') 5325
('feel_you',) 4928
('feel_*', 'feel_you') 4928
('did>*',) 4756
('are_*',) 4038
('did>you',) 3539
('did>*', 'did>you') 3539
('is_it',) 3456
('is_*', 'is_it') 3456
('feel_do',) 3369
('feel_*', 'feel_do') 3369
('feel_do', 'feel_you') 3355
('feel_*', 'feel_do', 'feel_you') 3355
('are>*',) 2934
('do>*', 'think_*') 2816
('do>*', 'think_do') 2799
('do>*', 'think_*', 'think_do') 2799
('do>*', 'think_you') 2798
('do>*', 'think_*', 'think_you') 2798
('do>*', 'think_do', 'think_you') 2794
('do>*', 'think_*', 'think_do', 'think_you') 2794
('do>you', 'think_*') 2784
('do>*', 'do>you', 'think_*') 2784
('do>you', 'think_you') 276

In [80]:
corpus = noun_pm_model.transform(corpus)

10000/163948 utterances processed
20000/163948 utterances processed
30000/163948 utterances processed
40000/163948 utterances processed
50000/163948 utterances processed
60000/163948 utterances processed
70000/163948 utterances processed
80000/163948 utterances processed
90000/163948 utterances processed
100000/163948 utterances processed
110000/163948 utterances processed
120000/163948 utterances processed
130000/163948 utterances processed
140000/163948 utterances processed
150000/163948 utterances processed
160000/163948 utterances processed


In [81]:
utt.get_info('question_motifs_full')

['feel_*__feel_do__how>* feel_*__feel_you feel_*__feel_you__how>* feel_*__how>* how>* how>*__how>do',
 'feel_* feel_*__feel_does feel_*__feel_does__feel_how feel_*__feel_how feel_*__feel_how__feel_now feel_*__feel_now feel_*__feel_now__how>* feel_*__how>* how>* how>*__how>does']