This notebook demos `PromptTypeWrapper`, a transformer that produces abstract representations of an utterance in terms of its phrasing and its rhetorical intent. 

The transformer, with some minor modifications, implements the methodology detailed in the [paper](http://www.cs.cornell.edu/~cristian/Asking_too_much.html), 

```
Asking Too Much? The Rhetorical Role of Questions in Political Discourse 
Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil
Proceedings of EMNLP 2017
```

and by default analyzes _questions_ and their responses (though this can be modified on initialization). 

Under the surface, the transformer implements two key modules, `PhrasingMotifs` and `PromptTypes`, as well as a suite of preprocessing steps. For a more detailed description of each of these steps, and examples of calling the component modules separately, see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/prompt-types/prompt-type-demo.ipynb).

First we load the corpus. We will examine a dataset of questions from question periods that take place in the British House of Commons (also detailed in the paper). 

In [1]:
from convokit import Corpus
from convokit import download
from convokit.prompt_types import PromptTypeWrapper

In [2]:
import warnings
warnings.filterwarnings('ignore')

For expedience, we load pre-computed dependency parses, which should come with the data release (see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/text-processing/text_preprocessing_demo.ipynb) for a demonstration of how to get these parses for yourself).

In [4]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = download('parliament-corpus', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE PARLIAMENT-CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [4]:
VERBOSITY = 10000

Inspecting an example utterance:

In [5]:
test_utt_id = '1997-01-27a.4.0'
utt = corpus.get_utterance(test_utt_id)

In [6]:
utt.text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

Initializing a `PromptTypeWrapper` model, that will infer 8 types of questions (see docstring for other arguments):

In [7]:
pt = PromptTypeWrapper(n_types=8, random_state=1000)

In [8]:
pt.fit(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

	counting itemset cooccurrences for 70000/318345 collections
	counting itemset cooccurrences for 80000/318345 collections
	counting itemset cooccurrences for 90000/318345 collections
	counting itemset cooccurrences for 100000/318345 collections
	counting itemset cooccurrences for 110000/318345 collections
	counting itemset cooccurrences for 120000/318345 collections
	counting itemset cooccurrences for 130000/318345 collections
	counting itemset cooccurrences for 140000/318345 collections
	counting itemset cooccurrences for 150000/318345 collections
	counting itemset cooccurrences for 160000/318345 collections
	counting itemset cooccurrences for 170000/318345 collections
	counting itemset cooccurrences for 180000/318345 collections
	counting itemset cooccurrences for 190000/318345 collections
	counting itemset cooccurrences for 200000/318345 collections
	counting itemset cooccurrences for 210000/318345 collections
	counting itemset cooccurrences for 220000/318345 collections
	counting i

Output. Note that this should produce the same output as calling the component transformers separately, as detailed in [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/prompt-types/prompt-type-demo.ipynb):

In [9]:
pt.summarize(corpus=corpus, k=15)

TYPE 0
top prompt:
                                     0         1         2         3  \
made_*                        0.642670  1.260725  1.112202  1.119975   
made_*__made_in               0.686119  1.166733  1.092458  1.101651   
in>*__tell_*                  0.686683  1.330053  1.175609  1.265570   
made_*__made_to               0.697633  1.386197  1.226340  1.180139   
made_*__made_what             0.698968  1.247663  1.124691  0.959685   
happen_*__happen_will         0.701071  1.231780  1.202306  1.119589   
made_*__made_been             0.709813  1.263380  1.122509  1.178115   
made_*__what>*                0.716440  1.247333  1.148910  0.997612   
give_*__give_on               0.720038  1.212984  1.041744  1.057502   
include_*                     0.720358  1.198579  1.072563  1.225554   
made_*__made_been__made_what  0.722511  1.225216  1.105467  1.051967   
made_*__made_has              0.725824  1.304950  1.122941  1.108813   
give_*                        0.728294  1.168

2011-01-31b.574.6 Will the hon Gentleman consider looking at this offset very seriously ? Other countries use offset to great benefit , some using it to stimulate investment in environmental technologies . I know that the Government are consulting , as he says , so will he meet a group who have been discussing the issue and some of the industry leaders to discuss it further ?
['consider_*__will>*', 'meet_*__meet_will']

2015-10-28a.346.1 The Prime Minister will remember meeting my constituents , Neil Shepherd and Sharon Wood . Nine years ago this week , Neil took their two children , Christi aged 7 and Bobby aged 6 , on holiday to Corfu . The children tragically died of carbon monoxide poisoning . The family ’s dearest wish is that no other family suffers the heartbreak and tragedy they endured . Tomorrow in the European Parliament there will be a vote on the recommendation that the Commission brings forward legislation to improve carbon monoxide safety and fire safety for tourism prem

Transforming a single utterance. The model will annotate each utterance with a set of rerpesntations or features.

In [10]:
utt = pt.transform_utterance(utt)

the phrasing motifs, i.e., a representation of how each sentence in the utterance is phrased:

In [11]:
utt.retrieve_meta('motifs')

['agree_* agree_*__does>* does>*',
 'agree_* agree_*__agree_also agree_*__does>* does>*',
 'as>* share_* share_*__share_does']

A vector representation encapsulating the utterance's rhetorical intent (in short, an embedding of the utterance based on the responses associated with questions containing its constituent phrasings. see paper for details):

In [12]:
utt.retrieve_meta('prompt_types__prompt_repr')

[-0.17103395551495287,
 0.030694092789899603,
 -0.14371185586935595,
 0.10998245525877463,
 -0.31508472326375,
 -0.03187113204172867,
 -0.22291774431496747,
 -0.1278562931647348,
 0.17717804384550123,
 0.02097518862685271,
 -0.3543799065246014,
 -0.23905016478526944,
 -0.0635970446676691,
 -0.19447723846509896,
 -0.05206238289580816,
 -0.033106993095678466,
 -0.4151244327411294,
 -0.060491493289427684,
 -0.11375878457482796,
 -0.017597837784700098,
 -0.046578984088077695,
 -0.5431360277316315,
 0.12980649779704173,
 -0.08504893017823376]

Distances between that vector and the centroid of each inferred cluster

In [13]:
utt.retrieve_meta('prompt_types__prompt_dists.8')

[1.130855626510634,
 0.39130608715180415,
 0.9490040025393338,
 1.1140869968500255,
 0.7542719064025534,
 1.1279773340447152,
 0.8453197995402353,
 1.1400944717972439]

The particular type of question, and how close it is to the centroid of that particular cluster:

In [14]:
utt.retrieve_meta('prompt_types__prompt_type.8')

1.0

In [15]:
utt.retrieve_meta('prompt_types__prompt_type_dist.8')

0.39130608715180415

Transforming the entire corpus:

In [10]:
corpus = pt.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

Other examples:

In [11]:
utt1 = corpus.get_utterance('1987-03-04a.857.5')

In [18]:
utt1.retrieve_meta('motifs')

['stop_* stop_*__stop_will stop_*__stop_will__will>* stop_*__will>* will>*',
 'admit_* admit_*__admit_will admit_*__admit_will__will>* admit_*__will>* will>*',
 'does>* does>*__does>not does>*__understand_* understand_* understand_*__understand_does']

In [19]:
utt1.text

'Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs were invented and that , even during the past three years the number of repair and improvement grants , which would bring some private homes back into use , have dropped by 100,000 ? Does not the right hon Gentleman understand that , if the private owner and the local authority are starved of resources , we are left with lengthy queues , homelessness and all the other scandals of poor housing that exist today ?'

In [20]:
utt1.retrieve_meta('prompt_types__prompt_type.8')

7.0

We can also try out the model on arbitrary input. For instance, we see that the following question is also of type 1 -- that is, similar to other questions which voice agreement or support.

In [13]:
str_utt = pt.transform_utterance('Do you share my distaste for cockroaches?')

In [18]:
str_utt.retrieve_meta('motifs')

['do>* share_*']

In [14]:
str_utt.retrieve_meta('prompt_types__prompt_type.8')

1.0

Serializing the model. This dumps both the underlying `PhrasingMotifs` and `PromptTypes` models to disk:

In [25]:
import os

In [26]:
pt.dump_model(os.path.join(ROOT_DIR, 'full_pipe_models'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information
dumping embedding model
dumping training embeddings
dumping type model 8


The entire pipeline can later be loaded back from memory and used to transform new data:

In [27]:
new_pt = PromptTypeWrapper(output_field='prompt_types_new',
                           min_support=100, svd__n_components=25, random_state=1000)

In [28]:
new_pt.load_model(os.path.join(ROOT_DIR, 'full_pipe_models'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information
loading embedding model
loading training embeddings
loading type model 8


In [29]:
pt_model_dir = os.path.join(ROOT_DIR, 'full_pipe_models')
!ls $pt_model_dir

pm_model  pt_model


In [30]:
new_str_utt = new_pt.transform_utterance('Do you share my distaste for cockroaches?')

In [31]:
new_str_utt.retrieve_meta('motifs')

['do>* share_*']

In [32]:
new_str_utt.retrieve_meta('prompt_types_new__prompt_type.8')

1.0