This notebook demos `PromptTypeWrapper`, a transformer that produces abstract representations of an utterance in terms of its phrasing and its rhetorical intent. 

The transformer, with some minor modifications, implements the methodology detailed in the [paper](http://www.cs.cornell.edu/~cristian/Asking_too_much.html), 

```
Asking Too Much? The Rhetorical Role of Questions in Political Discourse 
Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil
Proceedings of EMNLP 2017
```

and by default analyzes _questions_ and their responses (though this can be modified on initialization). 

Under the surface, the transformer implements two key modules, `PhrasingMotifs` and `PromptTypes`, as well as a suite of preprocessing steps. For a more detailed description of each of these steps, and examples of calling the component modules separately, see demo notebook TODO LINK.

First we load the corpus. We will examine a dataset of questions from question periods that take place in the British House of Commons (also detailed in the paper). 

In [1]:
from convokit import Corpus
from convokit.prompt_types import PromptTypeWrapper

In [4]:
ROOT_DIR = '/kitchen/clean-corpora/parliament-corpus/'

For expedience, we load pre-computed dependency parses, which should come with the data release (see TODO LINK for a demonstration of how to get these parses for yourself).

In [5]:
corpus = Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [6]:
VERBOSITY = 10000

Inspecting an example utterance:

In [7]:
test_utt_id = '1997-01-27a.4.0'
utt = corpus.get_utterance(test_utt_id)

In [8]:
utt.text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

Initializing a `PromptTypeWrapper` model, that will infer 8 types of questions (see docstring for other arguments):

In [9]:
pt = PromptTypeWrapper(n_types=8, random_state=1000)

In [10]:
pt.fit(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

	counting itemset cooccurrences for 90000/318345 collections
	counting itemset cooccurrences for 100000/318345 collections
	counting itemset cooccurrences for 110000/318345 collections
	counting itemset cooccurrences for 120000/318345 collections
	counting itemset cooccurrences for 130000/318345 collections
	counting itemset cooccurrences for 140000/318345 collections
	counting itemset cooccurrences for 150000/318345 collections
	counting itemset cooccurrences for 160000/318345 collections
	counting itemset cooccurrences for 170000/318345 collections
	counting itemset cooccurrences for 180000/318345 collections
	counting itemset cooccurrences for 190000/318345 collections
	counting itemset cooccurrences for 200000/318345 collections
	counting itemset cooccurrences for 210000/318345 collections
	counting itemset cooccurrences for 220000/318345 collections
	counting itemset cooccurrences for 230000/318345 collections
	counting itemset cooccurrences for 240000/318345 collections
	counting

Output. Note that this should produce the same output as calling the component transformers separately, as detailed in this notebook TODO LINK:

In [11]:
for i in range(8):
    print(i)
    pt.display_type(i,  k=15)
    print('\n\n')

0
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
made_*,0.617322,1.114613,1.25532,1.077988,1.115492,1.263859,1.081724,1.076337,0.0
made_*__made_in,0.66089,1.095649,1.298649,1.096428,1.10215,1.168903,1.113446,1.046065,0.0
made_*__made_to,0.671795,1.226338,1.22112,1.103266,1.175426,1.386013,1.211334,1.137368,0.0
made_*__made_been,0.673816,1.122863,1.279368,1.140499,1.17472,1.262665,1.173846,1.110745,0.0
in>*__tell_*,0.675825,1.177443,0.993464,0.845159,1.261066,1.330201,0.98083,1.112974,0.0
made_*__made_what,0.677538,1.119886,1.356694,1.105855,0.952826,1.247374,1.204205,1.134827,0.0
made_*__what>*,0.689943,1.144744,1.361479,1.14902,0.989999,1.24575,1.222378,1.134689,0.0
made_*__made_been__made_what,0.692419,1.102368,1.341362,1.149741,1.046976,1.222434,1.223794,1.127061,0.0
happen_*__happen_will,0.696431,1.201901,1.109235,0.865723,1.115287,1.236088,1.052395,1.155658,0.0
made_*__made_has,0.698737,1.121069,1.307046,1.132155,1.104679,1.305834,1.219281,1.176971,0.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_at,0.735826,0.929084,1.207077,1.02188,1.233163,1.220739,1.06773,1.056635,0.0
known_*,0.740341,1.225947,1.118957,1.03959,1.239562,1.274925,1.072987,1.153537,0.0
place_*,0.781089,1.0461,1.162646,1.036335,1.122966,1.157035,1.09392,1.046621,0.0
can_*,0.78558,1.102952,1.203068,0.991884,0.976701,1.084446,1.06726,1.06194,0.0
assure_have,0.791607,0.941692,1.2996,1.075591,0.992283,0.971913,1.023242,0.914936,0.0
was_made,0.793669,1.190258,0.974625,0.907933,1.198473,1.178469,0.93595,1.151902,0.0
give_can,0.796445,1.035412,1.152593,0.875637,1.130717,1.228405,1.071766,1.069094,0.0
make_shall,0.798254,0.875871,1.177812,0.902875,1.12698,0.959178,0.824072,0.900683,0.0
write_shall,0.806027,1.158947,1.14104,1.088976,1.183973,1.271158,1.202233,1.168124,0.0
write_with,0.806967,1.158069,1.139075,1.083898,1.179597,1.27314,1.203698,1.170615,0.0





1
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_will__will>*,1.08642,0.495553,1.274495,0.993667,1.086782,0.936458,0.982201,0.84615,1.0
agree_*__agree_will,1.059058,0.49891,1.266501,0.955892,1.088039,0.951211,0.950979,0.847188,1.0
agree_*__will>*,1.108804,0.51298,1.26179,1.020178,1.101481,0.879219,0.964296,0.846161,1.0
meet_*,1.129959,0.542662,1.253482,0.925841,1.034177,1.012588,1.00743,0.876131,1.0
agree_*__agree_meet__will>*,1.147648,0.552666,1.306894,1.063198,1.088222,1.07946,1.121671,0.985399,1.0
agree_*__agree_meet,1.117711,0.55342,1.304049,1.041603,1.109495,1.037333,1.053328,0.925631,1.0
undertake_*,1.008312,0.570473,1.263419,1.00474,1.056928,1.025175,1.009377,0.798828,1.0
meet_*__meet_will,1.143446,0.575972,1.23586,0.91514,1.044512,0.988814,0.995963,0.876696,1.0
raise_*__raise_will,1.038031,0.579395,1.306411,1.093962,1.094611,0.997513,1.028786,0.883628,1.0
press_*__press_may,1.123324,0.583806,1.23685,0.89237,1.124593,1.114285,0.990294,0.883688,1.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_always,1.15777,0.581956,1.215256,1.003618,1.203998,0.967212,0.958127,0.789691,1.0
am_aware,0.928112,0.613138,1.262266,1.004899,1.170358,1.134948,1.00871,0.794206,1.0
was_aware,1.096543,0.638019,1.213199,1.101653,1.266208,1.165171,1.100288,0.99147,1.0
want_obviously,1.205883,0.641081,1.267342,1.044674,1.078098,1.006022,1.096864,0.79021,1.0
know_been,1.051636,0.647597,1.222039,1.022548,1.106331,0.879952,0.992274,0.888401,1.0
know_takes,1.081533,0.652997,1.30053,1.087232,0.952602,0.847808,1.041013,0.852032,1.0
get_back,1.176566,0.670771,1.248852,1.126007,1.203506,1.148403,1.198766,1.084509,1.0
suspect_is,1.091446,0.675325,1.094886,0.875482,1.209965,1.053102,0.945532,1.017995,1.0
am_interested,1.167113,0.677795,1.116448,0.948298,1.261779,1.211668,1.00521,1.010175,1.0
be_happy,0.992268,0.68886,1.17699,0.869903,1.110672,0.852875,0.783603,0.787082,1.0





2
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
admit_*,1.206418,1.296074,0.572506,0.939985,1.397106,1.270348,0.955147,1.219785,2.0
why>*,1.115702,1.312999,0.582089,0.85822,1.38659,1.28456,0.915984,1.23009,2.0
admit_*__will>*,1.237791,1.316191,0.582742,0.985069,1.35979,1.245596,1.028147,1.264033,2.0
is>*__is_*__is_true,1.17647,1.325565,0.58734,1.017903,1.480818,1.1426,0.866422,1.106685,2.0
explain_*,1.085886,1.258692,0.588059,0.838155,1.366495,1.271532,0.923838,1.209288,2.0
explain_*__explain_will,1.088246,1.301927,0.591164,0.921589,1.375481,1.189579,0.886641,1.188762,2.0
is_*__why>*,1.193965,1.284715,0.606379,0.91827,1.366362,1.175479,0.864822,1.22607,2.0
is_*__is_true,1.176367,1.346108,0.608337,1.032375,1.492904,1.184561,0.916077,1.161737,2.0
does>*__realise_*__realise_does__realise_not,1.164062,1.347022,0.614524,0.958736,1.390545,1.253235,0.934875,1.149033,2.0
admit_*__admit_will__will>*,1.245443,1.312688,0.61638,0.972787,1.349487,1.277325,1.075932,1.295452,2.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
wonder_*,1.185317,1.280951,0.598627,0.862517,1.313988,1.111706,0.882447,1.191012,2.0
failed_*,1.212231,1.332105,0.623728,0.96729,1.340943,1.128558,0.946383,1.264899,2.0
were_*,1.215991,1.371217,0.639946,1.062895,1.352447,1.074506,0.939683,1.208665,2.0
is_wrong,1.176518,1.387354,0.657307,0.968697,1.312063,1.15407,0.96593,1.260657,2.0
instead>*,1.184016,1.264147,0.684,0.842224,1.264422,1.24363,1.01334,1.220351,2.0
was_*,1.183702,1.181505,0.690253,0.927966,1.289817,0.88145,0.74351,1.117129,2.0
were_there,1.218082,1.395072,0.693541,1.0812,1.356462,1.155589,1.012219,1.255169,2.0
remind_*,1.158828,1.090797,0.697386,0.903524,1.360224,1.070433,0.727951,0.965714,2.0
talks_*,1.246355,1.231195,0.698599,0.897893,1.340013,1.264851,1.045402,1.242012,2.0
talks_about,1.239044,1.226465,0.711066,0.88499,1.34072,1.301715,1.057403,1.244414,2.0





3
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
say_*,0.850687,1.083236,0.976233,0.631583,1.19126,1.300184,0.997584,1.11812,3.0
mean_*,1.002455,1.126768,0.864314,0.637877,1.195106,1.176348,0.816977,1.116557,3.0
have_*,0.950143,0.854446,0.996713,0.652224,1.121396,1.1064,0.816675,0.844928,3.0
mean_*__mean_does,0.964144,1.153265,0.871595,0.674824,1.235184,1.222633,0.852861,1.146697,3.0
given>*,1.021614,0.820715,1.152178,0.678697,0.945995,1.157331,1.031124,0.998046,3.0
explain_*__explain_can__explain_is,1.101147,1.093955,0.82759,0.687998,1.186098,1.208066,0.88637,1.14802,3.0
said_*,1.073083,0.867322,1.036573,0.703806,1.095813,1.209321,0.948335,1.078462,3.0
have_*__have_for__have_what,1.044837,0.959032,1.108911,0.707528,1.135617,1.29228,1.086578,1.144981,3.0
prepared_*,0.95862,1.040767,0.942323,0.708201,1.128748,1.260626,0.929676,1.124596,3.0
given>*__tell_*__tell_given,1.050557,1.151014,1.038956,0.709623,1.089149,1.437138,1.235422,1.308254,3.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
said_in,1.080238,1.113407,0.880794,0.643461,1.22226,1.235136,0.917431,1.162579,3.0
said_to,1.052248,1.108776,0.963397,0.649056,1.196879,1.271119,0.999531,1.156603,3.0
said_as,1.092619,1.05457,0.985477,0.671838,1.192027,1.202552,0.958364,1.141278,3.0
secondly>*,1.167214,1.166665,0.831006,0.675069,1.14889,1.206103,1.004402,1.229828,3.0
first>*,1.176276,1.091921,0.917366,0.680748,1.110759,1.224855,1.054866,1.221327,3.0
said_was,1.088524,1.143351,0.857047,0.686672,1.235714,1.222907,0.910354,1.162873,3.0
said_*,1.069248,1.128748,0.881324,0.686962,1.254928,1.190386,0.869128,1.150134,3.0
on>*,0.910719,1.057888,0.883892,0.687215,1.24293,1.089667,0.75194,0.99603,3.0
expect_do,0.966721,1.005263,0.966996,0.689461,1.297343,1.231216,0.859171,1.0156,3.0
is_say,1.07297,0.99548,0.933367,0.69564,1.137451,1.060478,0.88021,1.079003,3.0





4
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
doing_*__what>*,1.19052,1.170917,1.306277,1.159109,0.489782,1.163157,1.381528,1.286072,4.0
doing_*,1.197745,1.170508,1.281262,1.150488,0.505397,1.147638,1.346396,1.272227,4.0
taking_*__taking_is__what>*,1.130973,1.184592,1.34254,1.172115,0.508758,1.163605,1.407868,1.246767,4.0
doing_*__doing_is__what>*,1.197377,1.191344,1.312856,1.182806,0.529712,1.214734,1.420265,1.312049,4.0
take_*__take_what,1.161917,1.001976,1.343445,1.153361,0.534803,1.092679,1.296199,1.190162,4.0
taking_*__taking_are,1.091065,1.213732,1.36481,1.182392,0.535213,1.194818,1.402069,1.239454,4.0
taking_*,1.137747,1.2256,1.351025,1.198333,0.536375,1.192687,1.42015,1.258012,4.0
will>*__work_*__work_with,1.058973,0.960279,1.382445,1.148531,0.537547,1.015651,1.253309,1.120588,4.0
taking_*__what>*,1.0885,1.218316,1.368173,1.186242,0.540326,1.199291,1.414438,1.255039,4.0
doing_*__doing_is,1.206442,1.211483,1.28872,1.173522,0.54267,1.210846,1.397774,1.301012,4.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
through>*,1.173402,1.236664,1.337521,1.241417,0.640292,1.061877,1.373256,1.284117,4.0
is_working,1.119196,1.145401,1.280114,1.1487,0.645109,0.976848,1.213952,1.210242,4.0
ensuring_is,1.199215,1.08264,1.248994,1.137753,0.64925,0.959081,1.229484,1.118896,4.0
supporting_are,1.217661,1.240021,1.242036,1.170911,0.652102,1.184466,1.366115,1.295775,4.0
working_on,1.135818,1.218759,1.30314,1.203003,0.663693,1.310517,1.381095,1.280683,4.0
supporting_*,1.222778,1.25293,1.25403,1.1907,0.666032,1.201988,1.39002,1.322857,4.0
ensuring_*,1.178767,1.116399,1.228789,1.094961,0.667111,0.930856,1.18643,1.159537,4.0
working_are,1.137959,1.223427,1.296334,1.199166,0.669878,1.314324,1.379198,1.282322,4.0
working_with,1.13795,1.221931,1.297881,1.199787,0.672097,1.316745,1.380642,1.281343,4.0
working_*,1.139951,1.223877,1.296162,1.200937,0.672468,1.314581,1.380422,1.282824,4.0





5
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_is,1.190007,1.063678,1.130037,1.229631,1.120525,0.389741,0.868799,0.966774,5.0
agree_*__agree_be__does>*,1.146346,1.020134,1.141136,1.201449,1.132941,0.396055,0.800297,0.897559,5.0
agree_*__agree_be,1.141823,1.022926,1.140163,1.191639,1.138213,0.396684,0.790162,0.890782,5.0
agree_*__agree_is__does>*,1.187284,1.068057,1.132884,1.23529,1.11464,0.397564,0.876472,0.968414,5.0
agree_*__agree_have,1.160925,1.060035,1.15429,1.228834,1.128729,0.437026,0.867084,0.967709,5.0
agree_*__agree_are,1.199042,1.094654,1.126822,1.249057,1.161698,0.444698,0.868741,0.951413,5.0
agree_*__agree_does__agree_have__does>*,1.146106,1.091873,1.136289,1.216599,1.115092,0.450785,0.852474,0.969716,5.0
agree_*__agree_are__agree_does__does>*,1.200686,1.101353,1.124174,1.25332,1.158171,0.456768,0.878216,0.95574,5.0
agree_*__agree_also,1.185051,1.13607,1.117248,1.261808,1.174689,0.466548,0.904168,1.037751,5.0
continue_*__will>*,1.156545,1.037004,1.200064,1.197731,0.987681,0.472096,0.96677,0.982168,5.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_certainly,1.179935,1.072132,1.193617,1.288381,1.101887,0.461667,0.959144,0.983017,5.0
agree_however,1.170462,1.077594,1.185802,1.281559,1.11632,0.468359,0.939135,1.006986,5.0
agree_is,1.173939,1.072148,1.197891,1.286641,1.09541,0.468887,0.956498,0.996706,5.0
agree_will,1.181074,1.043084,1.202556,1.287871,1.122069,0.476162,0.951444,1.012685,5.0
agree_absolutely,1.194568,1.063192,1.211595,1.289561,1.068197,0.477591,0.993222,0.993978,5.0
agree_also,1.194003,1.079655,1.187382,1.290039,1.097132,0.478019,0.954352,1.003185,5.0
agree_wholeheartedly,1.185445,1.092127,1.183965,1.275667,1.089317,0.479565,0.971246,1.030695,5.0
is_also,1.198891,1.047256,1.096212,1.099916,1.003657,0.480978,0.882228,0.999332,5.0
agree_be,1.163662,1.079722,1.1962,1.283851,1.103804,0.481309,0.949732,1.004706,5.0
is_reduce,1.11324,1.054234,1.080224,1.156094,1.134554,0.482403,0.752575,0.906337,5.0





6
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
be_*__be_not,0.989431,0.991246,0.967091,0.972888,1.314618,0.79981,0.522003,0.706736,6.0
accept_*__accept_is,1.091505,1.066082,0.922891,1.043136,1.278793,0.682295,0.524484,0.843935,6.0
be_*,0.912916,0.941977,1.017985,0.86861,1.291235,0.89263,0.527895,0.717645,6.0
accept_*__accept_does__accept_is,1.089736,1.085264,0.924839,1.060385,1.281378,0.682497,0.53315,0.876671,6.0
be_*__be_would,0.984525,0.927095,0.996208,0.954007,1.327648,0.828217,0.534855,0.682225,6.0
accept_*__accept_will,1.118606,1.058722,0.811762,0.945476,1.342034,0.794267,0.539069,0.855157,6.0
accept_*,1.120104,1.088812,0.845708,1.007693,1.327598,0.748857,0.551046,0.861249,6.0
accept_*__accept_is__does>*,1.070845,1.114954,0.920761,1.057603,1.281384,0.704334,0.554492,0.903325,6.0
does>*__recognise_*,1.14123,0.996655,0.969755,1.039976,1.320921,0.811335,0.559488,0.763572,6.0
be_*__be_not__be_would,0.996937,0.981891,0.983219,0.976455,1.317023,0.831099,0.562854,0.680899,6.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
realise_*,1.056625,1.046113,0.848916,0.903176,1.3324,0.898105,0.511501,0.832393,6.0
realise_is,1.08175,1.043998,0.883446,0.960097,1.289095,0.81402,0.527835,0.881297,6.0
therefore>*,1.079974,1.112493,0.775424,0.879703,1.355731,0.89273,0.537383,0.900992,6.0
be_right,0.990255,1.001743,1.008152,0.842092,1.215388,0.88022,0.584703,0.832125,6.0
be_however,1.007021,0.835731,1.074983,0.903659,1.201534,0.743475,0.585211,0.70862,6.0
be_decide,1.011814,1.029786,0.963236,0.878909,1.236785,0.886694,0.589493,0.896205,6.0
believe_however,1.0593,0.954726,1.020656,0.92642,1.207507,0.74282,0.590316,0.79252,6.0
be_might,1.109655,0.898952,0.997058,0.911007,1.173249,0.673955,0.59175,0.778296,6.0
be_would,1.045133,0.88064,1.045993,0.891643,1.178825,0.692003,0.59217,0.765725,6.0
remind_is,1.099299,0.961939,0.909727,0.972829,1.335105,0.952873,0.60058,0.857637,6.0





7
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
learned_*__will_*,1.040354,0.909116,1.152059,1.154805,1.29971,0.841406,0.800129,0.543683,7.0
learned_*__will>*,1.03545,0.893771,1.162074,1.151075,1.296721,0.851857,0.821541,0.543967,7.0
bear_*__bear_in__in>*,1.083888,0.95693,1.217765,1.15147,1.223059,0.979592,0.977491,0.55036,7.0
draw_*__will>*,1.023994,0.910458,1.137336,1.057207,1.176448,0.939863,0.871607,0.551262,7.0
draw_*__draw_will,1.041282,0.913071,1.145222,1.072801,1.18168,0.902788,0.874199,0.559115,7.0
convey_*__convey_to,1.087247,0.96248,1.170417,1.086914,1.231473,1.015065,0.989478,0.565904,7.0
will_*,1.002983,0.847088,1.124928,1.101191,1.312992,0.835394,0.721011,0.572541,7.0
convey_*__convey_to__convey_will,1.112289,1.025614,1.149763,1.127112,1.216016,0.986942,0.993735,0.59198,7.0
will>*__will_*,1.002939,0.826671,1.157892,1.125611,1.321378,0.866592,0.762808,0.598355,7.0
does_*__learned_*__learned_accept,1.103301,1.001168,1.067266,1.123401,1.323748,0.872465,0.735771,0.610948,7.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
note_says,1.00226,0.963592,1.113321,1.035899,1.287303,0.980794,0.803377,0.604191,7.0
emphasise_*,1.03999,0.851872,1.190487,1.116548,1.260419,0.919627,0.807381,0.610986,7.0
note_*,1.048791,0.923442,1.065397,1.080076,1.350072,0.978475,0.758877,0.618893,7.0
learned_*,0.995554,0.941518,1.052428,0.99412,1.246673,0.778027,0.6681,0.621992,7.0
is_consider,0.981082,0.827353,1.229512,1.014184,1.14296,0.96423,0.923627,0.635198,7.0
be_important,1.087818,0.86826,1.168884,1.051799,1.141418,0.85798,0.824221,0.635525,7.0
are_always,1.076061,0.922007,1.159046,1.144859,1.267237,1.026292,0.901265,0.641044,7.0
convey_*,1.120694,0.941051,1.196546,1.1415,1.273676,1.07705,1.017874,0.64189,7.0
consider_is,0.964139,0.815137,1.196858,1.010101,1.241343,1.014127,0.817837,0.647248,7.0
consider_must,1.036941,0.897222,1.186344,1.11142,1.28173,0.963902,0.84875,0.64923,7.0







Transforming a single utterance. The model will annotate each utterance with a set of rerpesntations or features.

In [12]:
utt = pt.transform_utterance(utt)

the phrasing motifs, i.e., a representation of how each sentence in the utterance is phrased:

In [30]:
utt.get_info('motifs')

['agree_* agree_*__does>* does>*',
 'agree_* agree_*__agree_also agree_*__does>* does>*',
 'as>* share_* share_*__share_does']

A vector representation encapsulating the utterance's rhetorical intent (in short, an embedding of the utterance based on the responses associated with questions containing its constituent phrasings. see paper for details):

In [13]:
utt.get_info('prompt_types__prompt_repr')

[-0.1697405599956893,
 0.03632750344898307,
 -0.16960577009269515,
 0.13467740850273394,
 -0.3033313505509189,
 -0.017352642005944222,
 -0.21475294465096412,
 -0.13411435880513273,
 0.17912876584377993,
 0.021433472222178843,
 -0.3530617501679388,
 -0.24195307167099495,
 -0.06836710808789839,
 -0.18317690396891484,
 -0.03399566433694401,
 -0.030218990740537546,
 -0.41354193763410346,
 -0.06592637574242852,
 -0.10894426409285005,
 -0.02459589626813662,
 -0.04600034458088039,
 -0.5460362677858065,
 0.13112554982604735,
 0.07216726420879413]

Distances between that vector and the centroid of each inferred cluster

In [14]:
utt.get_info('prompt_types__prompt_dists.8')

[1.1457395346487957,
 0.9504055397170942,
 1.1230208124602417,
 1.1222589550014062,
 1.1162666884161712,
 0.38896782295615245,
 0.7627830151500072,
 0.8520923241206688]

The particular type of question, and how close it is to the centroid of that particular cluster:

In [15]:
utt.get_info('prompt_types__prompt_type.8')

5.0

In [16]:
utt.get_info('prompt_types__prompt_type_dist.8')

0.38896782295615245

Transforming the entire corpus:

In [17]:
corpus = pt.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

Other examples:

In [18]:
utt1 = corpus.get_utterance('1987-03-04a.857.5')

In [33]:
utt1.get_info('motifs')

['stop_* stop_*__stop_will stop_*__stop_will__will>* stop_*__will>* will>*',
 'admit_* admit_*__admit_will admit_*__admit_will__will>* admit_*__will>* will>*',
 'does>* does>*__does>not does>*__understand_* understand_* understand_*__understand_does']

In [19]:
utt1.text

'Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs were invented and that , even during the past three years the number of repair and improvement grants , which would bring some private homes back into use , have dropped by 100,000 ? Does not the right hon Gentleman understand that , if the private owner and the local authority are starved of resources , we are left with lengthy queues , homelessness and all the other scandals of poor housing that exist today ?'

In [20]:
utt1.get_info('prompt_types__prompt_type.8')

2.0

We can also try out the model on arbitrary input. For instance, we see that the following question is also of type 5 -- that is, similar to other questions which voice agreement or support.

In [36]:
str_utt = pt.transform_utterance('Do you share my distaste for cockroaches?')

In [38]:
str_utt.get_info('motifs')

['do>* share_*']

In [37]:
str_utt.get_info('prompt_types__prompt_type.8')

5.0

Serializing the model. This dumps both the underlying `PhrasingMotifs` and `PromptTypes` models to disk:

In [21]:
import os

In [23]:
pt.dump_models(os.path.join(ROOT_DIR, 'full_pipe_models'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information
dumping embedding model
dumping training embeddings
dumping type model 8


The entire pipeline can later be loaded back from memory and used to transform new data:

In [26]:
new_pt = PromptTypeWrapper(output_field='prompt_types_new',
                           min_support=100, svd__n_components=25, random_state=1000)

In [27]:
new_pt.load_models(os.path.join(ROOT_DIR, 'full_pipe_models'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information
loading embedding model
loading training embeddings
loading type model 8


In [29]:
pt_model_dir = os.path.join(ROOT_DIR, 'full_pipe_models')
!ls $pt_model_dir

pm_model  pt_model


In [39]:
new_str_utt = new_pt.transform_utterance('Do you share my distaste for cockroaches?')

In [40]:
new_str_utt.get_info('motifs')

['do>* share_*']

In [41]:
new_str_utt.get_info('prompt_types__prompt_type.8')