The paper that developed these methods can be found here: (http://www.cs.cornell.edu/~cristian/Asking_too_much.html).
The plots answer these questions:

This example extracts question types from the Tennis Interviews dataset (released with the Tie-breaker paper http://www.cs.cornell.edu/~liye/tennis.html).

This version uses precomputed motifs for speed.

In [None]:
import os
import pkg_resources

from convokit import Corpus, QuestionTypology, download

Initializing QuestionTypology Class

In [None]:
num_clusters = 8
# Get precomputed motifs. data_dir contains the downloaded data. 

data_dir = os.path.join(pkg_resources.resource_filename("convokit", ""), 'downloads', 'tennis')

#Load the corpus and filter out all non-winning tennis players. So the only question-answer pairs in this model
#are from reporters to winners
corpus = Corpus(filename=os.path.join(data_dir, 'tennis-corpus'))
corpus.filter_utterances_by(other_kv_pairs={'result':1})

#Extract clusters of the motifs and assign questions to these clusters
questionTypology = QuestionTypology(corpus, data_dir, dataset_name="tennis", num_dims=25, 
                                    num_clusters=num_clusters, verbose=False, random_seed=125)

`questionTypology.types_to_data` contains the necessary data that is computed in the step above. Its keys are the indices of the clusters (here 0-7). The values are dictionaries with the following keys:<br>
    <br>`"motifs"`: the motifs, as a list of tuples of the motif terms
    <br>`"motif_dists"`: the corresponding distances of each motif from the centroid of the cluster this motif is in
    <br>`"fragments"`: the answer fragments, as a list of tuples of answer terms
    <br>`"fragment_dists"`: the corresponding distances of each fragment from the centroid of the cluster this 
fragment is in
    <br>`"questions"`: the IDs of the questions in this cluster. You can get the corresponding question text by using the
get_question_text_from_pair_idx(pair_idx) method.
    <br>`"question_dists"`: the corresponding distances of each question from the centroid of the cluster 
this question is in

In [None]:
Display Outputs

In [None]:
questionTypology.display_totals()
print('10 examples for type 1-8:')
for i in range(num_clusters):
    questionTypology.display_motifs_for_type(i, num_egs=10)
    questionTypology.display_answer_fragments_for_type(i, num_egs=10)
    questionTypology.display_questions_for_type(i, num_egs=10)
