The purpose of this notebook is to demo some ideas I had about storing spacy parses as text, and more broadly about storing processed versions of utterance text that can be accessed in a fast, resource-friendly way.

The basic problem is that often we substantively preprocess the text of an utterance -- for instance, by removing words, or by performing some sort of transformation such as extracting bigrams. These aren't quite like computing highly-compressed features as with Transformers -- often you are storing text that's equal to or even greater in size than the original; while certain Transformers might require that such preprocessed versions be loaded, not all pipelines will require it. A good example is dependency parses, though any sort of text cleaning procedure that you want to precompute also fits under this category. In the case of the QuestionTypology module for instance, we might explore different ways to preprocess the text, like extracting dependency parse arcs.

My solution is to add an additional data structure to Corpus objects, called `processed_text`. (See [link](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/prompt-types/convokit/model/corpus.py) and ctrl+f `processed_text` for modifications to the code.) This is a dict of dicts, where each sub-dict maps utterance IDs to processed versions of the text. 

e.g.:
`{'cleaned_text': {utterance_id: clean version of text}, 'parsed': {utterance_id: serialized spacy parse}`

This dict isn't loaded by default, but the end-user, or a Transformer, can load particular fields (see the bottom of this notebook for a demonstration); they can also choose to write particular fields to disk. 

This may be more general than storing processed text, in which case processed_text is a bad name.
And yes, there is some ambiguity wrt what constitutes a "feature" versus "preprocessed text". in the extreme, someone might decide to store all features, even the float-valued ones, in this data structure -- not that we'd recommend it.)

In [1]:
import convokit

I'll demo the functionality on a subset of 100 conversations in the Tennis corpus.

In [2]:
corpus = convokit.Corpus('/kitchen/convokit_corpora/tennis-mini/')

here's the original directory structure. this should look familiar.

In [3]:
ls '/kitchen/convokit_corpora/tennis-mini/'

conversations.json  corpus.json  index.json  users.json  utterances.jsonl


the corpus object is initialized with an empty `processed_text` field.

In [4]:
corpus.processed_text

{}

here's the example utterance we'll work with.

In [5]:
test_utt_id = '1681_14.a'

In [6]:
corpus.get_utterance(test_utt_id).text

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick."

`TextProcessor` is the main class that is in charge of preprocessing utterance texts -- i.e., converting text that either is stored in the utterance object or in a field of `processed_text`, to processed text that is stored in another field of `processed_text`.

It inherits from `Transformer` objects, but rather than annotating utterances, it stores output in the relevant field of `processed_text`. The arguments it needs to take in are:

* `proc_fn`: a function for processing text that takes in string text and an optional dict `aux_input` of auxiliary inputs. (I couldn't figure out how to get `**kwargs` to work but maybe that's a better way)
* `output_field`: the field in `processed_text` that the processed text is stored in.

Other arguments:

* By default, `TextProcessor` will read text from the utterance object itself (i.e., `utterance.text`). However, specifying argument `input_field` will instead have it read text from that field of `processed_text`.
* `aux_input` stores any auxiliary arguments to `proc_fn`. Useful if we need to load something like a `spacy` object.

In [7]:
from convokit.text_processing import TextProcessor

a basic example of a preprocessing function: removing instances of `--`. 

In [8]:
def preprocess_text(text):
    text = text.replace(' -- ', ' ')
    return text

In [9]:
text_prep = TextProcessor(preprocess_text, 'text')
corpus = text_prep.transform(corpus)

we now have a new field in `processed_text`, which can be accessed with the `corpus.get_processed_text` function.

In [10]:
corpus.processed_text.keys()

dict_keys(['text'])

In [11]:
corpus.get_processed_text(test_utt_id, 'text')

"Yeah, but many friends went with me, Japanese guy. So I wasn't I wasn't like homesick. But now sometimes I get homesick."

I implemented some classes that further inherit from `TextProcessor` to handle some particular preprocessing methods. In particular, `TextParser` will dependency parse the text. 

In [12]:
from convokit.text_processing import TextParser

Here, we want the output of the parse to be written to the `parsed` field of `processed_text`, and we want the parser to use our preprocessed input in the `text` field.

In [13]:
textparser = TextParser('parsed', input_field='text', verbosity=50)
corpus = textparser.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed


In [14]:
corpus.processed_text.keys()

dict_keys(['text', 'parsed'])

This is what a spacy parse, serialized in text form, looks like.

In [48]:
test_parse = corpus.get_processed_text(test_utt_id, 'parsed')

The parse is a list consisting of sentences. For each sentence, the `rt` entry denotes the index of the root of the sentence, and the `toks` entry lists the tokens within the sentence.

In [49]:
test_parse[0]

{'rt': 5,
 'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
  {'dep': 'nsubj', 'dn': [3, 10], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
  {'dep': 'ROOT', 'dn': [0, 1, 2, 4, 6, 8, 11], 'tag': 'VBD', 'tok': 'went'},
  {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
  {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
  {'dep': 'appos', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 4},
  {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]}

Each token in `toks` contains 

* `tok`: the text of the token
* `tag`: the POS tag of the token
* `up`: the index of the parent of the token in the dependency tree.
* `dn`: the indices of the children of the token in the dependency tree.
* `dep`: the dependency in the edge between the token and its parent.

Note that `dn` is really just there for convenience; if we wanted to further save space we could remove it.

If we don't want the entire dependency parse, `TextParser` can also be run in other modes:

* `tag`, which POS tags the input
* `tokenize`, which only tokenizes the input.

In both cases, the nltk sentence tokenizer is used to tokenize sentences, and each sentence is then passed into a spacy object. A bit annoying.

In [16]:
texttagger = TextParser('tagged', 'tag', input_field='text')
corpus = texttagger.transform(corpus)

here, we see that only the POS tags are available for each token. Across all modes, `TextParser` output is in a similar format, in case we later change our minds and decide to dependency-parse it after all. (note that this isn't implemented, and kind of relies on spacy tokenization being consistent across runs)

In [17]:
corpus.get_processed_text(test_utt_id, 'tagged')

[{'toks': [{'tag': 'UH', 'tok': 'Yeah'},
   {'tag': ',', 'tok': ','},
   {'tag': 'CC', 'tok': 'but'},
   {'tag': 'JJ', 'tok': 'many'},
   {'tag': 'NNS', 'tok': 'friends'},
   {'tag': 'VBD', 'tok': 'went'},
   {'tag': 'IN', 'tok': 'with'},
   {'tag': 'PRP', 'tok': 'me'},
   {'tag': ',', 'tok': ','},
   {'tag': 'JJ', 'tok': 'Japanese'},
   {'tag': 'NN', 'tok': 'guy'},
   {'tag': '.', 'tok': '.'}]},
 {'toks': [{'tag': 'RB', 'tok': 'So'},
   {'tag': 'PRP', 'tok': 'I'},
   {'tag': 'VBD', 'tok': 'was'},
   {'tag': 'RB', 'tok': "n't"},
   {'tag': 'PRP', 'tok': 'I'},
   {'tag': 'VBD', 'tok': 'was'},
   {'tag': 'RB', 'tok': "n't"},
   {'tag': 'UH', 'tok': 'like'},
   {'tag': 'NN', 'tok': 'homesick'},
   {'tag': '.', 'tok': '.'}]},
 {'toks': [{'tag': 'CC', 'tok': 'But'},
   {'tag': 'RB', 'tok': 'now'},
   {'tag': 'RB', 'tok': 'sometimes'},
   {'tag': 'PRP', 'tok': 'I'},
   {'tag': 'VBP', 'tok': 'get'},
   {'tag': 'NN', 'tok': 'homesick'},
   {'tag': '.', 'tok': '.'}]}]

One last thing I've found myself frequently doing is to store a subset of the information returned by the preprocessing as a string that contains only the tokens (but properly sentence-and-word tokenized). This allows me to quickly analyze the data when I don't actually need the full parse. I felt this merited yet another `TextProcessor` class:

In [18]:
from convokit.text_processing import TokensToString

In [19]:
tok_to_str = TokensToString('tok_str')
corpus = tok_to_str.transform(corpus)

By default, `TokensToString` will access the `parsed` field of `processed_text` and output text, where sentences are newline separated: (note that here, spacy gets a bit confused about sentence boundaries by virtue of the speaker's disfluency)

In [21]:
print(corpus.get_processed_text(test_utt_id, 'tok_str'))

Yeah , but many friends went with me , Japanese guy .
So I was n't
I was n't like homesick .
But now sometimes I get homesick .


However, we can also format the outputted tokens in different ways, or only output a subset. Here, I'll ignore all punctuation and output the token and its tag.

In [22]:
tag_to_str = TokensToString('tok_tag', token_formatter=lambda x: '%s_%s' % (x['tok'].lower(), x['tag']),
                           token_filter=lambda x: sum(ch.isalpha() for ch in x['tok'])>0)
corpus = tag_to_str.transform(corpus)

In [24]:
print(corpus.get_processed_text(test_utt_id, 'tok_tag'))

yeah_UH but_CC many_JJ friends_NNS went_VBD with_IN me_PRP japanese_JJ guy_NN
so_RB i_PRP was_VBD n't_RB
i_PRP was_VBD n't_RB like_UH homesick_NN
but_CC now_RB sometimes_RB i_PRP get_VBP homesick_NN


Finally, a transformation that is used in the QuestionTypology paper is to extract arcs from the dependency parse of the text (fancy bigrams, basically). Here's the `TextProcessor` object that does it.

In [25]:
from convokit.text_processing import TextToArcs

In [26]:
text_to_arc = TextToArcs('arcs')
corpus = text_to_arc.transform(corpus)

this stores a list of sentences, where each sentence is a list of arcs.

In [27]:
corpus.get_processed_text(test_utt_id, 'arcs')

[['friends_*',
  'friends_guy',
  'friends_many',
  'guy_*',
  'guy_japanese',
  'japanese_*',
  'many_*',
  'me_*',
  'went_*',
  'went_friends',
  'went_with',
  'went_yeah',
  'with_*',
  'with_me',
  'yeah_*'],
 ['i_*', 'so>i', 'so_*', 'was_*', 'was_i', 'was_so'],
 ['homesick_*', 'i_*', 'like_*', 'was_*', 'was_homesick', 'was_i', 'was_like'],
 ['but>now',
  'get_*',
  'get_homesick',
  'get_i',
  'get_now',
  'get_sometimes',
  'homesick_*',
  'i_*',
  'now_*',
  'sometimes_*']]

`TextToArcs` comes with other options: we may only take arcs that come out of the root (`root_only`) or omit nouns (`censor_nouns`):

In [28]:
text_to_arc_mini = TextToArcs('arcs_mini', censor_nouns=True, root_only=True)
corpus = text_to_arc_mini.transform(corpus)

In [29]:
corpus.get_processed_text(test_utt_id, 'arcs_mini')

[['went_*', 'went_with', 'went_yeah'],
 ['was_*', 'was_so'],
 ['was_*', 'was_like'],
 ['but>now', 'get_*', 'get_now', 'get_sometimes']]

since storing lists of lists might be clunky, we might instead want to serialize these things as strings -- where, as above, sentences are separated by newlines and arcs are separated by spaces. Also, I will use this opportunity to demonstrate a more involved `TextProcessor`. 

this function will serialize lists of lists of strings, and optionally takes in arguments for how to delimit sentences and tokens (i.e., elements of the list). note that these arguments are passed in by way of the argument `aux_input`.

In [30]:
def join_tokens_and_sentences(sents, aux_input={'sent_sep': '\n', 'tok_sep': ' '}):
    return aux_input.get('sent_sep','\n').join(aux_input.get('tok_sep',' ')
                         .join(sent) for sent in sents)

for fun, suppose we wanted to separate arcs by commas instead of spaces. This is what the corresponding `TextProcessor` call, and output, look like:

In [31]:
arc_to_string = TextProcessor(proc_fn=join_tokens_and_sentences, 
                              output_field='arc_string', input_field='arcs',
                             aux_input={'sent_sep': '\n', 'tok_sep': ', '})
corpus = arc_to_string.transform(corpus)

In [32]:
print(corpus.get_processed_text(test_utt_id, 'arc_string'))

friends_*, friends_guy, friends_many, guy_*, guy_japanese, japanese_*, many_*, me_*, went_*, went_friends, went_with, went_yeah, with_*, with_me, yeah_*
i_*, so>i, so_*, was_*, was_i, was_so
homesick_*, i_*, like_*, was_*, was_homesick, was_i, was_like
but>now, get_*, get_homesick, get_i, get_now, get_sometimes, homesick_*, i_*, now_*, sometimes_*


we've now accumulated a bunch of different preprocessed versions of the text:

In [35]:
corpus.processed_text.keys()

dict_keys(['text', 'parsed', 'tagged', 'tok_str', 'tok_tag', 'arcs', 'arcs_mini', 'arc_string'])

we can save all of them to disk as follows:

In [36]:
corpus.dump_processed_text()

this stores each field of processed_text as a separate jsonlist. (note that I should correct the file ext)

In [37]:
ls '/kitchen/convokit_corpora/tennis-mini/'

conversations.json              processed_text.tagged.json
corpus.json                     processed_text.text.json
index.json                      processed_text.tok_str.json
processed_text.arcs.json        processed_text.tok_tag.json
processed_text.arcs_mini.json   users.json
processed_text.arc_string.json  utterances.jsonl
processed_text.parsed.json


if we'd only wanted to dump a subset, we could specify a list of fields to dump:

>> corpus.dump_processed_text(['arcs', 'tok_str'])

finally, here's a demo of how this expanded corpus might be loaded. 

In [38]:
new_corpus = convokit.Corpus('/kitchen/convokit_corpora/tennis-mini/')

by default, _none_ of the processed fields are loaded:

In [39]:
new_corpus.processed_text

{}

we can specify a subset of fields to load (if no argument is provided, the expected behaviour is that it will load all the fields it can find in the directory -- note that this isn't yet implemented)

In [44]:
new_corpus.load_processed_text(['arcs', 'tok_str'])

In [47]:
new_corpus.processed_text.keys()

dict_keys(['arcs', 'tok_str'])

In [45]:
new_corpus.get_processed_text(test_utt_id, 'arcs')

[['friends_*',
  'friends_guy',
  'friends_many',
  'guy_*',
  'guy_japanese',
  'japanese_*',
  'many_*',
  'me_*',
  'went_*',
  'went_friends',
  'went_with',
  'went_yeah',
  'with_*',
  'with_me',
  'yeah_*'],
 ['i_*', 'so>i', 'so_*', 'was_*', 'was_i', 'was_so'],
 ['homesick_*', 'i_*', 'like_*', 'was_*', 'was_homesick', 'was_i', 'was_like'],
 ['but>now',
  'get_*',
  'get_homesick',
  'get_i',
  'get_now',
  'get_sometimes',
  'homesick_*',
  'i_*',
  'now_*',
  'sometimes_*']]

In [46]:
print(new_corpus.get_processed_text(test_utt_id, 'tok_str'))

Yeah , but many friends went with me , Japanese guy .
So I was n't
I was n't like homesick .
But now sometimes I get homesick .
