_2019-10-30_

This notebook demos a few proposed additions/changes to ConvoKit:

* A `TextProcessor` base class that maps per-utterance attributes to per-utterance outputs;
* A `TextParser` class that does dependency parsing;
* Selective and decoupled data storage and loading;
* Per-utterance calls to a transformer;
* Pipelining transformers. 

I've also left a bunch of questions and uncertainties (ctrl-f `Question:`)

## Preliminaries: loading an existing corpus.

To start, we load a clean version of a corpus. For speed I'm using a 200-utterance subset of the tennis corpus (in a release we'd go through the extra steps of subsetting tennis).

In [1]:
import convokit

In [2]:
ROOT_DIR = '/kitchen/convokit_corpora/tennis-mini'

Here is the original (and familiar) directory structure.

In [3]:
ls $ROOT_DIR

conversations.json  corpus.json  index.json  users.json  utterances.jsonl


In [4]:
corpus = convokit.Corpus(ROOT_DIR)
corpus.print_summary_stats()

Number of Users: 9
Number of Utterances: 200
Number of Conversations: 100


Selecting the following example utterance:

In [5]:
test_utt_id = '1681_14.a'
utt = corpus.get_utterance(test_utt_id)

In [6]:
utt.text

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick."

Right now, `utt.meta` contains only the fields that we released with the corpus:

In [7]:
utt.meta

{'is_answer': True, 'is_question': False, 'pair_idx': '1681_14'}

and nothing else.

The following call is equivalent to `utt.meta.get('parsed', None)`. In other words, `get_info` is a wrapper on top of directly accessing the `meta` dictionary, and its default behaviour returns `None` when the particular field doesn't exist.

Having this wrapper and its counterpart, `utt.set_info(key, value)` hopefully enables any changes to how data is organized in the future.

In [8]:
utt.get_info('parsed')

## The TextProcessor class

Many of our transformers are per-utterance mappings of one attribute of an utterance to another. To facilitate this, I implemented a `TextProcessor` class that inherits from `Transformer`. 

`TextProcessor` is initialized with the following arguments:

* `proc_fn`: the mapping function. Supports one of two function signatures: `proc_fn(input)` and `proc_fn(input, auxiliary_info)`. 
* `input_field`: the attribute of the utterance that `proc_fn` will take as input. If set to `None`, will default to reading `utt.text`, as seems to be presently done.
* `output_field`: the name of the attribute that the output of `proc_fn` will be written to. 
* `aux_input`: any auxiliary input that `proc_fn` needs (e.g., a pre-loaded model); passed in as a dict.
* `input_filter`: a boolean function of signature `input_filter(utterance, aux_input)`, where `aux_input` is again passed as a dict. If this returns `False` then the particular utterance will be skipped; by default it will always return `True`.

Both `input_field` and `output_field` support multiple items -- that is, `proc_fn` could take in multiple attributes of an utterance and output multiple attributes. I'll show how this works below.

"Attribute" is a deliberately generic term. `TextProcessor` could produce "features" as we may conventionally think of them (e.g., wordcount, politeness strategies). It can also be used to pre-process text, i.e., generate alternate representations of the text. The line between what's counted as feature and representation is blurry.

### Question: design of function calls

Note that one option is for the function signatures of `proc_fn` and `input_filter` to match -- i.e., both take in entire utterances, rather than `proc_fn` taking in only select attributes. The perhaps hacky rationale for why this isn't presently the case is as follows:

* `proc_fn` could in principle be called on stand-alone input that has nothing to do with a Corpus: as long as you have _any_ text -- albeit formatted correctly (e.g., you'd still have to know when to pass in a dependency parse, versus a raw string) -- you can make a call to `proc_fn`, without going through `TextProcessor`. Consequently, you should also be able to write a `proc_fn` without knowing what ConvoKit is.
* `input_filter` is a decision you make that's contingent on the Corpus: for instance, you may not want to parse things with some corpus-specific metadata. As such, we'd actually want to see the Utterance object in this case. 
* `input_filter` is an advanced use case; most people will just interact with `proc_fn`. 

### simple example -- cleaning the text

In [9]:
from convokit.text_processing import TextProcessor

As a simple example, suppose we want to remove hyphens `--` from the text as a preprocessing step. To use `TextProcessor` to do this for us, we'd define the following as a `proc_fn`:

In [10]:
def preprocess_text(text):
    text = text.replace(' -- ', ' ')
    return text

Below, we initialize `prep`, a `TextProcessor` object that will run `preprocess_text` on each utterance.

When we call `prep.transform()`, the following will occur:

* Because we didn't specify an input field, `prep` will pass `utterance.text` into `preprocess_text`
* It will write the output -- the text minus the hyphens -- to a field called `clean_text`. Under the surface, we are calling `utt.set_info('clean_text', <output>)`. Currently this `set_info` wrapper is equivalent to `utt.meta['clean_text'] = <output>`.

In [11]:
prep = TextProcessor(proc_fn=preprocess_text, output_field='clean_text')
corpus = prep.transform(corpus)

And as desired, we now have a new field in `utt` -- presently stored as an entry in `utt.meta`.

In [12]:
utt.get_info('clean_text')

"Yeah, but many friends went with me, Japanese guy. So I wasn't I wasn't like homesick. But now sometimes I get homesick."

### Some advanced usage: playing around with parameters

The point of the following is to demonstrate more elaborate calls to `TextProcessor`, and also to show that `TextProcessor` is agnostic to whether we are producing a representation or a feature. I'll demo these points by way of wordcount.

First, we'll initialize a `TextProcessor` that does wordcounts (i.e., `len(x.split())`) on just the raw text (`utt.text`), writing output to field `wc_raw`.

In [13]:
wc_raw = TextProcessor(proc_fn=lambda x: len(x.split()), output_field='wc_raw')
corpus = wc_raw.transform(corpus)

In [14]:
utt.get_info('wc_raw')

23

If we instead wanted to wordcount our preprocessed text, with the hyphens removed, we can specify `input_field='clean_text'` -- as such, the `TextProcessor` will read from `utt.get_info('clean_text')` instead. 

In [15]:
wc = TextProcessor(proc_fn=lambda x: len(x.split()), output_field='wc', input_field='clean_text')
corpus = wc.transform(corpus)

Here we see that we are no longer counting the extra hyphen.

In [16]:
utt.get_info('wc')

22

Likewise, we can count characters:

In [17]:
chars = TextProcessor(proc_fn=lambda x: len(x), output_field='ch', input_field='clean_text')
corpus = chars.transform(corpus)

In [18]:
utt.get_info('ch')

120

Suppose that for some reason we now wanted to calculate:

* characters per word
* words per character (the reciprocal)

This requires:

* a `TextProcessor` that takes in multiple input fields, `'ch'` and `'wc'`;
* and that writes to multiple output fields, `'char_per_word'` and `'word_per_char'`.

Here's how the resultant object, `char_per_word`, handles this:

* in `transform()`, we pass `proc_fn` a dict mapping input field name to value, e.g., `{'wc': 22, 'ch': 120}`
* `proc_fn` will be written to return a tuple, where each element of that tuple corresponds to each element of the list we've passed to `output_field`, e.g., 

```out0, out1 = proc_fn(input)
utt.set_info('char_per_word', out0) 
utt.set_info('word_per_char', out1)```

In [19]:
char_per_word = TextProcessor(proc_fn=lambda x: (x['ch']/x['wc'], x['wc']/x['ch']), 
                              output_field=['char_per_word', 'word_per_char'], input_field=['ch','wc'])
corpus = char_per_word.transform(corpus)

In [20]:
utt.get_info('char_per_word')

5.454545454545454

In [21]:
utt.get_info('word_per_char')

0.18333333333333332

We now have a bunch of new fields pertaining to the attributes we've computed, presently stored as entries in `utt.meta`:

In [22]:
utt.meta.keys()

dict_keys(['is_answer', 'is_question', 'pair_idx', 'clean_text', 'wc_raw', 'wc', 'ch', 'char_per_word', 'word_per_char'])

### Question: default behavior

At present, `TextProcessing` basically does no error handling. Here's some cases that come to mind:

* If an utterance does not contain `input_field`, then we silently skip over the utterance. As such, utterances that are missing this attribute will not contain the resultant `output_field` (this is _not the same_ as the `output_field` existing and being set to `None`)
* An utterance must contain _all_ of the `input_field`s if we've passed in multiple. Otherwise, the silent skipping-over behavior occurs. 

One nice behavior might be for `transform(corpus)` to throw an error on corpora where we know that `input_field` doesn't exist for _any_ utterance. This would require the corpus to maintain a registry of metadata and fields which have been computed before (something we're talking about already).

## Parsing text with the TextParser class

One common utterance-level thing we want to do is parse the text. In practice, in increasing order of (computational) difficulty, this typically entails:

* proper tokenizing of words and sentences;
* POS-tagging;
* dependency-parsing. 

As such, we provide a `TextParser` class that inherits from `TextProcessor` to do all of this, taking in the following arguments:

* `output_field`: defaults to `'parsed'`
* `input_field`
* `mode`: whether we want to go through all of the above steps (which may be expensive) or stop mid-way through. Supports the following options: `'tokenize'`, `'tag'`, `'parse'` (the default).

Under the surface, `TextParser` actually uses two separate models: a `spacy` object that does word tokenization, tagging and parsing _per sentence_, and `nltk`'s sentence tokenizer. The rationale is:

* `spacy` doesn't support sentence tokenization without dependency-parsing, and we often want sentence tokenization without having to go through the effort of parsing.
* We want to be consistent (as much as possible, given changes to spacy and nltk) in the tokenizations we produce, between runs where we don't want parsing and runs where we do.

If we've pre-loaded these models, we can pass them into the constructor too, as:

* `spacy_nlp`
* `sent_tokenizer`

In [23]:
from convokit.text_processing import TextParser

In [24]:
parser = TextParser(input_field='clean_text', verbosity=50)

In [25]:
corpus = parser.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed


In [26]:
test_parse = utt.get_info('parsed')

### parse output

A parse produced by `TextParser` is serialized in text form (to avoid several memory and ease-of-use difficulties with spacy binaries). It is a list consisting of sentences, where each sentence is a dict with

* `toks`: a list of tokens (i.e., words) in the sentence;
* `rt`: the index of the root of the dependency tree (i.e., `sentence['toks'][sentence['rt']` gives the root)

Each token, in turn, contains the following:

* `tok`: the text of the token;
* `tag`: the tag;
* `up`: the index of the parent of the token in the dependency tree (no entry for the root);
* `down`: the indices of the children of the token;
* `dep`: the dependency of the edge between the token and its parent.

In [27]:
test_parse[0]

{'rt': 5,
 'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
  {'dep': 'nsubj', 'dn': [3, 10], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
  {'dep': 'ROOT', 'dn': [0, 1, 2, 4, 6, 8, 11], 'tag': 'VBD', 'tok': 'went'},
  {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
  {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
  {'dep': 'appos', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 4},
  {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]}

If we didn't want to go through the trouble of dependency-parsing (which could be expensive) we could initialize `TextParser` with `mode='tag'`, which only POS-tags tokens:

In [28]:
texttagger = TextParser(output_field='tagged', input_field='clean_text', mode='tag')
corpus = texttagger.transform(corpus)

No matter the mode, the parse produced will follow the same structure as if we had run the entire pipeline (i.e., a list of sentences which are dicts); it will just be missing a few fields (like `rt`). 

In principle, this is to maintain consistency between different runs of the parser.

In [29]:
utt.get_info('tagged')[0]

{'toks': [{'tag': 'UH', 'tok': 'Yeah'},
  {'tag': ',', 'tok': ','},
  {'tag': 'CC', 'tok': 'but'},
  {'tag': 'JJ', 'tok': 'many'},
  {'tag': 'NNS', 'tok': 'friends'},
  {'tag': 'VBD', 'tok': 'went'},
  {'tag': 'IN', 'tok': 'with'},
  {'tag': 'PRP', 'tok': 'me'},
  {'tag': ',', 'tok': ','},
  {'tag': 'JJ', 'tok': 'Japanese'},
  {'tag': 'NN', 'tok': 'guy'},
  {'tag': '.', 'tok': '.'}]}

### Some advanced usage: input filters

Just for the sake of demonstration, suppose we wished to save some computation time and only parse the questions in a corpus. We can do this by specifying `input_filter` (which, recall discussion above, takes as argument an `Utterance` object). (We note that especially if the corpus comes from an institutional setting, there may be official definitions for what a question is, beyond the presence or absence of a question mark in the text)

In [30]:
def is_question(utt, aux={}):
    return utt.meta['is_question']

In [31]:
qparser = TextParser(output_field='qparsed', input_field='clean_text', input_filter=is_question, verbosity=50)

In [32]:
corpus = qparser.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed


Since our test utterance is not a question, `qparser.transform()` will skip over it:

In [33]:
utt.get_info('qparsed')

However, if we take the question that triggered the answer, we see that it is indeed parsed:

In [34]:
q_utt_id = '1681_14.q'
q_utt = corpus.get_utterance(q_utt_id)

In [69]:
q_utt.text

'How hard was it for you when, 13 years, left your parents, left Japan to go to the States. Was it a big step for you?'

In [35]:
q_utt.get_info('qparsed')

[{'rt': 2,
  'toks': [{'dep': 'advmod', 'dn': [], 'tag': 'WRB', 'tok': 'How', 'up': 1},
   {'dep': 'acomp', 'dn': [0], 'tag': 'JJ', 'tok': 'hard', 'up': 2},
   {'dep': 'ROOT', 'dn': [1, 3, 4, 11, 22], 'tag': 'VBD', 'tok': 'was'},
   {'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'it', 'up': 2},
   {'dep': 'prep', 'dn': [5], 'tag': 'IN', 'tok': 'for', 'up': 2},
   {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'you', 'up': 4},
   {'dep': 'advmod', 'dn': [], 'tag': 'WRB', 'tok': 'when', 'up': 11},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 11},
   {'dep': 'nummod', 'dn': [], 'tag': 'CD', 'tok': '13', 'up': 9},
   {'dep': 'npadvmod', 'dn': [8], 'tag': 'NNS', 'tok': 'years', 'up': 11},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 11},
   {'dep': 'advcl',
    'dn': [6, 7, 9, 10, 13, 14, 15],
    'tag': 'VBD',
    'tok': 'left',
    'up': 2},
   {'dep': 'poss', 'dn': [], 'tag': 'PRP$', 'tok': 'your', 'up': 13},
   {'dep': 'dobj', 'dn': [12], 'tag': 'NNS', 'to

### Downstream application of parses: getting arcs

Once we have fully-parsed utterances, one thing we can do is extract all of the dependency-tree arcs -- effectively, fancy bigrams. To facilitate this, we include the following class `TextToArcs`, which also inherits from `TextProcessor`. By default, `TextToArcs` will use the parse as its input field (and hence requires the parse to exist). 

In [36]:
from convokit.text_processing import TextToArcs

In [37]:
get_arcs = TextToArcs('arcs')
corpus = get_arcs.transform(corpus)

`TextToArcs` returns a list of sentences, where each sentence is represented as a space-separated string of arcs. The main takeaway is that sometimes we might want utterances to be represented in a segmented way -- e.g., as a list of sentences. 

In [38]:
utt.get_info('arcs')

['friends_* friends_guy friends_many guy_* guy_japanese japanese_* many_* me_* went_* went_friends went_with went_yeah with_* with_me yeah>* yeah_*',
 'i_* so>* so>i so_* was_* was_i was_so',
 'homesick_* i>* i_* like_* was_* was_homesick was_i was_like',
 'but>* but>now get_* get_homesick get_i get_now get_sometimes homesick_* i_* now_* sometimes_*']

### Question: segmenting utterances

I can see a lot of use cases where we actually compute things over _segments_ of utterances -- sentences, but also paragraphs, subsections, etc. Down the line, this could get a bit hard to manage, if different `TextProcessor` objects are expecting to find segmented versus unsegmented utterances; also note that calling `TextParser` is effectively a precondition to having sentences (that could be used in future transformers).

Right now, many of the things I implement that inherit from `TextProcessor` down the line do this:

* If the output is per-sentence, it will return a list where each entry of that list is a sentence. In principle, this facilitates calls where we want to correspond a sentence-level attribute to the original text of the parse. 
* If the input requires per-sentence segmentation, the transformer will just not work if you give it a string instead. If the input doesn't care, the transformer will check if you've given it a string or a list of strings, and in the latter case call `'\n'.join(input)`. This motivates the next question:

### Question: what do TextProcessors expect as input?

Currently it's up to whoever writes the particular `TextProcessor`, and its in their head, what is the specification of the input (i.e., whether they're expecting a list of dicts, a list of strings, or a flat string, etc.) Maybe there's no way around this, or maybe there are conventions that one could set?

## Storing and loading corpora

We've now computed a bunch of utterance-level attributes. 

In [39]:
utt.meta.keys()

dict_keys(['is_answer', 'is_question', 'pair_idx', 'clean_text', 'wc_raw', 'wc', 'ch', 'char_per_word', 'word_per_char', 'parsed', 'tagged', 'arcs'])

By default, calling `corpus.dump` will write all of these attributes to disk, within the file that stores utterances; later calling `corpus.load` will load all of these attributes back into a new corpus. This is messy for a few reasons:

* For big objects like parses, this incurs a high computational burden (especially if in a later use case you might not even need to look at parses)
* There are lots of nonsense attributes we created around the way -- maybe as intermediate output, or as experiments we aren't committed to keeping. We don't necessarily want to enshrine these as a part of the corpus in the future.
* It's just really messy to cram everything into `utterance.meta`. In particular, maybe there ought to be a clear distinction between fields that we believe are "core" to the corpus (like an institutional annotation for what constitutes a question), and things that we compute on top of the corpus, that perhaps generalize across different corpora (like parses). 

Note that these types of considerations generalize beyond Utterances to the other types of Corpus objects, Users and Conversations.

I'll leave the last point open for now, since actually trying to taxonomize things we can compute on top of an utterance is hard. 

On the first two points, I've made a few tweaks to corpus.

First, `corpus.dump` now takes an optional argument `fields_to_skip`, which is a dict of object type (`'utterance'`, `'conversation'`, `'user'`, `'corpus'`) to a list of fields that we do not want to write to disk. As is presently implemented, these fields to skip will also not be logged to `corpus.meta_index`.

The following call will _only_ write the following new fields: `arcs` and `clean_text`.

In [40]:
corpus.dump('./', save_to_existing_path=True, 
            fields_to_skip={'utterance': ['parsed','tagged','wc_raw','ch',
                                         'char_per_word','word_per_char','qparsed']})

Second, to deal with fields that we'd like to keep around, but that we don't want to read and write to disk in a big batch with all the other corpus data, `corpus.dump_info` will dump fields of a Corpus object into separate files. This takes the following arguments as input:

* `obj_type`: which type of Corpus object you're dealing with.
* `fields`: a list of the fields to write. 
* `dir_name`: which directory to write to; by default will write to the directory you read the corpus from.

This function will write each field in `fields` to a separate file called `info.<field>.jsonl` where each line of the file is a json-serialized dict: `{"id": <ID of object>, "value": <object.get_info(field)>}`. 

In [41]:
corpus.dump_info('utterance',['parsed','tagged'])

As expected, we now have the following files in our directory:

In [42]:
ls $ROOT_DIR

conversations.json  index.json         info.tagged.jsonl  utterances.jsonl
corpus.json         info.parsed.jsonl  users.json


If we now initialize a new corpus by reading from this directory:

In [43]:
new_corpus = convokit.Corpus(ROOT_DIR)

In [44]:
new_utt = new_corpus.get_utterance(test_utt_id)

We see that things that we've omitted in the `corpus.dump` call will not be read.

In [45]:
new_utt.meta.keys()

dict_keys(['is_answer', 'is_question', 'pair_idx', 'clean_text', 'wc', 'arcs'])

In [46]:
new_utt.get_info('arcs')

['friends_* friends_guy friends_many guy_* guy_japanese japanese_* many_* me_* went_* went_friends went_with went_yeah with_* with_me yeah>* yeah_*',
 'i_* so>* so>i so_* was_* was_i was_so',
 'homesick_* i>* i_* like_* was_* was_homesick was_i was_like',
 'but>* but>now get_* get_homesick get_i get_now get_sometimes homesick_* i_* now_* sometimes_*']

As a counterpart to `corpus.dump_info` we can also load auxiliary information on-demand. Here, this call will look for `info.<field>.jsonl` in the directory of `new_corpus` (or an optionally-specified `dir_name`) and attach the value specified in each line of the file to the utterance with the associated id:

In [47]:
new_corpus.load_info('utterance',['parsed'])

In [48]:
new_utt.get_info('parsed')

[{'rt': 5,
  'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
   {'dep': 'nsubj', 'dn': [3, 10], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
   {'dep': 'ROOT', 'dn': [0, 1, 2, 4, 6, 8, 11], 'tag': 'VBD', 'tok': 'went'},
   {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
   {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
   {'dep': 'appos', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 4},
   {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]},
 {'rt': 2,
  'toks': [{'dep': 'advmod', 'dn': [], 'tag': 'RB', 'tok': 'So', 'up': 2},
   {'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'I', 'up': 2},
   {'de

### Question: Error handling

If `dump_info` is called on a non-existent field, then it will write an entire file consisting of `{'id': <ID>, 'value': None}`, one line per number of objects in the Corpus. (It could in principle skip objects which don't have that field and hence write an empty file, though see next question.) Here again, we could facilitate error-checking with a Corpus-level metadata registry.

### Question: Partial loading

Ideally, we'd like `load_info` to behave exactly like the corpus constructor when we are only reading a subset of utterances -- that is, passing the same `utterance_start_index` and `utterance_end_index` to `load_info` as to the corpus constructor should return auxiliary information corresponding to the subset of utterances that were loaded. 

The issue here is that I don't think a corpus currently has a "canonical" order in which utterances are listed (I see calls to `list(corpus.get_utterance_ids())` which aren't guaranteed to be deterministic). And of course, this partial loading would have to be implemented in `load_info`.

### Question: Default behavior of corpus.dump

This function call looks terrible, especially if you realize that after the dust has settled we've only written two additional fields to disk.

```corpus.dump('./', save_to_existing_path=True, 
            fields_to_skip={'utterance': ['parsed','tagged','wc_raw','ch',
             'char_per_word','word_per_char','qparsed']})
```


Another consideration is that when we dump corpora, we want to make sure that the metadata we shipped the corpus with get dumped as well (e.g., the `is_question` field should always be dumped). So one alternative, making use of the hypothetical metadata registry, could be as follows:

* The registry keeps track of the "canonical" fields to dump by default;
* When we create a new transformer where we can specify the output field, we should also have an option to specify whether this new field should be registered in the registry as a "dump-by-default" field, an "ad-hoc" field, or a "probably dump me separately" field. I don't know what the default behaviour of this should be, but clearly we'd need to set a default.

### Question: interoperability with merging corpora

I have no idea whether this is the case right now.

## Per-utterance calls

`TextProcessor` objects also support calls per-utterance via `TextProcessor.transform_utterance()` (in principle, it could also take a list of utterances, but that's not implemented yet). However, in general, these calls require some assumptions about the nature of the utterance you pass in (see pipelining below).

First, if the `TextProcessor` in question has the input field defaulting to `utt.text` (as is the case with our preprocessor that removes hyphens), then it can either take in a string or an utterance. In either case, it will return an utterance with the new output field:

In [49]:
test_str = "I played -- a tennis match."

In [50]:
prep.transform_utterance(test_str)

Utterance({'id': None, 'user': None, 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'I played -- a tennis match.', 'meta': {'clean_text': 'I played a tennis match.'}})

In [51]:
from convokit.model import Utterance

In [52]:
adhoc_utt = Utterance(text=test_str)

In [53]:
prep.transform_utterance(adhoc_utt)

Utterance({'id': None, 'user': None, 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'I played -- a tennis match.', 'meta': {'clean_text': 'I played a tennis match.'}})

Likewise for `wc_raw`, which wordcounts the raw utterance text:

In [54]:
wc_raw.transform_utterance(test_str)

Utterance({'id': None, 'user': None, 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'I played -- a tennis match.', 'meta': {'wc_raw': 6}})

However, if the `TextProcessor` is expecting a particular input field, then it will not take strings as input. This is the case for `wc`, which wordcounts de-hyphenated text, and is hence expecting a field called `clean_text`.

In [55]:
wc.transform_utterance(test_str)

ValueError: expecting utterance, not string

An alternate behavior would be to wrap the string in an utterance but not subsequently annotate the utterance with the output field. This actually works better with the behavior for utterances, which is that if the utterance does not have the desired input field, it will return the utterance with no additional fields annotated. So, uh, TODO. 

### Question: default behaviors

First, recall that `TextProcessor` is initialized with an optional `input_filter` argument. Right now, if a single utterance passed to `transform_utterance` doesn't satisfy the input filter, then the call is a no-op, unless optional argument `override_input_filter` is set to `True`.

Second, there's the question of whether erroneous calls should be no-ops, or whether it should raise errors (in the latter case, it's then on the end-user to do the error handling).

## Pipelines

Finally, we can string together multiple transformers, and hence `TextProcessors`, into a pipeline. This actually works right out of the box if we use scikit-learn's `Pipeline` class; I've wrapped this up in a `ConvokitPipeline` class which also supports the `transform_utterance` call. 

In [57]:
from convokit.convokitPipeline import ConvokitPipeline

As an example, suppose we want to go from raw text to outputting dependency parse arcs. We can chain the required steps to get there by initializing `ConvokitPipeline` with a list of steps, represented as a tuple of `(<step name>, initialized transformer-like object)`:

* `'prep'`, our de-hyphenator
* `'parse'`, our parser
* `'arcs'`, our arc extractor.

Step `k` expects that step `k-1` will hand it a corpus (or utterance) with the requisite input fields computed. Note that there's some redundancy here with specifying which input and output fields to look for.

In [58]:
arc_pipe = ConvokitPipeline([('prep', TextProcessor(preprocess_text, 'clean_text_pipe')),
                ('parse', TextParser('parsed_pipe', input_field='clean_text_pipe',
                                    verbosity=50)),
                ('arcs', TextToArcs('arcs_pipe', input_field='parsed_pipe'))])

In [59]:
corpus = arc_pipe.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed


In [60]:
utt.get_info('arcs_pipe')

['friends_* friends_guy friends_many guy_* guy_japanese japanese_* many_* me_* went_* went_friends went_with went_yeah with_* with_me yeah>* yeah_*',
 'i_* so>* so>i so_* was_* was_i was_so',
 'homesick_* i>* i_* like_* was_* was_homesick was_i was_like',
 'but>* but>now get_* get_homesick get_i get_now get_sometimes homesick_* i_* now_* sometimes_*']

As promised, the pipeline also works to transform utterances, assuming that the utterance you've passed in agrees with the input that the very first transformer in the pipeline is expecting.

In [61]:
arc_pipe.transform_utterance(test_str)

Utterance({'id': None, 'user': None, 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'I played -- a tennis match.', 'meta': {'clean_text_pipe': 'I played a tennis match.', 'parsed_pipe': [{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP', 'dep': 'nsubj', 'up': 1, 'dn': []}, {'tok': 'played', 'tag': 'VBD', 'dep': 'ROOT', 'dn': [0, 4, 5]}, {'tok': 'a', 'tag': 'DT', 'dep': 'det', 'up': 4, 'dn': []}, {'tok': 'tennis', 'tag': 'NN', 'dep': 'compound', 'up': 4, 'dn': []}, {'tok': 'match', 'tag': 'NN', 'dep': 'dobj', 'up': 1, 'dn': [2, 3]}, {'tok': '.', 'tag': '.', 'dep': 'punct', 'up': 1, 'dn': []}]}], 'arcs_pipe': ['a_* i>* i_* match_* match_a match_tennis played_* played_i played_match tennis_*']}})

### Question: robustness

Note that I haven't totally thought through when this basically direct application of `sklearn.pipeline` might not work.

### Question: Pipelining and per-utterance calls

Per the implementation here, per-utterance calls on a transformer won't work (either they'll crash if you've passed in a string and we were expecting an input field, or they will silently do nothing) unless we pass in an utterance where all the requisite inpute fields are already there. My proposed way to handle this is:

* For individual transformers, it is indeed up to the end-user to provide a well-formated utterance.
* If we actually want to implement transformers where the user can start at the very beginning, by passing in a raw string, we would actually need to use a pipeline to get the raw string to the well-formed utterance.

Note this isn't quite what Cristian was proposing:

* For individual transformers, check if well-formatted. If not, somehow have the transformer keep track of the full antecedent list of transformers, and run the input string through all of them -- in other words, include pipelining with all transformers by default.

I disfavor this approach since I feel that it makes actually writing custom transformers really annoying, and undermines the flexibility of having transformers potentially take in different input fields (e.g., being able to specify which variant of a pre-processed text you want to further transform). 