# DSCI 691: Natural Language Processing with Deep Learning <br> Chapter 2: Tasks, objectives, and optimization


## 2.0 Motivation

While DL has certainly revolutionized NLP, at its core, NLP with DL still an area of work that must resolve specific human-like tasks as shallow ML experiments. Abstracting NLP processing tasks as optimizable objectives is no small work, and ultimately has exceptional impact on the utility of systems at their downstream function. With this in mind, the key takeaways from this section should be pipelined as understanding:

1. a range of common and typical NLP problems (by no means all), their relationships, and their representations in data; sufficient to
2. to abstract NLP tasks from problems into optimizable objectives; and
3. apply the patterns of mathematical formulation necessary to optimze subsequent DL systems.

## 2.1 Tasks

In this section we'll review some NLP tasks and their data representations. One thing we'll want to keep in mind is that by nature, tasks in NLP generally are discrete, i.e., categorical prediction tasks, whether focused on prediction at course- (e.g., on documents) or fine-grained levels (e.g., on words).

### 2.1.1 Named Entity Recognition (NER)
Oftentimes, in NLP, we are concerned with identifying the "things" of interest. Perhaps we're interested in constructing a model that can identify the who, what, where, and when that is contained in a text for further processing. 

The task of identifying these "things" is typically referred to as Named Entity Recognition or NER. 

Consider the following sentence and annotation: 

```
Last night , Paris Hilton wowed in a sequin gown .
              PER   PER
```

Here, we can see that the annotation is focused on the "who" aspect. Both `Paris` and `Hilton` are annotated as `PER`, or the label for a Person entity.

Furthermore, even if we annotate both `Paris` and `Hilton` with the `PER` class, a proper system may need to then link these two words into a unifying entity of `Paris Hilton` the celebrity (as well as any other reference to such person). This extension takes a task beyond NER into the realm of named-entity linking (NEL).

Another example could look like: 

```
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
 PER    PER                       LOC    LOC      LOC      DATE DATE
```

Here, it is important to be aware of the fact that `Paris` is now part of the `LOC` class. This is important information, especially if one is intending to do entity linking downstream. Paris the city is obviously not Paris Hilton. A good system will be able to discriminate between these types.

Already, one can begin to see why this task is difficult. A good NER system should be able to identify which tokens are of interest and assign them an appropriate label. Many tokens are not assigned an entity class, increasing the sparsity of the signal. 

Furthermore, many named entities are an open class. If I say that I went down to Fred's Pizza Shack, it is easy enough for a human to understand that I went to a location (`LOC`). But can an NER system detect that, even if it's never seen such an entity before?

#### 2.1.1.1 NER data

For easy access to data, we can utilize `pip`:

In [None]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/54/90/43b396481a8298c6010afb93b3c1e71d4ba6f8c10797a7da8eb005e45081/datasets-1.5.0-py3-none-any.whl (192kB)
[K     |████████████████████████████████| 194kB 12.2MB/s 
[?25hCollecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/af/07/bf95f398e6598202d878332280f36e589512174882536eb20d792532a57d/huggingface_hub-0.0.7-py3-none-any.whl
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/91/0d/a6bfee0ddf47b254286b9bd574e6f50978c69897647ae15b14230711806e/fsspec-0.8.7-py3-none-any.whl (103kB)
[K     |████████████████████████████████| 112kB 39.5MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/e7/27/1c0b37c53a7852f1c190ba5039404d27b3ae96a55f48203a74259f8213c9/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 41.5MB/s 
Installing collected packages: huggingface-hub, fssp

The `'conll2003'` data set has a long history in the field and will suffice for our introduction:

In [None]:
import datasets

ds = datasets.load_dataset('conll2003')
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2603.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1781.0, style=ProgressStyle(description…


Downloading and preparing dataset conll2003/conll2003 (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=649539.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=162714.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=145897.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

This output tells us that our dataset is pre-split into train, validation, and test splits (each with a number fo samples denoted as the `num_rows` output). We can also see the names of the features included with the dataset. 

Let's take a look at a sample from the training set:

In [None]:
ds['train'][0]

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}

Here, the NER tags are our 9 different categories that we'd like to predict for the sentence's tokens, but these are presented as integer-indexed, i.e., for a vector representation. So if we'd like to see these presented in the more intuitive format we've introduced them in, i.e., as token-level string labels, we'll need to pull the `feature_names` out of the data set's packing, associate them as `token_labels` and zip them in with the `'tokens'`:

In [None]:
feature_names = ds['train'].features['ner_tags'].feature.names
token_labels = [feature_names[i] for i in ds['train'][3]['ner_tags']]
list(zip(ds['train'][3]['tokens'], token_labels))

[('The', 'O'),
 ('European', 'B-ORG'),
 ('Commission', 'I-ORG'),
 ('said', 'O'),
 ('on', 'O'),
 ('Thursday', 'O'),
 ('it', 'O'),
 ('disagreed', 'O'),
 ('with', 'O'),
 ('German', 'B-MISC'),
 ('advice', 'O'),
 ('to', 'O'),
 ('consumers', 'O'),
 ('to', 'O'),
 ('shun', 'O'),
 ('British', 'B-MISC'),
 ('lamb', 'O'),
 ('until', 'O'),
 ('scientists', 'O'),
 ('determine', 'O'),
 ('whether', 'O'),
 ('mad', 'O'),
 ('cow', 'O'),
 ('disease', 'O'),
 ('can', 'O'),
 ('be', 'O'),
 ('transmitted', 'O'),
 ('to', 'O'),
 ('sheep', 'O'),
 ('.', 'O')]

Here, we can see now that it is necessary to distinguish the NER label type of tokens that extend (`'I'`-type) versus begin (`B`-type) entities, and likewise, labels that distinguish the semantic category, e.g., organization (`'ORG'`) vs. person (`'PER'`) of the entities.

There are other features we might consider and which are explained to some extent in the [CoNLL2003 Documentation page](https://huggingface.co/datasets/conll2003) on the HuggingFace Dataset Hub. These are: 

* `id`: a string feature.
* `tokens`: a list of string features.
* `pos_tags`: a list of classification labels, with possible values including " (0), '' (1), # (2), $ (3), ( (4).
* `chunk_tags`: a list of classification labels, with possible values including O (0), B-ADJP (1), I-ADJP (2), B-ADVP (3), I-ADVP (4).
* `ner_tags`: a list of classification labels, with possible values including O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4).

But let's instead check out our full class distribution of the training portion to get a sense of the nature of this prediction problem as a multi-class classification problem:

In [None]:
from collections import Counter 

Counter([cls for sample in ds['train'] for cls in sample['ner_tags']])

Counter({0: 169578,
         1: 6600,
         2: 4528,
         3: 6321,
         4: 3704,
         5: 7140,
         6: 1157,
         7: 3438,
         8: 1155})

Here, we can see that while this is not a binary classification (2-class) task, there is an extreme class imbalance with the `0`-class (not part of an entity) dominating. Hence, when it comes to evaluation, positive prediction metrics like precision, recall, and $F_1$ will be essential to interpreting the capacities of our algorithms. However, macro-or-micro averaged positive prediction metrics will not be the objectives against which we evaluate/optimize, as they are not the smoothly-learnable for our categorical prediction tasks.

### 2.1.2 Sequence labeling, in general
While NER might be an extremely common form of _sequence labeling_ task, it is in fact just one for which we might abstract common architectural elements for DL applications. For example, part-of-speech (POS) tags are not, a priori, known, and therefore must be predicted if they are to be used by downstream tasks:

In [None]:
feature_names = ds['train'].features['pos_tags'].feature.names
token_labels = [feature_names[i] for i in ds['train'][3]['pos_tags']]
list(zip(ds['train'][3]['tokens'], token_labels))

[('The', 'DT'),
 ('European', 'NNP'),
 ('Commission', 'NNP'),
 ('said', 'VBD'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('it', 'PRP'),
 ('disagreed', 'VBD'),
 ('with', 'IN'),
 ('German', 'JJ'),
 ('advice', 'NN'),
 ('to', 'TO'),
 ('consumers', 'NNS'),
 ('to', 'TO'),
 ('shun', 'VB'),
 ('British', 'JJ'),
 ('lamb', 'NN'),
 ('until', 'IN'),
 ('scientists', 'NNS'),
 ('determine', 'VBP'),
 ('whether', 'IN'),
 ('mad', 'JJ'),
 ('cow', 'NN'),
 ('disease', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('transmitted', 'VBN'),
 ('to', 'TO'),
 ('sheep', 'NN'),
 ('.', '.')]

In this case, POS tagging entails choice from many more classes than our 9-class tags for entities did:

In [None]:
Counter([cls for sample in ds['train'] for cls in sample['pos_tags']])

Counter({0: 2178,
         1: 35,
         3: 427,
         4: 2866,
         5: 2866,
         6: 7291,
         7: 7389,
         8: 2386,
         10: 3653,
         11: 19704,
         12: 13453,
         13: 136,
         14: 166,
         15: 19064,
         16: 11831,
         17: 382,
         18: 254,
         19: 13,
         20: 1199,
         21: 23899,
         22: 34392,
         23: 684,
         24: 9903,
         25: 4,
         26: 33,
         27: 1553,
         28: 3163,
         29: 1520,
         30: 3975,
         31: 163,
         32: 35,
         33: 528,
         34: 439,
         35: 3469,
         36: 30,
         37: 4252,
         38: 8293,
         39: 2585,
         40: 4105,
         41: 1436,
         42: 2426,
         43: 506,
         44: 528,
         45: 23,
         46: 384})

However, as it turns out their prediction (POS tags) is in general a more resolved task (statistically, perhaps less challenging) than NER exhibiting higher performances, overall. This (POS tagging) is not unlike the case of other grammatical sequence-labeling tasks, but in general, to get a sense of how resolved a task is and in what _domains_, or, media contexts of text data we can consult the current state-of-the-art (SOTA) systems and rankings. One such ranking is currently is provided by Sebastian Ruder's NLP Progress repository on github: 
- https://github.com/sebastianruder/NLP-progress

Here, it's worth noting that a few other tasks on the list tend to generalize or specify NER-like sequence labeling tasks in different contexts, for example semantic role labeling.



#### 2.1.2.1 Semantic Role Labeling (SRL)
There are in general too many different types of specific sequence labeling tasks to explore in any one course, and to some degree it would miss the point to expect to learn about all of their idiosyncrasies. Instead, our goal should be to understand the complexities that can come along with approaching sequence-labeling tasks, and where, exactly sequence-labeling tasks _can_ fit in with other more coarse-grained NLP tasks. In this effort, our next exploration will be of _semantic role labeling_, where the goal is to identify token sequences&mdash;like named entities&mdash;that _answer_ specific, role-based questions about _entities_ the text, like who, what, when, where, and why? If it's not yet clear, this task clearly connects to NER by answering summary-level questions about sentences with specific points of justification for those answers.
```
|     UCD     |finished|the| 2006  |championship|as|Dublin|champions|,|
|B-who1/B-who2|   Q1   | 0 |B-what1|  I-what1   |0 |B-as1 |  I-as1  |0|
 ---------------------------------------------------------------------
|  by  | beating |       St       | Vincents  |in|  the  | final |.| 
|B-how1|Q2/I-how1|    B-whom2     |  I-whom2  |0 |B-when2|I-when2|0|
```

The excerpt, is taken from [dataset we'll explore](https://huggingface.co/datasets/qa_srl) is once again available from the `datasets` module, and the basis for its formation can be found in the [author's document](https://homes.cs.washington.edu/~lsz/papers/hlz-emnlp15.pdf):


In [None]:
ds = datasets.load_dataset('qa_srl')
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2507.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1225.0, style=ProgressStyle(description…

No config specified, defaulting to: qa_srl/plain_text



Downloading and preparing dataset qa_srl/plain_text (download: 1.04 MiB, generated: 2.96 MiB, post-processed: Unknown size, total: 4.00 MiB) to /root/.cache/huggingface/datasets/qa_srl/plain_text/1.0.0/9b4389fd382fb4e39c73274cb5948cdcbf1de7f467ba60e4578b6b18805d9a5c...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=646763.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=222666.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=218300.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset qa_srl downloaded and prepared to /root/.cache/huggingface/datasets/qa_srl/plain_text/1.0.0/9b4389fd382fb4e39c73274cb5948cdcbf1de7f467ba60e4578b6b18805d9a5c. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['sentence', 'sent_id', 'predicate_idx', 'predicate', 'question', 'answers'],
        num_rows: 6414
    })
    validation: Dataset({
        features: ['sentence', 'sent_id', 'predicate_idx', 'predicate', 'question', 'answers'],
        num_rows: 2183
    })
    test: Dataset({
        features: ['sentence', 'sent_id', 'predicate_idx', 'predicate', 'question', 'answers'],
        num_rows: 2201
    })
})

Looking again at a single instance:

In [None]:
ds['train'][25]

{'answers': ['Larger geographical areas , called Territories',
  'Larger geographical areas',
  'Territories'],
 'predicate': 'led',
 'predicate_idx': 8,
 'question': ['what', 'is', '_', 'led', '_', '_', '_', '?'],
 'sent_id': 'WIKI1_7',
 'sentence': 'Larger geographical areas , called Territories , are led by a Territorial Commander , who is the highest-ranking officer in that Territory .'}

it should become clear that we've presented the concept in a slightly different way than we'll encounter here in data. This is not uncommon, as different data set will be produced for somewhat different reasons, i.e., even specifying NER as SRL doesn't entail the exact nature of the task&mdash;one must always review the data. 

So, going deeper, a question is predicated by a verbs in each instance. Our goal, then, is to find&mdash;potentially many overlapping&mdash;answers for one single question at a given time. Likewise, the answers to different questions can overlap, meaning tokens might be extracted as sequences to be labeled multiple times. Note as well, each question is organized as a 7-token sequence in annotation&mdash;something we'd have to expect to produce to inform out pipeline, along with the detection of predicates.

As for processing, we'll have to enforce a space tokenization on the sentences to recover sequences. To put a single instance into a BIO-like format aking to NER, we can do the following:

In [None]:
def annotate_instance(instance):
    tokens = instance['sentence'].split(' ')
    for answer in instance['answers']:
        token_labels = ["O"]*len(tokens)
        token_labels[instance['predicate_idx']] = "Q"
        answer_tokens = answer.split(' ')
        answer_labels = ["I-"+instance['question'][0] 
                        for i in range(len(answer_tokens))]
        answer_labels[0] = "B"+answer_labels[0][1:] 
        idx = [tokens.index(token) if token in tokens else -1 for token in answer_tokens]
        if -1 in idx: 
          continue
        prev_idx = 0
        for i, token in enumerate(answer_tokens):
            if idx[i] - prev_idx > 1 and prev_idx:
                break
        else:
            for i, ix in enumerate(idx):
                token_labels[ix] = answer_labels[i]

            yield tokens, token_labels

In [None]:
instance = ds['train'][25]
tokens, token_labels = list(annotate_instance(instance))[0]
list(zip(tokens, token_labels))

[('Larger', 'B-what'),
 ('geographical', 'I-what'),
 ('areas', 'I-what'),
 (',', 'I-what'),
 ('called', 'I-what'),
 ('Territories', 'I-what'),
 (',', 'O'),
 ('are', 'O'),
 ('led', 'Q'),
 ('by', 'O'),
 ('a', 'O'),
 ('Territorial', 'O'),
 ('Commander', 'O'),
 (',', 'O'),
 ('who', 'O'),
 ('is', 'O'),
 ('the', 'O'),
 ('highest-ranking', 'O'),
 ('officer', 'O'),
 ('in', 'O'),
 ('that', 'O'),
 ('Territory', 'O'),
 ('.', 'O')]

This then allows us (roughly) to see the class distribution, i.e., the number of points of positive prediction:

In [None]:
Counter([label for instance in ds['train'] 
         for annotation in annotate_instance(instance) 
         for label in annotation[1]])

Counter({'B-how': 290,
         'B-how much': 61,
         'B-what': 3600,
         'B-when': 686,
         'B-where': 527,
         'B-who': 1842,
         'B-why': 208,
         'I-how': 1352,
         'I-how much': 117,
         'I-what': 17281,
         'I-when': 2089,
         'I-where': 2147,
         'I-who': 2747,
         'I-why': 1545,
         'O': 165844,
         'Q': 7483})

Note, that having a complex _data loader_ like this is often necessary for completion of a given application, particularly if the goal is to re-apply an existing system architecture. For example, in this case we might like to re-purpose an NER system to recognize NEs, in the context of `'Q'` predicates. One would need to have a way of converting training instances like this into, e.g., some standard input&mdash;BIO&mdash;format.

### 2.1.3 Co-reference resolution
Moving away from thinking so much about sequence labeling and NER, while a semantic role labeling system that can predict the above might be very useful, one limitation a downstream application that uses it might face is the fact that multiple multiple answers might exist for a question, but that they might not _all_ be the same answer, in meaning. In this limited scenario and the above data, we might be asking if `'Larger geographical areas'` and `'Territories'` are entities which reference the same concept, or, are _co-references_. For a more detailed view, the example provided by Ruder is helpful:

```
              +-----------+
              |           |
I voted for Obama because he was most aligned with my values", she said.
|                                                  |            |
+--------------------------------------------------+------------+
```

Here, models are most often trained using the data from the CoNLL-2012 shared task, which uses OntoNotes coreference annotations, distributed by the Lingustic Data Consortium (LDC). You can access trial data freely via the [CoNLL page](https://cemantix.org/conll/2012/data.html), which provides pointers for retrieving the full data (via liscence) from the LDC, however, the [huggingface's github](https://github.com/huggingface/neuralcoref/blob/master/neuralcoref/train/training.md) provides some more helpful information on how to preprocess the data set. If we wish to think about the number of positive predictions which exist in a data set, the dependence then becomes on how many references (entites) exist as annotated in a document, and then evaluate the sameness of pairs of them. This makes for a number of challenges in combinatorality and enumeration, but reduces the problem to the evaluation of binary (two class) prediction.

However, to discuss the task and its use more generally, co-references are objects whose resoltion are often sought are larger scales, e.g., between sentences in a given document, or across posts on a social media network. Some readily available data from within the `datasets` module could clearly utilize co-reference resolution as a part of its task, which we do discuss more closely, next.

##### 2.1.3.1 Example Use in Other Tasks: Conversational Disentanglement
Another complex, downstream task that could utilize more basic tasks (such as NER/SRL) that we'll review next is referred to as conversational disentanglement, where the goal is to figure out which messages are in reference to which others. The data can be [found here](https://huggingface.co/datasets/irc_disentangle) and is [documented by its authors here](https://www.aclweb.org/anthology/P19-1374.pdf). 

In [None]:
ds = datasets.load_dataset('irc_disentangle')
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2983.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1385.0, style=ProgressStyle(description…

No config specified, defaulting to: irc_disentangle/ubuntu



Downloading and preparing dataset irc_disentangle/ubuntu (download: 112.98 MiB, generated: 60.05 MiB, post-processed: Unknown size, total: 173.03 MiB) to /root/.cache/huggingface/datasets/irc_disentangle/ubuntu/1.0.0/049f50905c15dc7ef340726189dd2a20387507291fb132e17679c558b752628f...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset irc_disentangle downloaded and prepared to /root/.cache/huggingface/datasets/irc_disentangle/ubuntu/1.0.0/049f50905c15dc7ef340726189dd2a20387507291fb132e17679c558b752628f. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['id', 'raw', 'ascii', 'tokenized', 'date', 'connections'],
        num_rows: 220616
    })
    test: Dataset({
        features: ['id', 'raw', 'ascii', 'tokenized', 'date', 'connections'],
        num_rows: 15010
    })
    validation: Dataset({
        features: ['id', 'raw', 'ascii', 'tokenized', 'date', 'connections'],
        num_rows: 12510
    })
})

Here's an example within the data where two _posts_ must have connection drawn:

In [None]:
ds['train'][1050], ds['train'][1055]

({'ascii': "[03:57] <Xophe> (also, I'm guessing that this isn't a good place to report minor but annoying bugs... what is?)",
  'connections': [1048, 1054, 1055, 1072, 1073],
  'date': '2004-12-25',
  'id': 1050,
  'raw': "[03:57] <Xophe> (also, I'm guessing that this isn't a good place to report minor but annoying bugs... what is?)",
  'tokenized': "<s> ( also , i 'm guessing that this is n't a good place to report minor but annoying bugs ... what is ?) </s>"},
 {'ascii': '[03:59] <superted> Xophe: allthough the bug might be minor, it can bring down the user experience',
  'connections': [1050, 1059, 1060],
  'date': '2004-12-25',
  'id': 1055,
  'raw': '[03:59] <superted> Xophe: allthough the bug might be minor, it can bring down the user experience',
  'tokenized': '<s> <user> : allthough the bug might be minor , it can bring down the user experience </s>'})

The essential component we must predict is within the `'connections'` field, where we are required to identify which other messages (by index) are referred to by a given post. So perhaps in the example we find the entity:
```
[03:57] <Xophe> (also, I'm guessing that this isn't a good place to report minor but annoying bugs... what is?)
           |                              |                                                    |
           +                              +                                                    +
          PER                            ORG                                                  MISC
```

if we can _link_ the `'MISC'` and`'PER'` entities well to those in another:
```
[03:59] <superted> Xophe: allthough the bug might be minor, it can bring down the user experience
            |        |                   | 
            +        +                   + 
           PER      PER                 MISC
```

Then we will have formed a connection between the two posts. Once again, our evaluation can be framed as a binary classification problem (with respect to disentanglement), once again on pairs of the unit of analysis&mdash;this time, _posts_. However, the named entities and co-references we resolve (at some previously-determined accuracy) will ultimately be what we use as intermediate features for the disentanglement task. At this point, we should reflect that this scenario is essentially no different than using _predictions_ of POS tags as intermediate features used to perform NER.

#### 2.1.3.2 Task data, metadata, features?
What should become clear at this point is that there's an especially fine line between task data (dependent) and the 'text' source of our predictions. 

You might notice in the disentanglement data that a lot of the 'text' is actually metadata&mdash;time stamps, and perhaps reply pointers&mdash;that's been machine-serialized into the text representation. Upon inspection, you might then wonder if these two non-text features are extractable via a rule algorithm, i.e., regular expressions? If so, our sytems could benefit greatly from, e.g., never predicting a 'reply' towards a post in the future, or, to the wrong individual. So here, it might be essential to build out some intensive preprocessing for the features.

To start on the routing information, it's essential to know who all exists in the conversation, so first, we might collect all identifyable user `sources` of posts. Note how we can use the timestamp as an anchor for user names in targeted posts:

In [None]:
import re
sources = Counter()
for ky in ['test', 'validation', 'train']:
    for i, post in enumerate(ds[ky]):
        time_source = re.search("\[(\d\d:\d\d)\] <([^>]+)>", post['raw'])
        if time_source:
            time, source = time_source.groups()
            sources[source] += 1

Now that we know who's posting, we stand a chance at matching their names with who they're replying to (`targets`). Likewise, since this is a dataset of post-post relational (network) connections, it will become fast/convient to store the posts by id (`mapped_data`, below). While we're at it, we can match the regular timestamp structure and dateparse in conjunction with the pre-existing `'date'` field. Note here the lack of precision that exists in completing this task via a rule-based approach&mdash;inspecting samples of the data, it becomes clear that some `targets` are being referenced differently, even though the source names are machine produced. This mean's the `target` field will have non-trivial veracity as a metadata feature and might ultimately be best left as a component of the unstructured `'text'` feature.

On the other side of this problem, note that the pre-processing code below relies on the order in which the data is presented to infer timestamps. Some posts clearly don't have them (like user sign ins/outs) and there may be other 'types' of posts&mdash;like announcements from moderators&mdash;which don't conform to the basic format given to conversational posts. Hence, to ensure a _reasonable_ timestamp exists (technically, this is imputation) for _every_ post, we can accept the rolling order of the data set and assign the most-recently observed time stamp to any post without one.

In [None]:
from datetime import datetime as dt
post_data = {}
for ky in ['test', 'validation', 'train']:
    post_data[ky] = dict()
    time = "00:00"
    for i, post in enumerate(ds[ky]):        
        source_target = re.search("\[(\d\d:\d\d)\] <([^>]+)> ([^ :,]+)", post['raw'])
        source_only = re.search("\[(\d\d:\d\d)\] <([^>]+)>", post['raw'])
        target = ""
        source = ""
        if source_target:
            time, source, target = source_target.groups()
            if target not in sources:
                target = ""
        elif source_only:
            time, source = source_only.groups()

        post_data[ky][post['id']] = post
        post_data[ky][post['id']]['time'] = time
        post_data[ky][post['id']]['source'] = source
        post_data[ky][post['id']]['target'] = target
        post_data[ky][post['id']]['dateparsed'] = dt.strptime((post_data[ky][post['id']]['date'] + " " + 
                                                               post_data[ky][post['id']]['time']), 
                                                               "%Y-%m-%d %H:%M")
        if not i % 10000:
            print("finished pre-processing ", 100*i/len(ds[ky]), 
                  "% of the time and routing information")

finished pre-processing  0.0 % of the time and routing information
finished pre-processing  66.62225183211193 % of the time and routing information
finished pre-processing  0.0 % of the time and routing information
finished pre-processing  79.93605115907275 % of the time and routing information
finished pre-processing  0.0 % of the time and routing information
finished pre-processing  4.5327628095877 % of the time and routing information
finished pre-processing  9.0655256191754 % of the time and routing information
finished pre-processing  13.5982884287631 % of the time and routing information
finished pre-processing  18.1310512383508 % of the time and routing information
finished pre-processing  22.663814047938498 % of the time and routing information
finished pre-processing  27.1965768575262 % of the time and routing information
finished pre-processing  31.729339667113898 % of the time and routing information
finished pre-processing  36.2621024767016 % of the time and routing informa

#### 2.1.3.3 Building implied task data
So far, we've discussed how the disentanglement task here is in fact a binary prediction task on pairs of post ids. Considering this, we should ask:
> how do we store/sample _pairs_ of posts to binary predict on them for a connection?

Unfortunately, there are problems with treating the 'all pairs' data set as our data, in particular because of the combinatorality of the space. Instead, we should is:
> when a human interprets a conversational thread, which pairs of posts do they compare for connection?

For example, we might assume a clever human reasonably assumes that it's not 'worth it' to compare posts made several years apart&mdash;this is just one reason why the parsed time stamps are so important! Likewise, we might assume that a human annotator will only consider target posts that are authored by a source's mentioned user handle. What we're _doing_ is defining a range of resonable posts to _contrast_ our known 'true' connections against. This is the essence of what's referred to as _noise contrastive estimation_, which we'll study in more depth when we get to __Chapter 3__. To start along these lines, let's collect another object which tracks all positive 'samples' of pairs:

In [None]:
y = {}
for ky in ['test', 'validation', 'train']:
    y[ky] = dict()
    for post in ds[ky]:
        for connection in post['connections']:
            p1 = post_data[ky][post['id']]
            p2 = post_data[ky][connection]
            if ((p1['dateparsed'] - p2['dateparsed']).total_seconds() > 0):
                target = p2
                source = p1
            else:
                source = p2
                target = p1
            if post['id'] != connection:
                pair = tuple([source['id'], target['id']])
                y[ky][pair] = 1
print("there were ", sum(map(len, [y[ky] for ky in ['test', 'validation', 'train']])), 
      " positive connections across all folds")

there were  88900  positive connections across all folds


Note now in the above, that we were able to use the imputed timestamps to work around potentially malformed data&mdash;looking closely, one can see that the data set (inconsistently) mixes which post, i.e., the source vs target, receives the 'gold' connection index in the original format. Here, we're able to assert which _must_ be the source vs. target since we know _when_ they ocurred.

Now that we know what the positives look like in the data set, we have enough information perhaps to construct the negatives. As it turns out, there are _a lot_ of viable negatives that exist by making simple assumption like: `targets` must be from the past of their `sources`. So a simply exploratory question worth asking might be:
> how many positive annotations exist outside of a given time horizon?

Provided most connections exist over a short range of time, e.g., within 10 minutes, one might reasonably constrain the size of the negative sample drastically, while only a losing marginal capacity for recall. Along these lines, let's see what percentage of connections fall within 10 minutes of each other&mdash;again, this requires having munged the time stamps:

In [None]:
## let's see what the largest time delta is amongst the 1's 
largest_delta = 0
threshold = 600
above_threshold = 0
total_pairs = 0
for ky in ['test', 'validation', 'train']:
    for pair in y[ky]:
        source = post_data[ky][pair[0]]
        target = post_data[ky][pair[1]]
        delta = (source['dateparsed'] - target['dateparsed']).total_seconds()
        total_pairs += 1
        if delta > threshold:
            above_threshold += 1
        if delta > largest_delta:
            largest_delta = delta
        if delta < 0:
            print(ky, delta, pair[0], pair[1])

print("there were ", above_threshold, " positive connections above the threshold of ", 
      threshold, " seconds apart in id/index")
print("the largest 'true' delta was: ", largest_delta)
print("this threshold allows us to recover: ", 
      100*(1 - above_threshold/sum(map(len, [y[ky] for ky in ['test', 'validation', 'train']]))), 
      " of the positives")

there were  834  positive connections above the threshold of  600  seconds apart in id/index
the largest 'true' delta was:  86340.0
this threshold allows us to recover:  99.06186726659168  of the positives


Considering that a 10-minute horizon would allow us to recover 99% of the true positives (at best), this seems like a pretty nice approach! So, let's see if we can construct the set of all reasonable negative samples within ten minutes of each other.

With this `threshold` in place, we can work through the data set, time-sorted, in a nested (source > target) loop to draw negative connections between all viable pairs of posts. A key to this construction efficiently traverses the inner (targets) loop only over those which are viable, according to the 10-minute sampling window&mdash;and critically, keeping track of the point past which _all_ posts are already guarenteed to be beyond the 10-minute threshold: 

In [None]:
for ky in ['test', 'validation', 'train']:
    pids = sorted(list(post_data[ky].keys()), 
                  key = lambda x: post_data[ky][x]['dateparsed'])
    j_low = 0
    for i, pid in list(enumerate(pids)):
        if not i: 
            continue
        target_pids = list(enumerate(pids[:i]))[j_low:] 
        target_pids.reverse()
        for j, target_pid in target_pids:
            pair = tuple([pid, target_pid])
            if pid != target_pid and (pair not in y[ky]):
                source = post_data[ky][pair[0]]
                target = post_data[ky][pair[1]]
                delta = (source['dateparsed'] - target['dateparsed']).total_seconds()
                if delta > threshold:
                    j_low = j
                    break
                if ((source['id'] != target['id']) and 
                    ((source['target'] and source['target'] == target['source']) or 
                     (not source['target']))):
                    y[ky][pair] = 0
                    
                    if not len(y[ky]) % 100000:
                        print('percent of targets processed: ', 100*i/len(pids), 
                              'current negative to positive rate (class imbalance): ', 
                              len(y[ky])/sum(y[ky].values()))


percent of targets processed:  13.810792804796803 current negative to positive rate (class imbalance):  15.913430935709739
percent of targets processed:  24.716855429713526 current negative to positive rate (class imbalance):  31.826861871419478
percent of targets processed:  39.57361758827448 current negative to positive rate (class imbalance):  47.740292807129215
percent of targets processed:  67.51499000666223 current negative to positive rate (class imbalance):  63.653723742838956
percent of targets processed:  18.089528377298162 current negative to positive rate (class imbalance):  31.55569580309246
percent of targets processed:  30.99920063948841 current negative to positive rate (class imbalance):  63.11139160618492
percent of targets processed:  54.96402877697842 current negative to positive rate (class imbalance):  94.66708740927737
percent of targets processed:  82.82973621103118 current negative to positive rate (class imbalance):  126.22278321236983
percent of targets proce

Now that we've got all of this high-quality (hopefully) contrastive information on good vs. bad connections, let's save the data in a convenient format for future predictive experiments.

In [None]:
import json
connections = {'positive': {ky: [list(map(str, connection)) 
                                 for connection in y[ky] if y[ky][connection]]
                        for ky in y},
               'negative': {ky: [list(map(str, connection)) 
                                 for connection in y[ky] if not y[ky][connection]]
                        for ky in y}}
with open('./data/connections.json', "w") as f:
    f.write(json.dumps(connections))
    
thread =  {ky: {pid: dict(post_data[ky][pid])
                for pid in post_data[ky]}
           for ky in post_data}
for ky in thread:
    for pid in thread[ky]:
        if 'counts' in thread[ky][pid]:
            del(thread[ky][pid]['counts'])
        thread[ky][pid]['dateparsed'] = thread[ky][pid]['dateparsed'].strftime("%Y-%m-%d %H:%M")
with open('./data/thread.json', "w") as f:
    f.write(json.dumps(thread))

As we can see, building the data representation for learning a task at scale can be a lot of work. Often, we'll encounter different and interesting challenges just in the pre-processing work for an NLP task. As an aside, we should note that this, in and of itself, is _not_ an NLP task, but rather a conversational analysis meta-linguistic task, more focused on the connections between messages. As we get further along in NLP tasks we'll see this is more and more the case, where NLP is ultimately a means to an end for elucidating a broader, social phenomenon related to text-based communication.

However, before proceeding we should ask one question:
> should we expect that _all_ connections implied by the data are _actually_ annotated?

As it turns out, even if the annotators did a _really_ good job and found highly-acccurate connections in the data, some are impossible to find! This is because there are `targets` which posted their posts before the first post in the dataset, e.g., if the oldest post in the data set replied to one from earlier, it's un-annotatable! You might then say "well not many fall into this scenario", but annotation is a _challenging_ task, and even good annotations often aren't complete. 
> So what's the point? 

Anecdotally, we should juxtapose this to, say, the NER task, which we might assume is guarenteed more completely and 'correctly' annotated. We'll often see these higher-order, social NLP tasks challenged by sampling and completeness and/or somewhat ambiguous/subjective notions of what a 'correct' answer is. For example here, a post could potentially be labeled as a reply to one or many others depending on how the annotator sees the information attaching, but ultimtately, the truth as to who the target 'was' is up to the sender of the message. These issues exist, even for the relatively straightforward objective of disentanglement, and become profound for tasks that engage diverse social groups and viewpoints, such as fact checking, etc.

So at a high level, this should be taken into consideration as cautionary from a few different angles, namely, one should always assess early just how precisely predicable an NLP task is, and/or, how flexible/rigid (biased) a 'good' system be, given the community it is intended to serve. So on this note of viewpoints, a natural turn is to discuss the how NLP approaches 'what people think' about things, i.e., their sentiments and stances.

### 2.1.4 Sentiment and Stance
A common entry point into NLP is through sentiment analysis, which generally might be described as anything to do with the measurement of the underlying emotion or affect of text, either at a token level (semantic norms) or with respect to a larger sentence or document. 

#### 2.1.4.1 Semantic norms
For example the [LIWC](http://liwc.wpengine.com/) lexicon is (one of many) fine-grained data sources and contains sentiment, or, emotional-affect norms, $h(t)$ for tokens, $t$. These data sets generally involve human annotation with numerical scales on tokens, taken either in or out of context (depending on the model). 

For example, let's assume $W$ is our vocabulary of tokens and that $h:W\rightarrow[-1,1]$. The ususal 'sentiment' norm might have `t = 'love'` with `h(t) = 0.87`, i.e., be super 'happy', and have `s = 'hate'` with `h(s) = -0.92`, i.e., be super 'sad' (regardless of context). 

Note that happy/sad is just one semantic norm dimension, and that [others develop lexica](https://saifmohammad.com/WebPages/lexicons.html) spanning a range emotions, such as the [NRC Emotion Lexicon](https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm), which covers fear, anger, anticipation, trust, surprise, positive, negative, sadness, disgust, joy. While the larger lexicon is availbe for research [by request](http://sentiment.nrc.ca/lexicons-for-research/), its downstream token-affect identification is readily available for production systems Python via pip:

In [None]:
!pip install nrclex

Collecting nrclex
[?25l  Downloading https://files.pythonhosted.org/packages/41/1c/0097ee39d456c8a92b2eb5dfd59f581a09a6bafede184a058fb0f19bb6ea/NRCLex-3.0.0.tar.gz (396kB)
[K     |████████████████████████████████| 399kB 15.5MB/s eta 0:00:01
Building wheels for collected packages: nrclex
  Building wheel for nrclex (setup.py) ... [?25l[?25hdone
  Created wheel for nrclex: filename=NRCLex-3.0.0-cp37-none-any.whl size=43310 sha256=d9c9d64606ea9f1ee9f94043f55573ccab0cc3aa85f2061a4509ce62b474566c
  Stored in directory: /root/.cache/pip/wheels/17/31/64/035a8d245b4c217aeb8e8a2702d05dc91544b9c2334db72414
Successfully built nrclex
Installing collected packages: nrclex
Successfully installed nrclex-3.0.0


Here, we can see that in applying their affect-detection tool we can extract norm-specific information about a sentence, per their algorithm:

In [None]:
from nrclex import NRCLex
import nltk
nltk.download('punkt')

text_object = NRCLex(ds['train'][1055]['raw'])
text_object.affect_dict, text_object.words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


({'bug': ['disgust', 'fear', 'negative']},
 ['03:59',
  'superted',
  'Xophe',
  'allthough',
  'the',
  'bug',
  'might',
  'be',
  'minor',
  'it',
  'can',
  'bring',
  'down',
  'the',
  'user',
  'experience'])

#### 2.1.4.2 Structured Sentiment Analysis and Stance Detection
These first looks into sentiment analysis and semantic norms operate the subject on a naïve, token-based level, or perhaps average across a document for an overall 'temperature' of the sentiment. However, many of the applications which involve sentiment analysis have structured objectives, such as sentiment with respect to a particular entity or subject. In other words, this might be characterized as the _stance_ of the sentiment towards a subject in question.

With this general space (structured sentiment analysis) of development in mind, we'll now look at another [readily available data set](https://huggingface.co/datasets/per_sent), here focused on sentiment of an author towards the main entity in a news article. The authors describe this data set in [this document](https://arxiv.org/pdf/2011.06128.pdf).

In [None]:
ds = datasets.load_dataset('per_sent')
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2408.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1665.0, style=ProgressStyle(description…




Using custom data configuration default


Downloading and preparing dataset per_sent/default (download: 22.05 MiB, generated: 22.34 MiB, post-processed: Unknown size, total: 44.39 MiB) to /root/.cache/huggingface/datasets/per_sent/default/1.1.0/f2c9b7bd72dc22e2a740b022df17e9178ff921d9144a431652ef2e7a80e8973f...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3311158.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=515456.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=824940.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=599513.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset per_sent downloaded and prepared to /root/.cache/huggingface/datasets/per_sent/default/1.1.0/f2c9b7bd72dc22e2a740b022df17e9178ff921d9144a431652ef2e7a80e8973f. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['DOCUMENT_INDEX', 'TITLE', 'TARGET_ENTITY', 'DOCUMENT', 'MASKED_DOCUMENT', 'TRUE_SENTIMENT', 'Paragraph0', 'Paragraph1', 'Paragraph2', 'Paragraph3', 'Paragraph4', 'Paragraph5', 'Paragraph6', 'Paragraph7', 'Paragraph8', 'Paragraph9', 'Paragraph10', 'Paragraph11', 'Paragraph12', 'Paragraph13', 'Paragraph14', 'Paragraph15'],
        num_rows: 3355
    })
    test_random: Dataset({
        features: ['DOCUMENT_INDEX', 'TITLE', 'TARGET_ENTITY', 'DOCUMENT', 'MASKED_DOCUMENT', 'TRUE_SENTIMENT', 'Paragraph0', 'Paragraph1', 'Paragraph2', 'Paragraph3', 'Paragraph4', 'Paragraph5', 'Paragraph6', 'Paragraph7', 'Paragraph8', 'Paragraph9', 'Paragraph10', 'Paragraph11', 'Paragraph12', 'Paragraph13', 'Paragraph14', 'Paragraph15'],
        num_rows: 579
    })
    test_fixed: Dataset({
        features: ['DOCUMENT_INDEX', 'TITLE', 'TARGET_ENTITY', 'DOCUMENT', 'MASKED_DOCUMENT', 'TRUE_SENTIMENT', 'Paragraph0', 'Paragraph1', 'Paragraph2', 'Paragraph3',

Looking again at a single instance, as the authors write:

> Each document consists of multiple paragraphs. Each paragraph is labeled separately (Positive, Neutral, Negative) and the author’s sentiment towards the whole document is included as a document-level label. 

Since the authors only annotated affect for paragraphs of size 3&ndash;16, we should only pay attention to the paragaph-sentiment labels corresponding to the newline-delimited 'paragraphs' of those indices, i.e., shorter documents have been `-1`-padded with dummy annotations for up to 16 paragraphs.

In [None]:
instance = ds['train'][6]
instance

{'DOCUMENT': 'Lubna Hussein was among 13 women arrested July 3 in a raid by the public order police in Khartoum. Ten of the women were fined and flogged two days later. But Hussein and two others decided to go to trial.\n Hussein said Friday she would rather go to jail than pay any fine  out of protest of the nation\'s strict laws on women\'s dress.\n The case has made headlines in Sudan and around the world and Hussein used it to rally world opinion against the country\'s morality laws based on a strict interpretation of Islam.\n Galal al-Sayed  Hussein\'s lawyer  said he advised her to pay the fine before appealing the decision. She refused  he said  "She insisted."\n As a U.N. staffer  Hussein should have immunity from prosecution but she has opted to resign so that she could stand trial and draw attention to the case.\n In a column published in the British daily the Guardian Friday  Hussein said her case is not an isolated one  but is a showcase of repressive laws in a country with

Looking at the available paragraphs exhibited in the `'MASKED_DOCUMENT'` field, we can see that co-references to the target of the data set's sentiment analysis are resolved and annotated as `'[TGT]'` in the stream. Here, our goal is to determine the document- (whole list, below) and paragraph-level (elements of the list) sentiment of the author of the text with respect to the target of discussion:

In [None]:
import re
re.split(" *\n+ *", instance['MASKED_DOCUMENT'])

['[TGT] was among 13 women arrested July 3 in a raid by the public order police in Khartoum. Ten of the women were fined and flogged two days later. But [TGT] and two others decided to go to trial.',
 "[TGT] said Friday she would rather go to jail than pay any fine  out of protest of the nation's strict laws on women's dress.",
 "The case has made headlines in Sudan and around the world and [TGT] used it to rally world opinion against the country's morality laws based on a strict interpretation of Islam.",
 'Galal al-Sayed  [TGT]\'s lawyer  said he advised her to pay the fine before appealing the decision. She refused  he said  "She insisted."',
 'As a U.N. staffer  [TGT] should have immunity from prosecution but she has opted to resign so that she could stand trial and draw attention to the case.',
 'In a column published in the British daily the Guardian Friday  [TGT] said her case is not an isolated one  but is a showcase of repressive laws in a country with a long history of civil 

Here, there's a little bit of continuity in the way could evaluate such a model, but at either the paragraph or document level we could once again treat this as a categorical prediction problem with $5$ sentiment levels, i.e., classes $c\in\{-2, -1, 0, 1, 2\}$.

### 2.1.4 Language Modeling
So far, we've focused on NLP tasks and their data representations whose prediction provide information _about_ language. But how do we approach the modeling of systems whose objective should be to predict language, itself? We'll use this conceptual framing and call it _language modeling_ to characterize objectives for downstream applications that generally are required to _generate_ language, instead of predict, on it.

Reductively, language models are tasked with predicting the next word, e.g., including the one right here. A model's goal is generally to characterize the probability for any sequence of $m$ tokens: $P(t_1, t_2,\cdots, t_m)$. Supposing we wish to compute the model's probability up to the $m^\text{th}$ token, the chain rule for conditional probability is invaluable:

$$
P(t_1, t_2,\cdots, t_m) = \prod_{i=1}^m P(t_i\mid t_1,\cdots, t_{i-1}) 
$$

Here, we should observe that one may compute predictions from a language model incrementally and accumulate probabilistic information towards a given point of prediction, $t_m$.

However, documents vary in size and each point of prediction, $t_{m}$, has a different ($m-1$) number of other tokens which preceed it. These aspects impose strong computational challenges. Thus, one must generally condition approximately via predictions at $t_{m}$ that utilize '$n$-gram' windows of $n>0$ previous tokens:

$$
P(t_1, t_2,\cdots, t_m) \approx \prod_{i=1}^m P(t_i\mid t_{i - n},\cdots, t_{i-1}).
$$

These approximations allow language-model learning from corpus-based $n$-gram frequencies, so, e.g., if $f(t_1, t_2,\cdots, t_n)$ represents the number of times the $n$-gram $\{t_1, t_2,\cdots, t_n\}$ appears in a document, we can approximate the individual conditional probabilities as:

$$
\hat{P}(t_n|t_1, t_2, \cdots t_{n-1}) = \frac{f(t_1, t_2,\cdots, t_n)}{f(t_1, t_2,\cdots, t_{n-1})}
$$

So for example, from our co-ocurrence matrix that we studied in __Chapter 1__, we can construct a $2$-gram language model! However, to go any further we'll need to discuss some inherent limitations.

#### 2.1.4.1 Consequences of Combinatorality
Language modeling is, in general, a task that is plagued by combinatorality. If we wish to think of it as a class-prediction problem like everything else we've studied so far (and quite reasonably), then the number of classes we must weight for prediction at any given point is $|V|$, i.e., the size of the given vocabulary. Thus is generally a _huge_ prediction space, and while we must wade into it, building combinatorial (e.g., $n$-gram) models forces us to consider storage constraints amidst sparse data that exists in vast dimensional space. In part, we've seen the effects of this in __Chapter 1__ with the co-ocurrence matrices, which were handled in part by sparse matrix representations, but some other numerics implications must be discussed.

1. In calculating the numerator for any given $\hat{P}(t_n|t_1, t_2, \cdots t_{n-1})$ one often encounters the problem that a given $n$-gram has never before been observed in the corpus. By our formulation, the numerator becomes zero and makes the overall probability of the larger seqeunce (via the chain rule) $0$. To fix this, _smoothing_ is generally done, where a small positive value, $\varepsilon>0$ is added to the language model's representation for each possible output:
$$
\hat{P}(t_n|t_1, t_2, \cdots t_{n-1}) = \frac{\varepsilon + f(t_1, t_2,\cdots, t_n)}{\varepsilon|V| + f(t_1, t_2,\cdots, t_{n-1})}
$$
2. In considering the denominator, when even an $(n-1)$-gram hasn't been observed in the training corpus the $\hat{P}$ model is reduced to a uniform distribution on the vocabulary size, which is generally less informative than ideal. To combat this issue, _backoff_ is often conducted, where if $f(t_1, t_2,\cdots, t_{n-1}) = 0$, then $\hat{P}$ is approximated instead via the next-smaller-gram model, i.e., based on the previous $n-2$ tokens (and so on), recursively: 
$$
\hat{P}(t_n|t_1, \cdots t_{n-1})\approx\hat{P}(t_n|t_2, \cdots t_{n-1}).$$ 

Generally, increasing $n$ for an $n$-gram language model improves its (yet-to-quantified) performance. However, this also makes their issues with combinatorality _worse_, including the dramatic increases in storage, i.e., memory that becomes required to operate them.

#### 2.1.4.2 Sequence to Sequence Tasks
So what can we do with language modeling? As mentioned when introduced, the general objective with language modeling is for text generation; this includes tasks where the system must:

- translate, or, convert one language into another; 
- converse, or, provide an informative responses to prompts; and
- summarize, or, edit content into a more-concise form.

Traditionally, models built to satisfy language modeling tasks would consist of numerous different ML-algorithms satisfying incremental sub-tasks. These might be any of those NLP tasks discussed in this chapter prior to language modeling, and their information would be folded together as deemed appropriate at their various stages of prediction, in an ML-patchwork NLP pipeline. Note: this includes pipelines for which any one component might be 'deep' and so is in essence a characterization of the (lack of) _concurrence_ with which these model's parameters were trained, intgrated, and utilized for text generation. 

This aspect of language modeling as historically being satisfied by patchwork NLP pipelines speaks to its complexity. Unlike the first tasks we viewed where the objective of prediction was _about_ lanaguge, the language modeling tasks stated above are generally about producing a different variable-length sequence as response to an input sequence. Hence, they are generally called _sequence-to-sequence_, or, 'seq2seq' tasks or models etcetera. Consequently, these should be viewed as some of the most challenging NLP tasks to satisfy. 

The second reason we highlight this history of patchwork NLP pipelines and language generation is to reflect on how DL has critically provided a unifying modeling framework to allow seq2seq tasks to be modeled over their _full_ pipelines, i.e., from input to output sequence. What this means is that our objective while using DL in constructing language models will be to leverage end-to-end designs train all model parameters concurrently, i.e., are truly seq2seq.

#### 2.1.4.3 Machine Translation and BLEU
Let's move straight towards the characteristic seq2seq task of machine translation with a [specific dataset](https://huggingface.co/datasets/europa_eac_tm), which conventiently allows us to specify the language pairs for which we'd like to observe translations. Documentation for this data set can be [found here](https://link.springer.com/article/10.1007/s10579-014-9277-0). Here, we load pairs for English and German:

In [None]:
ds = datasets.load_dataset('europa_eac_tm', language_pair=("en", "de"))
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3443.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2961.0, style=ProgressStyle(description…

Using custom data configuration en2de-3ea1cf9b765fae58



Downloading and preparing dataset europa_eac_tm/en2de (download: 3.36 MiB, generated: 565.98 KiB, post-processed: Unknown size, total: 3.91 MiB) to /root/.cache/huggingface/datasets/europa_eac_tm/en2de-3ea1cf9b765fae58/0.0.0/f3a16cb432bdb8b5a8f765ffd7caf56de34b1f04dcb3a87d29ca75d6091da285...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3521416.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset europa_eac_tm downloaded and prepared to /root/.cache/huggingface/datasets/europa_eac_tm/en2de-3ea1cf9b765fae58/0.0.0/f3a16cb432bdb8b5a8f765ffd7caf56de34b1f04dcb3a87d29ca75d6091da285. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['translation', 'sentence_type'],
        num_rows: 4473
    })
})

Let's look at a single translation pair:

In [None]:
instance = ds['train'][2]
instance['translation']

{'de': 'Der Förderantrag wird elektronisch verarbeitet. Alle personenbezogenen Daten (wie Namen, Adressen, CVs etc.) werden gemäß Verordnung (EG) Nr. 45/2001 des Europäischen Parlaments und des Rates vom 18. Dezember 2000 zum Schutz natürlicher Personen bei der Verarbeitung personenbezogener Daten und zum freien Datenverkehr verarbeitet. Von den Antragstellern angegebene Informationen, die für die Beurteilung ihres Förderantrags notwendig sind, werden von der für das betreffende Programm zuständigen Abteilung ausschließlich für diesen Zweck verarbeitet. Auf Wunsch des Antragstellers/der Antragstellerin können personenbezogene Daten zur Korrektur oder Vervollständigung an den Antragsteller/die Antragstellerin gesendet werden. Fragen betreffend diese Daten sind an die zuständige Agentur zu richten, bei der das Formular eingereicht werden muss. Begünstigte können jederzeit beim Europäischen Datenschutzbeauftragten eine Beschwerde gegen die Verarbeitung ihrer personenbezogenen Daten einleg

An immediate observation should be that this pair involves substantially long streams of text. Some will be (much) shorter, and this property&mdash;variable sequence length&mdash;is pervasive. Likewise, we observe that a 'good' translation here could likely just as well exchange:
> (such as names, addresses, CVs, etc.)...

with:
> (for example names, addresses, CVs, etc.)...

In other words, we shouldn't expect that one or the other is necessarily the _perfect_ translation, if we want to avoid our issues with combinatorality and sparsity. Hence, comparisons of models for machine translation are generally made according to a specially-designed metric called Bilingual Evaluation Understudy (BLEU), which works as follows. 

Borrowing from the $n$-gram langauge model's statistics, BLEU employs $n$-gram analysis between a prediction and reference translation. Precisely, BLEU assesses the fraction of $n$-grams that appear in the reference, with caveats that 1) an $n$-gram in the reference cannot be matched more than once, and 2) a penalty is enforced for extremely brief output, e.g., single-word outputs that are too 'concise'.

To express BLEU, let's first define the numbers of tokens in the prediction and reference as $m_p$ and $m_r$ choose a maximum gram-length, $k$, and, for each $n\leq k$ let $p_n$ be the precision-score fraction of the prediction's $n$-grams that appear in the reference. A brevity penalty is then defined by:
$$
\beta = e^{\min\left(\left\{0,\:1 - \frac{m_r}{m_p}\right\}\right)}
$$

which provides more weight&mdash;up to $e$&mdash;to longer predictions and equally-low weight (equal to $1$) to all predictions shorter than the reference. [Geometric weights](https://en.wikipedia.org/wiki/Weighted_geometric_mean), $w_n = 2^{-n}$, are then utilize to form a blending of the precision scores to define BLEU as:
$$
\text{BLEU} = \beta\prod_{n = 1}^kp_n^{w_n}
$$

For machine translation, BLEU (and variants of it) provide an $F_1$-like score that can be used to benchmark and compare systems. The usage of BLEU is presented for convenience, below. But this point in our discussion provides a useful pivot to begin thinking about how we'll train our DL system, which as we'll see doesn't rely on the somewhat arbitrary and highly-non-smooth decision surfaces which BLEU (and $F_1$) present.

In [None]:
!pip install sacrebleu

Collecting sacrebleu
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 6.3MB/s  eta 0:00:01
[?25hCollecting portalocker==2.0.0
  Downloading https://files.pythonhosted.org/packages/89/a6/3814b7107e0788040870e8825eebf214d72166adf656ba7d4bf14759a06a/portalocker-2.0.0-py2.py3-none-any.whl
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-2.0.0 sacrebleu-1.5.1


In [None]:
from datasets import load_metric
metric = load_metric("sacrebleu")

test_references = [[instance['translation']['en'], 
                    re.sub("such as names", "for example names", instance['translation']['en'])]]

test_predictions = [re.sub("such as names", "including names", instance['translation']['en'])]

metric.add_batch(predictions = test_predictions, references = test_references)
metric.compute()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2236.0, style=ProgressStyle(description…




{'bp': 0.9940652993697221,
 'counts': [167, 165, 163, 161],
 'precisions': [99.4047619047619,
  98.80239520958084,
  98.19277108433735,
  97.57575757575758],
 'ref_len': 169,
 'score': 97.90704464864113,
 'sys_len': 168,
 'totals': [168, 167, 166, 165]}

If we want to know what all olf this means, we can view below to see that the $n$-gram precisions and their aggregated score are all present, including the computed brevity penalty that was applied.

In [None]:
print(metric.inputs_description)


Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: The system stream (a sequence of segments)
    references: A list of one or more reference streams (each a sequence of segments)
    smooth: The smoothing method to use
    smooth_value: For 'floor' smoothing, the floor to use
    force: Ignore data that looks already tokenized
    lowercase: Lowercase the data
    tokenize: The tokenizer to use
Returns:
    'score': BLEU score,
    'counts': Counts,
    'totals': Totals,
    'precisions': Precisions,
    'bp': Brevity penalty,
    'sys_len': predictions length,
    'ref_len': reference length,
Examples:

    >>> predictions = ["hello there general kenobi", "foo bar foobar"]
    >>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
    >>> sacrebleu = datasets.load_metric("sacrebleu")
    >>> results = sacrebleu.compute(predictions=predictions, references=ref

## 2.2 Objective functions, error signals, and performance

How do we evaluate how 'good' a prediction of, e.g., NER tags is? Downstream, this will tell us how to adjust the weights of a network, i.e., train it. With the machine translation example above, we were introduced to how&mdash;for _human_ interpretation and scientific evaluation&mdash;we might use BLEU or an $F_1$ metric to benchmark our algorithms on test data to express if they are 'good' enough to use. 

However, to teach the _algorithms_ how to be 'better,' we'll need to operate their parameter spaces _smoothly_ over training data and so will need less arbitrary objectives that can be differentiated (e.g., for gradient descent varients) to optimize our models. It turns out, that since our prediction scenarios are generally categorical, we'll generally conduct learning via the same objective&mdash;regardless of the number of prediction classes. So, whether we're predicting between $9$ different named-entity tags, or across $|W|$ different tokens in a language modeling experiment, we'll use the same _cross-entropy_ objective for learning, described below.

### 2.2.1 The categorical likelihood objective
In general, let's consider the case where there is a discrete set of classes $c\in C$ that we'd like to predict for our data. For each $x$ in our data set of $N$ points, $X$, our goal is to determine a boolean output vector, $y\in\mathbb{B}^{|C|}$ of norm 1: $\|y\| = 1$, i.e., with exactly one non-zero entry corresponding to $x$'s the true class membership. 

Generally, we won't be making predictions, $\hat{y}$, that have exactly one non-zero entry and instead, predict for each point, $x\in X$, compute _probabilistic_ vectors, $\hat{y}(x)\in[0,1]^{|C|}$. Across the classes $c\in C$, $x$'s measurable contribution:
$$
\mathcal{J}\left(y(x), \hat{y}(x)\right) = \prod_{c\in C}\hat{y}(x\mid c)^{y(x\mid c)}
$$

can be factored in with others to form the overall multi-class likelihood:

$$
\mathcal{J}(X) = \prod_{x\in X}\mathcal{J}\left(y(x), \hat{y}(x)\right)= \prod_{x\in X}\prod_{c \in C}  \hat{y}(x\mid c)^{y(x\mid c)}
$$

which in general will increase with 'better' predictions. Essentially, known relationships between $x$ and $c$ are _indicated_ by $y(x\mid c) = 1$, and a model is deemed 'better' when more of the indicated data appears likely under the expectation measured by $\hat{y}(x\mid c)$.

### 2.2.2 Optimizing through cross-entropy error signals

In practice and like other probabilistic framings, the likelihood $\mathcal{J}(X)$ is computationally unstable as a result of numerous multiplication operations. As a result, the negative logarithm is generally taken and optimized:

$$
\mathcal{L}(X) = -\log\mathcal{J}(X) = \sum_{x\in X}\mathcal{L}\left(y(x), \hat{y}(x)\right)= -\sum_{x\in X}\sum_{c \in C}  y(x\mid c)\log\hat{y}(x\mid c)
$$

and is often referred to as the _cross entropy_ of the prediction. The negative logarithm's monotonicity and smoothness ensure it's optimization result in the same objective extrema, but moreover, via the product and power rules we are granted access to an objective that is computationally stable through additive aggregation.

#### 2.2.2.1 Learning over Cross-Entropy
While we won't _always_ handle the calculation of gradients directly, the learning information that will be utilized from any objective function $\mathcal{L}(X)$ will generally be found by first calculating the gradient of $\mathcal{L}(X)$ along all model parameters. That is, since $\hat{y}(x;\Phi)$ will generally be a prediction that relies on many model parameters, $\phi_i\in\Phi$, a common step in moving towards the computation of a DL system will generally entail formulating the objective function in terms of the model parameters as $\mathcal{L}(X; \Phi)$ and computing its gradient:
$$
\nabla\mathcal{L}(X; \Phi) = \left[\mathcal{L}_{\phi_1}(X;\Phi), \mathcal{L}_{\phi_2}(X;\Phi), \cdots, \mathcal{L}_{\phi_{|\Phi|}}(X;\Phi)\right]
$$

as a general strategy. In fact, we'll see this pattern of work in full effect in our first deep/neural model in __Chapter 3__, when we explore how CBOW and language modeling converge as the word2vec algorithm and allow scalable methods for representation learning, which we'll see to overcome many of the sparsity- and combintorality-related problems incurred via models of co-ocurrence, such as those explored in __Chapter 1__.

### 2.2.3 Perplexity as a performance measure
Cross validations with BLEU and $F_1$ scores elucidate how well models perform on development-specific data and can be measured over out-of-domain sets to understand how well a deployment will function in a non-development scenario. But how would we understand how well a model is performing during deployment, when, perhaps it's not clear what the 'right' output should exactly be. 

For example, if our task is language generation for a chat bot, how could we know when that chat bot is more or less 'certain' of itself? Using BLEU or or other BLEU-like positive prediction/confusion matrix strategies will always require building a 'gold standard' collection of reference responses to specific input prompts. Moreover, as we see with BLEU&mdash;when the number of classes becomes large (e.g., the size of a vocabulary), macro- or micro-averaging $F_1$-like statistics often provides to coarse and task-specific a statsitic to be interpretably useful&mdash;after all, our use case in this discussion is to inform the _developer_ of the models current performing state, at _any_ given time of operation.

All of this should motivate the question:
> can't we just somehow use the probabilities, or, cross-entropy values we compute to score a prediction? 

So for example, even if we don't know what the $y$ value is, we still have a $\hat{y}$ vector&mdash;can't we just use our 'best' choice, e.g., 'argmax' probability/entropy as a measure of certainty? 

Well, If it helps the learning, then certainly&mdash;the lower the entropy of the prediction, the morece certain a model is of its operation. However, people find it _hard_ to interpret logaritmicly-unitized information, i.e., in bits or nats, so the custom is to transform back to the event space of the probability distribution, with a measure called _perplexity_:
$$
\mathcal{T}(X) = 
e^{\mathcal{L}(X)} = 
e^{-\log\mathcal{J}(X)} = 
\mathcal{J}(X)^{-1}
$$


Here, we use $\mathcal{T}$, i.e., "script-T" to build the intuition that perplexity is always in units of _time_ with respect to the event space. In other words, since $\mathcal{T}$ is the reciprocal of a probability, $\mathcal{J}$, it
> $\mathcal{T}$: represents the number of predictions (events) required for the system to have made one correct guess.

Note: the base of exponentiation is left as $e$ in the above. However in principle, the base should simply be whatever base _of logarithm_ is used in the computation of $\mathcal{L}$ from $\mathcal{J}$. _Provided the bases of the logarithm and exponent match, perplexity is always in the same event-space units_, regardles of choice of base.

This means $\mathcal{T}\geq 1$, with $\mathcal{T} = 1$ indicating a model driven to complete certainty. Note that while perplexity provides a different intuitive viewpoint into model performance, it is dynamically the exact same as entropy. In deployment, note that having $\mathcal{T} = 1$/model certainty may not be a 'good thing' necessarily, e.g., as it could exhibit an extreme rigidity of operation. Likewise, a large value of perplexity could indicate poorly-concentrated decision making, i.e., with flat model-score distributions resulting in un-realistic predictions (if sampling, e.g., in language generation tasks).
