<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/sequence_labeling_mlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence labeling (POS tagging) with MLP

This notebook builds upon the [classification with MLP notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb) and shows how to implement a basic sequence labeling method.

---

# Setup

Install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package primarily on top of torch
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* [`evaluate`](https://huggingface.co/docs/evaluate/index) is a library of performance metrics (like accuracy etc)

In [1]:
!pip install --quiet transformers[torch] datasets evaluate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/510.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m501.8/510.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K  

---

# Get and prepare data

*   Let us work with the venerable, if somewhat dated [CoNLL'03 shared task](https://aclanthology.org/W03-0419.pdf) English data
*   These are English news articles, and have annotation for POS, syntactic chunks, and named entities (in the IOB format)

The data as originally distributed for the 2003 shared task has the following format:

```
Only RB B-NP O
France NNP I-NP B-LOC
and CC I-NP O
Britain NNP I-NP B-LOC
backed VBD B-VP O
Fischler NNP B-NP B-PER
's POS I-NP O
proposal NN I-NP O
. . O O
```

Here, the four space-separated columns are token text, POS tag, chunk tag, and NER tag. The goal of the original task is to predict the NER tags using the other information as features, but the dataset can be used to study predicting the other columns too.

The dataset happens to be in the HF datasets collection, so we can grab it from there


In [2]:
import torch
import transformers
import datasets

from pprint import pprint    # pretty-print

dataset = datasets.load_dataset("conll2003")

print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/283k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [3]:
pprint(dataset["train"][12])

{'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'id': '12',
 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0],
 'pos_tags': [30, 22, 10, 22, 38, 22, 27, 21, 7],
 'tokens': ['Only',
            'France',
            'and',
            'Britain',
            'backed',
            'Fischler',
            "'s",
            'proposal',
            '.']}


As you can see above, the various labels (POS, NER and chunk tags) are converted into IDs in this dataset. We can access the textual labels of these tags through the dataset `features`:

In [4]:
POS_TAG_NAMES = dataset['train'].features['pos_tags'].feature.names
NER_TAG_NAMES = dataset['train'].features['ner_tags'].feature.names
CHUNK_TAG_NAMES = dataset['train'].features['chunk_tags'].feature.names

We can then create mappings from names to IDs and back as Python dictionaries:

In [5]:
POS2ID = { n: i for i, n in enumerate(POS_TAG_NAMES) }
ID2POS = { i: n for i, n in enumerate(POS_TAG_NAMES) }

NER2ID = { n: i for i, n in enumerate(NER_TAG_NAMES) }
ID2NER = { i: n for i, n in enumerate(NER_TAG_NAMES) }

CHUNK2ID = { n: i for i, n in enumerate(CHUNK_TAG_NAMES) }
ID2CHUNK = { i: n for i, n in enumerate(CHUNK_TAG_NAMES) }

This is what these mappings look like:

In [6]:
print(NER2ID)

{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


In [7]:
print(ID2NER)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


Let's also add in explanations from Penn Treebank for the POS tags:

In [8]:
# From the documentation page and from here https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

POS2DESCRIPTION = {
    "CC": "Coordinating conjunction",
    "CD": "Cardinal number",
    "DT": "Determiner",
    "EX": "Existential there",
    "FW": "Foreign word",
    "IN": "Preposition or subordinating conjunction",
    "JJ": "Adjective",
    "JJR": "Adjective, comparative",
    "JJS": "Adjective, superlative",
    "LS": "List item marker",
    "MD": "Modal",
    "NN": "Noun, singular or mass",
    "NNS": "Noun, plural",
    "NNP": "Proper noun, singular",
    "NNPS": "Proper noun, plural",
    "PDT": "Predeterminer",
    "POS": "Possessive ending",
    "PRP": "Personal pronoun",
    "PRP$": "Possessive pronoun",
    "RB": "Adverb",
    "RBR": "Adverb, comparative",
    "RBS": "Adverb, superlative",
    "RP": "Particle",
    "SYM": "Symbol",
    "TO": "to",
    "UH": "Interjection",
    "VB": "Verb, base form",
    "VBD": "Verb, past tense",
    "VBG": "Verb, gerund or present participle",
    "VBN": "Verb, past participle",
    "VBP": "Verb, non-3rd person singular present",
    "VBZ": "Verb, 3rd person singular present",
    "WDT": "Wh-determiner",
    "WP": "Wh-pronoun",
    "WP$": "Possessive wh-pronoun",
    "WRB": "Wh-adverb"
}

We can now try to make sense of the tags:

In [9]:
import tabulate

e = dataset["train"][12]    # work on the same example

table = []
for token, pos_id, chunk_id, ner_id in zip(e["tokens"], e["pos_tags"], e["chunk_tags"], e["ner_tags"]):
    ner_tag = ID2NER[ner_id]
    chunk_tag = ID2CHUNK[chunk_id]
    pos_tag = ID2POS[pos_id]
    pos_def = POS2DESCRIPTION.get(pos_tag,pos_tag)
    table.append([token, ner_tag, chunk_tag, pos_tag, pos_def])

print(tabulate.tabulate(table,headers=["Token", "NER", "Chunk", "POS", "POS definition"]))

Token     NER    Chunk    POS    POS definition
--------  -----  -------  -----  ------------------------
Only      O      B-NP     RB     Adverb
France    B-LOC  I-NP     NNP    Proper noun, singular
and       O      I-NP     CC     Coordinating conjunction
Britain   B-LOC  I-NP     NNP    Proper noun, singular
backed    O      B-VP     VBD    Verb, past tense
Fischler  B-PER  B-NP     NNP    Proper noun, singular
's        O      B-NP     POS    Possessive ending
proposal  O      I-NP     NN     Noun, singular or mass
.         O      O        .      .


Note that the data is organized into sentences.

---

# Create features

We'll define a simple function that takes a token sequence, the index of the focus token, and a window size and generates a few basic explicit features relevant to the task.

(Note that as we'll be predicting the POS tag, we won't look at the chunk or NER tags, which would typically only be predicted _after_ predicting POS in a "traditional" NLP pipeline)

In [10]:
def token_features(tokens, index, window_size):
    # Generate features for token in position `index` in given list of tokens
    features = []

    # Context window start and end
    window_start = max(0, index-window_size)
    window_end = min(index+window_size+1, len(tokens))    # note +1 for range

    for i in range(window_start, window_end):
          offset = i - index    # relative position
          features.append(f"token[{offset}]={tokens[i]}")

    # Example custom feature: does focus token start with an upper-case letter?
    if tokens[index][0].isupper():
        features.append("first-letter-capitalized")

    return features

We can call this function for all tokens in a sentence like so:

In [11]:
def add_features_to_sentence(sentence):
    # Collect lists of features for all tokens here
    all_features = []

    tokens = sentence["tokens"]
    for index in range(len(tokens)):
        all_features.append(token_features(tokens, index, window_size=3))

    return { "features": all_features }

In [12]:
for feats in add_features_to_sentence(dataset["train"][12])["features"]:
    print(feats)

['token[0]=Only', 'token[1]=France', 'token[2]=and', 'token[3]=Britain', 'first-letter-capitalized']
['token[-1]=Only', 'token[0]=France', 'token[1]=and', 'token[2]=Britain', 'token[3]=backed', 'first-letter-capitalized']
['token[-2]=Only', 'token[-1]=France', 'token[0]=and', 'token[1]=Britain', 'token[2]=backed', 'token[3]=Fischler']
['token[-3]=Only', 'token[-2]=France', 'token[-1]=and', 'token[0]=Britain', 'token[1]=backed', 'token[2]=Fischler', "token[3]='s", 'first-letter-capitalized']
['token[-3]=France', 'token[-2]=and', 'token[-1]=Britain', 'token[0]=backed', 'token[1]=Fischler', "token[2]='s", 'token[3]=proposal']
['token[-3]=and', 'token[-2]=Britain', 'token[-1]=backed', 'token[0]=Fischler', "token[1]='s", 'token[2]=proposal', 'token[3]=.', 'first-letter-capitalized']
['token[-3]=Britain', 'token[-2]=backed', 'token[-1]=Fischler', "token[0]='s", 'token[1]=proposal', 'token[2]=.']
['token[-3]=backed', 'token[-2]=Fischler', "token[-1]='s", 'token[0]=proposal', 'token[1]=.']
['t

The dataset is organized into sentences, so we can use the above function to add features to the entire dataset as follows.

**Note**: unlike e.g. the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values.

In [13]:
dataset = dataset.map(add_features_to_sentence)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Let's check that one more time:

In [14]:
pprint(dataset["train"][12])

{'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'features': [['token[0]=Only',
               'token[1]=France',
               'token[2]=and',
               'token[3]=Britain',
               'first-letter-capitalized'],
              ['token[-1]=Only',
               'token[0]=France',
               'token[1]=and',
               'token[2]=Britain',
               'token[3]=backed',
               'first-letter-capitalized'],
              ['token[-2]=Only',
               'token[-1]=France',
               'token[0]=and',
               'token[1]=Britain',
               'token[2]=backed',
               'token[3]=Fischler'],
              ['token[-3]=Only',
               'token[-2]=France',
               'token[-1]=and',
               'token[0]=Britain',
               'token[1]=backed',
               'token[2]=Fischler',
               "token[3]='s",
               'first-letter-capitalized'],
              ['token[-3]=France',
               'token[-2]=and',
          

---

# Flatten dataset

The MLP code that we introduced previously expects each of the `train`, `validation` and `test` subsets of the data to consist of simple sequences of examples.

Now that we have run the feature generation, we no longer need the sentence structure and can "flatten" the data into such sequences.

In [15]:
def flatten(subset):
    # Keys for values to flatten
    keys = ["tokens", "pos_tags", "chunk_tags", "ner_tags", "features"]

    # Initialize to empty lists of tokens etc.
    flattened = { k: [] for k in keys }

    # Concatenate per-sentence lists of tokens etc.
    for sentence in subset:
        for key in keys:
            flattened[key].extend(sentence[key])

    # Return as Dataset object
    return datasets.Dataset.from_dict(flattened)

Call `flatten` for each of the subsets and make a new `DatasetDict` containing the flattened subsets:

In [16]:
flattened_dict = {
    "train": flatten(dataset["train"]),
    "validation": flatten(dataset["validation"]),
    "test": flatten(dataset["test"]),
}

flat_dataset = datasets.DatasetDict(flattened_dict)

Check that the new dataset looks OK:

In [17]:
flat_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 203621
    })
    validation: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 51362
    })
    test: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 46435
    })
})

In [18]:
for i in range(10):
    token = flat_dataset["train"]["tokens"][i]
    pos_tag = ID2POS[flat_dataset["train"]["pos_tags"][i]]
    description = POS2DESCRIPTION.get(pos_tag, pos_tag)
    features = flat_dataset["train"]["features"][i]
    print(f"{token}\t{pos_tag}\t{description}\t{features}")

EU	NNP	Proper noun, singular	['token[0]=EU', 'token[1]=rejects', 'token[2]=German', 'token[3]=call', 'first-letter-capitalized']
rejects	VBZ	Verb, 3rd person singular present	['token[-1]=EU', 'token[0]=rejects', 'token[1]=German', 'token[2]=call', 'token[3]=to']
German	JJ	Adjective	['token[-2]=EU', 'token[-1]=rejects', 'token[0]=German', 'token[1]=call', 'token[2]=to', 'token[3]=boycott', 'first-letter-capitalized']
call	NN	Noun, singular or mass	['token[-3]=EU', 'token[-2]=rejects', 'token[-1]=German', 'token[0]=call', 'token[1]=to', 'token[2]=boycott', 'token[3]=British']
to	TO	to	['token[-3]=rejects', 'token[-2]=German', 'token[-1]=call', 'token[0]=to', 'token[1]=boycott', 'token[2]=British', 'token[3]=lamb']
boycott	VB	Verb, base form	['token[-3]=German', 'token[-2]=call', 'token[-1]=to', 'token[0]=boycott', 'token[1]=British', 'token[2]=lamb', 'token[3]=.']
British	JJ	Adjective	['token[-3]=call', 'token[-2]=to', 'token[-1]=boycott', 'token[0]=British', 'token[1]=lamb', 'token[2]=.

Note that this is now a single long sequence of tokens without sentence boundaries.

---

## Vectorize data

We'll next follow the steps that you should already be familiar with from the [text classification notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb), with a few changes:

* Since the data is already tokenized, we only need to **vectorize** it, i.e. get the non-zero elements of the feature vector
* Unlike in the text classification notebook, here we are **vectorizing token features**
* We'll again use sklearn's feature extraction package, in particular `CountVectorizer`
* Since our features are now lists of strings, we can skip tokenization and use these as-is

In [19]:
import sklearn.feature_extraction


# Dummy function for tokenization and preprocessing
def do_nothing(features):
    return features

vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    binary=True,
    max_features=30000,
    tokenizer=do_nothing,
    preprocessor=do_nothing,
)

# Get a list of all feature strings from the training data
features = [e["features"] for e in flat_dataset["train"]]

# "Train" the vectorizer, i.e. build its vocabulary
vectorizer.fit(features)



As in the text classification notebook, we then invoke the vectorizer and get non-zero elements as a sparse matrix:

In [20]:
def vectorize_example(e):
    vectorized = vectorizer.transform([e["features"]])

    # nonzero() gives a pair of (rows,columns), we want the columns
    non_zero_features = vectorized.nonzero()[1]

    # Feature index 0 will have a special meaning, so let us not produce
    # it by adding +1 to everything
    non_zero_features += 1

    return {
        "input_ids": non_zero_features,
        "label": e["pos_tags"]
    }

Check one example:

In [21]:
vectorized = vectorize_example(flat_dataset["train"][10])

print(flat_dataset["train"][10])
print(vectorized)

{'tokens': 'Blackburn', 'pos_tags': 22, 'chunk_tags': 12, 'ner_tags': 2, 'features': ['token[-1]=Peter', 'token[0]=Blackburn', 'first-letter-capitalized']}
{'input_ids': array([    1,  1603, 13770], dtype=int32), 'label': 22}


Map `input_ids` back to the original feature names to confirm that everything works:

In [22]:
# Invert the feature dictionary
idx2feat = { i: w for w, i in vectorizer.vocabulary_.items() }

feats = []
for idx in vectorized["input_ids"]:
    feats.append(idx2feat[idx-1])    # It is easy to forget we moved all by +1

# This is now the bag of features representation of the token in context
pprint(", ".join(feats))

'first-letter-capitalized, token[-1]=Peter, token[0]=Blackburn'


---

# Vectorizing the whole dataset

We'll again use [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) to process the whole dataset:

In [23]:
vectorized_dataset = flat_dataset.map(vectorize_example, num_proc=4)

pprint(vectorized_dataset["train"][0])

Map (num_proc=4):   0%|          | 0/203621 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/51362 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/46435 [00:00<?, ? examples/s]

{'chunk_tags': 11,
 'features': ['token[0]=EU',
              'token[1]=rejects',
              'token[2]=German',
              'token[3]=call',
              'first-letter-capitalized'],
 'input_ids': [1, 14072, 23002, 27892],
 'label': 22,
 'ner_tags': 3,
 'pos_tags': 22,
 'tokens': 'EU'}


* Our `input_ids` are an array containing the indices of the features
* This corresponds to the indices into the row of the embedding matrix in the model


---

# Batching and padding

As detailed in the [text classification notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb), we typically train neural networks on _batches_ of multiple examples rather than a single example at a time (efficiency and regularization).

As examples in a batch need to have identical length, we _pad_ shorter examples to the maximum example length in each batch with the "dummy" feature with index 0.

(This code is basically unchanged from the previous notebook.)

In [24]:
def collator(list_of_examples):
    # Labels are simply converted into a tensor
    batch={
        "labels": torch.tensor([e["label"] for e in list_of_examples])
    }

    # Examples need to be padded
    tensors = []

    # Find length of longest example
    max_len = max(len(e["input_ids"]) for e in list_of_examples)
    max_len = max(1,max_len)

    # Pad everything with zeros to length of longest example
    for example in list_of_examples:
        ids = torch.LongTensor(example["input_ids"])
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
         #pad by max - current length, pads with zero by default
        padded = torch.nn.functional.pad(ids, (0, max_len-ids.shape[0]))
        tensors.append(padded)

    # Now that all examples are of the same length, vstack() can be used
    # to vertically stack these into a tensor
    batch["input_ids"]=torch.vstack(tensors)

    return batch

Test that out with a minimal batch of two examples, one requiring padding:

In [25]:
batch=collator([vectorized_dataset["train"][2], vectorized_dataset["train"][7]])

print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
print("labels:",batch["labels"])
print("input_ids:",batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 6])
labels: tensor([16, 21])
input_ids: tensor([[    1,  5427, 14216, 20003, 26070, 27863],
        [  585,  6898, 12964, 17990,     0,     0]])


---

# MLP model

With the data now ready, we'll build the MLP model. Note that this is _identical_ to the MLP model we used for text classification: the only difference between the two applications is in the data.

The model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [26]:
# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self, config):
        super().__init__(config)

        self.vocab_size=config.vocab_size    # embedding matrix row count

        # Build and initialize embedding of vocab size +1 x hidden size
        # (+1 because of the padding index 0!)
        self.embedding = torch.nn.Embedding(
            num_embeddings=self.vocab_size+1,
            embedding_dim=config.hidden_size,
            padding_idx=0
        )

        # Initialize the embeddings with small random values
        torch.nn.init.uniform_(self.embedding.weight.data, -0.001, 0.001)
        # Enforce zero values for padding
        torch.nn.init.zeros_(self.embedding.weight.data[0,:])

        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(
            in_features=config.hidden_size,
            out_features=config.nlabels
        )

    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`:
    # - if given `labels`, returns (loss, output)
    # - if not, only returns (output,)
    def forward(self, input_ids, labels=None):
        # 1) Look up embeddings of features, sum them up
        embedded = self.embedding(input_ids)    # (batch, ids) -> (batch, ids, embedding_dim)
        embedded_summed = torch.sum(embedded, dim=1)    # (batch, ids, embedding_dim) -> (batch, embedding_dim)

        # NOTE: we're explicitly *not* applying a nonlinearity here to keep
        # things linear for later analysis

        # 2) Apply output layer
        # (batch, embedding_dim) -> (batch, num_classes)
        logits = self.output(embedded_summed)

        if labels is not None:
            # We have labels, so we ought to calculate the loss
            loss_fn = torch.nn.CrossEntropyLoss()    # Classification loss function
            loss = loss_fn(logits, labels)
            return (loss, logits)
        else:
            # No labels, so just return the logits
            return (logits,)

Configure the model

In [27]:
num_labels = len(POS2ID)

mlp_config = MLPConfig(
    vocab_size=len(vectorizer.vocabulary_),
    hidden_size=20,
    nlabels=num_labels
)

---

# Train the model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
* Model load/save
* Good foundation for later deep learning course
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training.

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [28]:
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_l

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

We can use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels. This time, though, the data is not evenly split.

In [29]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [30]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=vectorized_dataset["train"],
    eval_dataset=vectorized_dataset["validation"],
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,Accuracy
500,3.7168,3.553327,0.288326
1000,3.3158,3.071561,0.5132
1500,2.8339,2.628533,0.534948
2000,2.4314,2.296479,0.567073
2500,2.1386,2.041013,0.601709
3000,1.8965,1.832383,0.631751
3500,1.7088,1.661889,0.657159
4000,1.5389,1.519137,0.677388
4500,1.421,1.40157,0.693275
5000,1.303,1.303989,0.70704


TrainOutput(global_step=20000, training_loss=1.1197965408325194, metrics={'train_runtime': 503.229, 'train_samples_per_second': 5087.147, 'train_steps_per_second': 39.743, 'total_flos': 120764047446.0, 'train_loss': 1.1197965408325194, 'epoch': 12.57})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [31]:
eval_results = trainer.evaluate(vectorized_dataset["test"])

print("Accuracy:", eval_results["eval_accuracy"])

Accuracy: 0.8492516420803273


That's pretty poor performance for a task as simple as POS tagging where state-of-the-art accuracies are generally 97-99%. (The approach demonstrated in this notebook should be considered more of a teaching tool than a serious tagger implementation.)

However, the result is certainly much better than random, so we can conclude that the model is learning something about the task.

---

# Save model for later use

* You can save it with `trainer.save_model()`
* You can load it with `MLP.from_pretrained()`


In [32]:
trainer.save_model("mlp-postagger")

---

# What has the model learned?

* The embeddings should have some meaning to them
* Similar features should have similar embeddings

In [33]:
# Grab the embedding matrix out of the trained model
# and drop the first row (padding 0)
# then we can treat the embeddings as vectors

weights=mlp.embedding.weight.detach().cpu().numpy()
weights=weights[1:,:]

In [34]:
qry_idx=vectorizer.vocabulary_["token[0]=in"]

#calculate the distance of the "in" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "in"
for nearest in nearest_neighbors[0,:20]:
    print(idx2feat[nearest])

token[0]=in
token[0]=at
token[0]=with
token[0]=from
token[0]=by
token[0]=for
token[0]=on
token[0]=after
token[0]=of
token[0]=as
token[0]=In
token[0]=into
token[0]=under
token[0]=than
token[0]=between
token[0]=over
token[0]=since
token[0]=against
token[0]=before
token[0]=if


* The embeddings indeed seem to reflect the task and capture aspects of the meaning of words relevant to the task
* But now we have many classes, so we should take that into account too
* We can take the dot-product of the feature embeddings with the output layer weight of the class we care about
* When you think how the information propagates in the network, this will give us a single number reflecting each feature w.r.t. the selected label
* Technically speaking, it is the prediction of an example which only has that one feature, with respect to that one class
* Here is how we can implement it (here we rely on the fact that the model is linear, since we didn't include a nonlinearity earlier in the model's `forward()`

In [35]:
import numpy

embedding_weights=weights    #shape (features, embedding-dim)
output_weights=mlp.output.weight.detach().cpu().numpy()    #shape (num-labels, embedding-dim)

# We just matrix-multiply these together, since this gives us all the dot-products
weights_by_label=numpy.matmul(embedding_weights, output_weights.T)
weights_by_label.shape

(30000, 47)

In [36]:
def get_most_important_features_for_and_against(label):
    label_idx = POS2ID[label]
    feature_weights = weights_by_label[:,label_idx] #pick the column that interests us

    #The shape of feature_weights is (feature_vocab_size,) i.e. it is a vector
    features_weight_idx = numpy.argsort(-feature_weights) #sort in descending order, this will be vector of indices
    features_for = [idx2feat[feature_idx] for feature_idx in features_weight_idx[:20]]
    features_against = [idx2feat[feature_idx] for feature_idx in features_weight_idx[-20:][::-1]]
    return features_for, features_against

for label in ("DT", "NN", "VB"):
    dt_plus,dt_minus=get_most_important_features_for_and_against(label)
    print(f"{label}: {POS2DESCRIPTION[label]}")
    print(f"Most important features *for* label {label}:")
    pprint("   ".join(dt_plus))
    print()
    print(f"Most important features *against* label {label}:")
    pprint("   ".join(dt_minus))
    print("\n------\n")

DT: Determiner
Most important features *for* label DT:
('token[0]=the   token[0]=a   token[0]=The   token[0]=an   token[0]=this   '
 'token[0]=A   token[0]=some   token[0]=no   token[0]=any   token[0]=all   '
 'token[0]=.   token[0]=both   token[0]=another   token[0]=those   '
 'token[0]=This   token[0]=these   token[0]=An   token[0]=No   token[0]=each   '
 'token[0]=is')

Most important features *against* label DT:
('token[0]=and   token[0]=but   token[0]=of   token[0]=for   token[0]=or   '
 'token[0]=at   token[0]=with   token[0]=from   token[0]=in   token[0]=by   '
 'token[0]=But   token[0]=,   token[0]=after   token[0]=on   token[0]=1   '
 'token[0]=0   token[0]=as   token[0]=two   token[0]=3   token[0]=-')

------

NN: Noun, singular or mass
Most important features *for* label NN:
('token[0]=percent   token[0]=government   token[0]=year   token[0]=police   '
 'token[0]=week   token[0]=time   token[0]=state   token[1]=of   token[-1]=a   '
 'token[0]=SOCCER   token[0]=company   toke