# Trying out the allennlp tutorial

Link to the tutorial: https://allennlp.org/tutorials

In this notebook I try to explore the allennlp library starting from their tutorial. Specifically trying to decode what each and every module does for better understanding their framework and of course for faster prototyping. 

Underlined stuff are hyper-linked to docs for easy access. You might also find it easier to simply do ??Module in a new cell

### External modules

In [1]:
from typing import Iterator, List, Dict

First up is [`typing`](https://docs.python.org/3/library/typing.html). It allows type hints which can be used in functions to denote what the expected input and output type would be analogous to cpp. Some interesting ones are Union (either or) and Callable (another function).

In [2]:
import torch
import torch.optim as optim
import numpy as np

In [3]:
torch.manual_seed(1)

<torch._C.Generator at 0x7fbbfa6b25e8>

Nothing much, just regular pytorch and numpy 

### Set up example cases

We will first experiment with the 
- word 'Hello', 
- sentence 'We live in a society.' 
- sentences \['You are a bold one.', 'Perhaps the archives are incomplete.'\]

In [4]:
word = 'Hello'
sent = 'We live in a society.'
sents = ['You are a bold one.', 'Perhaps the archives are incomplete.']

### Start importing allennlp

#### Tokenizer

In [6]:
!pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/b5/14/f0f9dd1ce012e7723742821b95b33dd9bdc53befe209600608bc7be1f650/allennlp-1.2.0-py3-none-any.whl (498kB)
[K     |████████████████████████████████| 501kB 8.0MB/s 
Collecting jsonpickle
  Downloading https://files.pythonhosted.org/packages/af/ca/4fee219cc4113a5635e348ad951cf8a2e47fed2e3342312493f5b73d0007/jsonpickle-1.4.1-py2.py3-none-any.whl
Collecting overrides==3.1.0
  Downloading https://files.pythonhosted.org/packages/ff/b1/10f69c00947518e6676bbd43e739733048de64b8dd998e9c2d5a71f44c5d/overrides-3.1.0.tar.gz
Collecting transformers<3.5,>=3.1
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 20.5MB/s 
Collecting tensorboardX>=1.2
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b

In [8]:
from allennlp.data.tokenizers import Token
from allennlp.data.tokenizers.token import show_token

[`Token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.Token): is a wrapper around a word to keep track of some important stuff like its lemma, or a part of speech tag etc. 

[`WordTokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#word-tokenizer): Tokenizes a sentence and outputs a list of tokens. By default it uses spacy's implementation for tokenizing words.

[`show_token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.show_token): a convenience function to print your tokens

In [9]:
word_token = Token(word)

This is how a single token looks like

In [10]:
show_token(word_token)

'Hello (idx: None) (idx_end: None) (lemma: None) (pos: None) (tag: None) (dep: None) (ent_type: None) (text_id: None) (type_id: None) '

Note that only the 'text' is filled, others tags are None and get filled up when one does some other processing

We can now tokenize a whole sentence using the WordTokenizer.

In [11]:
sent_toks = WordTokenizer().tokenize(sent)

NameError: ignored

The tokenized sentence, the output being a list

In [None]:
sent_toks

[We, live, in, a, society, .]

These are the printed tokens

In [None]:
[show_token(s) for s in sent_toks]

['We (idx: 0) (lemma: We) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'live (idx: 3) (lemma: live) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'in (idx: 8) (lemma: in) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 11) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'society (idx: 13) (lemma: society) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 20) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

We can also process multiple sentences at once. 

In [None]:
sents_toks = WordTokenizer().batch_tokenize(sents)

In [None]:
sents_toks

[[You, are, a, bold, one, .], [Perhaps, the, archives, are, incomplete, .]]

In [None]:
[show_token(s) for snt in sents_toks for s in snt]

['You (idx: 0) (lemma: You) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 4) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 8) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'bold (idx: 10) (lemma: bold) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'one (idx: 15) (lemma: one) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 18) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'Perhaps (idx: 0) (lemma: Perhaps) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'the (idx: 8) (lemma: the) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'archives (idx: 12) (lemma: archive) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 21) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'incomplete (idx: 25) (lemma: incomplete) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 35) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

#### TokenIndexer

In [None]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer

[`TokenIndexer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.token_indexers.html#allennlp.data.token_indexers.token_indexer.TokenIndexer): Converts a token or list of tokens to indices. These indices refer to the index of the token in some vocabulary to be used by the model.

[`SingleIdTokenIndexer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.token_indexers.html#single-id-token-indexer): Converts a single field 

We note that the token indexer requires a vocabulary, however, we haven't created one yet.

In [None]:
s = SingleIdTokenIndexer()

#### Fields and Instances

In [None]:
from allennlp.data import Field
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data import Instance

[`Field`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.field.Field): is simply some data to be feeded to the pipeline of your model. Some important use cases are :
 - tokenized text, for example, the text "The movie" would be stored as a tokenized text in a field as \['The', 'movie'\].
 - numerical id for the tokenized text, suppose 'The' maps to 1, and 'movie' maps to 35 in a dictionary of words, the field could contain \[1, 35\]. 
 - Also note that field can contain multiple sentences of varying lengths. If you want to pass such a field to your pipeline, it needs to be appropriately padded. Field contains an `as_tensor` method to convert the data into tensor and `batch_tensors` to convert into tensors after appropriate padding. 

For most purposes you should be able to use one of the ready made Fields like the `TextField` or `SequenceLabelField`.
 - [`TextField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.text_field.TextField): The field contains tokenized strings. One needs to pass raw strings through a [`tokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html) before passing it to the field.
 - [`SequenceLabelField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.sequence_label_field.SequenceLabelField): assigns some label for each element in a field. 

[`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance): is simply a dictionary mapping with string keys, and values as fields. One data point is one Instance

First lets create a TextField

In [None]:
simple_text_field = TextField(sent_toks, SingleIdTokenIndexer())

In [None]:
simple_text_field

<allennlp.data.fields.text_field.TextField at 0x7fdf3d9a3860>

In [None]:
def get_instance_from_tokenized_sent(tok_sent: List[Token]) -> Instance:
    "Converts tokenized sentence into Instances. Each instance being TextField"
    sent_tok_text_field = TextField(tok_sent, {"tokens": SingleIdTokenIndexer()})
    fields = {'sentence': sent_tok_text_field}
    return Instance(fields)

In [None]:
def get_instances_from_tokenized_sents(tok_sents: List[List[Token]]) -> List[Instance]:
    "Converts list of sentences to instances."
    return [get_instance_from_tokenized_sent(tok_sent) for tok_sent in tok_sents]

In [None]:
simple_instance = get_instance_from_tokenized_sent(sent_toks)

In [None]:
simple_instance.fields['sentence'].__dict__

{'tokens': [We, live, in, a, society, .],
 '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fdf3d9a3320>},
 '_indexed_tokens': None,
 '_indexer_name_to_indexed_token': None}

In [None]:
few_instances = get_instances_from_tokenized_sents(sents_toks)

In [None]:
[f.__dict__ for f in few_instances]

[{'fields': {'sentence': <allennlp.data.fields.text_field.TextField at 0x7fdf3d995470>},
  'indexed': False},
 {'fields': {'sentence': <allennlp.data.fields.text_field.TextField at 0x7fdf3d9906a0>},
  'indexed': False}]

In [None]:
[f.fields['sentence'].__dict__ for f in few_instances]

[{'tokens': [You, are, a, bold, one, .],
  '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fdf3d995550>},
  '_indexed_tokens': None,
  '_indexer_name_to_indexed_token': None},
 {'tokens': [Perhaps, the, archives, are, incomplete, .],
  '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fdf3d9903c8>},
  '_indexed_tokens': None,
  '_indexer_name_to_indexed_token': None}]

#### Vocabulary

In [None]:
from allennlp.data.vocabulary import Vocabulary

[`Vocabulary`](https://allenai.github.io/allennlp-docs/api/allennlp.data.vocabulary.html): Provides a mapping from a string to an integer index. This can be created `from_files`, `from_instances`. It is quite useful, since building a dictionary is fundamental to almost all nlp tasks. 

Since we have `few_instances` we are now ready to build a vocabulary

In [None]:
vocab =  Vocabulary.from_instances(few_instances)

12/09/2018 22:07:26 - INFO - allennlp.data.vocabulary -   Fitting token dictionary from dataset.
100%|██████████| 2/2 [00:00<00:00, 17848.10it/s]


Now we can get the vocabulary mapping tokens to indices

In [None]:
vocab.get_token_to_index_vocabulary()

{'@@PADDING@@': 0,
 '@@UNKNOWN@@': 1,
 'are': 2,
 '.': 3,
 'You': 4,
 'a': 5,
 'bold': 6,
 'one': 7,
 'Perhaps': 8,
 'the': 9,
 'archives': 10,
 'incomplete': 11}

We can also get index to the word

In [None]:
vocab.get_index_to_token_vocabulary()

{0: '@@PADDING@@',
 1: '@@UNKNOWN@@',
 2: 'are',
 3: '.',
 4: 'You',
 5: 'a',
 6: 'bold',
 7: 'one',
 8: 'Perhaps',
 9: 'the',
 10: 'archives',
 11: 'incomplete'}

To get just the index for a word or just the word given index

In [None]:
vocab.get_token_from_index(2)

'are'

In [None]:
vocab.get_token_index('are')

2

We can also get some statistics about the vocabulary created

In [None]:
vocab.print_statistics()

12/09/2018 22:07:26 - INFO - allennlp.data.vocabulary -   Printed vocabulary statistics are only for the part of the vocabulary generated from instances. If vocabulary is constructed by extending saved vocabulary with dataset instances, the directly loaded portion won't be considered here.




----Vocabulary Statistics----


Top 10 most frequent tokens in namespace 'tokens':
	Token: are		Frequency: 2
	Token: .		Frequency: 2
	Token: You		Frequency: 1
	Token: a		Frequency: 1
	Token: bold		Frequency: 1
	Token: one		Frequency: 1
	Token: Perhaps		Frequency: 1
	Token: the		Frequency: 1
	Token: archives		Frequency: 1
	Token: incomplete		Frequency: 1

Top 10 longest tokens in namespace 'tokens':
	Token: incomplete		length: 10	Frequency: 1
	Token: archives		length: 8	Frequency: 1
	Token: Perhaps		length: 7	Frequency: 1
	Token: bold		length: 4	Frequency: 1
	Token: are		length: 3	Frequency: 2
	Token: You		length: 3	Frequency: 1
	Token: one		length: 3	Frequency: 1
	Token: the		length: 3	Frequency: 1
	Token: .		length: 1	Frequency: 2
	Token: a		length: 1	Frequency: 1

Top 10 shortest tokens in namespace 'tokens':
	Token: a		length: 1	Frequency: 1
	Token: .		length: 1	Frequency: 2
	Token: the		length: 3	Frequency: 1
	Token: one		length: 3	Frequency: 1
	Token: You		length: 3	Frequency: 1

#### File Utils

[`cached_path`](https://allenai.github.io/allennlp-docs/api/allennlp.common.file_utils.html#allennlp.common.file_utils.cached_path): A convenience function taking either an url or a localpath. If url downloads the file to some localpath, if localpath, ensures that it exists. Returns the cached localpath back.

In [None]:
from allennlp.common.file_utils import cached_path

In [None]:
train_dataset_path = cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/training.txt')
validation_dataset_path = cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/validation.txt')

We can open the file and see the format

In [None]:
with open(train_dataset_path, 'r') as f:
    lines = f.readlines()

In [None]:
lines

['The###DET dog###NN ate###V the###DET apple###NN\n',
 'Everybody###NN read###V that###DET book###NN\n']

In [None]:
with open(validation_dataset_path, 'r') as f:
    lines = f.readlines()

In [None]:
lines

['The###DET dog###NN read###V the###DET apple###NN\n',
 'Everybody###NN ate###V that###DET book###NN\n']

#### DataSet Readers

A superclass for all dataset readers. Has a method to read, and convert text to instance. Both need to be implemented in case of a custom dataset. Lazy defines whether or not to input the whole dataset at once.

In [None]:
from allennlp.data.dataset_readers import DatasetReader

[`DatasetReader`](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader): It reads from a file containing your dataset returning an iterable. You can use `lazy = True` to have `_read` call every time you want to get the new file (useful when data is too large to keep on disk) or `lazy = False` which ensures the `_read` returns a list.

Defining a POS Tagger. Note that the `_read` function returns an generator using [`yield`](https://www.pythoncentral.io/python-generators-and-yield-keyword/)

In [None]:
class PosDatasetReader(DatasetReader):
    """
    DatasetReader for PoS tagging data, one sentence per line, like

        The###DET dog###NN ate###V the###DET apple###NN
    """
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None, lazy: bool = False) -> None:
        super().__init__(lazy=lazy)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)

Note that in `__init__` constructor we are defining the `token_indexers` dictionary with the key `tokens` and value `SingleIdTokenIndexer`. This is because only the tokens from the dataset will be converted to the index and not the `labels` at least for the time being. 

We can now define our datasets

In [None]:
reader = PosDatasetReader()

In [None]:
train_dataset = reader.read(train_dataset_path)
validation_dataset = reader.read(validation_dataset_path)

2it [00:00, 4837.72it/s]
2it [00:00, 5017.11it/s]


We can confirm the type of outputs from the train and validation sets

In [None]:
type(train_dataset), type(validation_dataset)

(list, list)

To confirm if we have everything correct

In [None]:
train_dataset[0].__dict__['fields']['sentence'].__dict__

{'tokens': [The, dog, ate, the, apple],
 '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fdf3d9dfb38>},
 '_indexed_tokens': None,
 '_indexer_name_to_indexed_token': None}

Note that we can get `lazy` functionality by passing it as an argument when creating the reader

In [None]:
reader_lazy = PosDatasetReader(lazy=True)

In [None]:
train_dataset_lazy = reader_lazy.read(train_dataset_path)
valid_dataset_lazy = reader_lazy.read(validation_dataset_path)

In [None]:
type(train_dataset_lazy), type(valid_dataset_lazy)

(allennlp.data.dataset_readers.dataset_reader._LazyInstances,
 allennlp.data.dataset_readers.dataset_reader._LazyInstances)

In this tutorial we don't deal with lazy instances, so kept for later.

#### Recreating the vocab

We re-create the vocab from the given training and validation data

In [None]:
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

12/09/2018 22:07:26 - INFO - allennlp.data.vocabulary -   Fitting token dictionary from dataset.
100%|██████████| 4/4 [00:00<00:00, 34379.54it/s]


In [None]:
vocab.__dict__

{'_padding_token': '@@PADDING@@',
 '_oov_token': '@@UNKNOWN@@',
 '_non_padded_namespaces': {'*labels', '*tags'},
 '_token_to_index': _TokenToIndexDefaultDict(None,
                          {'tokens': {'@@PADDING@@': 0,
                            '@@UNKNOWN@@': 1,
                            'The': 2,
                            'dog': 3,
                            'ate': 4,
                            'the': 5,
                            'apple': 6,
                            'Everybody': 7,
                            'read': 8,
                            'that': 9,
                            'book': 10},
                           'labels': {'NN': 0, 'DET': 1, 'V': 2}}),
 '_index_to_token': _IndexToTokenDefaultDict(None,
                          {'tokens': {0: '@@PADDING@@',
                            1: '@@UNKNOWN@@',
                            2: 'The',
                            3: 'dog',
                            4: 'ate',
                            5: 'the',
      

Note that now we also have additional tokens for the `labels`, which was not present previously.

#### Iterators

Once the dataset is created, we need to give a way to iterate over the dataset. 

In [None]:
from allennlp.data.iterators import BucketIterator

[`BucketIterator`](https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html#allennlp.data.iterators.bucket_iterator.BucketIterator): By default it pads all the sequences in a batch to maximum inputs. Contains a helpful `biggest_batch_first` option to be sure that you don't run out of memory (at least you get to know it earlier than later). 

Additionally, you need to manually add the `vocab` using `index_with` method. 

In [None]:
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)

Now that we have an iterator we can create the training data generator

In [None]:
train_generator = iterator(train_dataset, num_epochs=1, shuffle=True)
valid_generator = iterator(validation_dataset, num_epochs=1, shuffle=False)

In [None]:
type(train_generator), type(valid_generator)

(generator, generator)

We can see one item that will be passed as a batch

In [None]:
train_item = next(train_generator)

In [None]:
train_item

{'sentence': {'tokens': tensor([[ 7,  8,  9, 10,  0],
          [ 2,  3,  4,  5,  6]])}, 'labels': tensor([[0, 2, 1, 0, 0],
         [1, 0, 2, 1, 0]])}

In [None]:
val_item = next(valid_generator)

In [None]:
val_item

{'sentence': {'tokens': tensor([[ 7,  4,  9, 10,  0],
          [ 2,  3,  8,  5,  6]])}, 'labels': tensor([[0, 2, 1, 0, 0],
         [1, 0, 2, 1, 0]])}

Note that `tokens` and `labels` are tensors, with the first dimension being batch dimension. Also, note that the tokens and labels have been zero-padded.

#### Models

Fields, instances, vocab, tokenization, datasetreaders all of them had to do with the data processing. Now that we are more or less done with that we can focus on creating our model to actually solve the NLP tasks at hand. 

In [None]:
from allennlp.models import Model

[`Model`](https://allenai.github.io/allennlp-docs/api/allennlp.models.model.html#allennlp.models.model.Model): This is the base class for AllenNLP Models. It builds on the default Pytorch `torch.nn.Module`. The output is a dictionary, which allows for arbitrary tensors as outputs (often great for debugging purposes). Also, during training the ouput dictionary should have `loss` which is optimized by the `trainer`. It contains a handful of useful methods:

- `decode`: converts the output dict from the `forward` method of the `model` and does any form of post-processing that is required. By default does nothing. 
- `forward_on_instance`: Takes a particular instance as input, converts the tokens to indices using the vocabulary. Removes the batch dimension from `torch.tensor` and converts into list. 
- `forward_on_instances`: Does same as above for multiple instances. 

#### TokenEmbedding 

Once we have the index for each word(token) from a vocabulary, we still need to map it to vector. This is done using an embedding matrix. 

In [None]:
from allennlp.modules.token_embedders import Embedding

[`Embedding`](https://allenai.github.io/allennlp-docs/api/allennlp.modules.token_embedders.html#allennlp.modules.token_embedders.embedding.Embedding) :This is a slightly advanced version of the `Embedding` module in pytorch by default:
- Higher Order Inputs: While the input to pytorch module can only be B x S, this accepts inputs of B x d1 x...x dn x S. Here S is the sequence length.
- Pre-specified weights for embedding matrix: Like vectors from fasttext or glove. 
- Allowing the embedding matrix to be freezed
- Projection matrix: Adding a linear layer to the embedding vectors to transform to another space. 

We define the embedding dimension

In [None]:
EMBEDDING_DIM = 6

Next we define the embedder

In [None]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

In [None]:
out_toks = token_embedding(torch.tensor([[0, 1], [1, 2]]))
out_toks

tensor([[[ 0.3061, -0.2622, -0.1152,  0.2788, -0.5593,  0.3563],
         [-0.1222,  0.3022,  0.0826, -0.0727,  0.1648,  0.0293]],

        [[-0.1222,  0.3022,  0.0826, -0.0727,  0.1648,  0.0293],
         [ 0.2170, -0.2315, -0.0433, -0.0535,  0.0861, -0.0024]]],
       grad_fn=<EmbeddingBackward>)

#### TextFieldEmbedder

Remember that our input is in the form of instances, each instance having TextField. So rather than using the embedding directly, we use the `TextFieldEmbedder` as a wrapper which takes care of the intermediate steps

[`TextFieldEmbedder`](https://allenai.github.io/allennlp-docs/api/allennlp.modules.text_field_embedders.html#allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder): Takes DataArrays produced by the TextField as input. 

In [None]:
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder

We can define the embedder as

In [None]:
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

Note that the passed dictionary has the key `tokens` which is the same as that used in textfield.

#### Seq2Seq Modules

In [None]:
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper

[`Seq2SeqEncoder`](https://allenai.github.io/allennlp-docs/api/allennlp.modules.seq2seq_encoders.html#allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder): Takes a sequence of input, and outputs the same sequence possibly with a different `output_dimension`. 

[`PytorchSeq2SeqWrapper`](https://allenai.github.io/allennlp-docs/api/allennlp.modules.seq2seq_encoders.html#allennlp.modules.seq2seq_encoders.pytorch_seq2seq_wrapper.PytorchSeq2SeqWrapper): It wraps around a seq2seq model, essentially doing the padding, packing stuff required for the RNNs. Requires the input as well as the mask.

We first define the hidden dimension for our LSTM.

In [None]:
HIDDEN_DIM = 6

In [None]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

In [None]:
lstm

PytorchSeq2SeqWrapper(
  (_module): LSTM(6, 6, batch_first=True)
)

The first argument is the output tokens and the second argument is the mask (all 1's because no masking).

In [None]:
lstm_out = lstm(out_toks, torch.tensor([[1, 1], [1, 1]]))

In [None]:
lstm_out

tensor([[[-0.0382,  0.0517,  0.0439, -0.0416,  0.0206,  0.0437],
         [-0.1595,  0.0633,  0.1049, -0.1001,  0.0205,  0.0742]],

        [[-0.1483,  0.0350,  0.0873, -0.0739,  0.0057,  0.0399],
         [-0.1634,  0.0742,  0.0458, -0.1128,  0.0073,  0.0993]]],
       grad_fn=<IndexSelectBackward>)

In [None]:
lstm_out.shape

torch.Size([2, 2, 6])

We get the output of the lstm as the hidden layer representation of each state. 

#### Getting text field mask

In [None]:
from allennlp.nn.util import get_text_field_mask

For the LSTM we need to supply the mask.

[`get_text_field_mask`](https://allenai.github.io/allennlp-docs/api/allennlp.nn.util.html#allennlp.nn.util.get_text_field_mask): Given text fields, it returns the mask. Useful for passing to an LSTM.

#### Loss Function and Metrics

We have set up the model more or less, we just need to add the loss function and the metrics

In [None]:
from allennlp.nn.util import sequence_cross_entropy_with_logits

[`sequence_cross_entropy_with_logits`](https://allenai.github.io/allennlp-docs/api/allennlp.nn.util.html#allennlp.nn.util.sequence_cross_entropy_with_logits): Implements binary cross entropy (bce) but with weights which are 0 for padded stuff. 

In [None]:
from allennlp.training.metrics import CategoricalAccuracy

[`CategoricalAccuracy`](https://allenai.github.io/allennlp-docs/api/allennlp.training.metrics.html#categorical-accuracy): Gives the topk classificaiton accuracy

#### Defining the LSTMTagger Model

In [None]:
class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        self.accuracy = CategoricalAccuracy()
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> torch.Tensor:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

In [None]:
model = LstmTagger(word_embeddings, lstm, vocab)

The model takes in textfields, which are converted to word embeddings, processed by the seq2seq encoder formed by the lstm. These features are taken as input using a linear layer which tries to categorize them.

In [None]:
model

LstmTagger(
  (word_embeddings): BasicTextFieldEmbedder(
    (token_embedder_tokens): Embedding()
  )
  (encoder): PytorchSeq2SeqWrapper(
    (_module): LSTM(6, 6, batch_first=True)
  )
  (hidden2tag): Linear(in_features=6, out_features=3, bias=True)
)

#### Trainer and Predictor

We have the dataset reader, the model definition including the losses and evaluation. We only require to train the model now.

In [None]:
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor

[`Trainer`](https://allenai.github.io/allennlp-docs/api/allennlp.training.trainer.html#allennlp.training.trainer.Trainer): Takes the model, optimizer, iterator, datasets, patience (for reducing lr) and number of epochs. The `train` method trains the model for the specified number of epochs

[`SentenceTaggerPredictor`](https://allenai.github.io/allennlp-docs/api/allennlp.predictors.html#allennlp.predictors.sentence_tagger.SentenceTaggerPredictor): Given a sentence, predicts the labels. Takes the model and the dataset as the input

Define a standard Pytorch optimizer

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.1)

Define the trainer and train

In [None]:
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000)

Run the trainer (output removed because too big)

In [None]:
trainer.train()

Define the Sentence Tagger

In [None]:
predictor = SentenceTaggerPredictor(model, dataset_reader=reader)

Predict the tags

In [None]:
tag_logits = predictor.predict("The dog ate the apple")['tag_logits']

tag_ids = np.argmax(tag_logits, axis=-1)

print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])

['DET', 'NN', 'V', 'DET', 'NN']


### Complete program

In [None]:
from typing import Iterator, List, Dict
import torch
import torch.optim as optim
import numpy as np
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.common.file_utils import cached_path
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor

torch.manual_seed(1)
class PosDatasetReader(DatasetReader):
    """
    DatasetReader for PoS tagging data, one sentence per line, like

        The###DET dog###NN ate###V the###DET apple###NN
    """
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        super().__init__(lazy=False)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)
class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        self.accuracy = CategoricalAccuracy()
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> torch.Tensor:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}
reader = PosDatasetReader()
train_dataset = reader.read(cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/training.txt'))
validation_dataset = reader.read(cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/validation.txt'))
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000)

In [None]:
trainer.train()

predictor = SentenceTaggerPredictor(model, dataset_reader=reader)

tag_logits = predictor.predict("The dog ate the apple")['tag_logits']

tag_ids = np.argmax(tag_logits, axis=-1)

print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])