# Data Pipeline

   ##  Hierarchical system of data structures 
   * Easy padding 
   * Easy batching
   * Easy iteration

## Steps to feed the dataset into PyTorch model 
 * Create a vocabulary from the dataset
     * Use Vocabulary.from_instances
 * Collect instances into a Batch
     * Provides methods for indexing and converting to Pytorch Tensors
 * Index the words and labels in Fields
     * In order to use the integer indices specified by the Vocabulary
 * Pad the instances to the same length
 * Convert into Pytorch Tensors
 

### Vocabulary Creation

In [8]:
from allennlp.data import Vocabulary

vocab = Vocabulary.from_instances(instances)

100%|██████████| 2/2 [00:00<00:00, 5226.55it/s]


###  word_id --> token(word) mapping
   * get_index_to_token_vocabulary

In [14]:
# token_ids namespace

print('id -> word mapping for the "token_ids" namespace: ')
print(vocab.get_index_to_token_vocabulary("token_ids"),"\n")

 id -> word mapping for the "token_ids" namespace: 
{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'this', 3: 'movie', 4: 'was', 5: 'awful', 6: '!', 7: 'quite', 8: 'slow', 9: 'but', 10: 'good', 11: '.'} 



In [13]:
# tags namespace

print('id -> word mapping for the "tags" namespace: ')
print(vocab.get_index_to_token_vocabulary('tags'), '\n')

id -> word mapping for the "tags" namespace: 
{0: 'negative', 1: 'positive'} 



###  Token(word) -> id 

In [16]:
print('Token to Index dictionary: \n' ,vocab._token_to_index, '\n')

Token to Index dictionary: 
 defaultdict(None, {'token_ids': {'@@PADDING@@': 0, '@@UNKNOWN@@': 1, 'this': 2, 'movie': 3, 'was': 4, 'awful': 5, '!': 6, 'quite': 7, 'slow': 8, 'but': 9, 'good': 10, '.': 11}, 'tags': {'negative': 0, 'positive': 1}}) 



### Collect Instances(dataset) into Batch and Index them
  * Must perform this step before generating Tensors

In [17]:
from allennlp.data.dataset import Batch

batch = Batch(instances)
# index batch using vocabulary
batch.index_instances(vocab)

### Pad the instances to the same length

In [18]:
# get the padding lenth 

padding_lengths = batch.get_padding_lengths()
print("Lengths used for padding : ", padding_lengths, "\n")

# padd instances and return Pytorch Tensors 
tensor_dict = batch.as_tensor_dict(padding_lengths)
print("Look how tensors are padded!!! \n", tensor_dict)

Lengths used for padding :  {'review': {'num_tokens': 8}} 

Look how tensors are padded!!! 
 {'review': {'tokens': tensor([[ 2,  3,  4,  5,  6,  0,  0,  0],
        [ 2,  3,  4,  7,  8,  9, 10, 11]])}, 'label': tensor([0, 1])}


# The role of TokenIndexer

* Conventional pre-processing flow
 * token --> indexing --> embedding

* AllenNLP pre-processing flow
 * token --> token_indexer --> token_embedder --> TextField
 
* What if we want to use multiple Indexer 
 * e.g. TokenCharacterIndexer --> generates indices for each character in a token

In [19]:
# for large batch --> Interator
# fixed batch size, bucketing, stocharsing sorting

# Normal FLow  : tokenization -> indexing -> embedding pipeline
# Allennlp     : tokenization -> TokenIndexers -> TokenEmbedders -> TextFieldEmbedders

# ex ) TokenCharacterIndexer --> takes the word in a TextField 
#                               and generates indices for the character in the word

In [21]:
from allennlp.data.token_indexers import TokenCharactersIndexer

tokens = list(map(Token,['here','are','some','longer','words','.']))
token_indexers = {'tokens':SingleIdTokenIndexer(namespace='token_ids'),
                  'chars':TokenCharactersIndexer(namespace='token_chars')}

word_and_character_text_field = TextField(tokens,token_indexers)

mini_dataset = Batch([Instance({"sentence":word_and_character_text_field})])

word_and_char_vocab = Vocabulary.from_instances(mini_dataset)

mini_dataset.index_instances(word_and_char_vocab)

print("this is the id -> word mapping for the 'tokens_ids' namesapce: ")
print(word_and_char_vocab.get_index_to_token_vocabulary("token_ids"), "\n")
print("this is the id -> word mapping for the 'token_chars' namespace: ")
print(word_and_char_vocab.get_index_to_token_vocabulary("token_chars"),'\m')

1it [00:00, 6150.01it/s]

this is the id -> word mapping for the 'tokens_ids' namesapce: 
{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'here', 3: 'are', 4: 'some', 5: 'longer', 6: 'words', 7: '.'} 

this is the id -> word mapping for the 'token_chars' namespace: 
{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'e', 3: 'r', 4: 'o', 5: 's', 6: 'h', 7: 'a', 8: 'm', 9: 'l', 10: 'n', 11: 'g', 12: 'w', 13: 'd', 14: '.'} \m





In [23]:
padding_lengths = mini_dataset.get_padding_lengths()
print("Lengths used for padding( Note that we now have a new \n"
     "padding key num_tokens_characters from the TokenCharactersIndexer):")
print(padding_lengths, "\n")

tensor_dict = mini_dataset.as_tensor_dict(padding_lengths)

print("The resulting PyTorch Tensor is : \n",tensor_dict)

Lengths used for padding( Note that we now have a new 
padding key num_tokens_characters from the TokenCharactersIndexer):
{'sentence': {'num_tokens': 6, 'num_token_characters': 6}} 

The resulting PyTorch Tensor is : 
 {'sentence': {'tokens': tensor([[2, 3, 4, 5, 6, 7]]), 'chars': tensor([[[ 6,  2,  3,  2,  0,  0],
         [ 7,  3,  2,  0,  0,  0],
         [ 5,  4,  8,  2,  0,  0],
         [ 9,  4, 10, 11,  2,  3],
         [12,  4,  3, 13,  5,  0],
         [14,  0,  0,  0,  0,  0]]])}}


In [43]:
#Note that the keys for the dictionary of token_indexers 
#for the TextField are different from the namespaces. 
#This is because it's possible to re-use a namespace in different TokenIndexer
token_indexers

{'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7f04680702b0>,
 'chars': <allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer at 0x7f0468070e48>}