## Learning distributed representations of words and sentences

### Recent techniques
 * pre-trained word vectors
 * character-level CNN encoding
 * sub-word token representation (e.g byte encodings)
 * higher level linguistic features
    * part of speech (POS) tags
    * names entities
    * dependency paths

### Key abstractions for expressivity
  * TokenIndexers
    * generate indexed tensors for sentences in a different ways
       * SingleIdTokenIndexer vs TokenCharactersIndexer
  * TokenEmbedders
    * transform that maps indexed tensors into embedding representation
       * simple case : PyTorch Embedding layer
       * complex case : token_character_encoders with applied CNN
  * TextFieldEmbedders
    * wrapper of set of TokenEmbedders
    * Applies TokenEmbedders and Concatenates(and performs other operations) on their results



## Step 1. Prepare Dataset

In [1]:
from allennlp.data.fields import TextField
from allennlp.data import Instance
from allennlp.data.token_indexers \
    import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.data.tokenizers import Token

In [2]:
# Tokens
words = ['All','the','cool','kids','use','character','embeddings','.']
words2 = ['I','prefer','word2vec','thouhg','...']

tokens1 = list(map(Token,words))
tokens2 = list(map(Token,words2))

In [3]:
# token_indexers

token_indexers = {"tokens":SingleIdTokenIndexer(namespace='token_ids'),
                  'characters': TokenCharactersIndexer(namespace='token_characters')}
sentence = TextField(tokens1,token_indexers)
sentence2 = TextField(tokens2,token_indexers)

instance = Instance({"sentence":sentence})
instance2 = Instance({"sentence":sentence2})

instances = [instance,instance2]

In [4]:
from allennlp.data import Vocabulary
from allennlp.data.dataset import Batch

# Vocabulary
vocab = Vocabulary.from_instances(instances)

# batch
instances = Batch(instances)

for instance in instances:
    instance.index_fields(vocab)

100%|██████████| 2/2 [00:00<00:00, 2953.74it/s]


# Step 2. Define the TokenEmbedder

In [26]:
from allennlp.modules.token_embedders \
    import Embedding, TokenCharactersEncoder

# to define CNN applied character level embedder
from allennlp.modules.seq2vec_encoders import CnnEncoder 


In [33]:
# Simple word embedder
word_embedding = Embedding(
    num_embeddings=vocab.get_vocab_size("token_ids"),
    embedding_dim =3)


In [34]:
"""
 embedding (transfom) 

  - Input : tensor (batch_size,
                    max_num_words_in_sentence, 
                    max_char_len_in_word) 
  - Output :tensor(batch_size,
                    max_num_words_in_sentence, 
                    max_char_len_in_word,
                    embedding_dim) 
 cnn encoder 
  
  - output : tensor (batch_size,
                     max_num_word_in_sentence,
                     num_filters * ngram_filter_sizes)

"""

char_embedding = Embedding(
    num_embeddings=vocab.get_vocab_size("token_characters"),
    embedding_dim = 5)

character_cnn = CnnEncoder(
    embedding_dim = 5 , 
    num_filters=2,
    output_dim=4)

token_character_encoder = TokenCharactersEncoder(
                                embedding = char_embedding,
                                encoder = character_cnn)

# Step 3. Define the TextFieldEmbedder

In [35]:
# BasicTextFieldEmbedder

from allennlp.modules.text_field_embedders \
    import BasicTextFieldEmbedder


text_field_embedder = BasicTextFieldEmbedder(
                        {'tokens': word_embedding, 
                         'characters':token_character_encoder})

In [39]:
# let's apply text_field_embedder to data and see what happens

#Converted the indexed dataset into Pytorch Variables
batch = Batch(instances)
tensors = batch.as_tensor_dict(batch.get_padding_lengths())
print("torch tensors for passing to a model: \n\n", tensors)


torch tensors for passing to a model: 

 {'sentence': {'tokens': tensor([[ 2,  3,  4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14,  0,  0,  0]]), 'characters': tensor([[[16,  9,  9,  0,  0,  0,  0,  0,  0,  0],
         [10,  4,  2,  0,  0,  0,  0,  0,  0,  0],
         [ 5,  6,  6,  9,  0,  0,  0,  0,  0,  0],
         [17, 12,  7, 11,  0,  0,  0,  0,  0,  0],
         [13, 11,  2,  0,  0,  0,  0,  0,  0,  0],
         [ 5,  4, 14,  3, 14,  5, 10,  2,  3,  0],
         [ 2, 18, 19,  2,  7,  7, 12, 20, 15, 11],
         [ 8,  0,  0,  0,  0,  0,  0,  0,  0,  0]],

        [[21,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [22,  3,  2, 23,  2,  3,  0,  0,  0,  0],
         [24,  6,  3,  7, 25, 26,  2,  5,  0,  0],
         [10,  4,  6, 13,  4, 15,  0,  0,  0,  0],
         [ 8,  8,  8,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]])}}


In [41]:
text_field_variables = tensors['sentence']

embedded_text  = text_field_embedder(text_field_variables)

dimensions = list(embedded_text.size())

print("Post embedding with our TextFieldEmbedder: ")
print("Batch Size : ", dimensions[0])
print("Sentence Length : " , dimensions[1])
print("Embedding size : ",dimensions[2])

print("Embedded Tensor : \n\n",embedded_text)

Post embedding with our TextFieldEmbedder: 
Batch Size :  2
Sentence Length :  8
Embedding size :  7
Embedded Tensor : 

 tensor([[[ 0.1963, -0.3375,  0.1647, -0.1156, -0.0404, -0.5156,  0.3331],
         [ 0.2908, -0.3940,  0.1887, -0.2405,  0.2811,  0.4906, -0.4056],
         [ 0.2478, -0.3705,  0.1714, -0.2411,  0.1916,  0.5432, -0.4662],
         [ 0.2789, -0.2663,  0.1470, -0.1924, -0.3673, -0.1278, -0.1605],
         [ 0.3105, -0.3436,  0.2232, -0.2375, -0.0165,  0.5044,  0.0403],
         [ 0.3064, -0.3487,  0.1801, -0.2848,  0.2411,  0.0355, -0.4371],
         [ 0.3322, -0.3074,  0.2398, -0.2993, -0.0436,  0.0939, -0.0668],
         [ 0.1526, -0.3271,  0.1999, -0.0636, -0.1979,  0.5730, -0.5436]],

        [[ 0.1510, -0.3281,  0.2023, -0.0854,  0.5091, -0.3519,  0.0300],
         [ 0.3827, -0.3614,  0.2198, -0.2577,  0.4955, -0.1582,  0.4972],
         [ 0.2264, -0.2860,  0.2286, -0.2553, -0.5514,  0.1500,  0.0608],
         [ 0.2022, -0.3576,  0.1423, -0.1917, -0.3281, -0.2090