## Learning distributed representations of words and sentences

### Recent techniques
 * pre-trained word vectors
 * character-level CNN encoding
 * sub-word token representation (e.g byte encodings)
 * higher level linguistic features
    * part of speech (POS) tags
    * names entities
    * dependency paths

### Key abstractions for expressivity
  * TokenIndexers
    * generate indexed tensors for sentences in a different ways
       * SingleIdTokenIndexer vs TokenCharactersIndexer
  * TokenEmbedders
    * transform that maps indexed tensors into embedding representation
       * simple case : PyTorch Embedding layer
       * complex case : token_character_encoders with applied CNN
  * TextFieldEmbedders
    * wrapper of set of TokenEmbedders
    * Applies TokenEmbedders and Concatenates(and performs other operations) on their results



## Step 1. Prepare Dataset

In [None]:
from allennlp.data.fields import TextField
from allennlp.data import Instance
from allennlp.data.token_indexers \
    import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.data.tokenizers import Token

In [None]:
# Tokens
words = ['All','the','cool','kids','use','character','embeddings','.']
words2 = ['I','prefer','word2vec','thouhg','...']

tokens1 = list(map(Token,words))
tokens2 = list(map(Token,words2))

In [None]:
# token_indexers

token_indexers = {"tokens":SingleIdTokenIndexer(namespace='token_ids'),
                  'characters': TokenCharactersIndexer(namespace='token_characters')}
sentence = TextField(tokens1,token_indexers)
sentence2 = TextField(tokens2,token_indexers)

instance = Instance({"sentence":sentence})
instance2 = Instance({"sentence":sentence2})

instances = [instance,instance2]

In [None]:
from allennlp.data import Vocabulary
from allennlp.data.dataset import Batch

# Vocabulary
vocab = Vocabulary.from_instances(instances)

# batch
instances = Batch(instances)

for instance in instances:
    instance.index_fields(vocab)

# Step 2. Define the TokenEmbedder

In [None]:
from allennlp.modules.token_embedders \
    import Embedding, TokenCharactersEncoder

# to define CNN applied character level embedder
from allennlp.modules.seq2vec_encoders import CnnEncoder 


In [None]:
# Simple word embedder
word_embedding = Embedding(
    num_embeddings=vocab.get_vocab_size("token_ids"),
    embedding_dim =3)


In [None]:
"""
 embedding (transfom) 

  - Input : tensor (batch_size,
                    max_num_words_in_sentence, 
                    max_char_len_in_word) 
  - Output :tensor(batch_size,
                    max_num_words_in_sentence, 
                    max_char_len_in_word,
                    embedding_dim) 
 cnn encoder 
  
  - output : tensor (batch_size,
                     max_num_word_in_sentence,
                     num_filters * ngram_filter_sizes)

"""

char_embedding = Embedding(
    num_embeddings=vocab.get_vocab_size("token_characters"),
    embedding_dim = 5)

character_cnn = CnnEncoder(
    embedding_dim = 5 , 
    num_filters=2,
    output_dim=4)

token_character_encoder = TokenCharactersEncoder(
                                embedding = char_embedding,
                                encoder = character_cnn)

# Step 3. Define the TextFieldEmbedder

In [None]:
# BasicTextFieldEmbedder

from allennlp.modules.text_field_embedders \
    import BasicTextFieldEmbedder


text_field_embedder = BasicTextFieldEmbedder(
                        {'tokens': word_embedding, 
                         'characters':token_character_encoder})

In [None]:
# let's apply text_field_embedder to data and see what happens

#Converted the indexed dataset into Pytorch Variables
batch = Batch(instances)
tensors = batch.as_tensor_dict(batch.get_padding_lengths())
print("torch tensors for passing to a model: \n\n", tensors)


In [None]:
text_field_variables = tensors['sentence']

embedded_text  = text_field_embedder(text_field_variables)

dimensions = list(embedded_text.size())

print("Post embedding with our TextFieldEmbedder: ")
print("Batch Size : ", dimensions[0])
print("Sentence Length : " , dimensions[1])
print("Embedding size : ",dimensions[2])

print("Embedded Tensor : \n\n",embedded_text)