# Sequence Representation

### Representation 
##### Text can be seen as following:
* Sequence of characters
* Sequence of words
* Sequence of N-grams

Most common to work at the level of words.

Using this representation of text, most of the models understand the statistical structure of text i.e. identify various features from the text which can solve some of the textual tasks e.g. document classification, author identificaiton, sentiment analysis etc. At high level, various algorithms treat the sequence elements similarly to how pixels are treated in context of images in computere vision.

#### Numerical representation of text as sequence
All algorithms works on numerial tensor, so we need to transform the raw text sequence into numerical tensor. And this is done in 2 steps:

##### Step 1: Tokenization
A text can be split in lower form by splitting into seq. of char or words or n-grams. This process of splitting is called tokenization and individual elements like char, words or n-gram is called token. And then each token can be converted into a numerical vector:

* Sequence of chars: treat each char in the sequence as a vector and using that create a tensor for the whole sequence of char
* Seq. of words: treat each word as a vector and using that create a tensor for a single sentence.
* Split the sentence using n-gram(take consecutive n chars as a single entity) and treat each n-gram as a vector and then create a numerial tensor out of it.


##### Step 2: Vector representation
The above generated tokens can be transformed into numerical vector using following representation:
* One hot encoding
* Word Embeddings

### One Hot Encoding
* It is the most basic way to transform a token into a vector
* Total no. of dimension = no. of unique tokens in vocabulary
* Assign each token a unique dimension and then represent that token by a vector with value 1 at the assigned dimension and 0 in other dimension.
* Here the tokens can be chars, words, n-grams
* Vectors are binary, sparse (mostly made of 0), high dimensional e.g. 20,000 dimensional for data having that much unique tokens.



##### Sample keras code for One Hot Encoding

In [1]:
from keras.preprocessing.text import Tokenizer

max_num_words = 10
tokenizer = Tokenizer(max_num_words)
sample_texts = ['This is a car.', 'That is a bicycle']
tokenizer.fit_on_texts(sample_texts)

Using TensorFlow backend.


In [2]:
tokenizer.word_index

{'is': 1, 'a': 2, 'this': 3, 'car': 4, 'that': 5, 'bicycle': 6}

In [3]:
sequences = tokenizer.texts_to_sequences(sample_texts)
print(sequences)
# each of the words/token is assigned an integer value

[[3, 1, 2, 4], [5, 1, 2, 6]]


In [4]:
from keras.utils import to_categorical
one_hot_encoded = to_categorical(sequences, num_classes=max_num_words)
print(one_hot_encoded)

[[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]]


#### Bag of word encoding

In [5]:
bag_of_word_encoding = tokenizer.texts_to_matrix(sample_texts)
print(bag_of_word_encoding)
# each word present as 1, order is lost

[[0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]]


In the above output, 1st vector is:
\[0. 1. 1. 1. 1. 0. 0. 0. 0. 0.\]
And 1st sentence is: 'This is a car' if we check each token(word) index, it is 1 in the encoding at that location.

### Word Embeddings
* More powerful way of representing tokens as vectors
* Here tokens are words
* Word embeddings are learned from data
* Word embeddings are low dimensional floating point vectors. Generally used dimensions are: 256, 512, 1024..
* Two ways:
    * Learn with the main task: start with random vector for a token and then learn the embeddings
    * Load pre trained word embedings computed in different task
* Features:
    * Geometric relationship with vectors, should reflect semantic relationship between words
   

![Visual representation of one hot encodding and word embedding](img/word-representation-dlwpbook.png)

Source: Deep Learning with Python by Francois Chollet, Book

Keras lib. has embeddding layer
The Embedding layer maps word's integer indices to dense vectors
Word index --> Embedding layer --> Corresponding word vector

In [6]:
from keras.layers import Embedding
embedding = Embedding(11, 2) 
# 11 = input dimension = 10 (max no. of word index declared above) + 1
# 2  = output dimension = embedding dimension
# input to embedding layer will be word sequence (sequences created above - word index vector)

Input to Embedding layer is (samples, sequence length)  
Output of Embedding layer is (samples, sequence length, embedding dimension)

Weights of embedding layer random assigned and learning/adjusted via backpropagation.

Word embeddings trained from one data can be used in other problems

#### Pre trained embeddings
Generally if we do not have enough data for training then we use pretrained word embeddings.  
Following are some of the most used pre computed word embeddings:
* Word2Vec: which captures specific semantic structure  
https://code.google.com/archive/p/word2vec
* GloVe: which captures co-occurence statistics for millions of English tokens from Wikipedia and Common Crawl data.  
https://nlp.stanford.edu/projects/glove

##### Code snippet to load word embeddings
We will load the GloVe word embeddings. Download the precomputed glove embedding on wikipedia data from above link and extract in ./data/glove/glove.6B

In [7]:
import numpy as np
import os

glove_dir = 'data/glove/glove.6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [8]:
# this is sample code which will not execute properly here, its a code snippet for specific use case
word_index = tokenizer.word_index # and this tokenizer is trained/fit on training data
max_words = len(word_index)
embedding_dim = 100 # dimension of output of embedding layer to be, its 100 as we are using pre trained with 100
embedding_matrix = np.zeros((max_words + 1, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [None]:
# code snippet to load weights
# Support in a model an Embedding layer is added, which can be added only as a 1st layer of the model.
# Then we can load this embedding as weights of that layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

Other reading materials: 
* Lec 2 & 3 from https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/syllabus.html
* http://ruder.io/word-embeddings-1/
* http://ruder.io/word-embeddings-softmax/

Course video: 
* [Lecture 2 | Word Vector Representations: word2vec](https://youtu.be/ERibwqs9p38)
* [Lecture 3 | GloVe: Global Vectors for Word Representation](https://youtu.be/ASn7ExxLZws)
