# Sequence Representation

### Text can be seen as following:
* Sequence of characters
* Sequence of words
* Sequence of N-grams
* Most common to work at the level of words

Using this representation of text, most of the models understand the statistical structure of text i.e. identify various features from the text which can solve some of the textual tasks e.g. document classification, author identificaiton, sentiment analysis etc. At high level, various algorithms treat the sequence elements similarly to how pixels are treated in context of images in computere vision.

### Numerical representation of text as sequence
All algorithms works on numerial tensor, so we need to transform the raw text sequence into numerical tensor.

#### Tokenization
A text can be split in lower form by splitting into seq. of char or words or n-grams. This process of splitting is called tokenization and individual elements like char, words or n-gram is called token. And then each token can be converted into a numerical vector:

* Sequence of chars: treat each char in the sequence as a vector and using that create a tensor for the whole sequence of char
* Seq. of words: treat each word as a vector and using that create a tensor for a single sentence.
* Split the sentence using n-gram(take consecutive n chars as a single entity) and treat each n-gram as a vector and then create a numerial tensor out of it.


#### Vector representation
* One hot encoding
* Word Embeddings

### One Hot Encoding
* It is the most basic way to transform a token into a vector
* Total no. of dimension = no. of unique tokens in vocabulary
* Assign each token a unique dimension and then represent that token by a vector with value 1 at the assigned dimension and 0 in other dimension.
* Here the tokens can be chars, words, n-grams
* Vectors are binary, sparse (mostly made of 0), high dimensional e.g. 20,000 dimensional for data having that much unique tokens.



In [1]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(10)
sample_texts = ['This is a car.', 'That is a bicycle']
tokenizer.fit_on_texts(sample_texts)

Using TensorFlow backend.


In [5]:
sequences = tokenizer.texts_to_sequences(sample_texts)
print(sequences)
# each of the words/token is assigned an integer value

[[3, 1, 2, 4], [5, 1, 2, 6]]


In [7]:
one_hot_encoding_result = tokenizer.texts_to_matrix(sample_texts)
print(one_hot_encoding_result)
# each word present as 1, order is lost

[[0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]]


In [9]:
tokenizer.word_index

{'is': 1, 'a': 2, 'this': 3, 'car': 4, 'that': 5, 'bicycle': 6}

### Word Embeddings
* More powerful way of representing tokens as vectors
* Here tokens are words
* Embeddings are learned from data
* Word embedding are low dimensional floating point vectors. Generally used dimensions are: 256, 512, 1024..
* Two ways:
    * Learn with the main task
        * Start with random vector for a token and then learn
    * Load pre trained word embedings computed in different task
* Features:
    * Geometric relationship with vectors, should reflect semantic relationship between words
   

![Visual representation of one hot encodding and word embedding](img/word-representation-dlipbook.png)

Source: Deep Learning with Python by Francois Chollet, Book

//todo 
add code snippet for loading pre computed word embeddings
page 186 def.
Word index --> Embedding layer --> Corresponding word vector

In [19]:
from keras.layers import Embedding
embedding = Embedding(11, 2) # 11 = 10 (max no. of word index declared above) + 1; 2 = embedding dimension
#input to embedding layer will be word sequence (sequences created above - word index vector)

##### Pre trained embedding
Generally if we do not have enough data for training then we use pretrained word embeddings.  
Following are some of the most used pre computed word embeddings:
* Word2Vec: which captures specific semantic structure  
https://code.google.com/archive/p/word2vec
* GloVe: which captures co-occurence statistics for millions of English tokens from Wikipedia and Common Crawl data.  
https://nlp.stanford.edu/projects/glove


Other reading materials: 
* Lec 2 & 3 from http://web.stanford.edu/class/cs224n/syllabus.html
* http://ruder.io/word-embeddings-1/
* http://ruder.io/word-embeddings-softmax/

Course video: 
* [Lecture 2 | Word Vector Representations: word2vec](https://youtu.be/ERibwqs9p38)
* [Lecture 3 | GloVe: Global Vectors for Word Representation](https://youtu.be/ASn7ExxLZws)
