# Word embedding
**In case of any questions email us: deeplearninginsciences@gmail.com**  
**Author: Bálint Ármin Pataki**

In this notebook we will create a model that can find a meaningful vector representation of words based on a large text file (corpus). This is an unsupervised learning task, as we do not have any labels (like we had for the happy-sad pictures). We have only the raw text data.

In [1]:
#load numpy

import numpy as np

The **toy corpus** (copied from https://en.wikipedia.org/wiki/Text_corpus) we will use is the following:

_A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).
Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. There are two main types of parallel corpora which contain texts in two languages. In a translation corpus, the texts in one language are translations of texts in the other language. In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus.
In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual._

Our **example sentence** will be: _The quick brown fox jumps over the lazy dog_

#### Let's save it to a python variable as a string!

In [2]:
corpus = 'A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. There are two main types of parallel corpora which contain texts in two languages. In a translation corpus, the texts in one language are translations of texts in the other language. In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word\'s part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.'
example_sentece = 'The quick brown fox jumps over the lazy dog'

# 1. Tokenization

Tokenization is the process when the input text (corpus) is splitted into smaller parts, called tokens. Token and word have often similar meaning, but not necessarily.


#### For example (the text in the parenthesis is one token):   
 - _The quick brown fox jumps over the lazy dog._  
 
will be converted to:  
 - (The), (quick), (brown), (fox), (jumps), (over), (the), (lazy), (dog)

**But is some case it is not completely clear:**  
 - _It's not clear._ 
 
can be:
 - (It's), (not), (clear)
 - (It), (s), (not), (clear)
 - (It), ('s), (not), (clear)


We will implement a tokenization when we split text an all non alphanumeric characters!  
So we pick:
 - (It), (s), (not), (clear)
 
## Let's implement a tokenizer!
### Help:
#### get_non_alphanumeric_chars
 1. with the `list()` function it is easy to convert the input to list of characters
 2. with the `.isalnum()` function you can filter out all the non alphanumeric characters
 3. `list(set(duplicated_list))` will keep only unique values in the list
 
 
#### tokenizer 
 1. replace all non alphanumeric characters to whitespace! Use `.replace()` function.
 2. do the splitting via the `.split()` function

In [3]:
#GRADED function
#Don't change the function name, parameters and return values
def get_non_alphanumeric_chars(input_text):
    """
        Returns the non alphanumeric unique characters from a corpus.
        Input: 
            * input_text:   string
        Output:
            * non_alphanumeric_unique: list of non alphanumeric characters
    """     
    ###Start code here
    
    #STEP1 set character_list variable to a list of characters in the input_text 
    character_list = list(input_text)
    
    #STEP2 store all non_alphanumeric characters from the character_list!
    non_alphanumeric = []
    for c in character_list:
        if c.isalnum() == False:
            non_alphanumeric.append(c)
    
    #STEP3 keep only the unique characters (hint: unique_values = list(set(duplicated_values))
    non_alphanumeric_unique = list(set(non_alphanumeric))
    
    ###End code here 
    
    return non_alphanumeric_unique

In [4]:
print(get_non_alphanumeric_chars('Hi! How are you? I\'m fine, thanks... Bye'))


['!', ' ', "'", ',', '.', '?']


In [5]:
print(get_non_alphanumeric_chars(corpus))
print(get_non_alphanumeric_chars('bla$$=()/ma z'))

[' ', "'", ')', '(', '-', ',', '.']
[' ', '$', ')', '(', '/', '=']


**Expected output** (maybe you will have different order but that is fine):

<pre>['.', ',', '-', "'", '(', ' ', ')']  
['/', '=', ' ', '(', '$', ')']
</pre>

In [6]:
#GRADED function
#Don't change the function name, parameters and return values
def tokenizer(input_text):
    """
        Transforms strings to tokens. The tokens should appear in the same order as they are in the string.
        Input: 
            * input_text:   string
        Output:
            * tokens: list of tokens
    """    

    
    ###Start code here
    #STEP0 extract non alphanumeric characters! Use your function above!
    non_alphanumeric_unique = get_non_alphanumeric_chars(input_text)
    #print non_alphanumeric_unique
    #STEP1 replace all non alphanumeric characters to ' ' whitespace.
    replaced_input = input_text
    for c in non_alphanumeric_unique:
        replaced_input = replaced_input.replace(c, " ")
    #print replaced_input
    #STEP2 split the input_text on all the non_alphanumeric_unique characters and store then in the tokens variable
    tokens = replaced_input.split()
    
    ###End code here    
    
    
    return tokens

In [7]:
print(tokenizer(corpus[49:153]))
print()
print(tokenizer('Hi! How are you? I\'m fine, thanks... Bye'))

['monolingual', 'corpus', 'or', 'text', 'data', 'in', 'multiple', 'languages', 'multilingual', 'corpus', 'Multilingual', 'corpora', 'that']
()
['Hi', 'How', 'are', 'you', 'I', 'm', 'fine', 'thanks', 'Bye']


**Expected output:**

<pre>
['monolingual', 'corpus', 'or', 'text', 'data', 'in', 'multiple', 'languages', 'multilingual', 'corpus', 'Multilingual', 'corpora', 'that']<br>
['Hi', 'How', 'are', 'you', 'I', 'm', 'fine', 'thanks', 'Bye']
</pre>

# 2. Stemming
This is the process when the inflected words are converted to their root word.

For example: goes $\to$ go, stemming $\to$ stem, fishing $\to$ fish

This can help a lot when the corpus is not large enough because then we do not have that many occurence of a word. And having just an _s_ at the end of the word means a completely different word as we will see below.


It is not that easy to write a proper stemmer, so we will **skip** this part in the homework.

# 3. Dictionary & one-hot encoding

When the tokenization is ready we can build up a dictionary of the tokens. This is simply the sorted version of the tokens.

A dictionary looks like: ['a', 'an', 'apple', ..., 'orange', ..., 'zebra', 'zulu']

From now we can think of a word as a vector which has 1 at the position of the word in the dictionary, and 0-s anywhere else.

$a = \begin{bmatrix}
           1 \\
           0 \\
           0 \\
           \vdots \\
           0 \\
           0
         \end{bmatrix},\:\:$ 
$an = \begin{bmatrix}
           0 \\
           1 \\
           0 \\
           \vdots \\
           0 \\
           0
         \end{bmatrix},\:\:$
$apple = \begin{bmatrix}
           0 \\
           0 \\
           1 \\
           \vdots \\
           0 \\
           0
         \end{bmatrix}\:\: ... \:\:$
$zebra = \begin{bmatrix}
           0 \\
           0 \\
           0 \\
           \vdots \\
           1 \\
           0
         \end{bmatrix},\:\:$
$zulu = \begin{bmatrix}
           0 \\
           0 \\
           0 \\
           \vdots \\
           0 \\
           1
         \end{bmatrix}$
         
         
### Let's build a dictionary and the one-hot encoding!

In [8]:
#GRADED function
#Don't change the function name, parameters and return values
def get_dictionary(input_text):
    """
        Builds up the dictionary from an input text.
        Input: 
            * input_text:   string
        Output:
            * dictionary: list of unique, sorted tokens
    """    
    tokens = tokenizer(input_text)
    ###Start code here
    
    unique_tokens = tokenizer(input_text)
    unique_sorted_tokens = sorted(list(set(unique_tokens)))
    
    ###End code here
    dictionary = unique_sorted_tokens

    return dictionary

In [9]:
print(get_dictionary(corpus.lower())[0:10])
print(len(get_dictionary(corpus.lower())))

['a', 'about', 'added', 'adjective', 'algorithms', 'aligned', 'alignment', 'an', 'analysis', 'and']
118


**Expected output:**

<pre>
['a', 'about', 'added', 'adjective', 'algorithms', 'aligned', 'alignment', 'an', 'analysis', 'and']  
118
</pre>

In [10]:
#GRADED function
#Don't change the function name, parameters and return values
def one_hot_encoded(word, dictionary):
    """
        Turn a word to a one-hot encoded vector based on the dictionary.
        Input: 
            * word:   string
            * dictionary: list of words in a dictionary
        Output:
            * oh_vec: one-hot encoded vector
    """    
    #check if word is in dictionary
    if(not word in dictionary):
        print('The word _' + word + '_ is not in dictionary!')
        return None
    
    ###Start code here
    # oh_vec should be a numpy array with a shape (len(dictionary), 1)
    # it should contains 0 everywhere except for the position of the word in the dictionary. There is should be 1. 
    oh_vec = np.zeros((len(dictionary), 1), dtype=int)
    #print oh_vec
    oh_vec[dictionary.index(word)][0] = 1
    ###End code here

    return oh_vec

In [11]:
dic = get_dictionary(corpus.lower())

print(one_hot_encoded('a', dic)[0:10])
print()
print(one_hot_encoded('aligned', dic)[0:10])
print()
print(one_hot_encoded('ablak', dic))

[[1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]
()
[[0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]]
()
The word _ablak_ is not in dictionary!
None


**Expected output:**

<pre>
[[1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]

[[0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]]

The word _ablak_ is not in dictionary!
None
</pre>

# 4. Continuous bag of words (CBOW) concept

In the CBOW model we try to fit a model that can predict a word by it's neighbors. We expect that if we manage to fit such a model, the inner representation of the model will capture the meaning of the different words.  

The number of neighbors is a parameter, for 4 neighbors it looks like (the **red is the target** word, the blue words are the input, the neighbors):  
<font color="blue">The quick</font> <font color="red">brown</font> <font color="blue">fox jumps</font> over the lazy dog  
The <font color="blue">quick brown</font> <font color="red">fox</font> <font color="blue">jumps over</font> the lazy dog  
The quick <font color="blue">brown fox</font> <font color="red">jumps</font> <font color="blue">over the</font> lazy dog  
The quick brown <font color="blue">fox jumps</font> <font color="red">over</font> <font color="blue">the  lazy</font> dog  

As it can be seen above from a sentence it is easy to create many training examples!

#### Let's make a CBOW example generator!

In [12]:
#GRADED function
#Don't change the function name, parameters and return values
def generate_CBOW_example(start_pos, half_window, tokens):
    """
        Generated CBOW training examples from tokens
        Input: 
            * start_pos:   the position of the first example
            * half_window: number of tokens on one-side
            * tokens: list of tokens
        Output:
            * X: input words (the neighbors)
            * Y: target word
    """     
    
    ###Start code here
    
    X = tokens[start_pos:start_pos+half_window] + tokens[start_pos+half_window+1:start_pos+2*half_window+1]
    Y = tokens[start_pos+half_window]
    
    ###End code here
    
    return X, Y

In [13]:
corpus_tokens = tokenizer(corpus.lower())

print(generate_CBOW_example(0, 2, corpus_tokens))
print(generate_CBOW_example(2, 2, corpus_tokens))
print(generate_CBOW_example(0, 3, corpus_tokens))

(['a', 'corpus', 'contain', 'texts'], 'may')
(['may', 'contain', 'in', 'a'], 'texts')
(['a', 'corpus', 'may', 'texts', 'in', 'a'], 'contain')


**Expected output:**
<pre>
(['a', 'corpus', 'contain', 'texts'], 'may')
(['may', 'contain', 'in', 'a'], 'texts')
(['a', 'corpus', 'may', 'texts', 'in', 'a'], 'contain')
</pre>

# 5. Skip-gram concept

Skip-gram is the opposite of the CBOW. Now starting from a single word we want to predict it's neighbors.

In the skip-gram model we try to fit a model that can predict a word's neighbors. We expect that if we manage to fit such a model, the inner representation of the model will capture the meaning of the different words.  

The number of neighbors is a parameter, for 4 neighbors it looks like (the **blue are the target** words, the red word is the input):  
<font color="blue">The quick</font> <font color="red">brown</font> <font color="blue">fox jumps</font> over the lazy dog  
The <font color="blue">quick brown</font> <font color="red">fox</font> <font color="blue">jumps over</font> the lazy dog  
The quick <font color="blue">brown fox</font> <font color="red">jumps</font> <font color="blue">over the</font> lazy dog  
The quick brown <font color="blue">fox jumps</font> <font color="red">over</font> <font color="blue">the  lazy</font> dog  

As it can be seen above from a sentence it is easy to create many training examples!

#### Let's make a skip-gram example generator!
#### Luckily it is very easy now. We only need to change X and Y in the CBOW generator. 

In [14]:
def generate_skip_gram_example(start_pos, half_window, tokens):
    """
        Generated skip-gram training examples from tokens
        Input: 
            * start_pos:   the position of the first example
            * half_window: number of tokens on one-side
            * tokens: list of tokens
        Output:
            * x: input word
            * y: target words
    """     
    
    X, Y = generate_CBOW_example(start_pos, half_window, tokens)
    x = Y
    y = X
    return x, y

# 6. Keras models

Skip-gram is said to word better for smaller amount of training data and for rare words, but CBOW is faster to train and said to be better for frequent words.

If we have a corpus with $N$ tokens then we will have $N$ training examples for CBOW and $\approx window\_size \cdot N$ training examples for skip-gram. 

Training these models can be really slow so a few other tricks are used for training. For the simplicity we won't implement them now.

## 6.1 CBOW model

In the model we take a training example which contains $N$ input words and a target word. 

<img src='images/cbow.png'></img>
<center>[Mikolov: Efficient Estimation of Word Representations in
Vector Space, 2013]</center>

The model is the following:
 1. convert the $N$ input words to one-hot encoded ($V$ long) vectors and average them!
 2. add a hidden layer of $d$ neurons (no activation, no bias, just the matrix multiplication)
 3. add an output layer with $V$ neurons (no bias) and softmax activation.

In [15]:
from keras.layers import Dense
from keras.models import Sequential

Using TensorFlow backend.


In [18]:
#GRADED function
#Don't change the function name, parameters and return values
def get_keras_cbow(hidden_dim, dictionary_size):
    """
        Generate keras model for CBOW model
        Input: 
            * hidden_dim: number of nurons in the hidden layer
            * dictionary_size: length of the dictionary (output/input size)
        Output:
            * model: keras model that implements CBOW 
    """
    cbow = Sequential()
    ###Start code here
    
    # add a Dense layer with hidden_dim neurons. Input_dim is the dictionary_size 
    # we don't need nor activation neither bias!
    cbow.add(Dense(hidden_dim,input_dim=dictionary_size))
    
    # add a Dense layer with dictionary_size neurons. We don't need bias. Activation is softmax.
    cbow.add(Dense(dictionary_size, activation='softmax'))
    
    ###End code here
    
    return cbow

#### Let's train it on out toy corpus. We should see the loss decreasing, but we do not expect anything serious.

In [19]:
dic = get_dictionary(corpus.lower())
cbow_model = get_keras_cbow(30, len(dic))
cbow_model.compile(optimizer='adam', loss='categorical_crossentropy')

In [20]:
epochs = 10
half_window = 2
corpus_tokens = tokenizer(corpus.lower())
max_pos_start = len(corpus_tokens)-2*half_window

for iteration in range(epochs): #iterate the corpus epochs times
    loss = 0.
    
    for pos_start in range(max_pos_start): # iterate on the token positions
        x, y = generate_CBOW_example(pos_start, half_window, corpus_tokens) # generate training examples
        
        x = np.array([one_hot_encoded(i, dic) for i in x]).sum(0).T 
        y = one_hot_encoded(y, dic).T
        
        loss += cbow_model.train_on_batch(x, y) # train the model. only 1 sample/batch now...

    print('Epoch', str(iteration+1).zfill(2), ', Loss:',  loss/(max_pos_start))

('Epoch', '01', ', Loss:', 4.76769242109346)
('Epoch', '02', ', Loss:', 4.547351667703676)
('Epoch', '03', ', Loss:', 4.286773908236795)
('Epoch', '04', ', Loss:', 3.997391689907421)
('Epoch', '05', ', Loss:', 3.7402127306323405)
('Epoch', '06', ', Loss:', 3.4930301293853887)
('Epoch', '07', ', Loss:', 3.2343938774313807)
('Epoch', '08', ', Loss:', 2.965933325369496)
('Epoch', '09', ', Loss:', 2.69564057184645)
('Epoch', '10', ', Loss:', 2.432002490955936)


You should see the loss decreasing.

## 6.2 skip-gram model

In the model we take a training example which contains an input word and the $N$ target words. We will handle the target words separately, so actually it means $N$ different training X-Y pairs. 


<img src='images/skip_gram.png'></img>
<center>[Mikolov: Efficient Estimation of Word Representations in
Vector Space, 2013]</center>

The model is the following:
 1. convert the input word and one of the target word to one-hot encoded ($V$ long) vector!
 2. add a hidden layer of $d$ neurons (no activation, no bias, just the matrix multiplication)
 3. add an output layer with $V$ neurons (no bias) and softmax activation.
 
Luckily it is exactly the same model than the CBOW. The only difference is how we create the input and the target.

In [21]:
dic = get_dictionary(corpus.lower())
skip_gram_model = get_keras_cbow(30, len(dic))
skip_gram_model.compile(optimizer='adam', loss='categorical_crossentropy')

In [22]:
epochs = 10
half_window = 2
corpus_tokens = tokenizer(corpus.lower())
max_pos_start = len(corpus_tokens)-2*half_window

for iteration in range(epochs): #iterate the corpus epochs times
    loss = 0.
    
    for pos_start in range(max_pos_start): # iterate on the token positions
        x, y = generate_skip_gram_example(pos_start, half_window, corpus_tokens) # generate training examples
        
        x = one_hot_encoded(x, dic).T  
        y = np.array([one_hot_encoded(i, dic) for i in y])
        
        for i in y: # iterate on the target words
            loss += skip_gram_model.train_on_batch(x, i.T) # train the model. only 1 sample/batch now...

    print('Epoch', str(iteration+1).zfill(2), ', Loss:',  loss/(max_pos_start*len(y)))

('Epoch', '01', ', Loss:', 4.752132970439501)
('Epoch', '02', ', Loss:', 4.409144571251121)
('Epoch', '03', ', Loss:', 4.230753688772848)
('Epoch', '04', ', Loss:', 4.116837238723582)
('Epoch', '05', ', Loss:', 3.9940485673502457)
('Epoch', '06', ', Loss:', 3.8593641066354167)
('Epoch', '07', ', Loss:', 3.718078181763326)
('Epoch', '08', ', Loss:', 3.5770511219570458)
('Epoch', '09', ', Loss:', 3.4424605907733774)
('Epoch', '10', ', Loss:', 3.3186080837545315)


You should see the loss decreasing.

#### This is the end of the homework. The deadline is 2018.04.24.

In the next homework we will continue to our smiley generator!