<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day7/IR_in_Arabic_Lab7_TermRepresentations%26Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **IR in Arabic** - Summer 2021 lab day7

This is one of a series of Colab notebooks created for the **IR in Arabic** course. It demonstrates how we can get word embeddings using a non context-aware and a context-aware model and how they differ.

The **learning outcomes** of the this notebook are:


*   Get words and documents embeddings using a non context-aware and a context-aware model.
*   Compare the similarity of words and documents embeddings of both models.  


### **Setup**

We will use [FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP](https://www.aclweb.org/anthology/N19-4010/). FLAIR make it easy to get words and documents embeddings using a huge number of SOTA models.

In [None]:
#install FLAIR
!pip install flair

In [None]:
#we need to install allennlp in order to be able to use elmo model
!pip install allennlp==0.9.0

### **Word embeddings**

We will first use a non context-aware model **GloVe** to get word embeddings as follows

In [None]:
from termcolor import colored
from flair.data import Sentence
from flair.embeddings import WordEmbeddings

# initialize embedding by specifying which model we want to use
glove_embedding = WordEmbeddings('glove')

In [None]:
# create sentence. Sentence class holds all meta related to a text
glove_sentence = Sentence('We are travelling to Italy to watch a famous play')
print(glove_sentence)
print(glove_sentence.tokens)
#Sentence will split our sentence to tokens. Let's access the first token
print(glove_sentence[0])

In [None]:
#print each token embedding. We will get empty vectors because we did not get the embeddings yet
for token in glove_sentence:
    print(colored(token,attrs=['bold']))
    #print the embedding for each token
    print(token.embedding)

In [None]:
# embed a sentence using glove.
glove_embedding.embed(glove_sentence)
# now check out the embedded tokens.
for token in glove_sentence:
    print(colored(token,attrs=['bold']))
    #print the embedding for each token
    print(token.embedding)

In [None]:
#print the embedding for the word "play"
print(colored("The embedding of the word play",attrs=['bold']))
print(glove_sentence[9].embedding)

In [None]:
#print the length of the embedding vector
print(colored("The size of the embedding vector of the word play",attrs=['bold']))
len(glove_sentence[9].embedding)

Let's create another sentence that contains the word **"play"** but with a different meaning.

In [None]:
# create sentence.
glove_sentence2 = Sentence('They play tennis on their break')

# embed a sentence using glove.
glove_embedding.embed(glove_sentence2)

In [None]:
#print the embedding of the word "play" in the first sentence
print(colored("The embedding of the word play in the first sentence",attrs=['bold']))
print(glove_sentence[9].embedding)
#print the embedding for the word "play" you will notice it is similar to the emebdding of "play" in the previous sentence
print(colored("The embedding of the word play in the second sentence",attrs=['bold']))
print(glove_sentence2[1].embedding)

Check if the word **"play"** have the same embeddings in both sentences when **GloVe** was used.

In [None]:
from scipy import spatial
similarity= 1 - spatial.distance.cosine(glove_sentence[9].embedding, glove_sentence2[1].embedding)
similarity

Let's try the context-aware model **ELMo**

In [None]:
from flair.embeddings import ELMoEmbeddings
# initialize embedding
embedding = ELMoEmbeddings()


ELMo word embeddings can be constructed by combining ELMo layers in different ways. The available combination strategies are:

*  **"all"**: Use the concatenation of the three ELMo layers.
*  **"top":** Use the top ELMo layer.
* **"average":** Use the average of the three ELMo layers.

By default, the top 3 layers are concatenated to form the word embedding.

In [None]:
# create a sentence
elmo_sentence = Sentence('We are travelling to Italy to watch a famous play')

# embed words in sentence
embedding.embed(elmo_sentence)

[Sentence: "We are travelling to Italy to watch a famous play"   [− Tokens: 10]]

In [None]:
# now check out the embedded tokens.
for token in elmo_sentence:
    print(colored(token,attrs=['bold']))
    print(token.embedding)

In [None]:
#print the embedding for the word "play"
print(colored("The embedding of the word play",attrs=['bold']))
elmo_sentence[9].embedding

In [None]:
#print the length of the embedding vector
print(colored("The size of the embedding vector of the word play",attrs=['bold']))
#the length will be 3072 as it is the concatention of the top 3 layers each with a length of 1,024 
len(elmo_sentence[9].embedding)

Let's get the elmo embedding of the second sentence we used previously.

In [None]:
# create a sentence
elmo_sentence2 = Sentence('They play tennis on their break')
# embed words in sentence
embedding.embed(elmo_sentence2)

[Sentence: "They play tennis on their break"   [− Tokens: 6]]

In [None]:
#print the embedding of the word "play" in the first sentence
print(colored("The embedding of the word play in the first sentence",attrs=['bold']))
print(elmo_sentence[9].embedding)
#print the embedding for the word "play" you will notice it is similar to the emebdding of "play" in the previous sentence
print(colored("The embedding of the word play in the second sentence",attrs=['bold']))
print(elmo_sentence2[1].embedding)

Check if the word **"play"** have the same embeddings in both sentences when **ELMo** was used.

In [None]:
similarity = 1 - spatial.distance.cosine(elmo_sentence[9].embedding, elmo_sentence2[1].embedding)
similarity

**Notice that the similarity between the words is equal 1 when GloVe was used which means they are exacly similar while it is low when ELMo was used because it is a contextual model.**

### **Document embeddings**

If we want to get the document embedding we perform pooling (the average of words embeddings in our case) over all the tokens embeddings as follows:

**Documents embeddings using GloVe**

In [None]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

# initialize the word embeddings
glove_embedding = WordEmbeddings('glove')
# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([glove_embedding])

#Now, create two sentences and call the embedding's embed() method .
glove_sentence = Sentence('We are travelling to Italy to watch a famous play')
glove_sentence2 = Sentence('They play tennis on their break')
# embed the sentences with our document embedding
document_embeddings.embed(glove_sentence)
document_embeddings.embed(glove_sentence2)
# now check out the embedded sentences.
print(colored("The embedding of first sentence",attrs=['bold']))
print(glove_sentence.embedding)
print(colored("The embedding of second sentence",attrs=['bold']))
print(glove_sentence2.embedding)

#check the size of the vector. Since it is the average of the terms vectors it will be 100
print(colored("The size of the embedding vector of second sentence",attrs=['bold']))
len(glove_sentence2.embedding)



Let's check the similarity between these sentences embeddings

In [None]:
similarity = 1 - spatial.distance.cosine(glove_sentence.embedding, glove_sentence2.embedding)
similarity

**Documents embeddings using ELMo**

In [None]:
from flair.embeddings import ELMoEmbeddings, DocumentPoolEmbeddings
# init embedding
elmo_embedding = ELMoEmbeddings()

# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([elmo_embedding])

#Now, create two sentences and call the embedding's embed() method .
elmo_sentence = Sentence('We are travelling to Italy to watch a famous play')
elmo_sentence2 = Sentence('They play tennis on their break')
# embed the sentences with our document embedding
document_embeddings.embed(elmo_sentence)
document_embeddings.embed(elmo_sentence2)
# now check out the embedded sentences.
print(colored("The embedding of first sentence",attrs=['bold']))
print(elmo_sentence.embedding)
print(colored("The embedding of second sentence",attrs=['bold']))
print(elmo_sentence2.embedding)
print(colored("The size of the embedding vector of second sentence",attrs=['bold']))
len(elmo_sentence2.embedding)

In [None]:
similarity = 1 - spatial.distance.cosine(elmo_sentence.embedding, elmo_sentence2.embedding)
similarity

## **Combining non-contextual and contextual embeddings**

In some cases we may need to combine both embeddings.

You can combine both non-contextual and contextual embeddings easily as follows

In [None]:
from flair.embeddings import StackedEmbeddings

stacked_embeddings = StackedEmbeddings([
                                        glove_embedding,
                                        elmo_embedding
                                       ])
sentence = Sentence('They play tennis on their break')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(colored(token,attrs=['bold']))
    print(token.embedding)

Words are now embedded using a concatenation of two different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

In [None]:
print(colored("The size of the embedding vector for the word play",attrs=['bold']))
print(len(sentence[1].embedding))

In [None]:
print(colored("The embedding vector for the word play",attrs=['bold']))
print(sentence[1].embedding)

In [None]:
#check the glove embeddings which are the first 100 elements
print(sentence[1].embedding[0:100])

In [None]:
#check the elmo embeddings which are the vector elements from 100 until the end of the vector
print(sentence[1].embedding[100:])

### **Exercise1**
Choose either the word "rose" or "tie" to create two different sentences such that they share the same word but with different meanings. Use both ELMo and GloVe to get the words embeddings. Check the similarity between the embeddings of the common word in both sentences when GloVe and ELMo were used.

In [None]:
#add your solution here

### **Exercise2**

Get the document embeddings for the sentences that you created using both ELMo and GloVe. Compute the similarity between the GloVe sentences embeddings and compare it to the simliarity between the ELMo sentences embeddings.

In [None]:
#add your solution here

### **References**


*   [Flair word emebddings tutorial.](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md)
*   [Flair Elmo embedding tutorial.](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/ELMO_EMBEDDINGS.**md**)
* [Flair document embeddings tutorial.](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_5_DOCUMENT_EMBEDDINGS.md)

