#  NLP-lab :  Word embeddings

In this series of exercises, we will explore three word embeddings:

* [Collobert & Weston](http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf) https://ronan.collobert.com/senna/
* [Word2Vec](https://arxiv.org/abs/1301.3781)
* [BERT](https://huggingface.co/bert-base-uncased) 


In the code already provided, add your code to the place indicated by `YOUR CODE HERE`.

**Important** : do NOT commit the data and embedding files in your repository git : it is a waste of resources and it takes more time to clone.
> Use https://docs.github.com/en/get-started/getting-started-with-git/ignoring-files


In [1]:
# basic imports
import os
import matplotlib.pyplot as plt
# display matplotlib graphics in notebook
%matplotlib inline 
import seaborn as sns

# disable warnings for libraries
import warnings
warnings.filterwarnings("ignore")

# configure logger
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%I:%M:%S')
logger = logging.getLogger(__name__)


###  Embeddings exploration with Collobert's embeddings

Upload the files containing the embeddings to `data`:
* Collobert (size 50): [collobert_embeddings.txt.zip](https://storage.teklia.com/shared/deepnlp-labs/collobert_embeddings.txt.zip) which contains the embedding vectors and [collobert_words.lst](https://storage.teklia.com/shared/deepnlp-labs/collobert_words.lst) which contains the associated words;

You need to unzip the files to load them.

Feel free to open the files to see what they contain (it's sometimes surprising).

#### Question: 
>* Add the files to your .gitignore
>* Give the size in Mb of the embeddings files before unzipping.
>* By exploring the content of the embedding files, give the number of words for which these files provide embeddings.



In [3]:
# Path to the files
collobert_words_path = 'collobert_embeddings.txt'
glove_embeddings_path = 'glove.6B.50d.txt'

# Read the Collobert words file
with open(collobert_words_path, 'r') as file:
    collobert_words = file.readlines()

# Read the GloVe embeddings file
with open(glove_embeddings_path, 'r',encoding='utf-8') as file:
    glove_embeddings = file.readlines()

# Number of words in each file
num_collobert_words = len(collobert_words)
num_glove_words = len(glove_embeddings)

# Log the results
logger.info(f'Number of words in Collobert embeddings: {num_collobert_words}')
logger.info(f'Number of words in GloVe embeddings: {num_glove_words}')

09:13:29 INFO:Number of words in Collobert embeddings: 130000
09:13:29 INFO:Number of words in GloVe embeddings: 400000


### List of closest words

The aim of this exercise is to list the closest words to a given word for the Collobert embedding. First, we'll load the vectors of the Collobert embedding into a numpy array and the associated words into a python list. Then we'll use the [scipy KDTree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) data structure to quickly search for the vectors closest to a series of words.


#### Question: 
>* load embedding vectors from the file `data/collobert_embeddings.txt` using the numpy function [genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html)
>* load the words associated with the vectors from the `data/collobert_words.lst` file into a python list (using `open()` and `readlines()`)
>* check that the sizes are correct


In [5]:
import numpy as np

# Load the embeddings
collobert_embeddings = np.genfromtxt('collobert_embeddings.txt')

# Load the words
with open('collobert_words.lst', 'r',encoding='utf-8') as file:
    collobert_words = file.readlines()

# Verify the sizes
assert collobert_embeddings.shape[0] == len(collobert_words), "Mismatch in number of embeddings and words"

print(f"Number of embeddings: {collobert_embeddings.shape[0]}")
print(f"Number of words: {len(collobert_words)}")

Number of embeddings: 130000
Number of words: 130000


KD trees are a very efficient data structure for storing large sets of points in a multi-dimensional space and performing very efficient nearest-neighbour searches. 

#### Question 
> * Initialise the [KDTree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) structure with Collobert's embedding vectors.
> * Using the [tree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query) function, display the 5 nearest words for the following words: ‘mother’, ‘computer’, ‘dentist’, ‘war’, ‘president’, ‘secretary’, ‘nurse’.  *Hint: you can use the function `collobert_words.index(w)` to obtain the index of a word in the list of words*.
> * Create a `words_plus_neighbors` list containing the words and all their neighbours (for the next question)

In [6]:
from scipy.spatial import KDTree

# Initialize KDTree with Collobert embeddings
tree = KDTree(collobert_embeddings)

# List of words to find neighbors for
words_to_query = ['mother', 'computer', 'dentist', 'war', 'president', 'secretary', 'nurse']

# Find the 5 nearest neighbors for each word
words_plus_neighbors = []
for word in words_to_query:
    word_index = collobert_words.index(word+'\n' ) 
    distances, indices = tree.query(collobert_embeddings[word_index], k=5)
    neighbors = [collobert_words[i] for i in indices]
    print(f"Word: {word}, Neighbors: {neighbors}")
    words_plus_neighbors.extend(neighbors)

# Remove duplicates
words_plus_neighbors = list(set(words_plus_neighbors))

Word: mother, Neighbors: ['mother\n', 'daughter\n', 'wife\n', 'father\n', 'husband\n']
Word: computer, Neighbors: ['computer\n', 'laptop\n', 'multimedia\n', 'desktop\n', 'software\n']
Word: dentist, Neighbors: ['dentist\n', 'pharmacist\n', 'midwife\n', 'physician\n', 'housekeeper\n']
Word: war, Neighbors: ['war\n', 'revolution\n', 'death\n', 'court\n', 'independence\n']
Word: president, Neighbors: ['president\n', 'governor\n', 'chairman\n', 'mayor\n', 'secretary\n']
Word: secretary, Neighbors: ['secretary\n', 'minister\n', 'treasurer\n', 'chairman\n', 'commissioner\n']
Word: nurse, Neighbors: ['nurse\n', 'physician\n', 'veterinarian\n', 'dentist\n', 'surgeon\n']


### Visualisation with T-SNE

Embeddings are vectors with several hundred dimensions. It is therefore not possible to display them in their original space. However, it is possible to apply dimension reduction algorithms to display them in 2 or 3 dimensions. One of the dimension reduction algorithms allowing 2D visualisation is [tSNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). 

#### Question
> * Create a `word_vectors` object of type `np.array` from a list containing all the embeddings of the words in the `words_plus_neighbors` list.
> * Create a tSNE object from the `from sklearn.manifold import TSNE` library with the parameters `random_state=0`, `n_iter=2000` and `perplexity=15.0` for a 2-dimensional view.
> * Calculate *T* the tSNE transformation of the `word_vectors` by applying function `.fit_transform(word_vectors)` to the tSNE object. This function estimates the parameters of the tSNE transformation and returns the reduced-dimension representation of the vectors used for estimation.
> * Use the `scatterplot` function from [seaborn](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) to represent points in 2 dimensions and add word labels using the `plt.annotate` function.

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

# Create an object `word_vectors` of type `np.array` from a list containing all the embeddings of the words in the list `words_plus_neighbors`

word_vectors = np.array([collobert_embeddings[collobert_words.index(word)] for word in words_plus_neighbors])

# Create a tSNE object from the sklearn.manifold library with the parameters random_state=0, n_iter=2000, and perplexity=15.0 for a 2D visualization
tsne = TSNE(random_state=0, n_iter=2000, perplexity=15.0)

# Compute the tSNE transformation of the word_vectors by applying the .fit_transform(word_vectors) function to the tSNE object
T = tsne.fit_transform(word_vectors)

# Plot the results
fig = plt.figure()
fig.patch.set_facecolor('#f9f9f9')

sns.set(rc={'figure.figsize':(14, 8)})
sns.set(font_scale=1)

sns.scatterplot(x=T[:, 0], y=T[:, 1])

for label, x, y in zip(words_plus_neighbors, T[:, 0], T[:, 1]):
    plt.annotate(label.strip(), xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')

### Semantic arithmetic with Word2Vec

One of the most original properties of Word2Vec embeddings is that the semantic relationships between vectors can be modelled by arithmetic operations. Given vectors representing the words `king`, `man` and `woman`, it is possible to compute the vector `v` as :  

`v = vector(king)-vector(man)+vector(woman)`

This operation corresponds to the following semantic relationship: *The king is to the man what the queen is to the woman*, which translates into the following arithmetic: *the concept of king, minus the concept of man plus the concept of woman gives the concept of queen*.

In fact, if we look in the embedding for the word whose closest vector is `v`, we find `reine`.


We will use a Word2Vec model pre-trained on the French Wac corpus.  This model has been trained on a corpus of 1 billion French words. 

This embedding is available in 2 formats:
- a text format for easy exploration of the model :
    - frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.txt](https://storage.teklia.com/shared/deepnlp-labs/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.txt)
- a binary format that can be loaded using the Gensim library: 
    - [enWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin](https://storage.teklia.com/shared/deepnlp-labs/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin)

Download the text file onto your machine to analyse it.

#### Question: 
>* Add the file to your .gitignore
>* Give the size in Mb of the embedding files
>* By exploring the contents of the embedding file in text format, give the number of words for which this model provides embeddings and the size of the embedding for each word.



In [None]:
import os
fichierbin="frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin"
# Get the size of the embedding file in Mb
ficiher_bin_size_mb = os.path.getsize(fichierbin) / (1024 * 1024)


print(f"Size of Collobert embedding file: {ficiher_bin_size_mb:.2f} Mb")

Size of Collobert embedding file: 120.21 Mb


#### Word similarity

We are now going to use the [Gensim] library (https://radimrehurek.com/gensim/) to load the Word2Vec model and use it. 

#### Question: 
>* Modify the following code to load the Word2Vec template file in binary format using [load_word2vec](https://radimrehurek.com/gensim/models/keyedvectors.html#how-to-obtain-word-vectors)
>* Choose a couple of words and find the closest words according to the model using [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)
>* To guess the meaning of the words ‘yokohama’, ‘kanto’ and ‘shamisen’, look for their nearest neighbours. Explain the results.


In [11]:
from gensim.models import KeyedVectors

## YOUR CODE HERE

embedding_file ="frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin"
model = KeyedVectors.load_word2vec_format(embedding_file, binary=True, unicode_errors="ignore")
## YOUR CODE HERE
model.most_similar("chevalier")

09:21:31 INFO:loading projection weights from frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin
09:21:31 INFO:KeyedVectors lifecycle event {'msg': 'loaded (155562, 200) matrix of type float32 from frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2025-02-15T21:21:31.764679', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:27:36) [GCC 11.2.0]', 'platform': 'Linux-5.14.0-503.23.2.el9_5.x86_64-x86_64-with-glibc2.34', 'event': 'load_word2vec_format'}


[('commandeur', 0.6844523549079895),
 ('chevaliers', 0.6799763441085815),
 ('écuyer', 0.6333731412887573),
 ('grand-croix', 0.621898353099823),
 ('preux', 0.6011075377464294),
 ('chevalerie', 0.5404021143913269),
 ('légion', 0.5335969924926758),
 ('honneur', 0.4953608810901642),
 ('yvain', 0.4855087101459503),
 ('insignes', 0.4742659330368042)]

In [16]:
model.most_similar("yokohama")

[('tokyo', 0.7117858529090881),
 ('tôkyô', 0.6314416527748108),
 ('japon', 0.6215220093727112),
 ('nagoya', 0.61984783411026),
 ('kyushu', 0.6141085028648376),
 ('osaka', 0.6123895049095154),
 ('fukuoka', 0.5612888336181641),
 ('japonaise', 0.5507327318191528),
 ('sendai', 0.5496150255203247),
 ('japonais', 0.5391373038291931)]

the word is coming from japanese , so the nearest words to it are related to japan, and have not necessarily any common semantic with "yokohama". it's logic because the model is trained on frensh words not on japanese ones. 

In [18]:
# model.most_similar("shamisen")

the word is unkonwn for the model

In [19]:
model.most_similar("kanto")

[('pokémon', 0.5426284670829773),
 ('mewtwo', 0.5076008439064026),
 ('pokémons', 0.4970633387565613),
 ('saito', 0.4549728333950043),
 ('pokédex', 0.448673278093338),
 ('yusuke', 0.4416310787200928),
 ('osaka', 0.4372846484184265),
 ('shôgun', 0.4324426054954529),
 ('jin', 0.42604967951774597),
 ('honshu', 0.42374101281166077)]

the same conclusion as before holds here: the model is trained on frensh words

#### Semantic arithmetic

One of the most original properties of Word2Vec embeddings is that the semantic relationships between vectors can be modelled by arithmetic operations. Given vectors representing the words `king`, `man` and `woman`, it is possible to compute the vector `v` as :  

`v = vector(king)-vector(man)+vector(woman)`

This operation corresponds to the following semantic relationship: *The king is to the man what the queen is to the woman*, which translates into the following arithmetic: *the concept of king, minus the concept of man plus the concept of woman gives the concept of queen*.

In fact, if we look in the embedding for the word whose closest vector is `v`, we find `reine`.

#### Question: 
>* using the function [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) specifying the arguments `positive` for the vectors to be added and `negative` for the vectors to be subtracted, check the relationship *the concept of king, minus the concept of man plus the concept of woman gives the concept of queen*.
>* Using the same method, find XXX in the following semantic relations
>   * Paris is to France what XXX is to Japan.
>   * Chevalier is to France what XXX is to Japan.

In [None]:
# Check the relationship: king - man + woman = queen
result_king = model.most_similar(positive=['king', 'woman'], negative=['man'])
print(f"king - man + woman = {result_king[0][0]}")

#capital of France is Paris
result_paris = model.most_similar(positive=['paris', 'japan'], negative=['france'])
print(f"Paris is to France what {result_paris[0][0]} is to Japan")

#  Chevalier is to France what .... is to Japan
result_chevalier = model.most_similar(positive=['chevalier', 'japan'], negative=['france'])
print(f"Chevalier is to France what {result_chevalier[0][0]} is to Japan")

king - man + woman = jessica
Paris is to France what tokyo is to Japan
Chevalier is to France what écuyer is to Japan


## Contextual embeddings with BERT 

BERT was one of the first freely available Transformer language models, trained on large corpora. Many other models are available on HuggingFace.

As BERT is a contextual model, it is necessary to have it predict whole sentences in order to study the word embeddings it produces. In this section, we will compare the embeddings obtained for polysemous words according to the sentence in which they are used.

In English, *plant* has two meanings: plant and vegetable. With a non-contextual embedding, such as Glove or Colobert, these two meanings of the word plus are associated with an identical embedding. With BERT, we'll see that the same word can have several embeddings depending on the context.

First, load the BERT model and tokenizer from HuggingFace : 

In [23]:
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load pre-trained model 
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # to access the hidden states
                                  )
# set the model to "evaluation" mode
model.eval()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

### Tokenizer

Language models are trained with a specific breakdown of sentences into tokens. These tokens can be words or parts of words. It is necessary to use the tokenizer corresponding to each model.

tokenizer.vocab.keys() gives the list of all the tokens known for the language model. 

#### Question
>* How many different tokens are known to the BERT tokenizer?
>* Display a hundred tokens at random. What do you find?

In [26]:
import random
# number of token in tokenizer
num_tokens = len(tokenizer)
print(f"Number of tokens in the tokenizer: {num_tokens}")

# sample of 100 tokens
sample_tokens = random.sample(list(tokenizer.vocab.keys()), 100)    
print(f"Sample of 100 tokens: {sample_tokens}")
# YOU CODE HERE


Number of tokens in the tokenizer: 30522
Sample of 100 tokens: ['##ᄆ', 'transvaal', '##ua', 'treacherous', '##ʔ', 'noodles', 'greyish', '##ης', 'quo', 'stanley', '##cky', 'rebuilding', 'holds', 'kris', 'pushing', 'collaborating', 'reckless', '##bor', 'spreads', '≡', 'mosque', '##dner', 'waltz', 'friendly', 'horribly', 'judaism', 'engine', 'formed', 'walkover', 'gasp', 'antoinette', 'malls', 'cher', 'poem', 'saxophonist', 'bus', '##yin', 'landing', '##rid', '[unused431]', 'soaked', 'knelt', 'alarmed', 'audible', 'unnamed', 'plot', 'rowan', 'dissipated', 'reprint', 'winged', 'semitic', '⁻', 'pitted', 'morale', 'figuring', 'entities', '288', 'rip', 'disbelief', '##ier', 'kenyan', 'roberts', 'localized', '##lla', 'igor', 'outspoken', 'jewelry', 'liberties', '##shin', 'blankets', 'evolved', 'taxonomic', 'mitochondrial', '[unused902]', 'colliery', 'associate', '##relli', 'pantry', '##bi', 'background', '##pipe', 'caracas', 'craft', 'trier', 'doubles', '##olt', 'inverness', '##ת', 'stroll', '

* Answer:
we can see that there is some tokens not making sens like ##ᄆ, or some end of words like ##rid  

The tokenizer splits sentences and transforms the elements (words or sub-words) into clues. 

BERT can process several sentences, but you need to tell it how the sentences (segments) have been split, with an index: 0 for the first sentence, 1 for the second. 

Two specific tokens must also be added: 
* CLS], a specific token used for sentence classification
* SEP], the end of sentence token.

#### Question
>* Apply the bert_tokenize function to the 3 phases and keep the 3 vectors (index, token, segment).
>* Display this information for each of the sentences and check that the word *plant* has the same token index in the two sentences in which it appears.

In [28]:
snt1 = "The plant has reached its maximal level of production."
snt2 = "The cars are assembled inside the factory."
snt3 = "A plant needs sunlight and water to grow well."


def bert_tokenize(snt):
    """ Apply the BERT tokenizer to a list of words representing a sentence
        and return 3 lists: 
        - list of token indx
        - list of token for debugging, not used by the BERT model
        - list of sentence index
        """
    # Add the special tokens.
    tagged_snt = "[CLS] " + snt + " [SEP]" 
    # Tokenize
    tokenized_snt = tokenizer.tokenize(tagged_snt)
    # convert tokens to indices
    indexed_snt = tokenizer.convert_tokens_to_ids(tokenized_snt)
    # mark the words in sentence.
    segments_ids = [1] * len(tokenized_snt)

    return (indexed_snt, tokenized_snt, segments_ids)

# YOUR CODE HERE

indexed_snt1, tokenized_snt1, segments_ids1=bert_tokenize(snt1)
indexed_snt2, tokenized_snt2, segments_ids2=bert_tokenize(snt2)
indexed_snt3, tokenized_snt3, segments_ids3=bert_tokenize(snt3)

print(f"Sentence 1: {indexed_snt1, tokenized_snt1, segments_ids1}")
print(f"Sentence 2: {indexed_snt2, tokenized_snt2, segments_ids2}")
print(f"Sentence 3: {indexed_snt3, tokenized_snt3, segments_ids3}")



Sentence 1: ([101, 1996, 3269, 2038, 2584, 2049, 29160, 2504, 1997, 2537, 1012, 102], ['[CLS]', 'the', 'plant', 'has', 'reached', 'its', 'maximal', 'level', 'of', 'production', '.', '[SEP]'], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Sentence 2: ([101, 1996, 3765, 2024, 9240, 2503, 1996, 4713, 1012, 102], ['[CLS]', 'the', 'cars', 'are', 'assembled', 'inside', 'the', 'factory', '.', '[SEP]'], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Sentence 3: ([101, 1037, 3269, 3791, 9325, 1998, 2300, 2000, 4982, 2092, 1012, 102], ['[CLS]', 'a', 'plant', 'needs', 'sunlight', 'and', 'water', 'to', 'grow', 'well', '.', '[SEP]'], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


 the word *plant* has the same token index in the two sentences in which it appears.

## Inference

To calculate embeddings, we need to make a prediction using the BERT model on a complete sentence. The *predict_hidden* function converts the token and segment index lists into a pytorch tensor and applies the model. 

The model used is a 12-layer model. We will use the last hidden layer of the model as an embedding to represent the words. Other solutions are possible, such as concatenation or averaging of several layers.


#### Question
>* Apply the model to each of the 3 sentences and store the resulting embeddings (tensors).
>* Display the dimension of the resulting tensors. What is the dimension of the embedding vector for each word?

In [None]:

def predict_hidden(indexed_snt, segments_ids):
    """Apply the BERT model to the input token indices and segment indices
        and return the last hidden layer
    """
    with torch.no_grad():
        # Convert inputs to PyTorch tensors
        tokens_tensor = torch.tensor([indexed_snt])
        segments_tensors = torch.tensor([segments_ids])
        outputs = model(tokens_tensor, segments_tensors)
        hidden_states = outputs[2]
        one_hidden_layer = hidden_states[12][0]
        
    return one_hidden_layer

# YOUR CODE HERE
hidden_states_snt1 = predict_hidden(indexed_snt1, segments_ids1)
hidden_states_snt2 = predict_hidden(indexed_snt2, segments_ids2)
hidden_states_snt3 = predict_hidden(indexed_snt3, segments_ids3)
print("dimensions of hidden_states_snt1",hidden_states_snt1.shape)
print("dimensions of hidden_states_snt2",hidden_states_snt2.shape)
 

dimensions of hidden_states_snt1 torch.Size([12, 768])
dimensions of hidden_states_snt2 torch.Size([10, 768])


*Answer: the dimension of the embedding vector for each word in the sentence is 768


The hidden layer returned by the *predict_hidden* function is a tensor containing a context vector representing each token in the input sentence. We can use this vector to represent the meaning of this word as a function of its context. We're going to compare the representation of the polysemous word *plant* as a function of its context.

#### Question
>* Using the [cosine distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html), calculate the following distances:
> * distance between *plant* in sentence 1 (plant-factory) and *plant* in sentence 3 (plant-vegetal)
> * distance between *plant* in sentence 1 (plant-factory) and *factory* in sentence 2 (plant-vegetal) 
> * distance between *plant* in sentence 1 (plant-factory) and *production* in sentence 2 
> * distance between *plant* in sentence 3 (plant-vegetal) and *production* in sentence 2 
> How can we interpret these distances?

In [None]:
print(snt1)
print(snt2)
print(snt3)

The plant has reached its maximal level of production.
The cars are assembled inside the factory.
A plant needs sunlight and water to grow well.


In [32]:
from scipy.spatial.distance import cosine

# YOUR CODE HERE
# distance between *plant* in sentence 1 (plant-factory) and *plant* in sentence 3 (plant-vegetal)
cosine_distance1 = 1 - cosine(hidden_states_snt1[2], hidden_states_snt3[2])
print(f"Cosine distance between *plant* in sentence 1 and *plant* in sentence 3: {cosine_distance1:.2f}")


# distance between *plant* in sentence 1 (plant-factory) and *factory* in sentence 2 (plant-factory)
cosine_distance2 = 1 - cosine(hidden_states_snt1[2], hidden_states_snt2[7])
print(f"Cosine distance between *plant* in sentence 1 and *factory* in sentence 2: {cosine_distance2:.2f}")

# distance between *production* in sentence 1 (plant-factory) and *plant* in sentence 3 (plant-vegetal)
cosine_distance3 = 1 - cosine(hidden_states_snt1[9], hidden_states_snt3[2])
print(f"Cosine distance between *production* in sentence 1 and *plant* in sentence 3: {cosine_distance3:.2f}")


Cosine distance between *plant* in sentence 1 and *plant* in sentence 3: 0.50
Cosine distance between *plant* in sentence 1 and *factory* in sentence 2: 0.69
Cosine distance between *production* in sentence 1 and *plant* in sentence 3: 0.38


we can see that the cosine distance between the word plant in sentence 1 and the word plant in sentence 3 is 0.5 which is not high, this means that the two words are not similar semantically, despiste the fact that they are the same spelling.