<a href="https://colab.research.google.com/github/ua-deti-information-retrieval/Neural-IR-hands-on/blob/main/RI_practical_tutorial_2_Embeddings_solucoesV2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RI practical tutorial #2

## Embeddings

An important component of natural language processing (NLP) is the ability to translate words, phrases, or larger bodies of text into continuous numerical vectors.



## Dependencies

In [None]:
!pip install torch matplotlib
!git clone https://github.com/ua-deti-information-retrieval/Neural-IR-hands-on.git

fatal: destination path 'Neural-IR-hands-on' already exists and is not an empty directory.


In [None]:
import torch
from tqdm import tqdm

## Recap

Embeddings convert words, sentences, or even entire documents into vectors of real numbers. Unlike traditional methods like one-hot encoding, which represent words as isolated and high-dimensional points.

In [None]:
# the -> 0
# supreme -> 1
# art -> 5
# 015


toy_vocab = ['the','supreme','art','of','war','is','to','subdue','the','enemy','without','fighting'] # 123
torch.eye(len(toy_vocab))

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

In [None]:
embedding_layer = torch.nn.Embedding(len(toy_vocab), 4) # lookup table
print(embedding_layer.weight)
print("embeddings norm", torch.linalg.norm(embedding_layer.weight, ord=2, dim=-1))

Parameter containing:
tensor([[-0.9652, -0.0183, -0.7884,  0.4117],
        [ 0.2808,  1.2251,  1.5657, -0.1117],
        [-0.5183,  0.1446,  1.4250,  0.3445],
        [ 0.3087, -1.5000, -0.7432,  1.4683],
        [ 0.3633, -0.9857, -1.6775,  0.0584],
        [ 0.0558,  1.3720,  0.5216, -1.8075],
        [ 0.2945, -0.4124,  0.7072, -0.9646],
        [-1.5088,  0.8306,  0.2767,  2.8368],
        [-0.4752, -0.1008, -0.6639, -0.1164],
        [-0.2077, -1.0223, -0.0475,  0.2471],
        [-0.5356, -0.0821, -2.3560,  1.4714],
        [ 1.7639, -0.9144, -1.1769, -0.1801]], requires_grad=True)
embeddings norm tensor([1.3127, 2.0109, 1.5617, 2.2480, 1.9802, 2.3291, 1.2990, 3.3303, 0.8308,
        1.0731, 2.8300, 2.3163], grad_fn=<LinalgVectorNormBackward0>)


##Hands on

To get started with practical exercises in embeddings, it's beneficial to use pre-trained models. This allows us to explore and understand the power of embeddings without the need for extensive computational resources and time to train our models.

For our exercise, we will use the DESM (Dual Embedding Space Model) from Microsoft (the same introduced in class). DESM is a unique model that leverages two types of embeddings.

In [None]:
# run to download the desm embeddings
!wget https://download.microsoft.com/download/A/7/C/A7C7F0A6-B925-4C07-A14B-04ACF8A8E030/desm.zip
!unzip desm.zip

--2023-11-30 13:58:25--  https://download.microsoft.com/download/A/7/C/A7C7F0A6-B925-4C07-A14B-04ACF8A8E030/desm.zip
Resolving download.microsoft.com (download.microsoft.com)... 23.199.49.187, 2600:1408:c400:e8e::317f, 2600:1408:c400:e8c::317f
Connecting to download.microsoft.com (download.microsoft.com)|23.199.49.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3757822502 (3.5G) [application/octet-stream]
Saving to: ‘desm.zip’


2023-11-30 13:59:01 (102 MB/s) - ‘desm.zip’ saved [3757822502/3757822502]

Archive:  desm.zip
  inflating: in.txt                  
  inflating: out.txt                 
  inflating: README.txt              


In [None]:
!wget https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

--2023-11-30 14:53:02--  https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4234917 (4.0M) [text/plain]
Saving to: ‘words_alpha.txt’


2023-11-30 14:53:03 (205 MB/s) - ‘words_alpha.txt’ saved [4234917/4234917]



In [None]:
# get a simple vocab, because load the in and out matrices will exaust the resources
with open("words_alpha.txt") as f:
#with open("simple_vocab_example.txt") as f:
  vocab_set = {token.rstrip() for token in f}

In [None]:

def load_embeddings_from_txt(path, vocab):
  emb = {}

  with open(path) as f:
    for line in tqdm(f):
      token, *values = line.split("\t")
      if token in vocab:
        emb[token] = list(map(float, values))

  # separating the vocab from the embeddings
  vocab, embedding = list(zip(*emb.items()))
  token_to_id = {token:i for i,token in enumerate(vocab)}
  id_to_token = {v:k for k,v in token_to_id.items()}

  return token_to_id, id_to_token, torch.tensor(embedding) # 173593x200

in_token_to_id, in_id_to_token, in_embeddings = load_embeddings_from_txt("in.txt", vocab_set)

2748230it [01:17, 35470.37it/s]


Let's explore the loaded embeddings.



In [None]:
print("shape", in_embeddings.shape)
nurse_id = in_token_to_id["nurse"]
print("Token: nurse | id:", nurse_id)
print("embeddings norm", torch.linalg.norm(in_embeddings[nurse_id], ord=2)) #
print("nurse embedding:",in_embeddings[nurse_id])

shape torch.Size([173593, 200])
Token: nurse | id: 103266
embeddings norm tensor(1.0000)
nurse embedding: tensor([ 1.3924e-01, -9.9500e-04,  8.8700e-04,  6.8999e-02,  7.3316e-02,
        -9.7289e-02, -5.1672e-02, -8.9405e-02,  8.8826e-02, -5.9428e-02,
         1.3653e-02, -5.3968e-02,  5.9562e-02, -2.9867e-02,  1.0009e-01,
        -1.9665e-02, -6.0743e-02, -9.9873e-02,  7.3166e-02,  1.6776e-01,
        -8.6471e-02,  8.0610e-02,  1.5516e-02,  8.0300e-03,  1.6674e-01,
        -1.2330e-03,  2.6245e-02,  4.9310e-03,  4.1719e-02, -3.9982e-02,
         3.8725e-02, -1.3561e-01,  1.5320e-03, -1.0055e-02, -4.1976e-02,
        -3.7175e-02,  6.5462e-02,  2.9214e-02,  4.2903e-02,  1.0753e-01,
        -6.8870e-03, -4.3129e-02, -7.8007e-02, -8.1616e-02,  1.6278e-02,
        -2.0872e-02,  1.3440e-01, -5.4099e-02, -3.3082e-02, -6.3098e-02,
         6.8556e-02, -2.9421e-02, -1.0856e-01, -2.4650e-02,  1.0879e-02,
        -3.0170e-03,  1.9260e-03,  8.4693e-02,  2.7450e-03, -9.0751e-02,
        -2.6489e-0

## How to find similar tokens with embeddings?

The same way you find similar vectors with tfidf, using cosine similarity!

More precisely, given the two vectors:

$cos(a,b) = \frac{\vec{a}\cdot\vec{b}^T}{\|\vec{a}\|\times\|\vec{b}\|}$

Then, we just need to compute the cosine similaraty between $\vec{a}$ and all of the vectors in our matrix $C$ (collection).

As an example, complete the following function. It should calculate the cosine similarity between a given vector and all the collection vectors and return the most similar tokens and scores.

In [None]:
def find_topk_similar_to(token, embeddings, token_to_id, id_to_token, topk=10):
  """
  Given the token return topk similar tokens according to the cos sim between the
  token vector and all of the embeddings vectors
  """

  token_embedding = embeddings[token_to_id[token]]
  return find_topk_similar_to_vec(token_embedding, embeddings, token_to_id, id_to_token, topk)

def find_topk_similar_to_vec(token_embedding, embeddings, token_to_id, id_to_token, topk=10):
  """
  Given the token embedding return topk similar tokens according to the cos sim between the
  token vector and all of the embeddings vectors


  [('mercedes', 0.9999992251396179),
  ('cabriolet', 0.6590193510055542),
  ('sprinter', 0.6370120048522949),
  ('volkswagen', 0.6347604393959045),
  ('fiat', 0.6245887875556946),
  ('jaguar', 0.6102705001831055),
  ('toyota', 0.5901010632514954),
  ('honda', 0.5850051641464233),
  ('rover', 0.5818690061569214),
  ('freightliner', 0.5783664584159851)]

  """
  ## complete


  # norm vector
  #token_embedding = token_embedding / torch.linalg.norm(in_embeddings[nurse_id], ord=2)

  # norm matrix
  #embeddings = embeddings / torch.linalg.norm(embeddings, ord=2, dim=1, keepdim=True)



  scores = embeddings @ token_embedding # 173593 X 200 @ 200 -> 173593

  # O(nlog(n))

  # O(nlog(k)) k<<n
  ordered_scores, ordered_ids = torch.topk(scores, k=topk)

  return list(map(lambda x: (id_to_token[x[0].item()], x[1].item()), zip(ordered_ids, ordered_scores)))




In [None]:
find_topk_similar_to("yale", in_embeddings, in_token_to_id, in_id_to_token)


[('yale', 1.0),
 ('harvard', 0.6379895210266113),
 ('cornell', 0.6104707717895508),
 ('quinnipiac', 0.6071730256080627),
 ('tufts', 0.5811516642570496),
 ('emory', 0.5635881423950195),
 ('hamline', 0.5511536002159119),
 ('stanford', 0.54926997423172),
 ('villanova', 0.5311557650566101),
 ('northwestern', 0.5205579400062561)]

In [None]:
find_topk_similar_to("apple", in_embeddings, in_token_to_id, in_id_to_token)

[('apple', 1.0000005960464478),
 ('blackberry', 0.6527820825576782),
 ('apples', 0.611760139465332),
 ('mac', 0.5416128039360046),
 ('raspberry', 0.5347691178321838),
 ('cider', 0.52553391456604),
 ('chokecherry', 0.5230116844177246),
 ('blueberry', 0.49647897481918335),
 ('crouton', 0.4963668882846832),
 ('pumpkin', 0.494684100151062)]

In [None]:
find_topk_similar_to("oak", in_embeddings, in_token_to_id, in_id_to_token)

[('oak', 1.0000003576278687),
 ('pine', 0.7786054015159607),
 ('walnut', 0.7469823360443115),
 ('maple', 0.7312123775482178),
 ('cedar', 0.7202832102775574),
 ('willow', 0.7179901003837585),
 ('birch', 0.7081882357597351),
 ('sycamore', 0.687503457069397),
 ('dogwood', 0.6859285235404968),
 ('hickory', 0.6843091249465942)]

In [None]:
# Why it works bad for covid? any guess?
find_topk_similar_to("covid", in_embeddings, in_token_to_id, in_id_to_token)

[('covid', 1.000001072883606),
 ('tensometer', 0.5409350395202637),
 ('shakeproof', 0.5386678576469421),
 ('hariana', 0.5289627313613892),
 ('outmatch', 0.5276938676834106),
 ('abattis', 0.5260857939720154),
 ('stobbing', 0.525831401348114),
 ('lashins', 0.5253284573554993),
 ('graywacke', 0.5226282477378845),
 ('genophobia', 0.5201941728591919)]

## Word analogies

Another interesting property of word embeddings is their ability to capture word analogies through geometric relationships in the vector space. This phenomenon is often illustrated by the famous example: "king" - "man" + "woman" ≈ "queen". In this case, the embeddings capture the relationship between gender roles and royal titles.

With the help of the previous function, create a the vector queen by using appling the relation ("king"-"man") to "woman".



In [None]:


def word_analogy(token_a, token_b, token_c):
  """
  Performs vec_token_a - vec_token_b + vec_token_c

  and returns a list with the closest tokens

  Note: token_a, token_b and token_c should be removed of the list

  Example:
  word_analogy("king", "man", "woman")
  [('queen', 0.6244865655899048),
 ('kings', 0.4600622057914734),
 ('prince', 0.42849528789520264),
 ('princess', 0.42579346895217896),
 ('royal', 0.41185224056243896),
 ('crown', 0.4051671624183655),
 ('princes', 0.40045303106307983),
 ('lamb', 0.3960754871368408),
 ('hamilton', 0.39465370774269104)]
  """
  ## Complete

  # get emb for token_a token_b token_c

  # approx_vec = emb_a - emb_b + emb_c

  # return find(approx_vec)



  approx_vec =  in_embeddings[in_token_to_id[token_a]] - in_embeddings[in_token_to_id[token_b]] + in_embeddings[in_token_to_id[token_c]]
  # approx_vec is not normalized
  approx_vec = approx_vec/torch.linalg.norm(approx_vec, ord=2)

  top_results = find_topk_similar_to_vec(approx_vec, in_embeddings, in_token_to_id, in_id_to_token)

  # remove token_a, token_b and token_c from the list
  exclude = {token_a, token_b, token_c}
  return [(tk,score) for tk, score in top_results if tk not in exclude]

word_analogy("king", "man", "woman") # expected queen

[('queen', 0.6244865655899048),
 ('kings', 0.4600622057914734),
 ('prince', 0.42849528789520264),
 ('princess', 0.42579346895217896),
 ('royal', 0.41185224056243896),
 ('crown', 0.4051671624183655),
 ('princes', 0.40045303106307983),
 ('lamb', 0.3960754871368408),
 ('hamilton', 0.39465370774269104)]

In [None]:
word_analogy("paris", "france", "portugal") # expected lisbon

[('lisbon', 0.5874060988426208),
 ('barcelona', 0.5461090207099915),
 ('porto', 0.512915849685669),
 ('malaga', 0.5048376321792603),
 ('rambla', 0.49051910638809204),
 ('vila', 0.4781612753868103),
 ('quito', 0.4775787889957428),
 ('oporto', 0.47215747833251953)]

In [None]:
word_analogy("france", "paris", "lisbon") # expected portugal

[('portugal', 0.6427342295646667),
 ('poland', 0.5291943550109863),
 ('austria', 0.5101706385612488),
 ('germany', 0.4979320466518402),
 ('netherlands', 0.49458247423171997),
 ('azores', 0.4933825731277466),
 ('lithuania', 0.4896232485771179),
 ('spain', 0.4859029948711395)]

In [None]:
word_analogy("teacher", "school", "hospital") # expected ? (maybe doctor?)

[('nurse', 0.6128870248794556),
 ('physician', 0.5838133096694946),
 ('therapist', 0.5681063532829285),
 ('hospita', 0.5670239329338074),
 ('pharmacist', 0.5632860660552979),
 ('midwives', 0.5558406710624695),
 ('nurses', 0.5456812381744385),
 ('psychiatrist', 0.5454483032226562),
 ('hospitals', 0.5374904274940491)]

## Okey, but if I want to use sentance or documents?

In such scenarios, a straightforward approach is to average the embeddings of all tokens within a sentence. This method offers a means to condense the rich information of a sentence into a single vector.

By averaging the embeddings of each word in a sentence, we create a composite representation that captures the essence of the sentence as a whole. This can then be used to compare and measure the similarity between different sentences or documents. It's a practical method, especially when dealing with small texts. Let's proceed to implement this and see how well it performs in identifying sentence similarities.

In [None]:
sentences_corpus = [
    "A nimble red fox leaped over a sleeping canine.",
    "New York is known for its bustling city life.",
    "The city of Tokyo is lively and vibrant at night.",
    "The development of AI has significant implications for society.",
    "Fresh vegetables and fruits are essential for a healthy diet.",
    "Eating a variety of greens and fruits contributes to good health.",
    "The book on the shelf is old and worn.",
    "An ancient, tattered tome sits in the library."
]

sentence_to_id = {s:i for i,s in enumerate(sentences_corpus)}
id_to_sentence = sentences_corpus
#id_to_sentence = {v:k for k,v in sentence_to_id.items()}

In [None]:


def text_to_vec(text, embeddings, in_token_to_id):
  # simple tokenizer
  tokens = text.lower().split() # [list of tokens]

  # lista das embeddings dos tokens que estao no V
  return [embeddings[in_token_to_id[token]] for token in tokens if token in in_token_to_id]

def sentence_embedding(text, embeddings):
  """
  Give a sequence of text compute the embeddings of the sentece by averaging its token embeddings

  use the function text_to_vec to convert text to vectors: text_to_vec(text, embeddings, in_token_to_id)

  Out: sentence embeddings
  """
  ## Complete

  tokens_emb = torch.stack(text_to_vec(text, embeddings, in_token_to_id)) # [vec_a, vac_b, vac_c ...]
  # [vec_a,
  #  vec_b,
  #   vec_c,] T x 200

  sent_emb = tokens_emb.mean(axis=0)# 200

  return sent_emb / torch.linalg.norm(sent_emb, ord=2)

# [sentence_embedding(sent, in_embeddings) for sent in sentences_corpus] lista da embbedings de cada frase
sentences_corpus_embeddings = torch.stack([sentence_embedding(sent, in_embeddings) for sent in sentences_corpus])
# S x 200


In [None]:
sentences_corpus_embeddings.shape

torch.Size([8, 200])

In [None]:
sent_embedding = sentence_embedding("Artificial Intelligence will shape the future of humanity.", in_embeddings)
find_topk_similar_to_vec(sent_embedding, sentences_corpus_embeddings, sentence_to_id, id_to_sentence, topk=5)



[('The development of AI has significant implications for society.',
  0.8642897605895996),
 ('Eating a variety of greens and fruits contributes to good health.',
  0.8398697972297668),
 ('The book on the shelf is old and worn.', 0.8192300796508789),
 ('The city of Tokyo is lively and vibrant at night.', 0.77782142162323),
 ('Fresh vegetables and fruits are essential for a healthy diet.',
  0.7680155038833618)]

In [None]:
sent_embedding = sentence_embedding("The quick brown fox jumps over the lazy dog.", in_embeddings)
find_topk_similar_to_vec(sent_embedding, sentences_corpus_embeddings, sentence_to_id, id_to_sentence, topk=5)

[('A nimble red fox leaped over a sleeping canine.', 0.8499789237976074),
 ('The book on the shelf is old and worn.', 0.8179973363876343),
 ('The city of Tokyo is lively and vibrant at night.', 0.7329216003417969),
 ('An ancient, tattered tome sits in the library.', 0.7195178866386414),
 ('Eating a variety of greens and fruits contributes to good health.',
  0.6892918348312378)]

## Well if it works for sentence similarity, maybe it works for retrieval?

Let's apply the same example to this toy collection of documents

In [None]:
documents = [
    "Apples are rich in antioxidants, which help in fighting free radicals.",
    "The water cycle consists of evaporation, condensation, and precipitation.",
    "Recent trends in AI include advancements in deep learning and neural networks.",
    "Good mental health can be maintained by regular exercise and proper sleep.",
    "The Olympic Games originated in ancient Greece and have evolved over centuries.",
    "Eating fruits and vegetables is essential for physical well-being.",
    "Cloud formation is a key aspect of the earth's hydrological process.",
    "Machine learning and AI are becoming integral in various industries.",
    "Mindfulness and meditation are effective for stress management.",
    "The modern Olympics include a variety of sports from track to swimming."
]

doc_to_id = {s:i for i,s in enumerate(documents)}
id_to_doc = documents

doc_embeddings = torch.stack([sentence_embedding(sent, in_embeddings) for sent in documents])
doc_embeddings.shape

torch.Size([10, 200])

In [None]:
sent_embedding = sentence_embedding("How does the water cycle work?", in_embeddings)
find_topk_similar_to_vec(sent_embedding, doc_embeddings, doc_to_id, id_to_doc, topk=3)

[('The water cycle consists of evaporation, condensation, and precipitation.',
  0.7694262266159058),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  0.5853888392448425),
 ('Eating fruits and vegetables is essential for physical well-being.',
  0.5655031204223633)]

In [None]:
sent_embedding = sentence_embedding("What is the history of the Olympic Games?", in_embeddings)
find_topk_similar_to_vec(sent_embedding, doc_embeddings, doc_to_id, id_to_doc, topk=3)

[('The modern Olympics include a variety of sports from track to swimming.',
  0.9072824716567993),
 ('The Olympic Games originated in ancient Greece and have evolved over centuries.',
  0.886823296546936),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  0.8800275921821594)]

## DESM model

Up to this point, we have primarily utilized the 'IN' embeddings of the DESM (Dual Embedding Space Model) model. Let's delve deeper into understanding and exploring this model:

The DESM model is unique in its dual-embedding approach. It leverages both 'IN' and 'OUT' embeddings to enhance the representation of words and phrases.

First lets load the OUT embeddings

In [None]:
# note that out_token_to_id and out_id_to_token should be exactly the same as in_token_id and in_id_to_token
out_token_to_id, out_id_to_token, out_embeddings = load_embeddings_from_txt("out.txt", vocab_set)


2748230it [01:19, 34395.88it/s]


In continuation of what we've learned in class, we'll now calculate similarities using different combinations of embeddings from the DESM model. Namely, IN-IN, IN-OUT and OUT-OUT.

In [None]:
def in_out_comparison_for_token(token, topk=10):

  in_in_results = find_topk_similar_to(token, in_embeddings, in_token_to_id, in_id_to_token, topk=topk)
  out_out_results = find_topk_similar_to(token, out_embeddings, out_token_to_id, out_id_to_token, topk=topk)
  in_out_results = find_topk_similar_to_vec(in_embeddings[in_token_to_id[token]], out_embeddings, in_token_to_id, in_id_to_token, topk=topk)
  print(f'|{"IN-IN":^25}|{"OUT-OUT":^25}|{"IN-OUT":^25}|')
  for i in range(topk):
    in_in_str = f'{in_in_results[i][0]} ({in_in_results[i][1]:.3f})'
    out_out_str = f"{out_out_results[i][0]} ({out_out_results[i][1]:.3f})"
    in_out_str = f"{in_out_results[i][0]} ({in_out_results[i][1]:.3f})"
    print(f'|{in_in_str:^25}|{out_out_str:^25}|{in_out_str:^25}|')



In [None]:
in_out_comparison_for_token("yale")


|          IN-IN          |         OUT-OUT         |         IN-OUT          |
|      yale (1.000)       |      yale (1.000)       |      yale (0.279)       |
|     harvard (0.638)     |     harvard (0.751)     |     faculty (0.187)     |
|     cornell (0.610)     |      tufts (0.742)      |     alumni (0.170)      |
|   quinnipiac (0.607)    |     cornell (0.738)     | preregistration (0.164) |
|      tufts (0.581)      |  northwestern (0.718)   |   orientation (0.164)   |
|      emory (0.564)      |   quinnipiac (0.716)    |      haven (0.162)      |
|     hamline (0.551)     |    villanova (0.715)    |    graduate (0.156)     |
|    stanford (0.549)     |      emory (0.712)      |   admissions (0.156)    |
|    villanova (0.531)    |     vassar (0.711)      |    academic (0.155)     |
|  northwestern (0.521)   |       uva (0.705)       |      dorms (0.150)      |


In [None]:
in_out_comparison_for_token("apple")

|          IN-IN          |         OUT-OUT         |         IN-OUT          |
|      apple (1.000)      |      apple (1.000)      |      apple (0.246)      |
|   blackberry (0.653)    |  misapprehend (0.817)   |     apples (0.167)      |
|     apples (0.612)      |     echards (0.817)     |   blackberry (0.161)    |
|       mac (0.542)       |     apples (0.814)      |     orchard (0.141)     |
|    raspberry (0.535)    |     lattins (0.811)     |      cider (0.140)      |
|      cider (0.526)      |     appale (0.809)      |    orchards (0.137)     |
|   chokecherry (0.523)   |    cankered (0.807)     |      crisp (0.134)      |
|    blueberry (0.496)    |     cobnuts (0.807)     |   pollination (0.130)   |
|     crouton (0.496)     |    mesropian (0.805)    |     picking (0.129)     |
|     pumpkin (0.495)     |     thistly (0.805)     |    jailbreak (0.113)    |


## DESM Retrieval

Following the slides lets implement the DESM retrieval model

$DESM(Q, D) = \frac{1}{|Q|}\sum_{q_i \in Q}cos(q_i,D)$

In [None]:
documents = [
    "Apples are rich in antioxidants, which help in fighting free radicals.",
    "The water cycle consists of evaporation, condensation, and precipitation.",
    "Recent trends in AI include advancements in deep learning and neural networks.",
    "Good mental health can be maintained by regular exercise and proper sleep.",
    "The Olympic Games originated in ancient Greece and have evolved over centuries.",
    "Eating fruits and vegetables is essential for physical well-being.",
    "Cloud formation is a key aspect of the earth's hydrological process.",
    "Machine learning and AI are becoming integral in various industries.",
    "Mindfulness and meditation are effective for stress management.",
    "The modern Olympics include a variety of sports from track to swimming."
]



In [None]:
def desm(query, documents, topk=3):
  """
  Implement the desm algorithm
  query: text of a question (use emb IN)
  documents: list of documents text that make the collection (use emb OUT)
  topk: maximum number of documents that we want to return

  desm("How does the water cycle work?", documents)
  [('The water cycle consists of evaporation, condensation, and precipitation.',
  -0.0023205685429275036),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  -0.028624113649129868),
 ('Good mental health can be maintained by regular exercise and proper sleep.',
  -0.031198585405945778)]
  """
  ## COMPLETE


  # average embeddings for the doc
  doc_to_id = {s:i for i,s in enumerate(documents)}
  id_to_doc = documents

  # vec doc in OUT projection
  doc_embeddings = torch.stack([sentence_embedding(sent, out_embeddings) for sent in documents])
  # 10 x 200

  query_token_vecs = torch.stack(text_to_vec(query, in_embeddings, in_token_to_id))
  # T x 200

  scores_per_token = query_token_vecs @ doc_embeddings.T # Tx10
  # 10 -> cos(q, D)

  scores = scores_per_token.mean(axis=0)

  ordered_scores, ordered_ids = torch.topk(scores, k= topk)

  return list(map(lambda x: (id_to_doc[x[0].item()], x[1].item()), zip(ordered_ids, ordered_scores)))



In [None]:
desm("How does the water cycle work?", documents) # it help?


[('The water cycle consists of evaporation, condensation, and precipitation.',
  -0.0023205664474517107),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  -0.02862410619854927),
 ('Good mental health can be maintained by regular exercise and proper sleep.',
  -0.03119858168065548)]

In [None]:
desm("What is the history of the Olympic Games?", documents)

[('The Olympic Games originated in ancient Greece and have evolved over centuries.',
  -0.014345861971378326),
 ('The water cycle consists of evaporation, condensation, and precipitation.',
  -0.023791415616869926),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  -0.023795852437615395)]