<a href="https://colab.research.google.com/github/ua-deti-information-retrieval/Neural-IR-hands-on/blob/main/RI_practical_tutorial_3_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RI practical tutorial #3

## Word2Vec

Now that we have explored the usefulness of embeddings, the next question is: how can we learn such useful representations?

Mikolov et al. proposed two main algorithms, CBOW (Continuous Bag of Words) and Skip-Gram, both based on a similar concept: leveraging the co-occurrence of words to learn meaningful spatial mappings.

These methods use the surrounding context of words in a corpus to train the model, thus capturing the semantic relationships between them effectively.


## Dependencies

In [1]:
!pip install torch
!git clone https://github.com/ua-deti-information-retrieval/Neural-IR-hands-on.git

Cloning into 'Neural-IR-hands-on'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 25 (delta 9), reused 8 (delta 1), pack-reused 0[K
Receiving objects: 100% (25/25), 3.08 MiB | 11.11 MiB/s, done.
Resolving deltas: 100% (9/9), done.


In [3]:
!wget "https://uapt33090-my.sharepoint.com/:u:/g/personal/aleixomatos_ua_pt/EWmWvPAkGq9Eq1RbL4C9uiYBDLEZqg6LRflxkMx52zvX1g?e=kBFkOY&download=1" -O pubmed_2022_tiny.jsonl.gz
!gzip -d pubmed_2022_tiny.jsonl.gz

--2023-11-29 00:31:36--  https://uapt33090-my.sharepoint.com/:u:/g/personal/aleixomatos_ua_pt/EWmWvPAkGq9Eq1RbL4C9uiYBDLEZqg6LRflxkMx52zvX1g?e=kBFkOY&download=1
Resolving uapt33090-my.sharepoint.com (uapt33090-my.sharepoint.com)... 13.107.136.10, 13.107.138.10, 2620:1ec:8f8::10, ...
Connecting to uapt33090-my.sharepoint.com (uapt33090-my.sharepoint.com)|13.107.136.10|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /personal/aleixomatos_ua_pt/Documents/@UA/Aulas/RI/2023/Dossier/Data/pubmed_2022_tiny.jsonl.gz?ga=1 [following]
--2023-11-29 00:31:37--  https://uapt33090-my.sharepoint.com/personal/aleixomatos_ua_pt/Documents/@UA/Aulas/RI/2023/Dossier/Data/pubmed_2022_tiny.jsonl.gz?ga=1
Reusing existing connection to uapt33090-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 134376760 (128M) [application/x-gzip]
Saving to: ‘pubmed_2022_tiny.jsonl.gz’


2023-11-29 00:31:41 (35.0 MB/s) - ‘pubmed_2022_tiny.jsonl.gz’ saved [134376760/13

In [2]:
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, IterableDataset
from itertools import chain
import torch
import random

import re
from tqdm import tqdm
import json
from collections import defaultdict
import math

import json


## Hands on!


### Model

![Img cbow vs skip-gram](https://s3-eu-west-1.amazonaws.com/ppreviews-plos-725668748/22001760/preview.jpg)


### Tokenizer


### DataLoader

### Training loop

### Test

In [4]:
!wget "https://uapt33090-my.sharepoint.com/:u:/g/personal/tiagomeloalmeida_ua_pt/EXMxihcfxhdJhlx8f8uQew8BxyGcKkbL6qDPRlUZBv4MJQ?e=zeCL5A&download=1" -O cbow_model_new_tok_e2.pt

--2023-11-29 00:58:55--  https://uapt33090-my.sharepoint.com/:u:/g/personal/tiagomeloalmeida_ua_pt/EXMxihcfxhdJhlx8f8uQew8BxyGcKkbL6qDPRlUZBv4MJQ?e=zeCL5A&download=1
Resolving uapt33090-my.sharepoint.com (uapt33090-my.sharepoint.com)... 13.107.136.10, 13.107.138.10, 2620:1ec:8f8::10, ...
Connecting to uapt33090-my.sharepoint.com (uapt33090-my.sharepoint.com)|13.107.136.10|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /personal/tiagomeloalmeida_ua_pt/Documents/cbow_model_new_tok_e2.pt?ga=1 [following]
--2023-11-29 00:58:56--  https://uapt33090-my.sharepoint.com/personal/tiagomeloalmeida_ua_pt/Documents/cbow_model_new_tok_e2.pt?ga=1
Reusing existing connection to uapt33090-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 239665833 (229M) [application/octet-stream]
Saving to: ‘cbow_model_new_tok_e2.pt’


2023-11-29 00:59:01 (47.9 MB/s) - ‘cbow_model_new_tok_e2.pt’ saved [239665833/239665833]



In [None]:
tokenizer = SimpleTokenizer.load("simple_tokenzer_updated.json")

model = CBOW(tokenizer.vocab_size, 300)
model.load_state_dict(torch.load("cbow_model_new_tok_e2.pt", map_location="cpu"))
model.eval()

emb = model.linear.weight
norm_emb = emb/torch.linalg.norm(emb, ord=2, dim=-1, keepdim=True)

In [3]:
"protein", "dna", "covid", "sars"