<a href="https://colab.research.google.com/github/tutsilianna/Automatic_Text_Processing_and_Image_Processing/blob/main/Vector%20Semantics/Task_5_%7C_Vector_Semantics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basics of word2vec


## Download the model
Download <code>google-news-vectors</code> model. Open it using the <code>gensim</code> library.

In [1]:
! pip install -q -U gensim
! pip install -q opendatasets
import opendatasets as od
import pandas

od.download("https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300")

# To quickly load data, enter your login and token from Kaggle
# or remove the last line, upload the file locally to colab (slow)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: davydovakristina
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300
Downloading googlenewsvectorsnegative300.zip to ./googlenewsvectorsnegative300


100%|██████████| 3.17G/3.17G [01:47<00:00, 31.8MB/s]





In [2]:
import warnings
warnings.filterwarnings('ignore')

import gensim
from gensim.models import KeyedVectors

w = KeyedVectors.load_word2vec_format("/content/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin",
                                      binary=True)

The structure is entitled <code>KeyedVectors</code> and in essence it is an embedding between the keys and the vectors. Each vector is identified by its search key, this is most often a short string token,  therefore, it's normally a correspondance between

<center><code>{str => 1D numpy array}</code></center><br/>



For example, let's dispaly first 10 coordinates of a vector, corresponding to the word <code>sunrise</code>

In [3]:
print(gensim.__version__)
print("Vector size: ", w["sunrise"].shape)
print("The first 10 coordinates of a vector: \n", w["sunrise"][:10])

4.3.2
Vector size:  (300,)
The first 10 coordinates of a vector: 
 [-0.22558594 -0.03540039 -0.21679688  0.03613281 -0.2265625  -0.09814453
  0.109375   -0.34570312  0.18652344  0.01806641]


## Task 1. Similarity.

Build vectors for the words <code>London</code>, <code>England</code>, <code>Moscow</code>. Compute the cosine distance between the words <code>London</code> and <code>England</code> and between the words <code>Moscow</code> and <code>England</code>. In which pair the words are more similar to each other?

Hint: to compute cosine distance use the <code>distance()</code> method. The correct answer is presented in the outputs.

In [6]:
from scipy.spatial import distance as ds

print(ds.cosine(w['London'], w['England']), ds.cosine(w['Moscow'], w['England']))
print(0.5600714385509491, 0.8476868271827698)

0.5600714087486267 0.8476868271827698
0.5600714385509491 0.8476868271827698


In [7]:
assert ds.cosine(w['Moscow'], w['England']) == 0.8476868271827698
assert ds.cosine(w['London'], w['England']) == 0.5600714385509491

AssertionError: 

In [8]:
answer = "Moscow & England" if ds.cosine(w['Moscow'], w['England']) > ds.cosine(w['London'], w['England']) else "London & England"
print("In which pair the words are more similar to each other?", answer)

In which pair the words are more similar to each other? Moscow & England


## Task 2. Analogies.
Using the most_similar method solve the analogy
```London : England = Moscow : X```

The correct answer is in the outputs.

(Hint: use the following arguments: positive and negative)

In [9]:
result = w.most_similar(positive=['Moscow', 'England'], negative=['London'])

print(f"London : England = Moscow : {result[0][0]}")

London : England = Moscow : Russia


In [10]:
result

[('Russia', 0.6502718329429626),
 ('Ukraine', 0.5879061818122864),
 ('Belarus', 0.5666376352310181),
 ('Azerbaijan', 0.5418694615364075),
 ('Armenia', 0.5300518870353699),
 ('Poland', 0.5253247618675232),
 ('coach_Georgy_Yartsev', 0.5220180749893188),
 ('Russian', 0.5214669108390808),
 ('Croatia', 0.5166040658950806),
 ('Moldova', 0.5125792026519775)]

In [11]:
[('Russia', 0.6502717733383179),
 ('Ukraine', 0.5879061818122864),
 ('Belarus', 0.5666375756263733),
 ('Azerbaijan', 0.5418694019317627),
 ('Armenia', 0.5300518870353699),
 ('Poland', 0.525324821472168),
 ('coach_Georgy_Yartsev', 0.5220180749893188),
 ('Russian', 0.5214669108390808),
 ('Croatia', 0.5166041851043701),
 ('Moldova', 0.5125792026519775)]

[('Russia', 0.6502717733383179),
 ('Ukraine', 0.5879061818122864),
 ('Belarus', 0.5666375756263733),
 ('Azerbaijan', 0.5418694019317627),
 ('Armenia', 0.5300518870353699),
 ('Poland', 0.525324821472168),
 ('coach_Georgy_Yartsev', 0.5220180749893188),
 ('Russian', 0.5214669108390808),
 ('Croatia', 0.5166041851043701),
 ('Moldova', 0.5125792026519775)]

In [None]:
assert result == [('Russia', 0.6502717733383179), ('Ukraine', 0.5879061818122864), ('Belarus', 0.5666375756263733), ('Azerbaijan', 0.5418694019317627), ('Armenia', 0.5300518870353699), ('Poland', 0.525324821472168), ('coach_Georgy_Yartsev', 0.5220180749893188), ('Russian', 0.5214669108390808), ('Croatia', 0.5166041851043701), ('Moldova', 0.5125792026519775)]

## Taks 3. Similarity: find the odd-one-out word.
Using the <code>doesnt_match</code> method, find the odd-one-out word in the string <code>breakfast cereal dinner lunch</code>.

The correct answer is in the outputs.

In [13]:
assert "odd-one-out word:  " + w.doesnt_match("breakfast cereal dinner lunch".split(' ')) == "odd-one-out word:  cereal"

## Task 4. Sentence vector representation


A sentence is given: <code>the quick brown fox jumps over the lazy dog</code>. You need to represent this sentence as a vector. Therefore, build the vector representation for each word in the model, and then average the vectors component-wise.


In [14]:
import numpy as np

text = "the quick brown fox jumps over the lazy dog".split()

vectors = [w[word] for word in text]

assert f"First 5 coordinates of a sentence-vector: {np.mean(vectors, axis=0)[:5]}" == "First 5 coordinates of a sentence-vector: [ 0.09055582  0.05434163 -0.06713867  0.10968696 -0.01060655]"

# Two models comparison

## Download one more model


Let's read the google-news-vectors model and the model, trained on British national corpus http://vectors.nlpl.eu/repository/20/0.zip, using gensim.


In [15]:
! wget -c http://vectors.nlpl.eu/repository/20/0.zip
! unzip 0.zip
! head -3 model.txt

--2024-05-05 07:16:40--  http://vectors.nlpl.eu/repository/20/0.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 344050746 (328M) [application/zip]
Saving to: ‘0.zip’


2024-05-05 07:17:14 (10.0 MB/s) - ‘0.zip’ saved [344050746/344050746]

Archive:  0.zip
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  
163473 300
say_VERB -0.008861 0.097097 0.100236 0.070044 -0.079279 0.000923 -0.012829 0.064301 -0.029405 -0.009858 -0.017753 0.063115 0.033623 0.019805 0.052704 -0.100458 0.089387 -0.040792 -0.088936 0.110212 -0.044749 0.077675 -0.017062 -0.063745 -0.009502 -0.079371 0.066952 -0.070209 0.063761 -0.038194 -0.046252 0.049983 -0.094985 -0.086341 0.024665 -0.112857 -0.038358 -0.007008 -0.010063 -0.000183 0.068841 0.024942 -0.042561 -0.04

Let's download the model, trained on the British national corpus

In [16]:
w_british = KeyedVectors.load_word2vec_format("model.bin", binary=True)

Note, that the vector size also equals 300 in this case. Specify the part of speech of the word of interest by means of underscore . All words should be lowercased.

In [17]:
try:
    print(w_british["London_NOUN"].shape)
    print('upper is ok')
except:
    print(w_british["london_NOUN"].shape)
    print('lower is ok')

(300,)
lower is ok


## The dataset for the quality evaluation
Let's download the wordsim353 dataset.



In [18]:
! wget -c http://alfonseca.org/pubs/ws353simrel.tar.gz
! tar -xvf ws353simrel.tar.gz
! head -5 wordsim353_sim_rel/wordsim_similarity_goldstandard.txt

--2024-05-05 07:17:41--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz’


2024-05-05 07:17:42 (280 MB/s) - ‘ws353simrel.tar.gz’ saved [5460/5460]

wordsim353_sim_rel/wordsim353_agreed.txt
wordsim353_sim_rel/wordsim353_annotator1.txt
wordsim353_sim_rel/wordsim353_annotator2.txt
wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt
wordsim353_sim_rel/wordsim_similarity_goldstandard.txt
tiger	cat	7.35
tiger	tiger	10.00
plane	car	5.77
train	car	6.31
television	radio	6.77


## Testing dataset preparation


Let's extract word pairs from the file `wordsim_similarity_goldstandard.txt` and compute the vector cosine similarity in each model. Compute the correlation between the similarity estimators of the google-news-vectors model model and human ratings of the wordsim dataset, and then - the similarity correlation between the model based on British national corpus and a human ratings of the wordsim dataset. Which model is closer to the human ratings?

(use only such words from wordsim dataset, which have the corresponding vectors in the British national corpus labeled as NOUNs!)

In [19]:
import pandas as pd

df = pd.read_csv("wordsim353_sim_rel/wordsim_similarity_goldstandard.txt",
                 sep="\t", header=None)
df.columns = ["first", "second", "score"]
df.head(3)

Unnamed: 0,first,second,score
0,tiger,cat,7.35
1,tiger,tiger,10.0
2,plane,car,5.77


## Model similarity evaluation
We use only such words from wordsim dataset, which have the corresponding vectors in the British national corpus labeled as nouns, make 3 sets with similarity measures:

1. Measures (cosine between vectors), obtained for the google-news-vectors model

2. Measures (cosine between vectors), obtained for the model based on the British national corpus

3. Human ratings from word_sim for the words, having the corresponding vectors in the British national corpus

The skipped words from word_sim are presented in the outputs.

In [42]:
from scipy.spatial.distance import cosine

gn_dist, br_dist, scores = [], [], []

for row in df.iterrows():

    w1, w2 = row[1]["first"].lower(), row[1]["second"].lower()
    try:
        #enter your code here
        br_dist.append(1-w_british.distance(w1 + '_NOUN', w2 + '_NOUN'))

        gn_dist.append(1-w.distance(w1, w2))

        scores.append(row[1]["score"])

    except KeyError as e:
        print(e, "Skipping this word.")

"Key 'stupid_NOUN' not present" Skipping this word.
"Key 'arafat_NOUN' not present" Skipping this word.
"Key 'harvard_NOUN' not present" Skipping this word.
"Key 'mexico_NOUN' not present" Skipping this word.
"Key 'live_NOUN' not present" Skipping this word.
"Key 'seven_NOUN' not present" Skipping this word.
"Key 'five_NOUN' not present" Skipping this word.
"Key 'mars_NOUN' not present" Skipping this word.


## Model selection: correlation with human ratings

Compute Spearman's correlation between each model and human ratings from word_sim.

The results are in the outputs.

In [44]:
from scipy.stats import spearmanr

coef_gn, p = spearmanr(gn_dist, scores)
print("gn  Spearman R: ", coef_gn)

coef_br, p = spearmanr(br_dist, scores)
print("br Spearman R: ", coef_br)

gn  Spearman R:  0.7834069205380487
br Spearman R:  0.762755193448961


In [46]:
assert  round(coef_gn,3) == round(0.7817164245392593,3)
assert  round(coef_br,3) == round(0.7627551934489611,3)

AssertionError: 

You can notice, that the google-news-vectors model is slighly better in this case.

# Individual task

1. Compute the cosine distance between the words vectors: `student` and `smart`
    * Enter the result obtained for the GN model
    * Enter the result obtained for the BR model

In [23]:
print(f'GN model: {round(cosine(w["student"], w["smart"]), 3)}')
print(f'BR model: {round(cosine(w_british["student_NOUN"], w_british["smart_NOUN"]), 3)}')

GN model: 0.934
BR model: 0.72


2. For the given set of words `student smart wood money`, find an odd-one-out word.
    * Enter the result obtained for the GN model:
    * Enter the result obtained for the BR model

In [24]:
print(f"GN model odd-one-out word: {w.doesnt_match('student smart wood money'.split(' '))}")
print(f"BR model odd-one-out word: {w_british.doesnt_match('student_NOUN smart_NOUN wood_NOUN money_NOUN'.split(' '))}")

GN model odd-one-out word: wood
BR model odd-one-out word: student_NOUN


3. Find the cosine distance between sentence vectors:

*Disclaimer: the words missing in the GN model are deleted from the original proverbs.*

`journey thousand miles begins with single step`

&

`leopard can not change its spots`

To build the sentence vector, build the vector for each word from the sentence, and then average these vectors components-wise (we recommend to use `numpy.mean()` with acorrect parameters `axis`).
    
* Enter the result obtained for the GN model:

In [25]:
text_1 = "journey thousand miles begins with single step".split()
text_2 = "leopard can not change its spots".split()

v1 = np.mean([w[word] for word in text_1], axis=0)
v2 = np.mean([w[word] for word in text_2], axis=0)

round(cosine(v1, v2), 3)

0.672

4. Select the word pair set with `19:119` indices from the word_sim word set (the numbering starts from 0, the right boundary is not included).

    Use only such pairs, which have the corresponding vectors in the British national corpus, labeled as nouns! Otherwise delete such a pair from the subset.

    Compute Spearman's correlation between the similarity measures of the selected wordpairs, obtained as results of models running, and the human ratings in the word_sim dataset.


* Enter the Spearman's correlation coefficient obtained for the GN model
* Enter the Spearman's correlation coefficient obtained for the BR model
* Enter the number of the skipped from the subset wordpairs

In [47]:
df1 = df.iloc[19:119]
gn_dist, br_dist, scores = [], [], []
num_skip = 0

for row in df1.iterrows():

    w1, w2 = row[1]["first"].lower(), row[1]["second"].lower()
    try:
        #enter your code here
        br_dist.append(1-w_british.distance(w1 + '_NOUN', w2 + '_NOUN'))

        gn_dist.append(1-w.distance(w1, w2))

        scores.append(row[1]["score"])

    except KeyError as e:
        num_skip += 1
        print(e, "Skipping this word.")

"Key 'arafat_NOUN' not present" Skipping this word.
"Key 'harvard_NOUN' not present" Skipping this word.
"Key 'mexico_NOUN' not present" Skipping this word.


In [48]:
coef_gn, p = spearmanr(gn_dist, scores)
print("GN Spearman R: ", round(coef_gn, 3))

coef_br, p = spearmanr(br_dist, scores)
print("BR Spearman R: ", round(coef_br, 3))

print("Number of the skipped = ", num_skip)

GN Spearman R:  0.693
BR Spearman R:  0.655
Number of the skipped =  3
