![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab 2: Testing a pretrained word2vec model on analogy tasks

**Objectives:**  experiment with *word vectors* from word2vec: test them on analogy tasks; use *accuracy and MRR* scores.

**Useful documentation:** the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) and possibly the [section on word2vec](https://radimrehurek.com/gensim/models/word2vec.html).

## 1. Word2vec model trained on Google News
**1a.** Please install the latest version of Gensim, preferably in a Conda environment. 

In [None]:
# !pip install --upgrade gensim
# You can run the following verification:
!pip show gensim

In [None]:
import gensim, os, random
from gensim import downloader
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim import utils
# help(gensim.models.word2vec) # take a look if needed
import time
import itertools

In [63]:
help(gensim.models.word2vec)

Help on module gensim.models.word2vec in gensim.models:

NAME
    gensim.models.word2vec

DESCRIPTION
    Introduction

    This module implements the word2vec family of algorithms, using highly optimized C routines,
    data streaming and Pythonic interfaces.

    The word2vec algorithms include skip-gram and CBOW models, using either
    hierarchical softmax or negative sampling: `Tomas Mikolov et al: Efficient Estimation of Word Representations
    in Vector Space <https://arxiv.org/pdf/1301.3781.pdf>`_, `Tomas Mikolov et al: Distributed Representations of Words
    and Phrases and their Compositionality <https://arxiv.org/abs/1310.4546>`_.

    Other embeddings

    There are more ways to train word vectors in Gensim than just Word2Vec.
    See also :class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText`.

    The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/
    and extended with additional functionalit

**1b.** Please download from Gensim the `word2vec-google-news-300` model, upon your first use.  Then, please write code to answer the following questions:
* Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
* What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.

In [64]:
# Download the model from Gensim (needed only the first time)
gensim.downloader.load("word2vec-google-news-300")
# No need to store the returned value (uses a lot of memory).

KeyboardInterrupt: 

In [21]:
# Please write your Python code below and execute it.
path_to_model_file="C:/Users/quent/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz"
print(path_to_model_file)
print(round(os.path.getsize(path_to_model_file)/(1024**3),2))


C:/Users/quent/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
1.62


**1c.** Please load the word2vec model as an instance of the class `KeyedVectors`, and store it in a variable called `wv_model`. 
What is, at this point, the memory size of the process corresponding to this notebook?  Simply write the value you obtain from any OS-specific utility that you like.

In [67]:
# Please write your Python code below and execute it.  Write the memory size on a commented line.
wv_model = KeyedVectors.load_word2vec_format(path_to_model_file, binary=True)
# process memory usage : 3.5GB

**1d.** Please write the instructions that generate the answers to the following questions.
* What is the size of the vocabulary of the `wv_model` model?  
* What is the dimensionality of each word vector?  
* What is the word corresponding to the vector in position 1234?  
* What are the first 10 coefficients of the word vector for the word *pyramid*?  

In [68]:
# Please write your Python code below and execute it.
# Please write your Python code below and execute it.
print("vocab size",wv_model.vectors.shape[0])
print("size of word vector",wv_model.vectors.shape[1])
print(wv_model.index_to_key[1234])
print(wv_model.most_similar("pyramid"))

vocab size 3000000
size of word vector 300
learn


MemoryError: Unable to allocate 3.35 GiB for an array with shape (3000000, 300) and data type float32

## 2. Solving analogies using word2vec trained on Google News
In this section, you are going to use word vectors to solve analogy tasks provided with Gensim, such as "What is to France what Rome is to Italy?".  The predefined function in Gensim that evaluates a model on this task does not provide enough details, so you will need to make modifications to it.

**2a.** The analogy tasks are stored in a text file called `questions-words.txt` which is typically found in `C:\Users\YourNameHere\.conda\envs\YourEnvNameHere\Lib\site-packages\gensim\test\test_data`.  You can access it from here with Gensim as `datapath('questions-words.txt')`.  

Please create a file called `questions-words-100.txt` with the first 100 lines from the original file.  Please run the evaluation task on this file, using the [documentation of the KeyedVectors class](https://radimrehurek.com/gensim/models/keyedvectors.html), then answer the following questions:
* How many analogy tasks are there in your `questions-words-100.txt` file?
* How many analogies were solved correctly and how many incorrectly?
* What is the accuracy returned by `evaluate_word_analogies`?
* How much time did it take to solve the analogies?

In [50]:
# Please write your Python code below and execute it.
directory ="C:/Users/quent/Documents/master/1er/nlp/pythonProject/.venv/Lib/site-packages/gensim/test/test_data"
output = directory+"/questions-words-100.txt"
input = directory+"/questions-words.txt"
if not os.path.exists(directory+"/"+"questions-words-100.txt""questions-words-100.txt"):
    with open(input, "r") as infile, open(output, "w") as outfile:
       outfile.writelines(itertools.islice(infile, 100))


KeyError: "Key 'Patrick_Nyarko' not present"

In [62]:
analogy_scores = (wv_model.evaluate_word_analogies(datapath('questions-words.txt'),dummy4unknown=True))


KeyError: "Key 'Patrick_Nyarko' not present"

**2b.** Please answer in writing the following questions:
* What is the meaning of the first line of `questions-words-100.txt`?
* How many analogies are there in the original `questions-words.txt`?
* How much time would it take to solve the original set of analogies?

In [None]:
# Please write your answers here.


**2c.** The built-in function from Gensim has several weaknesses, which you will address here.  Please copy the source code of the function `evaluate_word_analogies` from the file `gensim\models\keyedvectors.py` and create here a new function which will improve the built-in one as follows.  The function will be called `my_evaluate_word_analogies` and you will also pass it the model as the first argument.  Overall, please proceed gradually and only make minimal modifications, to ensure you don't break the function.  It is important to first understand the structure of the result, `analogies_scores` and `sections`. 

* Modify the line where `section[incorrect]` is assembled in order to also add to each analogy the *incorrect guess* (i.e. what the model thought was the good answer, but got it wrong).

* Modify the code so that when `section[incorrect]` is assembled, you also add the *rank of the correct answer* among the candidates returned by the system (after the incorrect guess).  If the correct answer is not present at all, then code the rank as 0.

In [None]:
def my_evaluate_word_analogies(model, analogies, restrict_vocab=300000, case_insensitive=True):


**2d.** Please run the `my_evaluate_word_analogies` function on `questions-words-100.txt` and then write instructions to display, from the results stored in `analogy_scores`:
* one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0)
* one correctly-solved analogy selected at random (in principle, four terms).

In [None]:
# Please write your Python code below and execute it.


**2e.** Please write a function to compute the MRR score given a structure with correctly and incorrectly solved analogies, such as the one that is found in the results from `evaluate_word_analogies`.  The structure is not divided into categories.

The Mean Reciprocal Rank (please use the [formula here](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) gives some credit for incorrectly solved analogies, in inverse proportion to the rank of the correct answer among the candidates.  This rank is 1 for correctly solved analogies (full credit), and 1/k (or 0) for incorrectly solved ones.

In [None]:
# Please define here the function that computes MRR from the information stored in analogy_scores
def myMRR(analogies):


In [None]:
# Please test your MRR function by running the following code, which  displays the total number of analogy tasks, 
# the number of different categories (sections), the accuracy of the results (total number of correctly 
# solved analogies), and the MRR score of the results:
print("Total number of analogies:",  # The last dictionary is the total
      len(analogy_scores[1][-1]['correct']) + 
      len(analogy_scores[1][-1]['incorrect']))
print("Total number of categories:", len(analogy_scores[1]) - 1) # the "total" is excluded 
print(f"Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}")

**2f.** When you have some time, please compute the accuracy and MRR and the total time for the entire `questions-words.txt` file.  Is the timing compatible with your estimate from (2b)?  What do you think about the difference between accuracy and MRR? 

In [None]:
# Please write your Python code below and execute it.


In [None]:
# Please write you answer here.

## End of AdvNLP Lab 2
Please make sure all cells have been executed, save this completed notebook, and upload it to Moodle.