# Task 1: Evaluation of the word2vec-google-news-300 Pre-trained Model

In this first experiment, you will use the pre-trained Word2Vec model called word2vec-google-news-300 to
compute the closest synonym for each word in the dataset. First, use gensim.downloader.load to load the
word2vec-google-news-300 pretrained embedding model. Then use the similarity method from Gensim to
compute the cosine similarity between 2 embeddings (2 vectors) and find the closest synonym to the question-
word.
The output of this task should be stored in 2 files:
1. In a file called <model name>-details.csv, for each question in the Synonym Test dataset, in a single line:
    * (a) the question-word, a comma,
    * (b) the correct answer-word, a comma
    * (c) your system’s guess-word, a comma
    * (d) one of 3 possible labels:
        * the label guess, if either question-word or all four guess-words (or all 5 words) were not found in
the embedding model (so if the question-word was present in the model, and at least 1 guess-word
was present also, you should not use this label).
        * the label correct, if the question-word and at least 1 guess-word were present in the model, and
the guess-word was correct.
        * the label wrong if the question-word and at least 1 guess-word were present in the model, and the
guess-word was not correct.
    
For example, the file word2vec-google-news-300-details.csv could contain:
    
enormously,tremendously,uniquely,wrong  
provisions,stipulations,stipulations,correct  
...

In [1]:
from gensim import downloader
import pandas as pd

In [2]:
# Download the Word2Vec model
model = downloader.load("word2vec-google-news-300")

In [3]:
# Load the file
df = pd.read_csv('synonyms.csv')
print(df.head)

<bound method NDFrame.head of        question        answer              0               1              2  \
0    enormously  tremendously  appropriately        uniquely   tremendously   
1    provisions  stipulations   stipulations  interrelations  jurisdictions   
2   haphazardly      randomly    dangerously         densely       randomly   
3     prominent   conspicuous       battered         ancient     mysterious   
4        zenith      pinnacle     completion        pinnacle         outset   
..          ...           ...            ...             ...            ...   
75      fashion        manner         ration          fathom          craze   
76     marketed          sold         frozen            sold      sweetened   
77       bigger        larger       steadier          closer         larger   
78        roots       origins        origins         rituals           cure   
79     normally    ordinarily      haltingly      ordinarily    permanently   

                  3  

In [6]:
# Create the output file
fs = open('word2vec-google-news-300-details.csv', 'w')
fs.write('question,answer,guess,label\n')
fs.close()

In [7]:
fs = open('word2vec-google-news-300-details.csv', 'a')

CORRECT_LABEL = 0
ANSWERED_QUESTIONS = 0

for _, row in df.iterrows():
    question = row['question']
    answer = row['answer']
    guesses = [row['0'], row['1'], row['2'], row['3']]
    best_guess = (0, '')
    for guess in guesses:
        try:
            sim = model.similarity(question, guess)
            # Check if the similarity is greater than the current best guess
            if sim > best_guess[0]:
                best_guess = (sim, guess)
        except Exception as e:
            # Could be because bhe question isn't in the model
            pass
        
    if best_guess[1] != '':
        ANSWERED_QUESTIONS += 1
        # If the guess is correct
        if best_guess[1] == answer:
            label = 'correct'
            CORRECT_LABEL += 1
        # If the guess is wrong
        else:
            label = 'wrong'
    else:
        label = 'guess'
            
    
    fs.write(','.join([question, answer, best_guess[1], label]) + '\n')
                    
fs.close()

In [8]:
fs = open('analysis.csv', 'w')

model_name = 'word2vec-google-news-300'
size_of_vocabulary = len(model)
correct_label = CORRECT_LABEL
answered_questions = ANSWERED_QUESTIONS
accuracy = correct_label / answered_questions

# Write the analysis to file: model_name, size of voc, correct label, answered questions, accuracy
fs.write(','.join([model_name, str(size_of_vocabulary), str(correct_label), str(answered_questions), str(accuracy)]))

fs.close()