# Task 2: Comparison with Other Pre-trained Models

Now that you have obtained results with the word2vec-google-news-300 pre-trained model, you will experi-
ment with 4 other English word2vec pretrained models and compare the results. You can use any pre-trained
embeddings that you want, but you must have:

1. 2 new models from different corpora (eg. Twitter, English Wikipedia Dump . . . ) but same embedding size
(eg. 25, 100, 300)

2. 2 new models from the same corpus but different embedding sizes

Many pre-trained embeddings are available on line (including in Gensim or at http://vectors.nlpl.eu/repository).
For each model that you use, create a new <model name>-details.csv output file and append the results
to the file analysis.csv (see Section 2.2). For example, the file analysis.csv could now contain:

word2vec-google-news-300,3000000,44,78,0.5641025641025641 // from Task 1  
C1-E1,...,...,...,...  
C2-E2,...,...,...,...  
C3-E3,...,...,...,...  
C4-E4,...,...,...,...  
where C1 to C4 refer to the corpora and E1 to E4 refer to their embedding sizes, and C1 ̸= C2 and E1 = E2
and C3 = C4 and E3 ̸= E4.

In [1]:
from gensim import downloader
import pandas as pd

In [2]:
# Download the 4 different models
model1 = 'glove-twitter-100'
model2 = 'glove-wiki-gigaword-100'
model3 = 'glove-twitter-25'
model4 = 'glove-twitter-50'
model_names = [model1, model2,model3, model4 ]
models = {}

for model in model_names:
    models[model] = downloader.load(model)

In [4]:
# Load the file
df = pd.read_csv('synonyms.csv')
print(df.head)

<bound method NDFrame.head of        question        answer              0               1              2  \
0    enormously  tremendously  appropriately        uniquely   tremendously   
1    provisions  stipulations   stipulations  interrelations  jurisdictions   
2   haphazardly      randomly    dangerously         densely       randomly   
3     prominent   conspicuous       battered         ancient     mysterious   
4        zenith      pinnacle     completion        pinnacle         outset   
..          ...           ...            ...             ...            ...   
75      fashion        manner         ration          fathom          craze   
76     marketed          sold         frozen            sold      sweetened   
77       bigger        larger       steadier          closer         larger   
78        roots       origins        origins         rituals           cure   
79     normally    ordinarily      haltingly      ordinarily    permanently   

                  3  

In [11]:
# Create the output files
for model in model_names:
    fs = open('{}-details.csv'.format(model), 'w')
    fs.write('question,answer,guess,label\n')
    fs.close()

In [12]:
for model in model_names:
    fs = open('{}-details.csv'.format(model), 'a')

    CORRECT_LABEL = 0
    ANSWERED_QUESTIONS = 0

    for _, row in df.iterrows():
        question = row['question']
        answer = row['answer']
        guesses = [row['0'], row['1'], row['2'], row['3']]
        best_guess = (0, '')
        for guess in guesses:
            try:
                sim = models[model].similarity(question, guess)
                # Check if the similarity is greater than the current best guess
                if sim > best_guess[0]:
                    best_guess = (sim, guess)
            except:
                # Could be because bhe question isn't in the model
                pass

        if best_guess[1] != '':
            ANSWERED_QUESTIONS += 1
            # If the guess is correct
            if best_guess[1] == answer:
                label = 'correct'
                CORRECT_LABEL += 1
            # If the guess is wrong
            else:
                label = 'wrong'
        else:
            label = 'guess'


        fs.write(','.join([question, answer, best_guess[1], label]) + '\n')

    fs.close()
    
    # analysis part
    fs = open('analysis.csv', 'a')

    model_name = model
    size_of_vocabulary = len(models[model])
    correct_label = CORRECT_LABEL
    answered_questions = ANSWERED_QUESTIONS
    accuracy = correct_label / answered_questions

    # Write the analysis to file: model_name, size of voc, correct label, answered questions, accuracy
    fs.write('\n' + ','.join([model_name, str(size_of_vocabulary), str(correct_label), str(answered_questions), str(accuracy)]))

    fs.close()