# Text-based Information Retrieval
## Assignment Part I - Warmup

*Assignment part 1 (10 points out of 100 total)*

>Your task will be to run several analogy solving models with several
different representations on the benchmarking analogy dataset and report your findings. Focus on the
following questions:
1. Is the choice of the analogy model important? Which representations work better with which analogy
models?
2. Is dimensionality of the representation important when using GloVe vectors?
3. What is the computational complexity of the analogy models given the pre-trained vectors?
4. What are the typical errors?


### Practical Info

Information about linguistic regularities in Word Predictions:
http://www.marekrei.com/blog/linguistic-regularities-word-representations/

List of questions to ask:
http://word2vec.googlecode.com/svn/trunk/questions-words.txt

Pretrained vector sets:
* Word2Vec: https://code.google.com/archive/p/word2vec/
* GloVe: http://nlp.stanford.edu/projects/glove/

## Code

Note: This is written using Python 3 - there may be small differences if using another version

Load in required libraries, load in Word2Vec model

In [1]:
# Import modules
from gensim import models
#import numpy as np
import logging

# Set up logger that logs (works in jupyter 3!) in console and outputs in file
import logging
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='mylog.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.DEBUG)

### Define functions

Because we need to use different analogy models and need to calculate the recall value, functions would be useful...
[Gensims' implementation](https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py)

The different analogy models are explained here: http://www.marekrei.com/blog/linguistic-regularities-word-representations/


**Model 1** (addition model)

a : b is c : ?  (Or, a to b is c to [...], with a, b and c are word vectors)
>1. Compute the vector c - a + b
>2. Find the closest vector

In [2]:
def analogy_model1(a, b, c, model): 
    result = model.most_similar(positive=[c, b], negative=[a], topn=1)
    return result[0]

**Model 2** (Multiplication model)

a : b is c : d  (Or, a to b is c to d, with a, b and c are word vectors)
>d = argmax(cos(d',c)*cos(d',b)/(cos(d'a)+e))
>
>e = 0.001 to avoid division by zero

In [3]:
def analogy_model2(a, b, c, model):
    result = model.most_similar_cosmul(positive=[c, b], negative=[a], topn=1)
    return result[0]

**Rec@ll1**

Each analogy model that we test should report its performance as a *Recall@1* metric
>[Recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall) in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved.

>![alt text](recall_formula.png "Recall@1")

In [4]:
def recall_analogy_model(questions, analogy_model, model):
    right_count = 0 
    total_count = 0
    skipped_count = 0

    with open(questions, 'r') as file:
        for line in file:
            if line[0] != ':' :   # Ignore the lines that start with a ':', they indicate semantic/syntactic relation categories
                total_count += 1
                words = line.split() # Split the different words
                try:
                    result_text = analogy_model(words[0], words[1], words[2], model)                 
                    if result_text[0] == words[3]:
                        right_count += 1
                except KeyError: # If a KeyError occurs, skip line
                    skipped_count += 1
                
    # Return the recall number, the recall numberv if we ignore the skipped ones,
    # the total right, the total count and the skipped count
    recall = float(right_count) / float(total_count)
    recall_ignore_skipped = float(right_count) / float(total_count - skipped_count)
    return float('%.5f'% recall), float('%.5f'% recall_ignore_skipped), float(right_count), float(total_count), float(skipped_count)

### Word2Vec


In [5]:
# Load Googles' pre-trained Word2Vec vector set
# Note: This will take a lot of memory and can take a while.
# Note II: Depending on your RAM, do not load all models at the same time

w2v_model = models.Word2Vec.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

w2v_model.init_sims(replace=True) # Normalize; Trims unneeded model memory = use (much) less RAM.

In [6]:
# Run for analogy models
print "Word2Vec - addition model: ", recall_analogy_model('data/questions-words-test.txt', analogy_model1, w2v_model)

Word2Vec - addition model: 0.78142
Word2Vec - addition model: 0.73588 (if run from terminal instead of notebookD: ) 


In [6]:
print ("Word2Vec - multiplication model: ", recall_analogy_model('data/questions-words.txt', analogy_model2, w2v_model))


Word2Vec - addition multiplication: 0.75148 (ran from terminal) 


### GloVe

Gloves' vector model is constructed differently than Word2Vec. But, once constructed, the vector model format is very similar to the Word2Vec model. However, there are some small differences. The answer to adapt Glove to Word2Vec is found [here](https://groups.google.com/forum/#!topic/gensim/0_SeYGVAL78) and the code [here](https://github.com/manasRK/glove-gensim).

In [5]:
import smart_open
import os.path

def glove2word2vec(glove_filename):
    def get_info(glove_filename): 
        num_lines = sum(1 for line in smart_open.smart_open(glove_filename))
        dims = glove_filename.split('.')[2].split('d')[0] # file name contains the number of dimensions
        return num_lines, dims
    
    def prepend_info(infile, outfile, line): # Function to prepend lines using smart_open
        with open(infile, 'r', encoding="utf8") as original: data = original.read()
        with open(outfile, 'w', encoding="utf8") as modified: modified.write(line + '\n' + data)
        return outfile
    
    word2vec_filename = glove_filename[:-3] + "word2vec.txt"
    if os.path.isfile(word2vec_filename):
        model = models.Word2Vec.load_word2vec_format(word2vec_filename)
    else:
        num_lines, dims = get_info(glove_filename)
        gensim_first_line = "{} {}".format(num_lines, dims)
        model_file = prepend_info(glove_filename, word2vec_filename, gensim_first_line)
        model = models.Word2Vec.load_word2vec_format(model_file)
    
    model.init_sims(replace = True)  # normalize all word vectors
    return model

In [7]:
# Load GloVes' pre-trained model
# These vectors are stored in a plain text - vector dimensionality 50, 100, 200 and 300
# only the vectors pre-trained on Wikipedia.
glove50d_model = glove2word2vec('data/glove.6B.50d.txt')

INFO:gensim.models.word2vec:loading projection weights from data/glove.6B.50d.word2vec.txt
INFO:gensim.models.word2vec:loaded (400000, 50) matrix from data/glove.6B.50d.word2vec.txt
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors


In [8]:
# Return the recall number, the recall numberv if we ignore the skipped ones,
# the total right, the total count and the skipped count
print ("GloVe50d - addition model: ", recall_analogy_model('data/questions-words.txt', analogy_model1, glove50d_model))
print ("GloVe50d - multiplication model: ", recall_analogy_model('data/questions-words.txt', analogy_model2, glove50d_model))

GloVe50d - addition model:  (0.18978, 0.38708, 3709.0, 19544.0, 9962.0)
GloVe50d - multiplication model:  (0.14506, 0.29587, 2835.0, 19544.0, 9962.0)


In [9]:
glove100d_model = glove2word2vec('data/glove.6B.100d.txt')
glove200d_model = glove2word2vec('data/glove.6B.200d.txt')

INFO:gensim.models.word2vec:loading projection weights from data/glove.6B.100d.word2vec.txt
INFO:gensim.models.word2vec:loaded (400000, 100) matrix from data/glove.6B.100d.word2vec.txt
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors
INFO:gensim.models.word2vec:loading projection weights from data/glove.6B.200d.word2vec.txt
INFO:gensim.models.word2vec:loaded (400000, 200) matrix from data/glove.6B.200d.word2vec.txt
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors


In [10]:
# Return the recall number, the recall numberv if we ignore the skipped ones,
# the total right, the total count and the skipped count
print ("GloVe100d - addition model: ", recall_analogy_model('data/questions-words.txt', analogy_model1, glove100d_model))
print ("GloVe100d - multiplication model: ", recall_analogy_model('data/questions-words.txt', analogy_model2, glove100d_model))
print ("GloVe200d - addition model: ", recall_analogy_model('data/questions-words.txt', analogy_model1, glove200d_model))
print ("GloVe200d - multiplication model: ", recall_analogy_model('data/questions-words.txt', analogy_model2, glove200d_model))


GloVe100d - addition model:  (0.28382, 0.5789, 5547.0, 19544.0, 9962.0)
GloVe100d - multiplication model:  (0.26668, 0.54394, 5212.0, 19544.0, 9962.0)
GloVe200d - addition model:  (0.30777, 0.62774, 6015.0, 19544.0, 9962.0)
GloVe200d - multiplication model:  (0.30531, 0.62273, 5967.0, 19544.0, 9962.0)


In [6]:
# This takes most memory, so in a different section
glove300d_model = glove2word2vec('data/glove.6B.300d.txt')

# Return the recall number, the recall numberv if we ignore the skipped ones,
# the total right, the total count and the skipped count
print ("GloVe300d - addition model: ", recall_analogy_model('data/questions-words.txt', analogy_model1, glove300d_model))
print ("GloVe300d - multiplication model: ", recall_analogy_model('data/questions-words.txt', analogy_model2, glove300d_model))

INFO:gensim.models.word2vec:loading projection weights from data/glove.6B.300d.word2vec.txt
INFO:gensim.models.word2vec:loaded (400000, 300) matrix from data/glove.6B.300d.word2vec.txt
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors


GloVe300d - addition model:  (0.31304, 0.63849, 6118.0, 19544.0, 9962.0)
GloVe300d - multiplication model:  (0.32214, 0.65707, 6296.0, 19544.0, 9962.0)



## More info on Word2Vec

Also, More info on how to use gensim can be found in [this tutorial](http://rare-technologies.com/word2vec-tutorial/).

Gensim accepts the bin format, but if you want the txt format.. Gensim can transform it:

>model = gensim.models.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
>model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

Once the model is loaded, Gensim supports a lot of out-of-the-box functionality.

In [11]:
# Get top-most similar word
w2v_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7118192315101624)]

In [12]:
# Find the word that doesnt fit in the row
w2v_model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [13]:
# Get the similarity between two words
w2v_model.similarity('woman', 'man')

0.76640122344103145

In [14]:
# Get the raw numpy vector of a certain word
w2v_model['computer']

array([  4.08137441e-02,  -7.64330178e-02,   4.67502922e-02,
         8.05143863e-02,  -3.46916839e-02,   8.23695585e-02,
        -5.00895977e-02,   3.15378942e-02,   7.68040493e-02,
         1.81806684e-02,   1.39137767e-02,  -9.32223070e-03,
         9.09033418e-03,  -6.08495846e-02,  -9.92516056e-03,
         3.69178876e-02,  -2.41172127e-02,   7.01254383e-02,
         6.49309605e-02,  -6.19626865e-02,  -4.15558144e-02,
         5.67682087e-02,  -1.76820919e-04,   3.65468524e-02,
         6.41888902e-02,   9.91356559e-04,   3.39496173e-02,
         2.46737637e-02,   1.35427425e-02,  -2.63434183e-02,
        -5.56551069e-02,  -4.60082218e-02,  -8.64509344e-02,
         9.32223070e-03,  -4.73068431e-02,  -1.20957099e-01,
        -8.38536993e-02,   4.97185625e-02,   1.39137767e-02,
        -1.38210189e-02,  -4.30399515e-02,   7.42068142e-02,
         3.71034071e-02,   4.82344255e-02,   2.50447989e-02,
         2.63434183e-02,   3.89585760e-03,   6.67861328e-02,
        -6.41888902e-02,

Or, Gensim supports the same format as Googles' question words.

In [39]:
# Gensim supports the same evaluation set as Google does
w2v_model.accuracy('data/questions-words-test.txt')

[{'correct': [], 'incorrect': [], 'section': u'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': u'capital-world'},
 {'correct': [(u'boy', u'girl', u'brother', u'sister'),
   (u'boy', u'girl', u'brothers', u'sisters'),
   (u'boy', u'girl', u'dad', u'mom'),
   (u'boy', u'girl', u'father', u'mother'),
   (u'boy', u'girl', u'grandfather', u'grandmother'),
   (u'boy', u'girl', u'grandson', u'granddaughter'),
   (u'boy', u'girl', u'groom', u'bride'),
   (u'boy', u'girl', u'he', u'she'),
   (u'boy', u'girl', u'his', u'her'),
   (u'boy', u'girl', u'husband', u'wife'),
   (u'boy', u'girl', u'king', u'queen'),
   (u'boy', u'girl', u'man', u'woman'),
   (u'boy', u'girl', u'nephew', u'niece'),
   (u'boy', u'girl', u'prince', u'princess'),
   (u'boy', u'girl', u'son', u'daughter'),
   (u'boy', u'girl', u'sons', u'daughters'),
   (u'boy', u'girl', u'uncle', u'aunt'),
   (u'brother', u'sister', u'brothers', u'sisters'),
   (u'brother', u'sister', u'dad', u'mom'),
   (u'brother


