# Text-based Information Retrieval

## Assignment PART II
### Using wordembedding
We can use the semantic similarity of wordembeddings, such as GloVe and Word2Vec, to obtain better results.
In this part of the exercise, we will the addition analogy (similar to Part I of this assignment) to rank the given documents.


In [16]:
# Loading modules
import os, re
import pandas as pd
from numpy import dot, array
from gensim import matutils, models

# Set up logger that logs (works in jupyter 3!) in console and outputs in file
import logging
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='part_II_logs.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.DEBUG)


#### Load in word model

In [2]:
# Load Googles' pre-trained Word2Vec vector set
# Note: This will take a lot of memory and can take a while.
# Note II: Depending on your RAM, do not load all models at the same time
w2v_model = models.Word2Vec.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
#w2v_model.init_sims(replace=True) # Normalize; Trims unneeded model memory = use (much) less RAM.


INFO:gensim.models.word2vec:loading projection weights from data/GoogleNews-vectors-negative300.bin.gz
INFO:gensim.models.word2vec:loaded (3000000L, 300L) matrix from data/GoogleNews-vectors-negative300.bin.gz
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors


In [None]:
import smart_open
import os.path

def glove2word2vec(glove_filename):
    def get_info(glove_filename): 
        num_lines = sum(1 for line in smart_open.smart_open(glove_filename))
        dims = glove_filename.split('.')[2].split('d')[0] # file name contains the number of dimensions
        return num_lines, dims
    
    def prepend_info(infile, outfile, line): # Function to prepend lines using smart_open
        with open(infile, 'r', encoding="utf8") as original: data = original.read()
        with open(outfile, 'w', encoding="utf8") as modified: modified.write(line + '\n' + data)
        return outfile
    
    word2vec_filename = glove_filename[:-3] + "word2vec.txt"
    if os.path.isfile(word2vec_filename):
        model = models.Word2Vec.load_word2vec_format(word2vec_filename)
    else:
        num_lines, dims = get_info(glove_filename)
        gensim_first_line = "{} {}".format(num_lines, dims)
        model_file = prepend_info(glove_filename, word2vec_filename, gensim_first_line)
        model = models.Word2Vec.load_word2vec_format(model_file)
    
    model.init_sims(replace = True)  # normalize all word vectors
    return model

# Load GloVes' pre-trained model
# These vectors are stored in a plain text - vector dimensionality 50, 100, 200 and 300
# only the vectors pre-trained on Wikipedia.
glove50d_model = glove2word2vec('data/glove.6B.50d.txt')

#### Images to wordvectors

We will use the similarity of wordmodels such as Word2Vec and GloVe to make vectors of each image. These vectors will look like 
>s = w1 + w2 + ... + wn

> With s = the image vector and {w1 .. wn} the words for each image


In [3]:
# Load in a stopword list from
# http://www.lextek.com/manuals/onix/stopwords2.html
stopwords = []
with open('data/stopwordlist.txt', 'r') as f:
    lines = ''.join(f.readlines())
    stopwords = [x for x in lines.split('\n')[2:]]


In [79]:
# Translate text to avg vector
def sentence_to_vector(model, sentence):
    v1 = []
    for word in sentence.split(' '):
        try:
            v1.append(model[word])
        except:
            if "-" in word: # attempt dash removing or replacing with space
                try:
                     v1.append(model[word.replace("-", "")])
                except:
                    try:
                        v1.append(model[word.split("-")[0]])
                        v1.append(model[word.split("-")[1]])
                    except:
                        print 'word not in model:', word
                        continue
            else:
                print 'word not in model:', word
                continue
    return matutils.unitvec(array(v1).mean(axis=0))


In [94]:
# Clean input because the wordmodels can not contain every possible combination words and signs
# TODO: [Maybe?] Lemmatize words for even better result
def clean_input(text, stopwords):
    # lowecase and remove linebreaks
    text = text.lower().rstrip()
    # Remove punctuation
    text = re.sub('[!@#$:;%&?,_\.\'\"\\\/\(\)\[\]]', '', text)
    text = re.sub('[\-\-]', '-', text)
    # Remove sole numbers or dashes
    text = re.sub('[\s][\-]+[\s]', '', text)
    text = re.sub('[0-9]+', '', text)
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text


In [88]:
# Text file parser
# Returns a dictionary with imageid - text in lowercase without stopwords or punctuation
def text_file_parser(filename, stopwords, model):
    corpus = dict()
    #corpus = pd.DataFrame(columns=('id', 'imageid', 'vec'))
    with open(filename) as f:
        next(f) # skip first line
        for doc in f:
            '''
            # Normal:
            doc_parts = doc.split('\t')
            doc_parts[2] = clean_input(doc_parts[2])
            doc_parts.append(sentence_to_vector(model, doc_parts[2]))
            corpus[doc_parts[0]] = doc_parts
            
            # If use of parsed
            '''
            doc_parts = doc.split(" ", 1)
            if(len(doc_parts[0]) < 6):
                doc_parts = doc.split(" ", 2)
                doc_parts.pop(0)
            doc_parts[1] = clean_input(doc_parts[1], stopwords)
            doc_parts.append(sentence_to_vector(model, doc_parts[1]))
            corpus[len(corpus) + 1] = doc_parts

    # Transform to dataframe
    df = pd.DataFrame.from_dict(corpus, orient='index')
    df = df.reset_index()
    df.columns = ['index', 'img_id', 'caption', 'vec']
    return df

In [95]:
# images file to docs dict
print 'Parsing documents'
training_docs = text_file_parser('data/target_collection_parsed.txt', stopwords, w2v_model)
print 'Parsing queries'
queries = text_file_parser('data/queries_val_parsed.txt', stopwords, w2v_model)

# Preview
queries

Parsing documents
word not in model: grey
word not in model: harbour
word not in model: colour
word not in model: grey
word not in model: seethrough
word not in model: aeroplane
word not in model: covington
word not in model: aeroplane
word not in model: labcoat
word not in model: grey
word not in model: plough
word not in model: colour
word not in model: bluegreen
word not in model: figur
word not in model: solvaten
word not in model: colour
word not in model: coloured
word not in model: grey
word not in model: lightyear
word not in model: theatre
word not in model: grey
word not in model: black-and-white
word not in model: colourful
word not in model: streetperson
word not in model: grey
word not in model: colour
word not in model: out-of-focus
word not in model: adana
word not in model: soundmixer
word not in model: colourful
word not in model: tirol
word not in model: colours
word not in model: hyperthermal
word not in model: theatre
word not in model: colour
word not in model: gre

Unnamed: 0,index,img_id,caption,vec
0,1,XBZPztvt67qkMUdI,man white shirt sit table cut meat plate front...,"[0.0370230601868, 0.0189000485364, 0.020742983..."
1,2,PaqtOaYmQmXkqW2i,woman red dress posing axe,"[0.0637089120528, 0.0104547512146, 0.012100161..."
2,3,IPcFtNL-7EQ6Z0yu,soccer play stand soccer ball front,"[0.00913737563304, 0.0457835417231, 0.07424217..."
3,4,IMAD0sq2Fz7HpSgX,white yellow train track,"[-0.0223426765997, 0.0784277416676, 0.04782966..."
4,5,-gqRDDfPZTGlCfJa,view tall building city,"[-0.0118529039629, 0.0563774739386, 0.01534611..."
5,6,xsrYb57vl4qiMLDG,hand pick flower vine,"[0.0556772009794, 0.0733344995993, -0.07167356..."
6,7,BCjxgJlQ3TD5T8ST,picture army ready sail,"[0.0596367224357, 0.176939013212, -0.033955283..."
7,8,LGxwsl9CtRQ8wW3Y,brick roof house picture,"[0.0435841148884, 0.0197629548524, -0.01897175..."
8,9,9LtOvyiygFYoxU8S,man clean mess street,"[0.0964928812486, 0.101068245848, 0.0509890990..."
9,10,8usTLD-Wg5EHCShk,group kid play playground accompany adult,"[0.0110731808015, 0.0021735181312, 0.022698451..."


In [81]:
w2v_model["axe"]

KeyError: 'axe'

#### Check similarity

In [None]:
# function to calculate the similarity between 2 documents
def similarity(v1, v2):
    """
    Compute cosine similarity between two documents.
    Example:
      >>> trained_model.similarity('doc1', 'doc2')
      0.73723527
      >>> trained_model.similarity('doc2', 'doc2')
      1.0
    """
    return dot(matutils.unitvec(v1), matutils.unitvec(v1))

In [None]:
# loop over every query and every document and rank according to their cosine similarity
#for q in queries:
    
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

#### Results

In [101]:
# Recall
queries

{'18497': ['ensWX7N4skJUw2wA', 0],
 '18496': ['BW35OZ250H-7gspc', 0],
 '18495': ['n77DHADXySILcBcK', 0],
 '18494': ['ijH5-TgdqBPW7TsT', 0],
 '18493': ['1yMxqhIUfWmm8pkA', 0],
 '18492': ['Zibjo7D7jJkC1Skm', 0],
 '18491': ['mXZqizlXAuoJP0KG', 0],
 '18490': ['-Ko3qo3EJ08HiuNZ', 0],
 '18556': ['Sy-U0-9WYtWjDubI', 0],
 '18557': ['AhuDv6YVgn_LFDVZ', 0],
 '18554': ['uphNmNieuVdrllnU', 0],
 '18555': ['KK9SmYr9C2ThVqYp', 0],
 '18552': ['A_9nB9l5xSA8sYAO', 0],
 '18553': ['-qIw0-8p3LPVt0nM', 0],
 '18499': ['r2wmr6u6i3OGvxlD', 0],
 '18498': ['0_QSjCiUSy_Pn1N3', 0],
 '18888': ['ULzNIjHi_uL4m65X', 0],
 '18881': ['iYKRu1ElhEWMG4xV', 0],
 '18970': ['-q-lecPJgKXcZnOI', 0],
 '18889': ['JisaOaMGQjZaUKWA', 0],
 '18952': ['D5aCgXl4AmDVNdCk', 0],
 '18953': ['ViVtvxPYGZc-wbcr', 0],
 '18950': ['5_V5TahqSjDrHQuc', 0],
 '18951': ['dDdbGACKVNbIQSov', 0],
 '18956': ['JKPdw6y2FTJxsh1K', 0],
 '18558': ['8SH9ZQ347TvFCHFn', 0],
 '18954': ['Z1TKwXyWzPHIFyqo', 0],
 '18955': ['WsHd9WhpLTVCLUlq', 0],
 '18711': ['aQS9gNA5

In [115]:
# MAP
#df = pd.DataFrame([], columns=list('AB'))
df = pd.DataFrame.from_dict(queries, orient='index')
#pd.DataFrame(columns=('id', 'imageid', 'text'))
#df = df.append(queries, ignore_index=True)
#df = df.append(pd.DataFrame([['joske', 'marieke', 'smth']], columns=('id', 'imageid', 'text')), ignore_index=True)

df = df.reset_index()
df = df.columns(['index', 'key', ''])
df

Unnamed: 0,index,0,1,2
0,18497,ensWX7N4skJUw2wA,A boat approaches a man.\n,0
1,18496,BW35OZ250H-7gspc,Two backpackers hike in the woods.\n,0
2,18495,n77DHADXySILcBcK,A man using a tool on the roof of a car.\n,0
3,18494,ijH5-TgdqBPW7TsT,Men in orange jumpsuits looking at something.\n,0
4,18493,1yMxqhIUfWmm8pkA,A young female stands wearing a heavily embroi...,0
5,18492,Zibjo7D7jJkC1Skm,A photo of a small kitchen with a little table...,0
6,18491,mXZqizlXAuoJP0KG,Two men wearing uniforms stand behind a table ...,0
7,18490,-Ko3qo3EJ08HiuNZ,A house with a large yard.\n,0
8,18556,Sy-U0-9WYtWjDubI,A view of a ship at sea with steam coming off ...,0
9,18557,AhuDv6YVgn_LFDVZ,A smiling man near a window wresting with laun...,0


In [15]:
v1 = [len(word) for word in "A boat approaches a man".split(' ')]
v1

[1, 4, 10, 1, 3]