# Information Retrieval Assignment  
## Part 1B: Retrieval and Ranking

---

The file retrieves the document vectors which was generated in Part 1A. The user can then enter a query or choose from a sample query. The `lnc.ltc` scores are calculated for each document with respect to the query and the top 10 scoring documents are returned.

In [1]:
import numpy as np
import pandas as pd
import pickle
import shelve
import time

Here we load up the document vectors which were generated from the Part 1A code. This might take a minute or two.

In [2]:
try:
    document_vectors = pd.read_csv('vector_space_model.csv', index_col=0, keep_default_na=False)
except:
    print('File not found. Please run the Vector Space Model file first or move vector_space_model.csv to the same directory as this notebook')

In [3]:
document_vectors

Unnamed: 0,tsalta baptiste,karl richter (tennis),matrix ab,sergey golubitskiy,west side story (earl hines album),2016 tipperary senior hurling championship,mario mosböck,faith baptist college,volodymyr dykyi,félix baumaine,...,pinkwash (band),promo azteca,strange creek (west virginia),strange creek,collective sigh,dileep agrawal,strouds creek,mystic marathon,the daddy issues,vicky astori
,4,0,0,16,0,0,0,0,0,3,...,0,0,0,0,0,0,0,1,0,2
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
𐩦𐩧𐩢,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩦𐩲𐩧𐩣,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩱𐩡,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩱𐩥𐩩𐩧𐩣,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Extracting the document list and vocabulary dictionary from the pandas dataframe

In [4]:
docs = document_vectors.columns
vocab_dict = {term: termID for termID, term in enumerate(document_vectors.index)}

### Computing the lnc for documents:

$$ Logarithm\ term\ frequency \left( l \right): 1 + \log_{2}\left ( tf_{t,d} \right ) $$  

$$ No\ document\ frequency \left( n \right): 1 $$  

$$ Cosine\ normalisation \left( c \right): \sqrt{ \sum\limits_{i=1}^{T} {w^{2}_{i}}} $$  

This block computes the lnc for the documents in the corpus. 

To store floats the matrix is typecasted to float which requires a lot of memory. This may cause the program to run out of memory and crash, especially when dealing with a large number of documents. We have addressed this problem in our first optimisation in Part 2 of the assignment.

In [5]:
#(l)nc for documents

w_d = document_vectors
w_d = np.array(w_d, dtype=np.float)
w_d = np.log2(w_d, out=np.zeros_like(w_d), where=(document_vectors!=0))
w_d = np.add(w_d, 1, out=np.zeros_like(w_d), where=(document_vectors!=0))

#ln(c) for documents

w_d = w_d/(np.linalg.norm(w_d, axis=0).reshape(1,-1))
w_d[np.isnan(w_d)] = 0

  w_d = np.log2(w_d, out=np.zeros_like(w_d), where=(document_vectors!=0))


---

## Test Queries

We have provided some sample queries to try out. Please uncomment the relevant line before running the query. There is also an option to enter your own query.

In [22]:
# query = "football footballer midfielder"
# query = "north american beaver"
# query = "car ride"
# query = "falling off the edge of the world"
# query = "guitar musician"
# query = "jordan peterson"

query = input()

start_time = time.time()

query_vector = np.array([0]*len(vocab_dict), dtype=np.float)
for word in query.strip().split(' '):
    if word in vocab_dict:
        query_vector[vocab_dict[word]] += 1

football footballer midfielder


### Computing the ltc for queries:

$$ Logarithm\ term\ frequency \left( l \right): 1 + \log_{2}\left ( tf_{t,d} \right ) $$ 

$$ Inverse\ document\ frequency \left( n \right): \log_{2}\left(\frac{N+1}{df_{t}} \right) $$   

$$ Cosine\ normalisation \left( c \right): \sqrt{ \sum\limits_{i=1}^{T} {w^{2}_{i}}} $$  

In [23]:
#(l)tc for query vector

w_q = query_vector
w_q = np.log2(w_q, out=np.zeros_like(w_q), where=(query_vector!=0))
w_q = np.add(w_q, 1, out=np.zeros_like(w_q), where=(query_vector!=0))

#l(t)c for query vector

idf = np.sum(np.array(document_vectors!=0), axis=1, dtype=np.float)
idf = np.log2((len(docs)+1)/idf, out=np.zeros_like(idf), where=(idf!=0))

#lt(c) for query vector

w_q = np.multiply(w_q, idf)
w_q = np.divide(w_q, np.linalg.norm(w_q))
w_q[np.isnan(w_q)] = 0

### Calculating Cosine Similarity

$$ \cos \left( d_{j}, q \right) = \frac {\sum\limits_{i=1}^{N}{w_{i,j}w_{i,q}}}{\sqrt{\sum\limits_{i=i}^{N}w_{i,j}^{2}} \sqrt{\sum\limits_{i=i}^{N}w_{i,q}^{2}}} $$     

Cosine similarity lies in the range [0,1], with 0 meaning no similarity and 1 meaning complete similarity. 

Since, we have already normalised the vectors, we just need to get their dot product to compute the cosine similarity. To optimise performance, we are computing the scores by matrix multiplication instead of running a loop.

In [24]:
scores = np.multiply(w_q.reshape(-1,1), w_d)
scores = np.sum(scores, axis=0)

scores[np.isnan(scores)] = 0

---

## Result

Here are the top 10 documents and their scores calculated by the IR system

In [25]:
rank = np.argsort(scores)[-10:][::-1]

print("--- %s seconds ---" % (time.time() - start_time))

print("   score  | name of document")
print("-------------------------------")
for i in rank:
    print("%.8f" % scores[i], docs[i])

--- 6.664822101593018 seconds ---
   score  | name of document
-------------------------------
0.33215049 burak yilmaz (footballer born 1995)
0.29260527 robert petre (footballer, born 1997)
0.27089984 fabian miesenböck
0.24865531 michael steiner (footballer)
0.24648572 bernd kager
0.24648572 stefan schwendinger
0.24169913 mihai bordeianu
0.23718099 cosmin ciocoteală
0.23718099 romario moise
0.23718099 szilard vereș
