## Part 2 A.2: Inverted Index Optimization: Search

We first import the required modules and the Inverted_Index class defined in the previous notebook



In [1]:
import math
import numpy as np
import pandas as pd
import pickle
import sys
import string
import time
from inverted_index import Inverted_Index



For the search class, we use the Inverted_Index object created in the previous notebook. 


### Query Vector
We take the input query and create a query vector using ltc scheme as described in part 1.

### Document Vector
For every term present in the query, we use the posting list corresponding to that term and update the values for the douments present in the posting list using the lnc scheme (described in part 1) in the document matrix.

### Scores
We simpy take the dot product of query vector with each row in the document matrix and update its score. The top K documents with the highest score are then returned 


In [2]:
class Search:
    def __init__(self, index):
        self.index = index
    
    # k = number of results to be returned
    def search(self, query, k = 10):
        words = query.lower().strip().split()
        terms = {}
        for word in words :
            if word not in index.InvertedIndex:
                continue
            terms[word] = words.count(word) 
        

        #finding ltc for query
        query_vector = np.add(np.log2(np.array(list(terms.values()), dtype=np.float64)),1)
        query_vector = np.multiply(query_vector, np.log2((index.no_documents+1) /np.array(list(map(lambda x: self.index.InvertedIndex[x][1], terms.keys())))))
        query_vector = query_vector/np.linalg.norm(query_vector)


        docs_vector = pd.DataFrame(columns = list(terms.keys()), dtype = np.float64)
        docs_encountered = dict()

        # finding lnc for document
        for word in list(terms.keys()):
            postinglist , df = self.index.InvertedIndex[word]
            head = postinglist.head

            # if the word does not exist in document 
            if not head:
                continue

            # idf = math.log2((no_documents+1)/df)
            idf = 1
            while head:
                # initializing the row if it does not exist 
                if not docs_encountered.get(head.docID):
                    docs_vector.loc[head.docID] = 0
                    docs_encountered[head.docID] = True
                
                #  df[docid][word] = tf*idf / Norm
                docs_vector.loc[head.docID, word] = ((1 + math.log2(head.tf))) / self.index.magnitude[head.docID]
                head = head.next
        
        
        # calculating lnc.ltc
        docs_vector['score$'] = docs_vector.dot(query_vector)

        # returning the top K searches
        docs_vector = docs_vector.sort_values(by='score$', ascending = False)
        if k > docs_vector.shape[0]:
            k = docs_vector.shape[0]
            
        results = docs_vector[['score$']].iloc[:k]
        # formatting the results 
        results['Title'] = list(map(lambda x : self.index.docs[x], results.index))
        results.reset_index(inplace = True)
        results.drop('index', inplace=True, axis = 1)
        return results[['Title', 'score$']]
        

Loading up the Inverted_Index object which was saved in Part 2 A.1

In [3]:
file = open("inverted_index_obj.pickle", "rb")
index = pickle.load(file)
file.close()

Creatng a object of the the Search class defined above and hence creating an IR system

In [4]:
system = Search(index)

---

## Test Queries

We have provided some sample queries to try out. Please uncomment the relevant line before running the query. There is also an option to enter your own query.

Also, to measure the execution time of the program we initialize the starting time as start_time variable


In [6]:
start_time = time.time()
query = "tax planning strategies"
# query = "north american beaver"
# query = "car ride"
# query = "falling off the edge of the world"
# query = "guitar musician"

# query = input()

results = system.search(query)
print(results)

print("--- Time taken: %s seconds ---" % (time.time() - start_time))


                                     Title    score$
0                 tax attractiveness index  0.146361
1                                   axykno  0.107383
2                clarehall shopping centre  0.090940
3                       flow-through share  0.086226
4              modified endowment contract  0.082823
5                          thirtieth (tax)  0.077450
6                             ottoman ayan  0.074638
7                              allelengyon  0.068978
8                        jorge roca suarez  0.053362
9  m/s r.m.d.c (mysore) v. state of mysore  0.052988
--- Time taken: 0.2368786334991455 seconds ---


We find that the results are exactly the same as those found in Part A. This is because the optimization does not change the score or the ranking of any of the documents and simply affect the space used by the system.

In our testing we got a running time of 0.66 seconds.