## Part 2 A.1: Inverted Index Optimization: Index Creation

As evaluated in the vector space model, 99.84% of the document matrix was empty and hence, a lot of space was being wasted. This could be prevented by using the inverted index representation. 


For every term present in the vocabolary, there exists a posting list which contains all the documents and their term frequency as nodes.

In [1]:
import string
import re
import math
import numpy as np
import pandas as pd
import pickle
import sys 

To view the memory usage by the program, tracemalloc module is used. The values of current and peak memory usage are printed at the end of the program.



In [2]:
import tracemalloc

tracemalloc.start()

The follow class contains:
+ Dictionary from term to posting list
+ Dictionary that stores the total magnitude (c-normalization) for every document
+ Function to initialize the above mentioned dictionaries
+ Class definitions for posting list and its nodes


Steps similar to vector space generation are followed, but instead of two passes, only one is required as we process a whole document at once and the following algorithm is used:
+ if term does not exist in dictionary, create a posting list for the term and initialize document frequency to 1
+ if term exists in the dictionary, and the document is not in posting list, add the node to the posting list initializing the term frequency to 1 and updating the document frequency of the term.
+ if term exists in the dictionary, and the document exists in the posting list, update its term frequency.
+ While processing a single document, seperately store all the term frequencies in a vector and find its magnitude for c-normalization

All the frequencies are stored as integers and not in its logarithmic float variants to conserve space 




In [3]:
class Inverted_Index:
    def __init__(self, FILENAME):
        self.FILENAME = FILENAME
        # dictionary from docID to docTitle
        self.docs = {}

        # magnitude of tf vector of each document 
        self.magnitude = dict()
        
        # term -> posting list dictionary  
        self.InvertedIndex = {}

        self.initialize()
    


    class PostingList:
        def __init__(self, docID):
            self.head = self.Node(docID)
            self.tail = self.Node(docID)

        # adding a word to the posting list 
        def add(self, docID):
            if self.tail.docID == docID:
                if self.head.docID == self.tail.docID:
                    self.head.tf += 1
                self.tail.tf += 1
            else:
                temp = self.Node(docID)
                if self.head.docID == self.tail.docID:
                    self.head.next = temp
                self.tail.next = temp
                self.tail = temp

        # nodes in the posting list 
        class Node:
            def __init__(self, docID):
                self.docID = docID
                self.tf = 1
                self.next = None


    def initialize(self):
        with open(self.FILENAME) as file:
            seen = {}
            for line in file:
                line = line.lower()
                check = re.search('<doc id="(.*?)" url="(.*?)" title="(.*?)">', line)
                if check:
                    # 0 -> entire string, 1 -> id, 2 -> url, 3 -> title
                    seen = {}
                    self.docs[int(check[1])] = check[3]
                    current_doc = int(check[1])
                    continue
                    
                if line == '\n' or line == '</doc>\n':
                    continue
                
                # cleaning the line 
                line = re.sub('<a.*?>|</a>', '', line)
                line = re.sub('-|–|—', ' ', line)
                line = line.translate(str.maketrans("","", string.punctuation))

                # for every word encountered in document, add it to posting list 
                for term in line.strip().split(' '):
                    if term in self.InvertedIndex.keys():
                        self.InvertedIndex[term][0].add(current_doc)
                    else:
                        self.InvertedIndex[term] = [self.PostingList(current_doc), 0]

                    if not term in seen.keys():
                        seen[term] = 1
                        self.InvertedIndex[term][1] += 1
                    else:
                        seen[term] += 1

                # magnitude of (1+log(tf)) vector for each document 
                self.magnitude[current_doc] = np.linalg.norm(np.array(list(map(lambda x : (1+math.log2(x)) ,seen.values()))))

        self.no_documents = len(self.docs)

### Creating and saving the model
We create a new model for the wiki_93 dataset. The object created is saved using the pickle module

In [4]:
# Creating a new system for the wiki_93 file
index = Inverted_Index('./wiki_93')



In [5]:
sys.setrecursionlimit(10**6)
file = open("inverted_index_obj.pickle", "wb")
pickle.dump(index, file)
file.close()

In [6]:
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

Current memory usage is 126.894685MB; Peak was 227.810828MB


In our testing, this index creation program had a peak memory usage of 247.7MB. Also the final pickle object has the size 26.7 MB Which is a huge improvement from only vector space model