# Information Retrieval Assignment  
## Part 1A: Vector Space Model

---

In this model, we will be representing each document as a vector of $\left| V \right|$ dimensions, where $ V $ is the vocabulary set, or the set of all terms present in the corpus. The document vectors have been stored together in a matrix, with the columns as the documents and the rows as the terms.

In [1]:
import string
import re
import numpy as np
import pandas as pd
import shelve
import pickle

To view the memory usage by the program, tracemalloc module is used. The values of current and peak memory usage are printed at the end of the program.

In [2]:
import tracemalloc

tracemalloc.start()

In [4]:
FILENAME = './wiki_93'
# FILENAME = input('Please enter path to a wiki file: ')

### First Pass

To represent the document as vectors, we first need a vocabulary set and a list/number of documents.
That is why, in the first pass, we parse the given file to generate the following:
- vocabulary set
- a list of all documents in the file

Each line in the document is:
1. Converted to lower-case to remove case-sensitivity
2. Matched with a regular expression to check if its a document opening tag <doc ...>
    - If it is, then the document name is appended to the `docs` list
3. If not, then all html anchor tags and punctuations are removed
4. The line is split into words and the `vocab` set is updated with the words

In [5]:
vocab = set()
docs = []

with open(FILENAME) as file:
    for line in file:
        line = line.lower()
        check = re.search('<doc id="(.*?)" url="(.*?)" title="(.*?)">', line)
        if check:
            # 0 -> entire string, 1 -> id, 2 -> url, 3 -> title
            docs.append(check[3])
            continue

        if line == '\n' or line == '</doc>\n':
            continue

        line = re.sub('<a.*?>|</a>', '', line)
        line = re.sub('-|–|—', ' ', line)
        line = line.translate(str.maketrans("","", string.punctuation))
        vocab.update(set(line.strip().split(' ')))

In [6]:
print('The size of the vocabulary is:', len(vocab))
print('The number of documents is:', len(docs))

The size of the vocabulary is: 75614
The number of documents is: 5784


Each term in the `vocab` set is given an ID and stored in a term -> term ID dictionary

In [7]:
vocab_dict = {word:index for index, word in enumerate(sorted(list(vocab)))}

### Second Pass

With the vocabulary set and document list generated, we can initialise and populate the document vectors.

Here, the file is parsed again and for each term in each document, the relevent cell is updated with the term frequency of that term in that document.

`document_vectors` is a table with rows as the terms and columns as the documents

---
Warning: This block takes  a couple of minutes to run.

In [8]:
document_vectors = pd.DataFrame(0, index=list(vocab_dict), columns=docs, dtype=np.byte)

with open(FILENAME) as file:
    for line in file:
        line = line.lower()
        check = re.search('<doc id="(.*?)" url="(.*?)" title="(.*?)">', line)
        if check:
          # 0 -> entire string, 1 -> id, 2 -> url, 3 -> title
          current_doc = check[3]
          continue

        if line == '\n' or line == '</doc>\n':  
          continue

        line = re.sub('<a.*?>|</a>', '', line)
        line = re.sub('-|–|—', ' ', line)
        line = line.translate(str.maketrans("","", string.punctuation))
        for word in line.strip().split(' '):
            document_vectors.at[word, current_doc] += 1

In [9]:
document_vectors

Unnamed: 0,tsalta baptiste,karl richter (tennis),matrix ab,sergey golubitskiy,west side story (earl hines album),2016 tipperary senior hurling championship,mario mosböck,faith baptist college,volodymyr dykyi,félix baumaine,...,pinkwash (band),promo azteca,strange creek (west virginia),strange creek,collective sigh,dileep agrawal,strouds creek,mystic marathon,the daddy issues,vicky astori
,4,0,0,16,0,0,0,0,0,3,...,0,0,0,0,0,0,0,1,0,2
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
𐩦𐩧𐩢,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩦𐩲𐩧𐩣,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩱𐩡,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
𐩱𐩥𐩩𐩧𐩣,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


A fair amount of these vocabulary words are numbers. We did not remove them since they might  be relevant to some queries. Eg. "Football match in 2015"

As you can notice already, most of these values are zero. This is because even if a term occurs only once in 5000 documents, it is assigned a term ID and a row is provided for it, creating 4999 empty cells. In this case, over 99% of the cells were empty.

In [10]:
numberOfZeros = (document_vectors == 0).sum().sum()
totalCells = document_vectors.shape[0]*document_vectors.shape[1]
print("Number of cells in the Inverted Index Matrix:", totalCells)
print("Number of cells which are empty:", numberOfZeros)
print("Percentage of empty cells:", round(numberOfZeros/totalCells * 100, 4), "%")

Number of cells in the Inverted Index Matrix: 437351376
Number of cells which are empty: 436736378
Percentage of empty cells: 99.8594 %


---

Finally, we write the `document_vectors` table to a file to be used to retrieve and rank user queries. This may take upto a few minutes, depending on the size of the corpus.

In [11]:
document_vectors.to_csv('vector_space_model.csv')

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

Current memory usage is 482.666055MB; Peak was 917.649134MB


In our testing, this index creation program had a peak memory usage of 917.64MB.
Also the resulting csv File takes up 900MB of space 