# Assignment 2A, Part 1: Indexer

Index the document collection and save the index to disk.  

**IMPORTANT**: The collection and index take up several hundred Megabytes. Do NOT push those to GitHub!

It is recommended that you work on a small sample of documents while developing your solution. It is enough to build the full index once you get to Part 2 of the assignment, as you may realize later that certain refinements are needed.

You have two main options to implement the inverted index: (1) all by yourself from scratch or (2) using the [HashedIndex](https://pypi.org/project/hashedindex/) Python library. There is no third option.

You are required to adhere to the structure provided below.

The code for parsing the gzip files in the collection is already given.

You may decide to build two separate indices for the two document fields (title and content) or to keep them together in the same structure.

In [2]:
# put jupyter display in fullscreen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [16]:
import re
import gzip
import pickle
import os
import math

from bs4 import BeautifulSoup
from hashedindex import HashedIndex
from hashedindex import textparser
from statistics import mean
from tqdm import tqdm
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
stemmer = PorterStemmer()

N_GRAMS = 1

In [17]:
# stemming does not appear to be useful
def stem_doc(doc):
    doc['title'] = ' '.join([stemmer.stem(word) for word in word_tokenize(doc['title'])])
    doc['content'] = ' '.join([stemmer.stem(word) for word in word_tokenize(doc['content'])])
    return doc

def tokenize_doc(doc):
    doc['title'] = ' '.join([word for word in word_tokenize(doc['title'])])
    doc['content'] = ' '.join([word for word in word_tokenize(doc['content'])])
    return doc

def add_docs_bulk(docs, title_index, content_index):
    
    for doc_id, doc in docs.items():
        #print("Indexing document {}".format(doc_id))
        #doc = stem_doc(doc)
        
        for token in textparser.word_tokenize(doc['title'], stopwords=stopwords, ngrams=N_GRAMS):
            title_index.add_term_occurrence(token, doc_id)
        for token in textparser.word_tokenize(doc['content'], stopwords=stopwords, ngrams=N_GRAMS):
            content_index.add_term_occurrence(token, doc_id)

## Indexing a given data file

**NOTE**: Each source gzip file contains several documents. The method below does the parsing of source files and then calls `add_docs_bulk()` to bulk indexing on all document 

In [4]:
def index_file(file_name, title_index, content_index):
    #print("Processing", file_name)
    
    with gzip.open(file_name, "rt") as fin:
        is_body = False
        docs = {}
        doc_id, body = None, None
        for line in fin:
            line = line.strip()
            if line.startswith("<DOCNO>"):  # get doc id
                doc_id = re.sub("<DOCNO> | </DOCNO>", "", line)
            elif line.startswith("<BODY>"):  # start to parse body
                is_body = True
                body = []
            elif line.startswith("</BODY>"):  # finished reading body
                soup = BeautifulSoup("\n".join(body), "lxml")
                headline = soup.find("headline")
                text = soup.find("text")
                docs[doc_id] = {
                    "title": headline.text if headline is not None else "",  # use an empty string if no <HEADLINE> found
                    "content": text.text if text is not None else ""  # everything inside <TEXT> is indexed as content
                }
                # get ready for next document
                doc_id = None
                is_body = False
            elif is_body:  # accumulate body content
                body.append(line)
            
        # bulk index the collected documents
        #print("Bulk indexing", len(docs), "documents")
        add_docs_bulk(docs, title_index, content_index)

## Indexing the entire collection

**TODO**: Complete (currently, indexing only a single gzip file for testing purposes)

In [5]:
def index_collection(path):
    
    title_index = HashedIndex()
    content_index = HashedIndex()
    
    for root, _, files in os.walk(path):
        for file in tqdm(files, desc='processing files'):
            if(os.path.splitext(file)[1] == ".gz"):
                filename = os.path.join(root, file)
                index_file(filename, title_index, content_index)
    
    # precompute IDF, document lengths, average doc length -> BM25
    content_collection_size = len(content_index.documents())
    content_doc_length = {doc:content_index.get_document_length(doc) for doc in content_index.documents()}
    average_content_length = mean([*content_doc_length.values()])
    content_idf = {term:math.log((content_collection_size-len(content_index.get_documents(term))+0.5)/(len(content_index.get_documents(term)) + 0.5)) for term in content_index.terms()}
    
    title_collection_size = len(title_index.documents())
    title_doc_length = {doc:title_index.get_document_length(doc) for doc in title_index.documents()}
    average_title_length = mean([*title_doc_length.values()])
    title_idf = {term:math.log((title_collection_size-len(title_index.get_documents(term))+0.5)/(len(title_index.get_documents(term)) + 0.5)) for term in title_index.terms()}
    
    # precomupte tf_sum, document_length sum, collection term probability
    content_sum_tf = {term:content_index.get_total_term_frequency(term) for term in content_index.terms()}
    content_sum_length = sum(content_doc_length.values())
    content_collection_probability = {term:content_sum_tf[term]/content_sum_length for term in content_index.terms()}
    
    title_sum_tf = {term:title_index.get_total_term_frequency(term) for term in title_index.terms()}
    title_sum_length = sum(title_doc_length.values())
    title_collection_probability = {term:title_sum_tf[term]/title_sum_length for term in title_index.terms()}
    
    
    collection_index = dict(
        content_index=content_index,
        title_index=title_index,
        
        content_doc_length=content_doc_length,
        title_doc_length=title_doc_length,
        
        average_content_length=average_content_length,
        average_title_length=average_title_length,
        
        content_idf=content_idf,
        title_idf=title_idf,
        
        content_sum_tf=content_sum_tf,
        title_sum_tf=title_sum_tf,
        
        content_sum_length=content_sum_length,
        title_sum_length=title_sum_length,
        
        content_collection_probability=content_collection_probability,
        title_collection_probability=title_collection_probability
    )
        
    return collection_index

In [20]:
reverted_index = index_collection('data/aquaint')

# during dev, index a subset
#reverted_index = index_collection('data/aquaint/xie/2000/')

processing files: 100%|██████████| 4/4 [00:00<00:00, 30840.47it/s]
processing files: 100%|██████████| 1/1 [00:00<00:00, 12192.74it/s]
processing files: 100%|██████████| 356/356 [06:40<00:00,  1.13s/it]
processing files: 100%|██████████| 249/249 [04:31<00:00,  1.09s/it]
processing files: 100%|██████████| 213/213 [04:09<00:00,  1.17s/it]
processing files: 100%|██████████| 1/1 [00:00<00:00, 9642.08it/s]
processing files: 100%|██████████| 304/304 [02:24<00:00,  2.10it/s]
processing files: 100%|██████████| 274/274 [01:41<00:00,  2.70it/s]
processing files: 100%|██████████| 215/215 [02:24<00:00,  1.49it/s]
processing files: 100%|██████████| 1/1 [00:00<00:00, 4755.45it/s]
processing files: 100%|██████████| 365/365 [01:31<00:00,  3.98it/s]
processing files: 100%|██████████| 365/365 [01:27<00:00,  4.18it/s]
processing files: 100%|██████████| 365/365 [01:38<00:00,  3.70it/s]
processing files: 100%|██████████| 274/274 [01:16<00:00,  3.58it/s]
processing files: 100%|██████████| 365/365 [01:38<00:0

In [21]:
reverted_index['content_idf']

{('los',): 2.7876916382219017,
 ('angeles',): 2.837143987096802,
 ('every',): 2.0580238954153556,
 ('year',): 0.2700673564555914,
 ('millions',): 3.554803248894958,
 ('americans',): 2.9701663492949715,
 ('pledge',): 4.949992938568511,
 ('put',): 1.9527947085646737,
 ('financial',): 2.285743320520581,
 ('house',): 2.0200679611389583,
 ('order',): 2.65511770849221,
 ('say',): 1.664836001633086,
 ('different',): 2.5588510213056668,
 ('theyll',): 4.423547561889047,
 ('save',): 3.595884358018536,
 ('invest',): 4.255595733863187,
 ('even',): 1.4119520962910492,
 ('stick',): 4.402549012166761,
 ('budget',): 3.0586032003984904,
 ('indeed',): 3.7995104582559454,
 ('second',): 1.5599641922100542,
 ('popular',): 3.1641292829929486,
 ('new',): 0.18594400037986678,
 ('resolution',): 4.0213223696093445,
 ('achieve',): 4.0793923242876495,
 ('goals',): 3.768391671090999,
 ('according',): 1.5267505931326868,
 ('citibank',): 7.097775156488056,
 ('first',): 0.5516078040296771,
 ('lose',): 3.5294671462742

**TODO**: Save the index to disk (make sure that the index directory is added to `.gitignore`)

In [22]:
def save_index(index):
    with open('data/basic_index_new_idf.dat', 'wb') as f:
        pickle.dump(index, f)

In [23]:
save_index(reverted_index)