# Assignment 2A, Part 1: Indexer

Index the document collection and save the index to disk.  

**IMPORTANT**: The collection and index take up several hundred Megabytes. Do NOT push those to GitHub!

It is recommended that you work on a small sample of documents while developing your solution. It is enough to build the full index once you get to Part 2 of the assignment, as you may realize later that certain refinements are needed.

You have two main options to implement the inverted index: (1) all by yourself from scratch or (2) using the [HashedIndex](https://pypi.org/project/hashedindex/) Python library. There is no third option.

You are required to adhere to the structure provided below.

The code for parsing the gzip files in the collection is already given.

You may decide to build two separate indices for the two document fields (title and content) or to keep them together in the same structure.

In [1]:
import re
import gzip
from bs4 import BeautifulSoup
import numpy as np

In [2]:
#def add_docs_bulk(docs):
#    for doc_id, doc in docs.items():
#        # TODO: complete
#        print("Indexing document {}".format(doc_id))
def add_docs_bulk(indexer, docs):
    for doc_id, doc in docs.items():
        indexer.add(doc["content"], doc_id)

## Indexing a given data file

**NOTE**: Each source gzip file contains several documents. The method below does the parsing of source files and then calls `add_docs_bulk()` to bulk indexing on all document 

In [3]:
def index_file(file_name,indexer):
    print("Processing {0}".format(file_name))
    
    with gzip.open(file_name, "rt") as fin:
        is_body = False
        docs = {}
        doc_id, body = None, None
        for line in fin:
            line = line.strip()
            if line.startswith("<DOCNO>"):  # get doc id
                doc_id = re.sub("<DOCNO> | </DOCNO>", "", line)
            elif line.startswith("<BODY>"):  # start to parse body
                is_body = True
                body = []
            elif line.startswith("</BODY>"):  # finished reading body
                soup = BeautifulSoup("\n".join(body), "lxml")
                headline = soup.find("headline")
                text = soup.find("text")
                docs[doc_id] = {
                    "title": headline.text if headline is not None else "",  # use an empty string if no <HEADLINE> found
                    "content": text.text if text is not None else ""  # everything inside <TEXT> is indexed as content
                }
                # get ready for next document
                doc_id = None
                is_body = False
            elif is_body:  # accumulate body content
                body.append(line)

        # bulk index the collected documents
        print("Bulk indexing", len(docs), "documents")  
        add_docs_bulk(indexer,docs)

## Indexing the entire collection

**TODO**: Complete (currently, indexing only a single gzip file for testing purposes)

In [4]:
#index_file("data/aquaint/nyt/2000/20000101_NYT.gz")

**TODO**: Save the index to disk (make sure that the index directory is added to `.gitignore`)

#### Note: install hashedindex package in conda using the following steps:
- conda skeleton pypi hashedindex
- conda build hashedindex

In [5]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import numpy as np
import nltk, unicodedata, re, csv, hashedindex, string, os, pickle

In [9]:
class Indexer():
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.stopwords = set(stopwords.words('english'))
        self.index = hashedindex.HashedIndex()
        self.document_length = {}

    """
    text preprocessor
    Returns word token list
    """
    def text_preprocessor(self, document):
        clean_tokens = []
        
        document = unicodedata.normalize('NFKD', document).encode('ascii', 'ignore').decode('utf-8', 'ignore') # remove non ascii
        document = document.lower() #convert to lowercase
        document = document.translate(str.maketrans('', '', string.punctuation)) #remove punctuations
            
        #break text into tokens
        for token in nltk.word_tokenize(document):
             
            # skip stop words
            if token in self.stopwords:
                continue
            
            token = self.stemmer.stem(token) # stem    
            clean_tokens.append(token)
            
        return clean_tokens
    
    """
    Adds text to the hashed index
    """
    def add(self, document, docID):
        
        words = self.text_preprocessor(document)
        for word in words:
            self.index.add_term_occurrence(word, docID)
            
        #create a text length document when indexing a document 
        self.document_length[docID] = len(document)
        
    
    """
    save index to file
    Format : term, Counter({'NYT20000101.0001': 1}) <-- probably will have to change this
    """
    def save_to_file(self, filename = "data/index/Indices.csv"):
        with open(filename, mode='w') as index_file:
            index_writer = csv.writer(index_file, delimiter=',', quotechar='"')

            for term in self.index.terms():
                index_writer.writerow([term,self.index.get_documents(term)]) #save as counter object
                #index_writer.writerow([term,list(self.index.get_documents(term).items())]) #save as list
    
    def save_pickle(self):
        
        #term_doc_dict = {}
        #for term in self.index.terms():
        #    term_doc_dict[term] = list(self.index.get_documents(term).items())
                
        #with open("data/doc_len.p", 'wb') as handle:
        #    pickle.dump(self.document_length, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
        #with open("data/inverse_index.p", 'wb') as handle:
        #    pickle.dump(term_doc_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
        with open("data/t--indexer.p", 'wb') as handle:
            pickle.dump(self.index, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
        print("saved to file")

In [10]:
class FileReader():
    
    def __init__(self,data_folder):
        self.data_folder = data_folder
        self.news_sources = ["apw","nyt","xie"]
        self.no_of_files = 0
    
    def indexTestFile(self,indexer):
        test_file = "data/aquaint/nyt/2000/20000101_NYT.gz"
        index_file(test_file,indexer)
        
    def indexAllFiles(self,indexer):
        #i = 0
        for news_source in self.news_sources:
            news_dir = os.path.join(self.data_folder, news_source)
            for subdir, dirs, files in os.walk(news_dir):
                for file in files:
                    if(".gz" in file):
                        gz_file_path = os.path.join(subdir, file)
                        index_file(gz_file_path,indexer)

                    self.no_of_files+=1
                    #if self.no_of_files>5:
                    #    return

indexer = Indexer()
data_folder = "data/aquaint/"
fr = FileReader(data_folder)
fr.indexTestFile(indexer)
# fr.indexAllFiles(indexer)

Processing data/aquaint/nyt/2000/20000101_NYT.gz
Bulk indexing 243 documents


In [11]:
indexer.save_pickle()

saved to file


In [57]:
indexer.index.terms()[:10]

['friday',
 'april',
 '28',
 '2000',
 'five',
 'dead',
 'suburban',
 'pa',
 'shoot',
 'mckee']

The hashed index pickle can be to as:
- get documents for a word : indexer.index.get_documents('midnight')
- get documents length : indexer.index.get_document_length("NYT20000101.0045")
- get term frequency in a doc: indexer.index.get_term_frequency("midnight","NYT20000101.0003")

In [None]:
indexer.index.documents()[:10]

In [None]:
indexer.index.get_document_length("NYT20000101.0045")

In [None]:
indexer.index.get_term_frequency("midnight","NYT20000101.0003")