# IR for Governmemt related Documents

## Data
* Government documents and TREC eval files are present in zip format.

In [1]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os.path
from pathlib import Path
import tempfile
import subprocess
import nltk
from nltk.stem import *
from whoosh import qparser
from whoosh import scoring 

In [2]:
DATA_DIR = "government"

# Put other path constants here

DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")


TREC_EVAL = os.path.join("trec_eval", "trec_eval.exe")


### Choosing the Best Evaluative Measure for this task

Theres is no particular best evaluative measure for information retrival (Search Engine) tasks. Choice of evaluation metric should be based on the Information need for that task. (Consider whether it is a Recall specific task or Precision based task)

The choosen measure for the given goverenment data collection  is - P5 (precision, in the first five documents returned)

### Reasons for choosing P5:

From the topic descriptions it is clear that users are looking for general information about different topics impliying that this is not a recall specific task - but required relevant information should be retrived in top results hence I selected "P5 - Precision in the first five documents returned" as an appropriate measure for this case.

## Document Preprocessing & Indexing

** Whoosh library: **
Whoosh is a fast, pure Python search engine library. It helps us to index free-form or structured text and then quickly find matching documents based on simple or complex search criteria. Also, it has Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
https://whoosh.readthedocs.io/en/latest/intro.html

### Schema:

In whoosh, structure the retrieval system by defining a storage schema. Each document can have multiple fields, such as title, content, url, date, etc. Firstly, we need to create a schema for our corpus to specify these fields of documents in an index.

The schema is the set of all possible fields in a document. Each individual document might only use a subset of the available fields in the schema.

note: without a schema, the query parser in Whoosh will not process the text in the user query (i.e., cannot do phrase searching)

In [3]:
# defining Schema for the index -  Vanilla Whoosh (Base case Schema)
Base_Schema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

### Index documents:
After creating our schema, we will index each document in the corpus. 1. Indexed fields must be passed a unicode value.
Opening a writer locks the index for writing. In a multi-threaded or multi-process environment, opening a writer may raise an error if a writer is already open. Advanced writer object "whoosh.writing.AsyncWriter" and "whoosh.writing.BufferedWriter" can solve this problem.

In [4]:
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()
    # creating and returning the index
    return index.create_in(indexDir, schema)

In [5]:
# Creating the index at the path INDEX_DIR based on the Base_Schema
Base_Index = createIndex(Base_Schema)

In [6]:
#function for indexing the files
def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=4000)
    
    ## Used BufferedWriter because it solves the problem of multiple openings of a writer(problem of locking the index for writing)
    ## Also, using a BufferedWriter is much faster than opening and committing a writer for each document when adding multiple docs
    try:
        # Opening and reading each file/documents
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                # Adding documents to the index
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)
    
                if (docNum+1) % 1000 == 0:
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the writer to save the uncommited changes
        writer.close()

In [7]:
# Building the list of files for indexing
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

In [8]:
#calling addFilesToIndex function for indexing the files
addFilesToIndex(Base_Index, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


### Query parser: 

After indexing the documents, we can write down the query and convert the query string into query object by the query parser.

Create a whoosh.qparser.QueryParser object, pass it the name of the default field to search and the schema of the index you’ll be searching.

Query parser is built on modular plug-ins. For example, qparser.WildcardPlugin, which is already in the default plug-in list of parser, gives the parser the ability to search for wildcards. Some frequently used plug-ins are shown in the following code.

You can use the plugins argument when creating the object to override the default list of plug-ins, use add_plugin() and/or remove_plugin_class() to change the plug-ins included in the parser.

Here is the list of available plug-ins:http://whoosh.readthedocs.io/en/latest/api/qparser.html#plug-ins.

In [9]:
# defining a Base_QueryParser for the field "file_content" in the index
Base_QueryParser = QueryParser("file_content", schema=Base_Index.schema)
## creating a Base_Searcher for query seraching in file_contents
Base_Searcher = Base_Index.searcher()

In [10]:
#printing all 15 topics from the topic files
with open(TOPIC_FILE, "r") as tf:
    topics = tf.read().splitlines()
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    print(topic_id, topic_phrase)

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education


In [11]:
INDEX_Q2 = Base_Index # Replace None with your index for Q2
QP_Q2 = Base_QueryParser # Replace None with your query parser for Q2
SEARCHER_Q2 = Base_Searcher # Replace None with your searcher for Q2

In [12]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an temporary output file to write the results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # parsing the query and recording the results in the tempOutputFile in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
#                 print(("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score)))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    print(result.stdout.decode())

### Evaluating the Retrival System:

In [13]:
#evaluatons of the returned results using TREC_eval
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2) 

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Results for Base system:

P5 (Precision after 5 docs retrieved) for the baseline Whoosh Query system is: 0.0714

#### Performed Well:  
For topic ID's: [14, 18, 22, 24, 26] it performed well because it retrived relevant documents in the first 5 results even with the baseline system.

#### Performed Bad:  
For topic ID's: [19] it performed very badly because it didnt even retrive any document. Also, for remanining topic Id's  [1,10,16,2,28,4,6,7,9] it performed badly beacuse there were no relevant docs in the first five results.

#### Overall Perfomance:  
Overall only for 5 out of 15 queries we got required relevant information in the first five retrived documents for each topic which shows that there is so much scope for improvement.

## How to Improve the Search perfomance:

To improve the search, first try to identify cases where our system is performing well and where it is not performing - By idetifying particular documents like these we can decide on what changes and improvements need to be made.

For Query number 2 ("juvenile delinquency") in the given quries list - Although our system retrived 6 documents all of them were False positives. We have 2 relevant docs in the given collections for this query, but they were false negatives in our base whoosh retrival system.

#### False Positive:

Document no: Q0 G00-22-3396139 is retrived and highly ranked because it having both "juvenile" AND "delinquency" terms once occured in the document.But if we check the relevant documents for this query one of them (G00-08-1145623) has only "delinquency" term occuring multiple times but no "juvenile" term. Hence instead of "AND" in the query parser using "OR" might improve the results.

#### False negative:

Document no: G00-37-1427392 is not retrived (False Negative) because the words "Juvenile" are in the upper case format and also word "Delinquency" is present as "delinquent". So by converting everything to lower case and also by using a stemmer to shorten the words we can make these kind of documents True Positive.

In [14]:
# defining a reader object on the index
myReader = Base_Index.reader()
print("# docs with 'juvenile'", myReader.doc_frequency("file_content", "juvenile"))
print("# docs with 'Juvenile'", myReader.doc_frequency("file_content", "Juvenile"))
print("# docs with 'delinquency'", myReader.doc_frequency("file_content", "delinquency"))
print("# docs with 'Delinquency'", myReader.doc_frequency("file_content", "Delinquency"))
print("# docs with 'Delinquent'", myReader.doc_frequency("file_content", "delinquent"))

# docs with 'juvenile' 19
# docs with 'Juvenile' 42
# docs with 'delinquency' 6
# docs with 'Delinquency' 12
# docs with 'Delinquent' 7


In [15]:
## choosing from different types of Stemmers and lemmatizers vailable: 
## We can choose any of the stemming & Lemmatizers based on the system we are working
listWords = ["delinquency", "Delinquent", "safety", "safe"]
for word in listWords:
    print("%15s %15s %15s %15s" % (LancasterStemmer().stem(word),
                                   SnowballStemmer("english").stem(word),
                                   WordNetLemmatizer().lemmatize(word),
                                   WordNetLemmatizer().lemmatize(word, 'v')))

        delinqu         delinqu     delinquency     delinquency
        delinqu         delinqu      Delinquent      Delinquent
            saf          safeti          safety          safety
            saf            safe            safe            safe


Suggested Improvements: lower case, Stemming, "OR" query parser

###  Using NLTK in Whoosh:

In [17]:
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [18]:
Improved_analyzer = RegexTokenizer() | LowercaseFilter()  | StopFilter() | CustomFilter(LancasterStemmer().stem)
# [token.text for token in myFilter1("We are going to do Text Analysis with whoosh.analysis")]

In [19]:
# defining improved Schema for the index after converting to lowercase and stemming
Improved_Schema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer=Improved_analyzer))

In [20]:
Improved_Index = createIndex(Improved_Schema)

In [21]:
addFilesToIndex(Improved_Index, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [22]:
Improved_OR_QueryParser = QueryParser("file_content", schema=Improved_Index.schema, group=qparser.OrGroup)
## query seraching in file_contents
Improved_Searcher = Improved_Index.searcher()

In [23]:
INDEX_Q3 = Improved_Index # Replace None with your index for Q3
QP_Q3 = Improved_OR_QueryParser # Replace None with your query parser for Q3
SEARCHER_Q3 = Improved_Searcher # Replace None with your searcher for Q3

In [24]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3)

num_ret               	1	471
num_rel               	1	5
num_rel_ret           	1	5
map                   	1	0.0625
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0500
iprec_at_recall_0.00  	1	0.0968
iprec_at_recall_0.10  	1	0.0968
iprec_at_recall_0.20  	1	0.0968
iprec_at_recall_0.30  	1	0.0968
iprec_at_recall_0.40  	1	0.0968
iprec_at_recall_0.50  	1	0.0968
iprec_at_recall_0.60  	1	0.0968
iprec_at_recall_0.70  	1	0.0430
iprec_at_recall_0.80  	1	0.0430
iprec_at_recall_0.90  	1	0.0427
iprec_at_recall_1.00  	1	0.0427
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.0667
P_100                 	1	0.0400
P_200                 	1	0.0250
P_500                 	1	0.0100
P_1000                	1	0.0050
num_ret               	10	498
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.00

###  Perfomance Evaliation after the changes:

#### Improvements made: 
Converting everything to lower case, removing stopwords (Since they are not relevant in our case, removing them can make code more efficient for running), using LancasterStemmer for stemming the words. 

#### Perfomance:
Perfomance improved overall because for 8 out of 15 queries we are getting required relevant information in the first five retrived documents. But queries 22 ("Veteran's Benefits") and 26 ("Nuclear power plants") which retrived relevant documents in first five results in Base whoosh system are not retriving now  [since the query terms are very general in nature and because we are using "OR" query parser and we also did stemming they are retriving the documents with the words with more term frequencies]
for query 2: we are getting the relevant documents in first five documents now. (False negative became True Positive) and (False positive is ranked lower now)

(Overall p5 measure improved from  0.0714 to 0.12)

Overall after making modifications, for 8 out of 15 queries we are getting required relevant information in the first five retrived documents for each topic which was only 5 before making modifications. This means that making the given modifications is a good idea for this case, but still there is scope for improvement.

## Scoring Techniques:

### How will different scoring techniques affect the perfomance

For Query number 28 ("Early Childhood Education") in the given queries list - Although our system retrived both the 2 relevant documents in the results but none of them were ranked in first five results. 

#### False Positive:

Document no: G00-77-3295130  is retrived and highly ranked because it is relatively shorter document (with relatively more or equal matching words) when compared with other two relevant documents. Generally as document size increases overall score decreases at higher values of b (considering similar number of matching words). As default b value is 0.75 in the BM25 scoring system, may be lowering the b values might reduce the impact of document size on the score.


#### False negative:

Document no: Although document G00-54-2576117 is retrived, it is not highly ranked because the size of the document is relatively larger compared to others. In the BM25 scoring equation at higher values of b and slightly lower values of K1, Document size greatly impacts the score. Hence reducing the impact of document size might increase the score of this document making it True Positive.


In [26]:
# first, define a Schema for the index
Improved_Schema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer=Improved_analyzer))

In [27]:
Improved_Index = createIndex(Improved_Schema)

In [28]:
addFilesToIndex(Improved_Index, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [29]:
Improved_OR_QueryParser = QueryParser("file_content", schema=Improved_Index.schema, group=qparser.OrGroup)
## Tuning the BM25 parameters as discussed above
Tuned_Searcher = Improved_Index.searcher(weighting=scoring.BM25F(B=0.3, K1=3))

In [30]:
INDEX_Q4 = Improved_Index # Replace None with your index for Q4
QP_Q4 = Improved_OR_QueryParser # Replace None with your query parser for Q4
SEARCHER_Q4 = Tuned_Searcher # Replace None with your searcher for Q4

In [31]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q4, SEARCHER_Q4)

num_ret               	1	471
num_rel               	1	5
num_rel_ret           	1	5
map                   	1	0.0611
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0556
iprec_at_recall_0.00  	1	0.0909
iprec_at_recall_0.10  	1	0.0909
iprec_at_recall_0.20  	1	0.0909
iprec_at_recall_0.30  	1	0.0909
iprec_at_recall_0.40  	1	0.0909
iprec_at_recall_0.50  	1	0.0909
iprec_at_recall_0.60  	1	0.0909
iprec_at_recall_0.70  	1	0.0494
iprec_at_recall_0.80  	1	0.0494
iprec_at_recall_0.90  	1	0.0385
iprec_at_recall_1.00  	1	0.0385
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.0667
P_100                 	1	0.0400
P_200                 	1	0.0250
P_500                 	1	0.0100
P_1000                	1	0.0050
num_ret               	10	498
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.00

### Evaluation after Scoring

#### Improvements

Reduced the value of b from defalut 0.75 to 0.3 and increased the value of K1 from default 1.2 to 3. Reasons for doing this is to reduce the impact of document size on the overall scoring. 
Heuristics: For selecting the range of values for b and K1 experimental results and heuristics from the research papers are used.  Source: elastic.co - Blog (Artilce: considerations for picking b and k values)


#### Perfomance:
Perfomance improved overall because for 11 out of 15 queries we are getting required relevant information in the first five retrived documents. previously only 8 out of 15 quiries got relevant docs in top 5 results. (p5 measure improved from  0.12 to 0.1733) 

for query 28: we are getting the relevant document in first five documents now. (False negative became True Positive) and (False positive is ranked lower now as expected)


Overall after making modifications and tuning BM25 parameters, For 11 out of 15 queries we are getting required relevant information in the first five retrived documents. And also overall p5 measure improved from  0.12 to 0.1733 (1.44 times improvement). This means that tuning the BM25 scoring parameters is a good idea for this case.