# Information Retrieval

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure ***MATERIALS_DIR*** points to the directory where you extracted the Zip file.
* Make sure all your paths are **relative to ** ***MATERIALS_DIR*** and **NOT hard-coded** in your code.

In [1]:
# imports
from whoosh import index, writing, qparser, scoring
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
#from whoosh import qparser
import nltk
from nltk.stem import *
import os, os.path
import shutil

In [2]:
MATERIALS_DIR = r"C:\DSS_Fall2017_Assign2"


# Path constants 
DOCUMENTS_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents")
INDEX_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index1")
QUER_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\topics\gov.topics")
QRELS_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\qrels\gov.qrels")
OUTPUT_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres")
TREC_EVAL = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\trec_eval\trec_eval.exe")
INDEX_DIR4 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index4")
OUTPUT_FILE5 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres5")
INDEX_DIR7 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index7")
OUTPUT_FILE7 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres7")

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a):
MAP



### Q1 (b):

Recall is generally considered more imoprtant when compared to precision for government website queries but at the same time you want to be mindful of time and money. As a result, MAP is a good metric for this purpose since it evaluates order-based recall, i.e., the order(more relevant first) of the recall matters. 

## Question 2

### Q2 (a): Write your code below

In [3]:
# Define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# if index exists - remove it
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
myIndex = index.create_in(INDEX_DIR, mySchema)

In [4]:
# Build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)
    
# Open writer
myWriter = writing.BufferedWriter(myIndex, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding="utf-8") as f:
            fileContent = f.read()
            myWriter.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter.close()


already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [5]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

In [6]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser.parse(topic_phrase)
    topicResults = mySearcher.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()
topicsFile.close()

In [7]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


In [8]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = myQueryParser # Replace None with your query parser for Q2
SEARCHER_Q2 = mySearcher # Replace None with your searcher for Q2

In [9]:
#!cat $QUER_FILE

In [10]:
#!head -n 10 $QRELS_FILE

### Q2 (b):
MAP: 0.1971

### Q2 (c):
Topics that performed poorly: 1, 16, 6, 7, 9, 19 <br>
Topics that performed well: 18, 24

## Question 3

### Q3 (a): Provide answer to Q3 (a) here [markdown cell]

Baseline Whoosh uses a RegexTokenizer that simply extracts words and ignores whitespace and punctuation. Since no stemming or lemmatization takes place for the query or the index, Whoosh misses different forms of the words during its search. <br>
For example, consider a relevant file "G00-02-0901987" for the query 1 "mining gold silver coal" - the baseline whoosh missed this file (false negative) because it contains the word 'mine' instead of 'mining'. This can be fixed by using a lemmatizer.
Simmilarly, for the same query, baseline whoosh returns "G00-90-0342721" as a relevant file because it contains all of the words: mining AND gold AND silver AND coal - although the subject matter of the file is not relelvant to the query. This is a false positive. <br>
Lastly, the ratio of total relative files to total number of relative files returned was low (33:7) despite the fact that 151 files were returned. After further investigation, I noted that that not all documents considered relevant as per the qrels file contain ALL of the search queries. As a result, it may be useful to consider an OR function while parsing user queries. 


### Q3 (b): Write your code below

In [11]:
# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [12]:
#lemmatizer
wnLemm = WordNetLemmatizer()

#new text analyzer
myFilter1 = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(WordNetLemmatizer().lemmatize, 'v')

In [13]:
mySchema4 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

# if index exists - remove it
if os.path.isdir(INDEX_DIR4):
    shutil.rmtree(INDEX_DIR4)

# create the directory for the index
os.makedirs(INDEX_DIR4)

# create index or open it if already exists
myIndex4 = index.create_in(INDEX_DIR4, mySchema4)

filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [14]:
# open writer
myWriter4 = writing.BufferedWriter(myIndex4, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding="utf-8") as f:
            fileContent = f.read()
            myWriter4.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter4.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [15]:
# define a query parser for the field "file_content" in the index
og = qparser.OrGroup.factory(0.5)
myQueryParser5 = qparser.QueryParser("file_content", schema=myIndex4.schema, group=og)
mySearcher5 = myIndex4.searcher()

In [16]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()


# create an output file to which we'll write our results
outputTRECFile5 = open(OUTPUT_FILE5, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser5.parse(topic_phrase)
    print (topicQuery)
    topicResults = mySearcher5.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile5.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile5.close()
topicsFile.close()

(file_content:mine OR file_content:gold OR file_content:silver OR file_content:coal)
(file_content:juvenile OR file_content:delinquency)
(file_content:wireless OR file_content:communications)
(file_content:physical OR file_content:therapists)
(file_content:cotton OR file_content:industry)
(file_content:genealogy OR file_content:search)
(file_content:physical OR file_content:fitness)
(file_content:agricultural OR file_content:biotechnology)
(file_content:emergency OR file_content:disaster OR file_content:preparedness OR file_content:assistance)
file_content:shipwreck
(file_content:cybercrime OR file_content:internet OR file_content:fraud OR file_content:cyber)
(file_content:veteran OR file_content:benefit)
(file_content:air OR file_content:bag OR file_content:safety)
(file_content:nuclear OR file_content:power OR file_content:plant)
(file_content:early OR file_content:childhood OR file_content:education)


In [17]:


!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE5

num_ret               	1	223
num_rel               	1	5
num_rel_ret           	1	4
map                   	1	0.0441
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0526
iprec_at_recall_0.00  	1	0.0870
iprec_at_recall_0.10  	1	0.0870
iprec_at_recall_0.20  	1	0.0870
iprec_at_recall_0.30  	1	0.0870
iprec_at_recall_0.40  	1	0.0870
iprec_at_recall_0.50  	1	0.0484
iprec_at_recall_0.60  	1	0.0484
iprec_at_recall_0.70  	1	0.0325
iprec_at_recall_0.80  	1	0.0325
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.0667
P_100                 	1	0.0300
P_200                 	1	0.0200
P_500                 	1	0.0080
P_1000                	1	0.0040
num_ret               	10	275
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.00

In [18]:
INDEX_Q3 = myIndex4  # Replace None with your index for Q3
QP_Q3 = myQueryParser5   # Replace None with your query parser for Q3
SEARCHER_Q3 = mySearcher5  # Replace None with your searcher for Q3

### Q3 (c): 
The following text analyzer was implemented for both the index and the query: lower case filter, stop word filter (ignore words like "we", "are", "with"), intra word filter (breaks phrases into words) and finally a verb lemmatizer. Furthermore, the query parser was changed to use OR instead of AND, so that any of the terms may be present for a document to match. A scaled bonus score was given for documents that contain more of the query words  searched for (i.e. a document with "gold" and "mine" would score higher than a document containing only "mine".) <br>

Overall, the MAP score (0.3756) and as well as the total number of relevant documents returned imporved (32 out of 35). As a result of this, the number of false negatives for query 1 decreased drastically (4 of 5 total relevant documents for the query were returned). However, because "OR" was used to parse the queries the number of false positives increased - although their weight can be adjusted to mimic "AND".

### Q3 (d): 
yes

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]
yes

### Q3 (f): 

Initially, I tried using the text analyzer described above without changing the query parser to "OR". This improved the MAP score (0.3401) however the the number of relevant documents returned (num_rel_ret) remained low (~14). Using OR to parse queries improved the number of relevant documents returned. If the goal is to obtain all possible documents as is often the case for government queries then this was an effective method, although this idea came with the added disadvantage of significantly increaseing the number of false positives (although this can be combated by adjusting the weighted score such that documents which contain more search terms are score higher).

## Question 4 (Graduate Students)

In [19]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a): 

Since query parser is using "OR" to search results, scoring the relevant documents highly is important. 


The tf-idf weighting(term frequency–inverse document frequency) scheme is used to evaluate how important a word is to a document in a collection. The importance of a word increases proportionally to the number of times it appears in the document but is offset by the frequency of the word in the collection. Therefore it may be possible to use it to imporve the score and thereby ranking of queries such as query 9 "genealogy searches".



### Q4 (b): Write your code below

In [39]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

from whoosh import scoring


In [21]:
mySchema6 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

# if index exists - remove it
if os.path.isdir(INDEX_DIR7):
    shutil.rmtree(INDEX_DIR7)

# create the directory for the index
os.makedirs(INDEX_DIR7)

# create index or open it if already exists
myIndex6 = index.create_in(INDEX_DIR7, mySchema6)   

filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [22]:
# open writer
myWriter6 = writing.BufferedWriter(myIndex6, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding="utf-8") as f:
            fileContent = f.read()
            myWriter6.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter6.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [33]:
# define a query parser for the field "file_content" in the index

#og = qparser.OrGroup.factory(0.5)
myQueryParser6 = qparser.QueryParser("file_content", schema=myIndex6.schema) #, group=og)
mySearcher6 = myIndex6.searcher(weighting=scoring.TF_IDF())
#myQueryParser6 = QueryParser("file_content", schema=myIndex6.schema)
#mySearcher6 = myIndex6.searcher(weighting=scoring.TF_IDF())

In [34]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()


# create an output file to which we'll write our results
outputTRECFile3 = open(OUTPUT_FILE7, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser6.parse(topic_phrase)
    print (topicQuery)
    topicResults = mySearcher6.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile3.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile3.close()
topicsFile.close()

(file_content:mine AND file_content:gold AND file_content:silver AND file_content:coal)
(file_content:juvenile AND file_content:delinquency)
(file_content:wireless AND file_content:communications)
(file_content:physical AND file_content:therapists)
(file_content:cotton AND file_content:industry)
(file_content:genealogy AND file_content:search)
(file_content:physical AND file_content:fitness)
(file_content:agricultural AND file_content:biotechnology)
(file_content:emergency AND file_content:disaster AND file_content:preparedness AND file_content:assistance)
file_content:shipwreck
(file_content:cybercrime AND file_content:internet AND file_content:fraud AND file_content:cyber)
(file_content:veteran AND file_content:benefit)
(file_content:air AND file_content:bag AND file_content:safety)
(file_content:nuclear AND file_content:power AND file_content:plant)
(file_content:early AND file_content:childhood AND file_content:education)


In [35]:
INDEX_Q4 = myIndex6 # Replace None with your index for Q4
QP_Q4 = myQueryParser6 # Replace None with your query parser for Q4
SEARCHER_Q4 = mySearcher6 # Replace None with your searcher for Q4

In [36]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE7

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2000
Rprec                 	10	0.0000


### Q4 (c):

A tf-idf weighting model was implemented. Overall, MAP (0.1368) and recip_rank (0.1558) over all queries did not improve as a result of the changed scoring when compared to scores in baseline (MAP 0.1971) or Q3 (MAP 0.3756). However, since the Q3 query parser used "OR" for search, I also implemented tf-idf scoring to a default query parser  - although this performed slightly better (MAP 0.1525recip_rank 0.1836) than the "OR" query parser  it also did not imporve MAP and recip_rank scores compared to baseline or Q3. However, for query 9 there was a slight improvement in both MAP and recip_rank which makes sense given that geneaology is a relatively rare word in the collection. 

### Q4 (d): 
No

### Q4 (e):
No

### Q4 (f):
Overall, no measure improved with the newly implmented scoring. Therefore, although tf-idf can be a good way to emphasize docuements containing rare terms or high term frequncy it does not necesarily work for broad searches (i.e. queries with common terms.

## Validation

In [27]:
# Run the following cells to make sure your code returns the correct value types

In [28]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Path Validation

In [29]:
assert "MATERIALS_DIR" in globals(), "variable MATERIALS_DIR does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR))), "MATERIALS_DIR folder does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2"))), "invalid folder structure"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents"))), "invalid folder structure"
print("Paths validated")

Paths validated


### Q2 Validation

In [30]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [31]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [32]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
