# Information Retrieval Project

Three important softwares used:
- Whoosh, a pure-Python search engineering library, 
- NLTK, a natural language processing toolkit and 
- pytrec eval, an Information Retrieval evaluation tool for Python, based  on the popular trec eval, the standard software for evaluating search engines with test collections.

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [0]:
!pip install whoosh
!pip install pytrec_eval
!pip install wget

Collecting whoosh
[?25l  Downloading https://files.pythonhosted.org/packages/ba/19/24d0f1f454a2c1eb689ca28d2f178db81e5024f42d82729a4ff6771155cf/Whoosh-2.7.4-py2.py3-none-any.whl (468kB)
[K     |▊                               | 10kB 18.9MB/s eta 0:00:01[K     |█▍                              | 20kB 6.7MB/s eta 0:00:01[K     |██                              | 30kB 7.7MB/s eta 0:00:01[K     |██▉                             | 40kB 8.5MB/s eta 0:00:01[K     |███▌                            | 51kB 7.2MB/s eta 0:00:01[K     |████▏                           | 61kB 7.8MB/s eta 0:00:01[K     |█████                           | 71kB 8.1MB/s eta 0:00:01[K     |█████▋                          | 81kB 8.9MB/s eta 0:00:01[K     |██████▎                         | 92kB 8.1MB/s eta 0:00:01[K     |███████                         | 102kB 8.3MB/s eta 0:00:01[K     |███████▊                        | 112kB 8.3MB/s eta 0:00:01[K     |████████▍                       | 122kB 8.3MB/s eta 

In [0]:
import wget
wget.download("https://github.com/MIE451-1513-2019/course-datasets/raw/master/government.zip", "government.zip")

'government (1).zip'

In [0]:
!unzip government.zip

Archive:  government.zip
replace government/topics-with-full-descriptions.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [0]:
# imports
from whoosh import index, writing, scoring, qparser
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser, OrGroup, AndGroup
from whoosh.scoring import BM25F, TF_IDF, Frequency
import os.path
from pathlib import Path
import tempfile
import subprocess
import pytrec_eval
import wget
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.stem import *
# download required resources
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
DATA_DIR = "government"
#
# Put other path constants here
#
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")

## Final Summary:

- The most useful trec_eval measures is MAP (Mean Average Precision). MAP considers ranking of each relevant document (unlike P@K) and empirically correlates with human evaluation of retrieval systems
- The MAP for the baseline whoosh is 0.1971
- MAP was increased to 0.3372 by just using Lower case and Stem Filter
- Final MAP is 0.4098 after a series of modifications made including trying out more filters, stemmers/lemmatizers, scoring methods, query parser customizations and parameter tuning for BM25F and PL2

## Indexing and Querying

In [0]:
#Creating Index
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)    

In [0]:
# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
myIndex = createIndex(mySchema)

In [0]:
def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1) % 1000 == 0:
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()

In [0]:
# Build a list of files to index
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

In [0]:
addFilesToIndex(myIndex, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [0]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = QueryParser("file_content", schema=INDEX_Q2.schema) # Replace None with your query parser for Q2
SEARCHER_Q2 = INDEX_Q2.searcher() # Replace None with your searcher for Q2

In [0]:
# print the topic file
with open(TOPIC_FILE, "r") as f:
    print(f.read())

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education



In [0]:
# print the first 10 lines in the qrels file
with open(QRELS_FILE, "r") as f:
    qrels10 = f.readlines()[:10]
    print("".join(qrels10))

1 0 G00-00-0681214 0
1 0 G00-00-0945765 0
1 0 G00-00-1006224 1
1 0 G00-00-1591495 0
1 0 G00-00-2764912 0
1 0 G00-00-3253540 0
1 0 G00-00-3717374 0
1 0 G00-01-0270065 0
1 0 G00-01-0400712 0
1 0 G00-01-0682299 0



In [0]:
def pyTrecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                #print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)

    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, pytrec_eval.supported_measures)

    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
        print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            if measure == "runid":
              continue
            print_line(measure, query_id, value)
    for measure in query_measures.keys():
        if measure == "runid":
              continue
        print_line(
            measure,
            'all',
            pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()]))

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2) 

The chosen measure (MAP) is 0.1971

## Baseline Results

Topics that did well:
- 18 Shipwrecks (MAP= 1)
- 24 Air Bag Safety (MAP= 1)
- 14 Agricultural biotechnology (MAP= 0.25)

Topics that did badly:
- 19 Cybercrime, internet fraud, and cyber fraud (did not even return a result)

  Topics with 0 MAP:
- 1 mining gold silver coal
- 2 juvenile delinquency
- 7 cotton industry

## Improving on the baseline and investigation (why is the MAP score so low?)

In [0]:
def printRelName(topicFile, qrelsFile, queryParser, searcher, id):
  with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()
  for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        if topic_id == id:
          print("---------------------------Topic_id and Topic_phrase----------------------------------")
          print(topic_id, topic_phrase)
          topicQuery = queryParser.parse(topic_phrase)
          topicResults = searcher.search(topicQuery, limit=None)
          print("---------------------------Return documents----------------------------------")
          for (docnum, result) in enumerate(topicResults):
              score = topicResults.score(docnum)
              print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
          print("---------------------------Relevant documents----------------------------------")
          with open(qrelsFile, 'r') as f_qrel:
            qrels = f_qrel.readlines()
            for i in qrels:
              qid, _, doc, rel = i.rstrip().split(" ")
              if qid == id and rel == "1":
                print(i.rstrip())

In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2, "19")

---------------------------Topic_id and Topic_phrase----------------------------------
19 Cybercrime, internet fraud, and cyber fraud
---------------------------Return documents----------------------------------
---------------------------Relevant documents----------------------------------
19 0 G00-02-3479535 1
19 0 G00-10-2344253 1


The query chosen to investigate is 4- wireless communications since it had the lowest non-zero MAP score (0.0312)

1) The documents that were highly ranked are:
- G00-99-2247765 (Whoosh Score= 16.449155)
- G00-85-1525415 (Whoosh Score= 13.364613)

2) The documents that should have been highly ranked are:
- G00-03-2855342 
- G00-36-1275993 
- G00-47-2117970 
- G00-65-0162935 

3) False positives (irrelevant documents highly ranked):
- G00-99-2247765 (Whoosh Score= 16.449155)
- G00-85-1525415 (Whoosh Score= 13.364613)

False negatives (relevant documents not ranked highly):
- G00-47-2117970 (Ranking= 7) (Whoosh Score= 10.213356)


In [0]:
#Quick check to see if index is empty or not
# Is it empty?
print("Index is empty?", INDEX_Q2.is_empty())

# How many files indexed?
print("Number of indexed files:", INDEX_Q2.doc_count())

Index is empty? False
Number of indexed files: 4078


The vaninlla whoosh system was run with just the Tokenizer which seperates the different words into token. However, there are various other form the words coud appear in.

For example-
- The term "wireless communications" could appear as capitalized letters and hence would not show as a match
- "communication" could show up in the document and it would not show up as a match.

Running a simple lowercase and stem filter should potentially improve the MAP score of this topic.

In [0]:
# define a reader object on the index
myReader = INDEX_Q2.reader()

Let's investigate the need for the two filters discussed above.
The word communication could be present in so many different ways across the documents

In [0]:
# the different variations of the same word
print("# docs with 'communications'", myReader.doc_frequency("file_content", "communications"))
print("# docs with 'communicate'", myReader.doc_frequency("file_content", "communicate"))
print("# docs with 'communication'", myReader.doc_frequency("file_content", "communication"))
print("# docs with 'Communication'", myReader.doc_frequency("file_content", "Communication"))

print("\n# docs with 'Wireless'", myReader.doc_frequency("file_content", "Wireless"))
print("# docs with 'wireless'", myReader.doc_frequency("file_content", "wireless"))

# docs with 'communications' 102
# docs with 'communicate' 45
# docs with 'communication' 107
# docs with 'Communication' 63

# docs with 'Wireless' 51
# docs with 'wireless' 33


As it can be seen, the two words appear in so many different forms. Applying the lower-case and stem filter (Snowball Stemmer) should drastically improve the performance of the IR system.
As seen below, the filters would reduce various form of the words to the same root word.

In [0]:
# Checking how the analyzer would work for the different types 
stmLwrAnalyzer = RegexTokenizer() | LowercaseFilter() | StemFilter(lang= 'ru')
print([token.text for token in stmLwrAnalyzer("communications")])
print([token.text for token in stmLwrAnalyzer("communicate")])
print([token.text for token in stmLwrAnalyzer("communication")])
print([token.text for token in stmLwrAnalyzer("Communication")])

print([token.text for token in stmLwrAnalyzer("wireless")])
print([token.text for token in stmLwrAnalyzer("Wireless")])

['communications']
['communicat']
['communication']
['communication']
['wireless']
['wireless']


The words are now the same and would make it easier for the system to find the relevant documents. The same would hold true for the other topics as well.

In [0]:
# define a Schema with the new analyzer

stmLwrAnalyzer = RegexTokenizer() | LowercaseFilter() | StemFilter()
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = stmLwrAnalyzer))

# create the index based on the new schema
myIndex2 = createIndex(mySchema2)

In [0]:
addFilesToIndex(myIndex2, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [0]:
INDEX_Q3 = myIndex2 # Replace None with your index for Q3
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema) # Replace None with your query parser for Q3
SEARCHER_Q3 = INDEX_Q3.searcher() # Replace None with your searcher for Q3

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3) 

In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3, "19")

---------------------------Topic_id and Topic_phrase----------------------------------
19 Cybercrime, internet fraud, and cyber fraud
---------------------------Return documents----------------------------------
19 Q0 G00-02-3479535 0 31.173572 test
19 Q0 G00-27-1492903 1 28.493387 test
19 Q0 G00-15-1460278 2 25.139250 test
19 Q0 G00-73-0028862 3 19.084104 test
19 Q0 G00-15-3429810 4 14.596646 test
---------------------------Relevant documents----------------------------------
19 0 G00-02-3479535 1
19 0 G00-10-2344253 1


There were two improvements made:
- Lower Case: all the words in the documents and query were lower-cased
- Stemming: This reduces all the words to its base word by removing the suffixes. There are other methods that could be used such as Lemmatization but for this step, Stemming was performed.

The MAP for the topic 4 (wireless communications) increased from 0.0312 to 0.5375 (an increase of upto 16 times).
Of the four relevant documents:
- The first two ranked documents are relevant documents. 
- One of the relevant documents is ranked 19 (G00-65-0162935) 
- The last one is not in the returned documents list (G00-03-2855342).

The False Negative document in the 3(b) is now ranked second and the false positive documents have moved down the rankings from the previous case.


The overall MAP increased from 0.1971 to 0.3372.

Queries that got better (as per MAP):
- Topic 4 increased from 0.0312 to 0.5375
- Topic 19 increased from no results to 0.5 
- Topic 2 increased from 0 to 0.5


Queries that got worse (as per MAP):
- Topic 22 decreased from 0.2 to 0.056
- Topic 26 decreased from 0.11 to 0.0778

Most of the others remained the same.

## Search Engine Optimization

The modifications are made in four main ways.

####Way 1- Adding filters and trying different analyzers:
- Using the NLTK options to compare four different Text Analyzers: Snowball Stemmer, Lancaster Stemmer, Wordnet Lemmatizer, Wordnet Lemmatizer with verbs. 

Note that the IntraWord Filter was added to all beforehand. Intraword filter is used to break phrases like "whooosh.analysis" to "whoosh" and "analysis" to make it easier to match certain terms. Stop word filter is used to remove the words such as "are", "and", etc which do not add value to the search.

Results:
- Using the Lancaster Stemmer increased MAP from 0.3372 to 0.3457
- Using Wordnet Lemmatizer increased MAP from 0.3372 to 0.3401
- Using WordNet Lemmatizer for verbs increased MAP from 0.3372 to 0.3401
- Using the Snowball Stemmer (same as Q3 but with additon of two new filters) reduced the MAP from 0.3372 to 0.3366.

Lancaster Stemmer gave the best MAP score of 0.3457 and will be used further.

The Lancaster Stemmer is just a different version as compared to the Snowball Stemmer but the idea is the same. It reduces the word to its root. Lemmatizers does the same function but it is slower and the result is an actual language word (unlike Stemmers). These analyzers in general aid in helping find the terms from the query in the documents.


In [0]:
# This filter will run for both the index and the query
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [0]:
myFilter1 = RegexTokenizer()| LowercaseFilter() | StopFilter()| IntraWordFilter()| CustomFilter(LancasterStemmer().stem)
#myFilter2 = RegexTokenizer()| LowercaseFilter() | StopFilter()| IntraWordFilter()| CustomFilter(WordNetLemmatizer().lemmatize)
#myFilter3 = RegexTokenizer()| LowercaseFilter() | StopFilter()| IntraWordFilter()| CustomFilter(WordNetLemmatizer().lemmatize, 'v')
#myFilter4 = RegexTokenizer()| LowercaseFilter() | StopFilter()| IntraWordFilter()| StemFilter () 

# define a Schema with the new analyzer

mySchema3 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

# create the index based on the new schema
myIndex3 = createIndex(mySchema3)

In [0]:
addFilesToIndex(myIndex3, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [0]:
INDEX_Way1 = myIndex3 
QP_Way1 = QueryParser("file_content", schema=INDEX_Way1.schema) 
SEARCHER_Way1 = INDEX_Way1.searcher() 

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Way1, SEARCHER_Way1) 

#### Way 2- Using different types of scoring methods
The two scoring methods used are: TF_IDF, BM25F and PL2
(Reference: https://whoosh.readthedocs.io/en/latest/api/scoring.html, http://ir.dcs.gla.ac.uk/smooth/he-ecir05.pdf and https://en.wikipedia.org/wiki/Okapi_BM25)

- TF_IDF also considers the rarity of the term in a document in addition to the frequency of the term. It gives a MAP of 0.1612
- BM25F is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters. It gives a MAP of 0.3547 (it is the default scorer)- but it has parameters that can be tuned.
- PL2 is one of the divergence from randomness (DFR) document weighting models. It gave a MAP of 0.3245

BM25F and PL2 both have parameters that can be tuned and hence they will both be used in Way 4.

Scoring in general is important because the relevant documents need to come up in the higher ranks since user would not go through all the results to find the relevant one.

In [0]:
INDEX_Way2 = myIndex3 #Using the Index from Way 1
QP_Way2 = QueryParser("file_content", schema=INDEX_Way2.schema)

#weighting= scoring.TF_IDF()
#weighting= scoring.PL2()
SEARCHER_Way2 = INDEX_Way2.searcher(weighting= scoring.BM25F())

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Way2, SEARCHER_Way2) 

#### Way 3- Using AND/OR Filter

There are different customizations which can be made to the query parser. (Reference: https://whoosh.readthedocs.io/en/latest/parsing.html)
- AND Filter- meaning all the terms must be present for a document to match. This is the default setting which gives the MAP of 0.3446
- OR Filter- any of the terms may be present for a document to match. For example, if the user searches for foo bar, a document with four occurances of foo would normally outscore a document that contained one occurance each of foo and bar. The MAP increases to 0.38.
- the factory() class method of OrGroup: Users usually expect documents that contain more of the words they searched for to score higher. To configure the parser to produce Or groups with this behavior, the factory() class method of OrGroup is utilized. The MAP score is 0.3752.

OR Filter is used further.

In [0]:
INDEX_Way3 = myIndex3 #Using the Index from Way 1
#group= ANDGroup
#group= OrGroup.factory(0.9)
QP_Way3 = QueryParser("file_content", schema=INDEX_Way3.schema, group= OrGroup) 
SEARCHER_Way3 = INDEX_Way3.searcher(weighting= scoring.BM25F())

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Way3, SEARCHER_Way3) 

#### Way4- Parameter Tuning of BM25F and PL2
Two main parameters of BM25- K1 and B (Reference: https://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf and http://www.minerazzi.com/tutorials/bm25f-model-tutorial.pdf)

- k1 is a constant that allows us to control the non-linear growing term frequency function. This parameter controls how quickly an increase in term frequency results in term-frequency saturation.

- B is a parameter for achieving full, soft, or zero document length normalization. 

Best MAP score- 0.4022 (B= 0.57, K1=6.0)

Main parameter of PL2: 
- c is the free parameter of the normalisation method

Best MAP score= 0.4098 (c=2.67)


pyTrecEval function is modified to be able to loop through the different values of the parameters

In [0]:
def pyTrecEval_modified(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                #print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)

    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, pytrec_eval.supported_measures)

    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
      if measure == "map" and scope== "all":
        global saved_value
        saved_value= value
        #print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            if measure == "runid":
              continue
            print_line(measure, query_id, value)
    for measure in query_measures.keys():
        if measure == "runid":
              continue
        print_line(
            measure,
            'all',
            pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()]))
       

In [0]:
INDEX_Way4 = myIndex3 #Using the Index from Way 1
QP_Way4 = QueryParser("file_content", schema=INDEX_Way4.schema, group= OrGroup)

The next two code blocks were used various times to obtain the best value of the parameters B and K1.

In [0]:
MAP_K1= []
K1= []
for i in np.arange(0,7, 0.1):
  SEARCHER_Way4 = INDEX_Way4.searcher(weighting= scoring.BM25F(B= 0.57, K1=i))
  pyTrecEval_modified(TOPIC_FILE, QRELS_FILE, QP_Way4, SEARCHER_Way4)
  MAP_K1.append(saved_value)
  K1.append(i)

In [0]:
#Effect of changing K1 parameter on MAP value with a constant B parameter
plt.plot(K1, MAP_K1)
plt.title("Overall MAP vs K1")
plt.xlabel('K1')
plt.ylabel("Overall MAP")
plt.grid()
plt.show()

In [0]:
MAP_B= []
B= []
for i in np.arange(0, 1, 0.01):
  SEARCHER_Way4 = INDEX_Way4.searcher(weighting=scoring.BM25F(B= i, K1=6))
  pyTrecEval_modified(TOPIC_FILE, QRELS_FILE, QP_Way4, SEARCHER_Way4)
  MAP_B.append(saved_value)
  B.append(i)

In [0]:
#Effect of changing B parameter on MAP value with a constant K1 parameter
plt.plot(B, MAP_B)
plt.title("Overall MAP vs B")
plt.xlabel('B')
plt.ylabel("Overall MAP")
plt.grid()
plt.show()

Using the above tuning method, the best parameters were obtained (B= 0.57, K1=6).

In [0]:
#Using the best parameters from the above tuning
SEARCHER_Way4 = INDEX_Way4.searcher(weighting=scoring.BM25F(B= 0.57, K1=6))

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Way4, SEARCHER_Way4) 

Now to tune the c parameter in PL2.

In [0]:
MAP_c= []
c= []
for i in np.arange(0.1, 3, 0.1):
  SEARCHER_Way4 = INDEX_Way4.searcher(weighting=scoring.PL2())
  pyTrecEval_modified(TOPIC_FILE, QRELS_FILE, QP_Way4, SEARCHER_Way4)
  MAP_c.append(saved_value)
  c.append(i)

In [0]:
#Effect of changing c parameter on MAP value
plt.plot(c, MAP_c)
plt.title("Overall MAP vs c (PL2)")
plt.xlabel('c')
plt.ylabel("Overall MAP")
plt.grid()
plt.show()

In [0]:
#Using the best parameters from the above tuning
SEARCHER_Way4 = INDEX_Way4.searcher(weighting=scoring.PL2(c=2.67))

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Way4, SEARCHER_Way4) 

#### Other modifications tried:
- Bi-word filter
- Charset Filter
- Different tokenizers
- Different scoring methods such as Frequency and Function Weighting

They all resulted in worse MAP scores and hence are not included within this notebook.

### Final Result

In [0]:
INDEX_Q4 = myIndex3 # Replace None with your index for Q4
QP_Q4 = QueryParser("file_content", schema=INDEX_Q4.schema, group= OrGroup) # Replace None with your query parser for Q4
SEARCHER_Q4 = INDEX_Q4.searcher(weighting=scoring.PL2(c=2.67)) # Replace None with your searcher for Q4

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q4, SEARCHER_Q4) 

Final MAP= 0.41
