# Assignment: Information Retrieval (IR)

## Preparations
* Put all your imports, and path constants in the next cells

In [1]:
!pip install whoosh
!pip install pytrec_eval
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import wget
wget.download("https://github.com/MIE1513HS-2022/course-datasets/raw/main/government.zip", "government.zip")

'government (1).zip'

In [3]:
!unzip government.zip

Archive:  government.zip
replace government/topics-with-full-descriptions.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [4]:
# imports
# Put all your imports here
from whoosh import index, writing, qparser,scoring
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os.path
from pathlib import Path
import tempfile
import subprocess
import pytrec_eval
import wget
import abc
import nltk
from abc import abstractmethod
from whoosh.analysis import Filter
from nltk.stem import *

In [5]:
class IRSystem(metaclass=abc.ABCMeta):
    """
    Abstract class which is inherited by other IR system
    """

    def __init__(self, data_dir):
        # DON'T change the following names,topic_file, qrels_file, document_dir, file_list
        self.topic_file = os.path.join(data_dir, "gov.topics")
        self.qrels_file = os.path.join(data_dir, "gov.qrels")
        self.document_dir = os.path.join(data_dir, "documents") 
        self.file_list = [str(filePath) for filePath in Path(self.document_dir).glob("**/*") if filePath.is_file()]

        self.create_index()
        self.add_files()
        self.create_parser_searcher()

    @abstractmethod
    def create_index(self):
        pass

    @abstractmethod
    def add_files(self):
        pass

    @abstractmethod
    def create_parser_searcher(self):
        pass

    @abstractmethod
    def perform_search(self, topic_phrase):
        pass

    @staticmethod
    def post_process_score(score):
        return score

    @staticmethod
    def print_trec_eval_result(results):
        if not results:
            print('empty results')
            return

        def print_line(name, scope, num):
            print('{:25s}{:8s}{:.4f}'.format(name, scope, num))

        for query_id, query_measures in results.items():
            for measure, value in query_measures.items():
                if measure == "runid":
                    continue
                print_line(measure, query_id, value)

        for measure in query_measures.keys():
            if measure == "runid":
                continue
            print_line(
                measure,
                'all',
                pytrec_eval.compute_aggregated_measure(
                    measure,
                    [query_measures[measure]
                     for query_measures in results.values()]))
            
    def print_rel_name(self, q_id):
        with open(self.topic_file, "r") as tf:
            topics = tf.read().splitlines()
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            if topic_id == q_id:
                print("---------------------------Topic_id and Topic_phrase----------------------------------")
                print(topic_id, topic_phrase)
                 # get search result
                topic_results = self.perform_search(topic_phrase)
                print("---------------------------Return documents----------------------------------")
                for (docnum, result) in enumerate(topic_results):
                    score = topic_results.score(docnum)
                    score = self.post_process_score(score)
                    print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                print("---------------------------Relevant documents----------------------------------")
                with open(self.qrels_file, 'r') as f_qrel:
                    qrels = f_qrel.readlines()
                    for i in qrels:
                        qid, _, doc, rel = i.rstrip().split(" ")
                        if qid == q_id and rel == "1":
                            print(i.rstrip())

    def py_trec_eval(self):
        # Load topic file - a list of topics(search phrases) used for evalutation
        with open(self.topic_file, "r") as tf:
            topics = tf.read().splitlines()

            # create an output file to which we'll write our results
        temp_output_file = tempfile.mkstemp()[1]
        with open(temp_output_file, "w") as outputTRECFile:
            # for each evaluated topic:
            # build a query and record the results in the file in TREC_EVAL format
            for topic in topics:
                topic_id, topic_phrase = tuple(topic.split(" ", 1))
                # get search result
                topic_results = self.perform_search(topic_phrase)
                # format the result
                for (docnum, result) in enumerate(topic_results):
                    score = topic_results.score(docnum)
                    outputTRECFile.write(
                        "%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

        with open(self.qrels_file, 'r') as f_qrel:
            qrel = pytrec_eval.parse_qrel(f_qrel)

        with open(temp_output_file, 'r') as f_run:
            run = pytrec_eval.parse_run(f_run)

        evaluator = pytrec_eval.RelevanceEvaluator(
            qrel, pytrec_eval.supported_measures)

        results = evaluator.evaluate(run)

        self.print_trec_eval_result(results)


In [6]:
# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]

Mean average precision (map)

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]

The MAP is a suitable measure to evaluate the performance of search engines for government websides. It gives equal weight to all queries, allowing for the assessment of both the relevance of the information retrieved and the ranked results. In my opinion, the MAP method is appropriate for measuring the performance of government website search systems and individual topics.

## Question 2

### Q2 (a): Write your code below

**1. The auto-grader will extract and use the following variables, DON'T change the their names:**

      self.topic_file  
      self.qrels_file  
      self.document_dir   
      self.file_list  
      self.index_sys  
      self.query_parser  
      self.searcher   



**2. DON'T change the names of the already defined funtions**  
**3. DON'T change the py_trec_eval function**  
**4. DON'T change the class names including CustomFilter, IRSystem, IRQ2, IRQ3, IRQ4**  
**5. DON'T change the CustomFilter class and DON'T create any new custom filter class that is used to define Whoosh schema**

In [7]:
class IRQ2(IRSystem):
    def create_index(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.index_sys which should have type whoosh.index.FileIndex
        """        
        # DON't change the name of 'index_sys'
        
        mySchema = Schema(file_path = ID(stored=True),
                          file_content = TEXT(analyzer = RegexTokenizer()))

         # Generate a temporary directory for the index
        indexDir = tempfile.mkdtemp()

        # create and return the index
        self.index_sys = index.create_in(indexDir, mySchema)


    def add_files(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Add buffer to self.index_sys
        """
        # Build a list of files to index
        filesToIndex = [str(filePath) for filePath in Path(self.document_dir).glob("**/*") if filePath.is_file()]
        # open writer
        writer = writing.BufferedWriter(self.index_sys, period=None, limit=1000)
        try:
          # write each file to index
          for docNum, filePath in enumerate(filesToIndex):
              with open(filePath, "r", encoding="utf-8") as f:
                  fileContent = f.read()
                  writer.add_document(file_path = filePath,
                                      file_content = fileContent)

                  # print status every 1000 documents
                  if (docNum+1) % 1000 == 0:
                      print("already indexed:", docNum+1)
          print("done indexing.")

        finally:
            # close the index
            writer.close()

    def create_parser_searcher(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.query_parser and self.searcher which should have type whoosh.qparser.default.QueryParser and whoosh.searching.Searcher respectively 
        """
        # define a query parser for the field "file_content" in the index
         # DON't change the names of 'query_parser' and 'searcher'
        self.query_parser = QueryParser("file_content", schema=self.index_sys.schema)
        self.searcher = self.index_sys.searcher()

    def perform_search(self, topic_phrase):
        """
        INPUT:
            topic_phrase: string
        OUTPUT:
            topic_results: whoosh.searching.Results
        
        NOTE: Utilize self.query_parser and self.searcher to calculate the result for topic_phrase
        """
        sampleQuery = self.query_parser.parse(topic_phrase)
        topic_results = self.searcher.search(sampleQuery, limit=None)
        return topic_results

In [8]:
q2 = IRQ2("government")

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [9]:
q2.py_trec_eval()

num_q                    1       1.0000
num_ret                  1       1.0000
num_rel                  1       5.0000
num_rel_ret              1       0.0000
map                      1       0.0000
gm_map                   1       -11.5129
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0000
iprec_at_recall_0.00     1       0.0000
iprec_at_recall_0.10     1       0.0000
iprec_at_recall_0.20     1       0.0000
iprec_at_recall_0.30     1       0.0000
iprec_at_recall_0.40     1       0.0000
iprec_at_recall_0.50     1       0.0000
iprec_at_recall_0.60     1       0.0000
iprec_at_recall_0.70     1       0.0000
iprec_at_recall_0.80     1       0.0000
iprec_at_recall_0.90     1       0.0000
iprec_at_recall_1.00     1       0.0000
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0000
P_30                     1       0.000

In [10]:
q2.print_rel_name('1')

---------------------------Topic_id and Topic_phrase----------------------------------
1 mining gold silver coal
---------------------------Return documents----------------------------------
1 Q0 G00-90-0342721 0 26.645398 test
---------------------------Relevant documents----------------------------------
1 0 G00-00-1006224 1
1 0 G00-02-0901987 1
1 0 G00-03-1898526 1
1 0 G00-10-3730888 1
1 0 G00-10-3849661 1


### Q2 (b): Provide answer to Q2 (b) here [markdown cell]

The MAP and set_recall results of the Whoosh system are both very low, with an MAP of 0.1971 and set_recall of 0.3988 as shown in the output data.

### Q2 (c): Provide answer to Q2(c) here [markdown cell]

Topics 18 and 24 performed exceptionally well, as all relevant information was retrieved and their MAP scores were 1. The number of relevant documents matches the number of relevant documents retrieved. However, for topics 1, 2, 6, 7, 9, 16, and 28, the MAP scores are all zero, indicating that no or very few relevant documents were retrieved. In some cases, a score was not returned, indicating poor performance.

## Question 3

In [11]:
q2.print_rel_name('9')

---------------------------Topic_id and Topic_phrase----------------------------------
9 genealogy searches
---------------------------Return documents----------------------------------
9 Q0 G00-26-1048210 0 12.268873 test
9 Q0 G00-59-3622783 1 5.132722 test
---------------------------Relevant documents----------------------------------
9 0 G00-91-3181951 1


### Q3 (a): Provide answer to Q3 (a) here [markdown cell]

The Whoosh system needs to update its analyzer in order to improve its performance. The current analyzer has limitations, such as not being able to comprehend words in various forms (such as prefixes) and not effectively removing stop words and punctuation.

Using query 9, "genealogy searches", as an example, there are two results listed in the sample output: G00-26-1048210 and G00-59-3622783. However, upon further investigation, these two results are false positives as they are not present in the qrels file. The document "G00-26-1048210" contains information on the FAQ for the public, but does not provide much useful information for "genealogy searches". It appears in the results due to the word "Genealogical Research".

On the other hand, the document "G00-91-3181951" is relevant according to the qrels file, but does not appear in the sample result list, making it a false negative. This document is actually about "genealogy" topics, but was not extracted properly, possibly due to its key words being in different forms or variations.

The results indicate that there are several factors that can impact Whoosh's performance, such as the presence of stop words, differences in upper/lower case, and different forms of words with the same root (such as "genealogical" and "genealogy"). Improving the performance of Whoosh may involve adjusting the words in the queries before the search, and using tools such as stemmers and lemmatizers from NLTK, such as the LancasterStemmer, LowercaseFilter, and StopFilter.

### Q3 (b): Write your code below

**1. The auto-grader will extract and use the following variables, DON'T change the their names:**

      self.topic_file  
      self.qrels_file  
      self.document_dir   
      self.file_list  
      self.index_sys  
      self.query_parser  
      self.searcher   



**2. DON'T change the names of the already defined funtions**  
**3. DON'T change the py_trec_eval function**  
**4. DON'T change the class names including CustomFilter, IRSystem, IRQ2, IRQ3, IRQ4**  
**5. DON'T change the CustomFilter class and DON'T create any new custom filter class that is used to define Whoosh schema**

In [12]:
# download required resources
nltk.download("wordnet")  

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
class IRQ3(IRSystem):
    def create_index(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.index_sys which should have type whoosh.index.FileIndex
        """        
        # DON't change the name of 'index_sys'
        myFilter = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(LancasterStemmer().stem)
        mySchema = Schema(file_path = ID(stored=True),
                          file_content = TEXT(analyzer = myFilter))

         # Generate a temporary directory for the index
        indexDir = tempfile.mkdtemp()

        # create and return the index
        self.index_sys = index.create_in(indexDir, mySchema)


    def add_files(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Add buffer to self.index_sys
        """
        # Build a list of files to index
        filesToIndex = [str(filePath) for filePath in Path(self.document_dir).glob("**/*") if filePath.is_file()]
        # open writer
        writer = writing.BufferedWriter(self.index_sys, period=None, limit=1000)
        try:
          # write each file to index
          for docNum, filePath in enumerate(filesToIndex):
              with open(filePath, "r", encoding="utf-8") as f:
                  fileContent = f.read()
                  writer.add_document(file_path = filePath,
                                      file_content = fileContent)

                  # print status every 1000 documents
                  if (docNum+1) % 1000 == 0:
                      print("already indexed:", docNum+1)
          print("done indexing.")

        finally:
            # close the index
            writer.close()

    def create_parser_searcher(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.query_parser and self.searcher which should have type whoosh.qparser.default.QueryParser and whoosh.searching.Searcher respectively 
        """
        # define a query parser for the field "file_content" in the index
         # DON't change the names of 'query_parser' and 'searcher'
        self.query_parser = QueryParser("file_content", schema=self.index_sys.schema)
        self.searcher = self.index_sys.searcher()

    def perform_search(self, topic_phrase):
        """
        INPUT:
            topic_phrase: string
        OUTPUT:
            topic_results: whoosh.searching.Results
        
        NOTE: Utilize self.query_parser and self.searcher to calculate the result for topic_phrase
        """
        sampleQuery = self.query_parser.parse(topic_phrase)
        topic_results = self.searcher.search(sampleQuery, limit=None)
        return topic_results

In [14]:
q3 = IRQ3("government")

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [15]:
q3.py_trec_eval()

num_q                    1       1.0000
num_ret                  1       3.0000
num_rel                  1       5.0000
num_rel_ret              1       0.0000
map                      1       0.0000
gm_map                   1       -11.5129
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0000
iprec_at_recall_0.00     1       0.0000
iprec_at_recall_0.10     1       0.0000
iprec_at_recall_0.20     1       0.0000
iprec_at_recall_0.30     1       0.0000
iprec_at_recall_0.40     1       0.0000
iprec_at_recall_0.50     1       0.0000
iprec_at_recall_0.60     1       0.0000
iprec_at_recall_0.70     1       0.0000
iprec_at_recall_0.80     1       0.0000
iprec_at_recall_0.90     1       0.0000
iprec_at_recall_1.00     1       0.0000
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0000
P_30                     1       0.000

In [16]:
q3.print_rel_name('9')

---------------------------Topic_id and Topic_phrase----------------------------------
9 genealogy searches
---------------------------Return documents----------------------------------
9 Q0 G00-30-0221651 0 14.375038 test
9 Q0 G00-79-2892445 1 13.424951 test
9 Q0 G00-26-1048210 2 12.392029 test
9 Q0 G00-55-0643570 3 11.497054 test
9 Q0 G00-08-1314254 4 10.777542 test
9 Q0 G00-08-0900666 5 10.777542 test
9 Q0 G00-02-1372443 6 10.777542 test
9 Q0 G00-88-2629440 7 10.405809 test
9 Q0 G00-95-3755341 8 10.405809 test
9 Q0 G00-06-1975174 9 10.405809 test
9 Q0 G00-59-0523165 10 10.405809 test
9 Q0 G00-24-0016657 11 10.353919 test
9 Q0 G00-95-3337324 12 10.353919 test
9 Q0 G00-01-2134408 13 10.181420 test
9 Q0 G00-33-1729611 14 10.070069 test
9 Q0 G00-01-2898660 15 9.879991 test
9 Q0 G00-91-3181951 16 9.815247 test
9 Q0 G00-43-3812747 17 9.743916 test
9 Q0 G00-21-1529615 18 9.435812 test
9 Q0 G00-67-1176122 19 9.147730 test
9 Q0 G00-08-3780534 20 9.093147 test
9 Q0 G00-49-2630728 21 8.601573 

### Q3 (c): Provide answer to Q3 (c) here [markdown cell]

I applied the stemming method from NLTK's stemmers and lemmatizers, and added a RegexTokenizer to separate the sentence into words. I also used LowercaseFilter to convert the words to lower case and StopFilter to remove common stop words that are not relevant.

The result of this improvement showed a higher MAP score of 0.3456 compared to the previous score of 0.1971. Additionally, the gm_ap score increased from 0.0015 to 0.0187, indicating that these improvements have improved the performance of Whoosh.

The false negative example (G00-91-3181951) has been resolved and is now included in the sample results. However, false positive issues still persist, as 27 files were returned, none of which were in the qrels file, compared to only 2 files returned previously.

### Q3 (d): Provide answer to Q3 (d) here [markdown cell]

yes

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]

Yes,while the performance of topic 28 has improved, as it now has a MAP score of 0.2262 compared to previously having a score of 0, other topics such as topic 26 have not seen improvement. The MAP score for topic 26 has dropped from 0.1111 to 0.0771, meaning its performance has worsened. Additionally, there are still false positive results and an increase in the number of false positive results compared to before.

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]

The overall performance of the system has improved as evidenced by the higher MAP score and gm_map score. Furthermore, the implementation of filters has allowed for the retrieval of more relevant documents compared to before, which has also fixed the previous false negative issue. However, this has also led to an increase in false positive results, but the overall performance improvement still stands.

## Question 4


### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

Lack of relevance: If the search engine doesn't return relevant results to the user's query, then it can lead to low-performance.

Insufficient data preprocessing: If the data is not cleaned, filtered, and preprocessed properly before indexing, it can lead to poor results.

Poor tokenization: If the tokenization process is not effective, it can result in poor indexing and retrieval of documents.

Lack of stemming/lemmatization: If the search engine is unable to understand words in different forms (e.g. plurals, conjugates), it can negatively impact performance.

Unoptimized scoring functions: If the scoring function used by the search engine is not optimal, it can result in poor relevance ranking of the results.

### Q4 (b): Write your code below

**1. The auto-grader will extract and use the following variables, DON'T change the their names:**

      self.topic_file  
      self.qrels_file  
      self.document_dir   
      self.file_list  
      self.index_sys  
      self.query_parser  
      self.searcher   



**2. DON'T change the names of the already defined funtions**  
**3. DON'T change the py_trec_eval function**  
**4. DON'T change the class names including CustomFilter, IRSystem, IRQ2, IRQ3, IRQ4**  
**5. DON'T change the CustomFilter class and DON'T create any new custom filter class that is used to define Whoosh schema**

In [17]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [18]:
class IRQ4(IRSystem):
    def create_index(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.index_sys which should have type whoosh.index.FileIndex
        """        
        # DON't change the name of 'index_sys'
        myFilter = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()|CustomFilter(LancasterStemmer().stem)|CustomFilter(WordNetLemmatizer().lemmatize)
        mySchema = Schema(file_path = ID(stored=True),
                          file_content = TEXT(analyzer = myFilter))

         # Generate a temporary directory for the index
        indexDir = tempfile.mkdtemp()

        # create and return the index
        self.index_sys = index.create_in(indexDir, mySchema)


    def add_files(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Add buffer to self.index_sys
        """
        # Build a list of files to index
        filesToIndex = [str(filePath) for filePath in Path(self.document_dir).glob("**/*") if filePath.is_file()]
        # open writer
        writer = writing.BufferedWriter(self.index_sys, period=None, limit=1000)
        try:
          # write each file to index
          for docNum, filePath in enumerate(filesToIndex):
              with open(filePath, "r", encoding="utf-8") as f:
                  fileContent = f.read()
                  writer.add_document(file_path = filePath,
                                      file_content = fileContent)

                  # print status every 1000 documents
                  if (docNum+1) % 1000 == 0:
                      print("already indexed:", docNum+1)
          print("done indexing.")

        finally:
            # close the index
            writer.close()

    def create_parser_searcher(self):
        """
        INPUT:
            None
        OUTPUT:
            None
        
        NOTE: Please update self.query_parser and self.searcher which should have type whoosh.qparser.default.QueryParser and whoosh.searching.Searcher respectively 
        """
        # define a query parser for the field "file_content" in the index
         # DON't change the names of 'query_parser' and 'searcher'
        self.query_parser = QueryParser("file_content", schema=self.index_sys.schema, group=qparser.OrGroup.factory(0.8))
        self.searcher = self.index_sys.searcher(weighting=scoring.BM25F(B=0.5,K1=5))

    def perform_search(self, topic_phrase):
        """
        INPUT:
            topic_phrase: string
        OUTPUT:
            topic_results: whoosh.searching.Results
        
        NOTE: Utilize self.query_parser and self.searcher to calculate the result for topic_phrase
        """
        sampleQuery = self.query_parser.parse(topic_phrase)
        topic_results = self.searcher.search(sampleQuery, limit=None)
        return topic_results

In [19]:
q4 = IRQ4("government")

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [20]:
q4.py_trec_eval()

num_q                    1       1.0000
num_ret                  1       470.0000
num_rel                  1       5.0000
num_rel_ret              1       5.0000
map                      1       0.0674
gm_map                   1       -2.6978
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0556
iprec_at_recall_0.00     1       0.0938
iprec_at_recall_0.10     1       0.0938
iprec_at_recall_0.20     1       0.0938
iprec_at_recall_0.30     1       0.0938
iprec_at_recall_0.40     1       0.0938
iprec_at_recall_0.50     1       0.0938
iprec_at_recall_0.60     1       0.0938
iprec_at_recall_0.70     1       0.0519
iprec_at_recall_0.80     1       0.0519
iprec_at_recall_0.90     1       0.0485
iprec_at_recall_1.00     1       0.0485
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0500
P_30                     1       0.06

In [21]:
q4.print_rel_name('9')

---------------------------Topic_id and Topic_phrase----------------------------------
9 genealogy searches
---------------------------Return documents----------------------------------
9 Q0 G00-30-0221651 0 13.704827 test
9 Q0 G00-79-2892445 1 13.065959 test
9 Q0 G00-26-1048210 2 9.465239 test
9 Q0 G00-91-3181951 3 8.414106 test
9 Q0 G00-55-0643570 4 7.484404 test
9 Q0 G00-08-1314254 5 7.098417 test
9 Q0 G00-08-0900666 6 7.098417 test
9 Q0 G00-02-1372443 7 7.098417 test
9 Q0 G00-43-3812747 8 6.449216 test
9 Q0 G00-24-0016657 9 6.324979 test
9 Q0 G00-95-3337324 10 6.324979 test
9 Q0 G00-01-2134408 11 6.217790 test
9 Q0 G00-48-2464830 12 5.933189 test
9 Q0 G00-28-3598417 13 5.728867 test
9 Q0 G00-01-2898660 14 5.706292 test
9 Q0 G00-88-2629440 15 5.584208 test
9 Q0 G00-95-3755341 16 5.584208 test
9 Q0 G00-06-1975174 17 5.584208 test
9 Q0 G00-59-0523165 18 5.584208 test
9 Q0 G00-59-3622783 19 5.522089 test
9 Q0 G00-33-1729611 20 5.411123 test
9 Q0 G00-08-3780534 21 5.250793 test
9 Q0 G00

### Q4 (b): Provide answer to Q4 (b) here [markdown cell]

A clear list of all final modifications made.

Adding | StemFilter()| CustomFilter(LancasterStemmer().stem)| CustomFilter| CustomFilter(WordNetLemmatizer().lemmatize)

Using the factory() class method of Orgroup: self.query_parser = QueryParser("file_content", schema=INDEX_Q4.schema, group=qparser.OrGroup.factory(0.8))

Using whoosh scoring BM25F function: self.searcher = self.index_sys.searcher(weighting=scoring.BM25F(B=0.5,K1=5))

By utilizing additional filters to enhance the performance, an 'OR' query was utilized to prioritize documents that contain a higher number of query terms. This resulted in the retrieval of more relevant documents, thereby increasing the map score. By conducting trial and error, the best factory number (0.8) was found to optimize results. Additionally, the Whoosh scoring function, BM25F, was utilized with the best combination of B and K found to be B = 0.5 and k = 
5, resulting in a map score of 0.4105. While it may be difficult to determine the optimal values of B and K, it is believed that a better combination exists for further improvement.

### Q4 (c): Provide answer to Q4 (c) here [markdown cell]

Yes, Q3 MAP is 0.3456, Q4 MAP is 0.4105.

## Validation

#### Run the following cells to make sure your code returns the correct value types

In [22]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [23]:
assert(isinstance(q2.index_sys, FileIndex)), "Index Type"
assert(isinstance(q2.query_parser, QueryParser)), "Query Parser Type"
assert(isinstance(q2.searcher, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [24]:
assert(isinstance(q3.index_sys, FileIndex)), "Index Type"
assert(isinstance(q3.query_parser, QueryParser)), "Query Parser Type"
assert(isinstance(q3.searcher, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation

In [25]:
assert(isinstance(q4.index_sys, FileIndex)), "Index Type"
assert(isinstance(q4.query_parser, QueryParser)), "Query Parser Type"
assert(isinstance(q4.searcher, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
