# Homework 2
Please see the write-up on Canvas for full details

### - **Name**: **Yashaswini**
### - **Kaggle Name**: **Yashaswini Joshi**
### - **Unique Name**: **yjoshi**

# install metapy, it may take several minutes.

In [1]:
!pip install metapy
import metapy

Collecting metapy
[?25l  Downloading https://files.pythonhosted.org/packages/81/a4/92dae084446597d6bbf355e7eaff3e83dcb51e33d434f43ecdea4c0c4b0a/metapy-0.2.13-cp36-cp36m-manylinux1_x86_64.whl (14.3MB)
[K     |████████████████████████████████| 14.3MB 297kB/s 
[?25hInstalling collected packages: metapy
Successfully installed metapy-0.2.13


# Download the dataset files for this assignment. 

These files are also on canvas. The `wget` and `tar` commands may not work on Windows, so if this command doesn't work, just download them on canvas. These commands should work if you're doing the assignment on Google Collab.

In [2]:
!wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt
!wget -N http://www-personal.umich.edu/~shiyansi/covid_ir.tar.gz
!tar xf covid_ir.tar.gz

--2020-10-14 03:23:10--  https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747 (2.7K) [text/plain]
Saving to: ‘lemur-stopwords.txt’


2020-10-14 03:23:11 (45.9 MB/s) - ‘lemur-stopwords.txt’ saved [2747/2747]

--2020-10-14 03:23:11--  http://www-personal.umich.edu/~shiyansi/covid_ir.tar.gz
Resolving www-personal.umich.edu (www-personal.umich.edu)... 141.211.243.103
Connecting to www-personal.umich.edu (www-personal.umich.edu)|141.211.243.103|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69098957 (66M) [application/x-gzip]
Saving to: ‘covid_ir.tar.gz’


2020-10-14 03:24:08 (1.15 MB/s) - ‘covid_ir.tar.gz’ saved [69098957/69098957]



# Generate the metapy header configuration for you
Metapy is a powerful IR library. To lower the barrier for entry, we're generating the configuration that tells metapy how the task is setup and what is needed. You should keep this file the same when verifying your BM25 implementation in Part 1 of the assignment.  However, you can generate a different version of it if you want when trying to outperform BM25 in Part 2.

In [3]:
with open('covid_ir/tutorial.toml', 'w') as f:
    f.write('type = "line-corpus"\n')
    f.write('store-full-text = true\n')

config = """prefix = "." # tells MeTA where to search for datasets

dataset = "covid_ir" # a subfolder under the prefix directory
corpus = "tutorial.toml" # a configuration file for the corpus specifying its format & additional args

index = "covid_ir-idx" # subfolder of the current working directory to place index files

query-judgements = "covid_ir/covid_ir-qrels.txt" # file containing the relevance judgments for this dataset

stop-words = "lemur-stopwords.txt"

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"
"""
with open('covid_ir-config.toml', 'w') as f:
    f.write(config)

### Build the inverted index with metapy

In [4]:
inv_idx = metapy.index.make_inverted_index('covid_ir-config.toml') 

## Problem 1: Re-implemented BM25 (25 points) and Pivoted Indexing (25 points)
We've provided a skeleton of a ranking function below with examples of commonly-used parameters. For each, you should re-implement it using the formulas as defined in the lecture slides. You are welcome to use any hyperparameter choices you want (NOTE: as mentioned in the homework, changing these does not count as a new method for Problem 2). 

To test for correctness, you can compare your method against metapy's implementations. We've included one below to get you started. Your solution should return the exact same ranking as their implementation when using identical hyperparameters.

In [5]:
# You can define your own retrieval function 
import math 
class MyBM25Reimplementation(metapy.index.RankingFunction):                                                                                                                    
    def __init__(self,  k1 = 1.2, b = 0.75, k3 = 500):                                             
        self.k1 = k1
        self.b = b
        self.k3 = k3
        # You *must* invoke the base class __init__() here!
        super(MyBM25Reimplementation, self).__init__()                                        
                                                                                 
    def score_one(self, sd):
        """
        You need to override this function to return a score for a single term.
        
        You may want to call some of the following variables when implementing your retrieval function:
        
        sd.avg_dl: average document length of the collection
        sd.num_docs: total number of documents in the index
        sd.total_terms: total number of terms in the index
        sd.query_length: the total length of the current query (sum of all term weights)
        sd.query_term_weight: query term count (or weight in case of feedback)
        sd.doc_count: number of documents that a term t_id appears in
        sd.corpus_term_count: number of times a term t_id appears in the collection
        sd.doc_term_count: number of times the term appears in the current document
        sd.doc_size: total number of terms in the current document
        sd.doc_unique_terms: number of unique terms in the current document
        
        """
        
        k1 = self.k1
        b = self.b
        k3 = self.k3      
        #Fill your answers here

        V_IDF = math.log((sd.num_docs - sd.doc_count + 0.5) /(sd.doc_count + 0.5))
        N_TF = (k1+1)* sd.doc_term_count / (k1*(1-b+b*sd.doc_size/sd.avg_dl) + sd.doc_term_count)
        QTF = (k3+1)*sd.query_term_weight / (k3 + sd.query_term_weight)
        return (V_IDF * N_TF * QTF)

In [6]:
#ranker = metapy.index.OkapiBM25(k1 = 1.2, b = 0.5, k3 = 500)
ranker = MyBM25Reimplementation()

In [7]:
num_results = 10
retrieval_results = []
with open('covid_ir/covid_ir-queries.txt') as query_file:
    for query_num, line in enumerate(query_file):
        query = metapy.index.Document()
        query.content(line.strip())
        results = ranker.score(inv_idx, query, num_results)  
        res_list = [(query_num + 1, x[0]) for x in results]
        retrieval_results += res_list

        
        print("Query: ", query.content())
        print("Retrieved Results")
        for num, (d_id, _) in enumerate(results):
           content = inv_idx.metadata(d_id).get('content')
           print(str(num + 1), content)
        break

Query:  coronavirus origin
Retrieved Results
1 Bat-Origin Coronaviruses Expand Their Host Range to Pigs Infections with bat-origin coronaviruses have caused severe illness in humans by ‘host jump’. Recently, novel bat-origin coronaviruses were found in pigs. The large number of mutations on the receptor-binding domain allowed the viruses to infect the new host, posing a potential threat to both agriculture and public health.
2 Zoonotic origins of human coronavirus 2019 (HCoV-19 / SARS-CoV-2): why is this work important? The ongoing pandemic of coronavirus disease 2019 (COVID-19), caused by infection with human coronavirus 2019 (HCoV-19 / SARS-CoV-2 / 2019-nCoV), is a global threat to the human population. Here, we briefly summarize the available data for the zoonotic origins of HCoV-19, with reference to the other two epidemics of highly virulent coronaviruses, SARS-CoV and MERS-CoV, which cause severe pneumonia in humans. We propose to intensify future efforts for tracing the origins 

In [8]:
class Pivoted(metapy.index.RankingFunction):                                                                                                                    
    def __init__(self, s = 0.1):                                             
        self.s = s
        # You *must* invoke the base class __init__() here!
        super(Pivoted, self).__init__()                                        
                                                                                 
    def score_one(self, sd):
        """
        You need to override this function to return a score for a single term.
        
        You may want to call some of the following variables when implementing your retrieval function:
        
        
        sd.avg_dl: average document length of the collection
        sd.num_docs: total number of documents in the index
        sd.total_terms: total number of terms in the index
        sd.query_length: the total length of the current query (sum of all term weights)
        sd.query_term_weight: query term count (or weight in case of feedback)
        sd.doc_count: number of documents that a term t_id appears in
        sd.corpus_term_count: number of times a term t_id appears in the collection
        sd.doc_term_count: number of times the term appears in the current document
        sd.doc_size: total number of terms in the current document
        sd.doc_unique_terms: number of unique terms in the current document
        
        """
        
        s = self.s    
        #Fill your answers here
        IDF =  math.log((sd.num_docs+1) / sd.doc_count) 
        Nor_TF = (1 + math.log(1+math.log(sd.doc_term_count))) / (1-s +s *sd.doc_size/sd.avg_dl)
        TF = sd.query_term_weight
        
        return (IDF*Nor_TF*TF)

In [9]:
ranker = Pivoted()

In [10]:
num_results = 10
retrieval_results = []
with open('covid_ir/covid_ir-queries.txt') as query_file:
    for query_num, line in enumerate(query_file):
        query = metapy.index.Document()
        query.content(line.strip())
        results = ranker.score(inv_idx, query, num_results)  
        res_list = [(query_num + 1, x[0]) for x in results]
        retrieval_results += res_list

        
        print("Query: ", query.content())
        print("Retrieved Results")
        for num, (d_id, _) in enumerate(results):
           content = inv_idx.metadata(d_id).get('content')
           print(str(num + 1), content)
        break


Query:  coronavirus origin
Retrieved Results
1 Origin and evolution of pathogenic coronaviruses Severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) are two highly transmissible and pathogenic viruses that emerged in humans at the beginning of the 21st century. Both viruses likely originated in bats, and genetically diverse coronaviruses that are related to SARS-CoV and MERS-CoV were discovered in bats worldwide. In this Review, we summarize the current knowledge on the origin and evolution of these two pathogenic coronaviruses and discuss their receptor usage; we also highlight the diversity and potential of spillover of bat-borne coronaviruses, as evidenced by the recent spillover of swine acute diarrhoea syndrome coronavirus (SADS-CoV) to pigs.
2 Bat origin of human coronaviruses Bats have been recognized as the natural reservoirs of a large variety of viruses. Special attention has been paid to bat coronaviruses as the

In [11]:
# You can check your results by comparing the two rankers here

## Part 2: Define your own ranking function! (50 points)
Implement at least one retrieval function *different* from BM25, Dirichlet Prior, and Pivoted Normalization. You will be graded based on your best performing function. You’ll get full credit if your retrieval function can beat the provided baseline in the dataset. By "beat," we mean that your implemented function and your choice of parameters should reach higher NDCG@10 than the baseline on Kaggle for our dataset, which you can check at any time. Report this information in your submission: the code to implement the retrieval function, the parameter you used that achieved the best performance, and the best performance. In addition,_explain_ what you have explored and why you decide to try those. You will lose points if you cannot explain why your function can reach a higher performance. You can include your explanations in the end of the submitted notebook.

*Note:* Simply varying the value of parameters in Okapi/BM25, Dirichlet Prior or Pivoted Normalization does not count as a new retrieval function.

In [13]:
# You can define your own retrieval function 
import math 
class MyCustomRanker(metapy.index.RankingFunction):                                                                                                                    
    def __init__(self, k1 = 1.2, b = 0.75, k3 = 500):                                             
        self.k1 = k1
        self.b = b
        self.k3 = k3
        # You *must* invoke the base class __init__() here!
        super(MyCustomRanker, self).__init__()                                       
                                                                                 
    def score_one(self, sd):
        """
        You need to override this function to return a score for a single term.
        
        You may want to call some of the following variables when implementing your retrieval function:
        
        sd.avg_dl: average document length of the collection
        sd.num_docs: total number of documents in the index
        sd.total_terms: total number of terms in the index
        sd.query_length: the total length of the current query (sum of all term weights)
        sd.query_term_weight: query term count (or weight in case of feedback)
        sd.doc_count: number of documents that a term t_id appears in
        sd.corpus_term_count: number of times a term t_id appears in the collection
        sd.doc_term_count: number of times the term appears in the current document
        sd.doc_size: total number of terms in the current document
        sd.doc_unique_terms: number of unique terms in the current document
        
        """
        
        k1 = self.k1
        b = self.b
        k3 = self.k3
        
        #Fill your answer here
        # modified ES 
        TF = sd.doc_term_count/(sd.doc_term_count + b * math.sqrt(sd.doc_size/sd.avg_dl))
        IDF = ((sd.corpus_term_count ** 3) * sd.num_docs / (sd.doc_count**4)) ** k1 
        QTF = (k3 + 1)* sd.query_term_weight / (k3 + sd.query_term_weight)
        return IDF * TF * QTF

In [19]:
ranker = MyCustomRanker()

num_results = 10
custom_ranking_retrieval_results = []
query_id_list = []
with open('covid_ir/covid_ir-queries.txt') as query_file:
    for query_num, line in enumerate(query_file):
        # print(type(query_num))
        # query_id_list.append(query_num)
        query = metapy.index.Document()
        query.content(line.strip())
        results = ranker.score(inv_idx, query, num_results)  
        res_list = [(query_num + 1, x[0]) for x in results]
        custom_ranking_retrieval_results += res_list

        
        # print("Query: ", query.content())
        # print("Retrieved Results")
        # for num, (d_id, _) in enumerate(results):
        #   content = inv_idx.metadata(d_id).get('content')
        #   print(str(num + 1), content)
        # break     

### Write your ranking to a file and upload it to the [Kaggle competition](https://www.kaggle.com/t/a8345852fdab42da9e210f833b9f50b1) for the class

In [20]:
import csv
with open("my_kaggle_submission.csv","w") as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(["queryid", "docid"])
    for x in custom_ranking_retrieval_results:
        csv_writer.writerow(list(x))
    f.close()