#COMPSCI 546: Applied Information Retrieval ([website](https://groups.cs.umass.edu/zamani/compsci-546-applied-information-retrieval-spring-2022/))
##Assignment 1: Information Retrieval Metrics (Total : 100 points)

**Description**

This is a coding assignment where you will write and execute code to evaluate ranked outputs generated by a retrieval model or a recommender system. Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.
*   You should implement all the functions yourself and should not use a library or tool for computing the metrics.
*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.
* To create the final pdf submission file, execute *Runtime->RunAll* from the menu to re-execute all the cells and then generate a PDF using *File->Print->Save as PDF*. Make sure that the generated PDF contains all the codes and printed outputs before submission. You are responsible for uploading the correct pdf with all the information required for grading.
To create the final python submission file, click on File->Download .py.


**Submission Details**

* Due data: Feb. 10, 2022 at 11:59 PM (EST).
* The final PDF and python file must be uploaded on Moodle.
* After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. ***You will not recieve any credit if you don't paste the link!*** Make sure we can access the file.
***LINK: ***

**Academic Honesty**

Please follow the guidelines under the *Collaboration and Help* section of the course website.     

# Download input files

Please execute the cell below to download the input files. 

In [None]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


import os
import zipfile

download = drive.CreateFile({'id': '1myaSouVnJygjLlQI54_S0vRXLNYekEC8'})
download.GetContentFile('HW01.zip')

with zipfile.ZipFile('HW01.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('HW01.zip')
# We will use hw1 as our working directory
os.chdir('HW01')

#Setting the input files
qrel_file = "antique-train-final.qrel"
rank_file = "ranking_file"

# 1: Initial Data Setup (10 Points) 

We use the files from the ANTIQUE dataset [https://arxiv.org/pdf/1905.08957.pdf] for this assignment. This is a passage retrieval dataset for non-factoid questions created by the Center for Intelligent Information Retrieval (CIIR) at UMass Amherst.

The description of the input files provided for this assignment is given below. 

**1) Query Relevance (qrel) file**

The qrel file contains the relevance judgements (ground truth) for the query passage combinations. The file consists of 4 columns with the information given below.

*\[queryid]  [topicid]  [passageid]  [relevancejudgment]*

Entries in each row are space separated. The second column (topicid) can be ignored. 

Given below are a couple of rows of a sample qrel file.

*2146313 U0 2146313_0 4*

*2146313 Q0 2146313_23 2*

The relevance judgements range from values 1-4. 
The description of the labels is given below:

Label 1: Non-Relevant

Label 2: Slightly Relevant

Label 3 : Relevant

Label 4: Highly Relevant

**Note**: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non-relevant and Labels 3 and 4 are considered relevant.

**Note**: if a query-document pair is not listed in the qrels file, we assume that the document is not relevant to the query.

**2) Ranking file**

The evaluation metric value has to be calculated for the input ranking file. The file was generated using a standard search engine by executing a ranking model, retrieving the top 100 passages for each of the train queries of the ANTIQUE dataset. The format of this file is given below. 

*\[queryid]  [topicid]  [passageid]  [rank] [relevance_score]  [indri]*

Similar to the qrel file, the entries in each row are space delimited.

Given below are some sample examples of the ranking file contents.

*3097310 Q0 2367043_3 1 -6.01785 indri*

*3097310 Q0 3007432_0 2 -6.22531 indri*

*3097310 Q0 674672_2 3 -6.28514 indri*
 
**Note**: For this assignment, you would only need the information from Column 1(queryid) and Column 3(passageid). The passages corresponding to each query is ranked with respect to the relevance score (highest to lowest), therefore you would not need to use Column 4 (rank) explicitly. 




In order to make it easier to access this information in subsequent cells, please store them in appropriate data structures in the cell below. 

In [None]:
import pandas as pd
''' 
In this function, load the qrel file into qrel datastructure 
Return Variables: 
num_queries_1 - Number of unique queries in the qrel file
num_rel - Number of total relevant passages in the dataset across all queries
qrels - datastructure with the query passage relevance information
'''
def loadQrels(qrel_file):
     #enter your code here
     qid = []
     pid =[]
     tup = []
     reljudg = []
     with open(qrel_file) as f:   
       for line in f:
         temp = line.split(" ")
         qid.append(temp[0])
         pid.append(temp[2])
         reljudg.append(temp[3].strip('\n'))
         tup.append((temp[0], temp[2]))

     qrels = pd.DataFrame({'key_val': tup, 'query_id':qid, 'passage_id':pid, 'relevance_judge':reljudg})
     #replacing relevance values > 2 as 1 and < 2 as 0
     qrels['normal_relevance_judge'] = qrels['relevance_judge'].apply(lambda x: '0' if (x == '1' or x == '2') else '1')    

     num_queries_1 = len(pd.unique(qrels['query_id']))
     num_rel = len(qrels[((qrels['normal_relevance_judge'] == "1"))])
     

     return num_queries_1, num_rel, qrels



''' 
In this function, load the ranking files into rank_in datastructure 
Return Variables: 
num_queries_2 - Number of unique queries in the ranking file
rank_in - datastructure with stored ranking information
'''     
def  loadRankfile(rank_file):
      #enter your code here
      qid = []
      pid = []
      rank = []
      rank_in = {}
      with open(rank_file) as f:   
       for line in f:
         temp = line.split(" ")
         qid.append(temp[0])
         pid.append(temp[2])
         rank.append(temp[3])

      rank_in = pd.DataFrame({"query_id":qid, "passage_id":pid, "rank":rank})  

      num_queries_2 = len(pd.unique(rank_in['query_id']))

      return num_queries_2, rank_in
      


''' You can return single/multiple variables to store data if that makes it convenient 
for data processing. 
This has been given as an example. However, you would still need to correctly print the 
queries in both files and total relevant passages.'''
num_queries_1, num_rel, qrels  = loadQrels(qrel_file)
num_queries_2, rank_in = loadRankfile(rank_file)

# print to ensure the file has been read correctly
print ('Total Num of queries in the qrel file  : {0}'.format(num_queries_1))
print ('Total Num of queries in the rank file  : {0}'.format(num_queries_2))
print ('Total number of relevant passages in the dataset : {0}'.format(num_rel))

Total Num of queries in the qrel file  : 2426
Total Num of queries in the rank file  : 2426
Total number of relevant passages in the dataset : 19813


In [None]:
print(qrels)

                    key_val query_id  ... relevance_judge normal_relevance_judge
0      (2531329, 2531329_0)  2531329  ...               4                      1
1      (2531329, 2531329_5)  2531329  ...               4                      1
2      (2531329, 2531329_4)  2531329  ...               3                      1
3      (2531329, 2531329_7)  2531329  ...               3                      1
4      (2531329, 2531329_6)  2531329  ...               3                      1
...                     ...      ...  ...             ...                    ...
27417    (884731, 884731_0)   884731  ...               4                      1
27418    (884731, 884731_4)   884731  ...               4                      1
27419    (884731, 884731_2)   884731  ...               4                      1
27420    (884731, 884731_3)   884731  ...               4                      1
27421    (884731, 884731_1)   884731  ...               3                      1

[27422 rows x 5 columns]


In [None]:
print(rank_in)

       query_id  passage_id rank
0       3097310   2367043_3    1
1       3097310   3007432_0    2
2       3097310    674672_2    3
3       3097310   2311700_0    4
4       3097310   2606613_4    5
...         ...         ...  ...
242595  4086230  2633809_23   96
242596  4086230   3391788_0   97
242597  4086230   597728_11   98
242598  4086230   4356539_1   99
242599  4086230  2261458_20  100

[242600 rows x 3 columns]


In [None]:
qrel_dict = qrels.set_index('key_val').to_dict()['normal_relevance_judge'] #creating hashmap with (qid,pid) as key and relevance_judge as value



# 2 : Precision (15 Points)


Question 2.1 (5 points)

Give the definition of Precision corresponding to the top *k* retrieved results for *n* queries (P@k). Please note that you have to use averaging to aggregate the results from all queries.   

***Answer*** 
Precision at k is the number of relevant results among the top k retrieved documents. Mathematically, precision is a ratio of the number of total relevant retrieved documents within the top k retrieved results (ranks) to the total number of retrieved documents within the rank for a given query. When we need to calculate precision across all queries, we take the average precision of them across n queries.



Question 2.2 (10 points) 

In the cell below, please enter the code to print the P@k where k={5,10} for the input ranking file.  As mentioned above, please note that the final value is the average of metric values across all queries. 


In [None]:
''' 
In this function, calculate Precision@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
precision - Precision@k
'''

import time
def precCal(qid, k, qrels, rank_in):
  data = rank_in.loc[(rank_in['query_id'] == qid)].head(k).reset_index()
  ctr = 0
  for index, row in data.iterrows():
    if ((row['query_id'], row['passage_id'])) in qrel_dict:
      if qrel_dict[(row['query_id'], row['passage_id'])] == '1':
        ctr += 1
      else: 
        ctr += 0 
    else: 
        continue
  #precision value for qid
  prec = ctr/k
    
  return prec


def calcPrecision(k, qrels, rank_in):
    #unique queries
    que = pd.unique(rank_in['query_id'])
    qdf = pd.DataFrame(que, columns=['qid'])
    qdf['precision'] = qdf['qid'].apply(lambda row: precCal(row, k, qrels, rank_in))
    #final aggr precision value
    precision = qdf['precision'].sum()/len(que)
    return precision

start_time = time.time()

print ('Precision at top 5 : {0}'.format(calcPrecision(5, qrels, rank_in)))
print ('Precision at top 10 : {0}'.format(calcPrecision(10, qrels, rank_in)))

print("--- %s seconds ---" % (time.time() - start_time))


Precision at top 5 : 0.15539983511953834
Precision at top 10 : 0.1072959604286892
--- 107.90563535690308 seconds ---


# 3 : Recall (15 points)

Question 3.1 (5 points) 

Give the definition of Recall corresponding to the top *k* retrieved results for *n* queries (R@k). Please note that you have to use averaging to aggregate the results from all queries. 

***Answer*** 
Recall at k is the ratio of the total number of relevant retrieved documents from the top k ranks, to the total number of relevant documents for a given query. Then, the average recall value is found across n queries.

Question 3.2 (10 points) 

In the cell below, please enter the code to print Recall (R@k) where k={5,10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.

In [None]:
''' 
In this function, calculate Recall@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
recall - Recall@k
'''

def recCal(qid, k, qrels, rank_in):
  recos = len(qrels[(qrels['query_id'] == qid) & ((qrels['normal_relevance_judge'] == "1"))])
  data = rank_in.loc[(rank_in['query_id'] == qid)].head(k).reset_index()
  ctr = 0
  for index, row in data.iterrows():
    if ((row['query_id'], row['passage_id'])) in qrel_dict:
      if qrel_dict[(row['query_id'], row['passage_id'])] == '1':
        ctr += 1
      else: 
        ctr += 0 
    else: 
        continue
  #recall value for qid
  rec = ctr/recos   
  return rec


def calcRecall(k, qrels, rank_in):
    #unique queries
    que = pd.unique(rank_in['query_id'])
    qdf = pd.DataFrame(que, columns=['qid'])
    qdf['recall'] = qdf['qid'].apply(lambda row: recCal(row, k, qrels, rank_in))
    #final aggr recall value
    recall = qdf['recall'].sum()/len(que)
    return recall


start_time = time.time()

print ('Recall at top 5 : {0}'.format(calcRecall(5, qrels, rank_in)))
print ('Recall at top 10 : {0}'.format(calcRecall(10, qrels, rank_in)))

print("--- %s seconds ---" % (time.time() - start_time))

Recall at top 5 : 0.1293441649380108
Recall at top 10 : 0.16743513404735452
--- 131.22059774398804 seconds ---


# 4 : F1 Measure (15 Points)

Question 4.1 (5 points) 

Give the definition of F1 measure corresponding to the top *k* retrieved results for *n* queries (F1@k). Please note that you have to use averaging to aggregate the results from all queries.

***Answer*** 
F1 measure is basically the harmonic mean of precision and recall. It is the ratio of twice the value of precision multiplied with recall for top k ranks to the sum of precision and recall for top k ranks, for a given query. The average is calculated by taking the sum of all the F1 scores across all n queries and dividing it by the total number of unique queries (n). 

Question 4.2 (10 points) 

In the cell below, please enter the code to print the F1@k where k={5,10} for the input ranking file.  Please note that you have to calculate F1 score for each query and compute the final score by averaging the metric values across all queries. 

In [None]:
''' 
In this function, calculate F1@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
f1 - F1@k
''' 

def f1Cal(qid, k, qrels, rank_in):
  recos = len(qrels[(qrels['query_id'] == qid) & ((qrels['normal_relevance_judge'] == "1"))])
  data = rank_in.loc[(rank_in['query_id'] == qid)].head(k).reset_index()
  ctr = 0
  for index, row in data.iterrows():
    if ((row['query_id'], row['passage_id'])) in qrel_dict:
      if qrel_dict[(row['query_id'], row['passage_id'])] == '1':
        ctr += 1
      else: 
        ctr += 0 
    else: 
        continue

  #precision value for qid
  precision = ctr/k

  #recall value for qid
  recall = ctr/recos   

  if precision + recall != 0:    
    return f1score(precision, recall)
  else:
    return 0


def f1score(precision, recall):
  return ((2*precision*recall)/(precision + recall))

def calcFScore(k, qrels, rank_in):
    #unique queries
    que = pd.unique(rank_in['query_id'])
    qdf = pd.DataFrame(que, columns=['qid'])
    qdf['f1'] = qdf['qid'].apply(lambda row: f1Cal(row, k, qrels, rank_in))
    #final aggr f1 value
    f1 = qdf['f1'].sum()/len(que)
    return f1


start_time = time.time()

print ('F1 score at top 5 : {0}'.format(calcFScore(5, qrels, rank_in)))
print ('F1 score at top 10 : {0}'.format(calcFScore(10, qrels, rank_in))) 

print("--- %s seconds ---" % (time.time() - start_time))

F1 score at top 5 : 0.12587171180786297
F1 score at top 10 : 0.11669190136577237
--- 120.64473867416382 seconds ---


# 5 : Mean Reciprocal Rank (MRR) (15 Points)

Question 5.1 (5 points)

Give the definition of MRR@k corresponding to the top *k* retrieved results for *n* queries (MRR@k). Please note that you have to use averaging to aggregate the results from all queries.

***Answer*** 
Reciprocal Rank in Information Retrieval is the inverse of the rank of the first relevant retrieved documents within the top k retrieved results for a given query. If no relevant document is retrieved, then its value is taken as 0. The Mean Reciprocal Rank is the Reciprocal Rank averaged across all n queries.

Question 5.2 (10 points)

In the cell below, please enter the code to print the MRR@k where k={5,10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.

In [None]:
''' 
In this function, calculate MRR@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
mrr - MRR@k
''' 

def mrrCal(qid, k, qrels, rank_in):
  data = rank_in.loc[(rank_in['query_id'] == qid)].head(k).reset_index()
  for index, row in data.iterrows():
    if ((row['query_id'], row['passage_id'])) in qrel_dict:
      if qrel_dict[(row['query_id'], row['passage_id'])] == '1':
        rec_rank = 1/(index+1) #here index starts from 0, hence adding 1 for normal ranking
        return rec_rank
  return 0

def calcMRR(k, qrels, rank_in):
  #enter your code here
  #unique queries
  que = pd.unique(rank_in['query_id'])
  qdf = pd.DataFrame(que, columns=['qid'])
  qdf['mrr'] = qdf['qid'].apply(lambda row: mrrCal(row, k, qrels, rank_in))
  #final aggr mrr value
  mrr = qdf['mrr'].sum()/len(que)
  return mrr

start_time = time.time()

print ('MRR at top 5 : {0}'.format(calcMRR(5, qrels, rank_in)))
print ('MRR at top 10 : {0}'.format(calcMRR(10, qrels, rank_in)))   

print("--- %s seconds ---" % (time.time() - start_time))

MRR at top 5 : 0.3375996152789228
MRR at top 10 : 0.34748982582865523
--- 97.61747598648071 seconds ---


# 6 : Mean Average Precision (MAP) (15 points)

Question 6.1 (5 points)

Give the definition of MAP@k corresponding to the top *k* retrieved results for *n* queries (MAP@k). Please note that you have to use averaging to aggregate the results from all queries.

***Answer*** 
The average precision is the most common metric in TREC. Here, the precision is computed whenever a relevant document is found and then it is averaged. 
Mathematically average precision for a given query is given by -

$AP = \frac{1}{|R|} \sum_{i=1}^k Precision_i Relevance_i$ \\

where, |R| is the total number of relevant documents \\
k is the length of the rank list \\
$Precision_i$ is the precision of the top i documents \\
$Relevance_i$ is the relevance of the ith document \\

Then, the mean average precision is calculated by taking the average of average precision across n queries. 



Question 6.2 (10 points)

In the cell below, please enter the code to print the MAP@k where k={50, 100} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.


In [None]:
''' 
In this function, calculate MAP@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
map - MAP@k
'''

def mapCal(qid, k, qrels, rank_in):
  recos = len(qrels[(qrels['query_id'] == qid) & ((qrels['normal_relevance_judge'] == "1"))])
  sump = 0
  for j in range(1, k):
    relevance_val = 0
    data = rank_in.loc[(rank_in['query_id'] == qid)].head(j).reset_index()
    ctr = 0
    temp = 0 
    for index, row in data.iterrows():
      temp += 1
      if ((row['query_id'], row['passage_id'])) in qrel_dict:
        if qrel_dict[(row['query_id'], row['passage_id'])] == '1':
          ctr += 1
          if temp == j:
            relevance_val = 1
        else: 
          ctr += 0 

    #precision value for qid and relevance at j
    precision = ctr/j
    prod_p_r = precision*relevance_val

    sump += prod_p_r

  #averaging over all the relevant docs for a single qid
  ap = sump/recos
  return ap


def calcMAP(k, qrels, rank_in):
  #enter your code here
  #unique queries
  que = pd.unique(rank_in['query_id'])
  qdf = pd.DataFrame(que, columns=['qid'])
  qdf['map'] = qdf['qid'].apply(lambda row: mapCal(row, k, qrels, rank_in))
  #final aggr map value
  map = qdf['map'].sum()/len(que)
  return map
   

start_time = time.time()

print ('MAP at top 50 : {0}'.format(calcMAP(50, qrels, rank_in)))
print ('MAP at top 100 : {0}'.format(calcMAP(100, qrels, rank_in)))   

print("--- %s seconds ---" % (time.time() - start_time))

MAP at top 50 : 0.12129151696866383
MAP at top 100 : 0.12326331463633428
--- 7993.947522163391 seconds ---


# 7 : Normalized Discounted Cumulative Gain (NDCG) (15 Points)

Question 7.1 (5 points)

Give the definition of NDCG@k corresponding to the top *k* retrieved results for *n* queries (NDCG@k). Use the definition discussed in the lectures. Note that this metric considers graded relevance judgments and you should not binarize the labels. To assign zero gain to non-relevant documents, decrease all relevance labels in the ANTIQUE qrels by 1 point i.e. map relevance judgements 1-4 to 0-3. Please note that you have to use averaging to aggregate the results from all queries.

***Answer*** 
Normalized DCG for a query is the ratio of Discounted Cumulative Gain (DCG) to the Ideal DCG. 
Discounted Cumulative gain at rank k is mathematically given as - \\

$DCG = \frac{r_1}{\log_{2}2} + \frac{r_1}{\log_{2}3} + \frac{r_1}{\log_{2}4} +  ...  + \frac{r_k}{\log_{2}(k+1)}$ \\

Ideal DCG at rank k for a query is the top k results from the full ideal ranking (when queries with relevance ranks are sorted in descending order).

Normalized Discounted Cumulative Gain usually has values between [0,1].

The average NDCG value across n queries is calculated by taking the sum of ndcg across all n queries and dividing it by n.


Question 7.2 (10 points)

In the cell below, please enter the code to print the NDCG@k where k={5, 10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries. 

Use log base 2 for the calculations. 


In [None]:
#scaling down the relevance labels
qrels['relevance_judge'] = qrels['relevance_judge'].apply(lambda x: str(int(x) - 1))
rel_dict_dcg = qrels.set_index('key_val').to_dict()['relevance_judge']

In [None]:
''' 
In this function, calculate NDCG@k, given the input ranking information (rank_in)  
and the query passage relevance information (qrels).
Return Value: 
ndcg - NDCG@k
'''

import math
def dcgCal(qid, k, qrels, rank_in):
  recos = qrels[(qrels['query_id'] == qid) & ((qrels['relevance_judge'] == "2") | (qrels['relevance_judge'] == "3"))].sort_values('relevance_judge', ascending=False).head(k).reset_index()
  data = rank_in.loc[(rank_in['query_id'] == qid)].head(k).reset_index()
  dcg = 0
  idcg = 0
  denom_start = 2
  dnm_start = 2
  ctr = 0
  for index, row in data.iterrows():
    if ((row['query_id'], row['passage_id'])) in qrel_dict:
      r_at_index = int(rel_dict_dcg[(row['query_id'], row['passage_id'])])
      dcg += (r_at_index/math.log(denom_start, 2))
    else:
      denom_start += 1
    denom_start += 1

    #to find ideal dcg
    rval = int(recos['relevance_judge'][ctr])
    idcg += rval/math.log(dnm_start, 2)
    dnm_start += 1

  ndcg = dcg/idcg

  return ndcg


def calcNDCG(k, qrels, rank_in):
    #unique queries
    que = pd.unique(rank_in['query_id'])
    qdf = pd.DataFrame(que, columns=['qid'])
    qdf['ndcg'] = qdf['qid'].apply(lambda row: dcgCal(row, k, qrels, rank_in))
    #final aggr ndcg value

    ndcg = qdf['ndcg'].sum()/len(que)
    return ndcg


start_time = time.time()

print ('NDCG at top 5 : {0}'.format(calcNDCG(5, qrels, rank_in)))
print ('NDCG at top 10 : {0}'.format(calcNDCG(10, qrels, rank_in)))  

print("--- %s seconds ---" % (time.time() - start_time))

NDCG at top 5 : 0.1646480959858136
NDCG at top 10 : 0.12391912424164493
--- 138.52916717529297 seconds ---
