# Anserini for ARQMath


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/03/20) from AI2.
In this notebook, we'll be working with the title + abstract + body index. 

In [2]:
from IPython.core.display import display, HTML

First, install Python dependencies

In [1]:
# %%capture
# !pip install pyserini==0.8.1.0
# !pip install transformers

import json
import os
import torch
import numpy
from tqdm.notebook import tqdm
from transformers import *
import warnings
import glob
import os
import jsonlines
import re
import xml.etree.ElementTree as ET
from elasticsearch import Elasticsearch
import re
from collections import defaultdict
import numpy as np
from scipy import spatial
import numpy as np
import gc
import math
import torch
from transformers import *
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

ARQ_INDEX = '/data/szr207/dataset/ArqMath/anserini_index'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-11.0.7.10-4.el7_8.x86_64"
topic_file_path = "/data/szr207/dataset/ArqMath/Task1/Topics/Topics_V2.0.xml"
# topic_file_path = "/data/szr207/dataset/ArqMath/Task1/Sample Topics/Task1_Samples_V2.0.xml"
model_path = "/data/szr207/github/transformers/examples/language-modeling/output/checkpoint-4000000"

from pyserini.search import pysearch


Let's load ARQ-BERT

In [4]:
tokenizer =  RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)
model = AutoModelWithLMHead.from_pretrained(model_path)

# from transformers import AutoTokenizer, AutoModelWithLMHead

# tokenizer = AutoTokenizer.from_pretrained("shauryr/arqmath-roberta-base-2M")

# model = AutoModelWithLMHead.from_pretrained("shauryr/arqmath-roberta-base-2M")

In [5]:
class Topic:
    """
    This class shows a topic for task 1. Each topic has an topic_id which is str, a title and question which
    is the question body and a list of tags.
    """

    def __init__(self, topic_id, title, question, tags):
        self.topic_id = topic_id
        self.title = title
        self.question = question
        self.lst_tags = tags


class TopicReader:
    """
    This class takes in the topic file path and read all the topics into a map. The key in this map is the topic id
    and the values are Topic which has 4 attributes: id, title, question and list of tags for each topic.

    To see each topic, use the get_topic method, which takes the topic id and return the topic in Topic object and
    you have access to the 4 attributes mentioned above.
    """

    def __init__(self, topic_file_path):
        self.__map_topics = self.__read_topics(topic_file_path)

    def __read_topics(self, topic_file_path):
        map_topics = {}
        tree = ET.parse(topic_file_path)
        root = tree.getroot()
        for child in root:
            topic_id = child.attrib['number']
            title = child[0].text
            question = child[1].text
            lst_tag = child[2].text.split(",")
            map_topics[topic_id] = Topic(topic_id, title, question, lst_tag)
        return map_topics

    def get_topic(self, topic_id):
        if topic_id in self.__map_topics:
            return self.__map_topics[topic_id]
        return None


queries = []
#"In this example, the title and the question body of topic with id A.1 is printed."
topic_reader = TopicReader(topic_file_path)
dict_q_a = defaultdict(list)

searcher = pysearch.SimpleSearcher(ARQ_INDEX)


topic_list = []
with open('../runs/qrel_task1', 'r') as eval_file:
    for _,line in enumerate(eval_file):
        topic_list.append(line.split('\t')[0])

topic_list = list(set(topic_list))

# list_p = []
# with open('../runs/qrel_task1', 'r') as eval_file:
#     for _,line in enumerate(eval_file):
#         list_p.append(line.split('\t')[2])

# list_p = list(set(list_p))
# len(list_p)

# with open('tf.psu-task1-prim.anserini-auto-both-A.tsv', 'w') as eval_file:
for topic_id in tqdm(topic_reader._TopicReader__map_topics):
    if topic_id in topic_list:
#         topic_id = 'A.4'
        title = re.sub('<[^<]+?>', '', topic_reader.get_topic(topic_id).title)
        body = topic_reader.get_topic(topic_id).question
        body_pro = re.sub('<[^<]+?>', '', body)
        query = title + '. ' + body_pro
        queries.append(query)
        print(topic_id, query[:80], '...')

        hits = searcher.search(query,1000)

        count = 1
        for hit in hits:
#             if hit.docid in list_p:
#                     eval_file.write(topic_id+'\t'+ '1\t' +str(hit.docid)+'\t'+str(count)+'\t'+ str(hit.score)+'\t'+ 'anserini'+'\n')
                count+=1
                dict_q_a[topic_id].append(hit)

HBox(children=(FloatProgress(value=0.0, max=98.0), HTML(value='')))

A.1 Finding value of $c$ such that the range of the rational function $f(x) = \frac{ ...
A.3 Approximation to $\sqrt{5}$ correct to an exactitude of $10^{-10}$. I am attempt ...
A.4 How to compute this combinatoric sum?. I have the sum  $$\sum_{k=0}^{n} \binom{n ...
A.5 A family has two children. Given that one of the children is a boy, what is the  ...
A.7 Finding out the remainder of $\frac{11^\text{10}-1}{100}$ using modulus.    If $ ...
A.8 finding value of $\lim_{n\rightarrow \infty}\sqrt[n]{\frac{(27)^n(n!)^3}{(3n)!}} ...
A.9 Simplifying this series. I need to write the series   $$\sum_{n=0}^N nx^n$$   in ...
A.10 Find the values of a>0 for which the improper integral $\int_{0}^{\infty}\frac{\ ...
A.11 What's the cross product in 2 dimensions?. The math book i'm using states that t ...
A.12 Finding the roots of a complex number. I was solving practice problems for my up ...
A.13 How to simplify expression $\int_a^b f(x)dx+\int_{f(a)}^{f(b)} f^{-1}(x)dx \ ?$. ...
A.14 Help solving

You can use `pysearch` to search over an index. Here's the basic usage:

In [None]:
def show_query(query):
    """HTML print format for the searched query"""
    return HTML('<br/><div style="font-family: Times New Roman; font-size: 20px;'
                'padding-bottom:12px"><b>Query</b>: '+query[:100]+'</div>')

def show_document(idx, doc):
    """HTML print format for document fields"""
    return HTML('<div style="font-family: Times New Roman; font-size: 18px; padding-bottom:10px">' + 
               f'<b>Document {idx}:</b> {doc.docid} ({doc.score:1.2f}) -- ' +
                f'${doc.raw[:100]}$</div>')

def show_query_results(query, searcher, top_k):
    """HTML print format for the searched query"""
    hits = searcher.search(query,1000)
    display(show_query(query))
    for i, hit in enumerate(hits[:top_k]):
        display(show_document(i+1, hit))
    return hits[:top_k]

In [7]:
from transformers import pipeline
from transformers import *

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# load model
model = RobertaModel.from_pretrained("/data/szr207/github/transformers/examples/language-modeling/output/checkpoint-4000000", output_hidden_states=True)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

model = model.to(device)

ques_emb = []
SEQ_LENGTH = 512
for query in tqdm(queries):
    token_ids = tokenizer.encode(query)[:SEQ_LENGTH]
    token_ids = torch.tensor(token_ids).unsqueeze(0)
    token_ids = token_ids.to(device)
    with torch.no_grad():
        out = model(input_ids=token_ids)
    hidden_states = out[2]
    del out
    torch.cuda.empty_cache()
    sentence_embedding = torch.mean(hidden_states[-1], dim=1).squeeze()
    ques_emb.append(sentence_embedding)
    

#mean

dict_q_idx = {}
count = 0
for topic_id in tqdm(topic_reader._TopicReader__map_topics):
    if topic_id in topic_list:
        dict_q_idx[topic_id]=0
        count+=1



HBox(children=(FloatProgress(value=0.0, max=77.0), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (527 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (673 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1472 > 512). Running this sequence through the model will result in indexing errors





HBox(children=(FloatProgress(value=0.0, max=98.0), HTML(value='')))




In [8]:
def Sort_Tuple(tup):  
    # reverse = None (Sorts in Ascending order)  
    # key is set to sort using second element of  
    # sublist lambda has been used  
    tup.sort(key = lambda x: x[2])
    return tup

In [9]:
def reduce_str(tokenizer, body):
    encoded = tokenizer.encode(body)[:510]
    return tokenizer.decode(encoded)

In [None]:
# import logging
# logging.getLogger("pytorch_pretrained_bert.tokenization").setLevel(logging.ERROR) 
# result_list = []
# count = 0
# for qid in tqdm(list(topic_reader._TopicReader__map_topics.keys())):
#     tup_postid_sim = []
#     for doc in tqdm(dict_q_a[qid]):
#         try:
#             feat = feat_ext(reduce_str(tokenizer, doc.raw))[0]
#             ans_emb = np.mean(feat, axis=0)
#         except:
#             print('CUDA error: an illegal memory access was encountered', post_id)
#             break
#         result = 1 - spatial.distance.cosine(ques_emb[dict_q_idx[qid]], ans_emb)

#         if math.isnan(result):
#             count+=1
#             print(qid, doc.docid, count)
#             break
#         tup_postid_sim.append((qid,doc.docid,result))

#     result_list.append(Sort_Tuple(tup_postid_sim)[::-1][:1000])
import time
    
result_list = []
count = 0

bert_time_list = []

for qid in tqdm(list(topic_reader._TopicReader__map_topics.keys())):
    tup_postid_sim = []
    if qid in topic_list:
        tic = time.perf_counter()
        for doc in dict_q_a[qid]:
            try:
                token_ids = tokenizer.encode(doc.raw)[:SEQ_LENGTH]
                token_ids = torch.tensor(token_ids).unsqueeze(0)
                token_ids = token_ids.to(device)
                
                with torch.no_grad():
                    out = model(input_ids=token_ids)
                hidden_states = out[2]
                del out
                torch.cuda.empty_cache()
                ans_emb = torch.mean(hidden_states[-1], dim=1).squeeze().cpu()
            except:
                print('CUDA error: an illegal memory access was encountered', post_id)
                break
            result = 1 - spatial.distance.cosine(ques_emb[dict_q_idx[qid]].cpu(), ans_emb)

            if math.isnan(result):
                count+=1
                print(qid, doc.docid, count)
                break
            tup_postid_sim.append((qid,doc.docid,result))
        toc = time.perf_counter()
        bert_time_list.append((qid,toc - tic))
        result_list.append(Sort_Tuple(tup_postid_sim)[::-1][:1000])

Token indices sequence length is longer than the specified maximum sequence length for this model (878 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1850 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (849 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1389 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (512 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

In [None]:
ans_times = []
for idx, t in bert_time_list:
    ans_times.append(t)
    
import matplotlib.pyplot as plt
x = ans_times
plt.boxplot(x)
plt.show()

In [None]:
with open('tf.psu-task1-prim.anserini.bert-auto-both-A.tsv', 'w') as eval_file:
    for res in result_list:
        count = 1
        for tuples in res:
            eval_file.write(tuples[0]+'\t'+ '1\t' + str(tuples[1]) +'\t'+str(count)+'\t'+ str(tuples[2])+'\t'+ 'mlt_bert'+'\n')
#             eval_file.write(tuples[0]+'\t' + str(tuples[1]) +'\t'+str(count)+'\t'+ str(tuples[2])+'\t'+ 'mlt_bert'+'\n')
            count+=1

In [None]:
from trectools import TrecRun, TrecEval, fusion

runs_path = '/data/szr207/projects/ArqMath/notebooks'
r1 = TrecRun(os.path.join(runs_path, "tf.psu-task1-anserini.bert-auto-both-A.tsv"))
r2 = TrecRun(os.path.join(runs_path, "tf.psu-task1-anserini-auto-both-A.tsv"))

# Easy way to create new baselines by fusing existing runs:
fused_run = fusion.reciprocal_rank_fusion([r1,r2])

# Save run to disk with all its topics
fused_run.print_subset("tf.psu-task1-rrf.anserini.bert-auto-both-P.tsv", topics=fused_run.topics())

In [None]:
df = pd.read_csv('tf.psu-task1-rrf.anserini.bert-auto-both-P.tsv', header=None , sep=" ")
df = df.drop(columns=1)
df.to_csv('psu-task1-rrf.anserini.bert-auto-both-P.tsv', sep='\t', index=False, header=False)