In [None]:
%load_ext autoreload
%autoreload 2

# How to evaluate Vespa ranking functions from python

> Using [pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to evaluate [cord19 search application](https://cord19.vespa.ai/) ranking functions currently in production.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/use_cases/cord19/cord19_download_parse_trec_covid.ipynb)

# Evaluate query model baselines

> Download and explore TREC-COVID. Split data into training and test sets. Evaluate existing query models.

The team behind [vespa.ai](https://vespa.ai/) have built and open-sourced a [CORD-19 search engine](https://cord19.vespa.ai/). Thanks to advanced Vespa features such as [Approximate Nearest Neighbors Search](https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/) and [Tranformers support via ONNX](https://blog.vespa.ai/introducing-nlp-with-transformers-on-vespa/) it comes with the most advanced NLP methodology applied to search that is currently available.   

In the following sections we will:
* Download, parse and explore the TREC-COVID Complete topics and relevance judgements.
* Split the data into training and test sets.
* Evaluate some query models that are already deployed in the [CORD-19 search engine](https://cord19.vespa.ai/).

## Install pyvespa

`pyvespa` provides a python API to Vespa. It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.

In [None]:
!pip install pyvespa

## Download and parse TREC-COVID Complete topics and relevance judgements

The files used in this section can be found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

In [None]:
!wget https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
!wget https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt

### Topics

The topics file is in XML format. We can parse it and store on a dictionary called `topics`. We want to extract a `query`, a `question` and a `narrative` for each topic.

In [2]:
import xml.etree.ElementTree as ET

topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
    topic_number = topic.attrib["number"]
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text        

In [None]:
import json

with open("/Users/tmartins/projects/sw/thigm85.github.io/data/cord19/topics.json", "w") as f:
    f.write(json.dumps(topics))

There is a total of 50 topics. For example, we can see the first topic below:

In [None]:
topics["1"]

In [None]:
import json

with open("/Users/tmartins/projects/sw/thigm85.github.io/data/cord19/labelled_data.json", "w") as f:
    f.write(json.dumps(labelled_data))

### Relevance judgements

We can load the relevance judgement data directly into a pandas `DataFrame`.

In [3]:
import pandas as pd

relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]

The relevance data contain all the relevance judgements made through out the 5 rounds of the competition.

In [4]:
relevance_data.head()

Unnamed: 0,topic_id,round_id,cord_uid,relevancy
0,1,4.5,005b2j4b,2
1,1,4.0,00fmeepz,1
2,1,0.5,010vptx3,2
3,1,2.5,0194oljo,1
4,1,4.0,021q9884,1


We are going to remore two rows that have relevancy equal to -1, which I am assuming is an error.

In [5]:
relevance_data[relevance_data.relevancy == -1]

Unnamed: 0,topic_id,round_id,cord_uid,relevancy
55873,38,5.0,9hbib8b3,-1
69173,50,5.0,ucipq8uk,-1


In [6]:
relevance_data = relevance_data[relevance_data.relevancy >= 0]

The plot below show that some topics have a higher number of identified relevant document than others.

In [7]:
relevance_data.to_csv("/Users/tmartins/projects/sw/thigm85.github.io/data/cord19/relevance_data.csv", index = False)

In [None]:
import plotly.express as px

fig = px.histogram(relevance_data, x="topic_id", color = "relevancy")
fig.show()

In [None]:
import requests
import json

topics = json.loads(
    requests.get("https://thigm85.github.io/data/cord19/topics.json").text
)

In [None]:
topics

In [None]:
read_csv("https://thigm85.github.io/data/cord19/relevance_data.csv")

In [16]:
from pandas import read_csv

relevance_data = read_csv("https://thigm85.github.io/data/cord19/relevance_data.csv")

In [17]:
relevance_data.head()

Unnamed: 0,topic_id,round_id,cord_uid,relevancy
0,1,4.5,005b2j4b,2
1,1,4.0,00fmeepz,1
2,1,0.5,010vptx3,2
3,1,2.5,0194oljo,1
4,1,4.0,021q9884,1


## Create and split labelled data 

### Include all judgments, including 0

In [None]:
labelled_data = [
    {
        "query_id": int(topic_id), 
        "query": topics[topic_id]["query"], 
        "relevant_docs": [
            {
                "id": row["cord_uid"], 
                "score": row["relevancy"]
            } for idx, row in relevance_data[relevance_data.topic_id == int(topic_id)].iterrows() if row["relevancy"] >= 0
        ]
    } for topic_id in topics.keys()]

In [None]:
import json

with open("labelled_data_all.json", "w") as f:
    f.write(json.dumps(labelled_data))

### Format the labelled data into pyvespa friendly format

Define some labelled data. `pyvespa` expects labelled data to follow the format illustrated below. It is a list of dict where each dict represents a query containing `query_id`, `query` and a list of relevant_docs. Each relevant document contain a required `id` key and an optional `score` key.

In [None]:
labelled_data = [
    {
        "query_id": int(topic_id), 
        "query": topics[topic_id]["query"], 
        "relevant_docs": [
            {
                "id": row["cord_uid"], 
                "score": row["relevancy"]
            } for idx, row in relevance_data[relevance_data.topic_id == int(topic_id)].iterrows() if row["relevancy"] > 0
        ]
    } for topic_id in topics.keys()]

We can look how this look like for the first two query topics below:

In [None]:
labelled_data[0]

In [None]:
import json

with open("labelled_data.json", "w") as f:
    f.write(json.dumps(labelled_data))

We can see that each query topic has many relevant documents associated with it. We only kept the relevant documents (scores > 0) because we will later collect non-relevant documents based on how we want to use the data to train models to improve the application relevance.

### Split the labelled data into train and test sets

**TODO**: Consider adding the split data functionality below to pyvespa

In [None]:
import random
import math

random.seed(87345634876)

# inputs 
query_prob = 0.2 # Percentage of queries to move to the test set
relevant_docs_prob = 0.2 # Percentage of relevant docs to move to the test set



# First lets move some query topics to the test set
number_queries = len(labelled_data)


test_query_idx = [x for x in range(number_queries) if 
                      x in random.sample(
                          population=range(number_queries), 
                          k=math.floor(number_queries*query_prob)
                      )
                 ]
test_unobserved = [labelled_data[i] for i in range(number_queries) if i in test_query_idx]
train_set = [labelled_data[i] for i in range(number_queries) if i not in test_query_idx]

test_partially_observed = []
for data in train_set:
    number_relevant_docs = len(data["relevant_docs"])
    test_relevant_docs_idx = [x for x in range(number_relevant_docs) if 
                                  x in random.sample(
                                      population=range(number_relevant_docs),
                                      k=math.floor(number_relevant_docs*relevant_docs_prob)
                                  )
                             ]
    test_data = {k:data[k] for k in data.keys() if k != "relevant_docs"}
    test_data["relevant_docs"] = [
        data["relevant_docs"][i] for i in range(number_relevant_docs) 
        if i in test_relevant_docs_idx
    ]
    test_partially_observed.append(test_data)
    data["relevant_docs"] = [
        data["relevant_docs"][i] for i in range(number_relevant_docs) 
        if i not in test_relevant_docs_idx
    ]

test_sets = {
    "partially_observed": test_partially_observed,
    "unobserved": test_unobserved
}

## Evaluate existing query models

### Define query models that we want to evaluate

In [None]:
from vespa.query import Query, RankProfile, OR

query_models = {
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = RankProfile(name="bm25")
    ),
    "or_bm25t5": Query(
        match_phase = OR(),
        rank_profile = RankProfile(name="bm25t5")
    ),
    "or_bm25t5-gbdt-1000": Query(
        match_phase = OR(),
        rank_profile = RankProfile(name="bm25t5-gbdt-1000")
    )
}
        

In [None]:
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank, NormalizedDiscountedCumulativeGain

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10), NormalizedDiscountedCumulativeGain(at=10)]

In [None]:
from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

In [None]:
evaluations = {}
for test_set in test_sets:
    evaluations[test_set] = {}
    for query_model in query_models:
        evaluations[test_set][query_model] = app.evaluate(
            labelled_data = test_sets[test_set],
            eval_metrics = eval_metrics,
            query_model = query_models[query_model],
            id_field = "cord_uid",
            hits = 10
        )

In [None]:
import pandas as pd

metric_values = []
for test_set in test_sets:
    for query_model in query_models:
        for metric in eval_metrics:
            metric_values.append(
                pd.DataFrame(
                    data={
                        "test_set": test_set, 
                        "query_model": query_model, 
                        "metric": metric.name, 
                        "value": evaluations[test_set][query_model][metric.name + "_value"].to_list()
                    }
                )
            )
metric_values = pd.concat(metric_values, ignore_index=True)

In [None]:
metric_values.head()

In [None]:
metric_values.metric.unique()

In [None]:
import plotly.express as px


fig = px.box(metric_values[metric_values.metric == "ndcg_10"], x="query_model", y="value", title="Ndgc @ 10")
fig.show()

In [None]:
metric_values.groupby(['query_model', 'metric']).median()