## 2WikiMultihopQA and Fusion-in-Decoder

In [1]:
import pyterrier as pt
import pyterrier_rag

## Fusion in Decoder (FiD)

FiD is a reader model that takes passages, encodes them separately with the question (e.g. using T5), and then concatenates the representations for the decoder to generate the answer.

Here we use the [`terrierteam/t5fid_base_2wiki`](https://huggingface.co/terrierteam/t5fid_base_2wiki) checkpoint on the Huggingface model repository, which we have fine-tuned for this dataset.

In [2]:
import pyterrier_rag.readers
fid = pyterrier_rag.readers.T5FiD("terrierteam/t5fid_base_2wiki")

## 2WikiMultihopQA Dataset

We're doing experiments with 2WikiMultihopQA. This dataset comes with passages for each question already, so `dataset.get_topics()` provides their title and text.

In [3]:
dataset = pt.get_dataset('rag:2wikimultihopqa')
dev_answers = dataset.get_answers('dev')
dev_docs = dataset.get_topics('dev')
dev_docs.head(2)

Reading 2WikiMultihopQA dev.json: 100%|██████████| 12576/12576 [00:05<00:00, 2265.92it/s]


Unnamed: 0,qid,query,docno,title,text
0,0008d48808a011ebbd78ac1f6bf848b6,Did Frederick Mulder and Earl Mindell have the...,0008d48808a011ebbd78ac1f6bf848b6_00,Mulder and Scully,Mulder and Scully may refer to:
1,0008d48808a011ebbd78ac1f6bf848b6,Did Frederick Mulder and Earl Mindell have the...,0008d48808a011ebbd78ac1f6bf848b6_01,Mulder and Scully (song),""" Mulder and Scully"" is a song by Catatonia, r..."


In [4]:
print("Average nbr of passages per query:", dev_docs.groupby('qid').count().mean()['text'])
print("Average passage length (chars):", dev_docs['text'].str.len().mean())

Average nbr of passages per query: 10.0
Average passage length (chars): 355.1587070610687


We can try out FiD model on one of the test queries..., for instance id 0008d48808a011ebbd78ac1f6bf848b6, just by passing the dataframe for that query into the `fid` transformer.

In [5]:
fid(dev_docs.query('qid == "e2a3bf2a0bdd11eba7f7acde48001122"'))

Unnamed: 0,qid,query,qanswer
0,e2a3bf2a0bdd11eba7f7acde48001122,"When did John V, Prince Of Anhalt-Zerbst's fat...",12 June 1516


Lets see how we did....  - right century, but not quite there.

In [6]:
dev_answers.query('qid == "e2a3bf2a0bdd11eba7f7acde48001122"')

Unnamed: 0,qid,type,gold_answer
2,e2a3bf2a0bdd11eba7f7acde48001122,compositional,12 June 1516


# pt.Experiment

Finally, lets evaluate FiD in terms of F1 and EM%. We provide pt.Experiment with 
1. The system(s) to evaluate
2. The input to FiD - i.e. the questions and pasages dataframe
3. The gold answers dataframe
4. The measures we'd like to calculate

First though, lets cutdown the dataset a little in order to speed up experiments - say only 100 questions with gold answers.

In [7]:
dev_answers = dev_answers.head(100)
dev_docs = dev_docs.merge(dev_answers[['qid']])

In [8]:
pt.Experiment(
    [fid],
    dev_docs,
    dev_answers,
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM], 
    names=['fid'], verbose=True
)

pt.Experiment: 100%|██████████| 1/1 [00:11<00:00, 11.00s/system]


Unnamed: 0,name,F1,EM%
0,fid,0.746075,0.66
