# PyTerrier Starter: Re-Rank with PyTerrier for Task 2 in Touche 2023

This is the [PyTerrier](https://github.com/terrier-org/pyterrier) baseline for [task 2 on Evidence Retrieval for Causal Questions](https://touche.webis.de/clef23/touche23-web/evidence-retrieval-for-causal-questions.html) in [Touché 2023](https://touche.webis.de/clef23/touche23-web/).

This notebook implements a simple BM25 re-ranker.

### Adapt the notebook locally

You can adapt/run this baseline locally with docker and can directly deploy and run it in [TIRA.io](https://www.tira.io/task/touche-2023-task-2).

With docker installed, you can start this notebook with the command:

```
docker run --rm -ti \
    -p 8888:8888 \
    -v ${PWD}:/workspace \
    webis/tira-touche23-task-2-pyterrier-baseline:0.0.1 \
    jupyter-lab --allow-root --ip 0.0.0.0
```

### Deployment in TIRA

To deploy approaches in TIRA, you upload the image and specify the command that is to be executed in the image. TIRA gives you a personalized documentation on how to upload the image, and to run this notebook in TIRA you can specify the following command in TIRA:

```
/workspace/run-notebook.py --notebook /workspace/re-rank-pipeline.ipynb --input $inputDataset --output $outputDir
```

You can dry-run this on your machine by executing the command:

```
./run-notebook.py \
    --input ${PWD}/sample-input/re-rank-default-text \
    --output ${PWD}/sample-output \
    --notebook /workspace/re-rank-pipeline.ipynb \
    --local-dry-run True
```



### Additional Resources

- The [PyTerrier tutorial](https://github.com/terrier-org/ecir2021tutorial)
- The [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/)
- The [TIRA quickstart](https://touche.webis.de/clef23/touche23-web/evidence-retrieval-for-causal-questions.html#tira-quickstart)



### Step 1: Import everything and load variables

In [1]:
from tira_utils import get_preconfigured_chatnoir_client, get_input_directory_and_output_directory, normalize_run
import pyterrier as pt
import pandas as pd
import os
import json
from tqdm import tqdm
from pathlib import Path

SYSTEM_NAME = os.environ.get('TIRA_SYSTEM_NAME' ,'my-retrieval-system')

if not pt.started():
    # tira_utils above should already have done started pyterrier with this configuration to ensure that no internet connection is required (for reproducibility)
    pt.init(version=os.environ['PYTERRIER_VERSION'], helper_version=os.environ['PYTERRIER_HELPER_VERSION'], no_download=True)

input_directory, output_directory = get_input_directory_and_output_directory(default_input='/workspace/sample-input/re-rank-default-text')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in /workspace/sample-input/re-rank-default-text.
The output directory is /tmp/


### Step 2: Load the data

In [11]:
print(f'Read input data from {input_directory}.')
df = pd.read_json(input_directory + '/rerank.jsonl', lines=True)
df['query'] = df['query'].apply(lambda i: "".join([x if x.isalnum() else " " for x in i]))
df['qid'] = df['qid'].astype('str')
df['text'] = df['text'].apply(lambda i: i.lower())
print(f'Done...')

df.head(3)

Read input data from /workspace/sample-input/re-rank-default-text.
Done...


Unnamed: 0,qid,query,docno,text
0,1111,does computer work increase eye pressure,clueweb12-1106wb-16-17437,eyes hurt looking computer screen\n\n\n\neyes ...
1,1111,does computer work increase eye pressure,clueweb12-0302wb-19-28258,how the eye works\n\n\n\nhow the eye works\n\n...
2,1111,does computer work increase eye pressure,clueweb12-1212wb-00-02238,how the eye works\n\n\n\nhow the eye works\n\n...


### Step 3: Define the actual retrieval appraoch

In [6]:
bm25_scorer = pt.text.scorer(body_attr="text", wmodel='BM25', verbose=True)


### Step 4: Run the pipeline

In [8]:
run = bm25_scorer(df)
run.head(3)

31documents [00:00, 35.13documents/s]                                                                                                             
BR(BM25):   0%|                                                                                                              | 0/3 [00:00<?, ?q/s]



BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 83.52q/s]


Unnamed: 0,qid,docno,text,rank,score,query
0,1111,clueweb12-1106wb-16-17437,eyes hurt looking computer screen\n\n\n\neyes ...,0,-1.692322,does computer work increase eye pressure
1,1111,clueweb12-0302wb-19-28258,how the eye works\n\n\n\nhow the eye works\n\n...,1,-2.76882,does computer work increase eye pressure
2,1111,clueweb12-1212wb-00-02238,how the eye works\n\n\n\nhow the eye works\n\n...,2,-2.76882,does computer work increase eye pressure


### Step 5: Stence Detection

In [9]:
print('Step 5: Define stence detection')

def detect_stance(query_document_pair):
    # As baseline, we return always neutral
    return 'NEU'

run['Q0'] = run.apply(lambda i: detect_stance(i), axis=1)


Step 5: Define stence detection


### Step 6: Persist results

In [12]:
print('Step 6: Persist Run.')

Path(output_directory).mkdir(parents=True, exist_ok=True)
normalize_run(run, SYSTEM_NAME).to_csv(output_directory + '/run.txt', sep=' ', header=False, index=False)

print('Done...')

Step 6: Persist Run.
Done...
