# PyTerrier Starter: Full-Rank with ChatNoir for Task 2 in Touche 2023

This is the [PyTerrier](https://github.com/terrier-org/pyterrier) baseline for [task 2 on Evidence Retrieval for Causal Questions](https://touche.webis.de/clef23/touche23-web/evidence-retrieval-for-causal-questions.html) in [Touché 2023](https://touche.webis.de/clef23/touche23-web/).

This notebook implements "full-rank" retrieval with ChatNoir so that the results retrieved by ChatNoir can be easily re-ranked.

### Adapt the notebook locally

You can adapt/run this baseline locally with docker and can directly deploy and run it in [TIRA.io](https://www.tira.io/task/touche-2023-task-2).

With docker installed, you can start this notebook with the command:

```
docker run --rm -ti \
    -p 8888:8888 \
    -v ${PWD}:/workspace \
    webis/tira-touche23-task-2-pyterrier-baseline:0.0.1 \
    jupyter-lab --allow-root --ip 0.0.0.0
```

### Deployment in TIRA

To deploy approaches in TIRA, you upload the image and specify the command that is to be executed in the image. TIRA gives you a personalized documentation on how to upload the image, and to run this notebook in TIRA you can specify the following command in TIRA:

```
/workspace/run-notebook.py --notebook /workspace/full-rank-pipeline.ipynb --input $inputDataset --output $outputDir
```

You can dry-run this on your machine by executing the command:

```
./run-notebook.py \
    --input ${PWD}/sample-input/full-rank \
    --output ${PWD}/sample-output \
    --notebook /workspace/full-rank-pipeline.ipynb \
    --enable-network-in-dry-run True --local-dry-run True
```



### Additional Resources

- The [PyTerrier tutorial](https://github.com/terrier-org/ecir2021tutorial)
- The [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/)
- The [TIRA quickstart](https://touche.webis.de/clef23/touche23-web/evidence-retrieval-for-causal-questions.html#tira-quickstart)



### Step 1: Import everything and load variables

In [2]:
from tira_utils import get_preconfigured_chatnoir_client, get_input_directory_and_output_directory, normalize_run
import pyterrier as pt
import pandas as pd
import os
import json
from tqdm import tqdm
from pathlib import Path

SYSTEM_NAME = os.environ.get('TIRA_SYSTEM_NAME' ,'my-retrieval-system')

if not pt.started():
    # tira_utils above should already have done started pyterrier with this configuration to ensure that no internet connection is required (for reproducibility)
    pt.init(version=os.environ['PYTERRIER_VERSION'], helper_version=os.environ['PYTERRIER_HELPER_VERSION'], no_download=True)

input_directory, output_directory = get_input_directory_and_output_directory(default_input='/workspace/sample-input/full-rank')


I will use a small hardcoded example located in /workspace/sample-input/full-rank.
The output directory is /tmp/


### Step 2: Load the Data

In [3]:
print('Step 2: Load the data.')

queries = pd.read_json(input_directory + '/queries.jsonl', lines=True)


Step 2: Load the data.


### Step 3: Define the Pipeline

In [4]:
print('Step 3: Defint the retrieval pipeline.')

retrieval_pipeline = get_preconfigured_chatnoir_client(
    config_directory = input_directory,
    features = ['TARGET_URI', 'TITLE_TEXT', 'HTML_PLAIN', 'HTML_PLAIN'],
    verbose = True
)

Step 3: Defint the retrieval pipeline.
ChatNoir Client will retrieve from index ClueWeb12


### Step 4: Create Run

In [5]:
print('Step 4: Create Run.')
run = retrieval_pipeline(queries)

Step 4: Create Run.


Searching with ChatNoir: 100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.31s/query]ChatNoir API internal server error. Retrying in 1 seconds.
Searching with ChatNoir: 100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:37<00:00, 12.52s/query]


In [6]:
run.head(3)

Unnamed: 0,qid,query,docno,score,target_uri,title_text,html_plain,rank
0,1111,does computer work increase eye pressure?,clueweb12-1106wb-16-17437,1182.2903,http://www.pingueculae.com/eye-strain-informat...,Eyes hurt looking computer screen,"<!doctype html>\n<meta charset=""utf-8"">\n<titl...",0
1,1111,does computer work increase eye pressure?,clueweb12-0302wb-19-28258,1168.2012,https://www.vsp.com/cms/edc/topics/how-the-eye...,How the Eye Works,"<!doctype html>\n<meta charset=""utf-8"">\n<titl...",1
2,1111,does computer work increase eye pressure?,clueweb12-1212wb-00-02238,1167.9583,https://vsp.com/cms/edc/topics/how-the-eye-wor...,How the Eye Works,"<!doctype html>\n<meta charset=""utf-8"">\n<titl...",2


### Step 5: Stence Detection

In [7]:
print('Step 5: Run stence detection')

def detect_stance(query_document_pair):
    # As baseline, we return always neutral
    return 'NEU'

run['Q0'] = run.apply(lambda i: detect_stance(i), axis=1)


Step 5: Run stence detection


### Step 6: Persist Run

In [9]:
print('Step 6: Persist Run.')

Path(output_directory).mkdir(parents=True, exist_ok=True)
normalize_run(run, SYSTEM_NAME).to_csv(output_directory + '/run.txt', sep=' ', header=False, index=False)

print('Done...')

Step 6: Persist Run.
Done...
