Install pyterrier it using pip (`pip install python-terrier`).

### Step 1: Initialize PyTerrier

First, you need to initialize PyTerrier. This is typically done once per notebook or script.

```python
import pyterrier as pt
if not pt.started():
    pt.init()
```

### Step 2: Dataset Loading

Load your dataset. You can use one of the datasets available in PyTerrier, or load your own dataset.

```python
# Example of loading a dataset from PyTerrier
dataset = pt.get_dataset('irds:your-dataset-name')
# For custom datasets, you'd typically load them into a dataframe or similar structure
```

### Step 3: Preprocessing and Query Rewriting

Implement the NLP-based query understanding and rewriting. You might use existing NLP libraries such as spaCy or Hugging Face's Transformers for this purpose.

```python
from your_query_rewriting_module import rewrite_query

# This is a placeholder function. You will implement your query understanding and rewriting logic here.
def preprocess_query(query):
    # Use NLP techniques to understand and rewrite the query
    rewritten_query = rewrite_query(query)
    return rewritten_query
```

### Step 4: Developing the IR System

Develop your prototype IR system. This involves setting up the indexing and retrieval process.

```python
# Indexing
indexer = pt.DFIndexer("./index_path", overwrite=True)
index_ref = indexer.index(dataset.get_corpus_iter())

# Retrieval
# You can use one of PyTerrier's built-in retrieval models or develop your own.
retriever = pt.BatchRetrieve(index_ref, wmodel="BM25")
```

### Step 5: Experimentation

Run experiments using your IR system. This includes query processing, retrieval, and evaluation against your benchmarks.

```python
from pyterrier.measures import RR, nDCG

# Example query processing and retrieval
def run_query_and_evaluate(query):
    rewritten_query = preprocess_query(query)
    result = retriever.transform(rewritten_query)
    # Evaluation - assuming you have a test set with relevance judgments
    eval_result = pt.Utils.evaluate(result, qrels, measures=[RR, nDCG])
    return eval_result

# Example usage
query = "Your test query"
evaluation_metrics = run_query_and_evaluate(query)
print(evaluation_metrics)
```

### Step 6: Analysis

Analyze the results to evaluate the effectiveness of your advanced query understanding and rewriting techniques.

# Project Setup and Data Preparation

## Install PyTerrier

Imports the PyTerrier library and initializes it. This step is necessary to use PyTerrier's functionalities.



In [7]:
!pip install python-terrier

import pyterrier as pt
if not pt.started():
    pt.init()



## Load Dataset

Loads the MS MARCO Passage Ranking dataset for the TREC 2020 Deep Learning Track, which will be used for the information retrieval tasks.

In [18]:
# Example of loading a dataset from PyTerrier
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# For custom datasets, you'd typically load them into a dataframe or similar structure

## Load Topics

Retrieves and prints the first few search queries (topics) from the dataset. These are the queries that will be used to evaluate the retrieval system.

In [3]:
topics = dataset.get_topics()
topics.head()

Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia
3,1045109,who owns barnhart crane
4,1049519,who said no one can make you feel inferior


## Load Qrels
Loads the relevance judgments (qrels) for the dataset, which indicate which documents are relevant to each query. These are used for evaluation.

In [4]:
qrels = dataset.get_qrels()
qrels.head()

Unnamed: 0,qid,docno,label,iteration
0,23849,1020327,2,0
1,23849,1034183,3,0
2,23849,1120730,0,0
3,23849,1139571,1,0
4,23849,1143724,0,0


## Access the Corpus

Retrieves the first document from the corpus using an iterator. This demonstrates how to access documents in the dataset.

In [15]:
corpus_iter = dataset.get_corpus_iter()

# Convert to an iterator
corpus_iterator = iter(corpus_iter)
print(1)
first_doc = next(corpus_iterator)
print(first_doc)


[INFO] [starting] fixing encoding                                                  
msmarco-passage/trec-dl-2020 documents:   0%|          | 0/8841823 [00:00<?, ?it/s]

1


                                                                                   
[A                                      [INFO] [error] fixing encoding: [00:13] [71.1MB] [5.42MB/s]
msmarco-passage/trec-dl-2020 documents:   0%|          | 0/8841823 [00:13<?, ?it/s]


KeyboardInterrupt: 


# Building and Evaluating the Retrieval System

## Index the Corpus

Indexes the documents in the corpus, preparing them for retrieval. The documents are stored in a specified directory (`./index_path`).

In [24]:
indexer = pt.index.IterDictIndexer("./index_path", stemmer="porter", stopwords="terrier")
# Index the dataset
indexer.index(dataset.get_corpus_iter())

[INFO] [starting] fixing encoding                                                  
                                                                                   
[A                                      [INFO] [finished] fixing encoding: [09:15] [3.06GB] [5.51MB/s]
msmarco-passage/trec-dl-2020 documents:   6%|▌         | 498889/8841823 [10:04<10:27, 13286.28it/s] 



msmarco-passage/trec-dl-2020 documents: 100%|██████████| 8841823/8841823 [20:32<00:00, 7176.07it/s] 


10:40:53.033 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 5 empty documents


<org.terrier.querying.IndexRef at 0x14c88cb90 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x7f9b06e68428 at 0x15e300fb0>>

In [28]:
corpus_iter = dataset.get_corpus_iter()

# Convert to an iterator
corpus_iterator = iter(corpus_iter)

first_doc = next(corpus_iterator)
print(first_doc)


msmarco-passage/trec-dl-2020 documents:   0%|          | 0/8841823 [00:00<?, ?it/s]

{'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'docno': '0'}


In [29]:
pip install TextBlob

Collecting TextBlob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: TextBlob
Successfully installed TextBlob-0.18.0.post0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [30]:
from pyterrier.transformer import TransformerBase
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from textblob import TextBlob

class TransformQuery(TransformerBase):
    def __init__(self, index_path):
        super().__init__()
        self.stopwords_set = set(stopwords.words('english'))
        self.stemmer = PorterStemmer()
        self.index_path = index_path

    def _preprocess(self, query):
        # Lowercase the query
        query = query.lower()
        # Remove punctuation
        query = query.translate(str.maketrans('', '', string.punctuation))
        # Tokenize the query
        tokens = query.split()
        # Remove stopwords and apply stemming
        filtered_tokens = [self.stemmer.stem(token) for token in tokens if token not in self.stopwords_set]
        # Join the filtered tokens back into a query
        return ' '.join(filtered_tokens)

    def _expand_query(self, query):
        # Use TextBlob for basic synonym expansion
        blob = TextBlob(query)
        expanded_query = blob.words.lemmatize()
        return " ".join(expanded_query)

    def transform(self, query):
        # Preprocess the query
        preprocessed_query = self._preprocess(query)
        # Expand the query
        expanded_query = self._expand_query(preprocessed_query)
        return expanded_query


In [32]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/zoe/nltk_data...


True

In [39]:
queries = dataset.get_topics()

print(queries)

         qid                                       query
0    1030303                          who is aziz hashim
1    1037496                          who is rep scalise
2    1043135            who killed nicholas ii of russia
3    1045109                     who owns barnhart crane
4    1049519  who said no one can make you feel inferior
..       ...                                         ...
195   985594                          where is kampuchea
196    99005                 convert sq meter to sq inch
197   997622          where is the show shameless filmed
198   999466                            where is velbert
199   132622               definition of attempted arson

[200 rows x 2 columns]


In [46]:
index_path = "./index_path"
queries = dataset.get_topics()

# Instantiate TransformQuery with the index path
transformer = TransformQuery(index_path="./index_path")

# Apply transformation to the first 10 queries
for query_id, query_row in queries.head(10).iterrows():
    # Extract the query text from the DataFrame row
    query_text = query_row['query']  # Assuming the column name is 'query'

    # Apply transformation
    transformed_query = transformer.transform(query_text)
    print(f"Original Query \t\t {query_id}: {query_text}")
    print(f"Transformed Query \t {query_id}: {transformed_query}")
    print("")


Original Query 		 0: who is aziz hashim
Transformed Query 	 0: aziz hashim

Original Query 		 1: who is rep scalise
Transformed Query 	 1: rep scalis

Original Query 		 2: who killed nicholas ii of russia
Transformed Query 	 2: kill nichola ii russia

Original Query 		 3: who owns barnhart crane
Transformed Query 	 3: own barnhart crane

Original Query 		 4: who said no one can make you feel inferior
Transformed Query 	 4: said one make feel inferior

Original Query 		 5: who sings monk theme song
Transformed Query 	 5: sing monk theme song

Original Query 		 6: who was the highest career passer rating in the nfl
Transformed Query 	 6: highest career passer rate nfl

Original Query 		 7: why do hunters pattern their shotguns
Transformed Query 	 7: hunter pattern shotgun

Original Query 		 8: why do some places on my scalp feel sore
Transformed Query 	 8: place scalp feel sore

Original Query 		 9: why is pete rose banned from hall of fame
Transformed Query 	 9: pete rose ban hall fame


  super().__init__()


In [62]:
pip install python-terrier transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [98]:
import torch

MODEL_PATH = "./models/doc_baseline.ckpt"
state_dict = torch.load(MODEL_PATH, map_location=torch.device('cpu'))

print(state_dict.keys())


dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'state_dict', 'loops', 'callbacks', 'optimizer_states', 'lr_schedulers', 'hparams_name', 'hyper_parameters'])


In [101]:
from transformers import BertModel, BertConfig
import torch

MODEL_PATH = "./models/doc_baseline.ckpt"
state_dict = torch.load(MODEL_PATH, map_location=torch.device('cpu'))["state_dict"]

# Create a default configuration for the BERT model
config = BertConfig()

# Initialize the BERT model
model = BertModel(config)

# Load the state dictionary into the model
model.load_state_dict(state_dict)

# Ensure token embedding weights are still tied if needed
model.tie_weights()

# Set model in evaluation mode to deactivate Dropout modules by default
model.eval()


RuntimeError: Error(s) in loading state_dict for BertModel:
	Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.attention.output.LayerNorm.weight", "encoder.layer.0.attention.output.LayerNorm.bias", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.0.output.LayerNorm.weight", "encoder.layer.0.output.LayerNorm.bias", "encoder.layer.1.attention.self.query.weight", "encoder.layer.1.attention.self.query.bias", "encoder.layer.1.attention.self.key.weight", "encoder.layer.1.attention.self.key.bias", "encoder.layer.1.attention.self.value.weight", "encoder.layer.1.attention.self.value.bias", "encoder.layer.1.attention.output.dense.weight", "encoder.layer.1.attention.output.dense.bias", "encoder.layer.1.attention.output.LayerNorm.weight", "encoder.layer.1.attention.output.LayerNorm.bias", "encoder.layer.1.intermediate.dense.weight", "encoder.layer.1.intermediate.dense.bias", "encoder.layer.1.output.dense.weight", "encoder.layer.1.output.dense.bias", "encoder.layer.1.output.LayerNorm.weight", "encoder.layer.1.output.LayerNorm.bias", "encoder.layer.2.attention.self.query.weight", "encoder.layer.2.attention.self.query.bias", "encoder.layer.2.attention.self.key.weight", "encoder.layer.2.attention.self.key.bias", "encoder.layer.2.attention.self.value.weight", "encoder.layer.2.attention.self.value.bias", "encoder.layer.2.attention.output.dense.weight", "encoder.layer.2.attention.output.dense.bias", "encoder.layer.2.attention.output.LayerNorm.weight", "encoder.layer.2.attention.output.LayerNorm.bias", "encoder.layer.2.intermediate.dense.weight", "encoder.layer.2.intermediate.dense.bias", "encoder.layer.2.output.dense.weight", "encoder.layer.2.output.dense.bias", "encoder.layer.2.output.LayerNorm.weight", "encoder.layer.2.output.LayerNorm.bias", "encoder.layer.3.attention.self.query.weight", "encoder.layer.3.attention.self.query.bias", "encoder.layer.3.attention.self.key.weight", "encoder.layer.3.attention.self.key.bias", "encoder.layer.3.attention.self.value.weight", "encoder.layer.3.attention.self.value.bias", "encoder.layer.3.attention.output.dense.weight", "encoder.layer.3.attention.output.dense.bias", "encoder.layer.3.attention.output.LayerNorm.weight", "encoder.layer.3.attention.output.LayerNorm.bias", "encoder.layer.3.intermediate.dense.weight", "encoder.layer.3.intermediate.dense.bias", "encoder.layer.3.output.dense.weight", "encoder.layer.3.output.dense.bias", "encoder.layer.3.output.LayerNorm.weight", "encoder.layer.3.output.LayerNorm.bias", "encoder.layer.4.attention.self.query.weight", "encoder.layer.4.attention.self.query.bias", "encoder.layer.4.attention.self.key.weight", "encoder.layer.4.attention.self.key.bias", "encoder.layer.4.attention.self.value.weight", "encoder.layer.4.attention.self.value.bias", "encoder.layer.4.attention.output.dense.weight", "encoder.layer.4.attention.output.dense.bias", "encoder.layer.4.attention.output.LayerNorm.weight", "encoder.layer.4.attention.output.LayerNorm.bias", "encoder.layer.4.intermediate.dense.weight", "encoder.layer.4.intermediate.dense.bias", "encoder.layer.4.output.dense.weight", "encoder.layer.4.output.dense.bias", "encoder.layer.4.output.LayerNorm.weight", "encoder.layer.4.output.LayerNorm.bias", "encoder.layer.5.attention.self.query.weight", "encoder.layer.5.attention.self.query.bias", "encoder.layer.5.attention.self.key.weight", "encoder.layer.5.attention.self.key.bias", "encoder.layer.5.attention.self.value.weight", "encoder.layer.5.attention.self.value.bias", "encoder.layer.5.attention.output.dense.weight", "encoder.layer.5.attention.output.dense.bias", "encoder.layer.5.attention.output.LayerNorm.weight", "encoder.layer.5.attention.output.LayerNorm.bias", "encoder.layer.5.intermediate.dense.weight", "encoder.layer.5.intermediate.dense.bias", "encoder.layer.5.output.dense.weight", "encoder.layer.5.output.dense.bias", "encoder.layer.5.output.LayerNorm.weight", "encoder.layer.5.output.LayerNorm.bias", "encoder.layer.6.attention.self.query.weight", "encoder.layer.6.attention.self.query.bias", "encoder.layer.6.attention.self.key.weight", "encoder.layer.6.attention.self.key.bias", "encoder.layer.6.attention.self.value.weight", "encoder.layer.6.attention.self.value.bias", "encoder.layer.6.attention.output.dense.weight", "encoder.layer.6.attention.output.dense.bias", "encoder.layer.6.attention.output.LayerNorm.weight", "encoder.layer.6.attention.output.LayerNorm.bias", "encoder.layer.6.intermediate.dense.weight", "encoder.layer.6.intermediate.dense.bias", "encoder.layer.6.output.dense.weight", "encoder.layer.6.output.dense.bias", "encoder.layer.6.output.LayerNorm.weight", "encoder.layer.6.output.LayerNorm.bias", "encoder.layer.7.attention.self.query.weight", "encoder.layer.7.attention.self.query.bias", "encoder.layer.7.attention.self.key.weight", "encoder.layer.7.attention.self.key.bias", "encoder.layer.7.attention.self.value.weight", "encoder.layer.7.attention.self.value.bias", "encoder.layer.7.attention.output.dense.weight", "encoder.layer.7.attention.output.dense.bias", "encoder.layer.7.attention.output.LayerNorm.weight", "encoder.layer.7.attention.output.LayerNorm.bias", "encoder.layer.7.intermediate.dense.weight", "encoder.layer.7.intermediate.dense.bias", "encoder.layer.7.output.dense.weight", "encoder.layer.7.output.dense.bias", "encoder.layer.7.output.LayerNorm.weight", "encoder.layer.7.output.LayerNorm.bias", "encoder.layer.8.attention.self.query.weight", "encoder.layer.8.attention.self.query.bias", "encoder.layer.8.attention.self.key.weight", "encoder.layer.8.attention.self.key.bias", "encoder.layer.8.attention.self.value.weight", "encoder.layer.8.attention.self.value.bias", "encoder.layer.8.attention.output.dense.weight", "encoder.layer.8.attention.output.dense.bias", "encoder.layer.8.attention.output.LayerNorm.weight", "encoder.layer.8.attention.output.LayerNorm.bias", "encoder.layer.8.intermediate.dense.weight", "encoder.layer.8.intermediate.dense.bias", "encoder.layer.8.output.dense.weight", "encoder.layer.8.output.dense.bias", "encoder.layer.8.output.LayerNorm.weight", "encoder.layer.8.output.LayerNorm.bias", "encoder.layer.9.attention.self.query.weight", "encoder.layer.9.attention.self.query.bias", "encoder.layer.9.attention.self.key.weight", "encoder.layer.9.attention.self.key.bias", "encoder.layer.9.attention.self.value.weight", "encoder.layer.9.attention.self.value.bias", "encoder.layer.9.attention.output.dense.weight", "encoder.layer.9.attention.output.dense.bias", "encoder.layer.9.attention.output.LayerNorm.weight", "encoder.layer.9.attention.output.LayerNorm.bias", "encoder.layer.9.intermediate.dense.weight", "encoder.layer.9.intermediate.dense.bias", "encoder.layer.9.output.dense.weight", "encoder.layer.9.output.dense.bias", "encoder.layer.9.output.LayerNorm.weight", "encoder.layer.9.output.LayerNorm.bias", "encoder.layer.10.attention.self.query.weight", "encoder.layer.10.attention.self.query.bias", "encoder.layer.10.attention.self.key.weight", "encoder.layer.10.attention.self.key.bias", "encoder.layer.10.attention.self.value.weight", "encoder.layer.10.attention.self.value.bias", "encoder.layer.10.attention.output.dense.weight", "encoder.layer.10.attention.output.dense.bias", "encoder.layer.10.attention.output.LayerNorm.weight", "encoder.layer.10.attention.output.LayerNorm.bias", "encoder.layer.10.intermediate.dense.weight", "encoder.layer.10.intermediate.dense.bias", "encoder.layer.10.output.dense.weight", "encoder.layer.10.output.dense.bias", "encoder.layer.10.output.LayerNorm.weight", "encoder.layer.10.output.LayerNorm.bias", "encoder.layer.11.attention.self.query.weight", "encoder.layer.11.attention.self.query.bias", "encoder.layer.11.attention.self.key.weight", "encoder.layer.11.attention.self.key.bias", "encoder.layer.11.attention.self.value.weight", "encoder.layer.11.attention.self.value.bias", "encoder.layer.11.attention.output.dense.weight", "encoder.layer.11.attention.output.dense.bias", "encoder.layer.11.attention.output.LayerNorm.weight", "encoder.layer.11.attention.output.LayerNorm.bias", "encoder.layer.11.intermediate.dense.weight", "encoder.layer.11.intermediate.dense.bias", "encoder.layer.11.output.dense.weight", "encoder.layer.11.output.dense.bias", "encoder.layer.11.output.LayerNorm.weight", "encoder.layer.11.output.LayerNorm.bias", "pooler.dense.weight", "pooler.dense.bias". 
	Unexpected key(s) in state_dict: "bert.embeddings.position_ids", "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias", "bert.encoder.layer.0.attention.self.key.weight", "bert.encoder.layer.0.attention.self.key.bias", "bert.encoder.layer.0.attention.self.value.weight", "bert.encoder.layer.0.attention.self.value.bias", "bert.encoder.layer.0.attention.output.dense.weight", "bert.encoder.layer.0.attention.output.dense.bias", "bert.encoder.layer.0.attention.output.LayerNorm.weight", "bert.encoder.layer.0.attention.output.LayerNorm.bias", "bert.encoder.layer.0.intermediate.dense.weight", "bert.encoder.layer.0.intermediate.dense.bias", "bert.encoder.layer.0.output.dense.weight", "bert.encoder.layer.0.output.dense.bias", "bert.encoder.layer.0.output.LayerNorm.weight", "bert.encoder.layer.0.output.LayerNorm.bias", "bert.encoder.layer.1.attention.self.query.weight", "bert.encoder.layer.1.attention.self.query.bias", "bert.encoder.layer.1.attention.self.key.weight", "bert.encoder.layer.1.attention.self.key.bias", "bert.encoder.layer.1.attention.self.value.weight", "bert.encoder.layer.1.attention.self.value.bias", "bert.encoder.layer.1.attention.output.dense.weight", "bert.encoder.layer.1.attention.output.dense.bias", "bert.encoder.layer.1.attention.output.LayerNorm.weight", "bert.encoder.layer.1.attention.output.LayerNorm.bias", "bert.encoder.layer.1.intermediate.dense.weight", "bert.encoder.layer.1.intermediate.dense.bias", "bert.encoder.layer.1.output.dense.weight", "bert.encoder.layer.1.output.dense.bias", "bert.encoder.layer.1.output.LayerNorm.weight", "bert.encoder.layer.1.output.LayerNorm.bias", "bert.encoder.layer.2.attention.self.query.weight", "bert.encoder.layer.2.attention.self.query.bias", "bert.encoder.layer.2.attention.self.key.weight", "bert.encoder.layer.2.attention.self.key.bias", "bert.encoder.layer.2.attention.self.value.weight", "bert.encoder.layer.2.attention.self.value.bias", "bert.encoder.layer.2.attention.output.dense.weight", "bert.encoder.layer.2.attention.output.dense.bias", "bert.encoder.layer.2.attention.output.LayerNorm.weight", "bert.encoder.layer.2.attention.output.LayerNorm.bias", "bert.encoder.layer.2.intermediate.dense.weight", "bert.encoder.layer.2.intermediate.dense.bias", "bert.encoder.layer.2.output.dense.weight", "bert.encoder.layer.2.output.dense.bias", "bert.encoder.layer.2.output.LayerNorm.weight", "bert.encoder.layer.2.output.LayerNorm.bias", "bert.encoder.layer.3.attention.self.query.weight", "bert.encoder.layer.3.attention.self.query.bias", "bert.encoder.layer.3.attention.self.key.weight", "bert.encoder.layer.3.attention.self.key.bias", "bert.encoder.layer.3.attention.self.value.weight", "bert.encoder.layer.3.attention.self.value.bias", "bert.encoder.layer.3.attention.output.dense.weight", "bert.encoder.layer.3.attention.output.dense.bias", "bert.encoder.layer.3.attention.output.LayerNorm.weight", "bert.encoder.layer.3.attention.output.LayerNorm.bias", "bert.encoder.layer.3.intermediate.dense.weight", "bert.encoder.layer.3.intermediate.dense.bias", "bert.encoder.layer.3.output.dense.weight", "bert.encoder.layer.3.output.dense.bias", "bert.encoder.layer.3.output.LayerNorm.weight", "bert.encoder.layer.3.output.LayerNorm.bias", "bert.encoder.layer.4.attention.self.query.weight", "bert.encoder.layer.4.attention.self.query.bias", "bert.encoder.layer.4.attention.self.key.weight", "bert.encoder.layer.4.attention.self.key.bias", "bert.encoder.layer.4.attention.self.value.weight", "bert.encoder.layer.4.attention.self.value.bias", "bert.encoder.layer.4.attention.output.dense.weight", "bert.encoder.layer.4.attention.output.dense.bias", "bert.encoder.layer.4.attention.output.LayerNorm.weight", "bert.encoder.layer.4.attention.output.LayerNorm.bias", "bert.encoder.layer.4.intermediate.dense.weight", "bert.encoder.layer.4.intermediate.dense.bias", "bert.encoder.layer.4.output.dense.weight", "bert.encoder.layer.4.output.dense.bias", "bert.encoder.layer.4.output.LayerNorm.weight", "bert.encoder.layer.4.output.LayerNorm.bias", "bert.encoder.layer.5.attention.self.query.weight", "bert.encoder.layer.5.attention.self.query.bias", "bert.encoder.layer.5.attention.self.key.weight", "bert.encoder.layer.5.attention.self.key.bias", "bert.encoder.layer.5.attention.self.value.weight", "bert.encoder.layer.5.attention.self.value.bias", "bert.encoder.layer.5.attention.output.dense.weight", "bert.encoder.layer.5.attention.output.dense.bias", "bert.encoder.layer.5.attention.output.LayerNorm.weight", "bert.encoder.layer.5.attention.output.LayerNorm.bias", "bert.encoder.layer.5.intermediate.dense.weight", "bert.encoder.layer.5.intermediate.dense.bias", "bert.encoder.layer.5.output.dense.weight", "bert.encoder.layer.5.output.dense.bias", "bert.encoder.layer.5.output.LayerNorm.weight", "bert.encoder.layer.5.output.LayerNorm.bias", "bert.encoder.layer.6.attention.self.query.weight", "bert.encoder.layer.6.attention.self.query.bias", "bert.encoder.layer.6.attention.self.key.weight", "bert.encoder.layer.6.attention.self.key.bias", "bert.encoder.layer.6.attention.self.value.weight", "bert.encoder.layer.6.attention.self.value.bias", "bert.encoder.layer.6.attention.output.dense.weight", "bert.encoder.layer.6.attention.output.dense.bias", "bert.encoder.layer.6.attention.output.LayerNorm.weight", "bert.encoder.layer.6.attention.output.LayerNorm.bias", "bert.encoder.layer.6.intermediate.dense.weight", "bert.encoder.layer.6.intermediate.dense.bias", "bert.encoder.layer.6.output.dense.weight", "bert.encoder.layer.6.output.dense.bias", "bert.encoder.layer.6.output.LayerNorm.weight", "bert.encoder.layer.6.output.LayerNorm.bias", "bert.encoder.layer.7.attention.self.query.weight", "bert.encoder.layer.7.attention.self.query.bias", "bert.encoder.layer.7.attention.self.key.weight", "bert.encoder.layer.7.attention.self.key.bias", "bert.encoder.layer.7.attention.self.value.weight", "bert.encoder.layer.7.attention.self.value.bias", "bert.encoder.layer.7.attention.output.dense.weight", "bert.encoder.layer.7.attention.output.dense.bias", "bert.encoder.layer.7.attention.output.LayerNorm.weight", "bert.encoder.layer.7.attention.output.LayerNorm.bias", "bert.encoder.layer.7.intermediate.dense.weight", "bert.encoder.layer.7.intermediate.dense.bias", "bert.encoder.layer.7.output.dense.weight", "bert.encoder.layer.7.output.dense.bias", "bert.encoder.layer.7.output.LayerNorm.weight", "bert.encoder.layer.7.output.LayerNorm.bias", "bert.encoder.layer.8.attention.self.query.weight", "bert.encoder.layer.8.attention.self.query.bias", "bert.encoder.layer.8.attention.self.key.weight", "bert.encoder.layer.8.attention.self.key.bias", "bert.encoder.layer.8.attention.self.value.weight", "bert.encoder.layer.8.attention.self.value.bias", "bert.encoder.layer.8.attention.output.dense.weight", "bert.encoder.layer.8.attention.output.dense.bias", "bert.encoder.layer.8.attention.output.LayerNorm.weight", "bert.encoder.layer.8.attention.output.LayerNorm.bias", "bert.encoder.layer.8.intermediate.dense.weight", "bert.encoder.layer.8.intermediate.dense.bias", "bert.encoder.layer.8.output.dense.weight", "bert.encoder.layer.8.output.dense.bias", "bert.encoder.layer.8.output.LayerNorm.weight", "bert.encoder.layer.8.output.LayerNorm.bias", "bert.encoder.layer.9.attention.self.query.weight", "bert.encoder.layer.9.attention.self.query.bias", "bert.encoder.layer.9.attention.self.key.weight", "bert.encoder.layer.9.attention.self.key.bias", "bert.encoder.layer.9.attention.self.value.weight", "bert.encoder.layer.9.attention.self.value.bias", "bert.encoder.layer.9.attention.output.dense.weight", "bert.encoder.layer.9.attention.output.dense.bias", "bert.encoder.layer.9.attention.output.LayerNorm.weight", "bert.encoder.layer.9.attention.output.LayerNorm.bias", "bert.encoder.layer.9.intermediate.dense.weight", "bert.encoder.layer.9.intermediate.dense.bias", "bert.encoder.layer.9.output.dense.weight", "bert.encoder.layer.9.output.dense.bias", "bert.encoder.layer.9.output.LayerNorm.weight", "bert.encoder.layer.9.output.LayerNorm.bias", "bert.encoder.layer.10.attention.self.query.weight", "bert.encoder.layer.10.attention.self.query.bias", "bert.encoder.layer.10.attention.self.key.weight", "bert.encoder.layer.10.attention.self.key.bias", "bert.encoder.layer.10.attention.self.value.weight", "bert.encoder.layer.10.attention.self.value.bias", "bert.encoder.layer.10.attention.output.dense.weight", "bert.encoder.layer.10.attention.output.dense.bias", "bert.encoder.layer.10.attention.output.LayerNorm.weight", "bert.encoder.layer.10.attention.output.LayerNorm.bias", "bert.encoder.layer.10.intermediate.dense.weight", "bert.encoder.layer.10.intermediate.dense.bias", "bert.encoder.layer.10.output.dense.weight", "bert.encoder.layer.10.output.dense.bias", "bert.encoder.layer.10.output.LayerNorm.weight", "bert.encoder.layer.10.output.LayerNorm.bias", "bert.encoder.layer.11.attention.self.query.weight", "bert.encoder.layer.11.attention.self.query.bias", "bert.encoder.layer.11.attention.self.key.weight", "bert.encoder.layer.11.attention.self.key.bias", "bert.encoder.layer.11.attention.self.value.weight", "bert.encoder.layer.11.attention.self.value.bias", "bert.encoder.layer.11.attention.output.dense.weight", "bert.encoder.layer.11.attention.output.dense.bias", "bert.encoder.layer.11.attention.output.LayerNorm.weight", "bert.encoder.layer.11.attention.output.LayerNorm.bias", "bert.encoder.layer.11.intermediate.dense.weight", "bert.encoder.layer.11.intermediate.dense.bias", "bert.encoder.layer.11.output.dense.weight", "bert.encoder.layer.11.output.dense.bias", "bert.encoder.layer.11.output.LayerNorm.weight", "bert.encoder.layer.11.output.LayerNorm.bias", "bert.pooler.dense.weight", "bert.pooler.dense.bias", "classification.weight", "classification.bias". 

In [61]:

from transformers import BertModel, BertTokenizer
from pyterrier.transformer import TransformerBase

class FineTunedBERTQueryExpander(TransformerBase):
    def __init__(self, model_path):
        super().__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_path)
        self.model = BertModel.from_pretrained(model_path)
        self.model.eval()  # Set the model to evaluation mode

    def transform(self, query):
        inputs = self.tokenizer(query, return_tensors="pt", max_length=512, truncation=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        # Process the outputs to obtain the expanded query
        expanded_query = ...  # Implement your query expansion logic here
        return expanded_query

# Load the fine-tuned BERT model
model_path = "/Users/zoe/Downloads/doc_query2doc.ckpt"
query_expander = FineTunedBERTQueryExpander(model_path)

# Use the query expander in a PyTerrier pipeline
from pyterrier.pipeline import Pipeline
from pyterrier.transformer import TransformerBase
from pyterrier.datasets import get_dataset


pipe = Pipeline()
pipe.append(query_expander)
pipe.fit(topics)

# Use the pipeline to expand queries
expanded_queries = pipe.transform(topics)


  super().__init__()


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

## Configure Retrieval Model
Configures the retrieval model to use the BM25 weighting model for ranking documents in response to queries.

In [26]:
# Configure the retrieval model
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

In [27]:
tf_idf = pt.BatchRetrieve(indexer, wmodel="TF_IDF")

# TODO: Run queries and evaluate the retrieval system 

We can now run an experiment that evaluates both models on the same data. For better readability, we assign custom names to both approaches.
Comparing the approaches in the same experiment allows us to automatically have statistical significance testing performed. By setting baseline=0, we tell the function to compute the  $p$-values with respect to the first approach (TF-IDF). Furthermore, PyTerrier supports a number of correction methods; here, we apply Bonferroni correction:



In [21]:
from pyterrier.measures import RR, nDCG, MAP
from pathlib import Path

results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

pt.Experiment(
    [tf_idf, bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    names=["TF-IDF", "BM25"],
    eval_metrics=[RR @ 10, nDCG @ 20, MAP],
    save_dir=str(results_dir),
)

Unnamed: 0,name,RR@10,nDCG@20,AP
0,TF-IDF,0.802102,0.479753,0.358072
1,BM25,0.802102,0.479866,0.358724
