# IR Lab Tutorial: Query Expansion with Large Language Models

This tutorial shows how to re-use Query Expansions created with Large Language models on all datasets available in [TIREx](https://www.tira.io/tirex). This tutorial covers three large language models (ChatGPT, Llama-2, and Flan-UL2) with three prompting techniques (chain-of-thought, similar-queries-zero-shot, and similar-queries-few-shot). The query expansions are loaded from [TIRA](https://www.tira.io) to ensure replicability and are available for all public and non-public datasets (i.e., you can re-use these expansions as prior stage on hidden test datasets).

# Import All Libraries

In [None]:
# This is only needed in Google Colab, not in the Codespace / Dev-Container
!pip3 install python-terrier ir-datasets git+https://github.com/tira-io/tira.git@development#\&subdirectory=python-client


In [4]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
ensure_pyterrier_is_loaded()
import pandas as pd
import pyterrier as pt
from tqdm import tqdm

tira = Client()

# Prepare Retrieval Environment

Load the index, configure BM25 retrieval, view topics

In [5]:
dataset = 'antique-test-20230107-training'
pt_dataset = pt.get_dataset('irds:ir-benchmarks/' + dataset)

In [6]:
index = tira.pt.index('ir-benchmarks/tira-ir-starter/Index (tira-ir-starter-pyterrier)', dataset)

Download from Zenodo: https://zenodo.org/records/10743990/files/2023-01-07-13-40-04.zip?download=1


Download: 100%|██████████| 31.1M/31.1M [00:02<00:00, 12.8MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/antique-test-20230107-training/tira-ir-starter


In [7]:
# Have a look into the topics

topics = pt_dataset.get_topics('query')

topics.head(3)

Download: 234kiB [00:00, 497kiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-benchmarks/antique-test-20230107-training/


Unnamed: 0,qid,query
0,3990512,how can we get concentration onsomething
1,714612,why doesn t the water fall off earth if it s r...
2,2528767,how do i determine the charge of the iron ion ...


In [8]:
# Configure baseline retrieval systems

bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# We use some traditional baselines using RM3 respectively Kullback Leibler divergence
# See:
# - https://pyterrier.readthedocs.io/en/latest/rewrite.html#rm3
# - https://pyterrier.readthedocs.io/en/latest/rewrite.html#klqueryexpansion

bm25_rm3 = bm25 >> pt.rewrite.RM3(index) >> bm25
bm25_kl = bm25 >> pt.rewrite.KLQueryExpansion(index) >> bm25

In [9]:
# Look into some retrieval results

bm25(topics.head(3))

Unnamed: 0,qid,docid,docno,rank,score,query
0,3990512,102622,3077638_1,0,15.887435,how can we get concentration onsomething
1,3990512,30676,3931664_0,1,15.621619,how can we get concentration onsomething
2,3990512,173781,4366141_0,2,15.395085,how can we get concentration onsomething
3,3990512,179429,1011598_10,3,15.134176,how can we get concentration onsomething
4,3990512,194913,4222212_0,4,15.134176,how can we get concentration onsomething
...,...,...,...,...,...,...
2995,2528767,51358,1199456_4,995,9.904100,how do i determine the charge of the iron ion ...
2996,2528767,53692,2478219_0,996,9.904100,how do i determine the charge of the iron ion ...
2997,2528767,65487,261752_1,997,9.904100,how do i determine the charge of the iron ion ...
2998,2528767,71945,3851713_1,998,9.904100,how do i determine the charge of the iron ion ...


# Query Expansion with Large Language Models

In the following, we use three large language models with three prompting techniques yielding 9 strategies.

The three large language models are:

 - ChatGPT
 - Llama-2
 - Flan-UL2


 The three prompting techniques are:

 - chain-of-thought
 - similar-queries-zero-shot
 - similar-queries-few-shot)

In [16]:
# Load the expansions

# llm expansions with gpt
gpt_cot = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-cot', dataset, prefix='llm_expansion_')
gpt_sq_fs = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-sq-fs', dataset, prefix='llm_expansion_')
gpt_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-zs', dataset, prefix='llm_expansion_')

# llm expansions with llama
llama_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-cot', dataset, prefix='llm_expansion_')
llama_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-fs', dataset, prefix='llm_expansion_')
llama_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-zs', dataset, prefix='llm_expansion_')

# llm expansions with flan-ul2
flan_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-cot', dataset, prefix='llm_expansion_')
flan_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-fs', dataset, prefix='llm_expansion_')
flan_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-zs', dataset, prefix='llm_expansion_')

In [11]:
# look into some examples

gpt_cot(topics.head(3))

Unnamed: 0,qid,query,llm_expansion_query
0,3990512,how can we get concentration onsomething,"To get concentration on something, you can try..."
1,714612,why doesn t the water fall off earth if it s r...,The water on Earth does not fall off despite t...
2,2528767,how do i determine the charge of the iron ion ...,To determine the charge of the iron ion in FeC...


In [12]:
gpt_cot(topics.head(3)).iloc[0].to_dict()

{'qid': '3990512',
 'query': 'how can we get concentration onsomething',
 'llm_expansion_query': 'To get concentration on something, you can try the following strategies:\n\n1. Minimize distractions: Find a quiet and clutter-free environment where you can focus without interruptions.\n\n2. Set specific goals: Clearly define what you want to achieve and break it down into smaller tasks to stay focused.\n\n3. Use time management techniques: Prioritize your tasks, set deadlines, and allocate specific time blocks for focused work.\n\n4. Take breaks: Allow yourself short breaks between focused work sessions to refresh your mind and maintain concentration.\n\n5. Practice mindfulness: Stay present in the moment and bring your attention back to the task whenever your mind starts to wander.\n\nRationale: Concentration is essential for efficient and effective work. By minimizing distractions, setting goals, managing time effectively, taking breaks, and practicing mindfulness, you can enhance you

I.e., the query expansion strategies configured as above add a new field `llm_expansion_query` to the topics. One strategy proposed in the paper was to concat the original queries `q` 5 times to the `llm_expansion_query`, i.e., so that the new query is`q + q + q + q + q + llm_expansion_query` where `+` is string concatenation. We can implement this in an function`expand_query` that also takes care of tokenization:

In [13]:
tokeniser = pt.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()

def pt_tokenize(text):
    return ' '.join(tokeniser.getTokens(text))

def expand_query(topic):
  ret = ' '.join([topic['query'], topic['query'], topic['query'],  topic['query'],  topic['query'], topic['llm_expansion_query']])

  # apply the tokenization
  return pt_tokenize(ret)

# we wrap this into an pyterrier transformer
# Documentation: https://pyterrier.readthedocs.io/en/latest/apply.html
pt_expand_query = pt.apply.query(expand_query)

In [14]:
# Now we can look into some expansion

(gpt_cot >> pt_expand_query)(topics.head(3))

Unnamed: 0,qid,query_0,llm_expansion_query,query
0,3990512,how can we get concentration onsomething,"To get concentration on something, you can try...",how can we get concentration onsomething how c...
1,714612,why doesn t the water fall off earth if it s r...,The water on Earth does not fall off despite t...,why doesn t the water fall off earth if it s r...
2,2528767,how do i determine the charge of the iron ion ...,To determine the charge of the iron ion in FeC...,how do i determine the charge of the iron ion ...


In [15]:
(gpt_cot >> pt_expand_query)(topics.head(3)).iloc[0].to_dict()

{'qid': '3990512',
 'query_0': 'how can we get concentration onsomething',
 'llm_expansion_query': 'To get concentration on something, you can try the following strategies:\n\n1. Minimize distractions: Find a quiet and clutter-free environment where you can focus without interruptions.\n\n2. Set specific goals: Clearly define what you want to achieve and break it down into smaller tasks to stay focused.\n\n3. Use time management techniques: Prioritize your tasks, set deadlines, and allocate specific time blocks for focused work.\n\n4. Take breaks: Allow yourself short breaks between focused work sessions to refresh your mind and maintain concentration.\n\n5. Practice mindfulness: Stay present in the moment and bring your attention back to the task whenever your mind starts to wander.\n\nRationale: Concentration is essential for efficient and effective work. By minimizing distractions, setting goals, managing time effectively, taking breaks, and practicing mindfulness, you can enhance y

We see that with the above strategy, we now replaced the original query `q` with `q + q + q + q + q + llm_expansion_query` where `+` is string concatenation and `llm_expansion_query` is the expanded query by the large language model for some prompt.

# Retrieval Effectiveness Experiments


In [18]:
pipeline_gpt_cot = (gpt_cot >> pt_expand_query) >> bm25
pipeline_gpt_sq_fs = (gpt_sq_fs >> pt_expand_query) >> bm25
pipeline_gpt_sq_zs = (gpt_sq_zs >> pt_expand_query) >> bm25

pipeline_llama_cot = (llama_cot >> pt_expand_query) >> bm25
pipeline_llama_sq_fs = (llama_sq_fs >> pt_expand_query) >> bm25
pipeline_llama_sq_zs = (llama_sq_zs >> pt_expand_query) >> bm25

pipeline_flan_cot = (flan_cot >> pt_expand_query) >> bm25
pipeline_flan_sq_fs = (flan_sq_fs >> pt_expand_query) >> bm25
pipeline_flan_sq_zs = (flan_sq_zs >> pt_expand_query) >> bm25


In [19]:
pt.Experiment(
    [bm25, bm25_rm3, bm25_kl, pipeline_gpt_cot, pipeline_gpt_sq_fs, pipeline_gpt_sq_zs, pipeline_llama_cot, pipeline_llama_sq_fs, pipeline_llama_sq_zs, pipeline_flan_cot, pipeline_flan_sq_fs, pipeline_flan_sq_zs, ],
    names=['BM25', 'BM25+RM3', 'BM25+KL', 'BM25+GPT-COT', 'BM25+GPT-SQ-FS', 'BM25+GPT-SQ-ZS', 'BM25+Llama-COT', 'BM25+Llama-SQ-FS', 'BM25+Llama-SQ-ZS', 'BM25+Flan-COT', 'BM25+Flan-SQ-FS', 'BM25+Flan-SQ-ZS'],
    topics=topics,
    qrels=pt_dataset.get_qrels(),
    eval_metrics=['recall_1000'],
    verbose=True
)

pt.Experiment:   0%|          | 0/12 [00:00<?, ?system/s]

Unnamed: 0,name,recall_1000
0,BM25,0.788732
1,BM25+RM3,0.780057
2,BM25+KL,0.787861
3,BM25+GPT-COT,0.806138
4,BM25+GPT-SQ-FS,0.797366
5,BM25+GPT-SQ-ZS,0.79043
6,BM25+Llama-COT,0.793485
7,BM25+Llama-SQ-FS,0.797566
8,BM25+Llama-SQ-ZS,0.803869
9,BM25+Flan-COT,0.796139


Because Query Expansion aims at increasing recall, we evaluate recall@1000.
Overall, query expansion seems to be not very effective on the Antique dataset, especially both traditional expansion approaches `BM25+RM3` and `BM25+KL` decrease the recall, whereas the GPT LLM Expansions slightly increase the recall.

# Experiments on a Second Corpus

Given that query expansion was not very effective on

In [21]:
index = tira.pt.index('ir-benchmarks/tira-ir-starter/Index (tira-ir-starter-pyterrier)', 'msmarco-passage-trec-dl-2019-judged-20230107-training')

Download from Zenodo: https://zenodo.org/records/10743990/files/2023-01-07-22-09-56.zip?download=1


Download: 100%|██████████| 892M/892M [00:40<00:00, 22.8MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/msmarco-passage-trec-dl-2019-judged-20230107-training/tira-ir-starter


### Experiments on TREC DL 2019

In [24]:
# Data
dataset = 'msmarco-passage-trec-dl-2019-judged-20230107-training'
pt_dataset = pt.get_dataset('irds:ir-benchmarks/' + dataset)

# Baselines
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25_rm3 = bm25 >> pt.rewrite.RM3(index) >> bm25
bm25_kl = bm25 >> pt.rewrite.KLQueryExpansion(index) >> bm25

# llm expansions with gpt
gpt_cot = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-cot', dataset, prefix='llm_expansion_')
gpt_sq_fs = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-sq-fs', dataset, prefix='llm_expansion_')
gpt_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-zs', dataset, prefix='llm_expansion_')

# llm expansions with llama
llama_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-cot', dataset, prefix='llm_expansion_')
llama_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-fs', dataset, prefix='llm_expansion_')
llama_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-zs', dataset, prefix='llm_expansion_')


# llm expansions with flan
flan_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-cot', dataset, prefix='llm_expansion_')
flan_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-fs', dataset, prefix='llm_expansion_')
flan_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-zs', dataset, prefix='llm_expansion_')

pipeline_gpt_cot = (gpt_cot >> pt_expand_query) >> bm25
pipeline_gpt_sq_fs = (gpt_sq_fs >> pt_expand_query) >> bm25
pipeline_gpt_sq_zs = (gpt_sq_zs >> pt_expand_query) >> bm25

pipeline_llama_cot = (llama_cot >> pt_expand_query) >> bm25
pipeline_llama_sq_fs = (llama_sq_fs >> pt_expand_query) >> bm25
pipeline_llama_sq_zs = (llama_sq_zs >> pt_expand_query) >> bm25

pipeline_flan_cot = (flan_cot >> pt_expand_query) >> bm25
pipeline_flan_sq_fs = (flan_sq_fs >> pt_expand_query) >> bm25
pipeline_flan_sq_zs = (flan_sq_zs >> pt_expand_query) >> bm25

pt.Experiment(
    [bm25, bm25_rm3, bm25_kl, pipeline_gpt_cot, pipeline_gpt_sq_fs, pipeline_gpt_sq_zs, pipeline_llama_cot, pipeline_llama_sq_fs, pipeline_llama_sq_zs, pipeline_flan_cot, pipeline_flan_sq_fs, pipeline_flan_sq_zs, ],
    names=['BM25', 'BM25+RM3', 'BM25+KL', 'BM25+GPT-COT', 'BM25+GPT-SQ-FS', 'BM25+GPT-SQ-ZS', 'BM25+Llama-COT', 'BM25+Llama-SQ-FS', 'BM25+Llama-SQ-ZS', 'BM25+Flan-COT', 'BM25+Flan-SQ-FS', 'BM25+Flan-SQ-ZS'],
    topics=pt_dataset.get_topics('query'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=['recall_1000'],
    verbose=True
)

pt.Experiment:   0%|          | 0/12 [00:00<?, ?system/s]

Unnamed: 0,name,recall_1000
0,BM25,0.73619
1,BM25+RM3,0.754277
2,BM25+KL,0.746565
3,BM25+GPT-COT,0.856772
4,BM25+GPT-SQ-FS,0.743227
5,BM25+GPT-SQ-ZS,0.781107
6,BM25+Llama-COT,0.815424
7,BM25+Llama-SQ-FS,0.757778
8,BM25+Llama-SQ-ZS,0.814025
9,BM25+Flan-COT,0.789587


### Experiments on TREC DL 2020

In [25]:
# Data
dataset = 'msmarco-passage-trec-dl-2020-judged-20230107-training'
pt_dataset = pt.get_dataset('irds:ir-benchmarks/' + dataset)

# Baselines
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25_rm3 = bm25 >> pt.rewrite.RM3(index) >> bm25
bm25_kl = bm25 >> pt.rewrite.KLQueryExpansion(index) >> bm25

# llm expansions with gpt
gpt_cot = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-cot', dataset, prefix='llm_expansion_')
gpt_sq_fs = tira.pt.transform_queries('workshop-on-open-web-search/tu-dresden-03/qe-gpt3.5-sq-fs', dataset, prefix='llm_expansion_')
gpt_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-zs', dataset, prefix='llm_expansion_')

# llm expansions with llama
llama_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-cot', dataset, prefix='llm_expansion_')
llama_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-fs', dataset, prefix='llm_expansion_')
llama_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-llama-sq-zs', dataset, prefix='llm_expansion_')


# llm expansions with flan
flan_cot = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-cot', dataset, prefix='llm_expansion_')
flan_sq_fs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-fs', dataset, prefix='llm_expansion_')
flan_sq_zs = tira.pt.transform_queries('ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-zs', dataset, prefix='llm_expansion_')

pipeline_gpt_cot = (gpt_cot >> pt_expand_query) >> bm25
pipeline_gpt_sq_fs = (gpt_sq_fs >> pt_expand_query) >> bm25
pipeline_gpt_sq_zs = (gpt_sq_zs >> pt_expand_query) >> bm25

pipeline_llama_cot = (llama_cot >> pt_expand_query) >> bm25
pipeline_llama_sq_fs = (llama_sq_fs >> pt_expand_query) >> bm25
pipeline_llama_sq_zs = (llama_sq_zs >> pt_expand_query) >> bm25

pipeline_flan_cot = (flan_cot >> pt_expand_query) >> bm25
pipeline_flan_sq_fs = (flan_sq_fs >> pt_expand_query) >> bm25
pipeline_flan_sq_zs = (flan_sq_zs >> pt_expand_query) >> bm25

pt.Experiment(
    [bm25, bm25_rm3, bm25_kl, pipeline_gpt_cot, pipeline_gpt_sq_fs, pipeline_gpt_sq_zs, pipeline_llama_cot, pipeline_llama_sq_fs, pipeline_llama_sq_zs, pipeline_flan_cot, pipeline_flan_sq_fs, pipeline_flan_sq_zs, ],
    names=['BM25', 'BM25+RM3', 'BM25+KL', 'BM25+GPT-COT', 'BM25+GPT-SQ-FS', 'BM25+GPT-SQ-ZS', 'BM25+Llama-COT', 'BM25+Llama-SQ-FS', 'BM25+Llama-SQ-ZS', 'BM25+Flan-COT', 'BM25+Flan-SQ-FS', 'BM25+Flan-SQ-ZS'],
    topics=pt_dataset.get_topics('query'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=['recall_1000'],
    verbose=True
)

pt.Experiment:   0%|          | 0/12 [00:00<?, ?system/s]

Unnamed: 0,name,recall_1000
0,BM25,0.751156
1,BM25+RM3,0.799385
2,BM25+KL,0.793911
3,BM25+GPT-COT,0.846802
4,BM25+GPT-SQ-FS,0.759494
5,BM25+GPT-SQ-ZS,0.770243
6,BM25+Llama-COT,0.810467
7,BM25+Llama-SQ-FS,0.761412
8,BM25+Llama-SQ-ZS,0.778425
9,BM25+Flan-COT,0.813146
