# IR Lab WiSe 2023: Query Expansion

This tutorial shows how to configure and use query expansion with Bo1 in PyTerrier.

**Attention:** The scenario below is cherry-picked to explain the concept with a minimal example. There are more query expansion approaches available in PyTerrier, please do not hesitate to look into them.


## Preparation: Install dependencies

In [2]:
# This is only needed in Google Colab, in a dev container, everything should be installed already
!pip3 install python-terrier

## Our Scenario

We want to build a search engine for pets.

Our search engine has the following five documents:


In [2]:
documents = [
    {'docno': 'd1', 'text': 'The Golden Retriever is a Scottish breed of medium size.'},
    {'docno': 'd2', 'text': 'Intelligent types of dogs are: (1) Border Collies, (2) Poodles, and (3) German Shepherds.'},
    {'docno': 'd3', 'text': 'Poodles are a highly intelligent, energetic, and sociable.'},
    {'docno': 'd4', 'text': 'The European Shorthair is medium-sized to large cat with a well-muscled chest.'},
    {'docno': 'd5', 'text': 'The domestic canary is a small songbird.'}
]

We create an index containing our five documents and use BM25 as retrieval model:

In [42]:
import pyterrier as pt
import pandas as pd
pd.set_option('display.max_colwidth', 0)

if not pt.started():
    pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, blocks=True, meta={'docno': 100, 'text': 20480}, )
index_ref = indexer.index(documents)
index = pt.IndexFactory.of(index_ref)


In [43]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## The Problem

During our first tests of our search engine, we observed that we have a vocabulary mismatch problem: For the query `dog`, only the document `d2` is albeit the documents `d1` and `d3` are also about dogs (as Golden Retrievers and Poodles are instances of dogs).

Lets look into the problem:

In [44]:
# searching for dog returns only document d2 as d1 and d3 have no occurence of the term dog
bm25.search("dog")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,1,d2,0,1.284654,dog


## The Solution

One potential solution for our vocabulary mismatch problem is [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html).

Before we start to implement our query expansion approach, we create a small Cranfield-Style collection to measure if our new approach improves the retrieval effectiveness:

In [45]:
# The information needs that we want to test
import pandas as pd

topics = pd.DataFrame([
    {'qid': '1', 'query': 'dog'},
])

qrels = pd.DataFrame([
    {'qid': '1', 'docno': 'd1', 'relevance': 1}, #d1 is about an specific dog
    {'qid': '1', 'docno': 'd2', 'relevance': 1}, #d1 is about multiple types of dogs
    {'qid': '1', 'docno': 'd3', 'relevance': 1}, #d1 is about an specific dog
])

In [46]:
pt.Experiment([bm25], topics, qrels, eval_metrics=['ndcg_cut_5'])

Unnamed: 0,name,ndcg_cut_5
0,BR(BM25),0.469279


Now that we can measure the effectiveness, lets try to improve the effectiveness with query expansion.


Query expansion algorithms like Bo1 use relevance feedback to expand the query with terms that are prominent in the relevance feedback. In most cases, the relevance feedback is implicit, e.g., we assume that the top results of BM25 are pseudo-relevant.

Let us implement the following pipeline:

- We use BM25 as pseudo relevance feedback
- We use the top-ranked documents of BM25 to expand the query with Bo1 (for our `dog` query, we only have one document for relevance feedback as seen above)
- We retrieve the final results using the expanded query against BM25

In [47]:
bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(index)

In [48]:
bo1_expansion(topics)

Unnamed: 0,qid,query_0,query
0,1,dog,applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646


Our Bo1 query expansion adds additional terms (already stemmed) like `colli`, `shepherd`, etc. to the query, but still puts the highest weight to the term `dog`.

We now can build our final pipeline and use this expanded query for retrieval against BM25.

In [49]:
bm25_bo1 = bo1_expansion >> bm25

In [50]:
bm25_bo1.search('dog')

Unnamed: 0,qid,docid,docno,rank,score,query_0,query
0,1,1,d2,0,6.895176,dog,applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646
1,1,2,d3,1,0.236991,dog,applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646


In [51]:
pt.Experiment([bm25_bo1], topics, qrels, eval_metrics=['ndcg_cut_5'], names=['BM25 >> Bo1 >> BM25'])

Unnamed: 0,name,ndcg_cut_5
0,BM25 >> Bo1 >> BM25,0.765361


# Summary

Our query expansion improved the nDCG@5 quite substantially from 0.47 to 0.77.

To summarize everything, please answer the following questions.


### Question 1:

Is query expansion a precision-oriented or a recall-oriented technique?


### TODO: Add your Solution

Question 2:

Please describe a potential problem that can be caused by query expansion? How would this problem influence precision respectively recall?

### TODO: Add your Solution

Question 3:

Our query expansion approach above was corpus-dependent. Do you think, that corpus-independent approaches (e.g., using ChatGPT without context, using Wordnet, etc.) would amplify or reduce the potential problem that you pointed out in question 2? How would they compare in terms of precision respectively recall?

### TODO: Add your Solution