Skip to content

Latest commit



218 lines (154 loc) · 11 KB


File metadata and controls

218 lines (154 loc) · 11 KB

Query Rewriting & Expansion

Query rewriting refers to changing the formulation of the query in order to improve the effectiveness of the search ranking. PyTerrier supplies a number of query rewriting transformers designed to work with Retriever.

Firstly, we differentiate between two forms of query rewriting:

  • Q -> Q: this rewrites the query, for instance by adding/removing extra query terms. Examples might be a WordNet- or Word2Vec-based QE; The input dataframes contain only ["qid", "docno"] columns. The output dataframes contain ["qid", "query", "query_0"] columns, where "query" contains the reformulated query, and "query_0" contains the previous formulation of the query.
  • R -> Q: these class of transformers rewrite a query by making use of an associated set of documents. This is typically exemplifed by pseudo-relevance feedback. Similarly the output dataframes contain ["qid", "query", "query_0"] columns.

The previous formulation of the query can be restored using pt.rewrite.reset(), discussed below.


This class implements Metzler and Croft's sequential dependence model, designed to boost the scores of documents where the query terms occur in close proximity. Application of this transformer rewrites each input query such that:

  • pairs of adjacent query terms are added as #1 and #uw8 complex query terms, with a low weight.
  • the full query is added as #uw12 complex query term, with a low weight.
  • all terms are weighted by a proximity model, either Dirichlet LM or pBiL2.

For example, the query pyterrier IR platform would become pyterrier IR platform #1(pyterrier IR) #1(IR platform) #uw8(pyterrier IR) #uw8(IR platform) #uw12(pyterrier IR platform). NB: Acutally, we have simplified the rewritten query - in practice, we also (a) set the weight of the proximity terms to be low using a #combine() operator and (b) set a proximity term weighting model.

This transfomer is only compatible with Retriever, as Terrier supports the #1 and #uwN complex query terms operators. The Terrier index must have blocks (positional information) recorded in the index.

.. autoclass:: pyterrier.rewrite.SequentialDependence
    :members: transform


sdm = pt.rewrite.SequentialDependence()
dph = pt.terrier.Retriever(index, wmodel="DPH")
pipeline = sdm >> dph
  • A Markov Random Field Model for Term Dependencies. Donald Metzler and W. Bruce Croft. In Proceedings of SIGIR 2005.
  • Incorporating Term Dependency in the DFR Framework. Jie Peng, Craig Macdonald, Ben He, Vassilis Plachouras, Iadh Ounis. In Proceedings of SIGIR 2007. July 2007. Amsterdam, the Netherlands. 2007.


This class applies the Bo1 Divergence from Randomess query expansion model to rewrite the query based on the occurences of terms in the feedback documents provided for each query. In this way, it takes in a dataframe with columns ["qid", "query", "docno", "score", "rank"] and returns a dataframe with ["qid", "query"].

.. autoclass:: pyterrier.rewrite.Bo1QueryExpansion
    :members: transform


bo1 = pt.rewrite.Bo1QueryExpansion(index)
dph = pt.terrier.Retriever(index, wmodel="DPH")
pipelineQE = dph >> bo1 >> dph

View the expansion terms:

pipelineDisplay = dph >> bo1"chemical reactions")
# will return a dataframe with ['qid', 'query', 'query_0'] columns
# the reformulated query can be found in the 'query' column,
# while the original query is in the 'query_0' columns

Alternative Formulations

Note that it is also possible to configure Retriever to perform QE directly using controls, which will result in identical retrieval effectiveness:

pipelineQE = pt.terrier.Retriever(index, wmodel="DPH", controls={"qemodel" : "Bo1", "qe" : "on"})

However, using pt.rewrite.Bo1QueryExpansion is preferable as:

  • the semantics of retrieve >> rewrite >> retrieve are clearly visible.
  • the complex control configuration of Terrier need not be learned.
  • the rewritten query is visible outside, and not hidden inside Terrier.
  • Amati, Giambattista (2003) Probability models for information retrieval based on divergence from randomness. PhD thesis, University of Glasgow.


Similar to Bo1, this class deploys a Divergence from Randomess query expansion model based on Kullback Leibler divergence.

.. autoclass:: pyterrier.rewrite.KLQueryExpansion
    :members: transform

  • Amati, Giambattista (2003) Probability models for information retrieval based on divergence from randomness. PhD thesis, University of Glasgow.


.. autoclass:: pyterrier.rewrite.RM3
    :members: transform

  • Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. UMass at TREC 2004: Novelty and HARD. In Proceedings of TREC 2004.

Combining Query Formulations

.. autofunction:: pyterrier.rewrite.linear

Resetting the Query Formulation

The application of any query rewriting operation, including the apply transformer, pt.apply.query(), will return a dataframe that includes the input formulation of the query in the query_0 column, and the new reformulation in the query column. The previous query reformulation can be obtained by inclusion of a reset transformer in the pipeline.

.. autofunction:: pyterrier.rewrite.reset()

Tokenising the Query

Sometimes your query can include symbols that aren't compatible with how your retriever parses the query. In this case, a custom tokeniser can be applied as part of the retrieval pipeline.

.. autofunction:: pyterrier.rewrite.tokenise()

Stashing the Documents

Sometimes you want to apply a query rewriting function as a re-ranker, but your rewriting function uses a different document ranking. In this case, you can use pt.rewrite.stash_results() to stash the retrieved documents for each query, so they can be recovered and re-ranked later using your rewritten query formulation.

.. autofunction:: pyterrier.rewrite.stash_results()

.. autofunction:: pyterrier.rewrite.reset_results()

Example: Query Expansion as a re-ranker

Some papers advocate for the use of query expansion (PRF) as a re-ranker. This can be attained in PyTerrier through use of stash_results() and reset_results():

# index: the corpus you are ranking

dph = pt.terrier.Retriever(index)
Pipe = dph
    >> pt.rewrite.stash_results(clear=False)
    >> pt.rewrite.RM3(index)
    >> pt.rewrite.reset_results()
    >> dph

Summary of dataframe types:

output of dataframe contents actual columns
dph R qid, query, docno, score
stash_results R + "stashed_results_0" qid, query, docno, score, stashed_results_0
RM3 Q + "stashed_results_0" qid, query, query_0, stashed_results_0
reset_results R qid, query, docno, score, query_0
dph R qid, query, docno, score, query_0

Indeed, as we need RM3 to have the initial ranking of documents as input, we use clear=False as the kwarg to stash_results().

Example: Collection Enrichment as a re-ranker:

# index: the corpus you are ranking
# wiki_index: index of Wikipedia, used for enrichment

dph = pt.terrier.Retriever(index)
Pipe = dph
    >> pt.rewrite.stash_results()
    >> pt.terrier.Retriever(wiki_index)
    >> pt.rewrite.RM3(wiki_index)
    >> pt.rewrite.reset_results()
    >> dph

In general, collection enrichment describes conducting a PRF query expansion process on an external corpus (often Wikipedia), before applying the reformulated query to the main corpus. Collection enrichment can be used for improving a first pass retrieval (pt.terrier.Retriever(wiki_index) >> pt.rewrite.RM3(wiki_index) >> pt.terrier.Retriever(main_index)). Instead, the particular example shown above applies collection enrichment as a re-ranker.

Summary of dataframe types:

output of dataframe contents actual columns
dph R qid, query, docno, score
stash_results Q + "stashed_results_0" qid, query, saved_docs_0
Retriever R + "stashed_results_0" qid, query, docno, score, stashed_results_0
RM3 Q + "stashed_results_0" qid, query, query_0, stashed_results_0
reset_results R qid, query, docno, score, query_0
dph R qid, query, docno, score, query_0

In this example, we have a Retriever instance executed on the wiki_index before RM3, so we clear the document ranking columns when using stash_results().