# IR Lab Tutorial: Stemming

This tutorial shows how to use a stemmer in PyTerrier.

**Attention:** The scenario below is cherry-picked to explain the concept of stopword lists with a minimal example.


## Preparation: Install dependencies

In [None]:
# This is only needed in Google Colab, in a dev container, everything should be installed already
!pip3 install python-terrier

## Our Scenario

We want to build a search engine that can generalize word forms of the same word stem (e.g. "producer" and "produces").

In [3]:
import pyterrier as pt
import pandas as pd
pd.set_option('display.max_colwidth', 0)

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1'])
    from jnius import autoclass

Try stemming some words with the Porter stemmer.

In [None]:
def stem_porter(t):
    stemmer = autoclass("org.terrier.terms.PorterStemmer")()
    return stemmer.stem(t)

print('are =>', stem_porter('are'))
print('producer =>', stem_porter('producer'))
print('produces =>', stem_porter('produces'))
print('corpora =>', stem_porter('corpus'))
# Feel free to try out other words, too.

are => ar
producer => produc
produces => produc
corpora => corpu


Now try to stem the same words with the Krovetz stemmer. Do you notice differences?

In [None]:
def stem_krovetz(t):
    stemmer = autoclass("org.terrier.terms.LemurKrovetzStemmer")()
    return stemmer.stem(t)

print('are =>', stem_krovetz('are'))
print('producer =>', stem_krovetz('producer'))
print('produces =>', stem_krovetz('produces'))
print('corpora =>', stem_krovetz('corpus'))
# Feel free to try out other words, too.

are => are
producer => producer
produces => produce
corpora => corpus


We now again build a little cherry-picked test collection to see which stemmer works best (or if stemming is needed at all).

In [8]:
documents = [
    {'docno': 'd1', 'text': 'producer'},
    {'docno': 'd2', 'text': 'produce'},
    {'docno': 'd2', 'text': 'produces'},
    {'docno': 'd4', 'text': 'tbd'},
]

topics = pd.DataFrame([
    {'qid': '1', 'query': 'produces'},
])

qrels = pd.DataFrame([
    {'qid': '1', 'docno': 'd1', 'relevance': 1},
    {'qid': '1', 'docno': 'd2', 'relevance': 1},
    {'qid': '1', 'docno': 'd3', 'relevance': 1},
])

Notice how the query uses a different word form than used in the documents. Still, we would like to find the same documents.

Create an index and corresponding BM25 retrieval model that uses no stemmer.

In [None]:
indexer_no_stemming = pt.IterDictIndexer("/tmp/index-no-stemming", overwrite=True, stemmer=None)
index_ref_no_stemming = indexer_no_stemming.index(documents)
index_no_stemming = pt.IndexFactory.of(index_ref_no_stemming)

bm25_no_stemming = pt.BatchRetrieve(index_no_stemming, wmodel="BM25")

pt.Experiment([bm25_no_stemming], topics, qrels, eval_metrics=['ndcg_cut_5'])

Unnamed: 0,name,ndcg_cut_5
0,BR(BM25),0.469279


Create an index and BM25 model that uses the Porter stemmer.

In [None]:
indexer_porter = pt.IterDictIndexer("/tmp/index-porter", overwrite=True, stemmer='PorterStemmer')
index_ref_porter = indexer_porter.index(documents)
index_porter = pt.IndexFactory.of(index_ref_porter)

bm25_porter = pt.BatchRetrieve(index_porter, wmodel="BM25")

pt.Experiment([bm25_porter], topics, qrels, eval_metrics=['ndcg_cut_5'])

Unnamed: 0,name,ndcg_cut_5
0,BR(BM25),0.765361


Create an index and BM25 model that uses the Krovetz stemmer.

In [None]:
indexer_krovetz = pt.IterDictIndexer("/tmp/index-krovetz", overwrite=True, stemmer='LemurKrovetzStemmer')
index_ref_krovetz = indexer_krovetz.index(documents)
index_krovetz = pt.IndexFactory.of(index_ref_krovetz)

bm25 = pt.BatchRetrieve(index_krovetz, wmodel="BM25")

pt.Experiment([bm25_krovetz], topics, qrels, eval_metrics=['ndcg_cut_5'])

Unnamed: 0,name,ndcg_cut_5
0,BR(BM25),0.469279


With PyTerrier, we can also directly compare the three options (no stemmer, Porter stemmer, and Krovetz stemmer).

In [None]:
pt.Experiment([bm25_no_stemming, bm25_porter, bm25_krovetz], topics, qrels, eval_metrics=['ndcg_cut_5'], names=["no stemming", "Porter", "Krovetz"])

### Question 1

With the observed effectiveness results, which stemmer would you choose and why?

### TODO: Add your solution.

### Question 2

Can you think of examples of two words with different meanings where the rule-based Porter stemmer could falsely reduce them to the same stem?

### TODO: Add your solution.

### Question 3

Update the experiment above to evaluate this word pair too. What do you observe? Is Porter still more effective?

### TODO: Add your solution.