# Lab 6 - Retrieval Models: BM25
CS 437  
Fall 2025  
Dr. Henderson  
_v1_

---

The BM25 ranking model is one of the most effective models assuming a binary relevance. It is based on the _binary independence model_ which makes the Naive Bayes assumption of _term independence_, simplifying the calculations.

Although the _binary independence model_ is known to have poor performance, we'll begin by implementing it so we can extend it to BM25

The score for a document in the _binary independence model_ is given by
$$ bim\_score(d) = \sum_{i:d_i=1} \log \frac{p_i(1-s_i)}{s_i(1-p_i)} $$

Where $d_i=1$ means term $i$ is present in the document, $p_i$ is the probability term $i$ occurs in the relevant set, and $s_i$ is the probability term $i$ does _not_ occur in the relevant set.

Since we don't know the relevant set beforehand we cannot calculate $p_i$ and $s_i$ directly so we'll use the estimates as described in the book. We'll choose $p_i = 0.5$ (50/50 chance the term occurs in the relevant set), and $s_i = \frac{n_i}{N}$ (proportion of documents the term occurs in the whole collection $N$) giving us:
$$ bim\_score(d) = \sum_{i:d_i=1} \log \frac{N - n_i}{n_i} $$

_Note: We won't preprocess the query or the corpus for this lab to make it simpler_

### Part 1: Binary Independece Model

1. Create a function called `term_count()` which takes a string and returns the number of documents it occurs in the collection found in the `docs` subdirectory.  
   _Hint: In homework 1 you used a system command to quickly search a set of documents for a term_

2. Set a variable named `N` to the number of documents in the `docs` subdirectory, and state `N`.

3. Create a function named `bim_score()` which takes a document filename and a list of query terms and returns the BIM score for the document.

In [21]:
_ut_3 = bim_score('docs/grauniad_news_001.txt', ['university', 'Australia'])
assert math.isclose(_ut_3, 4.434597, rel_tol=1e-6), f"Incorrect score: {_ut_3}"

4. Create a function named `bim_rank()` that takes a list of query terms, a count `k` (default 10), and returns a ranked list of the top `k` relevant documents from the `docs` collection.

5. Call your `bim_rank()` function with the terms "university" and "Australia". Assign the results to a variable named `bim_test` and state the value.

6. Review the top 2 relevant documents. Do you think they are _topically_ relevant and _user_ relevant?

### Part II: BM25

The BM25 algorithm modifies the _binary independence model_ by adding what amounts to term frequency information. Again, using the assumption that information about the relvant set is unavailable it reduces to:

$$ BM25(Q,d) = \sum_{i \in Q} \log \frac{1}{(n_i + 0.5)/(N - n_i + 0.5)} \frac{(k_1 + 1)f_i}{K + f_i} \frac{ (k_2 + 1) qf_i}{k_2 + qf_i} $$

where $f_i$ is the frequency of term $i$ in the document and $qf_i$ is the frequency of term $i$ in the query (usually just 1). $K$, $k_1$, and $k_2$ are parameters set empirically and

$$ K = k_1((1 - b) + b \frac{dl}{avdl}) $$

where $b$ is a parameter, $dl$ is the length of the document, and $avdl$ is the average length of a document.

Since we already have $N$ and $n_i$ from part I, we'll start with $K$.

7. Create a variable called `avdl` which is the average document length (words) of the documents in the collection (the `docs` directory). State the value of `avdl`

8. Create a function named `get_K()` which takes a parameter of the document length (in words) and the values for $k1$ and $b$, and returns the value of $K$ in the BM25 equation.

In [11]:
_ut_8 = get_K(1000, 1.2, 0.75)
assert math.isclose(_ut_8, 1.149478, rel_tol=1e-6), f"Incorrect K: {_ut_8}"

9. Create a function named `term_weight()` with a string parameter `term`, an integer parameter `f`, and float parameters `K`, `k1`, and `k2`. The function should calculate the weight for a single term in the BM25 equation. Assume that $qf_i$ = 1.

10. Create a function named `bm25_score()` which takes as parameters a document filename, a list of query terms, and values for $b$, $k_1$, and $k_2$. The function should return the BM25 score for the document given the terms.

11. Create a function named `bm25_rank()` that takes a list of query terms, values for $b$, $k1$, $k2$, and a count `k` (default 10), and returns a ranked list of the top `k` relevant documents from the `docs` collection.

12. Call your `bm25_rank()` function with the terms "university" and "Australia" using `b=0.75`, `k1=1.2`, and `k2=100`. Assign the result to a variable named `bm25_test` and state the value.

13. Iterate over the length of `bm25_test` and print the tuples from `bim_test` side-by-side on a single line for each rank.

14. Compare and analyze the rankings from the two algorithms. 

---

### Submission Instructions

Be sure to ***SAVE YOUR WORK***!  

Next, select Kernel -> Restart Kernel and Run All Cells...

Make sure there are no errors.

Use _File > Save and Export Notebook As > HTML_ then submit your HTML file to Canvas.