## Introdution - BM25 POC

BM25, which stands for "Best Matching 25," is an information retrieval algorithm used in search engines and text retrieval systems. It is an improved version of the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, designed to address some of its limitations.

Here's a brief description of BM25 and its primary purpose:

1. **Relevance Scoring**: BM25 is used to calculate the relevance score of a document or web page to a specific query. This score helps search engines rank documents in order of relevance when a user performs a search.

2. **Term Frequency and Inverse Document Frequency**: Like TF-IDF, BM25 takes into account the frequency of terms (words) in a document (Term Frequency) and the rarity of those terms across all documents (Inverse Document Frequency). However, BM25 uses a different formula for calculating these values.

3. **Tuning Parameters**: BM25 introduces tuning parameters, such as "k1" and "b," which allow system administrators to adjust the algorithm's sensitivity to term frequency and document length. This makes BM25 more flexible and adaptable to different types of documents and search scenarios.

4. **Non-linear Relationship**: Unlike TF-IDF, BM25 incorporates a non-linear relationship between term frequency and relevance. This means that as a term appears more frequently in a document, its contribution to the relevance score saturates, preventing documents with excessive keyword repetition from receiving disproportionately high scores.

5. **Improved Retrieval Performance**: BM25 has been found to perform well in various information retrieval tasks, such as document retrieval, web search, and text classification. It often yields more accurate and contextually relevant results compared to simple TF-IDF-based approaches.

In summary, BM25 is a relevance scoring algorithm used to rank documents in search engines and information retrieval systems. It overcomes some of the limitations of TF-IDF by introducing tuning parameters and a non-linear relationship between term frequency and relevance, resulting in improved retrieval performance and more accurate search results.

# POC

In [1]:
import pickle

import numpy as np
from rank_bm25 import BM25Okapi

from movielens_ai_playground.io.read_data import read_movies_data

In [2]:
MOVIES_PATH = "../data/movielens-100k/u.item"
movies_df = read_movies_data(path=MOVIES_PATH)

## 1. Build Index

In [3]:
print("Start building bm25 with titles")
bm25_title = BM25Okapi(movies_df.title)

Start building bm25 with titles


## 2. Ask query

In [4]:
query = "Toy"

In [5]:
doc_scores = bm25_title.get_scores(query)

## 3. Get Indexes to be shown

In [6]:
# Order by max and get the movie index
sorted_indices = np.argsort(doc_scores)[::-1]
print(sorted_indices)

[   0  789 1182 ...  516  525  472]


## 4. Retrieve movie_id

In [7]:
bm25_df = movies_df.iloc[sorted_indices]

In [8]:
# get results
hits=  5
bm25_top = bm25_df.movieId[: int(hits)].values
print(bm25_top)

[   1  790 1183  328 1484]
