# Okapi BM25

This notebook implements the [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) which is a TF-IDF based search algorithm. We implement this mainly to contrast it with Embed-and-Rerank, to see if this algorithms misclassifications are correctly classified by Embed-and-Rerank, or vice versa.

## Downloading and importing packages

In [None]:
!pip install rank_bm25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


In [None]:
# All the necessary imports

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np
import pandas as pd
import ast

  


## Setting hyperparameters and data preprocessing

In [None]:
# Some hyperparameters

pre_prune_results = 100
results_to_show = 10

In [None]:
# Mount drive and load datasets and model

from google.colab import drive
drive.mount("/content/gdrive")

plots = pd.read_csv("/content/gdrive/MyDrive/imdb_plots.csv", compression="zip", converters={'to_embed': ast.literal_eval})

plots['MovieId'] = plots.index
plots = plots.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1)

Mounted at /content/gdrive


In [None]:
movie_ids = []
to_embed = []
for row in plots.iterrows():
  movie_id = row[1]['MovieId']
  for frag in row[1]['to_embed']:
    movie_ids.append(movie_id)
    to_embed.append(frag)

id_and_summary = pd.DataFrame({'MovieId': movie_ids, 'to_embed': to_embed})

In [None]:
movie_ids = []
queries = []
for row in plots.iterrows():
  movie_id = row[1]['MovieId']
  summ1 = row[1]['imdb_1']
  summ2 = row[1]['imdb_2']
  if not pd.isna(summ1):
    movie_ids.append(movie_id)
    queries.append(summ1)
  if not pd.isna(summ2):
    movie_ids.append(movie_id)
    queries.append(summ2)

test_queries = pd.DataFrame({'MovieId': movie_ids, 'summary': queries})

In [None]:
# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

In [None]:
tokenized_corpus = []
for passage in id_and_summary['to_embed']:
  tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

In [None]:
# Function to query and return top `results_to_show` with associated score

def lexical_query(query_string, bm25_corpus, id_and_summary, wiki_dataset):
  bm25_scores = bm25.get_scores(bm25_tokenizer(query_string))
  top_n = np.argpartition(bm25_scores, -pre_prune_results)[-pre_prune_results:]
  bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
  bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

  results = []
  for raw_res in bm25_hits:
    if len(results) >= 10:
      break

    corpus_id = raw_res['corpus_id']
    score = raw_res['score']
    movie_id = id_and_summary['MovieId'][corpus_id]
    movie_title = wiki_dataset['Title'][movie_id]
    movie_year = wiki_dataset['Release Year'][movie_id]
    if movie_title.strip() not in map(lambda x: x[0][0].strip(), results):
      results.append(((movie_title, movie_year), score))
  return results

def measure_accuracy(query_dataset, bm25_corpus, id_and_summary, wiki_dataset):
  total = 0
  correct = 0

  for row in tqdm(query_dataset.iterrows()):
    query_string = row[1]['summary']
    movie_id = row[1]['MovieId']

    hits = lexical_query(query_string, bm25_corpus, id_and_summary, wiki_dataset)
    movie_title = wiki_dataset['Title'][movie_id]
    if movie_title.strip() in map(lambda x: x[0][0].strip(), hits):
      correct += 1
    total += 1

  return correct/total

## An example

Note that this is an example which is misclassified by Okapi BM25, but correctly classified by Embed-and-Rerank. The movie being referenced here is "Midnight in Paris".

In [None]:
query = "couple walks through paris all night"

lexical_query(query, bm25, id_and_summary, plots)

[(('Witness', 1985), 12.047900192294906),
 (('Target', 1985), 10.716933606280001),
 (('Picture Perfect', 1997), 10.131693096010025),
 (('An American Werewolf in Paris', 1997), 10.066780063295393),
 (('De-Lovely', 2004), 9.704919676517996),
 (('Paris, Texas', 1984), 9.53951184051008),
 (('An Education', 2009), 9.289532110099987),
 (('Pet Sematary', 1989), 9.063614686950345),
 (('Revolutionary Road', 2008), 8.845658806373812),
 (('Unlawful Entry', 1992), 8.737652543255958)]

## Testing performance on IMDB query set

In [None]:
test_queries

Unnamed: 0,MovieId,summary
0,0,A chivalrous British officer takes the blame f...
1,0,"Captain Wynnegate leaves England, accepting th..."
2,1,A naive country girl is tricked into a sham ma...
3,1,"The callous rich, portrayed by Lennox, think o..."
4,2,An extended family split up in France and Germ...
...,...,...
9982,5268,Four girls travel to a party in an isolated ho...
9983,5269,Jae-hyuk is an ordinary man in his 40s. He wor...
9984,5270,Esra working for a logistics firm lives with h...
9985,5271,Recep Ivedik has been depressed since the deat...


In [None]:
measure_accuracy(test_queries, bm25, id_and_summary, plots)

0it [00:00, ?it/s]

0.8405927706017823