<a href="https://colab.research.google.com/github/rajasafi/NLP-models/blob/main/BM_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
a = "purple is the best city in the forest".split()
b = "there is an art to getting your way and throwing bananas on to the street is not it".split()
c = "it is not often you find soggy bananas on the street".split()
d = "green should have smelled more tranquil but somehow it just tasted rotten".split()
e = "joyce enjoyed eating pancakes with ketchup".split()
f = "as the asteroid hurtled toward earth becky was upset her dentist appointment ha".split()

In [None]:
docs = [a, b, c, d, e, f]

The successor to TF-IDF, Okapi BM25 is the result of optimizing TF-IDF primarily to normalize results based on document length.

TF-IDF is great but can return questionable results when we begin comparing several mentions

In [None]:
import numpy as np

avgdl = sum(len(sentence) for sentence in [a, b, c, d, e, f]) / len(docs)
N = len(docs)

def bm25(word, sentence, k=1.2, b=0.75):
    freq = sentence.count(word)  # or f(q,D) - freq of query in Doc
    tf = (freq * (k + 1)) / (freq + k * (1 - b + b * (len(sentence) / avgdl)))
    N_q = sum([doc.count(word) for doc in docs])  # number of docs that contain the word
    idf = np.log(((N - N_q + 0.5) / (N_q + 0.5)) + 1)
    return round(tf*idf, 4)

In [None]:
bm25('purple', a)

1.7511

We’ve used the default parameters for k and b — and our outputs look promising. The query 'purple' only matches sentence a, and 'bananas' scores reasonable for both b and c — but slightly higher in c thanks to the smaller word count.

In [None]:
vocab = set(a+b+c+d+e+f)
print(vocab)

{'appointment', 'street', 'becky', 'earth', 'her', 'somehow', 'is', 'there', 'purple', 'have', 'was', 'more', 'art', 'but', 'ha', 'just', 'pancakes', 'should', 'your', 'hurtled', 'soggy', 'often', 'green', 'tranquil', 'best', 'ketchup', 'throwing', 'bananas', 'city', 'to', 'joyce', 'smelled', 'getting', 'as', 'upset', 'you', 'with', 'on', 'forest', 'dentist', 'find', 'and', 'in', 'not', 'it', 'rotten', 'enjoyed', 'eating', 'tasted', 'an', 'asteroid', 'way', 'the', 'toward'}


In [None]:
vec = []
# we will create the BM25 vector for sentence 'a'
for word in vocab:
    vec.append(bm25(word, a))
print(vec)

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5023, 0.0, 1.7511, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.7511, 0.0, 0.0, 0.0, 1.7511, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.7511, 0.0, 0.0, 0.0, 1.7511, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3615, 0.0]
