--------------------------
#### ChromaDB

- Use of BM25 in IR 
----------------------------

In [1]:
import pandas as pd

using **books CSV**

In [2]:
# Load the books csv
# https://www.kaggle.com/datasets/saurabhbagchi/books-dataset/data
file_path = r'D:\AI-DATASETS\02-MISC-large\books.csv'
df = pd.read_csv(file_path, encoding='ISO-8859-1', sep=';', on_bad_lines='skip', low_memory=False)

In [3]:
df.shape

(271360, 8)

In [4]:
df.drop(['Image-URL-S', 'Image-URL-M',	'Image-URL-L'], axis=1, inplace=True)

In [7]:
df.sample(10)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
138990,0385310250,Sudden Exposure: A Jill Smith Mystery (Jill Sm...,Susan Dunlap,1996,Bantam Dell Pub Group
180016,0373075170,One Last Chance (American Heroes) (Silhouette ...,Justine Davis,1993,Silhouette
214639,003016656X,Prisoners of the Scrambling Dragon,F. N. Monjo,1980,Henry Holt &amp; Co
144511,0195072790,Understanding Depression: A Complete Guide to ...,"Donald F., M.D. Klein",1992,Oxford University Press
194728,0471197335,Corporate Information Factory,William H. Inmon,1997,John Wiley &amp; Sons
256269,3806819580,Japanische KÃ?Â¼che. Einfach gut.,Marianne Kaltenbach,1998,Falken
213169,0451521811,A Girl of the Limberlost,Gene S. Porter,1991,New Amer Library Classics
218249,0684807572,BRAINSTYLES : Change Your Life Without Changin...,Marlene Miller,1997,Simon &amp; Schuster
12290,0743203232,The Nature of Water and Air,Regina McBride,2001,Touchstone
171737,0297794868,Boadicea's chariot: The warrior queens,Antonia Fraser,1988,Weidenfeld and Nicolson


In [8]:
import string,re
#import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [10]:
def preprocess_text(text):
    #print(f"Original text: {text}")
    
    text = text.lower()  # Lowercasing
    #print(f"Lowercased text: {text}")
    
    # Remove all punctuation except '&'
    text = text.translate(str.maketrans('', '', string.punctuation.replace('&', '')))
    #print(f"Without punctuation (keeping '&'): {text}")
    
    text = text.strip()  # Remove leading/trailing whitespace
    text = re.sub(r'\s+', ' ', text)  # Remove excessive whitespace using regex
    #print(f"Without excessive whitespace: {text}")
    
    # Normalize &amp; if it exists
    text = re.sub(r'&amp;', 'and', text)
    #print(f"After replacing '&amp;': {text}")
    
    # Replace any remaining & with 'and'
    text = text.replace('&', 'and')
    #print(f"After replacing '&': {text}")
    
    return text

In [11]:
# Example usage
sample_text = "This is an example   with   excessive  & whitespace!"
cleaned_text = preprocess_text(sample_text)
print(cleaned_text)

this is an example with excessive and whitespace


In [12]:
# Drop rows with any null values
df_cleaned = df.dropna()

In [13]:
%%time
# Apply the preprocessing
df_cleaned['text'] = df_cleaned['Book-Title'] + ' ' + df_cleaned['Book-Author'] + ' ' + df_cleaned['Publisher']
df_cleaned['text'] = df_cleaned['text'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


CPU times: total: 1.69 s
Wall time: 4.3 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [14]:
df_cleaned.shape

(271356, 6)

In [15]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 271356 entries, 0 to 271359
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271356 non-null  object
 1   Book-Title           271356 non-null  object
 2   Book-Author          271356 non-null  object
 3   Year-Of-Publication  271356 non-null  object
 4   Publisher            271356 non-null  object
 5   text                 271356 non-null  object
dtypes: object(6)
memory usage: 14.5+ MB


In [16]:
df_cleaned_samples = df_cleaned.sample(2500)

In [17]:
# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=10000)

NameError: name 'TfidfVectorizer' is not defined

In [18]:
# Generate TF-IDF vectors for the reviews
tfidf_matrix = vectorizer.fit_transform(df_cleaned_samples.text)

NameError: name 'vectorizer' is not defined

#### BM25

BM25 (Best Matching 25) is a ranking function used in information retrieval systems to evaluate the relevance of documents in relation to a query. It builds upon the probabilistic information retrieval model and is particularly effective for scoring and ranking documents based on the frequency of query terms within them. Here's a breakdown of BM25 and its differences from TF-IDF:

**Relevance Scoring:** BM25 calculates a score for each document based on the presence and frequency of the terms in the query. The score reflects how well the document matches the query.

**Components:**

- **Term Frequency (TF):** Similar to TF-IDF, BM25 considers how often a term appears in a document. However, it uses a saturation function to diminish the effect of term frequency as it increases.
  
- **Inverse Document Frequency (IDF):** BM25 employs IDF to account for the rarity of terms. Rare terms contribute more to the score than common terms.
  
- **Document Length Normalization:** BM25 normalizes for document length, ensuring that longer documents do not have an unfair advantage simply because they contain more terms.
  
- **Parameters:** BM25 has parameters like `k1` (controls the impact of term frequency) and `b` (adjusts the normalization based on document length). This flexibility allows for tuning based on specific datasets.


##### Why We Need to Saturate the Term Frequency (TF)

##### 1. Diminishing Returns on Term Relevance
- As the frequency of a term in a document increases, its contribution to relevance does not increase linearly. In fact, the impact of additional occurrences of the term diminishes.
- For instance, if a document mentions a keyword 1 time, it may be relevant; if it mentions it 10 times, it doesn't necessarily mean it is 10 times more relevant. Saturation accounts for this diminishing returns effect.

##### 2. Avoiding Overemphasis on Frequent Terms
- Without saturation, documents with very high term frequencies might be unfairly prioritized, even if the actual content and relevance to the query are low.
- This is particularly important for documents that may repeat a term excessively, as they could skew the results and lead to poor search quality.

##### 3. Enhancing Precision in Ranking
- Saturation helps create a more balanced scoring system. It allows for a better distinction between documents with varying term frequencies.
- For example, a document with a term frequency of 5 might be seen as more relevant than one with a frequency of 1, but not overwhelmingly so. The use of saturation can help refine the score, ensuring the ranking is more precise and meaningful.

##### 4. Consistency Across Document Lengths
- Different documents can vary significantly in length, leading to variations in raw term frequencies. Saturation helps normalize these differences.
- This is particularly important for long documents where high term frequency might be a result of sheer length rather than actual relevance.

##### 5. Parameter Control
- Saturation provides a means to control the behavior of the scoring function through parameters like `k1`. By adjusting this parameter, users can fine-tune how quickly the effects of term frequency diminish, allowing for flexibility based on specific datasets and requirements.

##### Example of Term Frequency Saturation
Consider a term "machine learning" in two documents:

- **Document A:** "Machine learning is a fascinating field. Machine learning can change the world."
- **Document B:** "Machine learning is machine learning machine learning machine learning machine learning machine learning."

- **Without Saturation:** Document B might score much higher due to the raw count of term occurrences (5 times).
- **With Saturation:** Document A might still score higher or similarly due to its context and meaningful use of the term, even though it appears less frequently.


#### 2. Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) in BM25 is conceptually similar to IDF in TF-IDF, but there are some differences in how they are calculated and used in their respective formulas. Here’s a breakdown of the similarities and differences:

`Similarities`

`Purpose`: Both IDF measures are designed to reflect the importance of a term across a collection of documents. The primary goal is to reduce the weight of common terms and increase the weight of rare terms in the scoring process.

`Concept`: In both cases, IDF is based on the idea that terms that appear in many documents are less informative than terms that appear in fewer documents. Therefore, IDF contributes to emphasizing the significance of rarer terms.

`Differences`

1. Mathematical Formulation:
   
- `TF-IDF IDF`:

$$
\operatorname{IDF}(t)=\log \left(\frac{N}{\operatorname{df}(t)}\right)
$$


Where $N$ is the total number of documents and $\operatorname{df}(t)$ is the number of documents containing the term $t$.

- `BM25 IDF`:

$$
\operatorname{IDF}(t)=\log \left(\frac{N-\operatorname{df}(t)+0.5}{\operatorname{df}(t)+0.5}\right)
$$


In BM25, a smoothing factor (0.5) is added to both the numerator and denominator to prevent division by zero and to smooth the effect of terms that appear in very few documents.

2. Normalization:
   
- BM25 applies a more nuanced form of normalization, which makes the IDF component more robust in cases where terms are either very common or very rare. The added constants help avoid extreme values, making the model more stable.

#### 3. Document Length Normalization in BM25

BM25 incorporates document length normalization to ensure that longer documents do not have an unfair advantage in the scoring process simply because they contain more terms. 

`How Normalization is Achieved:`
1. **Length Parameters**: BM25 uses a parameter \( b \) (typically set between 0 and 1) to control the degree of normalization. A value of \( b = 1 \) applies full normalization, while \( b = 0 \) means no normalization.
2. **Length Calculation**: The document length is measured in terms of the total number of terms. BM25 compares this length against the average document length in the collection.
3. **Score Adjustment**: The normalization is applied during the score calculation. It adjusts the term frequency based on the length of the document relative to the average length, reducing the score for longer documents while increasing it for shorter ones.

#### Example:
- **Document A** (100 words) contains the term "AI" 10 times.
- **Document B** (200 words) also contains the term "AI" 20 times.

Without normalization, Document B would have a higher score due to higher term frequency. However, BM25 adjusts for this by taking into account the document lengths, ensuring that Document A's relevance is appropriately recognized despite its shorter length.


#### BM25 Formula
The BM25 scoring function can be represented as follows:

$$
\operatorname{BM} 25(d, q)=\sum_{i=1}^{|q|} I D F\left(t_i\right) \cdot \frac{T F\left(t_i, d\right) \cdot\left(k_1+1\right)}{T F\left(t_i, d\right)+k_1 \cdot\left(1-b+b \cdot \frac{|d|}{\text { avgdl }}\right)}
$$


Where:
- $d=$ document
- $q=$ query
- $t_i=$ term in the query
- $T F\left(t_i, d\right)=$ term frequency of $t_i$ in document $d$
- $|d|=$ length of the document (number of terms)
- $\operatorname{avgdl}=$ average document length across the corpus
- $I D F\left(t_i\right)=$ inverse document frequency of term $t_i$
- $k_1$ and $b=$ tuning parameters

#### Implementing BM25

In [19]:
from rank_bm25 import BM25Okapi
import nltk

In [20]:
# Tokenize the 'text' column for BM25
tokenized_corpus = [nltk.word_tokenize(text.lower()) for text in df_cleaned_samples['text']]

In [21]:
print(tokenized_corpus[:3])

[['thats', 'why', 'theyre', 'in', 'cages', 'people', 'joel', 'perry', 'alyson', 'pubns'], ['frauen', 'die', 'geschichte', 'schrieben', '30', 'portrãâ¤ts', 'von', 'maria', 'sibylla', 'merian', 'bis', 'sophie', 'scholl', 'irma', 'hildebrandt', 'diederichs'], ['mr', 'potters', 'pet', 'dick', 'kingsmith', 'hyperion', 'books', 'for', 'children']]


In [22]:
# Initialize BM25 with the tokenized corpus
bm25 = BM25Okapi(tokenized_corpus)

In [23]:
df_cleaned_samples.text.sample(5)

26938                 beaches iris rainer dart bantam books
124158    the eighth day of the week a novel alfred copp...
83207     kundalini yoga the flow of eternal power shakt...
47985     loveland pinnacle historical romance jane ande...
196071    eric liddell pure gold a new biography of the ...
Name: text, dtype: object

In [24]:
# Define a query
query = "kundalini yoga"

In [25]:
# Tokenize the query
tokenized_query = nltk.word_tokenize(query.lower())

In [26]:
tokenized_query

['kundalini', 'yoga']

In [27]:
# Get BM25 scores for the query
scores = bm25.get_scores(tokenized_query)

In [28]:
len(scores)

2500

In [29]:
# Sort the results by score in descending order and retrieve the top matches
df_cleaned_samples['BM25_Score'] = scores
results = df_cleaned_samples.sort_values(by='BM25_Score', ascending=False)

print(results[['Book-Title', 'BM25_Score']])

                                               Book-Title  BM25_Score
83207           Kundalini Yoga: The Flow of Eternal Power   12.691831
67648                              Yoga fÃ?Â¼r jeden Tag.    7.754130
67554   The Yoga Manual: A Step-By-Step Guide to Gentl...    5.164993
78239                      Cloak of Deception (Star Wars)    0.000000
240450  The TROUBLE WITH TESTOSTERONE : And Other Essa...    0.000000
...                                                   ...         ...
211400  Exploring Missouri Wine Country (Show Me Misso...    0.000000
189881  Herman Melville : Redburn, White-Jacket, Moby-...    0.000000
65753                                  Reach The Splendor    0.000000
114957  The Power of Ethical Persuasion: From Conflict...    0.000000
143470  Wit'ch Storm (The Banned and the Banished, Boo...    0.000000

[2500 rows x 2 columns]
