# Retrieval Strategies

Retrieval Strategies are the different techniques that are used to retrieve the relevant chunks from a document. 

1. Vector Similarity Search: As discussed previously, by calculating metrics such as cosine similarity or euclidean distance, between the query and chunks, we can retrieve the most similar ones.
2. Keyword(BM25) Search: By using the BM25 algorithm, we can retrieve the chunks that contain the most relevant keywords. This is a very common technique used in search engines.
3. Hybrid Search: A combination of the above two techniques.

## About BM25 Search

### Intro to BM25

BM25 is a popular algorithm used for information retrieval, especially in search engines and document ranking systems. It's part of a family of algorithms called **"probabilistic information retrieval models"**, designed to rank documents based on their relevance to a user's query. When you enter a query in a search engine, BM25 helps determine which documents (e.g., web pages, articles) are most likely to contain the information you’re looking for. It’s particularly useful when dealing with unstructured data like text, where documents aren’t tagged or labeled, and relevance must be inferred from the content itself.

BM25 stands out because it considers **term frequency** (how often a word appears in a document) and **inverse document frequency** (how rare or common that word is across all documents). By balancing these factors, BM25 effectively finds documents that are not just packed with the query terms but are also meaningful and relevant in context.

### How BM25 Works

At its core, BM25 scores documents based on how well they match a query. It does this by analyzing the terms in both the query and the documents, assessing the importance of each term, and then assigning a relevance score to each document. To understand BM25 better, let's break down the process step by step:

#### 1. Term Frequency (TF):
BM25 looks at how often each term in the query appears in a document. This is called **term frequency**. The idea is simple: the more frequently a term appears in a document, the more relevant that document might be for the query. However, BM25 doesn’t just count the raw number of occurrences—it uses a formula that gives diminishing returns to higher frequencies. In other words, if a word appears once or twice, it might significantly boost the document’s relevance, but if it appears 100 times, it’s not going to make the document 100 times more relevant. This prevents documents that repeat a keyword excessively from dominating the results.

Example: 
Let's say the query is `machine learning`, and Document A mentions "machine" and "learning" three times each, while Document B mentions "machine" once and "learning" twice. BM25 will score Document A higher based on term frequency alone because it contains both terms more often. But term frequency is just one part of the formula.

#### 2. Inverse Document Frequency (IDF):
BM25 also considers how common or rare a term is across the entire set of documents. This is called **inverse document frequency**. If a word appears in almost every document (e.g., "the", "is"), it’s not very useful for distinguishing relevant documents from irrelevant ones. On the other hand, if a word is rare (like "neural networks"), it’s likely to be more informative and relevant when it does appear.

BM25 assigns higher importance to terms that are rare across the document collection. This helps ensure that the algorithm doesn’t just return documents filled with common terms, but rather those that include more unique and relevant terms.

Example:
If "machine" appears in 90% of documents and "learning" appears in only 10%, BM25 will assign a higher weight to "learning" because it’s less common and more likely to help identify relevant documents.

#### 3. Document Length Normalization:
Longer documents are more likely to contain any given term simply because they have more content. To avoid bias toward longer documents, BM25 normalizes the term frequency by the document length. This ensures that shorter documents with concentrated, relevant information aren’t penalized.

Example:
If Document A has 200 words and Document B has 1000 words, but both mention "machine learning" 5 times, BM25 will score Document A higher because its shorter length suggests that the term "machine learning" is more central to its content.

#### Example Query Walkthrough:

Let’s say the query is `"deep learning for image classification"`, and you have three documents:

- **Document 1**: A short blog post discussing image classification with neural networks and deep learning mentioned twice.
- **Document 2**: A lengthy research paper on deep learning that mentions deep learning multiple times but doesn’t focus on image classification specifically.
- **Document 3**: A general overview of machine learning, which mentions image classification briefly but without any focus on deep learning.

BM25 will first check how many times each query term appears in the documents. It will find that Document 2 mentions "deep learning" many times, so it will score well based on term frequency. However, Document 1 will also score highly because although it mentions "deep learning" fewer times, it’s a shorter document where the term is more central. Meanwhile, Document 3 will score lower because, even though it might mention image classification, it doesn’t cover deep learning well enough to be relevant.

Next, BM25 will apply inverse document frequency. If "deep learning" is a common term across all documents but "image classification" is rare, Document 1 and Document 2 will be weighted more heavily for mentioning "image classification." Document 3, which only touches on image classification, will be further penalized for lacking depth on the subject.

Lastly, BM25 adjusts for document length. Document 1 is shorter and to the point, so it gets an extra boost, while Document 2, despite being lengthy, will only get marginally higher scores for repeating terms more often.

In the end, BM25 will likely rank Document 1 as the most relevant, followed by Document 2, with Document 3 trailing behind.

#### Final BM25 Score Calculation:
BM25 combines all the factors—term frequency, inverse document frequency, and document length normalization—into a final score. The higher the BM25 score, the more relevant the document is to the query.

What’s great about BM25 is that it’s both simple and highly effective. It doesn’t just focus on raw counts of query terms; it carefully weighs how often terms appear, how common they are across all documents, and whether a term’s appearance is significant in the context of the document’s length.

### Comparing BM25 with Vector Similarity Search

While BM25 focuses on keyword matching (how relevant a document is based on exact words), **vector similarity search** looks at the semantic meaning behind the text. In vector search, documents and queries are represented as vectors in a continuous space, and similarity is measured based on the distance between them, usually using **cosine similarity** or **dot product**. This allows vector search to find documents with similar meanings, even if the words don’t exactly match.

**BM25 Advantages**: 
- Works well with small datasets.
- Doesn’t require a complex model.
- Easier to explain and debug.

**Vector Search Advantages**:
- Captures semantic meaning, not just exact word matches.
- Works better for complex queries where words might not exactly match the document terms.

Both methods have their place, but BM25 is particularly useful when you want precise keyword matching and have limited computational resources.

### Setup Weaviate Client

In [2]:
import weaviate
from dotenv import load_dotenv
import os

load_dotenv("./../.env")

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-22T00:47:22+05:30"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-22T00:47:22+05:30"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-22T00:47:22+05:30"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-09-22T00:47:22+05:30"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":61651},"time":"2024-09-22T00:47:22+05:30"}
{"address":"192.168.155.215:61652","level":"info","msg":"starting cloud rpc server ...","time":"2024-09-22T00:47:22+05:30"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-09-22T00

### Create Collection

In [3]:
from weaviate.classes.config import Property, DataType, Configure

if client.collections.exists("Article"):
    client.collections.delete("Article")

client.collections.create(
    "Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="body", data_type=DataType.TEXT, vectorize_property_name=True),
        Property(name="date", data_type=DataType.DATE),
        Property(name="category", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    )
)

<weaviate.collections.collection.sync.Collection at 0x10b0aba50>

{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-09-22T01:04:50+05:30","wait_for_cache_prefill":false}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-22T01:04:50+05:30","took":73375}
{"level":"info","msg":"Created shard article_QgJmz62CNjZL in 10.98525ms","time":"2024-09-22T01:04:50+05:30"}


### Insert Documents

In [5]:
import json

with open("./articles.json", "r") as f:
    articles_json = json.load(f)

article = client.collections.get('Article')


with article.batch.dynamic() as batch:  # inserting objects to collection in batch
    for art in articles_json:
        batch.add_object(art)


In [6]:
item_count = 0
for item in article.iterator():
    item_count += 1
item_count

13