# Retrieval Strategies

Retrieval Strategies are the different techniques that are used to retrieve the relevant chunks from a document. 

1. **Vector Similarity Search**: As discussed previously, by calculating metrics such as cosine similarity or euclidean distance, between the query and chunks, we can retrieve the most similar ones.
2. **Keyword(BM25) Search**: By using the BM25 algorithm, we can retrieve the chunks that contain the most relevant keywords. This is a very common technique used in search engines.
3. **Hybrid Search**: A combination of the above two techniques.

## About BM25 Search

### Intro to BM25

BM25 is a popular algorithm used for information retrieval, especially in search engines and document ranking systems. It's part of a family of algorithms called **"probabilistic information retrieval models"**, designed to rank documents based on their relevance to a user's query. When you enter a query in a search engine, BM25 helps determine which documents (e.g., web pages, articles) are most likely to contain the information you’re looking for. It’s particularly useful when dealing with unstructured data like text, where documents aren’t tagged or labeled, and relevance must be inferred from the content itself.

BM25 stands out because it considers **term frequency** (how often a word appears in a document) and **inverse document frequency** (how rare or common that word is across all documents). By balancing these factors, BM25 effectively finds documents that are not just packed with the query terms but are also meaningful and relevant in context.

### How BM25 Works

At its core, BM25 scores documents based on how well they match a query. It does this by analyzing the terms in both the query and the documents, assessing the importance of each term, and then assigning a relevance score to each document. To understand BM25 better, let's break down the process step by step:

#### 1. Term Frequency (TF):
BM25 looks at how often each term in the query appears in a document. This is called **term frequency**. The idea is simple: the more frequently a term appears in a document, the more relevant that document might be for the query. However, BM25 doesn’t just count the raw number of occurrences—it uses a formula that gives diminishing returns to higher frequencies. In other words, if a word appears once or twice, it might significantly boost the document’s relevance, but if it appears 100 times, it’s not going to make the document 100 times more relevant. This prevents documents that repeat a keyword excessively from dominating the results.

Example: 
Let's say the query is `machine learning`, and Document A mentions "machine" and "learning" three times each, while Document B mentions "machine" once and "learning" twice. BM25 will score Document A higher based on term frequency alone because it contains both terms more often. But term frequency is just one part of the formula.

#### 2. Inverse Document Frequency (IDF):
BM25 also considers how common or rare a term is across the entire set of documents. This is called **inverse document frequency**. If a word appears in almost every document (e.g., "the", "is"), it’s not very useful for distinguishing relevant documents from irrelevant ones. On the other hand, if a word is rare (like "neural networks"), it’s likely to be more informative and relevant when it does appear.

BM25 assigns higher importance to terms that are rare across the document collection. This helps ensure that the algorithm doesn’t just return documents filled with common terms, but rather those that include more unique and relevant terms.

Example:
If "machine" appears in 90% of documents and "learning" appears in only 10%, BM25 will assign a higher weight to "learning" because it’s less common and more likely to help identify relevant documents.

#### 3. Document Length Normalization:
Longer documents are more likely to contain any given term simply because they have more content. To avoid bias toward longer documents, BM25 normalizes the term frequency by the document length. This ensures that shorter documents with concentrated, relevant information aren’t penalized.

Example:
If Document A has 200 words and Document B has 1000 words, but both mention "machine learning" 5 times, BM25 will score Document A higher because its shorter length suggests that the term "machine learning" is more central to its content.

#### Example Query Walkthrough:

Let’s say the query is `"deep learning for image classification"`, and you have three documents:

- **Document 1**: A short blog post discussing image classification with neural networks and deep learning mentioned twice.
- **Document 2**: A lengthy research paper on deep learning that mentions deep learning multiple times but doesn’t focus on image classification specifically.
- **Document 3**: A general overview of machine learning, which mentions image classification briefly but without any focus on deep learning.

BM25 will first check how many times each query term appears in the documents. It will find that Document 2 mentions "deep learning" many times, so it will score well based on term frequency. However, Document 1 will also score highly because although it mentions "deep learning" fewer times, it’s a shorter document where the term is more central. Meanwhile, Document 3 will score lower because, even though it might mention image classification, it doesn’t cover deep learning well enough to be relevant.

Next, BM25 will apply inverse document frequency. If "deep learning" is a common term across all documents but "image classification" is rare, Document 1 and Document 2 will be weighted more heavily for mentioning "image classification." Document 3, which only touches on image classification, will be further penalized for lacking depth on the subject.

Lastly, BM25 adjusts for document length. Document 1 is shorter and to the point, so it gets an extra boost, while Document 2, despite being lengthy, will only get marginally higher scores for repeating terms more often.

In the end, BM25 will likely rank Document 1 as the most relevant, followed by Document 2, with Document 3 trailing behind.

#### Final BM25 Score Calculation:
BM25 combines all the factors—term frequency, inverse document frequency, and document length normalization—into a final score. The higher the BM25 score, the more relevant the document is to the query.

What’s great about BM25 is that it’s both simple and highly effective. It doesn’t just focus on raw counts of query terms; it carefully weighs how often terms appear, how common they are across all documents, and whether a term’s appearance is significant in the context of the document’s length.

### Comparing BM25 with Vector Similarity Search

While BM25 focuses on keyword matching (how relevant a document is based on exact words), **vector similarity search** looks at the semantic meaning behind the text. In vector search, documents and queries are represented as vectors in a continuous space, and similarity is measured based on the distance between them, usually using **cosine similarity** or **dot product**. This allows vector search to find documents with similar meanings, even if the words don’t exactly match.

**BM25 Advantages**: 
- Works well with small datasets.
- Doesn’t require a complex model.
- Easier to explain and debug.

**Vector Search Advantages**:
- Captures semantic meaning, not just exact word matches.
- Works better for complex queries where words might not exactly match the document terms.

Both methods have their place, but BM25 is particularly useful when you want precise keyword matching and have limited computational resources.

### Setup Weaviate Client

In [62]:
import weaviate
from dotenv import load_dotenv
import os

load_dotenv("./../.env")

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-23T11:32:44+05:30"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-23T11:32:44+05:30"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-23T11:32:44+05:30"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-09-23T11:32:44+05:30"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":55989},"time":"2024-09-23T11:32:44+05:30"}
{"address":"192.168.155.215:55990","level":"info","msg":"starting cloud rpc server ...","time":"2024-09-23T11:32:44+05:30"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-09-23T11

### Create Collection

In [63]:
from weaviate.classes.config import Property, DataType, Configure

if client.collections.exists("Article"):
    client.collections.delete("Article")

client.collections.create(
    "Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="body", data_type=DataType.TEXT, vectorize_property_name=True),
        Property(name="date", data_type=DataType.DATE),
        Property(name="category", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    )
)

{"action":"load_all_shards","level":"error","msg":"failed to load all shards: context canceled","time":"2024-09-23T11:32:48+05:30"}


<weaviate.collections.collection.sync.Collection at 0x10f55b050>

### Insert Documents

In [64]:
import json

with open("./articles.json", "r") as f:
    articles_json = json.load(f)

article = client.collections.get('Article')


with article.batch.dynamic() as batch:  # inserting objects to collection in batch
    for art in articles_json:
        batch.add_object(art)


{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-09-23T11:32:48+05:30","wait_for_cache_prefill":false}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-23T11:32:48+05:30","took":89541}
{"level":"info","msg":"Created shard article_RLgFbhxzVDZK in 5.301833ms","time":"2024-09-23T11:32:48+05:30"}


In [65]:
item_count = 0
for item in article.iterator():
    item_count += 1
item_count

13

### Vector Similarity Search

We've covered this previously as well, but let's repeat the steps for a quick refresher.

In [75]:
import textwrap

def print_objects(objects):
    """
        a function to print the retrieved objects
    """
    for obj in objects:
        print(f"ID: {obj.uuid.int}")
        metadata = [{k: round(v, 2) if isinstance(v, float) else v} for k, v in obj.metadata.__dict__.items() if v is not None]
        print(f"Metadata: {metadata}")
        print(f"Title: {obj.properties['title']}")
        print(f"Date: {obj.properties['date']}")
        print(f"Category: {obj.properties['category']}")
        print(f"Author: {obj.properties['author']}")
        print(f"Body: {textwrap.shorten(obj.properties['body'], width=100)}")
        print()

In [76]:
from weaviate.classes.query import MetadataQuery

response = article.query.near_text(
    query = "What is machine learning?",
    limit=3,    # max no of chunks to be returned
    return_metadata=MetadataQuery(distance=True, certainty=True),
    include_vector=True
)

print_objects(response.objects)

ID: 290510953615380066757077280396556319824
Metadata: [{'distance': 0.39}, {'certainty': 0.81}]
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 151753684610208536323796683740283397757
Metadata: [{'distance': 0.4}, {'certainty': 0.8}]
Title: Machine Learning Basics
Date: 2023-03-05 00:00:00+00:00
Category: ML
Author: Analytics Vidhya
Body: Machine learning is a branch of artificial intelligence that focuses on building systems that [...]

ID: 157521492403312658662470011970601017814
Metadata: [{'distance': 0.5}, {'certainty': 0.75}]
Title: Introduction to Deep Learning
Date: 2022-07-10 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Deep learning is a subset of machine learning that uses neural networks with many layers to [...]



In [77]:
response = article.query.bm25(
    query="What is machine learning?",
)

We can limit the properties on which bm25 search is applied by providing the specific properties. We can even increase the importance of a property for the search a by a factor.

BM25 search will be applied on these 3 properties while boosting the 'title' property by a facor of 2

In [78]:
response = article.query.bm25(
    query="what is machine learning?",
    query_properties=['body', 'title^2', 'category'] ,
    return_metadata=MetadataQuery(score=True, explain_score=True), # metdata specific to BM25 search
    limit=3
)

print_objects(response.objects)

ID: 290510953615380066757077280396556319824
Metadata: [{'score': 1.27}, {'explain_score': ', BM25F_machine_propLength:39, BM25F_learning_frequency:4, BM25F_learning_propLength:39, BM25F_machine_frequency:4'}]
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 151753684610208536323796683740283397757
Metadata: [{'score': 1.15}, {'explain_score': ', BM25F_machine_frequency:4, BM25F_machine_propLength:51, BM25F_learning_frequency:4, BM25F_learning_propLength:51'}]
Title: Machine Learning Basics
Date: 2023-03-05 00:00:00+00:00
Category: ML
Author: Analytics Vidhya
Body: Machine learning is a branch of artificial intelligence that focuses on building systems that [...]

ID: 138909255830429965420632981935136292337
Metadata: [{'score': 0.6}, {'explain_score': ', BM25F_learning_propLength:38, BM25F_machine_frequency:1, BM25F_machine_propL

Let's breakdown the `explain_score` metadata for the first chunk:
* `BM25F_machine_frequency`: 4: The word "machine" appears 4 times in the chunk.
* `BM25F_machine_propLength`: 39: The property (e.g., the "body" or "title") in which "machine" appears has a length of 39 words.
* `BM25F_learning_frequency`: 4: The word "learning" also appears 4 times in the chunk.
* `BM25F_learning_propLength`: 39: Similar to "machine", the property where "learning" appears has a length of 39 words.

### Hybrid Search

Hybrid Search is a combination of Vector Search and Keyword Search. The results from the two searches are combined based on a configurable weight 

In [79]:
response = article.query.hybrid(
    query="what is machine learning?",
    limit=3
)

print_objects(response.objects)

ID: 290510953615380066757077280396556319824
Metadata: []
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 151753684610208536323796683740283397757
Metadata: []
Title: Machine Learning Basics
Date: 2023-03-05 00:00:00+00:00
Category: ML
Author: Analytics Vidhya
Body: Machine learning is a branch of artificial intelligence that focuses on building systems that [...]

ID: 157521492403312658662470011970601017814
Metadata: []
Title: Introduction to Deep Learning
Date: 2022-07-10 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Deep learning is a subset of machine learning that uses neural networks with many layers to [...]



We can use the argument `alpha` to set the weightage for each search method.
* An alpha of 1 is a pure vector search.
* An alpha of 0 is a pure keyword search.

In [80]:
response = article.query.hybrid(
    query="what is machine learning?",
    alpha=0.25,  # keyword search is being given more wieghtage
    return_metadata=MetadataQuery(score=True, explain_score=True),
    limit=3
)

print_objects(response.objects)

ID: 290510953615380066757077280396556319824
Metadata: [{'score': 1.0}, {'explain_score': '\nHybrid (Result Set keyword,bm25) Document da8e5fdd-87c6-4b1e-9391-1a7e6ccb3c50: original score 1.009832, normalized score: 0.75 - \nHybrid (Result Set vector,hybridVector) Document da8e5fdd-87c6-4b1e-9391-1a7e6ccb3c50: original score 0.6449458, normalized score: 0.25'}]
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 151753684610208536323796683740283397757
Metadata: [{'score': 0.88}, {'explain_score': '\nHybrid (Result Set keyword,bm25) Document 722ab250-321b-4410-8574-198afc17e67d: original score 0.88386375, normalized score: 0.6337025 - \nHybrid (Result Set vector,hybridVector) Document 722ab250-321b-4410-8574-198afc17e67d: original score 0.6328484, normalized score: 0.24408486'}]
Title: Machine Learning Basics
Date: 2023-03-05 00:00:

```json
{'explain_score': '\nHybrid (Result Set keyword,bm25) Document fe776c9c-a970-4337-b2f1-732d778b997d: original score 1.0120157, normalized score: 0.75 - \nHybrid (Result Set vector,hybridVector) Document fe776c9c-a970-4337-b2f1-732d778b997d: original score 0.6449458, normalized score: 0.25'}```

As you can see, the keyword search score has been normalized to 0.75 and the vector search score has been normalised to 0.25. Which means keyword search is being given more weightage during the combination

### Search By Filtering

Objects/chunks can also be retrieved by just filtering on the properties. This is useful when you want to retrieve chunks that contain specific values in the properties.

In [81]:
from weaviate.classes.query import Filter

response =  article.query.fetch_objects(
    filters=Filter.by_property("category").equal("Programming"),
    limit=2
)

print_objects(response.objects)

ID: 138909255830429965420632981935136292337
Metadata: []
Title: Algorithms
Date: 2021-12-05 00:00:00+00:00
Category: Programming
Author: Towards Data Science
Body: Algorithms are step-by-step instructions or rules designed to perform a task or solve a [...]

ID: 179488132978240488857476808553647115166
Metadata: []
Title: Object-Oriented Programming
Date: 2023-01-15 00:00:00+00:00
Category: Programming
Author: Codecademy
Body: Object-oriented programming (OOP) is a paradigm based on the concept of 'objects', which can [...]



#### Filter on multiple properties

In [82]:
response =  article.query.fetch_objects(
    filters=(
        Filter.by_property("category").equal("Programming") &   # use & for AND, | for OR
        Filter.by_property("author").equal("GeeksForGeeks")
    ),
    limit=2
)

print_objects(response.objects)

ID: 9052491185643298918940238487750688866
Metadata: []
Title: Data Structures
Date: 2022-03-12 00:00:00+00:00
Category: Programming
Author: GeeksForGeeks
Body: Data structures are ways of organizing and storing data so that they can be accessed and [...]

ID: 282176851638362230292167119753462938456
Metadata: []
Title: Understanding Hash Tables
Date: 2022-10-15 00:00:00+00:00
Category: Programming
Author: GeeksForGeeks
Body: Hash tables are a data structure that implements an associative array, a structure that can [...]



#### Filter on properties with search

In [83]:
response =  article.query.near_text(
    query="What is machine learning?",
    filters=(
        Filter.by_property("category").equal("ML")
    ),
    limit=2
)

print_objects(response.objects)

ID: 290510953615380066757077280396556319824
Metadata: []
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Author: Towards Data Science
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 151753684610208536323796683740283397757
Metadata: []
Title: Machine Learning Basics
Date: 2023-03-05 00:00:00+00:00
Category: ML
Author: Analytics Vidhya
Body: Machine learning is a branch of artificial intelligence that focuses on building systems that [...]



#### Contains Any Filteration

In [85]:
response =  article.query.fetch_objects(
    filters=(
        Filter.by_property("body").contains_any(["cybersecurity", "security"])
    ),
    limit=2
)

print_objects(response.objects)

ID: 76158799381221900616382927789714661060
Metadata: []
Title: Cryptography
Date: 2022-08-25 00:00:00+00:00
Category: Cybersecurity
Author: TechCrunch
Body: Cryptography is the study and practice of securing communication and data from unauthorized [...]

ID: 327501226512119996986741418061563631305
Metadata: []
Title: The Importance of Cybersecurity
Date: 2022-12-12 00:00:00+00:00
Category: Cybersecurity
Author: Krebs on Security
Body: Cybersecurity involves protecting computer systems and networks from information disclosure, [...]



#### Contains All Filteration

In [86]:
response =  article.query.fetch_objects(
    filters=(
        Filter.by_property("body").contains_all(["cybersecurity", "security"])
    ),
    limit=2
)

print_objects(response.objects)

ID: 327501226512119996986741418061563631305
Metadata: []
Title: The Importance of Cybersecurity
Date: 2022-12-12 00:00:00+00:00
Category: Cybersecurity
Author: Krebs on Security
Body: Cybersecurity involves protecting computer systems and networks from information disclosure, [...]



#### Filter by Date

In [88]:
from datetime import datetime, timezone

# filter for articles published after March 2023
filter_time = datetime(2023, 3, 1).replace(tzinfo=timezone.utc)

response = article.query.fetch_objects(
    limit=3,
    filters=(
        Filter.by_property("date").greater_than(filter_time) &
        Filter.by_property("category").equal("Web Development")
    )
)

print_objects(response.objects)

ID: 160388162426943058060510827950556935940
Metadata: []
Title: RESTful API Design
Date: 2023-04-10 00:00:00+00:00
Category: Web Development
Author: Smashing Magazine
Body: RESTful APIs are an architectural style for designing networked applications. They rely on [...]

