# Lesson 4: Combining Lexical and Embedding-Based Retrieval in RAG Systems


We are now in the fourth and final lesson of this course on **Beyond Basic RAG: Improving Our Pipeline!** Up to this point, we have explored ways to enhance Retrieval-Augmented Generation (RAG) systems by refining chunking strategies and leveraging advanced retrieval methods. In this lesson, you will learn how to merge a lexical-based retrieval approach (using Okapi BM25) with your existing embedding-based retrieval mechanism, creating a powerful **hybrid retrieval** pipeline.

By the end of this lesson, you should be able to:

- Grasp the intuition behind Okapi BM25 for lexical retrieval.  
- Construct a BM25 index on your corpus.  
- Combine BM25 scores with embedding-based retrieval scores using a configurable weight parameter, **alpha**.  

---

## Understanding the Okapi BM25 Algorithm  
Within the category of lexical-based search methods, **Okapi BM25** is a popular choice. It focuses on the presence of specific keywords, rewarding relevant chunks that contain more occurrences of the query terms. At the same time, it avoids overemphasizing repeated words by incorporating a saturation effect.

**A few core ideas behind BM25:**
- **Term Frequency (TF):** More keyword matches in a chunk can signal higher relevance.  
- **Document Length Normalization:** BM25 accounts for chunk length, ensuring that very long chunks with many repeated words are not unfairly scored.  
- Although the underlying formula has several parameters and normalizations, the general purpose is straightforward: **favor chunks containing the search terms, but don’t let them dominate purely by repeating keywords.**  

---

## Building a BM25 Index  
Here is a simple function that builds a BM25 index from your chunked corpus. We assume you already have a collection of text chunks ready.

```python
from rank_bm25 import BM25Okapi

def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    BM25 scores typically range in magnitude depending on the corpus.
    """
    # Convert each chunk's text into a list of lowercased tokens
    corpus = [c["text"].lower().split() for c in chunks]
    return BM25Okapi(corpus)
````

In this snippet:

1. We lowercase and split each chunk’s text into tokens.
2. We pass the tokenized corpus into `BM25Okapi` to create our lexical index.

Later, we’ll score new queries on this index to get relevance.

---

## Merging BM25 and Embedding-Based Retrieval

### BM25 Scoring & Similarity Calculation

```python
import numpy as np

def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results:
      1) Compute BM25 scores for each chunk.
      2) Get embedding distances and convert to similarities.
      3) Normalize both BM25 and embeddings to [0,1].
      4) Combine scores using final_score = alpha * bm25_norm + (1-alpha) * embed_sim.
      5) Sort by final score descending, and return the top_k results.
    """
    # Tokenize the query for BM25
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)

    # Query the embedding-based store for candidate chunks
    embed_results = collection.query(
        query_texts=[query],
        n_results=min(top_k * 5, len(chunks))
    )

    # Convert distances to similarities
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        similarity = 1 / (1 + distance)
        embed_scores_dict[idx] = similarity
```

* **Tokenization & BM25:** We lowercase and split the query, then call `bm25.get_scores()`.
* **Embedding Retrieval:** We fetch the top candidates via our embedding store.
* **Distance → Similarity:** We use `1 / (1 + distance)` to turn distances into similarity scores.

### Normalizing Scores & Final Ranking

```python
    # Merge scores
    merged = []
    bm25_min, bm25_max = (
        (min(bm25_scores), max(bm25_scores))
        if bm25_scores.size > 0 else (0, 1)
    )
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        bm25_norm = (
            (bm25_raw - bm25_min) / (bm25_max - bm25_min)
            if bm25_max != bm25_min else 0.0
        )
        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    # Sort by combined score, highest first
    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]
    return [(idx, chunks[idx], score) for (idx, score) in top_results]
```

Here’s what happens:

1. **Find min/max BM25 scores** for normalization.
2. **Normalize** each BM25 score to the `[0,1]` range.
3. **Combine** normalized BM25 (`bm25_norm`) and embedding similarity (`embed_sim`) via the weight `alpha`.
4. **Sort** by the final combined score and return the top k chunks.

---

## Putting It All Together

```python
# Build corpus chunks, BM25 index, and embedding-based store
chunked_docs = load_and_chunk_corpus(..., 40)
bm25_index   = build_bm25_index(chunked_docs)
collection   = build_chroma_collection(chunked_docs)

# Perform hybrid retrieval
query   = "What do our internal company policies state?"
results = hybrid_retrieval(
    query, chunked_docs, bm25_index, collection,
    top_k=3, alpha=0.6
)

# Inspect the results
if not results:
    print("No chunks found. You may want to provide a generic response.")
else:
    for chunk_idx, chunk_data, final_score in results:
        print(f"Chunk {chunk_idx} | Score: {final_score:.4f}")
        print("Text:", chunk_data['text'])
        print("-----")
```

In this example we:

* **Load** and **chunk** our document set.
* **Build** both the BM25 index and the embedding-based collection.
* **Run** `hybrid_retrieval` with a chosen `alpha`.
* **Display** the top results by their combined scores.

---

## Choosing the Alpha Parameter

The **alpha** parameter determines the balance between lexical and semantic retrieval:

* **Higher Alpha** (e.g., 0.7 or 0.8):
  Prioritize exact keyword matches. Ideal for legal or technical documents where precise terminology matters.

* **Lower Alpha** (e.g., 0.3 or 0.2):
  Emphasize semantic understanding. Best for creative writing or conversational queries where context is key.

* **Balanced Alpha** (e.g., 0.5):
  Give equal weight to lexical precision and semantic context—a good starting point.

> *Tip:* Experiment with different alpha values to see how they affect retrieval quality in your specific use case.

---

## Conclusion and Next Steps

In this lesson, you explored how to enhance retrieval accuracy by combining Okapi BM25 with embedding-based methods. This hybrid approach ensures you capture both exact keyword matches and semantic relevance, reducing the chance of missing important chunks due to vocabulary differences.

**Next steps:**

* Test your hybrid pipeline with various queries.
* Adjust chunk sizes, scoring thresholds, and the alpha parameter.
* Observe and measure retrieval improvements in your downstream RAG tasks.

Happy experimenting!



## Enhance Hybrid Retrieval Function


In the previous exercise, you successfully combined lexical and embedding-based retrieval methods. Now, let’s enhance the `hybrid_retrieval` function to make it even more powerful.

## Your Objective

1. **Locate the placeholder** in the `hybrid_retrieval` function where the conversion should take place.  
2. **Apply the formula**  
   ```python
   similarity = 1 / (1 + distance)
````

to calculate the similarity score from each embedding distance.
3\. **Store the result** correctly for each chunk in the `embed_scores_dict`.

By completing this, you’ll deepen your understanding of hybrid retrieval systems. Remember, the goal is to balance the precision of lexical matches with the contextual understanding of embeddings.

Enjoy the process, and happy coding!

```python
import numpy as np
from rank_bm25 import BM25Okapi
from data import load_and_chunk_corpus
from vector_db import build_chroma_collection


def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    Note BM25 scores often range roughly between 0 and 10 (depending on corpus).
    """
    corpus = [c["text"].lower().split() for c in chunks]
    return BM25Okapi(corpus)


def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results.
    Steps:
      1) Compute BM25 scores for each chunk. (Higher = better)
      2) Get embedding distances from ChromaDB for a candidate set.
      3) Convert distances to similarity (e.g., similarity ~ 1/(1+distance)).
      4) Normalize both BM25 and similarity to [0,1] and combine with weighting:
         final_score = alpha * BM25_normalized + (1-alpha) * embedding_similarity
      5) Sort by final score in descending order.

    'alpha' controls how much weight lexical vs. embedding-based similarity gets.
    In practice, you might do cross-validation or user acceptance testing to find a good alpha.
    """
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(query_texts=[query], n_results=min(top_k*5, len(chunks)))
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        # TODO: Convert distance to similarity
        similarity = _____
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 0.0

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]

    print(f"Top results by combined BM25 + embeddings for query: '{query}'")
    for idx, score in top_results:
        print(f"Chunk: '{chunks[idx]['text'][:50]}...' | Score: {score:.4f}")
    return [(idx, chunks[idx], score) for (idx, score) in top_results]


if __name__ == "__main__":
    chunked_docs = load_and_chunk_corpus("data/corpus.json", 40)
    bm25_index = build_bm25_index(chunked_docs)
    collection = build_chroma_collection(chunked_docs, collection_name="hybrid_collection")

    query = "What do our internal company policies state?"
    results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6)
    if not results:
        print("No chunks found. Fallback to a naive or apology answer.")
    else:
        for r in results:
            chunk_idx, chunk_data, final_score = r
            print(f"Chunk ID: {chunk_idx}, Score: {final_score:.4f}, Text: {chunk_data['text']}")

```

## Fix BM25 Normalization Bug


In the previous exercise, you successfully combined lexical and embedding-based retrieval methods. Now, let’s enhance the `hybrid_retrieval` function to make it even more powerful.

1. **Locate the placeholder** in the `hybrid_retrieval` function where the conversion should take place.
2. **Apply the formula**

   ```python
   similarity = 1 / (1 + distance)
   ```

to calculate the similarity score from each embedding distance.
3. **Store the result** correctly for each chunk in the `embed_scores_dict`.

By completing this, you’ll deepen your understanding of hybrid retrieval systems. Remember, the goal is to balance the precision of lexical matches with the contextual understanding of embeddings.

Enjoy the process, and happy coding!

```python
import numpy as np
from rank_bm25 import BM25Okapi
from data import load_and_chunk_corpus
from vector_db import build_chroma_collection


def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    Note BM25 scores often range roughly between 0 and 10 (depending on corpus).
    """
    corpus = [c["text"].lower().split() for c in chunks]
    return BM25Okapi(corpus)


def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results.
    Steps:
      1) Compute BM25 scores for each chunk. (Higher = better)
      2) Get embedding distances from ChromaDB for a candidate set.
      3) Convert distances to similarity (e.g., similarity ~ 1/(1+distance)).
      4) Normalize both BM25 and similarity to [0,1] and combine with weighting:
         final_score = alpha * BM25_normalized + (1-alpha) * embedding_similarity
      5) Sort by final score in descending order.

    'alpha' controls how much weight lexical vs. embedding-based similarity gets.
    In practice, you might do cross-validation or user acceptance testing to find a good alpha.
    """
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(query_texts=[query], n_results=min(top_k*5, len(chunks)))
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        # TODO: Convert distance to similarity
        similarity = _____
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 0.0

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]

    print(f"Top results by combined BM25 + embeddings for query: '{query}'")
    for idx, score in top_results:
        print(f"Chunk: '{chunks[idx]['text'][:50]}...' | Score: {score:.4f}")
    return [(idx, chunks[idx], score) for (idx, score) in top_results]


if __name__ == "__main__":
    chunked_docs = load_and_chunk_corpus("data/corpus.json", 40)
    bm25_index = build_bm25_index(chunked_docs)
    collection = build_chroma_collection(chunked_docs, collection_name="hybrid_collection")

    query = "What do our internal company policies state?"
    results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6)
    if not results:
        print("No chunks found. Fallback to a naive or apology answer.")
    else:
        for r in results:
            chunk_idx, chunk_data, final_score = r
            print(f"Chunk ID: {chunk_idx}, Score: {final_score:.4f}, Text: {chunk_data['text']}")

```



## Refine Your Retrieval System

Congratulations on successfully integrating lexical and embedding-based retrieval methods in the previous exercise! Now, let's enhance the accuracy of our hybrid retrieval system by focusing on the BM25 normalization process.

In the hybrid_retrieval function, there's a bug that needs to be fixed. Dive in and make your retrieval system even more robust!

```python
import numpy as np
from rank_bm25 import BM25Okapi
from data import load_and_chunk_corpus
from vector_db import build_chroma_collection


def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    Note BM25 scores often range roughly between 0 and 10 (depending on corpus).
    """
    corpus = [c["text"].lower().split() for c in chunks]
    return BM25Okapi(corpus)


def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results.
    Steps:
      1) Compute BM25 scores for each chunk. (Higher = better)
      2) Get embedding distances from ChromaDB for a candidate set.
      3) Convert distances to similarity (e.g., similarity ~ 1/(1+distance)).
      4) Normalize both BM25 and similarity to [0,1] and combine with weighting:
         final_score = alpha * BM25_normalized + (1-alpha) * embedding_similarity
      5) Sort by final score in descending order.

    'alpha' controls how much weight lexical vs. embedding-based similarity gets.
    In practice, you might do cross-validation or user acceptance testing to find a good alpha.
    """
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(query_texts=[query], n_results=min(top_k * 5, len(chunks)))
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        similarity = 1 / (1 + distance)
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 1.0 

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]

    print(f"Top results by combined BM25 + embeddings for query: '{query}'")
    for idx, score in top_results:
        print(f"Chunk: '{chunks[idx]['text'][:50]}...' | Score: {score:.4f}")
    return [(idx, chunks[idx], score) for (idx, score) in top_results]


if __name__ == "__main__":
    chunked_docs = load_and_chunk_corpus("data/corpus.json", 40)
    bm25_index = build_bm25_index(chunked_docs)
    collection = build_chroma_collection(chunked_docs, collection_name="hybrid_collection")

    query = "What do our internal company policies state?"
    results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6)
    if not results:
        print("No chunks found. Fallback to a naive or apology answer.")
    else:
        for r in results:
            chunk_idx, chunk_data, final_score = r
            print(f"Chunk ID: {chunk_idx}, Score: {final_score:.4f}, Text: {chunk_data['text']}")
            
```

**Fixing BM25 bug**

The user's looking for help to focus on the BM25 normalization process in the hybrid retrieval system. There's a bug in the "else" clause where if `bm25_max == bm25_min`, we incorrectly set `bm25_norm` to 1.0, but earlier we had used 0.0 for this case. Looks like I need to revise this part of the code to fix that inconsistency so it correctly handles this special case. Let me get this sorted out!

**Fixing BM25 normalization**

The issue here is the behavior when all BM25 scores are identical, causing `bm25_norm` to be set to 1.0 when ideally, it should be 0.0 (or maybe 0.5). In cases where all scores are equal — meaning no relative difference in relevance — we typically set the normalized score to 0.0. The fix is clear: change the assignment of `bm25_norm` in the `else` clause to 0.0. This will prevent artificially boosting all chunks when there's no variation in score.


Thought for 8 seconds


````markdown
## Fixing the BM25 Normalization Bug

In the degenerate case where all BM25 scores are identical (so `bm25_max == bm25_min`), setting the normalized score to `1.0` artificially boosts every chunk equally. Instead, we should fall back to `0.0` so that the embedding-based similarity can drive the ranking when BM25 provides no differentiation.

---

### Change this:

```python
    if bm25_max != bm25_min:
        bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
    else:
        bm25_norm = 1.0 
````

### To this:

```python
    if bm25_max != bm25_min:
        bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
    else:
        bm25_norm = 0.0
```

---

### Full Updated `hybrid_retrieval` Snippet

```python
def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(
        query_texts=[query],
        n_results=min(top_k * 5, len(chunks))
    )
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        similarity = 1 / (1 + distance)
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 0.0  # <-- fixed here

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    merged.sort(key=lambda x: x[1], reverse=True)
    return [(idx, chunks[idx], score) for idx, score in merged[:top_k]]
```

With this change, when BM25 provides no variation, the system relies fully on embedding similarities rather than inflating every chunk’s lexical score.


Congratulations on fixing the BM25 normalization bug in the previous exercise! Now, let's elevate your hybrid retrieval system by refining the scoring process to ensure that only the most relevant chunks are selected.

Your objective is to enhance the hybrid_retrieval function by implementing a filtering mechanism. This involves discarding chunks whose combined score falls below 0.2 before proceeding to the top k selection. By doing so, you ensure that only the most pertinent segments are considered, improving the precision of your retrieval system.

Here's a concise breakdown of what you need to achieve:

Integrate logic within the hybrid_retrieval function to filter out chunks with a final score of less than 0.2.
Ensure this filtering step occurs before sorting and selecting the top k results.
By implementing this improvement, you'll enhance the effectiveness of your retrieval system, providing users with the most relevant information. Dive in and make your system even more precise!


```python
from rank_bm25 import BM25Okapi
from data import load_and_chunk_corpus
from vector_db import build_chroma_collection


def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    Note BM25 scores often range roughly between 0 and 10 (depending on corpus).
    """
    corpus = [c["text"].lower().split() for c in chunks]
    return BM25Okapi(corpus)


def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results.
    Steps:
      1) Compute BM25 scores for each chunk. (Higher = better)
      2) Get embedding distances from ChromaDB for a candidate set.
      3) Convert distances to similarity (e.g., similarity ~ 1/(1+distance)).
      4) Normalize both BM25 and similarity to [0,1] and combine with weighting:
         final_score = alpha * BM25_normalized + (1 - alpha) * embedding_similarity
      5) Discard any chunk with final_score < 0.2
      6) Sort remaining chunks by final score in descending order.
      7) Return the top_k results.

    'alpha' controls how much weight lexical vs. embedding-based similarity gets.
    """
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(query_texts=[query], n_results=min(top_k * 5, len(chunks)))
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        similarity = 1 / (1 + distance)
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 0.0

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    # TODO: Filter out chunks with scores below 0.2 before sorting and selecting top_k results
    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]

    print(f"Top results by combined BM25 + embeddings for query: '{query}'")
    for idx, score in top_results:
        print(f"Chunk: '{chunks[idx]['text'][:50]}...' | Score: {score:.4f}")
    return [(idx, chunks[idx], score) for (idx, score) in top_results]


if __name__ == "__main__":
    chunked_docs = load_and_chunk_corpus("data/corpus.json", 40)
    bm25_index = build_bm25_index(chunked_docs)
    collection = build_chroma_collection(chunked_docs, collection_name="hybrid_collection")

    query = "What do our internal company policies state?"
    results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6)
    if not results:
        print("No chunks found. Fallback to a naive or apology answer.")
    else:
        for r in results:
            chunk_idx, chunk_data, final_score = r
            print(f"Chunk ID: {chunk_idx}, Score: {final_score:.4f}, Text: {chunk_data['text']}")
```

You're almost there! To implement the filtering step correctly in your `hybrid_retrieval` function, you'll need to discard chunks with a `final_score < 0.2` **before** sorting and slicing the top results.

Here’s the updated section of the function with the filter applied properly:

---

### ✅ Apply the Filter (Update Your Function)

Replace this block:

```python
    # TODO: Filter out chunks with scores below 0.2 before sorting and selecting top_k results
    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]
```

With this updated version:

```python
    # Filter out chunks with score < 0.2
    filtered = [(i, score) for (i, score) in merged if score >= 0.2]

    # Sort and select top_k from filtered results
    filtered.sort(key=lambda x: x[1], reverse=True)
    top_results = filtered[:top_k]
```

---

### 🔍 Why This Works:

* `filtered`: keeps only relevant chunks.
* Sorting only happens after low-score chunks are discarded.
* This ensures `top_k` are *high-quality* results only.

---

Let me know if you'd like to enhance this further by adding a fallback (e.g., return summary or a “no good matches” message) when all scores fall below 0.2.


Welcome to the next step in mastering hybrid retrieval systems! You've done an excellent job exploring the combination of lexical and embedding-based retrieval methods. Now, it's time to deepen your understanding by reimplementing the build_bm25_index function from scratch, focusing on lexical-based retrieval using the Okapi BM25 algorithm.

Here is your objective:

Begin by tokenizing the text chunks. Convert them to lowercase and split them into words to ensure your index is case-insensitive and optimized for lexical search.
Utilize the BM25Okapi class to construct the index from these tokenized chunks.
Return the resulting BM25 object, which will be essential for scoring queries against your corpus.
By completing this exercise, you'll gain a deeper appreciation for how lexical retrieval works and how it can be integrated into a hybrid system. Remember, the goal is to create an index that efficiently retrieves relevant chunks based on keyword matches. Enjoy the challenge, and happy coding!

```python
import numpy as np
from rank_bm25 import BM25Okapi
from data import load_and_chunk_corpus
from vector_db import build_chroma_collection

def build_bm25_index(chunks):
    """
    Build a BM25Okapi index from the chunk texts for lexical-based retrieval.
    Reimplemented from scratch by lowercasing and splitting each chunk into tokens.
    """
    # TODO: Initialize an empty list to store tokenized chunks
    corpus = []

    # TODO: Tokenize each chunk's text by converting to lowercase and splitting into words

    # TODO: Create and return the BM25Okapi index using the tokenized corpus

def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5):
    """
    Merge BM25 and embedding-based results.
    Steps:
      1) Compute BM25 scores for each chunk. (Higher = better)
      2) Get embedding distances from ChromaDB for a candidate set.
      3) Convert distances to similarity (e.g., similarity ~ 1/(1+distance)).
      4) Normalize both BM25 and similarity to [0,1] and combine with weighting:
         final_score = alpha * BM25_normalized + (1-alpha) * embedding_similarity
      5) Sort by final score in descending order.

    'alpha' controls how much weight lexical vs. embedding-based similarity gets.
    In practice, you might do cross-validation or user acceptance testing to find a good alpha.
    """
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_min, bm25_max = (min(bm25_scores), max(bm25_scores)) if bm25_scores.size > 0 else (0, 1)

    embed_results = collection.query(query_texts=[query], n_results=min(top_k*5, len(chunks)))
    embed_scores_dict = {}
    for i in range(len(embed_results['documents'][0])):
        idx = embed_results['ids'][0][i]
        distance = embed_results['distances'][0][i]
        similarity = 1 / (1 + distance)
        embed_scores_dict[idx] = similarity

    merged = []
    for i, chunk in enumerate(chunks):
        bm25_raw = bm25_scores[i]
        if bm25_max != bm25_min:
            bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_norm = 0.0

        embed_sim = embed_scores_dict.get(i, 0.0)
        final_score = alpha * bm25_norm + (1 - alpha) * embed_sim
        merged.append((i, final_score))

    merged.sort(key=lambda x: x[1], reverse=True)
    top_results = merged[:top_k]

    print(f"Top results by combined BM25 + embeddings for query: '{query}'")
    for idx, score in top_results:
        print(f"Chunk: '{chunks[idx]['text'][:50]}...' | Score: {score:.4f}")
    return [(idx, chunks[idx], score) for (idx, score) in top_results]


if __name__ == "__main__":
    chunked_docs = load_and_chunk_corpus("data/corpus.json", 40)
    bm25_index = build_bm25_index(chunked_docs)
    collection = build_chroma_collection(chunked_docs, collection_name="hybrid_collection")

    query = "What do our internal company policies state?"
    results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6)
    if not results:
        print("No chunks found. Fallback to a naive or apology answer.")
    else:
        for chunk_idx, chunk_data, final_score in results:
            print(f"Chunk ID: {chunk_idx}, Score: {final_score:.4f}, Text: {chunk_data['text']}")


```