# Lesson 4: Metadata-Based Filtering in RAG Systems

Welcome to the final lesson in our **Scaling Up RAG with Vector Databases** course! Previously, we explored how to chunk large documents for efficient retrieval, store these chunks in a vector database (such as ChromaDB), and then retrieve them to build prompts for Large Language Models (LLMs). Remember, chunking and storing these text fragments provided the basic scaffolding for a Retrieval-Augmented Generation (RAG) pipeline.

In this lesson, we will expand on that foundation by introducing **metadata-based filtering**, which allows you to target specific attributes—like category or date—and make your content searches significantly more precise. By the end, you will be able to create queries that focus only on the metadata you care about, such as retrieving documents from specific categories.

---

# Understanding Metadata in RAG Systems

## What is Metadata, and Why Does It Matter?

Metadata includes any labeled information that describes your text chunks. Common examples are **category**, **date**, or **title**. When you have a large collection of documents, a normal text-based similarity search might return results you don’t actually want. But by selectively filtering on metadata, you can drastically reduce irrelevant results and ensure only the most pertinent information is retrieved.

## Real-World Example

Imagine a large enterprise knowledge base spanning different departments (e.g., Human Resources, Technology, Finance). If you only want to see technology-related documents, applying a simple metadata filter on the `category` field ensures that your search never strays into HR or Finance content. This becomes particularly useful when you have specialized queries that are domain-specific and need accurate, fast retrieval.

---

# Building the Filter Logic

Let’s move to coding, first focusing on the **metadata filter**:

```python
def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    # If categories are provided, build the filter
    where = {"category": {"$in": categories}} if categories else None
````

Here, we take an optional list of categories and build a where clause:

* A **where** query in ChromaDB acts like a targeted filter on the collection (similar to a `WHERE` clause in SQL).
* By specifying `{"category": {"$in": categories}}`, only documents with a matching category will be returned.
* The `$` symbol in `$in` denotes a special operator in ChromaDB’s query language, similar to MongoDB’s syntax.
* If no categories are passed, the filter is set to `None`, which tells ChromaDB to run a broader, unfiltered search.

---

# Executing the Query and Structuring Results

```python
    # Perform the query with an optional metadata filter
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where
    )
    
    # Compile the results
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk":    results['documents'][0][i],
            "doc_id":   results['metadatas'][0][i]['doc_id'],
            "category": results['metadatas'][0][i].get('category'),
            "distance": results['distances'][0][i]
        })
    
    return retrieved_chunks
```

* The `collection.query` method retrieves the most similar documents to `query_texts`.
* `n_results` controls how many matching chunks to retrieve per query.
* The `where` parameter defines our metadata filter.
* After running the query, we gather each chunk’s text, document ID, category, and distance score into a structured list.
* The `distance` value represents how semantically similar the retrieved chunk is to the query. ChromaDB returns **cosine distance** (`1 - cosine_similarity`), so larger values indicate less similarity.

---

# Practical Example

Next, integrate this metadata-based search into your workflow. Let’s run a sample query with and without metadata filtering:

```python
# Example query
query = "Recent advancements in AI and their impact on teaching"

print("======== WITHOUT CATEGORY FILTER ========")
no_filter_results = metadata_enhanced_search(query, collection, categories=None, top_k=3)
for res in no_filter_results:
    print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
    print(f"Chunk: {res['chunk']}\n")

print("======== WITH CATEGORY FILTER (Education) ========")
edu_filter_results = metadata_enhanced_search(query, collection, categories=["Education"], top_k=3)
for res in edu_filter_results:
    print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
    print(f"Chunk: {res['chunk']}\n")
```

When you run this example against your `data/corpus.json`, you might see:

```
======== WITHOUT CATEGORY FILTER ========
Doc ID: 64, Category: Education, Distance: 1.0530
Chunk: The integration of technology in education is revolutionizing traditional teaching methods...

Doc ID: 24, Category: Education, Distance: 1.1431
Chunk: Universities are rethinking traditional education models...

Doc ID: 1, Category: Technology, Distance: 1.1532
Chunk: Artificial intelligence is transforming the way we approach complex computing problems...

======== WITH CATEGORY FILTER (Education) ========
Doc ID: 64, Category: Education, Distance: 1.0530
Chunk: The integration of technology in education is revolutionizing traditional teaching methods...

Doc ID: 24, Category: Education, Distance: 1.1431
Chunk: Universities are rethinking traditional education models...

Doc ID: 63, Category: Education, Distance: 1.2630
Chunk: Modern classrooms are benefiting from innovative pedagogical approaches...
```

Notice how, without a filter, you get a mix of **Education** and **Technology** results. With the filter set to `["Education"]`, you only see documents in that category—demonstrating how **metadata-based filtering** sharpens your results.

---

# Conclusion and Next Steps

In this lesson, you learned how to harness **metadata-based filtering** to refine search results in your RAG pipeline. By storing category (or any descriptor) alongside your text chunks, you can pinpoint the data most relevant to your query. This makes your system more robust and efficient—especially as your document collection grows.

**Next**, you will practice implementing these ideas in hands-on exercises. Good luck, and keep exploring the power of metadata in RAG!


## Where Clauses

You've just explored the fundamentals of metadata-based filtering, and now it's time to focus on a crucial aspect of the process! In this exercise, you'll implement the metadata filtering condition for the metadata_enhanced_search function, allowing your RAG system to narrow down results to specific document categories.

Here's your mission:

Implement the missing where_clause logic in the metadata_enhanced_search function.
Create a filter that ensures only documents with matching categories are retrieved when categories are provided.
Ensure that when no categories are specified, the search is performed without any filtering.
By focusing on this key component, you'll understand how metadata filters are constructed to enhance the precision of your vector database searches!

```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Takes a query, a ChromaDB collection, an optional list of categories, and
    returns the top_k most relevant documents. If categories is specified, only
    documents matching any of those categories are retrieved.
    """
    # TODO: Create a where_clause dictionary for category filtering.
    # If categories is provided, filter documents to match any of those categories using the $in operator.
    # If no categories provided, the where_clause should be None.
    where_clause = None
    
    # Run the collection's query method, providing optional filtering and the number of results
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Compile the retrieval results into a structured list
    retrieved_chunks = []
    for i in range(len(results["documents"][0])):
        retrieved_chunks.append({
            "chunk": results["documents"][0][i],
            "doc_id": results["metadatas"][0][i]["doc_id"],
            "category": results["metadatas"][0][i].get("category"),
            "distance": results["distances"][0][i]
        })

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data from JSON file
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Create a ChromaDB client and set up the embedding function
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())

    # Create or get a collection for our metadata demo
    collection = client.create_collection(
        "metadata_demo_collection",
        embedding_function=embed_func
    )

    # Remove existing data (if any) and add fresh documents
    existing_ids = collection.get().get("ids", [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    texts = [doc["content"] for doc in sample_chunks]
    doc_ids = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = [{
        "doc_id": doc["id"],
        "category": doc.get("category", "General"),
        "title": doc["title"],
        "date": doc["date"]
    } for doc in sample_chunks]

    # Add documents with their metadata to the ChromaDB collection
    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Define a query to demonstrate searching with and without metadata filtering
    query_input = "Recent advancements in AI and their impact on teaching"

    # Search WITHOUT category filtering
    print("======== WITHOUT CATEGORY FILTER ========")
    no_filter_results = metadata_enhanced_search(query_input, collection, categories=None, top_k=3)
    for res in no_filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    # Search WITH category filtering
    filter_category = "Education"
    print(f"======== WITH CATEGORY FILTER ({filter_category}) ========")
    filter_results = metadata_enhanced_search(query_input, collection, categories=[filter_category], top_k=3)
    for res in filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")
```

Thought for a couple of seconds


Below is the updated `metadata_enhanced_search` function with the missing `where_clause` logic implemented. Now, whenever you pass a non-empty list of `categories`, only documents whose `category` metadata matches one of those values will be returned. If you pass `categories=None` (or an empty list), the search runs unfiltered.

```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Takes a query, a ChromaDB collection, an optional list of categories, and
    returns the top_k most relevant documents. If categories is specified, only
    documents matching any of those categories are retrieved.
    """
    # Build the metadata filter clause if categories were provided
    if categories:
        # Only documents whose 'category' is in the provided list will match
        where_clause = {"category": {"$in": categories}}
    else:
        # No filter at all
        where_clause = None

    # Run the collection's query method, providing optional filtering
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Compile the retrieval results into a structured list
    retrieved_chunks = []
    for i in range(len(results["documents"][0])):
        retrieved_chunks.append({
            "chunk":    results["documents"][0][i],
            "doc_id":   results["metadatas"][0][i]["doc_id"],
            "category": results["metadatas"][0][i].get("category"),
            "distance": results["distances"][0][i]
        })

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data from JSON file
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Create a ChromaDB client and set up the embedding function
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name
    )
    client = Client(Settings())

    # Create or get a collection for our metadata demo
    collection = client.create_collection(
        "metadata_demo_collection",
        embedding_function=embed_func
    )

    # Remove existing data (if any) and add fresh documents
    existing_ids = collection.get().get("ids", [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    texts = [doc["content"] for doc in sample_chunks]
    doc_ids = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = [{
        "doc_id":   doc["id"],
        "category": doc.get("category", "General"),
        "title":    doc["title"],
        "date":     doc["date"]
    } for doc in sample_chunks]

    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Define a query to demonstrate searching with and without metadata filtering
    query_input = "Recent advancements in AI and their impact on teaching"

    # Search WITHOUT category filtering
    print("======== WITHOUT CATEGORY FILTER ========")
    no_filter_results = metadata_enhanced_search(
        query_input, collection, categories=None, top_k=3
    )
    for res in no_filter_results:
        print(f"Doc ID: {res['doc_id']}, "
              f"Category: {res['category']}, "
              f"Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    # Search WITH category filtering
    filter_category = "Education"
    print(f"======== WITH CATEGORY FILTER ({filter_category}) ========")
    filter_results = metadata_enhanced_search(
        query_input, collection, categories=[filter_category], top_k=3
    )
    for res in filter_results:
        print(f"Doc ID: {res['doc_id']}, "
              f"Category: {res['category']}, "
              f"Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")
```

**Explanation of the key change**:

* We test `if categories:`

  * If true, we set

    ```python
    where_clause = {"category": {"$in": categories}}
    ```

    which instructs ChromaDB to only return items whose `category` metadata is in that list.
* Otherwise, we leave `where_clause = None`, meaning “no metadata filter—return all matches.”

With this in place, your RAG system can now precisely narrow down retrievals by category (or any other metadata field), or bypass filtering entirely when you need a broader search.



## Crafting Metadata Enhanced Search

You've just explored the fundamentals of metadata-based filtering, and now it's time to put that knowledge into practice! In this exercise, you'll create your own metadata_enhanced_search function from scratch, a powerful tool for filtering documents based on specific metadata attributes like category.

Here's your mission:

Finish implementing the metadata_enhanced_search function that accepts a query, a collection, an optional list of categories, and a top_k value.

Use the collection.query method to execute the search, passing the query_texts, n_results, and the where filter.

Process the results into a structured list, including the document chunk, its ID, category, and distance score.

Dive in and enhance your RAG pipeline with this precise filtering technique!

```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Takes a query, a ChromaDB collection, an optional list of categories, and
    returns the top_k most relevant documents. If categories is specified, only
    documents matching any of those categories are retrieved.
    """
    # Build the metadata filter if categories were specified
    where_clause = {"category": {"$in": categories}} if categories else None

    # TODO: Use collection.query to search for relevant documents.
    # Remember to pass the query, number of results, and the where clause.

    # TODO: Process the results and create a list of dictionaries.
    # Each dictionary should contain the chunk content, document ID, category, and distance.
    retrieved_chunks = []

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data from JSON file
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Create a ChromaDB client and set up the embedding function
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())

    # Create or get a collection for our metadata demo
    collection = client.create_collection(
        "metadata_demo_collection",
        embedding_function=embed_func
    )

    # Remove existing data (if any) and add fresh documents
    existing_ids = collection.get().get("ids", [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    texts = [doc["content"] for doc in sample_chunks]
    doc_ids = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = [{
        "doc_id": doc["id"],
        "category": doc.get("category", "General"),
        "title": doc["title"],
        "date": doc["date"]
    } for doc in sample_chunks]

    # Add documents with their metadata to the ChromaDB collection
    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Define a query to demonstrate searching with and without metadata filtering
    query_input = "Recent advancements in AI and their impact on teaching"

    # Search WITHOUT category filtering
    print("======== WITHOUT CATEGORY FILTER ========")
    no_filter_results = metadata_enhanced_search(query_input, collection, categories=None, top_k=3)
    for res in no_filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    # Search WITH category filtering
    filter_category = "Education"
    print(f"======== WITH CATEGORY FILTER ({filter_category}) ========")
    filter_results = metadata_enhanced_search(query_input, collection, categories=[filter_category], top_k=3)
    for res in filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")


```

Thought for a second


Here’s the complete, working implementation of `metadata_enhanced_search`—ready to drop into your RAG pipeline. It builds the optional `where` filter, calls `collection.query(...)`, then structures the results into a list of dicts:

```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Takes a query, a ChromaDB collection, an optional list of categories, and
    returns the top_k most relevant documents. If categories is specified, only
    documents matching any of those categories are retrieved.
    """
    # Build the metadata filter if categories were specified
    where_clause = {"category": {"$in": categories}} if categories else None

    # Run the vector search with optional filtering
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Process and return a list of structured chunks
    retrieved_chunks = []
    docs      = results["documents"][0]
    metas     = results["metadatas"][0]
    distances = results["distances"][0]

    for chunk, meta, dist in zip(docs, metas, distances):
        retrieved_chunks.append({
            "chunk":    chunk,
            "doc_id":   meta["doc_id"],
            "category": meta.get("category"),
            "distance": dist
        })

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Initialize ChromaDB client & embedding
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client     = Client(Settings())

    # Prepare collection
    collection = client.create_collection("metadata_demo_collection", embedding_function=embed_func)
    existing   = collection.get().get("ids", [])
    if existing:
        collection.delete(ids=existing)

    # Add documents + metadata
    texts     = [doc["content"] for doc in sample_chunks]
    doc_ids   = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = [
        {"doc_id": doc["id"], "category": doc.get("category", "General"),
         "title": doc["title"], "date": doc["date"]}
        for doc in sample_chunks
    ]
    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Example query
    query_input = "Recent advancements in AI and their impact on teaching"

    # Without filter
    print("======== WITHOUT CATEGORY FILTER ========")
    for res in metadata_enhanced_search(query_input, collection, None, 3):
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    # With filter
    print("======== WITH CATEGORY FILTER (Education) ========")
    for res in metadata_enhanced_search(query_input, collection, ["Education"], 3):
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")
```

### How it works

1. **Build `where_clause`**

   * If you pass `categories=['A','B']`, it becomes `{"category": {"$in": ['A','B']}}`.
   * If you pass `None` (or an empty list), it stays `None`, meaning “no filter.”

2. **Query the collection**

   ```python
   results = collection.query(
       query_texts=[query],
       n_results=top_k,
       where=where_clause
   )
   ```

   * `query_texts` accepts a list of queries (we use one).
   * `n_results` caps the number of chunks returned.
   * `where` applies your metadata filter if present.

3. **Structure the results**

   * We unzip the `documents`, `metadatas`, and `distances` arrays.
   * For each match, we pack:

     * `"chunk"` → the text snippet
     * `"doc_id"` → your original document identifier
     * `"category"` → the metadata field you filtered on
     * `"distance"` → the returned similarity score

That’s it! You now have a `metadata_enhanced_search` function that seamlessly blends vector similarity with precise metadata filtering.


## Enhance Metadata Filtering Skills

In the previous exercise, you successfully filtered documents by category. Now, let's enhance your skills by adding another layer of filtering to your searches.

Your objective is to refine the metadata_enhanced_search function to filter documents not only by category but also by a minimum publication date. Here's how you can achieve this:

Convert the Date: If a min_date is provided, convert it from an ISO format string to a timestamp using datetime.fromisoformat() and datetime.timestamp().

Build the Filter: Expand the where clause to check for both category inclusion and a date threshold. Use the $and operator to combine these conditions, ensuring that documents meet both criteria. The syntax for using $and is: {"$and": [condition_1, condition_2, ...]}.

Compile Results: When compiling the retrieved documents, convert the stored timestamp back to an ISO format date string using datetime.fromtimestamp() and datetime.isoformat().

Dive in and see how this additional filtering can refine your search results even further!


```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from datetime import datetime


def metadata_enhanced_search(query, collection, categories=None, min_date=None, top_k=3):
    """
    Filters documents by category and a minimum publication date.
    If a list of categories is provided, only documents in those categories are returned.
    If a min_date is provided, only documents with date >= min_date are returned.
    Both filters are combined such that documents must satisfy all provided conditions.
    """

    # TODO: Convert min_date to a timestamp if provided

    # TODO: Build a compound where clause that combines category and date filtering.
    # If both filters are provided, documents must match both conditions.
    # If only one filter is provided, use that one.
    # If no filters are provided, where_clause should be None.
    where_clause = None

    # Execute the query using the ChromaDB collection
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Compile the retrieved documents into a list.
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "doc_id": results['metadatas'][0][i]['doc_id'],
            "category": results['metadatas'][0][i].get('category', "General"),
            "distance": results['distances'][0][i],
            "date": datetime.fromtimestamp(results['metadatas'][0][i].get('date', 0)).isoformat()
        })

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data from JSON file
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Create a ChromaDB client and embedder
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())

    # Create or retrieve the collection
    collection = client.create_collection(
        "metadata_demo_collection",
        embedding_function=embed_func
    )

    # Clear any existing documents in the collection
    existing_ids = collection.get().get("ids", [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    # Prepare data and add to the collection
    texts = [doc["content"] for doc in sample_chunks]
    doc_ids = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = []
    for doc in sample_chunks:
        # TODO: Convert date to timestamp and store it
        metadatas.append({
            "doc_id": doc["id"],
            "category": doc.get("category", "General"),
            "title": doc["title"],
            "date": doc["date"]  # Store date as a timestamp
        })

    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Demonstrate searches
    query_input = "Recent advancements in AI and their impact on teaching"

    print("======== WITHOUT ANY FILTER ========")
    no_filter_results = metadata_enhanced_search(query_input, collection, categories=None, min_date=None, top_k=3)
    for res in no_filter_results:
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    print("======== WITH CATEGORY FILTER (Education) ONLY ========")
    cat_only_results = metadata_enhanced_search(query_input, collection, categories=["Education"], min_date=None, top_k=3)
    for res in cat_only_results:
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    print("======== WITH CATEGORY FILTER (Education) AND DATE FILTER (>= 2022-01-01) ========")
    cat_and_date_results = metadata_enhanced_search(query_input, collection, categories=["Education"], min_date="2022-01-01", top_k=3)
    for res in cat_and_date_results:
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")


```

Thought for a couple of seconds


```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from datetime import datetime


def metadata_enhanced_search(query, collection, categories=None, min_date=None, top_k=3):
    """
    Filters documents by category and a minimum publication date.
    If a list of categories is provided, only documents in those categories are returned.
    If a min_date (ISO string) is provided, only documents with date >= min_date are returned.
    Both filters are combined so documents must satisfy all provided conditions.
    """
    # Convert min_date ISO string to timestamp if provided
    if min_date:
        min_ts = datetime.fromisoformat(min_date).timestamp()
    else:
        min_ts = None

    # Build filter clauses
    clauses = []
    if categories:
        clauses.append({"category": {"$in": categories}})
    if min_ts is not None:
        clauses.append({"date": {"$gte": min_ts}})

    # Combine clauses with $and if needed
    if clauses:
        if len(clauses) == 1:
            where_clause = clauses[0]
        else:
            where_clause = {"$and": clauses}
    else:
        where_clause = None

    # Execute the query with metadata filtering
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Compile results, converting timestamp back to ISO date
    retrieved_chunks = []
    for chunk, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        ts = meta.get("date", 0)
        date_iso = datetime.fromtimestamp(ts).isoformat() if ts else None
        retrieved_chunks.append({
            "chunk":    chunk,
            "doc_id":   meta["doc_id"],
            "category": meta.get("category", "General"),
            "distance": dist,
            "date":     date_iso
        })

    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data
    with open("data/corpus.json", "r") as f:
        sample_chunks = json.load(f)

    # Initialize ChromaDB client & embedder
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())

    # Create or get the collection
    collection = client.create_collection(
        "metadata_demo_collection",
        embedding_function=embed_func
    )

    # Clear existing documents
    existing_ids = collection.get().get("ids", [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    # Prepare and add data with timestamps
    texts = [doc["content"] for doc in sample_chunks]
    doc_ids = [f"doc_{doc['id']}" for doc in sample_chunks]
    metadatas = []
    for doc in sample_chunks:
        # Convert ISO date to timestamp
        dt = datetime.fromisoformat(doc["date"])
        ts = dt.timestamp()
        metadatas.append({
            "doc_id":   doc["id"],
            "category": doc.get("category", "General"),
            "title":    doc["title"],
            "date":     ts
        })

    collection.add(documents=texts, metadatas=metadatas, ids=doc_ids)

    # Example query
    query_input = "Recent advancements in AI and their impact on teaching"

    print("======== WITHOUT ANY FILTER ========")
    for res in metadata_enhanced_search(query_input, collection, categories=None, min_date=None, top_k=3):
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    print("======== WITH CATEGORY FILTER (Education) ONLY ========")
    for res in metadata_enhanced_search(query_input, collection, categories=["Education"], min_date=None, top_k=3):
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    print("======== WITH CATEGORY FILTER (Education) AND DATE FILTER (>= 2022-01-01) ========")
    for res in metadata_enhanced_search(query_input, collection, categories=["Education"], min_date="2022-01-01", top_k=3):
        print(f"Doc ID: {res['doc_id']} | Category: {res['category']} | Date: {res['date']} | Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")
```


Congratulations on mastering the basics of metadata-based filtering! Previously, you learned how to filter documents by category. Now, let's enhance your skills by implementing a fallback mechanism in your metadata_enhanced_search function. This will ensure your search remains effective, even when no documents match a strict filter.

Here's your mission: Modify the function so that if an initial search with a strict metadata filter returns zero results, it automatically reruns the query with a relaxed or no filter. This way, you can still retrieve relevant documents even when the initial criteria are too restrictive.

Begin by checking whether the initial search results are empty.
If they are, rerun the query without the filter or with a less strict version.
Verify your implementation by setting an intentionally restrictive filter and confirming that the function correctly falls back to a broader search.
For instance, if you search for documents in a very niche category and find none, your function should seamlessly switch to a broader search, ensuring you still get useful results. This enhancement will make your RAG pipeline more resilient and adaptable to varying search conditions. Dive in and make your search logic smarter and more flexible!

```python
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Demonstrates how to filter results by metadata. If 'categories' is specified,
    we only retrieve documents whose category is in the provided list.
    If the initial search yields no results, automatically rerun the
    query without any category filter as a fallback.
    """
    # Build the initial where clause for category filtering
    where_clause = {"category": {"$in": categories}} if categories else None

    # Perform the initial search with category filter (if provided)
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # TODO: Check if the initial search returned any documents
    # If no documents were found, perform a fallback search without any filter

    # Build a result list
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "doc_id": results['metadatas'][0][i]['doc_id'],
            "category": results['metadatas'][0][i].get('category', "General"),
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks


if __name__ == "__main__":
    # Load sample data from JSON file
    with open('data/corpus.json', 'r') as f:
        sample_chunks = json.load(f)

    # Build a ChromaDB collection
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.create_collection("metadata_demo_collection", embedding_function=embed_func)

    # Clear any existing data and add fresh documents
    existing_ids = collection.get().get('ids', [])
    if existing_ids:
        collection.delete(ids=existing_ids)

    texts = [c["content"] for c in sample_chunks]
    ids = [f"doc_{c['id']}" for c in sample_chunks]
    metadatas = [{
        "doc_id": c["id"],
        "category": c["category"],
        "title": c["title"],
        "date": c["date"]
    } for c in sample_chunks]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    # Single query to compare results with and without metadata filtering
    query_input = "Recent advancements in AI and their impact on teaching"

    # Search WITHOUT category filtering
    print("======== WITHOUT CATEGORY FILTER ========")
    no_filter_results = metadata_enhanced_search(query_input, collection, categories=None, top_k=3)
    for res in no_filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")

    # Search WITH a strict category filter, potentially yielding zero results
    very_strict_category = "NonExistentCategory"
    print(f"======== WITH VERY STRICT CATEGORY FILTER ({very_strict_category}) ========")
    strict_filter_results = metadata_enhanced_search(query_input, collection, categories=[very_strict_category], top_k=3)
    for res in strict_filter_results:
        print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}")
        print(f"Chunk: {res['chunk']}\n")


```

Thought for a couple of seconds


Berikut adalah versi `metadata_enhanced_search` yang menambahkan mekanisme fallback: jika pencarian awal dengan filter kategori tidak mengembalikan apa-apa, fungsi akan otomatis men‐rerun query tanpa filter.

```python
def metadata_enhanced_search(query, collection, categories=None, top_k=3):
    """
    Demonstrates how to filter results by metadata. If 'categories' is specified,
    we only retrieve documents whose category is in the provided list.
    If the initial search yields no results, automatically rerun the
    query without any category filter as a fallback.
    """
    # Build the initial where clause for category filtering
    where_clause = {"category": {"$in": categories}} if categories else None

    # Perform the initial search with the category filter (if provided)
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where=where_clause
    )

    # Jika tidak ada hasil (daftar dokumen kosong), lakukan fallback tanpa filter
    if not results['documents'][0]:
        results = collection.query(
            query_texts=[query],
            n_results=top_k,
            where=None
        )

    # Kumpulkan hasil akhir ke dalam list
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "doc_id": results['metadatas'][0][i]['doc_id'],
            "category": results['metadatas'][0][i].get('category', "General"),
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks
```

**Penjelasan singkat:**

1. **Initial query**: Mencoba cari dengan `where_clause` jika `categories` diberikan.
2. **Cek hasil**: `if not results['documents'][0]` artinya tidak ada dokumen yang ditemukan.
3. **Fallback**: Jika kosong, ulangi `collection.query` tanpa filter sama sekali (`where=None`).
4. **Ambil dan return** hasil (baik dari query awal maupun fallback).

Dengan penyesuaian ini, ketika Anda menguji dengan kategori yang terlalu ketat (misalnya `"NonExistentCategory"`), fungsi akan secara otomatis jatuh ke pencarian yang lebih luas dan tetap mengembalikan `top_k` dokumen terdekat.

Thought for a couple of seconds


Kesalahan `KeyError: 'category'` terjadi karena salah satu (atau beberapa) entri di `sample_chunks` tidak memiliki field `"category"`. Untuk mengatasinya, cukup gunakan `.get()` dengan default value saat membangun `metadatas`—misalnya default `"General"`. Contohnya:

```python
# Ganti bagian pembuatan metadatas Anda menjadi:

metadatas = [{
    "doc_id": c.get("id"),
    "category": c.get("category", "General"),   # pakai default "General" jika key tidak ada
    "title":    c.get("title", "Untitled"),      # optional: default judul jika perlu
    "date":     c.get("date", "")                # optional: default tanggal jika perlu
} for c in sample_chunks]
```

Atau, dengan loop eksplisit:

```python
metadatas = []
for c in sample_chunks:
    md = {
        "doc_id": c.get("id"),
        "category": c.get("category", "General")
    }
    # jika Anda butuh title & date, bisa cek keberadaannya:
    if "title" in c:
        md["title"] = c["title"]
    if "date" in c:
        md["date"] = c["date"]
    metadatas.append(md)
```

Dengan perubahan ini, field `category` yang hilang tidak akan memicu `KeyError` dan akan otomatis diisi `"General"`. Setelah itu, kode Anda bisa berjalan penuh tanpa crash:

```bash
python src/main.py
```
