# Lesson 2: Storing and Managing Text Chunks in Vector Databases

**Converting to markdown**

The user simply asked to convert the provided text to markdown. The text includes clear sections like “Introduction,” “Understanding Vector Databases,” “Why We Need Vector Databases for RAG,” and others, along with code blocks and explanations.

I need to apply markdown formatting: use appropriate headings, code fences with syntax highlighting, lists, and emphasize key terms with bold or italics.

I won’t add a title at the top, as the instruction was just to convert, not introduce anything extra. Let’s do it!


Thought for 4 seconds


## Introduction

Hello there, welcome to the second lesson of our **Scaling Up RAG with Vector Databases** course! In the previous unit, you explored how to break large documents into smaller chunks and attach useful metadata (like `doc_id`, `chunk_id`, and labels such as category). These chunks are essential for structuring data in a way that makes retrieval easier. In this lesson, we’ll build on that groundwork by showing you how to store them in a vector database. One popular choice is **ChromaDB**—an open-source database designed for high-speed, semantic querying of vectors. By switching from keyword-based searches to semantic searches, your RAG system will retrieve relevant information more efficiently. Let’s dive in!

## Understanding Vector Databases

A vector database stores data in the form of numerical vectors that capture the **semantic essence** of texts (or other data). The database then uses similarity metrics—rather than literal word matches—so that conceptually similar items are stored close together. This means searches on vector databases can retrieve contextually relevant results even when keywords are absent.

By leveraging approximate or exact nearest-neighbor strategies for similarity, vector databases can scale to handle millions or billions of vectors while still providing quick query responses. This makes them especially suitable for RAG systems, which rely on fast semantic lookups across large collections of text.

## Why We Need Vector Databases for RAG

Before we explore how to set up a vector database, let’s look at why it’s a crucial component of a RAG pipeline:

1. **Semantic Retrieval**
   By embedding text into vectors, queries can match documents based on meaning rather than strict keyword matches. This yields more accurate and context-sensitive search results.

2. **Scalability**
   Specialized vector databases handle large datasets efficiently, allowing you to store and query vast libraries of text chunks without sacrificing performance.

3. **Richer Context**
   Embeddings capture nuanced relationships among chunks, ensuring that related information is surfaced even when it doesn’t use the exact same terms.

4. **Easy Updates**
   Vector databases (like ChromaDB) often allow you to add and remove chunks on the fly, so your collection stays in sync with new or evolving information.

## Setting Up ChromaDB and Basic Configuration

Now, let’s jump into coding with ChromaDB, our chosen vector database. Here’s how to set up a ChromaDB client:

```python
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def build_chroma_collection(chunks):
    # Use a Sentence Transformer model for embeddings
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name
    )
    
    # Create a ChromaDB client with default settings
    client = Client(Settings())

    # Either get an existing collection or create a new one
    collection = client.get_or_create_collection(
        name="rag_collection",
        embedding_function=embed_func
    )
    # ... continues
```

### How It Works

* **Embedding Setup**
  We define a `SentenceTransformerEmbeddingFunction` to generate vectors for the text chunks. The model `all-MiniLM-L6-v2` is a lightweight yet powerful transformer that maps sentences to a 384-dimensional dense vector space. It balances efficiency (small size, fast inference) with strong semantic understanding—ideal for RAG applications.

* **Client Configuration**
  `Client(Settings())` connects to ChromaDB with default in-memory settings for quick experimentation. You can customize persistence, encryption, or other features by passing additional parameters to `Settings()`.

* **Collection Management**
  `get_or_create_collection` checks if a collection named `"rag_collection"` exists; if not, it creates one. A collection in ChromaDB is like a table in a traditional database but optimized for vector similarity operations, letting you group related documents and embeddings under a single namespace.

## Preparing Data and Adding Chunks to ChromaDB

After setting up your client, embedding function, and collection, prepare your chunks for insertion:

```python
# ... continues in build_chroma_collection

# Prepare the data: texts, IDs, and metadata 
texts = [c["content"] for c in chunks]
ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks]
metadatas = [
    {
        "doc_id": chunk["doc_id"], 
        "chunk_id": chunk["chunk_id"], 
        "category": chunk["category"]
    }
    for chunk in chunks
]

# Add the documents (chunks) to the collection
collection.add(documents=texts, metadatas=metadatas, ids=ids)
return collection
```

**Key Points:**

* **Data Grouping**: Each chunk is mapped to its text, a unique ID, and metadata.
* **Seamless Insertion**: `collection.add()` handles embedding and storage in one call, making it easy to batch up and ingest your chunks.

## Updating and Managing Documents

ChromaDB lets you keep your collection current. Here’s an example of adding and then deleting a chunk:

```python
# Example initial chunks
example_chunks = [
    {"doc_id": 0, "chunk_id": 0, "category": "ai",      "content": "RAG stands for Retrieval-Augmented Generation."},
    {"doc_id": 0, "chunk_id": 1, "category": "ai",      "content": "A crucial component of a RAG pipeline is the Vector Database."},
    {"doc_id": 1, "chunk_id": 0, "category": "finance", "content": "Accurate data is essential in finance."},
]
collection = build_chroma_collection(example_chunks)

# Prepare a new chunk to add
new_document = {
    "doc_id": 2,
    "chunk_id": 0,
    "category": "food",
    "content": "Bananas are yellow fruits rich in potassium."
}
doc_id = f"chunk_{new_document['doc_id']}_{new_document['chunk_id']}"

# Add the new chunk
collection.add(
    documents=[new_document["content"]],
    metadatas=[{
        "doc_id": new_document["doc_id"],
        "chunk_id": new_document["chunk_id"],
        "category": new_document["category"]
    }],
    ids=[doc_id]
)

# Remove it later if it becomes outdated
collection.delete(ids=[doc_id])
```

**Why Unique IDs Matter:**
By combining `doc_id` and `chunk_id` into a string (e.g., `"chunk_2_0"`), you ensure each chunk has a distinct identifier, making updates, deletions, and lookups straightforward.

## Conclusion and Next Steps

By storing text chunks in a vector database, you’ve laid the foundation for faster, semantically aware retrieval. You now know how to create, update, and manage a ChromaDB collection—crucial skills for any large-scale RAG system.

In the next lesson, you’ll learn how to **query** the vector database to fetch the most relevant chunks and feed them into a language model. That’s where the real magic of context-rich, accurate responses shines! For now, feel free to experiment with different embedding models, or try adding and deleting chunks of various sizes. When you’re ready, move on to the practice exercises to cement these concepts and refine your RAG workflow.


## Loading and chunking text

Congratulations on reaching this point in the lesson! Now, let's put your newfound knowledge into practice by exploring the fascinating world of vector databases with ChromaDB.

Your mission is to load a small JSON file containing a few short documents, split each document into smaller, manageable chunks, and then build a ChromaDB collection from these chunks. Here's what you need to do in the load_and_chunk_dataset function:

Load the dataset from the JSON file located at data/corpus.json. Each document in this file has the keys: id, title, content, and category.
Extract the category and id from each document.
Use the provided chunk_text function to split the content of each document into smaller segments.
Create a dictionary for each chunk that includes metadata such as doc_id, chunk_id, category, and the chunk text.
By the end of this exercise, you'll have a solid understanding of how to prepare and store text chunks in a vector database, setting the stage for efficient semantic retrieval. Enjoy the process and happy coding!

```python
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def chunk_text(text, chunk_size=50):
    """
    Splits the given text into chunks of size 'chunk_size'.
    Returns a list of chunk strings.
    """
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]


def load_and_chunk_dataset(file_path, chunk_size=50):
    """
    Loads a dataset from JSON 'file_path', then splits each document into smaller chunks.
    Metadata such as 'doc_id' and 'category' is included with each chunk.
    """
    # TODO: Open and load the JSON file
    
    all_chunks = []
    for doc in data:
        doc_text = doc["content"]
        # TODO: Extract category and id from the document
        
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_index, chunk_str in enumerate(doc_chunks):
            # TODO: Create a dictionary for each chunk with doc_id, chunk_id, category and text
            
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk using a SentenceTransformer.
    Adds all chunks in the 'chunks' list to the collection for fast retrieval.
    """
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embed_func
    )

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [
        {"doc_id": c["doc_id"], "chunk_id": c["chunk_id"], "category": c["category"]}
        for c in chunks
    ]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection


if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    chunked_docs = load_and_chunk_dataset(dataset_file)
    collection = build_chroma_collection(chunked_docs)

    total_docs = collection.count()
    print("ChromaDB collection created with", total_docs, "documents.")

```

---

The user encountered a KeyError for 'category', possibly because some of their documents don't have this key or it's named differently. To handle this, I suggest adjusting the function to check for the key and set a default. They could use `doc.get("category", "unknown")` or check for nested keys or alternative names like 'tags' or 'categories.' Also, inspecting the JSON or printing the document's keys would help clarify the issue. Logging missing keys or raising a descriptive error might help improve the function’s robustness.


Thought for a couple of seconds


It looks like some of your documents don’t actually have a `"category"` field, so doing `doc["category"]` blows up. You have two options:

1. **Inspect the JSON** to see what the field is actually called (or if it’s missing), e.g.:

   ```python
   with open(dataset_file) as f:
       data = json.load(f)
   for i, doc in enumerate(data[:5]):
       print(i, doc.keys())
   ```

2. **Defensive code:** fall back to a default (or skip) when `"category"` isn’t present.

Here’s an updated version of `load_and_chunk_dataset` that:

* Uses `doc.get("category", "uncategorized")` so you never hit a KeyError.
* Prints a warning when a document is missing that field (so you can audit your data).
* Optionally, you could change `"uncategorized"` to `None` or even `raise` if that makes more sense for your pipeline.

```python
def load_and_chunk_dataset(file_path, chunk_size=50):
    """
    Loads a dataset from JSON 'file_path', then splits each document into smaller chunks.
    If a document is missing 'category', it gets tagged as "uncategorized" (and a warning is printed).
    """
    # 1. Open and load the JSON file
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    all_chunks = []
    for doc in data:
        # 2. Extract category and id from the document (with a default for missing categories)
        doc_id = doc.get("id", None)
        if doc_id is None:
            raise KeyError(f"Document is missing an 'id' field: {doc}")

        if "category" not in doc:
            print(f"⚠️ Warning: doc id={doc_id} has no 'category'; tagging as 'uncategorized'")
        category = doc.get("category", "uncategorized")

        content = doc.get("content", "")
        if not content:
            print(f"⚠️ Warning: doc id={doc_id} has empty or missing 'content'")

        # 3. Split the content into chunks
        doc_chunks = chunk_text(content, chunk_size)

        # 4. Build a dict for each chunk and collect
        for chunk_index, chunk_str in enumerate(doc_chunks):
            chunk = {
                "doc_id": doc_id,
                "chunk_id": chunk_index,
                "category": category,
                "text": chunk_str
            }
            all_chunks.append(chunk)

    return all_chunks
```

With this change you won’t get a `KeyError` on missing categories, and you’ll get console warnings so you know which documents need attention or where your schema may differ.


## Building a ChromaDB Collection

Congratulations on reaching this point in the lesson! Now, let's put your newfound knowledge into practice by exploring the fascinating world of vector databases with ChromaDB.

Your mission is to complete the build_chroma_collection function to create a ChromaDB collection from pre-chunked documents. You need to complete the build_chroma_collection function by:

Creating or retrieving a ChromaDB collection using the client's get_or_create_collection method
Preparing the texts, ids, and metadatas lists from the provided chunks
Adding the documents to the collection using the collection.add method
By the end of this exercise, you'll have a solid understanding of how to store text chunks in a vector database, setting the stage for efficient semantic retrieval. Verify your implementation by checking that the collection is successfully created and the correct number of chunks was added. Enjoy the process and happy coding!

```python
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def chunk_text(text, chunk_size=50):
    """
    Splits the given text into chunks of size 'chunk_size'.
    Returns a list of chunk strings.
    """
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]


def load_and_chunk_dataset(file_path, chunk_size=50):
    """
    Loads a dataset from JSON 'file_path', then splits each document into smaller chunks.
    Metadata such as 'doc_id' and 'category' is included with each chunk.
    """
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    all_chunks = []
    for doc in data:
        doc_text = doc["content"]
        doc_category = doc.get("category", "general")
        doc_id = doc["id"]

        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_index, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "doc_id": doc_id,
                "chunk_id": chunk_index,
                "category": doc_category,
                "text": chunk_str
            })
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk using a SentenceTransformer.
    Adds all chunks in the 'chunks' list to the collection for fast retrieval.
    """
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

    client = Client(Settings())
    # TODO: Create or get the collection using client.get_or_create_collection

    # TODO: Prepare the texts, ids and metadatas lists from chunks

    # TODO: Add the documents to the collection using collection.add
    
    return collection


if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    chunked_docs = load_and_chunk_dataset(dataset_file)
    collection = build_chroma_collection(chunked_docs)

    total_docs = collection.count()
    print("ChromaDB collection created with", total_docs, "documents.")


```

Thought for a couple of seconds


Here’s a filled-out `build_chroma_collection` that:

1. Retrieves or creates a ChromaDB collection via `get_or_create_collection`
2. Extracts `texts`, `ids`, and `metadatas` from your `chunks` list
3. Adds everything into the collection with `collection.add(...)`

```python
def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk using a SentenceTransformer.
    Adds all chunks in the 'chunks' list to the collection for fast retrieval.
    """
    # 1. Setup the embedding function
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name
    )

    # 2. Connect to Chroma and get-or-create the collection
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embed_func
    )

    # 3. Prepare lists for texts, ids, and metadata
    texts = [chunk["text"] for chunk in chunks]
    ids = [
        f"chunk_{chunk['doc_id']}_{chunk['chunk_id']}"
        for chunk in chunks
    ]
    metadatas = [
        {
            "doc_id": chunk["doc_id"],
            "chunk_id": chunk["chunk_id"],
            "category": chunk["category"]
        }
        for chunk in chunks
    ]

    # 4. Add everything into the collection
    collection.add(
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )

    return collection
```

And to verify you’ve ingested all your chunks:

```python
if __name__ == "__main__":
    chunked_docs = load_and_chunk_dataset(dataset_file)
    collection = build_chroma_collection(chunked_docs)

    total_chunks = collection.count()
    print(f"✅ ChromaDB collection '{collection.name}' created with {total_chunks} chunks.")
```

When you run this, you should see:

```
✅ ChromaDB collection 'rag_collection' created with 123 chunks.
```

(where `123` matches the length of your `chunked_docs` list).


## Dynamic Chunk Management in ChromaDB

Now, let's take your skills to the next level by learning how to dynamically manage these chunks.

In this exercise, you'll enhance your ability to handle ChromaDB collections by adding and removing chunks. Here's what you need to focus on:

Add a New Chunk: Create a new document with a unique doc_id, chunk_id, and category. Use the text "Bananas are yellow fruits rich in potassium." for this chunk.
Verify the Update: After adding the chunk, check that the total document count in the collection has increased by one.
Remove the Chunk: Delete the newly added chunk using its unique ID and confirm that the document count returns to its previous value.
By the end of this exercise, you'll have a deeper understanding of how to dynamically manage text chunks in a vector database, setting the stage for efficient semantic retrieval. Enjoy the challenge and happy coding!

```python
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def chunk_text(text, chunk_size=30):
    """
    Splits the given text into chunks of size 'chunk_size'.
    Returns a list of chunk strings.
    """
    words = text.split()
    return [
        " ".join(words[i:i+chunk_size])
        for i in range(0, len(words), chunk_size)
    ]


def load_and_chunk_dataset(file_path, chunk_size=30):
    """
    Loads a dataset from JSON 'file_path', then splits each
    document into smaller chunks. Metadata such as 'doc_id'
    and 'category' is included with each chunk.
    """
    with open(file_path, "r") as f:
        data = json.load(f)

    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_category = doc.get("category", "general")
        doc_chunks = chunk_text(doc_text, chunk_size)

        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "doc_id": doc_id,
                "chunk_id": chunk_id,
                "category": doc_category,
                "text": chunk_str
            })

    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk
    using a SentenceTransformer. Adds all chunks in the 'chunks' list
    to the collection for fast retrieval.
    """
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name
    )
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embed_func
    )

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [
        {
            "doc_id": c["doc_id"],
            "chunk_id": c["chunk_id"],
            "category": c["category"]
        }
        for c in chunks
    ]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    return collection


if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    # Build the initial collection from chunked documents
    chunked_docs = load_and_chunk_dataset(dataset_file, chunk_size=30)
    collection = build_chroma_collection(chunked_docs)
    total_docs = collection.count()
    print("ChromaDB collection created with", total_docs, "documents.")

    # TODO: Create a new document dictionary with doc_id, chunk_id, category, and text
    # The text should be "Bananas are yellow fruits rich in potassium."

    # TODO: Generate a unique ID string for the new document

    # TODO: Add the new document to the collection using collection.add()
    # Don't forget to include the document text, metadata, and ID

    # TODO: Print the updated document count

    # TODO: Remove the newly added document using collection.delete()

    # TODO: Print the final document count


```

Thought for a few seconds


Here’s how you can complete the `__main__` section to add and then remove a new “banana” chunk:

```python
if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    # Build the initial collection from chunked documents
    chunked_docs = load_and_chunk_dataset(dataset_file, chunk_size=30)
    collection = build_chroma_collection(chunked_docs)
    initial_count = collection.count()
    print("ChromaDB collection created with", initial_count, "documents.")

    # ─── TODO: Create a new chunk dict ──────────────────────────────────────────
    # Use a doc_id that doesn't collide with existing ones; here we pick len(chunked_docs)
    new_chunk = {
        "doc_id": len(chunked_docs),
        "chunk_id": 0,
        "category": "fruits",
        "text": "Bananas are yellow fruits rich in potassium."
    }

    # ─── TODO: Generate a unique ID string for the new chunk ───────────────────
    new_id = f"chunk_{new_chunk['doc_id']}_{new_chunk['chunk_id']}"

    # ─── TODO: Add the new chunk to the collection ─────────────────────────────
    collection.add(
        documents=[ new_chunk["text"] ],
        metadatas=[ {
            "doc_id": new_chunk["doc_id"],
            "chunk_id": new_chunk["chunk_id"],
            "category": new_chunk["category"]
        } ],
        ids=[ new_id ]
    )

    # ─── TODO: Print the updated document count ────────────────────────────────
    after_add_count = collection.count()
    print("After adding, collection has", after_add_count, "documents.")

    # ─── TODO: Remove the newly added chunk ──────────────────────────────────
    collection.delete(ids=[ new_id ])

    # ─── TODO: Print the final document count ─────────────────────────────────
    final_count = collection.count()
    print("After deleting, collection has", final_count, "documents.")
```

**What happens when you run this:**

1. **Initial build** prints something like

   ```
   ChromaDB collection created with 200 documents.
   ```
2. **After adding** prints

   ```
   After adding, collection has 201 documents.
   ```
3. **After deleting** prints

   ```
   After deleting, collection has 200 documents.
   ```

And now you’ve dynamically added and removed a chunk in your ChromaDB collection!


## Dynamic Document Management in ChromaDB

Congratulations on mastering the basics of adding and removing individual chunks from a ChromaDB collection! Now, let's elevate your skills by implementing a function that dynamically manages your collection based on content.

Your objective is to create a function that efficiently deletes all chunks from a ChromaDB collection containing a specific keyword. This will help you maintain a clean and relevant dataset, ensuring that only the most pertinent information is stored.

Dive in and enjoy the process of refining your ChromaDB collection!

```python
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def chunk_text(text, chunk_size=50):
    """
    Splits the given text into chunks of size 'chunk_size'.
    Returns a list of chunk strings.
    """
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]


def load_and_chunk_dataset(file_path, chunk_size=50):
    """
    Loads a dataset from JSON 'file_path', then splits each document into smaller chunks.
    Metadata such as 'doc_id' and 'category' is included with each chunk.
    """
    with open(file_path, "r") as f:
        data = json.load(f)

    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_category = doc.get("category", "general")  # Default to "general" if no category.
        doc_chunks = chunk_text(doc_text, chunk_size)

        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "doc_id": doc_id,
                "chunk_id": chunk_id,
                "category": doc_category,
                "text": chunk_str
            })
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk using a SentenceTransformer.
    Adds all chunks in the 'chunks' list to the collection for fast retrieval.
    """
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(name=collection_name, embedding_function=embed_func)

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [
        {"doc_id": c["doc_id"], "chunk_id": c["chunk_id"], "category": c["category"]}
        for c in chunks
    ]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection


def delete_documents_with_keyword(collection, keyword):
    """
    Deletes all documents from the given ChromaDB 'collection' whose text contains 'keyword'.
    """
    # TODO: Get all documents and their IDs from the collection

    # TODO: Create a list to store IDs of documents containing the keyword

    # TODO: Iterate through documents and their IDs, adding matching document IDs to the list

    # TODO: If there are documents to delete, remove them from the collection


if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    # Load and chunk the dataset, then build the initial collection.
    chunked_docs = load_and_chunk_dataset(dataset_file)
    collection = build_chroma_collection(chunked_docs)
    initial_count = collection.count()
    print("ChromaDB collection created with", initial_count, "documents.")

    # Add a new document containing "Bananas".
    new_document = {
        "doc_id": initial_count + 1,
        "chunk_id": 0,
        "category": "food",
        "text": "Bananas are yellow fruits rich in potassium."
    }
    doc_id_str = f"chunk_{new_document['doc_id']}_{new_document['chunk_id']}"

    collection.add(
        documents=[new_document["text"]],
        metadatas=[{
            "doc_id": new_document["doc_id"],
            "chunk_id": new_document["chunk_id"],
            "category": new_document["category"]
        }],
        ids=[doc_id_str]
    )

    updated_count = collection.count()
    print("After adding keyword document, collection has", updated_count, "documents.")

    # Now delete all documents containing the keyword "Bananas".
    delete_documents_with_keyword(collection, "Bananas")

    final_count = collection.count()
    print("After deleting documents with 'Bananas', collection has", final_count, "documents.")


```

**Implementing `delete_documents_with_keyword` function**

It looks like the user is working with Python code and needs help implementing a `delete_documents_with_keyword` function for a ChromaDB collection. Since they’ve provided a code snippet and want the function included, I should create a full implementation. I'll go ahead and write the code that accomplishes getting all documents in the collection, filtering by keyword, and deleting them as needed. I’ll make sure the code is complete—so I’ll create the full Python file for them.

**Implementing `delete_documents_with_keyword` function**

In order to implement this function in ChromaDB, I’ll need to fetch all the documents from the collection using `.get()`. The method `collection.get(ids=ids)` returns every fetched document’s `ids`, `documents`, and `metadatas`. Next, I'll filter the documents with the keyword and collect the matching IDs. Then I can delete them using `collection.delete(ids=ids_to_remove)`. I'll prepare all of that as full Python code and provide it in a text file for the user’s convenience, following the proper guidelines for file creation.


```python
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions


def chunk_text(text, chunk_size=50):
    """
    Splits the given text into chunks of size 'chunk_size'.
    Returns a list of chunk strings.
    """
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]


def load_and_chunk_dataset(file_path, chunk_size=50):
    """
    Loads a dataset from JSON 'file_path', then splits each document into smaller chunks.
    Metadata such as 'doc_id' and 'category' is included with each chunk.
    """
    with open(file_path, "r") as f:
        data = json.load(f)

    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc.get("content", "")
        doc_category = doc.get("category", "general")  # Default to "general"
        doc_chunks = chunk_text(doc_text, chunk_size)

        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "doc_id": doc_id,
                "chunk_id": chunk_id,
                "category": doc_category,
                "text": chunk_str
            })
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    """
    Builds or retrieves a ChromaDB collection, embedding each chunk using a SentenceTransformer.
    Adds all chunks in the 'chunks' list to the collection for fast retrieval.
    """
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embed_func
    )

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [
        {"doc_id": c["doc_id"], "chunk_id": c["chunk_id"], "category": c["category"]}
        for c in chunks
    ]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection


def delete_documents_with_keyword(collection, keyword):
    """
    Deletes all documents from the given ChromaDB 'collection' whose text contains 'keyword'.
    """
    # Retrieve all documents (ids and texts)
    results = collection.get(include=["ids", "documents"])
    ids = results.get('ids', [])
    docs = results.get('documents', [])

    # Identify document IDs that contain the keyword
    ids_to_delete = [doc_id for doc_id, text in zip(ids, docs) if keyword in text]

    # Delete matching documents
    if ids_to_delete:
        collection.delete(ids=ids_to_delete)
        print(f"Deleted {len(ids_to_delete)} documents containing keyword '{keyword}'.")
    else:
        print(f"No documents found containing keyword '{keyword}'.")


if __name__ == "__main__":
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", "corpus.json")

    # Load and chunk the dataset, then build the initial collection.
    chunked_docs = load_and_chunk_dataset(dataset_file)
    collection = build_chroma_collection(chunked_docs)
    initial_count = collection.count()
    print("ChromaDB collection created with", initial_count, "documents.")

    # Add a new document containing "Bananas".
    new_document = {
        "doc_id": initial_count + 1,
        "chunk_id": 0,
        "category": "food",
        "text": "Bananas are yellow fruits rich in potassium."
    }
    doc_id_str = f"chunk_{new_document['doc_id']}_{new_document['chunk_id']}"

    collection.add(
        documents=[new_document["text"]],
        metadatas=[{
            "doc_id": new_document["doc_id"],
            "chunk_id": new_document["chunk_id"],
            "category": new_document["category"]
        }],
        ids=[doc_id_str]
    )

    updated_count = collection.count()
    print("After adding keyword document, collection has", updated_count, "documents.")

    # Now delete all documents containing the keyword "Bananas".
    delete_documents_with_keyword(collection, "Bananas")

    final_count = collection.count()
    print("After deleting documents with 'Bananas', collection has", final_count, "documents.")



```


Saya sudah menambahkan implementasi lengkap untuk fungsi `delete_documents_with_keyword`, termasuk:

* Mengambil semua dokumen (IDs dan teks) menggunakan `collection.get()`.
* Menyaring ID dokumen yang mengandung kata kunci.
* Menghapus dokumen yang sesuai dan mencetak ringkasan penghapusan.

Silakan cek file `dynamic_chroma_management.py` di canvas untuk detail implementasinya!
