# Chunking Knowledge Base Data

## Lesson Introduction

Welcome! In this lesson, we’ll cover **chunking** a dataset for a knowledge base, especially for **Retrieval-Augmented Generation** (RAG) systems. Imagine you have a large set of notes or articles. To help an AI agent answer questions using this information, you need to break it into smaller, manageable pieces — this is `chunking`. Our goal: understand what chunking is, why it matters, and how to implement it in `Python`. By the end, you’ll know how to split documents into chunks, making them easier for AI systems to process and retrieve relevant information.

## What is Chunking?

Why not give the whole document to the AI agent? Most AI models, including those in RAG, have limits on how much text they can process at once. Feeding them a long article or book quickly hits these limits. `Chunking` means dividing large text into smaller segments, or "chunks." Each chunk should be small enough for the AI to handle, but large enough to keep useful information.

For example, if you have a 1,000-word document and your AI can only process 100 words at a time, you need at least 10 chunks. This lets the system search through smaller pieces to find relevant information. Chunking isn’t just about size — it’s about structure. Good chunking keeps the meaning and context, so the AI can retrieve accurate answers.

## Chunking Strategy and Implementation: part 1

Let’s see how to chunk text in practice. The simplest way is to split text into fixed-size pieces, like every 30 characters or 100 words. The size depends on your use case and your AI’s limits.

Here’s a basic `Python` function to chunk text by character count:
```python
def chunk_text(text, chunk_size=30):
    """Split text into fixed-size chunks."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

# Example usage:
text = "This is a sample document that we want to split into smaller chunks for easier processing."
chunks = chunk_text(text, chunk_size=20)
print(chunks)
```

Output:
```
['This is a sample doc', 'ument that we want to ', 'split into smaller chu', 'nks for easier proces', 'sing.']
```

Each chunk is 20 characters. This method is simple, but in real use, you might chunk by words or sentences to avoid splitting in the middle of ideas.

## Chunking Strategy and Implementation: part 2

Often, you have a dataset with multiple documents, not just one string. Let’s apply chunking to a whole dataset.

Suppose you have a list of documents, each with an `id` and `content`. You want to chunk each document and track which chunk belongs to which document:

```python
def chunk_dataset(data, chunk_size=30):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

# Example dataset
data = [
    {"id": 1, "content": "Read Chapter 7 of 'Clean Code' — focus on writing small, single-purpose functions."},
    {"id": 2, "content": "Experiment with React's useContext by creating a theme toggler component."}
]

chunked_data = chunk_dataset(data, chunk_size=30)
for chunk in chunked_data:
    print(chunk)
```

Sample output:
```
{'id': 0, 'chunk_id': 0, 'text': "Read Chapter 7 of 'Clean Code' "}
{'id': 0, 'chunk_id': 1, 'text': '— focus on writing small, sing'}
{'id': 0, 'chunk_id': 2, 'text': 'le-purpose functions.'}
{'id': 1, 'chunk_id': 0, 'text': "Experiment with React's useCon"}
{'id': 1, 'chunk_id': 1, 'text': 'text by creating a theme toggl'}
{'id': 1, 'chunk_id': 2, 'text': 'er component.'}
```

Each chunk has a document ID and chunk ID, so you can trace it back to the original document and its position.

## Practical Considerations

How do you pick the right chunk size? Too small, and you lose context. Too large, and you might exceed the AI’s limits or make retrieval less precise.

### Tips:

- **Chunk size**: Pick a size that fits your AI model’s input limit. For many models, this is 200–500 words, but check your model’s documentation.
- **Chunk boundaries**: Split at natural points, like sentences or paragraphs, to keep meaning.
- **Overlapping chunks**: Sometimes, let chunks overlap a bit so important information at the edge of one chunk is also in the next. This helps preserve context.
- **Metadata**: Always track which chunk came from which document and its position. This is key for reconstructing answers or providing references.

In RAG pipelines, chunked data lets the retrieval system quickly find and return the most relevant pieces, making responses more accurate and efficient.

### Lesson Summary and Practice Introduction

You learned why chunking is essential for building knowledge bases for AI agents, especially with RAG. We covered what chunking is, why it matters, and how to implement it in `Python`. You saw how to chunk an entire dataset and got tips for choosing chunk sizes and managing metadata.

Now it’s your turn! Next, you’ll practice chunking your own dataset. You’ll use these techniques to split documents into chunks and inspect the results. This hands-on work will help you master chunking for knowledge bases.

# Exercise 1
Great job on understanding the basics of chunking! Now, let's modify the existing code to change how the text is chunked. Currently, the code chunks the text by a fixed number of characters. Your task is to change the chunk_text function to chunk the text by a fixed number of words instead.

This will help you see how different chunking strategies can affect text processing.

In [None]:
import os
import json

def load_data(file_name):
    """Load sample knowledge base content from JSON file."""
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=30):
    """Chunk text into smaller pieces for better processing."""
    chunks = []
    words = text.split(' ')
    for i in range(0, len(words), chunk_size):
        chunks.append((' ').join(words[i:i + chunk_size]))
    return chunks

def chunk_dataset(data, chunk_size=30):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

def main():
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    print("Loaded and chunked", len(chunked_data), "chunks from dataset.")
    for c in chunked_data:
        print(c)

if __name__ == "__main__":
    main()

Loaded and chunked 7 chunks from dataset.
{'id': 0, 'chunk_id': 0, 'text': "Read Chapter 7 of 'Clean Code' — focus on writing small, single-purpose functions."}
{'id': 1, 'chunk_id': 0, 'text': "Experiment with React's useContext by creating a theme toggler component."}
{'id': 2, 'chunk_id': 0, 'text': 'Review different types of SQL joins — especially LEFT and FULL OUTER joins.'}
{'id': 3, 'chunk_id': 0, 'text': 'Watch lecture on consensus algorithms — try to summarize Paxos in your own words.'}
{'id': 4, 'chunk_id': 0, 'text': 'Figure out the difference between React Query and Redux for async data handling.'}
{'id': 5, 'chunk_id': 0, 'text': "Draft blog post: '3 Things I Learned About Writing Better Functions.'"}
{'id': 6, 'chunk_id': 0, 'text': 'Ask Aram about good beginner-friendly open-source projects to contribute to.'}

# Exercise 2
You've done well in modifying the chunking strategy to use word count. Now, let's enhance the chunk_dataset function by completing the missing parts in the starter code.

In [None]:
# Exercise 2
import os
import json

def load_data(file_name):
    """Load sample knowledge base content from JSON file."""
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=5):
    """Chunk text into smaller pieces by word count for better processing."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(' '.join(words[i:i + chunk_size]))
    return chunks

def chunk_dataset(data, chunk_size=5):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        # TODO: Retrieve the content of the document
        doc_text = doc["content"]
        # TODO: Chunk the document text into smaller pieces with the specified chunk size
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            # TODO: Append each chunk with its metadata to the all_chunks list
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
                "length": len(chunk_str)
            })
    return all_chunks

def main():
    """Main function to create and simple knowledge base."""
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    print("Loaded and chunked", len(chunked_data), "chunks from dataset.")
    for c in chunked_data:
        print(c)

if __name__ == "__main__":
    main()


# Exercise 3

You've come a long way in understanding chunking for knowledge bases. Now, let's put your skills to the test.

Your task is to write a Python script from scratch that loads a dataset from a JSON file, chunks the text content of each document into smaller pieces, and prints the chunked data.

The dataset contains information about various learning tasks, and your script should handle the chunking process efficiently.