## Chunking

### What is chunking?

![Chunking overview](assets/chunking_overview.jpg)

### Why is chunking useful?

![Chunking is useful for retrieval](assets/chunking_retrieval.jpg)

## Chunking - basic concepts

Here is a **long** piece of text:

In [3]:
import requests

url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text

print(source_text)

[[what_is_git_section]]
=== What is Git?

So, what is Git in a nutshell?
This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.
As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool.
Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce)))

==== Snapshots, Not Differences

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.
Conceptually, most other systems store information as a list of file-based changes.
These other systems (CVS, Subversion, Perforce, and so o

### Chunking by size

In [1]:
from typing import List

def word_splitter(source_text: str) -> List[str]:
    """
    Split the text into a list of words
    Replace multiple whitespaces with a single whitespace, then split by whitespace
    """
    import re
    source_text = re.sub("\s+", " ", source_text)  # Replace multiple whitespces
    return re.split("\s", source_text)  # Return a list of words

def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
    """
    Split the text into chunks of fixed size
    Use word_splitter to split the text into groups of `chunk_size` words
    """
    text_words = word_splitter(text)
    chunks = []
    for i in range(0, len(text_words), chunk_size):
        chunk_words = text_words[i: i + chunk_size]
        chunk = " ".join(chunk_words)
        chunks.append(chunk)
    return chunks

#### Try it out

Let's use multiple chunk sizes:

In [4]:
for chosen_size in [5, 25, 100]:
    chunks = get_chunks_fixed_size(source_text, chosen_size)
    # Print outputs to screen
    print(f"\nSize {chosen_size} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {chunks[i]}")


Size 5 - 281 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git?
Chunk 2: So, what is Git in
Chunk 3: a nutshell? This is an

Size 25 - 57 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git
Chunk 2: is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to
Chunk 3: clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid

Size 100 - 15 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, S

### Chunking by text structure

This is another good option, as it preserves natural groups of text, such as paragraphs, or sections of text. 

If the text includes headings with a specific format, you can use those to chunk the text as well. 

In [7]:
# Chunk text by particular marker
for marker in ["\n\n", "\n=="]:  # Different markers to try (newline, heading marker in AsciiDoc)
    chunks = source_text.split(marker)
    # Print outputs to screen
    print(f"\nUsing the marker: {repr(marker)} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {repr(chunks[i])}")


Using the marker: '\n\n' - 31 chunks returned.
Chunk 1: '[[what_is_git_section]]\n=== What is Git?'
Chunk 2: "So, what is Git in a nutshell?\nThis is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.\nAs you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool.\nEven though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce)))"
Chunk 3: '==== Snapshots, Not Differences'
Chunk 4: 'The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.\nConceptually, most other systems store informat

IndexError: list index out of range

### Reflections / Discussions

Which of these would be best?

It depends, but paragraph-based chunking can be a good choice, and if a word count is used, 100-150 words is a good starting point.

## Add data to Weaviate

In [5]:
import weaviate
import weaviate.classes.config as wc
import os
import requests

client = weaviate.connect_to_local(
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY"),
        "X-Cohere-Api-Key": os.getenv("COHERE_API_KEY"),
    }
)


# Create a new collection
client.collections.delete(["BookChunk"])  # Delete the collection if it already exists (for the sake of re-runs)

client.collections.create(
    name="BookChunk",
    properties=[
        wc.Property(name="title", data_type=wc.DataType.TEXT),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="chunk_no", data_type=wc.DataType.INT),
    ],
    vectorizer_config=wc.Configure.Vectorizer.text2vec_cohere(),
    generative_config=wc.Configure.Generative.openai(),
)


# Get chunk data
url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text

CHUNK_SIZE = 100
chunks = get_chunks_fixed_size(source_text, CHUNK_SIZE)


# Add the chunks to the collection
chunks_collection = client.collections.get("BookChunk")

with chunks_collection.batch.rate_limit(2400) as batch:
    for i, chunk in enumerate(chunks):
        batch.add_object(
            properties={
                "title": "Pro Git",
                "text": chunk,
                "chunk_no": i+1,  # Start from 1
            }
        )

In [6]:
response = chunks_collection.query.near_text("how git works", limit=2)
for o in response.objects:
    print(o.properties["title"], "text:", o.properties["chunk_no"])
    print(o.properties["text"])
    print()

Pro Git text: 1
[[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool. Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in

Pro Git text: 13
that stores information about what will go into your next commit. Its technical name in Git parlance is the "`index`", but the phrase "`staging area`" works just as well. The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you _clone_ a repository from another computer. The basic Git workflow goes 

## Chunking in RAG

Chunking is very useful, if not essential in RAG

Lets you pass on just relevant parts to the generative AI as context. 

![Chunking is useful for rag](assets/chunking_rag_one_doc.jpg)

In [7]:
response = chunks_collection.generate.near_text(
    "how git works",
    limit=2,
    grouped_task="Summarize the key points here like I am five."
)
print(response.generated)

Git is a tool that helps you keep track of changes you make to your files. It has a staging area where you can choose which changes to save. The Git directory stores important information about your project. The basic workflow involves modifying files, staging changes, and then committing them.


In [8]:
response = chunks_collection.generate.near_text(
    "how git works",
    limit=2,
    grouped_task="Summarize the key points in bullet points. Use many emojis to make it interesting and fun like a social post."
)
print(response.generated)

🔑 What is Git:
- Understand the fundamentals of Git to use it effectively
- Clear your mind of knowledge from other VCSs
- Git stores information in a staging area and Git directory

🔑 Basic Git workflow:
- Modify files in working tree
- Stage changes for next commit
- Git directory stores metadata and object database

🔑 Git terminology:
- Staging area = index
- Git directory is crucial for storing project data

🔑 Remember:
- Git's user interface is similar to other VCSs
- Clear understanding of Git basics is key for effective usage


For comparison, try performing RAG with the full document:

In [9]:
# Add the full text to the collection
client.collections.delete(["Book"])  # Delete the collection if it already exists (for the sake of re-runs)
book_collection = client.collections.get("Book")

client.collections.create(
    name="Book",
    properties=[
        wc.Property(name="title", data_type=wc.DataType.TEXT),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
    ],
    vectorizer_config=wc.Configure.Vectorizer.text2vec_cohere(),
    generative_config=wc.Configure.Generative.openai(),
)

book_collection.data.insert(
    properties={
        "title": "Pro Git",
        "text": source_text,
    }
)

response = book_collection.generate.near_text(
    "how git works",
    limit=2,
    grouped_task="Summarize the key points in bullet points. Use many emojis to make it interesting and fun like a social post."
)
print(response.generated)

- Git stores data as snapshots 📸
- Git is a mini filesystem with powerful tools 🛠️
- Most operations in Git are local 🏠
- Git has integrity with checksums 🔒
- Git generally only adds data, hard to lose 📈
- Three main states in Git: modified, staged, committed 🔄
- Git workflow: modify files, stage changes, commit snapshot 🔄
- Working tree, staging area, Git directory 🌳
- Git uses SHA-1 hash for checksumming 🔢


⬆️ This is thorough, but it's just summarized the entire chapter, not the specific section we're interested in.

**Unless we've chunked the text, there's no way to retrieve just the parts we're interested in.**

### This is why chunking is so important in RAG

This is even more important when you have many documents, or the document is very long. 

If we don't filter out information, it might not fit into a context window, or the model might get confused by irrelevant information.

![Chunking is useful for rag](assets/chunking_rag_many_docs.jpg)

Also, many LLMs work better when the input includes less low-relevance information.