# Helix Chunking

Uses Chonkie's chunker under the hood. You can read which Chonkie chunker will work best for your use case at https://docs.chonkie.ai/python-sdk/chunkers/overview

`helix.Chunk` returns a list of your chunked text from Chonkie so you can use it like:

`chunks = helix.Chunk.token_chunk(massive_text_blob, chunk_size=100)` --> this gives a list of chunked string

`db.query("endpoint", {"chunks": chunks})` --> this is assuming your QUERY looks something like this:


```C++
QUERY create_chunk(contents: [String]) =>
    FOR content in contents {
            AddN<Chunk>({chunk: content})
    }
    RETURN "success" 
```


In [None]:
import helix

### Sample Text for Single Text Chunking

In [2]:
massive_text_blob = """
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages, code blocks, or special formatting, which can add complexity to the chunking process.

This example demonstrates how the token chunker works with a realistic text sample that would be common in document processing and RAG (Retrieval-Augmented Generation) applications. The chunks will be created with specified token limits and overlap settings to optimize for both comprehension and processing efficiency. Each chunk will contain metadata about its position in the original text and token count for further processing. By using a robust chunking strategy, we can ensure that downstream models receive high-quality, context-rich input, improving the overall performance of NLP pipelines and applications.
"""

### Sample Text List for Batch Chunking

In [3]:
texts = [
    "First document to chunk with some content for testing.",
    "Second document with different content for batch processing."
]

### Sample Text For Code Chunker

In [4]:
code_sample = """
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42
"""

code_samples = [
    "def func1():\n    pass",
    "const x = 10;\nfunction add(a, b) { return a + b; }"
]

### Token Chunker

In [5]:
chunks = helix.Chunk.token_chunk(massive_text_blob, chunk_size=100)
chunks

['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains m',
 't contains multiple sentences and paragraphs that need to be divided appropriately to maintain conte',
 'intain context while fitting within token limits. When working with large documents, it is important',
 'is important to ensure that each chunk maintains enough context for downstream tasks, such as retrie',
 'ch as retrieval or summarization. Chunking strategies can vary depending on the use case, but the go',
 ', but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker sh',
 'e chunker should handle overlaps properly to ensure no important information is lost at chunk bounda',
 'chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that b',
 'sures that both chunks retain the full meaning of the text. This is especially important in applicat',
 ' in applications like document question answering,

In [6]:
batch_chunks = helix.Chunk.token_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 batches chunked [00:00<00:00, 11618.57batch/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Sentence Chunker

In [7]:
chunks = helix.Chunk.sentence_chunk(massive_text_blob)
chunks

['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages, code blocks, or spe

In [8]:
batch_chunks = helix.Chunk.sentence_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 19.72doc/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Recursive Chunker

In [9]:
chunks = helix.Chunk.recursive_chunk(massive_text_blob)
chunks

['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages, code blocks, or spe

In [10]:
batch_chunks = helix.Chunk.recursive_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 21.89doc/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Code Chunker

In [11]:
chunks = helix.Chunk.code_chunk(code_sample, language="python")
chunks

['\ndef hello_world():\n    print("Hello, Chonkie!")\n\nclass MyClass:\n    def __init__(self):\n        self.value = 42\n']

In [12]:
batch_chunks = helix.Chunk.code_chunk(code_samples, language="python")
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 7212.90doc/s] 🌱


['def func1():\n    pass',
 'const x = 10;\nfunction add(a, b) { return a + b; }']

### Semantic Chunker

In [13]:
chunks = helix.Chunk.semantic_chunk(massive_text_blob)
chunks

  return np.divide(


['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization.',
 ' Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text.',
 ' This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers.',
 ' Additionally, chunkers may need to account for different languages, code

In [14]:
batch_chunks = helix.Chunk.semantic_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 10994.24doc/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Late Chunker

In [15]:
chunks = helix.Chunk.late_chunk(massive_text_blob)
chunks

Token indices sequence length is longer than the specified maximum sequence length for this model (300 > 256). Running this sequence through the model will result in indexing errors
  return forward_call(*args, **kwargs)


['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages, code blocks, or spe

In [16]:
batch_chunks = helix.Chunk.late_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 17.94doc/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Neural Chunker

In [17]:
chunks = helix.Chunk.neural_chunk(massive_text_blob)
chunks

Device set to use cpu


['\n',
 'This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits.',
 ' When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n',
 '\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text.',
 ' This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages,

In [18]:
batch_chunks = helix.Chunk.neural_chunk(texts)
batch_chunks

Device set to use cpu
🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 15.75doc/s] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### Slumber Chunker

You need to set an Gemini API key in your env to run this

In [19]:
import dotenv
dotenv.load_dotenv()

True

In [20]:
chunks = helix.Chunk.slumber_chunk(massive_text_blob)
chunks

🦛 choooooooooooooooooooonk 100% • 36/36 splits processed [00:24<00:00,  1.48split/s] 🌱


['\nThis is a massive text blob that we want to chunk into smaller pieces for processing. Itcontainsmultiplesentencesandparagraphsthatneedtobedividedappropriatelytomaintaincontextwhilefittingwithintokenlimits.When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.\n\nThe chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. ',
 'Additionally, chunkers may need to account for different languages, code blocks, or special formatting

In [21]:
batch_chunks = helix.Chunk.slumber_chunk(texts)
batch_chunks

🦛 choooooooooooooooooooonk 100% • 1/1 splits processed [00:06<00:00,  6.04s/split] 🌱
🦛 choooooooooooooooooooonk 100% • 1/1 splits processed [00:05<00:00,  5.64s/split] 🌱
🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:11<00:00,  5.84s/doc] 🌱


['First document to chunk with some content for testing.',
 'Second document with different content for batch processing.']

### PDF to Markdown Converter

In [22]:
pdf_path = "sample.pdf"
markdown_text = helix.Chunk.pdf_markdown(pdf_path)
markdown_text[:102]

'Sample Document\n\nThis is a massive text blob that we want to chunk into smaller pieces for processing.'

In [23]:
chunks = helix.Chunk.recursive_chunk(markdown_text, recipe="markdown")
chunks

['Sample Document\n\nThis is a massive text blob that we want to chunk into smaller pieces for processing. It contains\n\nmultiple  sentences  and  paragraphs  that need to be divided appropriately to maintain context\n\nwhile ﬁtting within token limits. When working with large documents, it is important to ensure\n\nthat  each  chunk  maintains  enough  context  for  downstream  tasks,  such  as  retrieval  or\n\nsummarization. Chunking strategies can vary depending on the use case, but the goal is always\n\nto balance context preservation with processing eﬃciency.\n\nThe  chunker  should  handle  overlaps  properly  to  ensure  no  important  information  is  lost  at\n\nchunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures\n\nthat both chunks retain the full meaning of the text. This is especially important in applications\n\nlike  document  question  answering,  where  missing  a  single  sentence  could  lead  to  incorrect\n\nanswers.  Addi