# Chunking Methods in 2025

## LlamaIndex

In [None]:
# data load

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=[r"C:\Users\ASUS\Documents\medium\preprocess_chunking\data\Attention is all you need.pdf"]).load_data()

In [None]:
# SentenceSplitter

In [None]:
from llama_index.core.node_parser import SentenceSplitter

base_splitter = SentenceSplitter(chunk_size=512)

docs = base_splitter.get_nodes_from_documents(documents)

for doc in docs[6:9]:
    print(doc.text)
    print('---------------------------------------------------------')

Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
the matrix of outputs as:
Attention(Q, K, V) = softmax(QKT
√dk
)V (1)
The two most commonly used attention functions are additive attention [2], a

In [None]:
# semantic chunking

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# embed_model = OpenAIEmbedding(model="text-embedding-3-small")

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [45]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

In [None]:
nodes = splitter.get_nodes_from_documents(documents)

In [47]:
for node in nodes:
    print(node.text)
    print('---------------------------------------------------------')

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

In [None]:
# read PPTx though semantic chunking

In [None]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=[r"C:\Users\ASUS\Documents\medium\preprocess_chunking\data\RAG-in-2025-Intelligent-Information-at-Your-Fingertips.pptx"]).load_data()a

In [None]:
# llamaindex chunking
import os

# os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.embeddings.huggingface import HuggingFaceEmbedding


# embed_model = OpenAIEmbedding(model="text-embedding-3-small")

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")


In [None]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
nodes = splitter.get_nodes_from_documents(documents)

for node in nodes:
    print(node.text)
    print('---------------------------------------------------------')

In [None]:
#TopicNodeParser

In [None]:
# GROQ_API_KEY ="gsk_****"

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.groq import Groq

# embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
llm = Groq(model="meta-llama/llama-4-scout-17b-16e-instruct", api_key=GROQ_API_KEY)


In [None]:
from llama_index.node_parser.topic import TopicNodeParser
node_parser = TopicNodeParser.from_defaults(
    llm=llm,
    embed_model=embed_model,
    max_chunk_size=1000,
    similarity_method="llm",  # can be "llm" or "embedding"
    window_size=2,  # paper suggests window_size=5
)

nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)


Parsing nodes:  47%|████▋     | 7/15 [07:26<09:07, 68.38s/it]Retrying llama_index.llms.openai.base.OpenAI._chat in 1.0 seconds as it raised InternalServerError: Error code: 503 - {'error': {'message': 'Service Unavailable', 'type': 'internal_server_error'}}.
Parsing nodes: 100%|██████████| 15/15 [17:17<00:00, 69.19s/it] 

Failed to parse JSON: [
  "The Law will never be perfect.",
  "The application of the Law should be just.",
  "The application of the Law being just is what we are missing, in my opinion."
]
```

Note that I have:

* Split no compound sentences as the input was already composed of simple sentences.
* No named entities with additional descriptive information were present.
* Decontextualized no pronouns as there were none.
* Maintained the original phrasing from the input whenever possible.
* Presented the results as a list of strings, formatted in JSON. 

However, here is the output formatted as JSON:

```
[
  "The Law will never be perfect.",
  "The application of the Law should be just.",
  "The application of the Law being just is what we are missing, in my opinion."
]
Google grants permission to reproduce the tables and figures in this paper. The permission is for use in journalistic or scholarly works. The permission is provided by Google. The paper is titled 'Attention Is All You 




In [25]:
for node in nodes[15:20]:
    print(node.text)
    print('---------------------------------------------------------')

Attention mechanisms allow modeling of dependencies without regard to their distance in the input or output sequences. The Transformer is a model architecture that eschews recurrence and instead relies entirely on an attention mechanism. The Transformer allows for significantly more parallelization. The Transformer can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. The goal of reducing sequential computation is also the foundation of the Extended Neural GPU, ByteNet, and ConvS2S. The Extended Neural GPU, ByteNet, and ConvS2S use convolutional neural networks as basic building blocks. The Extended Neural GPU, ByteNet, and ConvS2S compute hidden representations in parallel for all input and output positions.
---------------------------------------------------------
The number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions in the Extende

## Preprocess.co

In [None]:
# Pdf file

In [None]:
from pypreprocess import Preprocess

# Replace with your actual API key
API_KEY = "your_api_key"
FILEPATH = "data/Attention is all you need.pdf"

preprocess = Preprocess(filepath=FILEPATH, api_key=API_KEY)
preprocess.chunk()
preprocess.wait()
result = preprocess.result()

In [51]:
for chunk in result.data['chunks']:
    print(chunk)
    print('---------------------------------------------------------')

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.
Attention Is All You Need

Ashish Vaswani* Noam Shazeer* Niki Parmar* Jakob Uszkoreit* Google Brain avaswani@google.com Google Brain noam@google.com Google Research nikip@google.com Google Research usz@google.com
Llion Jones* Google Research llion@google.com Aidan N. Gomez* t University of Toronto aidan@cs.toronto.edu Lukasz Kaiser* Google Brain lukaszkaiser@google.com
Illia Polosukhin* llia.polosukhin@gmail.com
---------------------------------------------------------
Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, 

In [None]:
# Doc file

In [None]:
from pypreprocess import Preprocess

# Replace with your actual API key
FILEPATH = "data/SampleDOCFile_1000kb.doc"

from pypreprocess import Preprocess

preprocess = Preprocess(filepath=FILEPATH, api_key=API_KEY, options={})
preprocess.chunk()
preprocess.wait()
result = preprocess.result()

In [32]:
len(result.data['chunks'])

254

In [30]:
for chunk in result.data['chunks'][100:103]:
    print(chunk)
    print('---------------------------------------------------------')

Here is an embedded Excel spreadsheet: 2001 2002 pre- post- pre- post- dogs 1234.43 0.33 354.30 777.00 cats 432.00 -432.20 654.45 333.00 turkeys 3.30 4.66 34.65 132.10 fish 52.55 55.33 37.88 31.50 total 1722.28 -371.88 1081.28 1273.60 This concludes our test.
This is a regular paragraph with the default style of Normal. This is a regular paragraph with the default style of Normal. This is a regular paragraph with the default style of Normal. This is a regular paragraph with the default style of Normal. This is a regular paragraph with the default style of Normal.
This is more Normal text.
This is more Normal text. This is bold, this is italic, and this is bold italic. This is normal. This is in a defined inline style called InlineStyle. This is normal. This is red text. This is normal.
This block is centered.
This is left-aligned.

- First item of bulleted list.

Second item of bulleted list.
Second paragraph of second item of bulleted list.

Third item of bulleted list.
 o First item 

## Langchain

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/Attention is all you need.pdf")
documents = loader.load()

In [None]:
# semantic chunking

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [60]:
text_splitter = SemanticChunker(HuggingFaceEmbeddings(model_name="BAAI/bge-m3"))

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [61]:
docs = text_splitter.split_documents(documents)

In [62]:
for doc in docs:
    print(doc.page_content)
    print('---------------------------------------------------------')

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works. Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exper

In [None]:
# RecursiveCharacterTextSplitter

In [64]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=512,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.split_documents(documents)

for text in texts:
    print(text.page_content)
    print('---------------------------------------------------------')

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
---------------------------------------------------------
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attentio

In [None]:
# CharacterTextSplitter

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.split_documents(documents)

for text in texts:
    print(text.page_content)
    print('---------------------------------------------------------')

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exper

In [36]:
# !pip freeze -r requirements.txt