__Tokens, Context Windows and Chunking__

The goal of this notebook is to explain the basics of tokens, context windows, and chunking.

__Tokens__

Tokens are small pieces of text that an AI model, such as an LLM, processes. Tokens are represented by a series of integers that correspond to entries in a predefined vocabulary. Many modern LLMs use subword tokenization such as Byte-Pair Encoding or WordPiece. By creating tokens from subwords instead of entire words, the vocabulary does not need to grow without bound. This also allows models to handle text outside the predefined vocabulary. Tokenization strongly affects performance and cost. In this example we use tiktoken, which is a tokenizer compatible with the gpt-4o model.

__Context Window__

Context windows define the maximum number of tokens a model can process in a single forward pass. The self-attention mechanism used in LLMs scales roughly quadratically in computational and memory cost as context grows, which makes very large contexts expensive. Increasing a model's context window involves trade-offs: higher latency and greater compute and memory usage. A common issue is the "lost in the middle" problem, where models that ingest long documents struggle to recall or use information from the middle of the context. In this notebook we use gpt-3.5-turbo-0125 (context window: 16,385 tokens) to demonstrate exceeding the context limit and how chunking can help.

__Chunking__

Chunking is the process of splitting text into smaller, more manageable pieces for a RAG (Retrieval-Augmented Generation) system. Chunking affects a RAG system's ability to efficiently and accurately synthesize context and impacts downstream performance. In this example we implement fixed-size chunking, the simplest approach. Other approaches include Recursive Character Splitting and Semantic Chunking. Recursive Character Splitting looks for natural boundaries such as newlines and paragraph breaks. Semantic chunking leverages embeddings to split text based on semantic similarity.


<span style="color:darkblue">El objetivo de este cuaderno es explicar los conceptos b√°sicos de tokens, ventanas de contexto y chunking.</span>

<span style="color:darkblue">__Tokens:__ Los tokens son peque√±os fragmentos de texto que un modelo de IA, como un LLM, procesa. Los tokens se representan mediante una serie de enteros que corresponden a entradas en un vocabulario predefinido. Muchos LLM modernos usan tokenizaci√≥n por subpalabras, por ejemplo Byte-Pair Encoding o WordPiece. Al crear tokens a partir de subpalabras en lugar de palabras completas, el vocabulario no necesita crecer indefinidamente. Esto tambi√©n permite que los modelos manejen texto fuera del vocabulario predefinido. La tokenizaci√≥n afecta en gran medida el rendimiento y el costo. En este ejemplo usamos tiktoken, un tokenizador compatible con el modelo gpt-4o.</span>

<span style="color:darkblue">__Ventana de contexto:__ Las ventanas de contexto definen el n√∫mero m√°ximo de tokens que un modelo puede procesar en una sola pasada hacia adelante. El mecanismo de self-attention que usan los LLM escala aproximadamente de forma cuadr√°tica en coste computacional y de memoria al aumentar el contexto, lo que hace que los contextos muy grandes sean costosos. Aumentar la ventana de contexto de un modelo implica compensaciones: mayor latencia y mayor uso de c√≥mputo y memoria. Un problema habitual es el "perderse en el medio", donde los modelos que ingieren documentos largos tienen dificultades para recordar o usar la informaci√≥n que aparece en la parte central del contexto. En este cuaderno usamos gpt-3.5-turbo-0125 (ventana de contexto: 16,385 tokens) para demostrar exceder el l√≠mite de contexto y c√≥mo el chunking puede ayudar.</span>

<span style="color:darkblue">__Chunking:__ El chunking es el proceso de dividir el texto en piezas m√°s peque√±as y manejables para un sistema RAG (Generaci√≥n Aumentada por Recuperaci√≥n). El chunking afecta la capacidad del sistema RAG para sintetizar el contexto de forma eficiente y precisa e impacta el rendimiento posterior. En este ejemplo implementamos chunking de tama√±o fijo, que es el enfoque m√°s sencillo. Otros enfoques incluyen Recursive Character Splitting y Semantic Chunking. Recursive Character Splitting busca l√≠mites naturales como saltos de l√≠nea y p√°rrafos. El chunking sem√°ntico utiliza embeddings para dividir el texto seg√∫n la similitud sem√°ntica.</span>

In [1]:
# Import necessary libraries
import openai
import tiktoken
import os

For this example we will use the gpt-4o model from OpenAI. The following cell creates two text inputs and prints their respective tokens.

<span style="color:darkblue">Para este ejemplo usaremos el modelo gpt-4o de OpenAI. La celda siguiente crea dos entradas de texto e imprime sus tokens correspondientes.</span>

In [4]:
# tokenizer for the gpt-4o model
encoding = tiktoken.encoding_for_model("gpt-4o")

# A simple sentence
text1 = "Hello, moto !"
tokens1 = encoding.encode(text1)
print(f"Text: '{text1}'")
print(f"Tokens: {tokens1}")
print(f"Number of tokens: {len(tokens1)}\n")

# A more complex sentence
text2 = "Mexico will win the World Cup in 2026."
tokens2 = encoding.encode(text2)
print(f"Text: '{text2}'")
print(f"Tokens: {tokens2}")
print(f"Number of tokens: {len(tokens2)}")

Text: 'Hello, moto !'
Tokens: [13225, 11, 37906, 1073]
Number of tokens: 4

Text: 'Mexico will win the World Cup in 2026.'
Tokens: [134721, 738, 4449, 290, 5922, 17257, 306, 220, 1323, 21, 13]
Number of tokens: 11


In the next cell we will generate a long fake document by repeating a sample sentence 20,000 times. Then we will pass this content to the gpt-3.5-turbo-0125 model from OpenAI, which has a maximum context length of 16,385 tokens. The goal is to have the model summarize the long fake document (‚âà50,000 tokens); we expect an error because the context length will be exceeded.

<span style="color:darkblue">En la siguiente celda generaremos un documento falso largo repitiendo una frase de ejemplo 20,000 veces. Luego pasaremos este contenido al modelo gpt-3.5-turbo-0125 de OpenAI, que tiene una longitud m√°xima de contexto de 16,385 tokens. El objetivo es que el modelo resuma el documento falso largo (‚âà50,000 tokens); esperamos un error porque se exceder√° la longitud de contexto.</span>

In [5]:
# Let's create a very long string of text
# Let's simulate a ~40-page document, which will be well over the limit.
long_text = "Harry Potter, you are a wizard. " * 20000 # ~200k characters, ~50k tokens
num_tokens = len(encoding.encode(long_text))

print(f"Our sample text has approximately {num_tokens} tokens.")

try:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo-0125",
      messages=[
        {"role": "user", "content": f"Summarize this text: {long_text}"}
      ]
    )
except openai.BadRequestError as e:
    print("\nüí• We got an error, as expected!")
    print(f"Error Message: {e.message}")

Our sample text has approximately 160001 tokens.

üí• We got an error, as expected!
Error Message: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 160014 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


In the next cell we will implement fixed-size chunking to split the content so it can be provided to the model. Chunking allows long documents to be processed by LLMs in smaller pieces.

<span style="color:darkblue">En la siguiente celda implementaremos chunking de tama√±o fijo para dividir el contenido y poder proporcionarlo al modelo. El chunking permite que los LLM procesen documentos largos en piezas m√°s peque√±as.</span>

In [6]:
def chunk_text(text, chunk_size_tokens):
    """Splits a text into chunks of a specified token size."""
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size_tokens):
        chunk = tokens[i:i + chunk_size_tokens]
        chunks.append(encoding.decode(chunk))
    return chunks

# Let's set a chunk size that is safely within the context window
CHUNK_SIZE = 1000

# Chunk our long text
text_chunks = chunk_text(long_text, CHUNK_SIZE)

print(f"The long text was split into {len(text_chunks)} chunks.")
print(f"The first chunk has {len(encoding.encode(text_chunks[0]))} tokens.")
print("\nHere's the first chunk:\n---")
print(text_chunks[0])
print("---")

The long text was split into 161 chunks.
The first chunk has 1000 tokens.

Here's the first chunk:
---
Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. Harry Potter, you are a wizard. H

The following cell shows a breakdown of costs for the gpt-3.5-turbo-0125 model based on the estimated tokens per chunk and the number of chunks. This cost analysis is important when designing RAG systems: the value the RAG system provides should exceed the operational cost.

<span style="color:darkblue">La celda siguiente muestra un desglose de costes para el modelo gpt-3.5-turbo-0125 basado en los tokens estimados por chunk y el n√∫mero de chunks. Este an√°lisis de costes es importante al dise√±ar sistemas RAG: el valor que proporciona el sistema RAG debe superar el coste operativo.</span>

In [None]:
# Pricing for gpt-3.5-turbo-0125 as an example
INPUT_PRICE_PER_1M_TOKENS = 0.50  # $0.50
OUTPUT_PRICE_PER_1M_TOKENS = 1.50 # $1.50

# 1. Calculate total input tokens
total_input_tokens = sum(len(encoding.encode(chunk)) for chunk in text_chunks)

# 2. Estimate total output tokens (let's assume each summary is ~50 tokens)
estimated_output_tokens_per_chunk = 50
total_output_tokens = len(text_chunks) * estimated_output_tokens_per_chunk

# 3. Calculate cost
input_cost = (total_input_tokens / 1_000_000) * INPUT_PRICE_PER_1M_TOKENS
output_cost = (total_output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M_TOKENS
total_cost = input_cost + output_cost

print(f"Total Input Tokens: {total_input_tokens}")
print(f"Estimated Total Output Tokens: {total_output_tokens}")
print("---")
print(f"Estimated Input Cost: ${input_cost:.4f}")
print(f"Estimated Output Cost: ${output_cost:.4f}")
print(f"ESTIMATED TOTAL COST: ${total_cost:.4f}")

Total Input Tokens: 160001
Estimated Total Output Tokens: 8050
---
Estimated Input Cost: $0.0800
Estimated Output Cost: $0.0121
ESTIMATED TOTAL COST: $0.0921


The total estimated cost for this simple example is about $0.09. While this seems inexpensive for a single run, costs can increase significantly depending on how often documents are referenced or reprocessed.

<span style="color:darkblue">El coste estimado total para este ejemplo simple es de aproximadamente $0.09. Aunque parece econ√≥mico para una sola ejecuci√≥n, los costes pueden aumentar significativamente seg√∫n la frecuencia con la que se consulten o reprocesen los documentos.</span>

__Additional Considerations__

Here's the text with improved line breaks while maintaining the blue spans:

__Tokenization__ impacts performance and costs. In practice, language differences and industry-specific vocabulary influence which tokenizer is best for a use case. For example, a single Japanese character may be split into many tokens, inflating token counts and increasing costs while potentially reducing model performance. Similar token inflation can occur with specialized vocabulary in domains such as law or medicine. Some organizations choose to build custom tokenizers tailored to their domain to reduce cost and improve model outputs.

__Context Windows__ often encounter the "lost in the middle" problem, where a RAG system may be less effective at using information that appears in the middle of a long context. To mitigate this, systems commonly use re-ranking algorithms that select and rank the most relevant chunks using a smaller model; the final query includes the top-ranked chunks first, improving accuracy.

__Chunking__ is essential to provide effective context to RAG systems. There are several chunking approaches, such as Recursive Character Splitting, Semantic Chunking, and Layout-Aware Chunking. Layout-Aware Chunking preserves elements of complex documents (tables, headings, paragraphs), producing chunks that better reflect how humans interpret the document. By feeding these structured chunks into a RAG system, retrieval quality and downstream performance can improve significantly.

<span style="color:darkblue">__Consideraciones adicionales:__</span>

<span style="color:darkblue">__Tokenizaci√≥n:__ La tokenizaci√≥n impacta el rendimiento y los costes. En la pr√°ctica, las diferencias entre idiomas y el vocabulario espec√≠fico de cada sector influyen en qu√© tokenizador es m√°s adecuado para un caso de uso. Por ejemplo, un √∫nico car√°cter japon√©s puede dividirse en muchos tokens, inflando el recuento de tokens y aumentando los costes, adem√°s de poder reducir el rendimiento del modelo. Un inflado similar puede suceder con vocabulario especializado en dominios como el jur√≠dico o el m√©dico. Algunas organizaciones optan por crear tokenizadores personalizados adaptados a su dominio para reducir costes y mejorar los resultados del modelo.</span>

<span style="color:darkblue">__Ventanas de contexto:__ A menudo se encuentra el problema de "perderse en el medio", donde un sistema RAG puede ser menos efectivo al utilizar informaci√≥n que aparece en la parte central de un contexto largo. Para mitigarlo, los sistemas suelen usar algoritmos de reordenamiento (re-ranking) que seleccionan y ordenan los chunks m√°s relevantes mediante un modelo m√°s peque√±o; la consulta final incluye primero los chunks mejor clasificados, lo que mejora la precisi√≥n.</span>

<span style="color:darkblue">__Chunking:__ El chunking es esencial para proporcionar contexto efectivo a los sistemas RAG. Existen varios enfoques, como Recursive Character Splitting, Semantic Chunking y Layout-Aware Chunking. El Layout-Aware Chunking preserva elementos de documentos complejos (tablas, t√≠tulos, p√°rrafos), generando chunks que reflejan mejor c√≥mo los humanos interpretan el documento. Al alimentar estos chunks estructurados en un sistema RAG, la calidad de la recuperaci√≥n y el rendimiento posterior pueden mejorar significativamente.</span>