# Optimal Chunk-Size for Large Document Summarization 

### Introduction
Large Language Model (LLM) is good at text summarization. However, the limited context window of LLM poses a challenge when summarizing large documents. We discuss a limitation of the commonly used summarization strategy and introduce a straightforward improvement.

*Source*: [Optimal Chunk-Size for Large Document Summarization](https://vectify.ai/blog/LargeDocumentSummarization)

In [None]:
import os
import openai
from dotenv import load_dotenv

load_dotenv(override=True)


OPENAI_API_TYPE = "YOUR_API_TYPE"
OPENAI_API_VERSION = "YOUR_API_VERSION"
AZURE_OPENAI_ENDPOINT = "YOUR_ENDPOINT"
AZURE_OPENAI_LLM_DEPLOYMENT_NAME = "YOUR_LLM_DEPLOYMENT_NAME"
AZURE_OPENAI_LLM_MODEL_NAME = "YOUR_LLM_MODEL_NAME"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = "YOUR_EMBEDDING_DEPLOYMENT_NAME"
AZURE_OPENAI_EMBEDDING_MODEL_NAME = "YOUR_EMBEDDING_MODEL_NAME"
OPENAI_API_KEY = "YOUR_API_KEY"

CONFLUENCE_TOKEN = "YOUR_CONFLUENCE_TOKEN"

COSMOSDB_VCORE_CONNECTION_STRING = "YOUR_COSMOSDB_VCORE_CONNECTION_STRING"
COSMOSDB_NAMESPACE = "YOUR_COSMOSDB_NAMESPACE"

In [None]:
from langchain.document_loaders import ConfluenceLoader
import pytesseract

# This section uses internal Confluence documents and cannot be shared.
# Replace this loader with your own Confluence space or another document source.

loader = ConfluenceLoader(
    url="https://your-confluence-domain.com"
    token="YOUR_CONFLUENCE_TOKEN"
)

confluence_documents = loader.load(
    space_key="EXAMPLE_SPACE", 
    include_attachments=False, 
    limit=20,
    max_pages=1000,
)

print(confluence_documents[1])
print(f'{len(confluence_documents)} documents read from Confluence.')

*There are a lot of changes made here throug changing the function names and behavior in openai ([Migration Guide](https://github.com/openai/openai-python/discussions/742)). Also I didn't use the suggested summary function, but I fixed it too, si it's available for testing.*

In [None]:
import tiktoken
import math
import textwrap

def get_token_size(document, model):
    tokenizer=tiktoken.encoding_for_model(model)
    return len(tokenizer.encode(document))

def naive_chunker(document, chunk_size, model):
    tokenizer=tiktoken.encoding_for_model(model)
    document_tokens=tokenizer.encode(document)
    document_size = len(document_tokens)
    
    chunks = []
    for i in range(0, document_size, chunk_size):
        chunk = document_tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk))

    return chunks

def auto_chunker(document, max_chunk_size, model):
    tokenizer = tiktoken.encoding_for_model(model)
    document_tokens = tokenizer.encode(document)
    document_size = len(document_tokens)
    # total chunk number
    K = math.ceil(document_size / max_chunk_size)
    # average integer chunk size
    average_chunk_size = math.ceil(document_size / K)
    # number of chunks with average_chunk_size - 1 
    shorter_chunk_number = K * average_chunk_size - document_size
    # number of chunks with average_chunk_size
    standard_chunk_number = K - shorter_chunk_number

    chunks = []
    chunk_start = 0
    for i in range(0, K):
        if i < standard_chunk_number:
            chunk_end = chunk_start + average_chunk_size
        else:
            chunk_end = chunk_start + average_chunk_size - 1
        chunk = document_tokens[chunk_start:chunk_end]
        chunks.append(tokenizer.decode(chunk))
        chunk_start = chunk_end

    assert chunk_start == document_size
    return chunks

In [None]:
import openai
import time
from openai import OpenAI
from openai import AzureOpenAI

delimiter = "####"

MAX_ATTEMPTS = 3
MODEL='gpt-4'


def ChatGPT_API(messages, openai_key, model):
    client = AzureOpenAI(
        api_key="YOUR_OPENAI_API_KEY",
        api_version="YOUR_OPENAI_API_VERSION",
        azure_endpoint="YOUR_AZURE_OPENAI_ENDPOINT",
        azure_deployment="YOUR_AZURE_OPENAI_LLM_DEPLOYMENT_NAME",
    )
    for attempt in range(MAX_ATTEMPTS):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0,
            )
            break
        except Exception as error:
            print(error)
            time.sleep(1)
            if attempt == MAX_ATTEMPTS - 1:
                return "Server Error"
            continue
    return response.choices[0].message.content


def get_chunk_summary(content, openai_key, model):
    system_msg = f"""
            Summarize this document chunk.
            reply format：{delimiter}<summary>"""
    user_msg = 'here is the document chunk:\n' + content
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]
    result = ChatGPT_API( messages, openai_key, model)
    # print(result.split(delimiter)[-1].strip())
    return result.split(delimiter)[-1].strip()

def get_global_summary(list_of_summaries, openai_key, model):
    
    system_msg = f"""
            You are given a a list of summaries, each summary summarizes a chunk of a document in sequence.
            Combine a list of summaries into one global summary of the document.
            reply format：{delimiter}<global summary>"""
    user_msg = 'here is the list of the summaries:\n' + str(list_of_summaries)
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]
        
    result = ChatGPT_API(messages, openai_key, model)
    # print(result)
    return result.split(delimiter)[-1].strip()

In [None]:
CHUNK_SIZE=1024
test_document=' '.join([str(item) for item in confluence_documents])
naive_chunks=naive_chunker(test_document, CHUNK_SIZE, AZURE_OPENAI_LLM_MODEL_NAME)

In [None]:
naive_chunk_summaries=[get_chunk_summary(chunk,  OPENAI_API_KEY, AZURE_OPENAI_LLM_MODEL_NAME) for chunk in naive_chunks]
naive_global_summary=get_global_summary(naive_chunk_summaries,OPENAI_API_KEY, AZURE_OPENAI_LLM_MODEL_NAME)

In [None]:
# Print chunk sizes
print('Chunk sizes:', [get_token_size(chunk, AZURE_OPENAI_LLM_MODEL_NAME) for chunk in naive_chunks])
print('First chunks size:', [get_token_size(naive_chunks[0], AZURE_OPENAI_LLM_MODEL_NAME)])
print('Last chunks size:', [get_token_size(naive_chunks[-1], AZURE_OPENAI_LLM_MODEL_NAME)])

# Print chunk sizes
print(len(naive_chunks),' chunks generated.')
    
# Print last chunk text
print('Last chunk text:', naive_chunks[-1])

In [None]:

MAX_CHUNK_SIZE=1024
auto_chunks=auto_chunker(test_document, MAX_CHUNK_SIZE, AZURE_OPENAI_LLM_MODEL_NAME)

In [None]:
auto_chunk_summaries=[get_chunk_summary(chunk,  AZURE_OPENAI_LLM_MODEL_NAME, AZURE_OPENAI_LLM_MODEL_NAME) for chunk in auto_chunks]
auto_global_summary=get_global_summary(auto_chunk_summaries,AZURE_OPENAI_LLM_MODEL_NAME, AZURE_OPENAI_LLM_MODEL_NAME)

In [None]:
print('chunk size list:',[get_token_size(chunk, MODEL) for chunk in auto_chunks])
print('First chunks size:', [get_token_size(auto_chunks[0], AZURE_OPENAI_LLM_MODEL_NAME)])
print('Last chunks size:', [get_token_size(auto_chunks[-1], AZURE_OPENAI_LLM_MODEL_NAME)])

print(len(auto_chunks),' chunks generated.')
    
print('Last chunk text:', auto_chunks[-1])

In [None]:
# Don't use for now
from langchain.text_splitter import RecursiveCharacterTextSplitter

# https://js.langchain.com/docs/modules/data_connection/document_transformers/#get-started-with-text-splitters
# How to chunk: https://www.pinecone.io/learn/chunking-strategies/
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1024, 
    chunk_overlap = 200,
    length_function = len)

splitted_documents = text_splitter.split_documents(confluence_documents)

print(f'{len(splitted_documents)} chunks generated.')

In [None]:
# Embeddings by OpenAI
from langchain_openai import AzureOpenAIEmbeddings

# https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.azure.AzureOpenAIEmbeddings.html
openai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
    openai_api_version=OPENAI_API_VERSION,
    # https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-preview
    model=AZURE_OPENAI_EMBEDDING_MODEL_NAME,
    embedding_ctx_length=8191, # default
    chunk_size= 1024 # Shall be the same value (ENV) as for the text embedding
)

# Test naive_chunks

In [None]:
from langchain_community.vectorstores.azure_cosmos_db import (
    AzureCosmosDBVectorSearch,
    CosmosDBSimilarityType,
)
from pymongo import MongoClient
from langchain.docstore.document import Document

# LEARNINGS
# 1) CosmosDB RU connection string needs to be pulled from Instance Connection String directly (not from the DB)
# 2) Before connecting from local machine towards the DB make sure to open the firewall from Azure Portal (Networking)

_indexName = 'km-index'
_dbName, _collectionName = COSMOSDB_NAMESPACE.split(".")

client: MongoClient = MongoClient(COSMOSDB_VCORE_CONNECTION_STRING)
collection = client[_dbName][_collectionName]

# Clean MongoDB collection before inserting new data
collection.database.drop_collection(_collectionName)

# Make Documents from naive_chunks, add "source" as a number of chunk for now
naive_documents = [Document(page_content=chunk, metadata={"source": f"chunk_{i}"}) for i, chunk in enumerate(naive_chunks)]

vectorstore = AzureCosmosDBVectorSearch.from_documents(
    naive_documents,
    openai_embeddings,
    collection=collection,
    index_name=_indexName,
)

num_lists = 100
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS

vectorstore.create_index(num_lists, dimensions, similarity_algorithm)

In [None]:
from langchain.chat_models import AzureChatOpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

# Setup of the LLM and the chain
llm = AzureChatOpenAI(
    azure_deployment=AZURE_OPENAI_LLM_DEPLOYMENT_NAME, 
    model=AZURE_OPENAI_LLM_MODEL_NAME,
    temperature=0.8
)

chain = load_qa_with_sources_chain(
    llm, 
    chain_type="stuff")

query = "YOUR_EXAMPLE_QUESTION"
query1 = "YOUR_EXAMPLE_QUESTION"
query2 = "YOUR_EXAMPLE_QUESTION"
query3 = "YOUR_EXAMPLE_QUESTION"

matching_docs = vectorstore.similarity_search(query)
matching_docs1 = vectorstore.similarity_search(query1)
matching_docs2 = vectorstore.similarity_search(query2)
matching_docs3 = vectorstore.similarity_search(query3)

answer = chain.run(input_documents=matching_docs, question=query)
answer1 = chain.run(input_documents=matching_docs1, question=query1)
answer2 = chain.run(input_documents=matching_docs2, question=query2)
answer3 = chain.run(input_documents=matching_docs3, question=query3)

print(query, "\n", answer)
print(query1, "\n", answer1)
print(query2, "\n", answer2)
print(query3, "\n", answer3)

# Test auto_chunks

In [None]:
from langchain_community.vectorstores.azure_cosmos_db import (
    AzureCosmosDBVectorSearch,
    CosmosDBSimilarityType,
)
from pymongo import MongoClient
from langchain.docstore.document import Document

# LEARNINGS
# 1) CosmosDB RU connection string needs to be pulled from Instance Connection String directly (not from the DB)
# 2) Before connecting from local machine towards the DB make sure to open the firewall from Azure Portal (Networking)

_indexName = 'km-index'
_dbName, _collectionName = COSMOSDB_NAMESPACE.split(".")

client: MongoClient = MongoClient(COSMOSDB_VCORE_CONNECTION_STRING)
collection = client[_dbName][_collectionName]

# Clean MongoDB collection before inserting new data
collection.database.drop_collection(_collectionName)

# Make Documents from auto_chunks, add "source" as a number of chunk for now
auto_documents = [Document(page_content=chunk, metadata={"source": f"chunk_{i}"}) for i, chunk in enumerate(auto_chunks)]

vectorstore = AzureCosmosDBVectorSearch.from_documents(
    auto_documents,
    openai_embeddings,
    collection=collection,
    index_name=_indexName,
)

num_lists = 100
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS

vectorstore.create_index(num_lists, dimensions, similarity_algorithm)

In [None]:
from langchain.chat_models import AzureChatOpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

# Setup of the LLM and the chain
llm = AzureChatOpenAI(
    azure_deployment=AZURE_OPENAI_LLM_DEPLOYMENT_NAME, 
    model=AZURE_OPENAI_LLM_MODEL_NAME,
    temperature=0.8
)

chain = load_qa_with_sources_chain(
    llm, 
    chain_type="stuff")

query = "YOUR_EXAMPLE_QUESTION"
query1 = "YOUR_EXAMPLE_QUESTION"
query2 = "YOUR_EXAMPLE_QUESTION"
query3 = "YOUR_EXAMPLE_QUESTION"

matching_docs = vectorstore.similarity_search(query)
matching_docs1 = vectorstore.similarity_search(query1)
matching_docs2 = vectorstore.similarity_search(query2)
matching_docs3 = vectorstore.similarity_search(query3)

answer = chain.run(input_documents=matching_docs, question=query)
answer1 = chain.run(input_documents=matching_docs1, question=query1)
answer2 = chain.run(input_documents=matching_docs2, question=query2)
answer3 = chain.run(input_documents=matching_docs3, question=query3)

print(query, "\n", answer)
print(query1, "\n", answer1)
print(query2, "\n", answer2)
print(query3, "\n", answer3)

# Evaluation

In this experiment, we investigated the impact of different chunking strategies on large document summarization using OpenAI's language models. I explored two approaches:

*Naive Chunking:* A simple method that splits the document into fixed-size chunks. Our document resulted in 665 chunks of 1024 tokens each, with a final chunk of 138 tokens.
*Auto Chunking:* A more sophisticated approach that aims to divide the document into chunks of approximately equal size, up to a maximum chunk size (1024 tokens in our case). This resulted in all chunks being very close to the maximum size.

### Key Observations:
Unequal vs. Equal Chunk Sizes: The naive approach led to a much smaller final chunk, potentially containing less information. Auto chunking, in contrast, ensured more balanced chunk sizes.
Similar Summarization Performance: Despite this difference, both methods yielded very similar answers to most of our questions, suggesting the model's ability to synthesize information from across the document.

### Specific Example: "How to Present Decision Outcome?" from the last chunk

I tested the models' understanding of the last section of the document by asking: "How to present a decision outcome?"  Interestingly, both chunking strategies produced comprehensive responses:

##### Naive Chunking Response:

*How to present decision outcome?*

To present a decision outcome, the chosen option should be clearly stated along with the justification for the choice. The decision-making process involves considering various options along with their pros and cons. The chosen option should address the problem statement effectively and meet the key decision drivers. It is important to outline both the positive and negative consequences of the decision to provide a comprehensive overview. Additionally, links to any related decisions or further information should be included to offer a complete picture and support transparency.

##### Auto Chunking Response:

*How to present decision outcome?*

To present a decision outcome, the following structure should be used:

1. **Status**: Indicate whether the decision is proposed, rejected, accepted, or deprecated.
2. **Deciders**: List everyone involved in making the decision.
3. **Date**: Provide the date when the decision was last updated.
4. **Technical Story**: Give a brief description or link to a ticket/issue URL that necessitated the decision.
5. **Context and Problem Statement**: Describe the context and problem statement. This can be done in two to three sentences or in the form of a question.
6. **Decision Drivers**: List the factors or concerns that influenced the decision.
7. **Considered Options**: Enumerate the options that were considered.
8. **Pros and Cons of the Options**: For each considered option, list the advantages and disadvantages.
9. **Decision Outcome**: State the chosen option and provide justification for the selection, based on how it meets critical criteria or resolves key issues identified in the decision drivers.
10. **Positive Consequences**: Highlight the expected benefits or positive outcomes of the chosen decision.
11. **Negative Consequences**: Mention any potential downsides or negative outcomes that may arise.
12. **Links**: Include any relevant links for further reference.

This format ensures clarity, accountability, and traceability in decision-making processes by systematically outlining the problem, considered options, and the rationale behind the final decision.

### Conclusion

While auto-chunking offers a more theoretically sound approach to dividing documents, our results indicate that naive chunking can still be effective for question-answering tasks, especially when the model can draw from a broader context. However, if the content within individual chunks is critical and cannot be inferred from elsewhere, the more consistent chunk sizes of auto-chunking might provide an advantage.