# KARA Basic Usage

## Setup and Imports

In [2]:
from kara import KARAUpdater, RecursiveCharacterChunker

In [3]:
def pprint_chunks(chunks_text):
    """Pretty print a chunk of text."""
    for i, chunk in enumerate(chunks_text):
        print(f"Chunk {i + 1}: `{chunk[:90]}...`\n")

## Document Preparation

In [4]:
# Original document
original_doc = (
    "LangChain is an open-source framework built in Python that helps developers create "
    "applications powered by large language models (LLMs). It allows seamless integration "
    "between LLMs and external data sources like APIs, files, and databases. With LangChain, "
    "developers can build dynamic workflows where a language model not only generates text but "
    "also interacts with tools and environments. This makes it ideal for creating advanced "
    "chatbots, agents, and AI systems that go beyond static prompting. LangChain provides both "
    "low-level components for custom logic and high-level abstractions for rapid prototyping, "
    "making it a versatile toolkit for AI application development.\n\n"
    "Python is the primary language used with LangChain due to its rich ecosystem and "
    "simplicity. Python's popularity in AI and data science makes it a natural fit for "
    "building with LangChain. Libraries like pydantic, asyncio, and openai integrate smoothly "
    "with LangChain, enabling developers to quickly build robust, scalable applications. "
    "Because LangChain supports modularity, developers can extend it using Python's vast "
    "collection of libraries. Whether you're building an autonomous agent or a document QA "
    "tool, Python and LangChain together offer a powerful combination that lowers the barrier "
    "for building intelligent, interactive systems."
)

print(f"Original document length: {len(original_doc)} characters")

Original document length: 1315 characters


In [5]:
# Updated document (with additional content in the middle)
updated_doc = (
    "LangChain is an open-source framework built in Python that helps developers create "
    "applications powered by large language models (LLMs). It allows seamless integration "
    "between LLMs and external data sources like APIs, files, and databases. With LangChain, "
    "developers can build dynamic workflows where a language model not only generates text but "
    "also interacts with tools and environments. Developers can define step-by-step workflows "
    "in which an LLM can retrieve data, call APIs, and act based on context. This flexibility "
    "allows LangChain to support everything from basic assistants to complex, multi-step "
    "agents capable of reasoning and memory retention.\n\n"
    "Python is the primary language used with LangChain due to its rich ecosystem and "
    "simplicity. Python's popularity in AI and data science makes it a natural fit for "
    "building with LangChain. Libraries like pydantic, asyncio, and openai integrate smoothly "
    "with LangChain, enabling developers to quickly build robust, scalable applications. "
    "Because LangChain supports modularity, developers can extend it using Python's vast "
    "collection of libraries. Whether you're building an autonomous agent or a document QA "
    "tool, Python and LangChain together offer a powerful combination that lowers the barrier "
    "for building intelligent, interactive systems."
)

print(f"Updated document length: {len(updated_doc)} characters")

Updated document length: 1300 characters


## Initialize KARA Components

In [None]:
# Initialize KARA with character-based chunking
chunker = RecursiveCharacterChunker(chunk_size=200, separators=[". ", "\n\n"])
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=10)

## Step 1: Process Original Document

In [7]:
initial_result = updater.create_knowledge_base([original_doc])

assert initial_result.new_chunked_doc is not None
original_chunks = [chunk.content for chunk in initial_result.new_chunked_doc.chunks]

print(f"Created {len(original_chunks)} chunks:")
print()
for i, chunk in enumerate(original_chunks, 1):
    print(f"Chunk {i}: `{chunk[:65].strip()}...`")
    print("-" * 80)

Created 9 chunks:

Chunk 1: `LangChain is an open-source framework built in Python that helps...`
--------------------------------------------------------------------------------
Chunk 2: `It allows seamless integration between LLMs and external data sou...`
--------------------------------------------------------------------------------
Chunk 3: `With LangChain, developers can build dynamic workflows where a la...`
--------------------------------------------------------------------------------
Chunk 4: `This makes it ideal for creating advanced chatbots, agents, and A...`
--------------------------------------------------------------------------------
Chunk 5: `LangChain provides both low-level components for custom logic and...`
--------------------------------------------------------------------------------
Chunk 6: `Python is the primary language used with LangChain due to its ric...`
--------------------------------------------------------------------------------
Chunk 7: `Librar

## Step 2: Process Updated Document

Now let's update the knowledge base with the modified document and see how KARA reuses existing chunks:

In [8]:
result = updater.update_knowledge_base(initial_result.new_chunked_doc, [updated_doc])

assert result.new_chunked_doc is not None
updated_chunks = [chunk.content for chunk in result.new_chunked_doc.chunks]

print(f"Result: {len(updated_chunks)} chunks")
print()
for i, chunk in enumerate(updated_chunks, 1):
    # Check if this chunk existed before
    is_reused = chunk in original_chunks
    status = "REUSED" if is_reused else "NEW"
    print(f"Chunk {i} [{status}]: `{chunk[:55].strip()}...`")
    print("-" * 80)

Result: 9 chunks

Chunk 1 [REUSED]: `LangChain is an open-source framework built in Python t...`
--------------------------------------------------------------------------------
Chunk 2 [REUSED]: `It allows seamless integration between LLMs and externa...`
--------------------------------------------------------------------------------
Chunk 3 [REUSED]: `With LangChain, developers can build dynamic workflows...`
--------------------------------------------------------------------------------
Chunk 4 [NEW]: `Developers can define step-by-step workflows in which a...`
--------------------------------------------------------------------------------
Chunk 5 [NEW]: `This flexibility allows LangChain to support everything...`
--------------------------------------------------------------------------------
Chunk 6 [REUSED]: `Python is the primary language used with LangChain due...`
--------------------------------------------------------------------------------
Chunk 7 [REUSED]: `Libraries l

## Step 3: Analyze Efficiency

Let's examine the efficiency gains from using KARA:

In [11]:
# Show efficiency metrics
reused_count = result.num_reused
total_chunks = len(updated_chunks)
efficiency_pct = result.efficiency_ratio

print("KARA Efficiency Analysis")
print("=" * 40)
print(f"Total chunks in updated document: {total_chunks}")
print(f"Chunks reused from original: {reused_count}")
print(f"Chunks added: {result.num_added}")
print(f"Chunks deleted: {result.num_deleted}")
print()
print(f"Overall efficiency: {reused_count}/{total_chunks} = {efficiency_pct:.1%}")

KARA Efficiency Analysis
Total chunks in updated document: 9
Chunks reused from original: 7
Chunks added: 2
Chunks deleted: 2

Overall efficiency: 7/9 = 77.8%


## Step 4: Multi-Document Support Demo

KARA also supports efficient updates across multiple documents. Let's demonstrate this:

In [12]:
# Multiple documents with some overlap
doc1 = original_doc

doc2 = (
    "Python's rich ecosystem makes it ideal for AI development. Libraries like numpy, "
    "pandas, and scikit-learn integrate seamlessly with LangChain components."
)

doc3 = (
    "Vector databases enable semantic search in RAG applications. They store "
    "embeddings and allow for efficient similarity-based retrieval of context."
)

print("📚 Processing multiple documents:")
print(f"Document 1: {len(doc1)} chars (LangChain overview)")
print(f"Document 2: {len(doc2)} chars (Python ecosystem)")
print(f"Document 3: {len(doc3)} chars (Vector databases)")

📚 Processing multiple documents:
Document 1: 1315 chars (LangChain overview)
Document 2: 153 chars (Python ecosystem)
Document 3: 145 chars (Vector databases)


In [13]:
# Process multiple documents
multi_result = updater.create_knowledge_base([doc1, doc2, doc3])
assert multi_result.new_chunked_doc is not None
multi_chunks = [chunk.content for chunk in multi_result.new_chunked_doc.chunks]

print(f"Created {len(multi_chunks)} chunks from 3 documents:")
print()
for i, chunk in enumerate(multi_chunks, 1):
    print(f"Chunk {i}: `{chunk[:65].strip()}...`")
    print("-" * 80)

Created 11 chunks from 3 documents:

Chunk 1: `LangChain is an open-source framework built in Python that helps...`
--------------------------------------------------------------------------------
Chunk 2: `It allows seamless integration between LLMs and external data sou...`
--------------------------------------------------------------------------------
Chunk 3: `With LangChain, developers can build dynamic workflows where a la...`
--------------------------------------------------------------------------------
Chunk 4: `This makes it ideal for creating advanced chatbots, agents, and A...`
--------------------------------------------------------------------------------
Chunk 5: `LangChain provides both low-level components for custom logic and...`
--------------------------------------------------------------------------------
Chunk 6: `Python is the primary language used with LangChain due to its ric...`
-------------------------------------------------------------------------------

In [14]:
# Update with modified documents
doc1_updated = updated_doc  # Modified version
doc2_updated = doc2  # Unchanged
doc4_new = (
    "Retrieval-Augmented Generation (RAG) combines the power of large language "
    "models with external knowledge retrieval to provide more accurate and "
    "contextual responses."
)

print("Updating with modified and new documents:")
print("   - Document 1: Modified (content added)")
print("   - Document 2: Unchanged")
print("   - Document 3: Removed")
print("   - Document 4: New (RAG explanation)")

multi_update_result = updater.update_knowledge_base(
    multi_result.new_chunked_doc, [doc1_updated, doc2_updated, doc4_new]
)
assert multi_update_result.new_chunked_doc is not None

Updating with modified and new documents:
   - Document 1: Modified (content added)
   - Document 2: Unchanged
   - Document 3: Removed
   - Document 4: New (RAG explanation)


In [15]:
# Analyze multi-document efficiency
print("Multi-Document Update Results:")
print("=" * 40)
print("Original documents: 3")
print("Updated documents: 3 (1 modified, 1 unchanged, 1 new)")
print("")
print(f"Chunks reused: {multi_update_result.num_reused}")
print(f"Chunks added: {multi_update_result.num_added}")
print(f"Chunks deleted: {multi_update_result.num_deleted}")
print("")
print(f"Multi-doc efficiency: {multi_update_result.efficiency_ratio:.1%}")

Multi-Document Update Results:
Original documents: 3
Updated documents: 3 (1 modified, 1 unchanged, 1 new)

Chunks reused: 8
Chunks added: 3
Chunks deleted: 3

Multi-doc efficiency: 72.7%
