# KARA LangChain Integration


## Setup and Imports


In [9]:
try:
    from langchain_core.documents.base import Document

    from kara.integrations.langchain import KARATextSplitter
except ImportError as e:
    print("LangChain is not installed. This notebook requires LangChain.")
    print("Please install it with: pip install kara-toolkit[langchain]")
    raise e

## Document Preparation


In [10]:
# Original document
original_doc = (
    "LangChain is an open-source framework built in Python that helps developers create "
    "applications powered by large language models (LLMs). It allows seamless integration "
    "between LLMs and external data sources like APIs, files, and databases. With LangChain, "
    "developers can build dynamic workflows where a language model not only generates text but "
    "also interacts with tools and environments. This makes it ideal for creating advanced "
    "chatbots, agents, and AI systems that go beyond static prompting. LangChain provides both "
    "low-level components for custom logic and high-level abstractions for rapid prototyping, "
    "making it a versatile toolkit for AI application development.\n\n"
    "Python is the primary language used with LangChain due to its rich ecosystem and "
    "simplicity. Python's popularity in AI and data science makes it a natural fit for "
    "building with LangChain. Libraries like pydantic, asyncio, and openai integrate smoothly "
    "with LangChain, enabling developers to quickly build robust, scalable applications. "
    "Because LangChain supports modularity, developers can extend it using Python's vast "
    "collection of libraries. Whether you're building an autonomous agent or a document QA "
    "tool, Python and LangChain together offer a powerful combination that lowers the barrier "
    "for building intelligent, interactive systems."
)

print(f"Original document length: {len(original_doc)} characters")

Original document length: 1315 characters


In [11]:
# Updated document (with additional content)
updated_doc = (
    "LangChain is an open-source framework built in Python that helps developers create "
    "applications powered by large language models (LLMs). It allows seamless integration "
    "between LLMs and external data sources like APIs, files, and databases. With LangChain, "
    "developers can build dynamic workflows where a language model not only generates text but "
    "also interacts with tools and environments. Developers can define step-by-step workflows "
    "in which an LLM can retrieve data, call APIs, and act based on context. This flexibility "
    "allows LangChain to support everything from basic assistants to complex, multi-step "
    "agents capable of reasoning and memory retention.\n\n"
    "Python is the primary language used with LangChain due to its rich ecosystem and "
    "simplicity. Python's popularity in AI and data science makes it a natural fit for "
    "building with LangChain. Libraries like pydantic, asyncio, and openai integrate smoothly "
    "with LangChain, enabling developers to quickly build robust, scalable applications. "
    "Because LangChain supports modularity, developers can extend it using Python's vast "
    "collection of libraries. Whether you're building an autonomous agent or a document QA "
    "tool, Python and LangChain together offer a powerful combination that lowers the barrier "
    "for building intelligent, interactive systems."
)

print(f"Updated document length: {len(updated_doc)} characters")

Updated document length: 1300 characters


## Initialize KARA Text Splitter


In [None]:
# Initialize KARA splitter with LangChain-compatible interface
splitter = KARATextSplitter(
    chunk_size=200,
    imperfect_chunk_tolerance=10,
    separators=[". ", "\n\n"],
)

## Step 1: Process Original Document


In [14]:
original_chunks = splitter.split_text(original_doc)

print(f"Created {len(original_chunks)} chunks:")
print()
for i, chunk in enumerate(original_chunks, 1):
    print(f"Chunk {i}: `{chunk[:65].strip()}...`")
    print("-" * 80)

Created 9 chunks:

Chunk 1: `LangChain is an open-source framework built in Python that helps...`
--------------------------------------------------------------------------------
Chunk 2: `It allows seamless integration between LLMs and external data sou...`
--------------------------------------------------------------------------------
Chunk 3: `With LangChain, developers can build dynamic workflows where a la...`
--------------------------------------------------------------------------------
Chunk 4: `This makes it ideal for creating advanced chatbots, agents, and A...`
--------------------------------------------------------------------------------
Chunk 5: `LangChain provides both low-level components for custom logic and...`
--------------------------------------------------------------------------------
Chunk 6: `Python is the primary language used with LangChain due to its ric...`
--------------------------------------------------------------------------------
Chunk 7: `Librar

## Step 2: Process Updated Document

Now let's process the updated document and see how KARA reuses existing chunks:

In [15]:
updated_chunks = splitter.split_text(updated_doc)

print(f"Result: {len(updated_chunks)} chunks")
print()
for i, chunk in enumerate(updated_chunks, 1):
    # Check if this chunk existed before
    is_reused = chunk in original_chunks
    status = "REUSED" if is_reused else "NEW"
    print(f"Chunk {i} [{status}]: `{chunk[:55].strip()}...`")
    print("-" * 80)

Result: 9 chunks

Chunk 1 [REUSED]: `LangChain is an open-source framework built in Python t...`
--------------------------------------------------------------------------------
Chunk 2 [REUSED]: `It allows seamless integration between LLMs and externa...`
--------------------------------------------------------------------------------
Chunk 3 [REUSED]: `With LangChain, developers can build dynamic workflows...`
--------------------------------------------------------------------------------
Chunk 4 [NEW]: `Developers can define step-by-step workflows in which a...`
--------------------------------------------------------------------------------
Chunk 5 [NEW]: `This flexibility allows LangChain to support everything...`
--------------------------------------------------------------------------------
Chunk 6 [REUSED]: `Python is the primary language used with LangChain due...`
--------------------------------------------------------------------------------
Chunk 7 [REUSED]: `Libraries l

## Step 3: Efficiency Analysis

Let's calculate and visualize the efficiency gains:

In [41]:
reused_count = sum(1 for chunk in updated_chunks if chunk in original_chunks)
total_chunks = len(updated_chunks)
efficiency_pct = reused_count / total_chunks
new_chunks = total_chunks - reused_count

print("KARA Efficiency Analysis")
print("=" * 40)
print(f"Total chunks in updated document: {total_chunks}")
print(f"Chunks reused from original: {reused_count}")
print()
print(f"Overall efficiency: {reused_count}/{total_chunks} = {efficiency_pct:.1%}")

KARA Efficiency Analysis
Total chunks in updated document: 9
Chunks reused from original: 7

Overall efficiency: 7/9 = 77.8%


## Step 4: LangChain Document Integration


In [42]:
# Create a LangChain Document with metadata
doc_with_metadata = Document(
    page_content=updated_doc,
    metadata={
        "source": "langchain_guide.txt",
        "version": "2.0",
        "author": "Documentation Team",
        "last_updated": "2024-01-15",
    },
)

print(f"Document metadata: {doc_with_metadata.metadata}")
print(f"Content length: {len(doc_with_metadata.page_content)} chars")

Document metadata: {'source': 'langchain_guide.txt', 'version': '2.0', 'author': 'Documentation Team', 'last_updated': '2024-01-15'}
Content length: 1300 chars


In [43]:
# Split the document using KARA
chunked_docs = splitter.split_documents([doc_with_metadata])

for i, doc in enumerate(chunked_docs[:3], 1):
    print(f"Document Chunk {i}:")
    print(f"Content: '{doc.page_content.strip()[:110]}...'")
    print(f"Metadata: {doc.metadata}")
    print("-" * 125)

Document Chunk 1:
Content: 'LangChain is an open-source framework built in Python that helps developers create applications powered by lar...'
Metadata: {'source': 'langchain_guide.txt', 'version': '2.0', 'author': 'Documentation Team', 'last_updated': '2024-01-15'}
-----------------------------------------------------------------------------------------------------------------------------
Document Chunk 2:
Content: 'It allows seamless integration between LLMs and external data sources like APIs, files, and databases....'
Metadata: {'source': 'langchain_guide.txt', 'version': '2.0', 'author': 'Documentation Team', 'last_updated': '2024-01-15'}
-----------------------------------------------------------------------------------------------------------------------------
Document Chunk 3:
Content: 'With LangChain, developers can build dynamic workflows where a language model not only generates text but also...'
Metadata: {'source': 'langchain_guide.txt', 'version': '2.0', 'author': 'Docume

## Step 5: Multi-Document Processing

Let's demonstrate KARA's capabilities with multiple documents:

In [50]:
# Create multiple documents with different topics
docs_multi = [
    Document(
        page_content=original_doc,
        metadata={"source": "doc1.txt", "topic": "framework", "priority": "high"},
    ),
    Document(
        page_content=(
            "Python's rich ecosystem makes it ideal for AI development. Libraries like numpy, "
            "pandas, and scikit-learn integrate seamlessly with LangChain components."
        ),
        metadata={"source": "doc2.txt", "topic": "python", "priority": "medium"},
    ),
    Document(
        page_content=(
            "Vector databases enable semantic search in RAG applications. They store "
            "embeddings and allow for efficient similarity-based retrieval of context."
        ),
        metadata={"source": "doc3.txt", "topic": "vectors", "priority": "high"},
    ),
]

for i, doc in enumerate(docs_multi, 1):
    topic = doc.metadata.get("topic")
    priority = doc.metadata.get("priority")
    length = len(doc.page_content)
    print(f"   Doc {i}: {topic} ({priority} priority, {length} chars)")

   Doc 1: framework (high priority, 1315 chars)
   Doc 2: python (medium priority, 153 chars)
   Doc 3: vectors (high priority, 145 chars)


In [51]:
# Process multiple documents
multi_chunked_docs = splitter.split_documents(docs_multi)

print(f"Split {len(docs_multi)} documents into {len(multi_chunked_docs)} chunks")
print("\nChunks with their sources:")
print()

for i, doc in enumerate(multi_chunked_docs, 1):
    source = doc.metadata.get("source", "unknown")
    topic = doc.metadata.get("topic", "unknown")
    priority = doc.metadata.get("priority", "unknown")
    content_preview = doc.page_content.strip()[:60]

    print(f"Chunk {i} [{source}/{topic}/{priority}]:")
    print(f"'{content_preview}...'")
    print("-" * 70)

Split 3 documents into 11 chunks

Chunks with their sources:

Chunk 1 [doc1.txt/framework/high]:
'LangChain is an open-source framework built in Python that h...'
----------------------------------------------------------------------
Chunk 2 [doc1.txt/framework/high]:
'It allows seamless integration between LLMs and external dat...'
----------------------------------------------------------------------
Chunk 3 [doc1.txt/framework/high]:
'With LangChain, developers can build dynamic workflows where...'
----------------------------------------------------------------------
Chunk 4 [doc1.txt/framework/high]:
'This makes it ideal for creating advanced chatbots, agents, ...'
----------------------------------------------------------------------
Chunk 5 [doc1.txt/framework/high]:
'LangChain provides both low-level components for custom logi...'
----------------------------------------------------------------------
Chunk 6 [doc1.txt/framework/high]:
'Python is the primary language used with L

## Step 6: Document Updates with Multi-Document Efficiency

Now let's update our document collection and see how KARA handles the changes:

In [55]:
# Update documents: modify one, keep others, add new one
updated_docs_multi = [
    Document(
        page_content=updated_doc,  # Modified content
        metadata={"source": "doc1.txt", "topic": "framework", "priority": "high", "version": "2.0"},
    ),
    docs_multi[1],  # Unchanged Python doc
    Document(
        page_content=(
            "Retrieval-Augmented Generation (RAG) combines the power of large language "
            "models with external knowledge retrieval to provide more accurate and "
            "contextual responses."
        ),
        metadata={"source": "doc4.txt", "topic": "rag", "priority": "high"},
    ),
]

print("Updating with modified and new documents:")
print("   - Document 1: Modified (content added)")
print("   - Document 2: Unchanged")
print("   - Document 3: Removed")
print("   - Document 4: New (RAG explanation)")

Updating with modified and new documents:
   - Document 1: Modified (content added)
   - Document 2: Unchanged
   - Document 3: Removed
   - Document 4: New (RAG explanation)


In [53]:
# Process updated documents and calculate efficiency
original_texts = [doc.page_content for doc in multi_chunked_docs]
new_chunked_docs = splitter.split_documents(updated_docs_multi)
new_texts = [doc.page_content for doc in new_chunked_docs]

# Calculate reuse statistics
reused_chunks = sum(1 for text in new_texts if text in original_texts)
total_new_chunks = len(new_texts)
multi_efficiency = (reused_chunks / total_new_chunks) * 100

print("Multi-Document Update Results:")
print("=" * 45)
print(f"Original chunks: {len(original_texts)}")
print(f"Updated chunks: {total_new_chunks}")
print(f"Chunks reused: {reused_chunks}")
print(f"New chunks: {total_new_chunks - reused_chunks}")
print("")
print(f"Multi-doc efficiency: {reused_chunks}/{total_new_chunks} = {multi_efficiency:.1f}%")

Multi-Document Update Results:
Original chunks: 11
Updated chunks: 11
Chunks reused: 8
New chunks: 3

Multi-doc efficiency: 8/11 = 72.7%


In [54]:
# Show detailed breakdown by document
print("Detailed breakdown:")
print()

# Group chunks by source
chunks_by_source = {}
for doc in new_chunked_docs:
    source = doc.metadata.get("source", "unknown")
    if source not in chunks_by_source:
        chunks_by_source[source] = []
    chunks_by_source[source].append(doc.page_content)

for source, chunks in chunks_by_source.items():
    reused_in_source = sum(1 for chunk in chunks if chunk in original_texts)
    total_in_source = len(chunks)
    source_efficiency = (reused_in_source / total_in_source) * 100 if total_in_source > 0 else 0

    print(f"{source}:")
    print(f"   Chunks: {total_in_source}")
    print(f"   Reused: {reused_in_source} ({source_efficiency:.1f}%)")
    print()

Detailed breakdown:

doc1.txt:
   Chunks: 9
   Reused: 7 (77.8%)

doc2.txt:
   Chunks: 1
   Reused: 1 (100.0%)

doc4.txt:
   Chunks: 1
   Reused: 0 (0.0%)

