# Part 2 — Text Splitting
### Techniques for creating processable text chunks for LLM pipelines

This notebook explains why text splitting is essential for LLM applications and demonstrates how to split documents into meaningful segments using LangChain utilities.

## Learning Guide

**What you will learn**
- Why text splitting is essential for LLM pipelines and RAG systems.
- How different LangChain text splitters work and when to choose them.
- Hands-on examples splitting a sample contract-like text and inspecting chunk sizes and overlaps.

**Why it matters**
- Proper chunking prevents token-overflow, preserves clause boundaries in legal docs, and improves retrieval relevance.

**How it fits into an AI/LLM course**
- Foundational step in ingestion pipelines used before embedding, indexing, or LLM prompting.

**Hands-on steps you'll perform**
1. Install dependencies (LangChain).
2. Load a sample contract text and run four splitters.
3. Inspect chunks and evaluate which splitter best preserves legal clauses.
4. Save recommended configs for production ingestion.


In [1]:
# API Key input (kept intentionally simple per your spec)
from secrete_key import my_gemini_api_key
API_KEY = my_gemini_api_key()

# NOTE: Replace `secrete_key` / function as appropriate in your environment.
print("API_KEY loaded (hidden).")

API_KEY loaded (hidden).


# Part 2 — Text Splitting
### Techniques for creating processable text chunks for LLM pipelines

This section demonstrates LangChain splitters and why they matter for contract documents where clause integrity is important.


In [2]:
!pip install langchain-text-splitters tiktoken langchain_google_genai





In [3]:
# Install LangChain in Colab / local environment if needed
# Uncomment the following line in Colab or local environment if LangChain is not installed
!pip install --upgrade langchain
print('Ensure langchain is installed in your environment.')

Ensure langchain is installed in your environment.


In [16]:
# Sample contract-like text for demonstration
sample_contract = """
SERVICE AGREEMENT

This Service Agreement ("Agreement") is made as of January 1, 2024, by and between Alpha Corp ("Provider") and Beta LLC ("Client").
1. Services. Provider shall provide software development services (the "Services") described in Schedule A.
2. Term. The term of this Agreement begins on the Effective Date and continues for twelve (12) months, unless earlier terminated.
3. Payment. Client will pay Provider the fees set forth in Schedule B within thirty (30) days of invoice receipt.
4. Confidentiality. Each party shall maintain confidential information in strict confidence and not disclose it to third parties.
5. Liability. Neither party shall be liable for indirect or consequential damages.
6. Termination. Either party may terminate upon thirty (30) days notice for material breach.
7. Governing Law. This Agreement shall be governed by the laws of the State of Delaware.
8. Miscellaneous. This Agreement constitutes the entire agreement between the parties and supersedes prior discussions.
"""
print('Loaded sample contract ({} chars)'.format(len(sample_contract)))


Loaded sample contract (1019 chars)


## RecursiveCharacterTextSplitter

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(sample_contract)
print("RecursiveCharacterTextSplitter -> num chunks:", len(chunks))
for i,c in enumerate(chunks):
    print(f"--- chunk {i+1} (len={len(c)}) ---\n{c[:1000]}\n")

RecursiveCharacterTextSplitter -> num chunks: 4
--- chunk 1 (len=17) ---
SERVICE AGREEMENT

--- chunk 2 (len=369) ---
This Service Agreement ("Agreement") is made as of January 1, 2024, by and between Alpha Corp ("Provider") and Beta LLC ("Client").
1. Services. Provider shall provide software development services (the "Services") described in Schedule A.
2. Term. The term of this Agreement begins on the Effective Date and continues for twelve (12) months, unless earlier terminated.

--- chunk 3 (len=326) ---
3. Payment. Client will pay Provider the fees set forth in Schedule B within thirty (30) days of invoice receipt.
4. Confidentiality. Each party shall maintain confidential information in strict confidence and not disclose it to third parties.
5. Liability. Neither party shall be liable for indirect or consequential damages.

--- chunk 4 (len=301) ---
6. Termination. Either party may terminate upon thirty (30) days notice for material breach.
7. Governing Law. This Agreement shall

## CharacterTextSplitter

In [6]:
from langchain_text_splitters  import CharacterTextSplitter

char_splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=40
)

char_chunks = char_splitter.split_text(sample_contract)

print("CharacterTextSplitter -> num chunks:", len(char_chunks))
for i,c in enumerate(char_chunks):
    print(f"--- chunk {i+1} (len={len(c)}) ---\n{c[:1000]}\n")

CharacterTextSplitter -> num chunks: 2
--- chunk 1 (len=17) ---
SERVICE AGREEMENT

--- chunk 2 (len=998) ---
This Service Agreement ("Agreement") is made as of January 1, 2024, by and between Alpha Corp ("Provider") and Beta LLC ("Client").
1. Services. Provider shall provide software development services (the "Services") described in Schedule A.
2. Term. The term of this Agreement begins on the Effective Date and continues for twelve (12) months, unless earlier terminated.
3. Payment. Client will pay Provider the fees set forth in Schedule B within thirty (30) days of invoice receipt.
4. Confidentiality. Each party shall maintain confidential information in strict confidence and not disclose it to third parties.
5. Liability. Neither party shall be liable for indirect or consequential damages.
6. Termination. Either party may terminate upon thirty (30) days notice for material breach.
7. Governing Law. This Agreement shall be governed by the laws of the State of Delaware.
8. Miscellan

In [7]:
pip install --upgrade langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


## TokenTextSplitter

In [17]:
from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=200,
    chunk_overlap=30
)

token_chunks = token_splitter.split_text(sample_contract)

print("TokenTextSplitter -> num chunks:", len(token_chunks))
for i, c in enumerate(token_chunks):
    print(f"--- chunk {i+1} (len={len(c)}) ---\n{c[:200]}\n")


TokenTextSplitter -> num chunks: 2
--- chunk 1 (len=915) ---

SERVICE AGREEMENT

This Service Agreement ("Agreement") is made as of January 1, 2024, by and between Alpha Corp ("Provider") and Beta LLC ("Client").
1. Services. Provider shall provide software dev

--- chunk 2 (len=238) ---
 notice for material breach.
7. Governing Law. This Agreement shall be governed by the laws of the State of Delaware.
8. Miscellaneous. This Agreement constitutes the entire agreement between the part



## SemanticChunker

In [12]:
pip install langchain-community langchain-experimental

Collecting langchain-experimental
  Downloading langchain_experimental-0.4.0-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_experimental-0.4.0-py3-none-any.whl (209 kB)
Installing collected packages: langchain-experimental
Successfully installed langchain-experimental-0.4.0
Note: you may need to restart the kernel to use updated packages.


### Scematic Chucking > from langchain_experimental.text_splitter import SemanticChunker

In [None]:
pip install sentence-transformers langchain-huggingface

Collecting sentence-transformers
  Using cached sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-1.1.0-py3-none-any.whl.metadata (2.8 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting tqdm (from sentence-transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.9.1-cp312-cp312-win_amd64.whl.metadata (30 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.7.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.16.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-1.2.1-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-

In [32]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

# ✅ Load a lightweight, FREE, open-source embedding model (runs on CPU)
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # or "BAAI/bge-small-en-v1.5"
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Create semantic splitter
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85  # adjust: higher = fewer chunks
)

# Your contract text
sample_contract = """
SERVICE AGREEMENT

This Service Agreement ("Agreement") is made as of January 1, 2024, by and between Alpha Corp ("Provider") and Beta LLC ("Client").
1. Services. Provider shall provide software development services (the "Services") described in Schedule A.
2. Term. The term of this Agreement begins on the Effective Date and continues for twelve (12) months, unless earlier terminated.
3. Payment. Client will pay Provider the fees set forth in Schedule B within thirty (30) days of invoice receipt.
4. Confidentiality. Each party shall maintain confidential information in strict confidence and not disclose it to third parties.
5. Liability. Neither party shall be liable for indirect or consequential damages.
6. Termination. Either party may terminate upon thirty (30) days notice for material breach.
7. Governing Law. This Agreement shall be governed by the laws of the State of Delaware.
8. Miscellaneous. This Agreement constitutes the entire agreement between the parties and supersedes prior discussions.

"""

# ✅ Semantic chunking — no internet, no API, no cost!
chunks = semantic_splitter.split_text(sample_contract)

print("\n=== Semantic Chunks (100% Local) ===")
for i, chunk in enumerate(chunks, 1):
    print(f"\n--- Chunk {i} ---\n{chunk.strip()}")

OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "c:\Users\Selvam Sabarish\Desktop\sabs\AI\venv\Lib\site-packages\torch\lib\c10.dll" or one of its dependencies.

## Top 3 Practical Splitters for Real-World Use (Contracts focus)

1. **RecursiveCharacterTextSplitter** — *Top recommendation for contracts and PDFs.*
   - Tries natural separators first (paragraphs, newlines) so clauses remain intact.
   - Configurable separators and overlap to avoid cutting mid-clause.
   - Works well when source has structured text and paragraphs.

2. **TokenTextSplitter** — *Critical when strict token budgets or exact token counts matter.*
   - Splits by tokens (not characters), avoiding LLM input overflow.
   - Recommended when you must precisely control tokens sent to the model (e.g., costly APIs).

3. **MarkdownHeaderTextSplitter** — *Excellent for docs with clear header structure.*
   - Preserves header sections and is ideal for README, Notion-style docs, or structured legal notes with headings.

**When CharacterTextSplitter is useful**: simple, fast baseline; but risks splitting in awkward places for legal clauses.


## Practical tips to avoid broken clauses/context in contracts

- Use **RecursiveCharacterTextSplitter** with separators prioritizing paragraph/newline boundaries.
- Choose chunk_size large enough to contain full clauses (e.g., 400-800 characters) and chunk_overlap 50-150 to retain context across clause boundaries.
- When in doubt, inspect the first 10 chunks to validate no clause was cut mid-sentence.
- If precise token control is required for your LLM, post-process chunks with TokenTextSplitter to ensure token limits.
- Consider hybrid approach: first recursive (structure-aware), then token-safety pass.

In [None]:
# Appendix: quick reference functions to get splitter objects
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter
)

def make_recursive_splitter(chunk_size=400, chunk_overlap=50):
    return RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap,
                                         separators=["\n\n", "\n", " ", ""])

def make_token_splitter(chunk_size=500, chunk_overlap=50):
    return TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

def make_markdown_header_splitter(chunk_size=500, chunk_overlap=50):
    return MarkdownHeaderTextSplitter(headings=["#","##","###"], chunk_size=chunk_size, chunk_overlap=chunk_overlap)

print("Appendix splitters ready.")

---
**Saved notebook:** This notebook contains runnable code and explanatory markdown. Use it as a starting point for creating ingestion pipelines.
