# 02_chunk.ipynb  
### Text Chunking for Patent Documents

This notebook loads cleaned patent text files, splits them into overlapping word-based chunks, and writes the chunks to a JSONL file for downstream embedding and retrieval.


## Setup and Imports
Define project paths and import libraries for loading text files and writing chunk metadata.


In [1]:
from pathlib import Path
import json

PROJECT_ROOT = Path("..").resolve()
TXT_DIR = PROJECT_ROOT / "data" / "processed" / "txt"
CHUNK_DIR = PROJECT_ROOT / "data" / "processed" / "chunks"

CHUNK_DIR.mkdir(parents=True, exist_ok=True)

TXT_DIR, CHUNK_DIR

(WindowsPath('C:/Users/sully/RAGPROJ/data/processed/txt'),
 WindowsPath('C:/Users/sully/RAGPROJ/data/processed/chunks'))

## Load Cleaned Patent Texts
Load all `.txt` files from the processed text directory into a dictionary keyed by patent ID. Empty files are skipped with a warning.


In [2]:
def load_all_patent_texts(txt_dir: Path = TXT_DIR):
    txt_files = sorted(txt_dir.glob("*.txt"))
    print(f"Found {len(txt_files)} text files.")
    
    docs = {}
    for txt_file in txt_files:
        patent_id = txt_file.stem  # e.g. "US11562147"
        text = txt_file.read_text(encoding="utf-8", errors="ignore").strip()
        if not text:
            print(f"WARNING: {txt_file.name} is empty")
            continue
        docs[patent_id] = text
    print(f"Loaded {len(docs)} non-empty documents.")
    return docs

docs = load_all_patent_texts()
list(docs.keys())[:5]

Found 33 text files.
Loaded 33 non-empty documents.


['US10452978', 'US10740433', 'US10902563', 'US11003865', 'US11023715']

## Chunking Function
Split each document into overlapping word-based chunks using a fixed window size and overlap.  
Default: 300-word chunks with 50-word overlap.


In [3]:
def chunk_text(
    text: str,
    chunk_size_words: int = 300,
    overlap_words: int = 50,
):
    """
    Split text into overlapping word-based chunks.
    Returns a list of chunk strings.
    """
    words = text.split()
    chunks = []
    
    if not words:
        return chunks
    
    start = 0
    n = len(words)
    
    while start < n:
        end = start + chunk_size_words
        chunk_words = words[start:end]
        chunk_text = " ".join(chunk_words).strip()
        if chunk_text:
            chunks.append(chunk_text)
        if end >= n:
            break
        start = end - overlap_words  # step back for overlap
    
    return chunks

## Sample Chunk Preview
Take one example patent and view how it is split into chunks to verify the chunking behavior.


In [4]:
sample_id = next(iter(docs.keys()))
sample_chunks = chunk_text(docs[sample_id])
len(sample_chunks), sample_chunks[0][:300]

(60,
 'US010452978B2 ( 12 ) United States Patent Shazeer et al . ( 10 ) Patent No . : US 10 , 452 , 978 B2 ( 45 ) Date of Patent : Oct . 22 , 2019 ( 54 ) ATTENTION - BASED SEQUENCE TRANSDUCTION NEURAL NETWORKS ) U . S . Ci . ( 71 ) Applicant : Google LLC , Mountain View , CA ( US ) ( 58 ) Field of Classifi')

## Create Chunks for All Documents
Apply the chunking function to every loaded patent and write the results to a JSONL file.  
Each record includes an ID, patent ID, chunk index, and chunk text.


In [5]:
def create_all_chunks(
    docs: dict,
    out_dir: Path = CHUNK_DIR,
    chunk_size_words: int = 300,
    overlap_words: int = 50,
    out_filename: str = "patent_chunks.jsonl",
):
    out_path = out_dir / out_filename
    
    total_chunks = 0
    with out_path.open("w", encoding="utf-8") as f:
        for patent_id, text in docs.items():
            chunks = chunk_text(
                text,
                chunk_size_words=chunk_size_words,
                overlap_words=overlap_words,
            )
            print(f"{patent_id}: {len(chunks)} chunks")
            
            for i, chunk in enumerate(chunks):
                rec = {
                    "id": f"{patent_id}_{i}",
                    "patent_id": patent_id,
                    "chunk_index": i,
                    "text": chunk,
                }
                f.write(json.dumps(rec) + "\n")
                total_chunks += 1
    
    print(f"\n✅ Wrote {total_chunks} chunks to {out_path}")
    return out_path

chunks_path = create_all_chunks(docs)
chunks_path

US10452978: 60 chunks
US10740433: 46 chunks
US10902563: 67 chunks
US11003865: 76 chunks
US11023715: 164 chunks
US11238332: 61 chunks
US11295552: 117 chunks
US11328398: 19 chunks
US11562147: 44 chunks
US11636570: 103 chunks
US11749857: 2 chunks
US11900261: 53 chunks
US11921824: 51 chunks
US11961514: 129 chunks
US11989527: 233 chunks
US11991338: 30 chunks
US12148421: 125 chunks
US12182506: 73 chunks
US12217382: 41 chunks
US12271791B2: 54 chunks
US12282696B2: 301 chunks
US20210183484A1: 70 chunks
US20220101113A1: 256 chunks
US20230252224A1: 165 chunks
US20240185001A1: 84 chunks
US20240256792A1: 91 chunks
US20240346254A1: 48 chunks
US8332207: 44 chunks
US8812291: 44 chunks
US9037464: 36 chunks
US9740680: 47 chunks
US_12380282_B2: 113 chunks
US_12417081_B2: 58 chunks

✅ Wrote 2905 chunks to C:\Users\sully\RAGPROJ\data\processed\chunks\patent_chunks.jsonl


WindowsPath('C:/Users/sully/RAGPROJ/data/processed/chunks/patent_chunks.jsonl')

## Inspect Sample Chunk Records
Read a few lines from the JSONL file to confirm that the chunk structure and metadata look correct.


In [6]:
sample_lines = []

with chunks_path.open("r", encoding="utf-8") as f:
    for _ in range(5):
        line = f.readline()
        if not line:
            break
        sample_lines.append(json.loads(line))

sample_lines

[{'id': 'US10452978_0',
  'patent_id': 'US10452978',
  'chunk_index': 0,
  'text': 'US010452978B2 ( 12 ) United States Patent Shazeer et al . ( 10 ) Patent No . : US 10 , 452 , 978 B2 ( 45 ) Date of Patent : Oct . 22 , 2019 ( 54 ) ATTENTION - BASED SEQUENCE TRANSDUCTION NEURAL NETWORKS ) U . S . Ci . ( 71 ) Applicant : Google LLC , Mountain View , CA ( US ) ( 58 ) Field of Classification Search CPC . . . . . . . . . . . . . . . . . GOON 3 / 08 ( 2013 . 01 ) ; G06N 3 / 04 ( 2013 . 01 ) ; G06N 3 / 0454 ( 2013 . 01 ) CPC USPC . . . . . . . . . . . . . . . . . . . . . . . . . GOOF 3 / 015 . . . . . . 706 / 15 , 45 See application file for complete search history . ( 72 ) Inventors : Noam M . Shazeer , Palo Alto , CA ( US ) ; Aidan Nicholas Gomez , Toronto ( CA ) ; Lukasz Mieczyslaw Kaiser , Mountain View , CA ( US ) ; Jakob D . Uszkoreit , Portola Valley , CA ( US ) ; Llion Owen Jones , San Francisco , CA ( US ) ; Niki J . Parmar , Sunnyvale , CA ( US ) ; Illia Polosukhin , Mountain View ,

## Summary Statistics
Compute the number of documents and total number of chunks created.  
These values are useful for documentation and the model card.


In [7]:
n_docs = len(docs)
print("Documents:", n_docs)

all_chunks = []
# or just len from chunks file:
import json, pathlib
path = pathlib.Path("../data/processed/chunks/patent_chunks.jsonl")
with path.open() as f:
    all_chunks = [json.loads(l) for l in f]
print("Total chunks:", len(all_chunks))

Documents: 33
Total chunks: 2905
