# 03_embeddings.ipynb  
### Embedding Patent Chunks and Building a Vector Index

This notebook loads precomputed text chunks, encodes them into dense vector embeddings, saves the embeddings and metadata, and builds a nearest neighbors index for similarity search.


## Setup and Imports
Import libraries and define paths for chunk input, embedding output, and vector index construction.


In [1]:
from pathlib import Path
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

PROJECT_ROOT = Path("..").resolve()
CHUNK_DIR = PROJECT_ROOT / "data" / "processed" / "chunks"
EMB_DIR = PROJECT_ROOT / "embeddings"

EMB_DIR.mkdir(parents=True, exist_ok=True)

CHUNK_FILE = CHUNK_DIR / "patent_chunks.jsonl"
CHUNK_FILE, EMB_DIR

  from .autonotebook import tqdm as notebook_tqdm


(WindowsPath('C:/Users/sully/RAGPROJ/data/processed/chunks/patent_chunks.jsonl'),
 WindowsPath('C:/Users/sully/RAGPROJ/embeddings'))

## Load Chunk Metadata
Load all chunk records from the JSONL file.  
Each record includes an ID, patent ID, chunk index, and text.


In [2]:
def load_chunks(jsonl_path: Path):
    chunks = []
    with jsonl_path.open("r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            rec = json.loads(line)
            chunks.append(rec)
    print(f"Loaded {len(chunks)} chunks from {jsonl_path.name}")
    return chunks

chunks = load_chunks(CHUNK_FILE)
chunks[:2]

Loaded 2905 chunks from patent_chunks.jsonl


[{'id': 'US10452978_0',
  'patent_id': 'US10452978',
  'chunk_index': 0,
  'text': 'US010452978B2 ( 12 ) United States Patent Shazeer et al . ( 10 ) Patent No . : US 10 , 452 , 978 B2 ( 45 ) Date of Patent : Oct . 22 , 2019 ( 54 ) ATTENTION - BASED SEQUENCE TRANSDUCTION NEURAL NETWORKS ) U . S . Ci . ( 71 ) Applicant : Google LLC , Mountain View , CA ( US ) ( 58 ) Field of Classification Search CPC . . . . . . . . . . . . . . . . . GOON 3 / 08 ( 2013 . 01 ) ; G06N 3 / 04 ( 2013 . 01 ) ; G06N 3 / 0454 ( 2013 . 01 ) CPC USPC . . . . . . . . . . . . . . . . . . . . . . . . . GOOF 3 / 015 . . . . . . 706 / 15 , 45 See application file for complete search history . ( 72 ) Inventors : Noam M . Shazeer , Palo Alto , CA ( US ) ; Aidan Nicholas Gomez , Toronto ( CA ) ; Lukasz Mieczyslaw Kaiser , Mountain View , CA ( US ) ; Jakob D . Uszkoreit , Portola Valley , CA ( US ) ; Llion Owen Jones , San Francisco , CA ( US ) ; Niki J . Parmar , Sunnyvale , CA ( US ) ; Illia Polosukhin , Mountain View ,

## Initialize Embedding Model
Use a sentence-transformer model to embed each text chunk into a dense vector representation.


In [3]:
model_name = "all-MiniLM-L6-v2"
embed_model = SentenceTransformer(model_name)

print("Embedding dim:", embed_model.get_sentence_embedding_dimension())

Embedding dim: 384


## Build Embeddings for All Chunks
Encode all chunk texts in batches and stack the results into a single NumPy array.


In [4]:
def build_embeddings(chunks, model, batch_size: int = 64):
    texts = [c["text"] for c in chunks]
    ids = [c["id"] for c in chunks]

    all_embs = []
    n = len(texts)
    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_texts = texts[start:end]
        embs = model.encode(
            batch_texts,
            batch_size=len(batch_texts),
            show_progress_bar=True,
        )
        all_embs.append(embs)
        print(f"Encoded {end}/{n}")

    embeddings = np.vstack(all_embs)
    return embeddings

embeddings = build_embeddings(chunks, embed_model)
embeddings.shape

Batches: 100%|██████████| 1/1 [00:02<00:00,  2.43s/it]


Encoded 64/2905


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.09s/it]


Encoded 128/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.93s/it]


Encoded 192/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.87s/it]


Encoded 256/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


Encoded 320/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.83s/it]


Encoded 384/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.84s/it]


Encoded 448/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.93s/it]


Encoded 512/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 576/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 640/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.90s/it]


Encoded 704/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.89s/it]


Encoded 768/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.94s/it]


Encoded 832/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.87s/it]


Encoded 896/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]


Encoded 960/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.89s/it]


Encoded 1024/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 1088/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.83s/it]


Encoded 1152/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 1216/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]


Encoded 1280/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]


Encoded 1344/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


Encoded 1408/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 1472/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Encoded 1536/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]


Encoded 1600/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


Encoded 1664/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


Encoded 1728/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


Encoded 1792/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.92s/it]


Encoded 1856/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


Encoded 1920/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.99s/it]


Encoded 1984/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.44s/it]


Encoded 2048/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.46s/it]


Encoded 2112/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.54s/it]


Encoded 2176/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.42s/it]


Encoded 2240/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.58s/it]


Encoded 2304/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.52s/it]


Encoded 2368/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.50s/it]


Encoded 2432/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.27s/it]


Encoded 2496/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.29s/it]


Encoded 2560/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.43s/it]


Encoded 2624/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.32s/it]


Encoded 2688/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.48s/it]


Encoded 2752/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.34s/it]


Encoded 2816/2905


Batches: 100%|██████████| 1/1 [00:03<00:00,  3.46s/it]


Encoded 2880/2905


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.61s/it]

Encoded 2905/2905





(2905, 384)

## Save Embeddings and Metadata
Write the embeddings to a `.npy` file and save the corresponding chunk metadata to a JSONL file.


In [5]:
emb_path = EMB_DIR / "embeddings.npy"
meta_path = EMB_DIR / "chunk_metadata.jsonl"

# Save embeddings
np.save(emb_path, embeddings)
print(f"Saved embeddings to {emb_path}")

# Save metadata (id, patent_id, chunk_index, text)
with meta_path.open("w", encoding="utf-8") as f:
    for rec in chunks:
        f.write(json.dumps(rec) + "\n")
print(f"Saved metadata to {meta_path}")

Saved embeddings to C:\Users\sully\RAGPROJ\embeddings\embeddings.npy
Saved metadata to C:\Users\sully\RAGPROJ\embeddings\chunk_metadata.jsonl


## Build Nearest Neighbors Index
Construct a k-nearest neighbors (kNN) index over the embeddings using cosine distance for similarity search.


In [6]:
def build_nn_index(embeddings, n_neighbors: int = 5):
    nn = NearestNeighbors(
        n_neighbors=n_neighbors,
        metric="cosine",
    )
    nn.fit(embeddings)
    return nn

nn_index = build_nn_index(embeddings)

## Test Similarity Search
Run a sample query through the embedding model and nearest neighbors index.  
Inspect the top retrieved chunks and their similarity scores.


In [7]:
def search(query: str, model, nn_index, embeddings, chunks, top_k: int = 5):
    # 1. Embed query
    q_emb = model.encode([query])
    
    # 2. Nearest neighbors
    distances, indices = nn_index.kneighbors(q_emb, n_neighbors=top_k)
    
    results = []
    for rank, (idx, dist) in enumerate(zip(indices[0], distances[0])):
        rec = chunks[idx]
        results.append({
            "rank": rank,
            "score": 1 - dist,  # cosine similarity ≈ 1 - distance
            "id": rec["id"],
            "patent_id": rec["patent_id"],
            "chunk_index": rec["chunk_index"],
            "text": rec["text"][:400] + ("..." if len(rec["text"]) > 400 else "")
        })
    return results

query = "transformer-based language model for dialogue generation"
results = search(query, embed_model, nn_index, embeddings, chunks, top_k=3)

for r in results:
    print(f"[{r['rank']}] patent={r['patent_id']} chunk={r['chunk_index']} sim={r['score']:.3f}")
    print(r["text"])
    print("-" * 80)

[0] patent=US20240346254A1 chunk=12 sim=0.577
In this way, the natural language generation system can produce high quality natural language outputs while retaining adaptability and resource efficiency of a small language model. [0026] Various examples, scenarios, and aspects that enable natural language training and/or augmentation with large language models are described below with respect to FIGS. 1-8. [0027] language model 102 can be confi...
--------------------------------------------------------------------------------
[1] patent=US11562147 chunk=3 sim=0.576
into a unified transformer encoder together with a corresponding image caption and multi-turn dialogue history input. The subject technology can initialize the unified transformer encoder with BERT for increased leveraging of the pre-trained language representation. To deeply fuse features from the two modalities, the subject technology make use of two visually-grounded pretraining objectives, suc...
-------------------------