# 04_rag_qa.ipynb  
### Retrieval-Augmented Question Answering Over Patent Corpus

This notebook loads the embeddings and metadata created earlier, rebuilds a nearest neighbor index, and performs Retrieval-Augmented Generation (RAG) using an OpenAI model. The notebook concludes with a manual evaluation of eight test questions.


## Load Embeddings and Metadata

We load two files created in notebook 03:

embeddings.npy — dense vector representations of all  — chunk_metadata.jsonl - patent IDs, chunk indices, and the text of each chunk  

These are used to rebuild the retrieval system.


In [1]:
from pathlib import Path
import json
import numpy as np
from sklearn.neighbors import NearestNeighbors

from openai import OpenAI  # OpenAI Python SDK

PROJECT_ROOT = Path("..").resolve()
EMB_DIR = PROJECT_ROOT / "embeddings"

EMB_PATH = EMB_DIR / "embeddings.npy"
META_PATH = EMB_DIR / "chunk_metadata.jsonl"

EMB_PATH, META_PATH

(WindowsPath('C:/Users/sully/RAGPROJ/embeddings/embeddings.npy'),
 WindowsPath('C:/Users/sully/RAGPROJ/embeddings/chunk_metadata.jsonl'))

## Load Embeddings and Metadata

We load the embedding matrix (`embeddings.npy`) and the chunk metadata (`chunk_metadata.jsonl`) created in Notebook 03.  
These files contain:

- Vector embeddings for each chunk  
- Patent ID and chunk text for each embedding  

This step prepares the retrieval index.


In [2]:
def load_embeddings_and_metadata(emb_path: Path, meta_path: Path):
    embeddings = np.load(emb_path)
    chunks = []
    with meta_path.open("r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            chunks.append(json.loads(line))
    print(f"Loaded embeddings: {embeddings.shape}")
    print(f"Loaded metadata records: {len(chunks)}")
    return embeddings, chunks

embeddings, chunks = load_embeddings_and_metadata(EMB_PATH, META_PATH)
chunks[:2]

Loaded embeddings: (2905, 384)
Loaded metadata records: 2905


[{'id': 'US10452978_0',
  'patent_id': 'US10452978',
  'chunk_index': 0,
  'text': 'US010452978B2 ( 12 ) United States Patent Shazeer et al . ( 10 ) Patent No . : US 10 , 452 , 978 B2 ( 45 ) Date of Patent : Oct . 22 , 2019 ( 54 ) ATTENTION - BASED SEQUENCE TRANSDUCTION NEURAL NETWORKS ) U . S . Ci . ( 71 ) Applicant : Google LLC , Mountain View , CA ( US ) ( 58 ) Field of Classification Search CPC . . . . . . . . . . . . . . . . . GOON 3 / 08 ( 2013 . 01 ) ; G06N 3 / 04 ( 2013 . 01 ) ; G06N 3 / 0454 ( 2013 . 01 ) CPC USPC . . . . . . . . . . . . . . . . . . . . . . . . . GOOF 3 / 015 . . . . . . 706 / 15 , 45 See application file for complete search history . ( 72 ) Inventors : Noam M . Shazeer , Palo Alto , CA ( US ) ; Aidan Nicholas Gomez , Toronto ( CA ) ; Lukasz Mieczyslaw Kaiser , Mountain View , CA ( US ) ; Jakob D . Uszkoreit , Portola Valley , CA ( US ) ; Llion Owen Jones , San Francisco , CA ( US ) ; Niki J . Parmar , Sunnyvale , CA ( US ) ; Illia Polosukhin , Mountain View ,

## Build Nearest Neighbor Index

Using the embeddings, we build a `sklearn.neighbors.NearestNeighbors` index with cosine distance.  
This index allows fast retrieval of the most relevant chunks for any query.


In [3]:
def build_nn_index(embeddings: np.ndarray, n_neighbors: int = 5):
    nn = NearestNeighbors(
        n_neighbors=n_neighbors,
        metric="cosine"
    )
    nn.fit(embeddings)
    return nn

nn_index = build_nn_index(embeddings, n_neighbors=5)

## Search Function

The `search_chunks` function embeds a query, retrieves the nearest chunks from the index, and returns:

- Rank  
- Similarity score  
- Patent ID  
- Chunk index  
- Full text of the chunk  

This is the retrieval component of the RAG system.


In [4]:
def search_chunks(
    query: str,
    embed_model,          # SentenceTransformer model from 03, or reload it
    nn_index,
    embeddings,
    chunks,
    top_k: int = 5,
):
    # embed query
    q_emb = embed_model.encode([query])
    
    # retrieve
    distances, indices = nn_index.kneighbors(q_emb, n_neighbors=top_k)
    
    results = []
    for rank, (idx, dist) in enumerate(zip(indices[0], distances[0])):
        rec = chunks[idx]
        results.append({
            "rank": rank,
            "score": 1 - float(dist),  # cosine similarity approx
            "id": rec["id"],
            "patent_id": rec["patent_id"],
            "chunk_index": rec["chunk_index"],
            "text": rec["text"],
        })
    return results

## Load Embedding Model and Test Retrieval

We load the same sentence transformer used earlier (`all-MiniLM-L6-v2`)  
and perform a test query to confirm that chunk retrieval works correctly.


In [5]:
from sentence_transformers import SentenceTransformer

model_name = "all-MiniLM-L6-v2"
embed_model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
test_query = "How does this invention handle language model dialogue?"
hits = search_chunks(test_query, embed_model, nn_index, embeddings, chunks, top_k=3)

for h in hits:
    print(f"[{h['rank']}] {h['patent_id']} (sim={h['score']:.3f})")
    print(h["text"][:300], "...\n")

[0] US12148421 (sim=0.580)
part of a dialog session between a user of a client device and an automated assistant implemented by the client device: receiving a stream of audio data that captures a spoken utterance ofthe user, the stream of audio data being generated by one or more microphones of the client device, and the spok ...

[1] US11562147 (sim=0.577)
an utterance of the human user in the dialogue history or a language model response. 14. The system of claim 11, wherein a position level encoding layer from the plurality of text encoding layers generates the position level encoding, wherein the position level encoding identifies a token ordering i ...

[2] US11562147 (sim=0.575)
visual dialogue model 55 receives the image 110, the dialogue history 120 and the question 130 as input and generates the answer 150 base on the received input. Prior approaches have attempted to implement visual dialogue, where a dialogue machine agent is tasked to answer a series of questions grou ...



## Initialize OpenAI Client

We verify that the `OPENAI_API_KEY` is set in the environment and instantiate the OpenAI SDK client.  
This will be used for the generation step of the RAG pipeline.


In [None]:
import os
print(os.getenv("OPENAI_API_KEY"))
# Note will not include output
# Because I would prefer my SK key to be private.

None


In [8]:
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

## Build Context String

The retrieved chunks are formatted into a structured text block  
that becomes the context for the RAG answer generation.  
Each chunk includes:

- Patent ID  
- Chunk index  
- Similarity score  
- Chunk text  


In [9]:
def build_context_string(retrieved_chunks):
    pieces = []
    for r in retrieved_chunks:
        header = f"[{r['patent_id']} | chunk {r['chunk_index']} | score={r['score']:.3f}]"
        pieces.append(header + "\n" + r["text"])
    return "\n\n---\n\n".join(pieces)

## RAG Answer Function

This function performs the full RAG workflow:

1. Retrieve top-k relevant chunks  
2. Construct a combined context string  
3. Send a prompt to the OpenAI model with a strict instruction  
4. Return:
   - The generated answer  
   - Retrieved chunks (for evaluation)  


In [10]:
def rag_answer(
    question: str,
    embed_model,
    nn_index,
    embeddings,
    chunks,
    client,
    model: str = "gpt-4.1-mini",
    top_k: int = 5,
):
    # 1. Retrieve
    retrieved = search_chunks(
        question,
        embed_model,
        nn_index,
        embeddings,
        chunks,
        top_k=top_k,
    )
    context = build_context_string(retrieved)
    
    system_prompt = (
        "You are a helpful assistant answering questions about a small set of US patents. "
        "Answer the user's question using ONLY the information in the provided context. "
        "If the answer is not in the context, say you don't know based on these documents."
    )
    
    user_prompt = (
        f"Question:\n{question}\n\n"
        f"Context (patent chunks):\n{context}"
    )
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,
    )
    
    answer = response.choices[0].message.content
    return answer, retrieved

## Example RAG Call

We run the RAG pipeline on a sample question to verify that the  
end-to-end system (retrieval + generation) works correctly.


In [11]:
question = "What is the main novelty of these inventions related to language model training or dialogue systems?"

answer, retrieved = rag_answer(
    question,
    embed_model,
    nn_index,
    embeddings,
    chunks,
    client,
    model="gpt-4.1-mini",
    top_k=5,
)

print("QUESTION:")
print(question)
print("\nANSWER:")
print(answer)

print("\n\nRETRIEVED CHUNKS (for debugging):")
for r in retrieved:
    print(f"- {r['patent_id']} chunk {r['chunk_index']} (sim={r['score']:.3f})")

QUESTION:
What is the main novelty of these inventions related to language model training or dialogue systems?

ANSWER:
The main novelties of these inventions related to language model training or dialogue systems are:

1. **Dataset Generation Using Large Language Models (US20240185001A1)**: This invention introduces a system and technique for generating training datasets for task-oriented dialogue systems by combining template queries with domain-specific tokens sampled from a data store. A large language model then generates natural language queries based on these prompts, which are used to train conversational machine-learning models specialized for domain-specific tasks. This approach automates and enhances the creation of relevant training data tailored to specific conversational domains, reducing reliance on human-generated data.

2. **Natural Language Training and Augmentation with Large Language Models (US20240346254A1)**: This invention describes methods where a large language

In [12]:
print("\n\nRETRIEVED CHUNKS:")
for r in retrieved:
    print(f"- {r['patent_id']} | chunk {r['chunk_index']} | sim={r['score']:.3f}")
    print(r['text'][:300], "...\n")



RETRIEVED CHUNKS:
- US20240185001A1 | chunk 0 | sim=0.575
US 20240185001A1 (19) United States (12) Patent Application Publication Nagaraju et al. (54) DATASET GENERATION USING LARGE LANGUAGE MODELS (71) Applicant: NVIDIA Corporation, Santa Clara, CА (US) (72) Inventors: Divija Nagaraju, Mountain View, СА (US); Christopher Parisien, Toronto (CA) (21) Appl.  ...

- US20240185001A1 | chunk 5 | sim=0.537
that were previously performed by humans. In addition to designing efficient and effective machine-learning model (MLM) architectures, the successful deployment or application of the MLMs also depends heavily on the training techniques employed. For example, training an MLM to perform a specific tas ...

- US20240346254A1 | chunk 0 | sim=0.530
(19) United States (12) Patent Application Publication LIU et al. (54) NATURAL LANGUAGE TRAINING AND/OR AUGMENTATION WITH LARGE LANGUAGE MODELS (71) Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US) (72) Inventors: Yang LIU, Bellevue

## Evaluation Questions

We define a set of eight evaluation questions designed to test retrieval grounding  
across themes such as model training, multimodal processing, dialogue systems,  
and computational efficiency.


In [13]:
test_questions = [
    "What is the main innovation described in these patents for improving natural language model training or generation?",
    "How do the patents describe combining large language models with smaller models for improved efficiency?",
    "What role does a transformer encoder play in the systems described?",
    "How is dialogue history or multi-turn conversation handled within the patented architectures?",
    "Do the patents describe any specialized pretraining objectives or methods for tuning language models?",
    "What mechanisms do the patents propose for integrating multimodal inputs such as text and images?",
    "How do the patents describe improving resource efficiency or reducing computational cost in language model processing?",
    "What system components or modules are mentioned as contributing to natural language generation quality?",
]

## Run Evaluation Questions

This helper function runs RAG for each evaluation question  
and prints detailed retrieval outputs.  
Manual scoring will occur afterward.


In [14]:
def run_eval_questions(
    questions,
    embed_model,
    nn_index,
    embeddings,
    chunks,
    client,
    model="gpt-4.1-mini",
    top_k=5,
):
    """
    Runs RAG on each question and prints out:
      - question
      - answer
      - top-k retrieved chunks
    You will manually judge success from this output.
    """
    results = []
    
    for i, q in enumerate(questions, start=1):
        print("=" * 80)
        print(f"QUESTION {i}: {q}")
        print("=" * 80)
        
        answer, retrieved = rag_answer(
            q,
            embed_model,
            nn_index,
            embeddings,
            chunks,
            client,
            model=model,
            top_k=top_k,
        )
        
        print("\nANSWER:\n")
        print(answer)
        
        print("\nRETRIEVED CHUNKS:\n")
        for r in retrieved:
            print(f"[{r['rank']}] patent={r['patent_id']} chunk={r['chunk_index']} "
                  f"sim={r['score']:.3f}")
            print(r["text"][:400] + ("..." if len(r["text"]) > 400 else ""))
            print("-" * 80)
        
        # placeholder dict; you will fill these fields manually later
        results.append({
            "question": q,
            "answer": answer,
            "retrieved": retrieved,
            "retrieval_success": None,   # True/False after you inspect
            "relevant_rank": None,       # e.g., 0,1,2 or None
            "support_level": None,       # "full", "partial", "none"
            "hallucination": None,       # True/False
        })
    
    return results

eval_raw = run_eval_questions(
    test_questions,
    embed_model,
    nn_index,
    embeddings,
    chunks,
    client,
    model="gpt-4.1-mini",
    top_k=5,
)


QUESTION 1: What is the main innovation described in these patents for improving natural language model training or generation?

ANSWER:

The main innovation described in these patents for improving natural language model training or generation involves leveraging large language models (LLMs) to enhance both training and augmentation of smaller natural language generation systems. Specifically, as detailed in US20240346254A1, the approach includes:

1. Using a large language model to process a training dataset and produce natural language outputs.
2. Having the natural language generation system analyze both the training data and the LLM's output to generate its own outputs that mimic the LLM.
3. Employing the large language model to evaluate the generation system's outputs and iteratively adjust and improve their quality through a feedback loop.
4. Augmenting smaller language models by retrieving external information via the large language model to provide additional context and a lan

## Manual Evaluation Scoring

Each question is scored on:

- Retrieval success  
- Relevant rank  
- Support level  
- Hallucination severity  

These scores populate the evaluation DataFrame used in the analysis.


In [15]:
eval_scored = [
    {
        "question": 1,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "mostly_supported",
        "hallucination": "mild",
    },
    {
        "question": 2,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "fully_supported",
        "hallucination": "none",
    },
    {
        "question": 3,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "mostly_supported",
        "hallucination": "mild",
    },
    {
        "question": 4,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "moderately_supported",
        "hallucination": "mild",
    },
    {
        "question": 5,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "partially_supported",
        "hallucination": "mild",
    },
    {
        "question": 6,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "weakly_supported",
        "hallucination": "moderate",
    },
    {
        "question": 7,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "mostly_supported",
        "hallucination": "mild",
    },
    {
        "question": 8,
        "retrieval_success": True,
        "relevant_rank": 0,
        "support_level": "mostly_supported",
        "hallucination": "mild",
    },
]

In [20]:
import pandas as pd

df_eval = pd.DataFrame(eval_scored)
df_eval

Unnamed: 0,question,retrieval_success,relevant_rank,support_level,hallucination
0,1,True,0,mostly_supported,mild
1,2,True,0,fully_supported,none
2,3,True,0,mostly_supported,mild
3,4,True,0,moderately_supported,mild
4,5,True,0,partially_supported,mild
5,6,True,0,weakly_supported,moderate
6,7,True,0,mostly_supported,mild
7,8,True,0,mostly_supported,mild


## Manual Evaluation

Each question is scored on:

- Retrieval success  
- Relevant rank  
- Support level  
- Hallucination severity  

We then see retrieval rate and success


In [26]:
import pandas as pd

# df_eval already created from eval_scored

# 1. Basic counts
n = len(df_eval)

# 2. Retrieval success rate (booleans → mean works)
retrieval_rate = df_eval["retrieval_success"].mean()

# 3. Support level distribution (proportions)
support_dist = df_eval["support_level"].value_counts(normalize=True)

# 4. Hallucination label distribution (proportions)
hallucination_dist = df_eval["hallucination"].value_counts(normalize=True)

# 5. Optional: map hallucination severity to numbers for an average score
hallucination_map = {
    "none": 0,
    "mild": 1,
    "moderate": 2,
    "severe": 3,   # not used here, but included for completeness
}
df_eval["hallucination_severity"] = df_eval["hallucination"].map(hallucination_map)
avg_hallucination_severity = df_eval["hallucination_severity"].mean()

print("Number of questions:", n)
print("Retrieval success rate:", retrieval_rate)

print("\nSupport level distribution:")
print(support_dist)

print("\nHallucination distribution:")
print(hallucination_dist)

print("\nAverage hallucination severity (0=none, 1=mild, 2=moderate, 3=severe):")
print(avg_hallucination_severity)

Number of questions: 8
Retrieval success rate: 1.0

Support level distribution:
support_level
mostly_supported        0.500
fully_supported         0.125
moderately_supported    0.125
partially_supported     0.125
weakly_supported        0.125
Name: proportion, dtype: float64

Hallucination distribution:
hallucination
mild        0.750
none        0.125
moderate    0.125
Name: proportion, dtype: float64

Average hallucination severity (0=none, 1=mild, 2=moderate, 3=severe):
1.0
