## 1) Install required packages

Run this cell to install Python packages used in the notebook. This installs PyMuPDF, SentenceTransformers, FAISS (CPU), and OpenAI client library.


In [2]:
# Install dependencies
!pip install -q pymupdf sentence-transformers faiss-cpu openai nbformat

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m87.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2) Upload the sample service manual (PDF)

Use the file upload widget to upload the `sample-service-manual 1.pdf`. The notebook will use the uploaded file as `PDF_PATH`. If you previously placed the file in `/mnt/data/`, you may also set the path manually.


In [6]:
from google.colab import files
uploaded = files.upload()
# Determine the PDF path
if uploaded:
    PDF_PATH = list(uploaded.keys())[0]
else:
    # Fallback if the file already exists in the environment
    PDF_PATH = "/mnt/data/sample-service-manual 1.pdf"
print("Using PDF:", PDF_PATH)

Saving sample-service-manual 1.pdf to sample-service-manual 1.pdf
Using PDF: sample-service-manual 1.pdf


## 3) Pipeline overview (what each section does)

- **Text extraction:** Read PDF pages using PyMuPDF (fitz) and extract textual content.
- **Chunking:** Split text into overlapping chunks to preserve local context for retrieval.
- **Embeddings:** Convert chunks to dense vectors using SentenceTransformers (all-MiniLM-L6-v2).
- **Vector DB (FAISS):** Store embeddings and perform fast nearest-neighbor search.
- **Retrieval:** Given a user query, retrieve the top-k most relevant chunks.
- **Extraction:** Use OpenAI LLM to extract structured JSON if `OPENAI_API_KEY` is present; otherwise use regex heuristics.

This notebook is intentionally verbose and educational, with clear explanations and example outputs at each step.


In [9]:

# 4) Pipeline code (detailed and commented)
import os, re, json, time
from typing import List, Tuple
from collections import namedtuple
import fitz  # PyMuPDF
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# Optional OpenAI
import openai

# Configuration (you can edit these)
PDF_PATH = globals().get("PDF_PATH", "/mnt/data/sample-service-manual 1.pdf")
EMBED_MODEL = "all-MiniLM-L6-v2"   # small, fast, and good for retrieval
CHUNK_SIZE = 800
CHUNK_OVERLAP = 200
TOP_K = 5
FAISS_DIR = "/content/faiss_index_optionB"
OPENAI_MODEL = "gpt-4o-mini"
Chunk = namedtuple("Chunk", ["id", "text", "page_no"])

# Text extraction--
# Returns a list of (page_no, text) tuples extracted from the PDF.
# Uses PyMuPDF (fitz) because it is fast and preserves textual order reasonably well.
def extract_text_from_pdf(pdf_path: str) -> List[Tuple[int, str]]:
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found at {pdf_path}. Upload it first.")
    doc = fitz.open(pdf_path)
    pages = []
    for i, page in enumerate(doc):
        txt = page.get_text("text")
        pages.append((i+1, txt if txt else ""))
    return pages

def clean_text(s: str) -> str:
    # Normalize whitespace and remove stray control characters
    s = s.replace("\x0c", " ")
    s = re.sub(r"\s+", " ", s)
    return s.strip()

# Chunking strategy
def chunk_pages(pages, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    chunks = []
    cid = 0
    for page_no, text in pages:
        text = clean_text(text)
        if not text:
            continue
        start = 0
        while start < len(text):
            end = min(start + chunk_size, len(text))
            piece = text[start:end].strip()
            chunks.append(Chunk(cid, piece, page_no))
            cid += 1
            if end == len(text):
                break
            start = end - overlap
    return chunks

# Build embeddings and FAISS index
def build_embeddings_and_index(chunks, model_name=EMBED_MODEL):
    print("Loading embedding model:", model_name)
    model = SentenceTransformer(model_name)
    texts = [c.text for c in chunks]
    print(f"Encoding {len(texts)} chunks...")
    emb = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=True)
    print("Embeddings shape:", emb.shape)
    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)
    os.makedirs(FAISS_DIR, exist_ok=True)
    faiss.write_index(index, os.path.join(FAISS_DIR, "index.faiss"))

    # Save metadata and embeddings
    meta = [{"id": c.id, "page_no": c.page_no, "text": c.text} for c in chunks]
    with open(os.path.join(FAISS_DIR, "meta.json"), "w", encoding="utf8") as f:
        json.dump(meta, f, ensure_ascii=False, indent=2)
    np.save(os.path.join(FAISS_DIR, "emb.npy"), emb)
    print("Saved FAISS index and metadata to", FAISS_DIR)
    return model, index, meta

def load_index_and_meta():
    idx = faiss.read_index(os.path.join(FAISS_DIR, "index.faiss"))
    with open(os.path.join(FAISS_DIR, "meta.json"), "r", encoding="utf8") as f:
        meta = json.load(f)
    emb = np.load(os.path.join(FAISS_DIR, "emb.npy"))
    model = SentenceTransformer(EMBED_MODEL)
    return model, idx, meta, emb

# Retrieval
def retrieve(query: str, model, index, meta, top_k=TOP_K):
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    D, I = index.search(q_emb, top_k)
    results = []
    for score, idx in zip(D[0], I[0]):
        if idx < 0 or idx >= len(meta):
            continue
        m = meta[idx]
        results.append({"score": float(score), "page_no": m["page_no"], "text": m["text"]})
    return results

# Heuristic extraction (fallback)
def heuristic_extract(retrieved_chunks):
    results = []
    # find torque-like patterns and fluid capacities, part numbers patterns too
    torque_re = re.compile(r"(\d{1,4}(?:\.\d+)?(?:\s*-\s*\d{1,4}(?:\.\d+)?)?)\s*(Nm|N·m|N m|lb-ft|lb·ft|ft-lb|in-lb)", re.I)
    capacity_re = re.compile(r"([0-9]+(?:\.[0-9]+)?)\s*(L|litre|litres|ml|cc)", re.I)
    part_re = re.compile(r"(?:Part|P/N|PN|Part Number)[:\s]*([A-Za-z0-9\-_/]+)", re.I)

    for c in retrieved_chunks:
        t = c["text"]
        p = c["page_no"]
        # torque matches
        for m in torque_re.finditer(t):
            val = m.group(1)
            unit = m.group(2)
            snippet = t[max(0,m.start()-100):m.end()+50].strip()
            comp = snippet.split("\n")[-1][:120]
            results.append({
                "component": comp.strip() or "UNKNOWN",
                "spec_type": "Torque",
                "value": val.strip(),
                "unit": unit.strip(),
                "context": snippet,
                "source_page": p
            })
        # capacity matches
        for m in capacity_re.finditer(t):
            val = m.group(1)
            unit = m.group(2)
            snippet = t[max(0,m.start()-60):m.end()+30].strip()
            comp = snippet.split("\n")[-1][:120]
            results.append({
                "component": comp.strip() or "UNKNOWN",
                "spec_type": "Capacity",
                "value": val.strip(),
                "unit": unit.strip(),
                "context": snippet,
                "source_page": p
            })
        # part numbers
        for m in part_re.finditer(t):
            pn = m.group(1)
            snippet = t[max(0,m.start()-60):m.end()+30].strip()
            results.append({
                "component": snippet.split("\n")[-1][:120].strip() or "UNKNOWN",
                "spec_type": "Part Number",
                "value": pn.strip(),
                "unit": "",
                "context": snippet,
                "source_page": p
            })
    # dedupe
    seen = set()
    dedup = []
    for r in results:
        key = (r["component"], r["spec_type"], r["value"], r["unit"])
        if key in seen:
            continue
        seen.add(key)
        dedup.append(r)
    return dedup

# OpenAI LLM extraction
def build_prompt(query, retrieved_chunks):
    header = ("You are an assistant that extracts vehicle specifications from manual text.\n"
              "Given the query and the supplied text chunks, extract relevant specs in JSON.\n"
              "Return an array of objects with fields: component, spec_type, value, unit, context, source_page.\n\n")
    qtxt = f"Query: {query}\n\n"
    chunk_texts = ""
    for i, c in enumerate(retrieved_chunks):
        chunk_texts += f"--- CHUNK {i+1} | PAGE {c['page_no']} ---\n{c['text']}\n\n"
    rules = ("Rules:\n1) Return ONLY a JSON array (no explanation).\n"
             "2) Use empty string for unknown fields.\n3) Normalize units when obvious (e.g., 'Nm').\n")
    prompt = header + qtxt + rules + chunk_texts + "\nReturn the JSON array now."
    return prompt

def llm_extract_openrouter(prompt):
    import os, json, re
    import requests

    api_key = os.environ.get("OPEN_API_KEY")
    if not api_key:
        raise RuntimeError("OPENROUTER_API_KEY not set in environment.")

    url = "https://openrouter.ai/api/v1/chat/completions"

    payload = {
        "model": OPENAI_MODEL,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.0,
        "max_tokens": 800
    }

    headers = {
        "Authorization": f"Bearer {api_key}",
        "HTTP-Referer": "http://localhost",
        "X-Title": "Spec-Extractor"
    }

    response = requests.post(url, json=payload, headers=headers)
    data = response.json()

    # Extract model text output
    content = data["choices"][0]["message"]["content"]

    # try to parse JSON in response
    m = re.search(r"(\[.*\])", content, flags=re.S)
    if m:
        return json.loads(m.group(1))
    try:
        return json.loads(content)
    except Exception as e:
        raise ValueError("OpenRouter returned non-JSON content.") from e

# Build and query top-level functions
def build_pipeline(pdf_path=PDF_PATH):
    pages = extract_text_from_pdf(pdf_path)
    print(f"Extracted {len(pages)} pages.")
    chunks = chunk_pages(pages)
    print(f"Created {len(chunks)} chunks.")
    model, index, meta = build_embeddings_and_index(chunks)
    return model, index, meta

def query_pipeline(query, use_openai=True, top_k=TOP_K):
    model, index, meta, emb = load_index_and_meta()
    retrieved = retrieve(query, model, index, meta, top_k=top_k)
    # When using OpenAI and a key is present, call LLM and return parsed JSON
    if use_openai and os.environ.get("OPENAI_API_KEY"):
        prompt = build_prompt(query, retrieved)
        try:
            extracted = llm_extract_openrouter(prompt)
            return extracted
        except Exception as e:
            print("OpenAI extraction failed; falling back to heuristic. Error:", e)
            return heuristic_extract(retrieved)
    else:
        return heuristic_extract(retrieved)


## 5) Build the FAISS index (run once)

This step extracts text from the uploaded PDF, chunks it, creates embeddings, and builds a FAISS index. It may take a few minutes depending on the document length and runtime.


In [14]:
# Build the pipeline: extract -> chunk -> embed -> FAISS
model, index, meta = build_pipeline(PDF_PATH)
print("Index built. Number of saved chunks (meta):", len(meta))

Extracted 852 pages.
Created 1604 chunks.
Loading embedding model: all-MiniLM-L6-v2
Encoding 1604 chunks...


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

Embeddings shape: (1604, 384)
Saved FAISS index and metadata to /content/faiss_index_optionB
Index built. Number of saved chunks (meta): 1604


## 6) Query the pipeline

Provide sample queries below. If you want to use OpenAI for better extraction, set API key in the environment first using:

```python
import os
os.environ['OPENAI_API_KEY'] = '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'
```

If the key is missing the notebook will fall back to the heuristic extractor.


In [None]:
import os
os.environ['OPEN_API_KEY'] = 'XYZ'

In [13]:
# Example queries
q = "Torque for brake caliper bolts"
print('Query:', q)
res = query_pipeline(q, use_openai=True)
import json
print(json.dumps(res, indent=2))

Query: Torque for brake caliper bolts
[
  {
    "component": "Brake caliper guide pin bolts",
    "spec_type": "Torque",
    "value": "33",
    "unit": "Nm",
    "context": "",
    "source_page": "652"
  },
  {
    "component": "Brake caliper support bracket bolts",
    "spec_type": "Torque",
    "value": "150",
    "unit": "Nm",
    "context": "",
    "source_page": "652"
  },
  {
    "component": "Brake caliper anchor plate bolts",
    "spec_type": "Torque",
    "value": "250",
    "unit": "Nm",
    "context": "",
    "source_page": "636"
  },
  {
    "component": "Brake caliper flow bolt",
    "spec_type": "Torque",
    "value": "35",
    "unit": "Nm",
    "context": "",
    "source_page": "636"
  }
]


## 7) Save and download results

Save the last query results to a JSON file and download it to your machine.


In [None]:
# Save last result (res) to JSON and provide download link
from google.colab import files
with open('specs_output_optionB.json', 'w') as f:
    json.dump(res, f, indent=2)
files.download('specs_output_optionB.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>