# Datasheet Clustering Starter Notebook
This notebook guides you through building a prototype that ingests component datasheet PDFs, extracts text & numeric features, and clusters similar products.

> **Setup**: Create a virtual environment and install required packages first:
> ```bash
> pip install pymupdf pdfplumber sentence-transformers faiss-cpu hdbscan umap-learn plotly tqdm
> ```

In [None]:
import os, re, glob, pathlib
import fitz  # PyMuPDF
import pdfplumber
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler
import hdbscan, umap
import plotly.express as px

## 1. Helper: PDF extraction

In [None]:
def extract_pdf(path):
    """Return raw text and list of DataFrames extracted from tables."""
    doc = fitz.open(path)
    text_chunks, tables = [], []
    for page in doc:
        text_chunks.append(page.get_text("text"))
        try:
            tbls = page.find_tables()[0]
            tables.extend(tbls)
        except Exception:
            pass  # no tables on this page
    return "\n".join(text_chunks), tables

## 2. Parse numeric specs (placeholder)
Edit `SPEC_REGEXES` to match the parameters you care about.

In [None]:
SPEC_REGEXES = {
    "vdd_max": r"V[Dd][Dd].{0,20}?([0-9]+\.?[0-9]*)\s*[Vv]",
    "gain_db": r"[Gg]ain.*?([0-9]+\.?[0-9]*)\s*dB",
}

def parse_specs(text):
    specs = {}
    for key, pattern in SPEC_REGEXES.items():
        m = re.search(pattern, text)
        if m:
            specs[key] = float(m.group(1))
    return specs

## 3. Build vector: text embedding + numeric vector

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

NUM_FEATURES = list(SPEC_REGEXES.keys())

def build_vector(text, numeric_dict):
    text_vec = model.encode(text, normalize_embeddings=True)
    numeric = np.array([numeric_dict.get(k, np.nan) for k in NUM_FEATURES])
    numeric = np.nan_to_num(numeric)  # simple imputation
    return np.hstack([text_vec, numeric])

## 4. Clustering utility

In [None]:
def cluster_vectors(X, min_cluster_size=5):
    # Dimensionality reduction (optional but speeds HDBSCAN)
    red = umap.UMAP(n_components=10, random_state=42).fit_transform(X)
    clust = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size).fit(red)
    return clust.labels_, red

## 5. End‑to‑end example
Drop a few sample PDFs inside a folder (e.g., `./sample_datasheets`) and run the block below.

In [None]:
PDF_DIR = './sample_datasheets'  # change to your folder
rows = []
for pdf_path in glob.glob(os.path.join(PDF_DIR, '*.pdf')):
    text, tables = extract_pdf(pdf_path)
    specs = parse_specs(text)
    vec = build_vector(text, specs)
    rows.append({'file': os.path.basename(pdf_path), 'vector': vec, 'specs': specs})

# Build feature matrix
X = np.vstack([r['vector'] for r in rows])

# Cluster
labels, red = cluster_vectors(X)
for r, lab in zip(rows, labels):
    r['cluster'] = int(lab)

# Visualise
vis_df = pd.DataFrame({
    'x': red[:,0],
    'y': red[:,1],
    'cluster': labels,
    'file': [r['file'] for r in rows]
})
fig = px.scatter(vis_df, x='x', y='y', color='cluster', hover_name='file', title='Datasheet Clusters')
fig.show()

---
### Next steps
- **Vector store**: Persist embeddings in FAISS/Qdrant for similarity search.
- **Better spec parsing**: Use `pdfplumber` tables or NLP entity extraction for robust numeric features.
- **Web service**: Wrap logic in FastAPI + React/Streamlit front‑end.
- **Evaluation**: Silhouette score, manual cluster inspection dashboards.

## 6. Advanced spec parsing
This cell shows **two complementary techniques** to pull numeric specs more robustly:
1. **Table extraction** with `pdfplumber` & heuristics for unit conversion.
2. **Regex on sentences** for specs hidden in prose.

> **Tip**: Extend `TABLE_HEADERS` and `SENTENCE_PATTERNS` dictionaries with the parameters important to your parts.

In [None]:
import pdfplumber, itertools
UNIT_MAP = {
    'v': 1.0, 'mv': 1e-3, 'kv': 1e3,
    'a': 1.0, 'ma': 1e-3, 'ua': 1e-6,
    'db': 1.0
}

TABLE_HEADERS = {
    'vdd_max': ['vdd', 'supply voltage', 'vcc'],
    'gain_db': ['gain', 'power gain'],
    'freq_max_hz': ['frequency range', 'f_max'],
}

SENTENCE_PATTERNS = {
    'noise_figure_db': r"noise\s*figure[^0-9]*([0-9]+\.?[0-9]*)\s*dB",
}

def _to_float(val):
    try:
        return float(str(val).strip())
    except ValueError:
        return np.nan

def parse_tables_plumber(path):
    out = {}
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            for table in page.extract_tables():
                for row in table:
                    for k, aliases in TABLE_HEADERS.items():
                        if any(a.lower() in str(row[0]).lower() for a in aliases):
                            # find first numeric in row
                            num_cells = [c for c in row[1:] if re.search(r"[0-9]", str(c))]
                            if num_cells:
                                val_str = re.findall(r"([0-9]+\.?[0-9]*)\s*([kmu]?v|db|a)?", str(num_cells[0].lower()))
                                if val_str:
                                    mag, unit = val_str[0]
                                    mag = float(mag)
                                    factor = UNIT_MAP.get(unit, 1.0)
                                    out[k] = mag * factor
    return out

def advanced_parse_specs(text, path):
    specs = {}
    # sentence regex pass
    for k, pat in SENTENCE_PATTERNS.items():
        m = re.search(pat, text, re.I)
        if m:
            specs[k] = _to_float(m.group(1))
    # table pass
    table_specs = parse_tables_plumber(path)
    specs.update(table_specs)
    return specs

## 7. FAISS similarity search
After clustering, you may want to retrieve the most similar products to a query part or to a free‑text description. This cell builds an **in‑memory FAISS index** over your vectors and shows how to query it.

In [None]:
import faiss

# Build index (run after you've created X)
d = X.shape[1]
index = faiss.IndexFlatIP(d)  # inner‑product for cosine (vectors must be L2‑normalised!)
index.add(X.astype('float32'))

def search(query_text, k=5):
    q_vec = model.encode(query_text, normalize_embeddings=True)
    q_vec = np.hstack([q_vec, np.zeros(len(NUM_FEATURES))])  # pad numeric zeros
    D, I = index.search(np.expand_dims(q_vec.astype('float32'), 0), k)
    return [(rows[i]['file'], float(D[0][j])) for j, i in enumerate(I[0])]

print(search("high gain 5 GHz amplifier", k=3))

## 8. FastAPI micro‑service
The following cell spins up a **FastAPI** app that lets you `POST` a PDF and receive the top‑N similar products. Run with `uvicorn main:app --reload` (or use the `__main__` block below for inline launch).

> **Install**:
> ```bash
> pip install fastapi uvicorn python-multipart
> ```

In [None]:
from fastapi import FastAPI, File, UploadFile
import tempfile, shutil

app = FastAPI(title='Datasheet similarity API')

@app.post('/search')
async def search_pdf(file: UploadFile = File(...), top_k: int = 5):
    # save to temp
    with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    text, tables = extract_pdf(tmp_path)
    specs = advanced_parse_specs(text, tmp_path)
    vec = build_vector(text, specs)
    D, I = index.search(np.expand_dims(vec.astype('float32'), 0), top_k)
    hits = [{'file': rows[i]['file'], 'score': float(D[0][j])} for j, i in enumerate(I[0])]
    return {'hits': hits, 'specs': specs}

if __name__ == '__main__':
    import uvicorn
    uvicorn.run(app, host='0.0.0.0', port=8000)