Web search today is query-bound. Each search lives in its own bubble, and it doesn't matter if you use Google or Bing or something else.
This is a research project to see if I could remove that boundary and let results from different queries relate to each other based on meaning. If I could collate multi-query Google searches into a single connected dataset, and see which insights emerge from relationships across searches, not within them. i.e. Treat search results not as isolated lists, but as a connected semantic space — and explore the relationships between them.
General-purpose UI in use:
Search engines optimize results within a query, but this approach lets you explore relationships across queries. That difference is what enables entirely new kinds of discovery. It helps you discover what you didn’t know to ask.
This system:
- collects SERP results across many queries
- merges them into a single corpus
- embeds each result (title + snippet + context)
- runs k-nearest neighbors over the entire dataset
The result is a semantic space over search results, where any result can surface related results from completely different queries.
This enables huge wins that a single Google search will never surface. A user could use this technique to uncover connections that don’t show up in a single Google search, without tab hopping and without a mental load that could potentially be lowered.
See
./metrics.jsonfor the math on how a typical run did.
Cross-query filter enabled:
Search filter applied:
Instead of isolated results, you can read off chains like:
NPU → ONNX Runtime → mobile inference → browser LLM → quantization
That ties together hardware capability, runtime systems, deployment environments, and optimization techniques. No single Google query returns the whole chain. A person would need many searches, many tabs, and the luck to know what to ask next. This pipeline collapses that into one corpus and one k-NN view over it.
Neighbors are not siloed by format or intent. You get links across research (e.g. arXiv), product (vendor blogs, announcements), implementation (GitHub, tutorials), and constraints (benchmarks, limitations, discussions). An academic benchmark shows up right next to a deployment guide; a model comparison next to a hardware choice; runtime docs next to real-world usage. Helps a ton with real research work.
The corpus naturally mixes forum threads, vendor blogs and docs, long-form tutorials, repos, and papers. Those usually live in separate searches and only come together after manual synthesis. Here, k-NN surfaces them next to each other from shared embedding geometry—not because they shared a SERP, but because the text reads similarly in model space.
Anchor — Edge Devices Inference Performance Comparison (arXiv).
For this document, others in the same calculated neighborhood (with k=8) include:
- hardware comparisons (e.g. Jetson vs Coral)
- browser inference (WebAssembly / WebGPU)
- implementation guides (TensorFlow Lite / ONNX Runtime)
So one click lets you see the hops from:
model efficiency → hardware choice → runtime → deployment strategy
Those pieces are related, but they usually sit under different queries, domains, and mental models. Ordinary search does not assemble them for you. Google ranks within a query; a k-NN based approach explores across the merged corpus. This gives you a locally coherent semantic neighborhood you can explore at leisure and uncover hidden connections.
| File / dir | Role |
|---|---|
queries.json |
20 Google query strings (edit or replace) |
ingest.py |
Bright Data → data/serp.duckdb (merge by seed; --refresh truncates all) |
embed.py |
DuckDB rows → Ollama embed → Chroma collection |
neighbors.py |
Chroma k-NN + DuckDB hydration (compute_neighbors; used by serve.py) |
serve.py |
FastAPI + Uvicorn at http://127.0.0.1:8766/ — rows + neighbors API + static UI |
static/ |
UI |
internal/ |
article draft and helper scripts |
docker-compose.yml |
Optional Chroma on port 8000 |
- Python 3.10+ and uv
.envin this folder (or cwd when you run the scripts) with at least Bright Data (see below). Ollama and Chroma use defaults in code; add env vars only if you override hosts/ports.- Ollama with
nomic-embed-text:latestavailable (e.g.ollama list/ollama pull nomic-embed-text— tags aslatestin your local index) - Chroma over HTTP, e.g.
docker compose up -din this folder (defaults:CHROMA_HOST=localhostCHROMA_PORT=8000)
| Variable | Required? | Purpose |
|---|---|---|
BRIGHT_DATA_API_KEY |
Yes (for ingest.py) |
Bright Data Request API bearer token |
BRIGHT_DATA_ZONE |
Yes | SERP zone name |
BRIGHT_DATA_COUNTRY |
No | Optional country hint for routing |
DUCKDB_PATH |
No | Default ./data/serp.duckdb (relative to cwd) |
OLLAMA_HOST |
No | Default http://127.0.0.1:11434 |
EMBEDDING_MODEL |
No | Default nomic-embed-text:latest |
CHROMA_HOST / CHROMA_PORT / CHROMA_SSL |
No | Default localhost / 8000 / off |
CHROMA_COLLECTION |
No | Default serp_knn |
SERVE_PORT |
No | Default 8766 |
QUERIES_JSON |
No | Default ./queries.json |
Nothing else is required in .env for a standard local Ollama + Docker Chroma setup.
| Variable | Default |
|---|---|
DUCKDB_PATH |
./data/serp.duckdb (relative to scripts/knn cwd) |
OLLAMA_HOST |
http://127.0.0.1:11434 |
EMBEDDING_MODEL |
nomic-embed-text:latest |
CHROMA_HOST / CHROMA_PORT |
localhost / 8000 |
CHROMA_COLLECTION |
serp_knn |
SERVE_PORT |
8766 |
Working directory: cd scripts/knn for all commands.
cd scripts/knn
uv venv
uv pip install -r requirements.txt
# Windows: .venv\Scripts\activate | macOS/Linux: source .venv/bin/activate
docker compose up -d
python ingest.py # skip seeds already in DuckDB; add ` --refresh` to wipe + refetch all
python embed.py
python serve.pyEquivalent: uvicorn serve:app --host 127.0.0.1 --port 8766 (from scripts/knn with the same venv).
Open http://127.0.0.1:8766/: filter the table (plain text), click a row (not the link) to load k=8 nearest neighbors.
uv pip install -r requirements-dev.txt
pytestBy default this runs pure logic + FastAPI stubs (no Chroma or Bright Data). If Chroma is up on CHROMA_HOST / CHROMA_PORT (same defaults as docker-compose.yml: localhost:8000), an extra @pytest.mark.integration check runs compute_neighbors against real Chroma; if nothing is listening, it is skipped.
Bright Data SERP smoke (optional) — hits the paid Request API once. Not recommended for routine CI: cost, network, and upstream flakiness. Opt in:
# Windows PowerShell
$env:BRIGHT_DATA_LIVE_TEST="1"; pytest tests/test_bright_data_live.py -v
# macOS / Linux
BRIGHT_DATA_LIVE_TEST=1 pytest tests/test_bright_data_live.py -vRequires valid BRIGHT_DATA_API_KEY / BRIGHT_DATA_ZONE (e.g. in scripts/knn/.env). You can add BRIGHT_DATA_LIVE_TEST=1 there as well; the live test loads scripts/knn/.env explicitly so it still runs when pytest is started from another working directory.


