Skip to content

sixthextinction/knn

Repository files navigation

Turning Google into an Explorable Knowledge Graph Using Pure k-NN

Goal

Web search today is query-bound. Each search lives in its own bubble, and it doesn't matter if you use Google or Bing or something else.

This is a research project to see if I could remove that boundary and let results from different queries relate to each other based on meaning. If I could collate multi-query Google searches into a single connected dataset, and see which insights emerge from relationships across searches, not within them. i.e. Treat search results not as isolated lists, but as a connected semantic space — and explore the relationships between them.

General-purpose UI in use:

UI in use

Search engines optimize results within a query, but this approach lets you explore relationships across queries. That difference is what enables entirely new kinds of discovery. It helps you discover what you didn’t know to ask.

What this does

This system:

  • collects SERP results across many queries
  • merges them into a single corpus
  • embeds each result (title + snippet + context)
  • runs k-nearest neighbors over the entire dataset

The result is a semantic space over search results, where any result can surface related results from completely different queries.

This enables huge wins that a single Google search will never surface. A user could use this technique to uncover connections that don’t show up in a single Google search, without tab hopping and without a mental load that could potentially be lowered.

See ./metrics.json for the math on how a typical run did.

More Screenshots

Cross-query filter enabled:

Cross-query filter on

Search filter applied:

Search filter

Why this is cool

1) Semantic stepping stones (multi-hop insight, instantly)

Instead of isolated results, you can read off chains like:

NPU → ONNX Runtime → mobile inference → browser LLM → quantization

That ties together hardware capability, runtime systems, deployment environments, and optimization techniques. No single Google query returns the whole chain. A person would need many searches, many tabs, and the luck to know what to ask next. This pipeline collapses that into one corpus and one k-NN view over it.

2) Cross-register connections

Neighbors are not siloed by format or intent. You get links across research (e.g. arXiv), product (vendor blogs, announcements), implementation (GitHub, tutorials), and constraints (benchmarks, limitations, discussions). An academic benchmark shows up right next to a deployment guide; a model comparison next to a hardware choice; runtime docs next to real-world usage. Helps a ton with real research work.

3) Cross-domain overlap (without manual synthesis)

The corpus naturally mixes forum threads, vendor blogs and docs, long-form tutorials, repos, and papers. Those usually live in separate searches and only come together after manual synthesis. Here, k-NN surfaces them next to each other from shared embedding geometry—not because they shared a SERP, but because the text reads similarly in model space.

Example

AnchorEdge Devices Inference Performance Comparison (arXiv).

For this document, others in the same calculated neighborhood (with k=8) include:

  • hardware comparisons (e.g. Jetson vs Coral)
  • browser inference (WebAssembly / WebGPU)
  • implementation guides (TensorFlow Lite / ONNX Runtime)

So one click lets you see the hops from:

model efficiency → hardware choice → runtime → deployment strategy

Those pieces are related, but they usually sit under different queries, domains, and mental models. Ordinary search does not assemble them for you. Google ranks within a query; a k-NN based approach explores across the merged corpus. This gives you a locally coherent semantic neighborhood you can explore at leisure and uncover hidden connections.

Layout

File / dir Role
queries.json 20 Google query strings (edit or replace)
ingest.py Bright Data → data/serp.duckdb (merge by seed; --refresh truncates all)
embed.py DuckDB rows → Ollama embed → Chroma collection
neighbors.py Chroma k-NN + DuckDB hydration (compute_neighbors; used by serve.py)
serve.py FastAPI + Uvicorn at http://127.0.0.1:8766/ — rows + neighbors API + static UI
static/ UI
internal/ article draft and helper scripts
docker-compose.yml Optional Chroma on port 8000

Prereqs

  • Python 3.10+ and uv
  • .env in this folder (or cwd when you run the scripts) with at least Bright Data (see below). Ollama and Chroma use defaults in code; add env vars only if you override hosts/ports.
  • Ollama with nomic-embed-text:latest available (e.g. ollama list / ollama pull nomic-embed-text — tags as latest in your local index)
  • Chroma over HTTP, e.g. docker compose up -d in this folder (defaults: CHROMA_HOST=localhost CHROMA_PORT=8000)

.env — required vs optional

Variable Required? Purpose
BRIGHT_DATA_API_KEY Yes (for ingest.py) Bright Data Request API bearer token
BRIGHT_DATA_ZONE Yes SERP zone name
BRIGHT_DATA_COUNTRY No Optional country hint for routing
DUCKDB_PATH No Default ./data/serp.duckdb (relative to cwd)
OLLAMA_HOST No Default http://127.0.0.1:11434
EMBEDDING_MODEL No Default nomic-embed-text:latest
CHROMA_HOST / CHROMA_PORT / CHROMA_SSL No Default localhost / 8000 / off
CHROMA_COLLECTION No Default serp_knn
SERVE_PORT No Default 8766
QUERIES_JSON No Default ./queries.json

Nothing else is required in .env for a standard local Ollama + Docker Chroma setup.

Env defaults (reference)

Variable Default
DUCKDB_PATH ./data/serp.duckdb (relative to scripts/knn cwd)
OLLAMA_HOST http://127.0.0.1:11434
EMBEDDING_MODEL nomic-embed-text:latest
CHROMA_HOST / CHROMA_PORT localhost / 8000
CHROMA_COLLECTION serp_knn
SERVE_PORT 8766

Working directory: cd scripts/knn for all commands.

Run

cd scripts/knn
uv venv
uv pip install -r requirements.txt
# Windows: .venv\Scripts\activate  |  macOS/Linux: source .venv/bin/activate
docker compose up -d
python ingest.py          # skip seeds already in DuckDB; add ` --refresh` to wipe + refetch all
python embed.py
python serve.py

Equivalent: uvicorn serve:app --host 127.0.0.1 --port 8766 (from scripts/knn with the same venv).

Open http://127.0.0.1:8766/: filter the table (plain text), click a row (not the link) to load k=8 nearest neighbors.

Tests

uv pip install -r requirements-dev.txt
pytest

By default this runs pure logic + FastAPI stubs (no Chroma or Bright Data). If Chroma is up on CHROMA_HOST / CHROMA_PORT (same defaults as docker-compose.yml: localhost:8000), an extra @pytest.mark.integration check runs compute_neighbors against real Chroma; if nothing is listening, it is skipped.

Bright Data SERP smoke (optional) — hits the paid Request API once. Not recommended for routine CI: cost, network, and upstream flakiness. Opt in:

# Windows PowerShell
$env:BRIGHT_DATA_LIVE_TEST="1"; pytest tests/test_bright_data_live.py -v

# macOS / Linux
BRIGHT_DATA_LIVE_TEST=1 pytest tests/test_bright_data_live.py -v

Requires valid BRIGHT_DATA_API_KEY / BRIGHT_DATA_ZONE (e.g. in scripts/knn/.env). You can add BRIGHT_DATA_LIVE_TEST=1 there as well; the live test loads scripts/knn/.env explicitly so it still runs when pytest is started from another working directory.

About

POC for turning a Google corpus into an explorable knowledge graph using pure k-NN.

Topics

Resources

Stars

Watchers

Forks

Contributors