Turning Google into an Explorable Knowledge Graph Using Pure k-NN

Goal

Web search today is query-bound. Each search lives in its own bubble, and it doesn't matter if you use Google or Bing or something else.

This is a research project to see if I could remove that boundary and let results from different queries relate to each other based on meaning. If I could collate multi-query Google searches into a single connected dataset, and see which insights emerge from relationships across searches, not within them. i.e. Treat search results not as isolated lists, but as a connected semantic space — and explore the relationships between them.

General-purpose UI in use:

Search engines optimize results within a query, but this approach lets you explore relationships across queries. That difference is what enables entirely new kinds of discovery. It helps you discover what you didn’t know to ask.

What this does

This system:

collects SERP results across many queries
merges them into a single corpus
embeds each result (title + snippet + context)
runs k-nearest neighbors over the entire dataset

The result is a semantic space over search results, where any result can surface related results from completely different queries.

This enables huge wins that a single Google search will never surface. A user could use this technique to uncover connections that don’t show up in a single Google search, without tab hopping and without a mental load that could potentially be lowered.

See ./metrics.json for the math on how a typical run did.

More Screenshots

Cross-query filter enabled:

Search filter applied:

Why this is cool

1) Semantic stepping stones (multi-hop insight, instantly)

Instead of isolated results, you can read off chains like:

NPU → ONNX Runtime → mobile inference → browser LLM → quantization

That ties together hardware capability, runtime systems, deployment environments, and optimization techniques. No single Google query returns the whole chain. A person would need many searches, many tabs, and the luck to know what to ask next. This pipeline collapses that into one corpus and one k-NN view over it.

2) Cross-register connections

Neighbors are not siloed by format or intent. You get links across research (e.g. arXiv), product (vendor blogs, announcements), implementation (GitHub, tutorials), and constraints (benchmarks, limitations, discussions). An academic benchmark shows up right next to a deployment guide; a model comparison next to a hardware choice; runtime docs next to real-world usage. Helps a ton with real research work.

3) Cross-domain overlap (without manual synthesis)

The corpus naturally mixes forum threads, vendor blogs and docs, long-form tutorials, repos, and papers. Those usually live in separate searches and only come together after manual synthesis. Here, k-NN surfaces them next to each other from shared embedding geometry—not because they shared a SERP, but because the text reads similarly in model space.

Example

Anchor — Edge Devices Inference Performance Comparison (arXiv).

For this document, others in the same calculated neighborhood (with k=8) include:

hardware comparisons (e.g. Jetson vs Coral)
browser inference (WebAssembly / WebGPU)
implementation guides (TensorFlow Lite / ONNX Runtime)

So one click lets you see the hops from:

model efficiency → hardware choice → runtime → deployment strategy

Those pieces are related, but they usually sit under different queries, domains, and mental models. Ordinary search does not assemble them for you. Google ranks within a query; a k-NN based approach explores across the merged corpus. This gives you a locally coherent semantic neighborhood you can explore at leisure and uncover hidden connections.

Layout

File / dir	Role
`queries.json`	20 Google query strings (edit or replace)
`ingest.py`	Bright Data → `data/serp.duckdb` (merge by seed; `--refresh` truncates all)
`embed.py`	DuckDB rows → Ollama embed → Chroma collection
`neighbors.py`	Chroma k-NN + DuckDB hydration (`compute_neighbors`; used by `serve.py`)
`serve.py`	FastAPI + Uvicorn at `http://127.0.0.1:8766/` — rows + neighbors API + static UI
`static/`	UI
`internal/`	article draft and helper scripts
`docker-compose.yml`	Optional Chroma on port 8000

Prereqs

Python 3.10+ and uv
.env in this folder (or cwd when you run the scripts) with at least Bright Data (see below). Ollama and Chroma use defaults in code; add env vars only if you override hosts/ports.
Ollama with nomic-embed-text:latest available (e.g. ollama list / ollama pull nomic-embed-text — tags as latest in your local index)
Chroma over HTTP, e.g. docker compose up -d in this folder (defaults: CHROMA_HOST=localhost CHROMA_PORT=8000)

`.env` — required vs optional

Variable	Required?	Purpose
`BRIGHT_DATA_API_KEY`	Yes (for `ingest.py`)	Bright Data Request API bearer token
`BRIGHT_DATA_ZONE`	Yes	SERP zone name
`BRIGHT_DATA_COUNTRY`	No	Optional country hint for routing
`DUCKDB_PATH`	No	Default `./data/serp.duckdb` (relative to cwd)
`OLLAMA_HOST`	No	Default `http://127.0.0.1:11434`
`EMBEDDING_MODEL`	No	Default `nomic-embed-text:latest`
`CHROMA_HOST` / `CHROMA_PORT` / `CHROMA_SSL`	No	Default `localhost` / `8000` / off
`CHROMA_COLLECTION`	No	Default `serp_knn`
`SERVE_PORT`	No	Default `8766`
`QUERIES_JSON`	No	Default `./queries.json`

Nothing else is required in .env for a standard local Ollama + Docker Chroma setup.

Env defaults (reference)

Variable	Default
`DUCKDB_PATH`	`./data/serp.duckdb` (relative to `scripts/knn` cwd)
`OLLAMA_HOST`	`http://127.0.0.1:11434`
`EMBEDDING_MODEL`	`nomic-embed-text:latest`
`CHROMA_HOST` / `CHROMA_PORT`	`localhost` / `8000`
`CHROMA_COLLECTION`	`serp_knn`
`SERVE_PORT`	`8766`

Working directory: cd scripts/knn for all commands.

Run

cd scripts/knn
uv venv
uv pip install -r requirements.txt
# Windows: .venv\Scripts\activate  |  macOS/Linux: source .venv/bin/activate
docker compose up -d
python ingest.py          # skip seeds already in DuckDB; add ` --refresh` to wipe + refetch all
python embed.py
python serve.py

Equivalent: uvicorn serve:app --host 127.0.0.1 --port 8766 (from scripts/knn with the same venv).

Open http://127.0.0.1:8766/: filter the table (plain text), click a row (not the link) to load k=8 nearest neighbors.

Tests

uv pip install -r requirements-dev.txt
pytest

By default this runs pure logic + FastAPI stubs (no Chroma or Bright Data). If Chroma is up on CHROMA_HOST / CHROMA_PORT (same defaults as docker-compose.yml: localhost:8000), an extra @pytest.mark.integration check runs compute_neighbors against real Chroma; if nothing is listening, it is skipped.

Bright Data SERP smoke (optional) — hits the paid Request API once. Not recommended for routine CI: cost, network, and upstream flakiness. Opt in:

# Windows PowerShell
$env:BRIGHT_DATA_LIVE_TEST="1"; pytest tests/test_bright_data_live.py -v

# macOS / Linux
BRIGHT_DATA_LIVE_TEST=1 pytest tests/test_bright_data_live.py -v

Requires valid BRIGHT_DATA_API_KEY / BRIGHT_DATA_ZONE (e.g. in scripts/knn/.env). You can add BRIGHT_DATA_LIVE_TEST=1 there as well; the live test loads scripts/knn/.env explicitly so it still runs when pytest is started from another working directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turning Google into an Explorable Knowledge Graph Using Pure k-NN

Goal

What this does

More Screenshots

Why this is cool

1) Semantic stepping stones (multi-hop insight, instantly)

2) Cross-register connections

3) Cross-domain overlap (without manual synthesis)

Example

Layout

Prereqs

`.env` — required vs optional

Env defaults (reference)

Run

Tests

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
static		static
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
bright_data_serp.py		bright_data_serp.py
docker-compose.yml		docker-compose.yml
embed.py		embed.py
ingest.py		ingest.py
metrics.json		metrics.json
neighbors.py		neighbors.py
pyproject.toml		pyproject.toml
queries.json		queries.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
serve.py		serve.py

Folders and files

Latest commit

History

Repository files navigation

Turning Google into an Explorable Knowledge Graph Using Pure k-NN

Goal

What this does

More Screenshots

Why this is cool

1) Semantic stepping stones (multi-hop insight, instantly)

2) Cross-register connections

3) Cross-domain overlap (without manual synthesis)

Example

Layout

Prereqs

.env — required vs optional

Env defaults (reference)

Run

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`.env` — required vs optional