Skip to content

turing-db/turingdb-graphrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphRAG on TuringDB — Tutorial

A step-by-step guide to building a graph-augmented retrieval (graphRAG) pipeline on top of TuringDB, a columnar graph database with git-like versioning.

By the end of this tutorial you will have:

  • A knowledge graph of AI researchers, companies, technologies, and papers loaded into TuringDB
  • A vector index over node descriptions for semantic search
  • An interactive Q&A pipeline that combines vector search, graph traversal, and optionally an LLM to answer natural-language questions

What is graphRAG?

Standard RAG retrieves text chunks by semantic similarity alone. GraphRAG adds a second step — once similar nodes are found, it traverses the graph to pull in structurally related context that pure vector search would miss.

Question: "Who developed the Transformer?"
              |
              v
   [1. Vector search]
       Embed the question, find the top-k most similar nodes
       → Transformer, Attention Mechanism, Attention Is All You Need, ...
              |
              v
   [2. Graph expansion]
       Follow edges from those seed nodes (1 hop, both directions)
       → Large Language Model  (via BUILDS_ON ← Transformer)
       → Yoshua Bengio         (via RESEARCHES → Attention Mechanism)
       → Google DeepMind       (via DEVELOPS → LLM)
              |
              v
   [3. LLM generation]
       Feed the question + all retrieved context to an LLM
       → Synthesized answer grounded in graph-derived knowledge

The graph structure surfaces connections that no amount of embedding similarity can capture — who works where, what technology a paper introduced, which company develops which model.


Prerequisites

Requirement Check Install
Python 3.11+ python3 --version python.org
uv (package manager) uv --version curl -LsSf https://astral.sh/uv/install.sh | sh
TuringDB see Step 1 below uv add turingdb (included in this project)

Optional (for Step 6):

Requirement Check Install
Ollama ollama --version ollama.com
— or any OpenAI-compatible API Set LLM_BASE_URL and LLM_API_KEY env vars

Step 1 — Install dependencies and start TuringDB

# Clone this repo (or cd into it if you already have it)
git clone <repo-url>
cd graphrag_example

# Install all Python dependencies
uv sync

Start the TuringDB server if it is not already running:

uv run turingdb start -turing-dir ~/.turing -demon

Verify the server is up:

uv run python -c "
from turingdb import TuringDB
client = TuringDB(host='http://localhost:6666')
print('TuringDB is running')
print('Loaded graphs:', client.list_loaded_graphs())
"

Expected output:

TuringDB is running
Loaded graphs: ['default']

Tip: If you get a connection error, make sure ~/.turing exists (mkdir -p ~/.turing) and re-run the start command.


Step 2 — Build the knowledge graph

This step creates a graph with 20 nodes and 31 edges, generates vector embeddings for every node, and loads them into a TuringDB vector index.

uv run python main.py --setup

Expected output:

Connecting to TuringDB at http://localhost:6666 ...
Graph 'graphrag_demo' ready.
Created 20 nodes and 31 edges.
Built vector index 'graphrag_demo_vectors' with 20 vectors (384d).
Setup complete.

What just happened?

1. Graph creation — The script opened a TuringDB change (like a git branch), created nodes with labels Person, Company, Technology, and Paper, committed them, then wired up edges (WORKS_AT, FOUNDED, AUTHORED, DEVELOPS, INTRODUCES, BUILDS_ON, RESEARCHES) and submitted the change to main.

2. Embedding — Every node has a text property containing a short description. The script embedded all 20 texts with all-MiniLM-L6-v2 (a 384-dimensional sentence-transformer), wrote the vectors to a CSV, and loaded them into a TuringDB vector index.

The sample knowledge graph

Label Count Examples
Person 6 Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Demis Hassabis, Sam Altman, Ilya Sutskever
Company 4 Google DeepMind, OpenAI, Meta AI, Anthropic
Technology 6 Transformer, Attention Mechanism, CNN, Reinforcement Learning, LLM, Diffusion Model
Paper 4 Attention Is All You Need, ImageNet/AlexNet, Playing Atari with Deep RL, BERT
Edge type Example
WORKS_AT Geoffrey Hinton → Google DeepMind
FOUNDED Demis Hassabis → Google DeepMind
AUTHORED Hinton, Sutskever → ImageNet paper
DEVELOPS OpenAI → Large Language Model
INTRODUCES Attention Is All You Need → Transformer
BUILDS_ON Large Language Model → Transformer → Attention Mechanism
RESEARCHES Yoshua Bengio → Attention Mechanism

Step 3 — Explore the graph with Cypher

Before running queries, take a look at what was created. Open a Python shell (uv run python) or run these one-liners:

# Node labels and edge types
uv run python -c "
from turingdb import TuringDB
client = TuringDB(host='http://localhost:6666')
client.set_graph('graphrag_demo')
print(client.query('CALL db.labels()'))
print(client.query('CALL db.edgeTypes()'))
"
   id      label
0   0    Company
1   1      Paper
2   2     Person
3   3 Technology

   id   edgeType
0   0   WORKS_AT
1   1    FOUNDED
2   2   AUTHORED
3   3   DEVELOPS
4   4 INTRODUCES
5   5  BUILDS_ON
6   6 RESEARCHES

Try some graph traversals:

# Who works where?
uv run python -c "
from turingdb import TuringDB
client = TuringDB(host='http://localhost:6666')
client.set_graph('graphrag_demo')
print(client.query(\"\"\"
    MATCH (p:Person)-[:WORKS_AT]->(c:Company)
    RETURN p.name, c.name
\"\"\"))
"
          p.name          c.name
0  Ilya Sutskever          OpenAI
1      Sam Altman          OpenAI
2  Demis Hassabis Google DeepMind
3      Yann LeCun         Meta AI
4 Geoffrey Hinton Google DeepMind
# What do companies develop?
uv run python -c "
from turingdb import TuringDB
client = TuringDB(host='http://localhost:6666')
client.set_graph('graphrag_demo')
print(client.query(\"\"\"
    MATCH (c:Company)-[:DEVELOPS]->(t:Technology)
    RETURN c.name, t.name
\"\"\"))
"
          c.name                       t.name
0 Google DeepMind       Reinforcement Learning
1 Google DeepMind         Large Language Model
2          OpenAI              Diffusion Model
3          OpenAI         Large Language Model
4         Meta AI Convolutional Neural Network
5         Meta AI         Large Language Model
6       Anthropic         Large Language Model
# Technology lineage: what builds on what?
uv run python -c "
from turingdb import TuringDB
client = TuringDB(host='http://localhost:6666')
client.set_graph('graphrag_demo')
print(client.query(\"\"\"
    MATCH (a:Technology)-[:BUILDS_ON]->(b:Technology)
    RETURN a.name, b.name
\"\"\"))
"
               a.name              b.name
0          Transformer Attention Mechanism
1 Large Language Model         Transformer

Step 4 — Ask questions (retrieval only, no LLM needed)

Run the interactive Q&A in retrieval-only mode — this shows the raw graphRAG retrieval pipeline without needing an LLM:

uv run python main.py --no-llm
Connecting to TuringDB ...

--- GraphRAG Demo ---
Ask questions about AI researchers, companies, and technologies.
Type 'quit' to exit.

Example: "Who developed the Transformer architecture?"

> Who developed the Transformer architecture?

  Found 5 seed nodes, 5 related nodes via graph traversal.

## Directly relevant nodes (from vector search)

**Transformer** (Technology)
The Transformer is a deep learning architecture based on self-attention
mechanisms. Introduced in the 2017 paper Attention Is All You Need, it has
become the foundation for modern large language models ...

**Attention Mechanism** (Technology)
The attention mechanism allows neural networks to focus on relevant parts
of the input when producing output ...

**Attention Is All You Need** (Paper)
Attention Is All You Need is a seminal 2017 paper by Vaswani et al. that
introduced the Transformer architecture ...

**BERT Pre-training** (Paper)
BERT is a 2018 paper by Google that introduced pre-training deep
bidirectional language representations ...

**Geoffrey Hinton** (Person)
Geoffrey Hinton is a British-Canadian cognitive psychologist and computer
scientist, widely regarded as one of the godfathers of deep learning ...

## Related nodes (from graph traversal)

**Large Language Model** (Technology) ...
**Convolutional Neural Network** (Technology) ...
**ImageNet Classification with Deep CNNs** (Paper) ...
**Google DeepMind** (Company) ...
**Yoshua Bengio** (Person) ...

Sources: Transformer, Attention Mechanism, Attention Is All You Need,
         BERT Pre-training, Geoffrey Hinton

What happened: Vector search found 5 seed nodes (Transformer, Attention Mechanism, the Attention paper, BERT, and Hinton). Graph expansion then traversed edges from those seeds and discovered 5 more nodes — LLM (via BUILDS_ON), CNN, the ImageNet paper, DeepMind, and Bengio (via RESEARCHES). That is context no vector search alone would have surfaced.

Example: "What does OpenAI work on?"

> What does OpenAI work on?

  Found 5 seed nodes, 5 related nodes via graph traversal.

## Directly relevant nodes (from vector search)

**OpenAI** (Company) ...
**Sam Altman** (Person) ...
**Ilya Sutskever** (Person) ...
**Meta AI** (Company) ...
**Anthropic** (Company) ...

## Related nodes (from graph traversal)

**Diffusion Model** (Technology) ...     ← via DEVELOPS
**Large Language Model** (Technology) ... ← via DEVELOPS
**Convolutional Neural Network** (Technology) ...
**ImageNet Classification with Deep CNNs** (Paper) ...
**Yann LeCun** (Person) ...

Sources: OpenAI, Sam Altman, Ilya Sutskever, Meta AI, Anthropic

What happened: Vector search found OpenAI and its people. Graph expansion followed DEVELOPS edges to discover that OpenAI works on Large Language Models and Diffusion Models — the actual answer to the question came from the graph, not the embeddings.

Example: "How are CNNs related to deep learning pioneers?"

> How are CNNs related to deep learning pioneers?

  Found 5 seed nodes, 6 related nodes via graph traversal.

## Directly relevant nodes (from vector search)

**Geoffrey Hinton** (Person) ...
**Convolutional Neural Network** (Technology) ...
**ImageNet Classification with Deep CNNs** (Paper) ...
**Ilya Sutskever** (Person) ...
**Google DeepMind** (Company) ...

## Related nodes (from graph traversal)

**Reinforcement Learning** (Technology) ...
**Large Language Model** (Technology) ...
**OpenAI** (Company) ...
**Demis Hassabis** (Person) ...
**Yann LeCun** (Person) ...          ← via RESEARCHES → CNN
**Meta AI** (Company) ...

Sources: Geoffrey Hinton, Convolutional Neural Network,
         ImageNet Classification with Deep CNNs, Ilya Sutskever,
         Google DeepMind

What happened: Vector search surfaced Hinton, the CNN node, and the AlexNet paper. Graph expansion then pulled in Yann LeCun (via RESEARCHES → CNN) and Meta AI (via LeCun → WORKS_AT), building a complete picture of the CNN-pioneer connection across multiple hops.


Step 5 — Ask questions (with LLM)

To get synthesized natural-language answers, connect an LLM. The easiest option is Ollama running locally:

# Pull a model (one time)
ollama pull qwen2.5:7b

# Run with LLM
uv run python main.py

Or use any OpenAI-compatible API:

# OpenAI
export LLM_BASE_URL="https://api.openai.com/v1"
export LLM_API_KEY="sk-..."
export LLM_MODEL="gpt-4o-mini"
uv run python main.py

# Mistral
export LLM_BASE_URL="https://api.mistral.ai/v1"
export LLM_API_KEY="..."
export LLM_MODEL="mistral-small-latest"
uv run python main.py

With an LLM connected, the same queries produce synthesized answers instead of raw context — the LLM reads the retrieved graph context and generates a concise, grounded response.


How it works — under the hood

Project structure

graphrag_example/
├── config.py       # Configuration from environment variables
├── data.py         # 20 nodes + 31 edges defining the sample graph
├── ingest.py       # Loads graph into TuringDB, builds vector index
├── retriever.py    # Vector search → graph expansion → formatted context
└── generator.py    # Sends context + question to LLM
main.py             # CLI entry point (--setup / --no-llm)
pyproject.toml      # Dependencies: turingdb, sentence-transformers, openai

1. Ingestion (ingest.py)

# TuringDB writes require the change workflow (like git branches)
change = client.new_change()
client.checkout(change=change)

# Create nodes
client.query("CREATE (:Person {nodeId: 1, name: 'Geoffrey Hinton', text: '...'}), ...")
client.query("COMMIT")  # make nodes visible for edge queries

# Create edges by matching existing nodes
client.query("""
    MATCH (a:Person {nodeId: 1}), (b:Company {nodeId: 7})
    CREATE (a)-[:WORKS_AT]->(b)
""")
client.query("CHANGE SUBMIT")  # merge into main
client.checkout()               # return to main

# Build vector index (independent of the graph versioning)
client.query("CREATE VECTOR INDEX my_idx WITH DIMENSION 384 METRIC COSINE")
client.query('LOAD VECTOR FROM "vectors.csv" IN my_idx')

2. Retrieval (retriever.py)

Vector search — embed the question, find the top-k most similar nodes:

query_vec = embedder.encode([question])[0]
vec_str = ",".join(str(v) for v in query_vec.tolist())

seeds = client.query(f"""
    VECTOR SEARCH IN {index_name} FOR {top_k} [{vec_str}] YIELD ids
    MATCH (n) WHERE n.nodeId = ids
    RETURN n.nodeId as nodeId, n.name as name, n.text as text, labels(n) as label
""")

Graph expansion — traverse 1-hop edges from the seed nodes:

where = " OR ".join(f"n.nodeId = {sid}" for sid in seed_ids)

# Outgoing edges
outgoing = client.query(f"""
    MATCH (n)-[e]->(m) WHERE {where}
    RETURN labels(m) as label, m.nodeId as nodeId, m.name as name, m.text as text
""")

# Incoming edges
incoming = client.query(f"""
    MATCH (n)<-[e]-(m) WHERE {where}
    RETURN labels(m) as label, m.nodeId as nodeId, m.name as name, m.text as text
""")

The combined seed + neighbor context is formatted as structured Markdown and passed to the LLM.

3. Generation (generator.py)

response = llm.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "Answer using the provided knowledge graph context..."},
        {"role": "user", "content": f"# Context\n{context}\n\n# Question\n{question}"},
    ],
)

Configuration reference

All settings can be overridden via environment variables:

Variable Default Description
TURINGDB_HOST http://localhost:6666 TuringDB server URL
TURINGDB_GRAPH graphrag_demo Graph name
TURINGDB_DATA_DIR ~/.turing/data Where TuringDB reads vector CSVs
EMBEDDING_MODEL all-MiniLM-L6-v2 Sentence-transformers model for embeddings
LLM_BASE_URL http://localhost:11434/v1 OpenAI-compatible LLM API URL
LLM_API_KEY ollama API key for the LLM provider
LLM_MODEL qwen2.5:7b Model name to use for generation

Key TuringDB concepts used

Concept How it is used here
Cypher queries MATCH, WHERE, CREATE, RETURN — the query language for reading and writing graph data
Change workflow All writes happen inside an isolated change, then CHANGE SUBMIT merges to main (like a git merge)
COMMIT Makes intermediate writes visible within a change — needed between creating nodes and creating edges
Vector index CREATE VECTOR INDEX ... WITH DIMENSION 384 METRIC COSINE — a standalone index for k-NN search
LOAD VECTOR FROM Bulk-loads embeddings from a CSV file (id,v1,v2,...,vn) into the vector index
VECTOR SEARCH ... YIELD ids Finds the k nearest vectors and returns their IDs, which are joined back to graph nodes via MATCH
Schema introspection CALL db.labels(), CALL db.edgeTypes(), CALL db.propertyTypes() — discover graph structure

For the full TuringDB reference, see docs.turingdb.ai.


Adapting this to your own data

To use your own knowledge graph:

  1. Edit data.py — define your nodes (each needs a nodeId, label, name, and text) and edges (triples of from_id, to_id, edge_type)
  2. Run --setup — this creates the graph, computes embeddings, and builds the vector index
  3. Query — the retrieval pipeline works on any graph shape; it finds seed nodes by text similarity and expands them via whatever edges exist

For larger graphs, look at askalan — a production graphRAG agent built on TuringDB with batched embeddings, an agentic LLM loop, and a web UI.

About

GraphRAG on TuringDB — step-by-step tutorial and working pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages