Knowledge Graph Project

This repository implements a decoupled, agentic Knowledge Graph system designed to transform unstructured documents into a queryable semantic memory.

It consists of two primary microservices:

GraphGen (Generation/ETL): A pipeline that parses documents, extracts entities using GLiNER + LLMs, resolves coreferences, detects communities, and stores everything in Neo4j.
GraphRAG (Retrieval/API): An agentic retrieval system powered by LlamaIndex ReActAgent that navigates the graph to answer complex user queries.

System Architecture

The system follows a strict producer-consumer model, decoupled by Neo4j as the shared storage layer.

graph TD
    Raw[Documents .txt/.pdf/.docx/...] -->|Input| Gen[GraphGen Service :8020]

    subgraph "GraphGen Pipeline"
        Gen --> Parse[Parser Registry]
        Parse --> Upload[Neo4j Upload]
        Upload --> Ext[Entity Extraction GLiNER + LLM]
        Ext --> Enrich[Embeddings + Resolution]
        Enrich --> Comm[Community Detection]
    end

    Upload -->|Document + Chunks| Neo4j[(Neo4j)]
    Ext -->|Entity + Relations| Neo4j

    User -->|Query| RAG[GraphRAG Service :8010]

    subgraph "GraphRAG Agent"
        RAG -->|Vector + Graph Search| Neo4j
        RAG -->|Synthesis| LLM[LLM Response]
    end

Shared Infrastructure

Neo4j: Stores graph topology, properties, and 384-dim vector embeddings (Community Edition 5.11+).

Service Details

1. GraphGen (The Builder)

Located in services/graphgen.

Key Pipeline Stages:

Document Parsing: Auto-discovers files in input/ using a parser registry (TXT, MD, PDF, DOCX, PPTX, XLSX, HTML, Images via pytesseract).
Neo4j Upload: Stores Document → [:CONTAINS] → Chunk nodes immediately.
Entity & Relation Extraction: Uses GLiNER (NER) + LLMs (relation extraction).
Semantic Enrichment: Generates embeddings (BAAI/bge-small-en-v1.5), resolves duplicate entities.
Community Detection: Applies the Leiden Algorithm to cluster entities.
Summarization: Generates LLM-based summaries for every community.

Key Tech: LangChain, GLiNER, NetworkX, Leidenalg, SentenceTransformers, Neo4j.

2. GraphRAG (The Agent)

Located in services/graphrag.

Key Tech: LlamaIndex ReActAgent, FastAPI, Langfuse (Tracing), Pydantic, Neo4j.

Data Schema

(:Document {doc_id, title, source_path, created_at})
    -[:CONTAINS]->
(:Chunk {chunk_id, text, position, doc_id, embedding[384]})
    -[:MENTIONS]->
(:Entity {entity_id, name, type, embedding[384]})
    -[:RELATED_TO {relation}]->
(:Entity)

Getting Started

1. Prerequisites

Docker and Docker Compose.
API Keys for Groq (or OpenAI).

2. Configuration

cp .env.example .env
# Fill in: NEO4J_PASSWORD, GROQ_API_KEY

3. Run the Stack

docker-compose up --build -d

This starts:

GraphGen (Port 8020)
GraphRAG (Port 8010)
Neo4j (Port 7474 / 7687)

4. Ingest Data

# Place files in input/
curl -X POST http://localhost:8020/run \
  -H "Content-Type: application/json" \
  -d '{"clean_database": true}'

Or upload individual documents:

curl -X POST http://localhost:8020/documents -F "file=@myfile.pdf"

5. Chat

Web UI: http://localhost:3333
API Docs: http://localhost:8010/docs

Development

docker-compose -f docker-compose.yaml -f docker-compose.dev.yaml up

🏗️ System Architecture

The system follows a strict producer-consumer model, decoupled by the storage layer.

graph TD
    Raw[Raw Data .txt/.csv] -->|Input| Gen[GraphGen Service]
    
    subgraph "GraphGen Pipeline"
        Gen --> Lex[Lexical Graph]
        Lex --> Ext[Entity Extraction]
        Ext --> Enrich[Semantic Enrichment]
        Enrich --> Comm[Community Detection]
    end
    
    Comm -->|Write Topology| Falkor[(FalkorDB Graph)]
    Comm -->|Write Vectors| PG[(Postgres pgvector)]
    
    User -->|Query| RAG[GraphRAG Service]
    
    subgraph "GraphRAG Agent"
        RAG -->|Vector Search| PG
        RAG -->|Graph Traversal| Falkor
        RAG -->|Synthesis| LLM[LLM Response]
    end

Shared Infrastructure

FalkorDB: Stores the graph topology (Nodes, Edges, Properties).
PostgreSQL (pgvector): Stores 384-dimensional vector embeddings for hybrid retrieval.

🛠️ Service Details

1. GraphGen (The Builder)

Located in services/graphgen.

This service runs a multi-stage ETL pipeline to convert unstructured text into a structured knowledge graph.

Key Pipeline Stages:

Lexical Construction: Parses documents into a temporal hierarchy (DAY → SEGMENT → EPISODE → CHUNK).
Entity & Relation Extraction: Uses a hybrid of GLiNER (for high-precision NER) and LLMs (for semantic relation extraction).
Semantic Enrichment: Generates embeddings and resolves duplicate entities using vector similarity.
Community Detection: Applies the Leiden Algorithm to cluster entities into high-level TOPIC and SUBTOPIC nodes.
Summarization: Generates LLM-based summaries for every community and topic.
Centrality: Calculates Degree Centrality and Z-Scores for influential entities.
Hybrid Upload: Persists topology to FalkorDB and vectors to Postgres.

Key Tech: LangChain, Spacy, GLiNER, NetworkX, Leidenalg, SentenceTransformers.

2. GraphRAG (The Agent)

Located in services/graphrag.

This service provides a FastAPI interface for querying the graph. Unlike simple RAG, it uses an Agentic Workflow to "walk" the graph.

Retrieval Workflow:

Keyword Extraction: Analyzes the user query to find key entities.
Seed Identification: Uses Hybrid Search (Vector Similarity + Exact Match) to find entry points in the graph.
Subgraph Expansion: Expands from seeds to find relevant neighbors (Chunks, Topics, Related Entities) using Cypher queries.
Context Building: Formats the subgraph into a structured XML context for the LLM.
Synthesis: Generates a personalized answer based on the traversed path.

Key Tech: LlamaIndex, FastAPI, Langfuse (Tracing), Pydantic.

📊 Data Schema

The graph uses a specific schema to represent both time and knowledge:

Temporal Nodes: DAY, SEGMENT (Morning/Afternoon), EPISODE (Events).
Content Nodes: CHUNK (Raw text).
Semantic Nodes: ENTITY_CONCEPT (People, Places, Concepts).
Organizational Nodes: TOPIC, SUBTOPIC (Leiden communities).

🚀 Getting Started

1. Prerequisites

Docker and Docker Compose.
API Keys for Groq and OpenAI.

2. Configuration

Copy the example environment file:
```
cp .env.example .env
```

Fill in your credentials in .env:

GROQ_API_KEY=gsk_...
OPENAI_API_KEY=sk-...

3. Run the Stack

Build and start all services:

docker-compose up --build -d

This starts:

GraphGen (Port 8020)
GraphRAG (Port 8010)
FalkorDB & Postgres

4. Ingest Data

Place text files (.txt, .csv) in the input/ directory.

Trigger the pipeline via the API:

curl -X POST http://localhost:8020/run \
  -H "Content-Type: application/json" \
  -d '{"clean_database": true}'

(Use clean_database: false for incremental updates)

5. Chat

Access the retrieval interface:

Web UI: http://localhost:8010
API Docs: http://localhost:8010/docs

Development

To run with hot-reloading for local development:

docker-compose -f docker-compose.yaml -f docker-compose.dev.yaml up

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
apps/web		apps/web
docs		docs
memory-bank		memory-bank
packages		packages
scripts		scripts
services		services
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.yaml		docker-compose.yaml
harrypotter.txt		harrypotter.txt
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
turbo.json		turbo.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph Project

System Architecture

Shared Infrastructure

Service Details

1. GraphGen (The Builder)

2. GraphRAG (The Agent)

Data Schema

Getting Started

1. Prerequisites

2. Configuration

3. Run the Stack

4. Ingest Data

5. Chat

Development

🏗️ System Architecture

Shared Infrastructure

🛠️ Service Details

1. GraphGen (The Builder)

2. GraphRAG (The Agent)

📊 Data Schema

🚀 Getting Started

1. Prerequisites

2. Configuration

3. Run the Stack

4. Ingest Data

5. Chat

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Project

System Architecture

Shared Infrastructure

Service Details

1. GraphGen (The Builder)

2. GraphRAG (The Agent)

Data Schema

Getting Started

1. Prerequisites

2. Configuration

3. Run the Stack

4. Ingest Data

5. Chat

Development

🏗️ System Architecture

Shared Infrastructure

🛠️ Service Details

1. GraphGen (The Builder)

2. GraphRAG (The Agent)

📊 Data Schema

🚀 Getting Started

1. Prerequisites

2. Configuration

3. Run the Stack

4. Ingest Data

5. Chat

Development

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages