# Graph-ND Quickstart Example
__Knowledge in Graphs not Documents!__

1. GraphRAG in 4 lines of code in <3 minutes. No graph expertise necessary.
2. Designed to extend to production - not just a demo.
3. Easily merges mixed structured & unstructured data.

The below is an introductory example to get you started fast.  For an example that covers more options for control and precision, see the `retail/` example.


To run this notebook:
1. **Set up Neo4j (Aura):**
    - Start a free Neo4j instance at [console.neo4j.io](https://console.neo4j.io/) and save the credentials file.

2. **Clone and navigate to the repo:**
    - `git clone https://github.com/zach-blumenfeld/graph-nd.git`
    - `cd graph-nd`

3. **Prepare your environment:**
    - Create a Python virtual environment and install dependencies:
`pip install -r requirements.txt`
    - Configure your `.env` file in `graph-nd/examples/components/` with Neo4j credentials and your `OPENAI_API_KEY` as shown below.

4. **Run the notebook:**
    - Navigate to the appropriate folder: `graph-nd/examples/components/`
    - Open and run `quickstart-example.ipynb`.

In [5]:
from dotenv import load_dotenv
import os

load_dotenv('.env', override=True)

uri = os.getenv('NEO4J_URI')
username = os.getenv('NEO4J_USERNAME')
password = os.getenv('NEO4J_PASSWORD')

In [6]:
import os

from graph_nd import GraphRAG
from neo4j import GraphDatabase
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

db_client = GraphDatabase.driver(uri, auth=(username, password))
embedding_model = OpenAIEmbeddings(model='text-embedding-ada-002')
llm = ChatOpenAI(model="gpt-4o", temperature=0.0)


# Instantiate graph
graphrag = GraphRAG(db_client, llm, embedding_model)

# 1) Get the graph schema. Can also define exactly via json/pydantic spec
graphrag.schema.infer("a simple graph of hardware components "
                      "where components (with id, name, and description properties)  "
                      "can be types of or inputs to other components.")

# 2) Merge data. Can also directly merge node & rel records extracted else where
graphrag.data.merge_csvs(['component-types.csv', 'component-input-output.csv']) # structured data
graphrag.data.merge_pdf('component-catalog.pdf') #unstructured

# 3) GraphRAG agent for better answers.
graphrag.agent("what sequence of components depend on silicon wafers?")

[Schema] Generated schema:
 {
    "description": "A simple graph schema for hardware components, capturing their relationships and properties.",
    "nodes": [
        {
            "description": "Represents a hardware component with unique identifier, name, and description.",
            "id": {
                "description": "Unique identifier for the hardware component.",
                "name": "id",
                "type": "STRING"
            },
            "label": "Component",
            "properties": [
                {
                    "description": "Name of the hardware component.",
                    "name": "name",
                    "type": "STRING"
                },
                {
                    "description": "Description of the hardware component.",
                    "name": "description",
                    "type": "STRING"
                }
            ],
            "searchFields": [
                {
                    "description": "Semantic 

Extracting entities from text: 100%|██████████| 8/8 [00:27<00:00,  3.45s/it]


Consolidating results...


Merging Nodes by Label: 100%|██████████| 1/1 [00:05<00:00,  5.57s/node]
Merging Relationships by Type & Pattern: 0rel [00:00, ?rel/s]



what sequence of components depend on silicon wafers?
Tool Calls:
  node_search (call_yPwnP5j16jddiLOOObMyDs4N)
 Call ID: call_yPwnP5j16jddiLOOObMyDs4N
  Args:
    search_query: silicon wafers
    top_k: 5
    search_config: {'search_type': 'SEMANTIC', 'node_label': 'Component', 'search_prop': 'name'}
Name: node_search

[
    {
        "id": "N26",
        "name": "Wafer",
        "description": "Silicon wafers are the basic building block for chip production. To produce them, a furnace forms a cylinder of silicon (or other semiconducting materials), which is then cut into disc-shaped wafers. These wafers are then processed, split and packaged into individual chips. Most wafers are made purely of silicon or another material, but others have more complex structures. Dopants, such as boron, aluminum, phosphorous, platinum or other elements, may be added to alter the level of semiconductivity. 300 mm wafers, produced by Japanese, Taiwanese, German, and Korean firms, are used to produce n

In [3]:
graphrag.agent("which components have the most inputs? top 5")


which components have the most inputs? top 5
Tool Calls:
  aggregate (call_PNTxCumpQXLcVQssZaF2u9DY)
 Call ID: call_PNTxCumpQXLcVQssZaF2u9DY
  Args:
    agg_instructions: Find the top 5 components with the most INPUT_TO relationships where they are the target component.
Running Query:
MATCH (c:Component)<-[:INPUT_TO]-(:Component)
RETURN c.name, COUNT(*) AS inputCount
ORDER BY inputCount DESC
LIMIT 5
Name: aggregate

[
    {
        "c.name": "Photolithography",
        "inputCount": 9
    },
    {
        "c.name": "Assembly and packaging",
        "inputCount": 8
    },
    {
        "c.name": "Deposition",
        "inputCount": 7
    },
    {
        "c.name": "Chemical mechanical planarization",
        "inputCount": 7
    },
    {
        "c.name": "Etch and clean",
        "inputCount": 6
    }
]

The top 5 components with the most inputs are:

1. **Photolithography** with 9 inputs
2. **Assembly and packaging** with 8 inputs
3. **Deposition** with 7 inputs
4. **Chemical mechanica

In [1]:
graphrag.agent("can you describe what gpus do?")

NameError: name 'graphrag' is not defined

## Creating Agents & Adding More Tools
You can create a Langgraph agent with prebuilt GraphRAG. Think of this as an Agent with "knowledge" -> an agent that has an embedded "left brain" knowledge graph

In [4]:
from langchain_core.messages import HumanMessage

#create langgraph agent
agent = graphrag.create_react_agent()

# use just like any other langgraph agent
config = {"configurable": {"thread_id": "thread-1"}}

for step in agent.stream(
    {"messages": [HumanMessage(content="what sequence of components depend on silicon wafers?")]},
    stream_mode="values", config=config
):
    step["messages"][-1].pretty_print()


what sequence of components depend on silicon wafers?
Tool Calls:
  node_search (call_0G8Y87loxiuKT3z8rnFdBv7p)
 Call ID: call_0G8Y87loxiuKT3z8rnFdBv7p
  Args:
    search_query: silicon wafers
    top_k: 5
Name: node_search

Error: 1 validation error for node_search
search_config
  Field required [type=missing, input_value={'search_query': 'silicon wafers', 'top_k': 5}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
 Please fix your mistakes.
Tool Calls:
  node_search (call_27rdFIFpJbiCOQ7zw7DI064d)
 Call ID: call_27rdFIFpJbiCOQ7zw7DI064d
  Args:
    search_query: silicon wafers
    top_k: 5
    search_config: {'search_type': 'SEMANTIC', 'node_label': 'Component', 'search_prop': 'name'}
Name: node_search

[
    {
        "id": "N26",
        "name": "Wafer",
        "description": "Silicon wafers are the basic building block for chip production. To produce them, a furnace forms a cylinder of silicon (or other semiconducting materials), wh

### Add Additional Tools
...as many as you like

In [5]:
import getpass
from langchain_community.tools.tavily_search import TavilySearchResults

if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API key:\n")

web_search = TavilySearchResults(max_results=3)
#create langgraph agent
agent = graphrag.create_react_agent(tools=[web_search])

 # use just like any other langgraph agent
config = {"configurable": {"thread_id": "thread-1"}}

for step in agent.stream(
    {"messages": [HumanMessage(content="what sequence of components depend on silicon wafers? and what companies may be involved?")]},
    stream_mode="values", config=config
):
    step["messages"][-1].pretty_print()


what sequence of components depend on silicon wafers? and what companies may be involved?
Tool Calls:
  node_search (call_IHNoHurHpY1nxZam43p7tqo8)
 Call ID: call_IHNoHurHpY1nxZam43p7tqo8
  Args:
    search_query: silicon wafers
    top_k: 5
Name: node_search

Error: 1 validation error for node_search
search_config
  Field required [type=missing, input_value={'search_query': 'silicon wafers', 'top_k': 5}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
 Please fix your mistakes.
Tool Calls:
  node_search (call_lfmf60XTKEPMDQoCleypNLcB)
 Call ID: call_lfmf60XTKEPMDQoCleypNLcB
  Args:
    search_query: silicon wafers
    top_k: 5
    search_config: {'search_type': 'SEMANTIC', 'node_label': 'Component', 'search_prop': 'name'}
Name: node_search

[
    {
        "id": "N26",
        "name": "Wafer",
        "description": "Silicon wafers are the basic building block for chip production. To produce them, a furnace forms a cylinder of silicon (or

In [None]:
#TODO: Show Example with MCP. This is easy, see https://github.com/langchain-ai/langchain-mcp-adapters

## The GraphSchema Is Kinda The Secret Sauce
The **GraphSchema** plays a key role in data mapping and is also provided to built-in agent tools during query execution. Its description fields and structured information significantly enhance query quality.
The `prompt_str()` function injects the GraphSchema into LLM prompts. It uses a special serialization format to describe "queryPatterns" in a concise, Cypher-like notation, further improving query generation quality.


In [6]:
print(graphrag.schema.schema.prompt_str())

{
    "description": "A simple graph schema for hardware components and their relationships.",
    "nodes": [
        {
            "description": "Represents a hardware component with an id, name, and description.",
            "id": {
                "description": "",
                "name": "id",
                "type": "STRING"
            },
            "label": "Component",
            "properties": [
                {
                    "description": "",
                    "name": "name",
                    "type": "STRING"
                },
                {
                    "description": "",
                    "name": "description",
                    "type": "STRING"
                }
            ],
            "searchFields": [
                {
                    "description": "Semantic search field for the component's name.",
                    "name": "name_textembedding",
                    "type": "TEXT_EMBEDDING",
                    "calculatedFrom": "

## Saving & Reloading GraphSchema
You can also `.export` & `.load` the schema to/from json files allowing you to easily save, reload, iterate, and version control the schema. This allows you to make custom edits as well.

In [7]:
# export and look at graph schema
graphrag.schema.export("graphrag-schema.json")

[Schema] Schema successfully exported to graphrag-schema.json


In [13]:
# reload and pick up where you left off
new_graphrag = GraphRAG(db_client, llm, embedding_model)
new_graphrag.schema.load("graphrag-schema.json")
new_graphrag.agent("can you describe what gpus do?")

[Schema] Schema successfully loaded from graphrag-schema.json

can you describe what gpus do?
Tool Calls:
  node_search (call_bnPWu8zuZRd0BlN9P8HRikcg)
 Call ID: call_bnPWu8zuZRd0BlN9P8HRikcg
  Args:
    search_config: {'search_type': 'SEMANTIC', 'node_label': 'Component', 'search_prop': 'description'}
    search_query: GPU
Name: node_search

[
    {
        "id": "N2",
        "name": "Logic chip design: Discrete GPUs",
        "description": "Discrete graphics processing units (\"GPUs\") have long been used for graphics processing (for example, in video game consoles) and in the last decade have become the most used chip for training artificial intelligence algorithms. The United States monopolizes the design market for GPUs, including standalone \"discrete GPUs,\" the most powerful GPUs.",
        "search_score": 0.914886474609375
    },
    {
        "id": "N4",
        "name": "Logic chip design: AI ASICs",
        "description": "Application-specific integrated circuits for artif

## A Note on How Tracking Sources Works
Tracking where data comes from is important for RAG traceability.  Graph-ND enables tracking the source(s) of each individual node and relationship.

By default, every node and relationship will have a `__source_id` property containing a list of ids.  For each id there exists at least one `__Source__` node containing source metadata. By default, this node will have the following fields:

- id: The id of the Source (same value as __source_id)
- file: The file path
- name: the name of the source
- sourceType: The type of source, i.e. "UNSTRUCTURED_TEXT_PDF_FILE", "STRUCTURED_CSV_TABLE" , etc.
- transformType: How the data was transformed from source and if it involved an LLM, i.e. "LLM_TEXT_EXTRACTION_TO_NODES", "TABLE_MAPPING_TO_NODE" (the presence of "LLM" implies an LLM was used)
- loadType: The Type of loading used i.e: "MERGE_NODES", "MERGE_NODES_AND_RELATIONSHIPS", etc.
- createdAt: timestamp at time of write i.e. 2025-04-13T04:22:36.246425779Z

GraphRAG will automatically generate this metadata, but it is possible to customize.
Every graphrag `merge` method has an optional `source_metadata` argument.

- source_metadata Union[bool, Dict[str, Any]], optional : Metadata for the source being merged.
    - If set to `True`, default source metadata is prepared and added to a `__Source__` node in the graph.
    A `__source_id` property is added and/or appended to each node which maps to the id property of `__Source__` node
    - If `False`, no source metadata is added to the graph.
    - If a custom dictionary is provided, source metadata is added as in the case of `True` and the dictionary properties are added to/override the default ones.
    Default is `True`.



`__Source__` has no relationships to other nodes.  You can only match connection through the shared id in the `__source_id` property.  This decision was purposeful, though open to change in later revisions.  For one, relationships may have a smaller set of sources then the nodes they connect to, and there isn't a simple way to connect relationships to other relationships.  The other bigger reason was to avoid bad text2Cypher queries that may unknowing and erroneously traverse over `__Source__` nodes to explode context windows or retrieve bad info. Finally, and this is more of a personal opinion, the UX feels much cleaner in Neo4j tools like Query and Explore. You avoid super node hairballs, the schema visualizations are easier to understand, etc.



## Clean up

In [6]:
# drop all the data in the graph (nodes, rels, indexes,...everything)
graphrag.data.nuke()