# Using Graph-ND with Pre-Existing Graphs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zach-blumenfeld/graph-nd/blob/main/examples/companies/companies.ipynb)

graph-nd is designed for end-to-end GraphRAG workflows that include building graphs from scratch: empty graph -> mapping source data -> agentic GraphRAG.

However, if you already have an existing graph database—with data loaded through external workflows—you can still leverage graph-nd's powerful GraphRAG capabilities on it, even with read-only access.

To get started, you just need to use the `GraphRAG.schema.from_existing_graph()` method. After which you can access pre-built GraphRAG agents and optionally build on them with expert tools as needed.

Example below.


## Example Graph DB
We will demonstrate on the Neo4j Labs companies DB - a  graph created without graph-nd that contains companies, associated industries, people that work at or invested in the companies, and articles that report on those companies. The data is sourced from a small subset (250k entities) of [Diffbot's](https://diffbot.com/) global Knowledge Graph (50bn entities).

The database is publicly available with a read-only user. You can explore the data at [https://demo.neo4jlabs.com:7473/browser/](https://demo.neo4jlabs.com:7473/browser/).

![Companies Graph](https://github.com/zach-blumenfeld/graph-nd/blob/main/examples/companies/img/companies-schema.png?raw=1)


## Setup

In [None]:
%%capture
%pip install graph-nd

In [None]:
from neo4j import GraphDatabase

#connection details
uri = "neo4j+s://demo.neo4jlabs.com"
username = "companies"
password = "companies"

db_client = GraphDatabase.driver(uri, auth=(username, password))

#test connection
db_client.execute_query('RETURN 1')

## Create GraphRAG Object

In [None]:
from dotenv import load_dotenv
import os
from graph_nd import GraphRAG
from getpass import getpass
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

load_dotenv('nb.env', override=True) # for OPENAI_API_KEY
if not os.getenv("OPENAI_API_KEY"):
    os.environ['OPENAI_API_KEY'] = getpass("Please enter your OpenAI API key: ")

llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
embedding_model = OpenAIEmbeddings(model='text-embedding-ada-002')

#instantiate graphrag
graphrag = GraphRAG(db_client, llm, embedding_model)


## Get Schema and Have Agentic GraphRAG (2 lines of code :) )

Note that you may get some warnings here.  Graphs created externally will not perfectly conform to graph-nd `GraphSchema` model assumptions.  This method will construct a valid graph schema and produce warnings where things are not well optimized or need to be excluded due to modeling assumption or limitations.

In [None]:
graphrag.schema.from_existing_graph()
print(graphrag.schema.prompt_str())

In [None]:
graphrag.agent("what articles mention GPUs?")

In [None]:
graphrag.agent("How many people exist who have been both a board member and  CEOs at some point? even of separate orgs?")

## Additional Arguments & Usage Details
`schema.from_existing_graph()` has various optional arguments for customizing the schema:

- `exclude_prefixes`: A tuple of strings containing prefixes. Node labels, relationship types, or properties
    starting with any of these prefixes are excluded, defaults to ("_", " ").
- `exclude_exact_matches`: An optional set of exact node labels, relationship types, or property names to
    exclude from the schema, defaults to None if not provided.
- `text_embed_index_map`: An optional dictionary mapping {text_embedding_index_name: text_property}
    where text_property is a node property that is used to calculate the embedding. This is required to use
    text embedding search fields for nodes. If not provided, no text embedding search fields will be included in the schema.
    Defaults to None.
- `parallel_rel_ids`: An optional dictionary mapping relationship
    types to their parallel relationship ID property names: `{rel_type: property_name}`. This is only required if the
    user wishes to ingest more data while maintaining parallel relationships for specific node types
    (more than one instance of a relationship type existing between the same start and end nodes). Defaults to None.
- `description`: Optional description of the generated graph schema. Exposed to LLM when accessing the graph through GraqphRAG

## Get Schema With Text Embedding Index for Chunk
Below is an example of including text embedding indexes to inform GraphRAG node search

In [None]:
graphrag.schema.from_existing_graph(text_embed_index_map={'news':'text'})

In [None]:
graphrag.agent("What chunks mention high tech stuff? use semantic search")

## Further Customizations & Tools
See the [retail example](../retail/retail-example.ipynb) to see how you can add expert tools and create customizable Langgraph agents from this point.

## More Details on `GraphSchema`
Below are more details on how the graph-nd internal `GraphSchema` works and the modeling assumptions it makes.  It has specific opinions and limitations to help with automated retrieval tool design and data loading - though these are subject to change in the future as needed.

### `GraphSchema` Assumptions & Limitations

1. A GraphSchema is composed of three elements
    1. an optional description
    2. a list of NodeSchemas
    3. a list of relationshipSchemas
2. There can be only one node schema per node label and one relationship schema per relationship type.
3. Both nodes and relationships can have any number of properties
4. Only properties with types `ALLOWED_PROPERTY_TYPES = {"STRING", "INTEGER", "FLOAT", "BOOLEAN", "DATE", "DATE_TIME"}` can be considered
5. No methods in `graphrag.schema.*` assume any write permissions to the database (technically only [`reader`](https://neo4j.com/docs/operations-manual/current/authentication-authorization/built-in-roles/#access-control-built-in-roles-reader) permissions are assumed), meaning that no indexes or constraints can be set while creating schemas. They are instead checked and created if needed when writing data via `graphrag.data.*` methods. In general ONLY `graphrag.data.*` methods attempt writes and index/constraint setting.

#### Node Schema Assumptions and Limitations
1. Every node label must have one id property (non-composite) which is assumed to uniquely identify nodes of that label.
    - This id property is required whenever loading nodes and relationships regardless of source (structured or unstructured)
    - in general elements (nodes and rels) are merged on unique id property(ies).
2. Vector and full text indexes are supported per node label through `searchFields` owned by the individual `NodeSchema`
3. Multi-label and multi-property vector/fulltext indexes are not currently supported. You can only use vector and full text indexes that are set on one node label and one node property.

#### Relationship Schema Assumptions and Limitations
1.  Vector and full text indexes are not currently supported for relationship properties
2.  relationships have a list of one or more `queryPatterns`. A `queryPattern` is composed of a start and end node label.  The `queryPatterns` in a `RelationshipSchema` dictate which node labels a relationship can exist in between.
3. relationships have an *optional* id property which if provided is used to identify relationships for the purpose of maintaining parallel relationships (more than one instance of relationships of the same type existing between the same start and nodes).  If this id isn't provided repeat instances will be merged into the same single relationship when loading data.


### `GraphSchema` Opinions When Using `graphrag.schema.from_existing_graph()`


#### Naming Conventions & Exclusions
1. By default properties, labels and rel types leading with an underscore `_` or space ` ` are ignored.
    - Users can customize prefix, and exact match exclusion criteria across properties, labels, and relationships
    - `INFO` message on exclusion

#### Node Label & Id
1. multi-labels are not considered.  If multi-labels are encountered they are treated as separate nodes.
2. node ids should have a non-composite uniqueness constraint on a property of `ALLOWED_PROPERTY_TYPES`. if multiple found then use `tie-break` logic. If non are found then the following fallback methods are tried in order.
     - `WARNING` must be thrown explaining fallback choices, not falling in `ALLOWED_PROPERTY_TYPES`, and tie breaking if it happens
     - fallback methods
       1. Look for properties with range indexes.  if found, choose the one with highest unique count. If ties, use `tie-break` logic
       2. Else find properties with highest unique count. If ties, use `tie-break` logic
       3. Else if no properties the node label must be ignored
     - `tie-break` logic: choose the shortest named property (i.e. between a property "id" and "name" - "id" would be chosen since it only has 2 chars vs 4). If same length choose first name in ascending sort.

#### Node Properties
1. properties which have a vector index will be excluded from property lists. This is to avoid them being returned to LLMs in retrieval and blowing context windows. They can be taken into account later with search fields which is a separate part of graph schema to inform vector search & retrieval
    - These are silently excluded
2. Only properties with types `ALLOWED_PROPERTY_TYPES` can be considered

#### Node Search Fields
1. Only FULLTEXT and TEXTEMBEDDING will be considered
2. for full text indexes it must be true that the property is included (from above) and the property has a full text index on it
3. For TEXTEMBEDDING this is more complicated because Neo4j does not associate the text embedding property to the field it was calculated from
    1. The ser must specify what indexes and pproeprties to include as text embedding search fields through the `text_embed_index_map` argument.
    2. If the vector index corresponding isn't found an `ERROR` is thrown

#### Relationship TYPE and Id
1. Only Relationship Types between included nodes (above) will be included
2. Relationship Ids (for parallel relationships) cannot be inferred and will be assumed `None` unless users provide them through the `parallel_rel_ids` argument.

#### Relationship Properties
1. properties which have a vector index will be excluded from property lists. This is to avoid them being returned to LLMs in retrieval and blowing context windows.
     - These are silently excluded
2. Only properties with types `ALLOWED_PROPERTY_TYPES` can be considered


