# Convert JSON to GraphDocument and insert into Neo4j -- Custom System Prompt

- With help from LLM but we only limited to small amount of data
- Let LLM create graph schema with `custom system prompt`, *tailored for AML json data*
- We only show here an example of creating schema of `EntityType="Individual"`, the reader is encourage to repeat the example for `EntityType="Entity"` as excercise

- **Why the Default `system_prompt` Doesn't Work for Our Data:**
  - The default system prompt in `LLMGraphTransformer` instructs the LLM to use **human names** as `id` values.
  - **This is not suitable for our data** since we prefer using `EntityID` as the unique identifier.
  - Using names as `id` can cause **inconsistencies** and **duplicate nodes**, especially when dealing with aliases or name variations.
  - To ensure a **structured and unique graph**, we override this behavior by explicitly setting `EntityID` as the primary identifier.
  - You can find the full default system prompt [here](https://python.langchain.com/api_reference/_modules/langchain_experimental/graph_transformers/llm.html#LLMGraphTransformer).



In [1]:
import sys
import os

from dotenv import load_dotenv
sys.path.append(os.path.abspath('..'))
load_dotenv('../.env',override=True)

True

In [2]:
import os
import json
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_neo4j import Neo4jGraph
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from langchain.prompts import ChatPromptTemplate

# Load environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.getenv("NEO4J_USER", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "password")

# Initialize Neo4jGraph (LangChain handles the connection)
graph = Neo4jGraph(url=NEO4J_URI,username=NEO4J_USER,password=NEO4J_PASSWORD,enhanced_schema=True)


# **Step 1: Delete old graph before inserting new data**
graph.query("MATCH (n) DETACH DELETE n")
print("Old graph deleted.")

# Define allowed nodes and relationships 
allowed_nodes = ["Person", "Alias", "Address", "Program", "IdentityDocument"]
allowed_relationships = ["HAS_ALIAS", "HAS_ADDRESS","SANCTIONED_BY", "HAS_DOCUMENT" ]

# Define system prompt
system_prompt = """


# Knowledge Graph Extraction Instructions for GPT-4
## 1. Overview
You are an advanced AI system specializing in transforming structured data into a **graph-based knowledge representation**. 
Your goal is to extract **entities (nodes)** and **relationships (edges)** from the provided data.

### 2. Allowed **Node Types**
Use only these entity types as **nodes**:
- **Person** → Represents an individual identified by `EntityID`.
- **Alias** → Represents alternative names of a **Person**.
- **Address** → Represents geographical locations (e.g., country, city).
- **Sanction** → Represents a sanction action against a **Person**.
- **Program** → Represents the sanction program associated with the **Sanction**.
- **IdentityDocument** → Represents any official documents or listings associated with the person.


### 3. Allowed **Relationships**
Use only these **relationships** to connect nodes:
- **HAS_ALIAS** → `(Person) -[:HAS_ALIAS]-> (Alias)`
- **HAS_ADDRESS** → `(Person) -[:HAS_ADDRESS]-> (Address)`
- **SANCTIONED_BY** → `(Person) -[:SANCTIONED_BY {{santion_type: "Block"}}]-> (Program)`
- **HAS_DOCUMENT** → `(Person) -[:HAS_DOCUMENT]-> (IdentityDocument)`


### 4. **How to Extract Nodes and Relationships**
#### 🟢 **Person Nodes**
- Extract a **Person** node using:
  - **EntityID** (Unique identifier)
  - **Full Name (Primary Name)**
  - **First Name** (if available)
  - **Last Name** (if available)
  - **Gender** (if available)
  - **Birthdate** (if available)
  - **Place of Birth** (if available)
  - **Nationality Country** (if available)
  - **EntityType** (e.g., `"Individual"`, `"Entity"`, etc.)
  - **Title** (if available, e.g., `"Director of Organization XYZ"`)
  - ❌ DO NOT use a generic `Entity` label.

#### 🟢 **Alias Nodes**
- If a **Person** has multiple names (`IsPrimary=false`), store **non-primary names** as **Alias** nodes.
- Ensure all **Alias** nodes use `fullName` as the attribute key, instead of `aliasName`.
- Link them with **HAS_ALIAS**: (Person) -[:HAS_ALIAS]-> (Alias)

#### 🟢 **Address Nodes**
- If an **Address** exists, create an **Address** node.
  - ❌ DO NOT use a generic `Entity` label.

#### 🟢 **Alias Nodes**

- If a **Person** has multiple names (`IsPrimary=false`), store **non-primary names** as **Alias** nodes.
- Ensure all **Alias** nodes use:
  - **Full Name** (`fullName` as the attribute key, instead of `aliasName`)
  - **First Name** (if available)
  - **Last Name** (if available)
- Link them with **HAS_ALIAS**: `(Person) -[:HAS_ALIAS]-> (Alias)`

#### 🟢 **Address Nodes**

- If an **Address** exists, create an **Address** node with attributes:
  - **Country** (e.g., `"Colombia"`)
  - **City** (if available, e.g., `"Cartago"`)
  - **Address** (if available, e.g., `"Carrera 4 No. 16-04 apt. 303"`)
  - ❌ DO NOT use `location` as an attribute.
- Link it to **Person** with **HAS_ADDRESS**: (Person) -[:HAS_ADDRESS]-> (Address)

#### 🟢 **Sanction Nodes**
- Extract a **Sanction** node for every **sanction program** under `Sanctions.Programs`.
- Link it to **Person** using **SANCTIONED_BY**: (Person) -[:SANCTIONED_BY]-> (Sanction)

#### 🟢 **Program Nodes**
- Ensure that **all Program nodes** use only the attribute `name` for consistency.
- If `programName` or `value` exists, rename it to `name`.

### 5. **Coreference Resolution (Avoid Duplication)**
- **Ensure each person is uniquely identified** by their `EntityID`—never create duplicate persons.
- **Reuse existing Alias,and Address** when appropriate.


Strictly adhere to these guidelines.


"""


# LLM setup
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

# Define ChatPromptTemplate
chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "Extract entities and relationships from the following data: {input}")
], template_format="f-string")

# LLMGraphTransformer
llm_transformer = LLMGraphTransformer(
    llm=llm,
    prompt=chat_prompt,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships,
    node_properties=True,
    relationship_properties=True,
)

# Load JSON data
with open("ofac_data_small.json", "r", encoding="utf-8") as f:
    data = json.load(f)["individuals"]

# Function to process text
def process_text(text: str):
    doc = Document(page_content=text)
    return llm_transformer.convert_to_graph_documents([doc])

# Transform data using LLMGraphTransformer with parallelization
graph_documents = []
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(process_text, json.dumps(entity)) for entity in data]
    for future in tqdm(as_completed(futures), total=len(futures), desc="Processing documents"):
        graph_documents.extend(future.result())

# **Step 2: Insert new graph data**
graph.add_graph_documents(graph_documents, baseEntityLabel=False, include_source=False)

print("New graph data successfully added to Neo4j!")

Old graph deleted.


Processing documents: 100%|██████████| 50/50 [02:19<00:00,  2.79s/it]


New graph data successfully added to Neo4j!
