## Setup and Test Neo4j Connection

In [5]:
from py2neo import Graph
graph = Graph("neo4j://127.0.0.1:7687", auth=("neo4j", "testpassword"))

In [6]:
from py2neo import Graph

# Connect to the local Neo4j database
graph = Graph("neo4j://127.0.0.1:7687", auth=("neo4j", "testpassword"))

# Run a test query
result = graph.run("RETURN 'Hello from Neo4j!' AS message").data()
print(result)

[{'message': 'Hello from Neo4j!'}]


Connected!

## Quick check on the DataSet

In [7]:
import pandas as pd

# Load the procurement dataset
df = pd.read_csv("Procurement KPI Analysis Dataset.csv")

# View the first few rows
print(df.head())

      PO_ID         Supplier  Order_Date Delivery_Date    Item_Category  \
0  PO-00001        Alpha_Inc  2023-10-17    2023-10-25  Office Supplies   
1  PO-00002  Delta_Logistics  2022-04-25    2022-05-05  Office Supplies   
2  PO-00003         Gamma_Co  2022-01-26    2022-02-15              MRO   
3  PO-00004    Beta_Supplies  2022-10-09    2022-10-28        Packaging   
4  PO-00005  Delta_Logistics  2022-09-08    2022-09-20    Raw Materials   

  Order_Status  Quantity  Unit_Price  Negotiated_Price  Defective_Units  \
0    Cancelled      1176       20.13             17.81              NaN   
1    Delivered      1509       39.32             37.34            235.0   
2    Delivered       910       95.51             92.26             41.0   
3    Delivered      1344       99.85             95.52            112.0   
4    Delivered      1180       64.07             60.53            171.0   

  Compliance  
0        Yes  
1        Yes  
2        Yes  
3        Yes  
4         No  


In [12]:
print(df.columns.tolist())

['PO_ID', 'Supplier', 'Order_Date', 'Delivery_Date', 'Item_Category', 'Order_Status', 'Quantity', 'Unit_Price', 'Negotiated_Price', 'Defective_Units', 'Compliance']


## Import Required Libraries and Connect to Neo4j

Before we build the knowledge graph, we need to import the necessary libraries and connect to a local Neo4j database. Make sure Neo4j is running and your credentials are correct.


In [None]:
import ast  # For safely evaluating stringified lists of tuples
from py2neo import Graph, Node, Relationship  # To interact with Neo4j graph

# Connect to your running Neo4j database instance
graph = Graph("neo4j://localhost:7687", auth=("neo4j", "testpassword"))

## Define the Graph Construction Function
This function takes GPT's response (a string of triples), parses it, and constructs nodes and relationships inside the Neo4j graph. Each triple is expected to follow the (subject, predicate, object) structure.

In [None]:
def build_graph_from_gpt_response(gpt_response):
    try:
        # Safely parse the GPT response (which is a string of list of tuples)
        triples = ast.literal_eval(gpt_response)
    except Exception as e:
        print("Parsing Error:", e)
        return  # Exit if the format is incorrect

    for triple in triples:
        # Ensure the triple is valid and contains exactly 3 items
        if len(triple) != 3:
            continue

        # Strip and convert all parts to string (to avoid TypeErrors)
        subject, predicate, obj = [str(x).strip() for x in triple]

        # Create graph nodes for subject and object
        subj_node = Node("Entity", name=subject)
        obj_node = Node("Entity", name=obj)

        # Create the relationship between the nodes
        relationship = Relationship(subj_node, predicate, obj_node)

        # Merge ensures no duplicates; updates if nodes/edges exist
        graph.merge(subj_node, "Entity", "name")
        graph.merge(obj_node, "Entity", "name")
        graph.merge(relationship)

## Import OpenAI and Load Dataset
We'll use the OpenAI API to extract structured knowledge from rows in the procurement CSV dataset. We’re using a Procurement KPI Analysis Dataset because it includes product, supplier, pricing, and compliance information. 

In [None]:
import pandas as pd
from openai import OpenAI  # Ensure you've installed the OpenAI package

# Initialize OpenAI client with your API key
client = OpenAI(api_key="OPENAI_API_KEY")  # 🔐 Replace with your actual API key

# Load the Procurement KPI Analysis Dataset
df = pd.read_csv("Procurement KPI Analysis Dataset.csv")

## Loop Over a Few Rows and Send Them to GPT
We'll use GPT to extract subject-predicate-object triples from 3 sample rows in the dataset. Once you confirm that it's working, you can increase the number of rows.

In [None]:
# Loop through the first 3 rows of the dataset
for i in range(3):
    # Extract a single row of structured purchase order data
    row = df.iloc[i]

    # Format a prompt to send to OpenAI for triple extraction
    prompt = f"""
    Extract subject-predicate-object triples from the following purchase order:

    Supplier: {row['Supplier']}
    Item Category: {row['Item_Category']}
    Quantity: {row['Quantity']}
    Unit Price: {row['Unit_Price']}
    Order Status: {row['Order_Status']}
    Compliance: {row['Compliance']}

    Return the triples as a Python list of tuples.
    """

    # Call the OpenAI GPT API to get structured triples
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",  # You can switch to GPT-4 for higher quality
        messages=[{"role": "user", "content": prompt}],
        temperature=0  # Keep temperature low for deterministic output
    )

    # Extract and clean up GPT's output
    content = response.choices[0].message.content.strip()
    print(f"Triples from row {i}:\n", content)

    # Pass the GPT output to our graph builder function
    build_graph_from_gpt_response(content)

## Visualizing the results in Neo4j Desktop
Go to Neo4j Desktop and click on Query. Let’s confirm node and relationship creation in Neo4j.
You can do this by running this cypher:


`MATCH (n) RETURN COUNT(n);`

You should see a number > 0 in this case, we have 7

If you want to check for a sample, run this cypher

`MATCH (a)-[r]->(b) RETURN a.name, type(r), b.name LIMIT 10;`

To see the entire graph visualized: entities as circles, relationships as arrows, run this cypher:

`MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 100`

## Measuring Knowledge Graph Quality?
Building a knowledge graph is great. But how do you know if it’s actually any good?
This is where a lot of beginner AI engineers hit a wall. The graph looks fine, maybe even visualizes nicely in Neo4j, but under the hood, it could be missing key connections, introducing wrong relationships, or lacking coverage. And if you're plugging that graph into a retrieval-augmented generation (RAG) system, a weak graph means weak answers.


## Coverage
Coverage refers to how much of the actual data your graph managed to capture in the form of nodes and edges. If your dataset contains 100 suppliers but only 40 show up as nodes in the graph, you're leaving insight on the table.
You can calculate this manually by comparing the number of entities/relations in your source data versus the ones that made it into the graph.


In [None]:
num_suppliers_csv = df['Supplier'].nunique()
num_suppliers_kg = graph.run("MATCH (n:Entity) RETURN count(distinct n.name)").data()[0]['count(distinct n.name)']
print(f"CSV Suppliers: {num_suppliers_csv}, Graph Nodes: {num_suppliers_kg}")

## Accuracy
Accuracy means the relationships in the graph are actually correct. For instance, if “Delta Logistics” is shown supplying “Office Supplies” when they really supply “IT Equipment,” that’s a faulty edge.
To check for accuracy, spot-check your triples. A good practice is to manually review a subset of extracted triples and compare them with the original row in the dataset. If you’ve extracted 10 triples from GPT, verify if each subject-predicate-object makes sense contextually.


In [None]:
sample_check = graph.run("""
MATCH (a:Entity)-[r]->(b:Entity)
RETURN a.name AS subject, type(r) AS relation, b.name AS object
LIMIT 10
""").to_data_frame()

print(sample_check)



## Completeness
A graph can be accurate and still incomplete. Completeness checks whether all relevant connections were extracted and represented.
Checking this is trickier because it often involves domain knowledge. For example, if your business rule says every supplier must have a “compliance” rating, but your graph has some suppliers without any compliance edges, you’ve got gaps.


In [None]:
# Find suppliers without compliance links
query = """
MATCH (s:Entity)
WHERE NOT (s)-[:Compliance]->()
RETURN s.name
"""
missing_compliance = graph.run(query).to_data_frame()
print(missing_compliance)

## Evaluation
### How Good Is the Knowledge Graph We Built?
Now that you've built our graph using Neo4j and GPT-generated triples, we can test it in real time. Using this code: 


In [None]:
# What % of suppliers in CSV have at least one edge in the graph?
total_suppliers = df['Supplier'].nunique()

query = """
MATCH (s:Entity)
WHERE EXISTS {
 MATCH (s)-[]->()
}
RETURN count(DISTINCT s.name) AS linked_suppliers
"""

linked_suppliers = graph.run(query).data()[0]['linked_suppliers']
coverage_percent = (linked_suppliers / total_suppliers) * 100

print(f"Supplier Node Coverage: {coverage_percent:.2f}%")