### MSDS Knowledge Graph Extraction with Claude-4 Sonnet (Orchestration Model) - Final VERSION

This notebook demonstrates how to use the **Orchestration Model with Claude-4 Sonnet** for extracting knowledge graph triples from MSDS documents, with **enhanced parsing** to correctly extract triples.

### Key Benefits of Claude-4 Sonnet:
- Better understanding of complex MSDS terminology
- More accurate Subject-Predicate-Object extraction
- Enhanced content filtering and safety
- Improved handling of technical documents

## 1. Environment Setup

Load environment variables and import required libraries.

In [1]:
%pip install python-dotenv

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from prompts import KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT
from llm_client import CL_Orchestration_Service
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import re

# Load environment variables
load_dotenv()

print("✅ Libraries imported successfully")


Note: you may need to restart the kernel to use updated packages.
✅ Libraries imported successfully


## 2. Setup Orchestration Service with Claude-4 Sonnet

In [2]:
# AI Core configuration
aic_config = {
    "aic_auth_url": os.getenv("AI_AUTH_URL"),
    "aic_client_id": os.getenv("AI_CLIENT_ID"),
    "aic_client_secret": os.getenv("AI_CLIENT_SECRET"),
    "aic_resource_group": os.getenv("AI_RESOURCE_GROUP", "default")
}

# Orchestration parameters with Claude-4 Sonnet
orch_model_params = {
    "orch_url": os.getenv("AIC_ORCH_URL"),
    "orch_model": os.getenv("ORCH_MODEL"),
    "parameters": {
        "temperature": 0.3,  # Lower temperature for more consistent extraction
        "max_tokens": 20000,
        "top_p": 0.9
    }
}

# Initialize Orchestration Service
orch_service = CL_Orchestration_Service(aic_config, orch_model_params)
print("✅ Orchestration Service URL:",{orch_model_params['orch_url']})
print("✅ Orchestration Service initialized with Claude-4 Sonnet")
print(f"🔧 Model: {orch_model_params['orch_model']}")
print(f"🌡️ Temperature: {orch_model_params['parameters']['temperature']}")

✅ Orchestration Service URL: {'https://api.ai.prod.us-east-1.aws.ml.hana.ondemand.com/v2/inference/deployments/ddaae0b631e78184'}
✅ Orchestration Service initialized with Claude-4 Sonnet
🔧 Model: None
🌡️ Temperature: 0.3


## 3. Load MSDS Document

In [3]:
# Load MSDS document
pdf_path = "/Users/I310202/Library/CloudStorage/OneDrive-SAPSE/SR@Work/81.Innovations/98.AI_Developments/33.AI_MSDS/Build_MSDS_SAPKGE/Documents/WD-40.pdf"

print(f"📄 Loading MSDS document: {pdf_path}")

try:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    
    doc_content = "MSDS Document"
    for doc in docs:
        doc_content = doc_content + ' ' + doc.page_content
    
    print(f"✅ Document loaded successfully")
    print(f"📊 Content length: {len(doc_content)} characters")
    print(f"📄 Number of pages: {len(docs)}")
    
    # Display first 500 characters as preview
    print("\n📖 Document preview (first 500 chars):")
    print("-" * 60)
    print(doc_content[:500] + "...")
    
except Exception as e:
    print(f"❌ Error loading document: {e}")
    doc_content = None

# --- New cell: Section parsing utility ---
import re
def split_msds_sections(doc_content):
    """
    Split MSDS document into sections using standard section headers.
    Returns: dict of {section_title: section_text}
    """
    # Standard MSDS section headers (1-16)
    section_patterns = [
        r"1\s*[-–]?[\s\w/]+", r"2\s*[-–]?[\s\w/]+", r"3\s*[-–]?[\s\w/]+", r"4\s*[-–]?[\s\w/]+",
        r"5\s*[-–]?[\s\w/]+", r"6\s*[-–]?[\s\w/]+", r"7\s*[-–]?[\s\w/]+", r"8\s*[-–]?[\s\w/]+",
        r"9\s*[-–]?[\s\w/]+", r"10\s*[-–]?[\s\w/]+", r"11\s*[-–]?[\s\w/]+", r"12\s*[-–]?[\s\w/]+",
        r"13\s*[-–]?[\s\w/]+", r"14\s*[-–]?[\s\w/]+", r"15\s*[-–]?[\s\w/]+", r"16\s*[-–]?[\s\w/]+"
    ]
    # Find all section headers and their positions
    matches = list(re.finditer(r"(\d{1,2}\s*[-–][^\n]+)", doc_content))
    sections = {}
    for i, match in enumerate(matches):
        start = match.start()
        end = matches[i+1].start() if i+1 < len(matches) else len(doc_content)
        title = match.group(1).strip()
        text = doc_content[start:end].strip()
        sections[title] = text
    return sections
print("✅ Section parsing utility defined")

# --- New cell: Section-wise triple extraction and display ---
if doc_content:
    print("\n🔍 SECTION-WISE EXTRACTION WITH CLAUDE-4 SONNET")
    print("=" * 60)
    msds_sections = split_msds_sections(doc_content)
    section_triples = {}
    for section_title, section_text in msds_sections.items():
        print(f"\n📑 Extracting triples for section: {section_title}")
        try:
            formatted_prompt = KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT.template.format(text=section_text)
            response = orch_service.invoke_llm(
                prompt=formatted_prompt,
                model_name="anthropic--claude-4-sonnet",
                temperature=0.3,
                max_tokens=3000
            )
            triples = parse_triples_from_response(response, debug=False)
            section_triples[section_title] = triples
            print(f"   • Extracted {len(triples)} triples")
        except Exception as e:
            print(f"   ❌ Error extracting triples for section: {e}")
            section_triples[section_title] = []
    print("\n================ SECTION-WISE TUPLES ================")
    for section_title, triples in section_triples.items():
        print(f"\nSection: {section_title}")
        if not triples:
            print("   No triples extracted.")
        else:
            for i, (subj, pred, obj) in enumerate(triples, 1):
                print(f"  {i:2d}. ({subj}, {pred}, {obj})")
else:
    print("❌ No document content available for section-wise extraction.")

📄 Loading MSDS document: /Users/I310202/Library/CloudStorage/OneDrive-SAPSE/SR@Work/81.Innovations/98.AI_Developments/33.AI_MSDS/Build_MSDS_SAPKGE/Documents/WD-40.pdf
✅ Document loaded successfully
📊 Content length: 13576 characters
📄 Number of pages: 5

📖 Document preview (first 500 chars):
------------------------------------------------------------
MSDS Document Page 1 of 5 
 
Safety Data Sheet 
California CARB Compliant 
1 - Identification 
 
Product Name: WD-40 Multi-Use Product Aerosol  
 
Product Use: Lubricant, Penetrant, Drives Out 
Moisture and Protects Surfaces from Corrosion 
 
Restrictions on Use: None identified 
 
SDS Date of Preparation: November 13, 2024 
Manufacturer: WD-40 Company 
Address: 9715 Businesspark Avenue 
   San Diego, California, USA 
  92131 
Telephone:  
Emergency:      1-888-324-7596  
Information:  1-888-324...
✅ Section parsing utility defined

🔍 SECTION-WISE EXTRACTION WITH CLAUDE-4 SONNET

📑 Extracting triples for section: 1 - Identification
   ❌ E

## 4. Enhanced Triple Parsing Function

This enhanced parsing function uses multiple methods to correctly extract triples from Claude-4 Sonnet responses.

In [4]:
def parse_triples_from_response(response, debug=True):
    """Parse triples from the LLM response with enhanced parsing and debugging."""
    triples = []
    lines = response.split('\n')
    
    if debug:
        print(f"🔍 DEBUG: Parsing response with {len(lines)} lines")
        print("📝 DEBUG: First 10 lines of response:")
        for i, line in enumerate(lines[:10]):
            print(f"   {i+1}: {repr(line)}")
    
    for line_num, line in enumerate(lines, 1):
        line = line.strip()
        
        # Method 1: Standard parentheses format (Subject, Predicate, Object)
        if line.startswith('(') and line.endswith(')'):
            try:
                content = line[1:-1]
                parts = [part.strip() for part in content.split(',')]
                
                if len(parts) == 3:
                    subject, predicate, obj = parts
                    # Clean up quotes if present
                    subject = subject.strip('"\'')
                    predicate = predicate.strip('"\'')
                    obj = obj.strip('"\'')
                    triples.append((subject, predicate, obj))
                    if debug:
                        print(f"✅ Method 1 - Line {line_num}: ({subject}, {predicate}, {obj})")
                    continue
                    
            except Exception as e:
                if debug:
                    print(f"⚠️ Method 1 failed for line {line_num}: {line}")
        
        # Method 2: Text format "Subject: X, Predicate: Y, Object: Z"
        if 'subject:' in line.lower() and 'predicate:' in line.lower() and 'object:' in line.lower():
            try:
                pattern = r'subject:\s*([^,]+),\s*predicate:\s*([^,]+),\s*object:\s*(.+)'
                match = re.search(pattern, line, re.IGNORECASE)
                if match:
                    subject, predicate, obj = match.groups()
                    subject = subject.strip().strip('"\'')
                    predicate = predicate.strip().strip('"\'')
                    obj = obj.strip().strip('"\'')
                    triples.append((subject, predicate, obj))
                    if debug:
                        print(f"✅ Method 2 - Line {line_num}: ({subject}, {predicate}, {obj})")
                    continue
            except Exception as e:
                if debug:
                    print(f"⚠️ Method 2 failed for line {line_num}: {line}")
        
        # Method 3: Delimiter-based parsing
        if '{KG_TRIPLE_DELIMITER}' in line:
            try:
                parts = line.split('{KG_TRIPLE_DELIMITER}')[0].strip()
                if parts.startswith('(') and parts.endswith(')'):
                    content = parts[1:-1]
                    triple_parts = [part.strip().strip('"\'') for part in content.split(',')]
                    if len(triple_parts) == 3:
                        triples.append(tuple(triple_parts))
                        if debug:
                            print(f"✅ Method 3 - Line {line_num}: {tuple(triple_parts)}")
                        continue
            except Exception as e:
                if debug:
                    print(f"⚠️ Method 3 failed for line {line_num}: {line}")
        
        # Method 4: Flexible comma-separated parsing
        if ',' in line and line.count(',') >= 2:
            try:
                if '(' in line and ')' in line:
                    start = line.find('(')
                    end = line.rfind(')')
                    if start != -1 and end != -1 and end > start:
                        content = line[start+1:end]
                        parts = [part.strip().strip('"\'') for part in content.split(',')]
                        if len(parts) >= 3:
                            # Take first 3 parts
                            subject, predicate, obj = parts[0], parts[1], parts[2]
                            if subject and predicate and obj:  # Make sure none are empty
                                triples.append((subject, predicate, obj))
                                if debug:
                                    print(f"✅ Method 4 - Line {line_num}: ({subject}, {predicate}, {obj})")
                                continue
            except Exception as e:
                if debug:
                    print(f"⚠️ Method 4 failed for line {line_num}: {line}")
    
    if debug:
        print(f"📊 DEBUG: Successfully parsed {len(triples)} triples")
    return triples

print("✅ Enhanced parsing function defined")

✅ Enhanced parsing function defined


## 5. Extract Triples using Claude-4 Sonnet

This is the main extraction process using our improved prompts and Claude-4 Sonnet with enhanced parsing.

In [5]:
# Extract triples using Claude-4 Sonnet
if doc_content:
    print("🔍 EXTRACTING TRIPLES WITH CLAUDE-4 SONNET")
    print("=" * 60)
    
    try:
        # Format the prompt with the document content
        formatted_prompt = KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT.template.format(text=doc_content)
        
        print("📝 Sending extraction request to Claude-4 Sonnet...")
        print(f"📏 Prompt length: {len(formatted_prompt)} characters")
        
        # Use orchestration service for extraction
        response = orch_service.invoke_llm(
            prompt=formatted_prompt,
            model_name="anthropic--claude-4-sonnet",
            temperature=0.3,
            max_tokens=15000
        )
        
        print("✅ Response received from Claude-4 Sonnet")
        print(f"📏 Response length: {len(response)} characters")
        
        # Parse the response to extract triples using enhanced parsing
        triples = parse_triples_from_response(response, debug=True)
        
        print(f"\n📊 Final result: Extracted {len(triples)} triples")
        
        # Store for later use
        raw_response = response
        
    except Exception as e:
        print(f"❌ Error during triple extraction: {e}")
        triples = []
        raw_response = ""
else:
    print("❌ No document content available for extraction")
    triples = []
    raw_response = ""

🔍 EXTRACTING TRIPLES WITH CLAUDE-4 SONNET
📝 Sending extraction request to Claude-4 Sonnet...
📏 Prompt length: 15487 characters
✅ Response received from Claude-4 Sonnet
📏 Response length: 3022 characters
🔍 DEBUG: Parsing response with 46 lines
📝 DEBUG: First 10 lines of response:
   1: '(WD-40 Multi-Use Product Aerosol, is a, Chemical)<|>'
   2: '(WD-40 Multi-Use Product Aerosol, is manufactured by, WD-40 Company)<|>'
   3: '(WD-40 Multi-Use Product Aerosol, is located at, 9715 Businesspark Avenue San Diego California USA 92131)<|>'
   4: '(WD-40 Multi-Use Product Aerosol, has, Lubricant)<|>'
   5: '(WD-40 Multi-Use Product Aerosol, has, Penetrant)<|>'
   6: '(WD-40 Multi-Use Product Aerosol, is classified as, Aerosol Category 1)<|>'
   7: '(WD-40 Multi-Use Product Aerosol, is classified as, Aspiration Toxicity Category 1)<|>'
   8: '(WD-40 Multi-Use Product Aerosol, is classified as, Specific Target Organ Toxicity Single Exposure Category 3)<|>'
   9: '(WD-40 Multi-Use Product Aerosol,

## 6. Display Extracted Triples

Let's examine the quality of the extracted triples.

In [6]:
# Display sample triples
print("📋 EXTRACTED TRIPLES (CLAUDE-4 SONNET)")
print("=" * 60)

if not triples:
    print("❌ No triples found")
    print("\n🔍 Debugging suggestions:")
    print("1. Check the raw response format")
    print("2. Run debug_claude_response.py to see exact output")
    print("3. Adjust parsing methods if needed")
else:
    print(f"📊 Total triples extracted: {len(triples)}")
    print("\n🔍 First 10 triples:")
    print("-" * 60)
    
    for i, (subj, pred, obj) in enumerate(triples[:10], 1):
        print(f"{i:2d}. Subject: {subj}")
        print(f"    Predicate: {pred}")
        print(f"    Object: {obj}")
        print()
    
    if len(triples) > 10:
        print(f"... and {len(triples) - 10} more triples")

📋 EXTRACTED TRIPLES (CLAUDE-4 SONNET)
📊 Total triples extracted: 46

🔍 First 10 triples:
------------------------------------------------------------
 1. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is a
    Object: Chemical

 2. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is manufactured by
    Object: WD-40 Company

 3. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is located at
    Object: 9715 Businesspark Avenue San Diego California USA 92131

 4. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: has
    Object: Lubricant

 5. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: has
    Object: Penetrant

 6. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is classified as
    Object: Aerosol Category 1

 7. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is classified as
    Object: Aspiration Toxicity Category 1

 8. Subject: WD-40 Multi-Use Product Aerosol
    Predicate: is classified as
    Object: Specific Target Organ

## 7. Quality Analysis

Analyze the quality of the extraction and check for correct Subject-Predicate-Object format.

In [9]:
# Analyze extraction quality
print("📊 EXTRACTION QUALITY ANALYSIS")
print("=" * 60)

if triples:
    # Analyze subjects, predicates, objects
    subjects = {s for s, _, _ in triples}
    predicates = {p for _, p, _ in triples}
    objects = {o for _, _, o in triples}
    
    print(f"🎯 Unique subjects: {len(subjects)}")
    print(f"🔗 Unique predicates: {len(predicates)}")
    print(f"📦 Unique objects: {len(objects)}")
    
    print(f"\n🔍 Most common predicates:")
    predicate_counts = {}
    for _, pred, _ in triples:
        predicate_counts[pred] = predicate_counts.get(pred, 0) + 1
    
    sorted_predicates = sorted(predicate_counts.items(), key=lambda x: x[1], reverse=True)
    for pred, count in sorted_predicates[:10]:
        print(f"   • {pred}: {count} times")
    
    # Check for correct Subject-Predicate-Object format
    correct_format_count = 0
    relationship_words = ['is a', 'has', 'includes', 'is classified as', 'is manufactured by', 
                        'is located at', 'is regulated by', 'is exposed to', 'is protected by', 
                        'is described by', 'is recommended for']
    
    for subj, pred, obj in triples:
        if any(rel_word in pred.lower() for rel_word in relationship_words):
            correct_format_count += 1
    
    print(f"\n✅ Triples with correct predicate format: {correct_format_count}/{len(triples)} ({correct_format_count/len(triples)*100:.1f}%)")
    
    # Show some examples of correct format
    print(f"\n🎯 Examples of correctly formatted triples:")
    correct_examples = [(s, p, o) for s, p, o in triples if any(rel_word in p.lower() for rel_word in relationship_words)]
    for i, (s, p, o) in enumerate(correct_examples[:5], 1):
        print(f"   {i}. ({s}, {p}, {o})")

else:
    print("❌ No triples to analyze")

print(f"\n📝 Raw response length: {len(raw_response)} characters")

📊 EXTRACTION QUALITY ANALYSIS
🎯 Unique subjects: 5
🔗 Unique predicates: 9
📦 Unique objects: 42

🔍 Most common predicates:
   • has: 19 times
   • is classified as: 8 times
   • is regulated by: 5 times
   • includes: 4 times
   • is exposed to: 4 times
   • is protected by: 3 times
   • is a: 1 times
   • is manufactured by: 1 times
   • is located at: 1 times

✅ Triples with correct predicate format: 46/46 (100.0%)

🎯 Examples of correctly formatted triples:
   1. (WD-40 Multi-Use Product Aerosol, is a, Chemical)
   2. (WD-40 Multi-Use Product Aerosol, is manufactured by, WD-40 Company)
   3. (WD-40 Multi-Use Product Aerosol, is located at, 9715 Businesspark Avenue San Diego California USA 92131)
   4. (WD-40 Multi-Use Product Aerosol, has, Lubricant)
   5. (WD-40 Multi-Use Product Aerosol, has, Penetrant)

📝 Raw response length: 3022 characters


## 8. Summary

Summary of the extraction results and benefits of using Claude-4 Sonnet with enhanced parsing.

In [8]:
print("🎉 EXTRACTION COMPLETED!")
print("=" * 80)

print("📊 FINAL RESULTS SUMMARY:")
print(f"   • Total triples extracted: {len(triples)}")
print("   • Document processed: WD-40 MSDS")
print("   • Model used: Claude-4 Sonnet (Orchestration)")
print("   • Temperature: 0.3 (for consistency)")
print("   • Enhanced parsing: ✅ ENABLED")

if triples:
    subjects = {s for s, _, _ in triples}
    predicates = {p for _, p, _ in triples}
    objects = {o for _, _, o in triples}
    
    print(f"   • Unique subjects: {len(subjects)}")
    print(f"   • Unique predicates: {len(predicates)}")
    print(f"   • Unique objects: {len(objects)}")

print("\n🚀 KEY BENEFITS OF USING CLAUDE-4 SONNET WITH ENHANCED PARSING:")
print("   • Better understanding of complex MSDS terminology")
print("   • More accurate Subject-Predicate-Object extraction")
print("   • Enhanced content filtering and safety")
print("   • Improved handling of technical documents")
print("   • Better consistency in predicate formatting")
print("   • Advanced reasoning capabilities for chemical data")
print("   • Robust parsing that handles multiple response formats")

print("\n💡 SUCCESS FACTORS:")
print("   ✅ Enhanced parsing function with multiple methods")
print("   ✅ Debug output for troubleshooting")
print("   ✅ Flexible pattern matching")
print("   ✅ Proper handling of Claude-4 Sonnet response format")

print("\n" + "=" * 80)

🎉 EXTRACTION COMPLETED!
📊 FINAL RESULTS SUMMARY:
   • Total triples extracted: 46
   • Document processed: WD-40 MSDS
   • Model used: Claude-4 Sonnet (Orchestration)
   • Temperature: 0.3 (for consistency)
   • Enhanced parsing: ✅ ENABLED
   • Unique subjects: 5
   • Unique predicates: 9
   • Unique objects: 42

🚀 KEY BENEFITS OF USING CLAUDE-4 SONNET WITH ENHANCED PARSING:
   • Better understanding of complex MSDS terminology
   • More accurate Subject-Predicate-Object extraction
   • Enhanced content filtering and safety
   • Improved handling of technical documents
   • Better consistency in predicate formatting
   • Advanced reasoning capabilities for chemical data
   • Robust parsing that handles multiple response formats

💡 SUCCESS FACTORS:
   ✅ Enhanced parsing function with multiple methods
   ✅ Debug output for troubleshooting
   ✅ Flexible pattern matching
   ✅ Proper handling of Claude-4 Sonnet response format

