# Enhanced Knowledge Graph Extraction for Antiquities Trafficking

This notebook implements domain-specific extraction using:
- **Structured entity schema** (PERSON, ORGANIZATION, ARTIFACT, TRANSACTION, etc.)
- **Coreference resolution** with canonical_id
- **Event linking** using shared event_id
- **Order-preserving extraction** from text
- **Rich attributes** for domain context

## Architecture
1. Stage 1: LLM-based structured extraction with domain schema
2. Stage 2: Knowledge Graph construction
3. Stage 3: Visualization and export

## 📦 Installation

In [None]:
# Install required packages
!pip install -q google-generativeai networkx matplotlib pyvis anthropic

## 🔑 Configuration

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict
import re

# Choose your LLM provider
USE_ANTHROPIC = True  # Set to False to use Gemini instead

if USE_ANTHROPIC:
    from anthropic import Anthropic
    ANTHROPIC_API_KEY = "YOUR_API_KEY_HERE"  # Replace with your key
    client = Anthropic(api_key=ANTHROPIC_API_KEY)
    print("✅ Using Claude (Anthropic)")
else:
    import google.generativeai as genai
    GEMINI_API_KEY = "YOUR_API_KEY_HERE"  # Replace with your key
    genai.configure(api_key=GEMINI_API_KEY)
    model = genai.GenerativeModel("gemini-2.0-flash-exp")
    print("✅ Using Gemini")

## 📖 Extraction Schema and Examples

In [None]:
EXTRACTION_SCHEMA = """
## Entity Classes for Antiquities Trafficking

Extract the following entity types IN ORDER OF APPEARANCE:

1. **PERSON**: Individuals (dealers, collectors, officials, looters)
   - Use canonical_id: firstname_lastname (lowercase, underscores)
   - Add full_name when known
   - Include role attribute (dealer, collector, official, looter, buyer, seller, suspect)
   - Pronouns inherit canonical_id from context

2. **ORGANIZATION**: Institutions (museums, galleries, auction houses, law enforcement)
   - Use canonical_id (e.g., j_paul_getty_museum)
   - Add entity_type (museum, gallery, auction_house, law_enforcement)
   - Include entity_role when in a transaction (buyer, seller)

3. **ARTIFACT**: Cultural objects being trafficked
   - Use canonical_id (e.g., euphronios_sarpedon_krater)
   - Add object_type (Greek pottery, sculpture, etc.)
   - Include condition and legal_status when mentioned

4. **CRIMINAL_ACTIVITY**: Illicit actions
   - Use activity_type attribute: illicit excavation, illegal export, forgery, smuggling
   - Link to artifact_id and perpetrator_id

5. **TRANSACTION**: Action verbs (sold, bought, consigned, acquired)
   - Extract the VERB ONLY as extraction_text
   - Use transaction_type: sale, acquisition, consignment
   - Include seller_id, buyer_id, artifact_id, amount, date

6. **LEGAL_ACTION**: Prosecutions, raids, convictions
   - Use action_type: conviction, raid, prosecution, arrest
   - Include date, executing_authority, target_id

7. **LOCATION**: Places relevant to trafficking network
   - Use canonical_id (e.g., geneva_freeport)
   - Add location_type and significance

8. **PROVENANCE_CLAIM**: False or fabricated ownership histories
   - Use claim_status: fabricated, disputed, verified
   - Include claimed_source and artifact_id

## Critical Rules:

1. **Order of appearance**: Extract entities in the sequence they appear in text
2. **Exact text spans**: Use the actual text from document, no overlap
3. **Event linking**: Use shared event_id to connect related extractions
4. **Coreference**: Different mentions of same entity get same canonical_id
5. **Pronouns**: "he", "she", "it", "they" inherit canonical_id from context

## Example:

Text: "In 1985, the Hydra Gallery sold fragments of the Onesimos kylix to the J. Paul Getty Museum for $100,000."

Extractions:
[
  {
    "extraction_class": "ORGANIZATION",
    "extraction_text": "Hydra Gallery",
    "attributes": {
      "canonical_id": "hydra_gallery",
      "entity_type": "gallery",
      "entity_role": "seller",
      "event_id": "kylix_transaction_1985"
    }
  },
  {
    "extraction_class": "TRANSACTION",
    "extraction_text": "sold",
    "attributes": {
      "transaction_type": "sale",
      "date": "1985",
      "seller_id": "hydra_gallery",
      "buyer_id": "j_paul_getty_museum",
      "artifact_id": "onesimos_kylix",
      "amount": "$100,000",
      "event_id": "kylix_transaction_1985"
    }
  },
  {
    "extraction_class": "ARTIFACT",
    "extraction_text": "fragments of the Onesimos kylix",
    "attributes": {
      "canonical_id": "onesimos_kylix",
      "object_type": "Greek pottery",
      "condition": "fragmentary",
      "event_id": "kylix_transaction_1985"
    }
  },
  {
    "extraction_class": "ORGANIZATION",
    "extraction_text": "J. Paul Getty Museum",
    "attributes": {
      "canonical_id": "j_paul_getty_museum",
      "entity_type": "museum",
      "entity_role": "buyer",
      "event_id": "kylix_transaction_1985"
    }
  }
]
"""

## 🤖 Structured Extraction Function

In [None]:
def extract_entities_structured(text, use_anthropic=True):
    """Extract entities using domain-specific structured schema"""
    
    prompt = f"""{EXTRACTION_SCHEMA}

## Document to Extract:

{text}

## Instructions:

Extract all entities from the document above following the schema.
Return ONLY a JSON array of extractions, no other text.

Format:
[
  {{
    "extraction_class": "<CLASS>",
    "extraction_text": "<exact text from document>",
    "attributes": {{ ... }}
  }},
  ...
]
"""
    
    try:
        if use_anthropic:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=16000,
                temperature=0.1,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            result_text = response.content[0].text
        else:
            response = model.generate_content(
                prompt,
                generation_config=genai.types.GenerationConfig(
                    temperature=0.1,
                    max_output_tokens=8192,
                )
            )
            result_text = response.text
        
        # Extract JSON from response (handle markdown code blocks)
        result_text = result_text.strip()
        if result_text.startswith('```'):
            # Remove markdown code blocks
            result_text = re.sub(r'^```json\s*', '', result_text)
            result_text = re.sub(r'^```\s*', '', result_text)
            result_text = re.sub(r'```\s*$', '', result_text)
        
        extractions = json.loads(result_text)
        
        print(f"✅ Extracted {len(extractions)} entities")
        
        # Count by class
        class_counts = defaultdict(int)
        for ext in extractions:
            class_counts[ext['extraction_class']] += 1
        
        print("\nEntity breakdown:")
        for cls, count in sorted(class_counts.items()):
            print(f"  {cls}: {count}")
        
        return extractions
        
    except Exception as e:
        print(f"❌ Error: {e}")
        print(f"Response text: {result_text[:500] if 'result_text' in locals() else 'N/A'}")
        return []

## 🏗️ Knowledge Graph Construction

In [None]:
def build_kg_from_extractions(extractions):
    """Build knowledge graph from structured extractions"""
    
    nodes = {}
    edges = []
    
    # First pass: Create nodes from entities with canonical IDs
    entity_classes = ['PERSON', 'ORGANIZATION', 'ARTIFACT', 'LOCATION']
    
    for ext in extractions:
        if ext['extraction_class'] in entity_classes:
            attrs = ext.get('attributes', {})
            canonical_id = attrs.get('canonical_id')
            
            if canonical_id:
                # Create or update node
                if canonical_id not in nodes:
                    nodes[canonical_id] = {
                        'id': canonical_id,
                        'type': ext['extraction_class'],
                        'label': attrs.get('full_name', ext['extraction_text']),
                        'attributes': attrs,
                        'mentions': [ext['extraction_text']]
                    }
                else:
                    # Add new mention if not duplicate
                    if ext['extraction_text'] not in nodes[canonical_id]['mentions']:
                        nodes[canonical_id]['mentions'].append(ext['extraction_text'])
    
    # Second pass: Create relationships from events
    # Group extractions by event_id
    events = defaultdict(list)
    for ext in extractions:
        event_id = ext.get('attributes', {}).get('event_id')
        if event_id:
            events[event_id].append(ext)
    
    # Create edges based on events
    for event_id, event_extractions in events.items():
        # Find action/transaction in this event
        action = None
        for ext in event_extractions:
            if ext['extraction_class'] in ['TRANSACTION', 'LEGAL_ACTION', 'CRIMINAL_ACTIVITY']:
                action = ext
                break
        
        if not action:
            continue
        
        attrs = action.get('attributes', {})
        action_type = attrs.get('transaction_type') or attrs.get('action_type') or attrs.get('activity_type')
        
        # Create edges based on action type
        if action['extraction_class'] == 'TRANSACTION':
            seller_id = attrs.get('seller_id')
            buyer_id = attrs.get('buyer_id')
            artifact_id = attrs.get('artifact_id')
            
            if seller_id and artifact_id:
                edges.append({
                    'source': seller_id,
                    'target': artifact_id,
                    'relation': f'sold ({attrs.get("date", "?")})',
                    'event_id': event_id,
                    'attributes': attrs
                })
            
            if buyer_id and artifact_id:
                edges.append({
                    'source': buyer_id,
                    'target': artifact_id,
                    'relation': f'acquired ({attrs.get("date", "?")})',
                    'event_id': event_id,
                    'attributes': attrs
                })
            
            if seller_id and buyer_id:
                edges.append({
                    'source': seller_id,
                    'target': buyer_id,
                    'relation': f'transaction ({attrs.get("date", "?")})',
                    'event_id': event_id,
                    'attributes': attrs
                })
        
        elif action['extraction_class'] == 'LEGAL_ACTION':
            target_id = attrs.get('target_id')
            authority = attrs.get('executing_authority')
            
            if target_id:
                edges.append({
                    'source': 'law_enforcement',
                    'target': target_id,
                    'relation': f'{action_type} ({attrs.get("date", "?")})',
                    'event_id': event_id,
                    'attributes': attrs
                })
        
        elif action['extraction_class'] == 'CRIMINAL_ACTIVITY':
            artifact_id = attrs.get('artifact_id')
            perpetrator_id = attrs.get('perpetrator_id')
            
            if perpetrator_id and artifact_id:
                edges.append({
                    'source': perpetrator_id,
                    'target': artifact_id,
                    'relation': action_type,
                    'event_id': event_id,
                    'attributes': attrs
                })
    
    print(f"\n📊 Knowledge Graph Stats:")
    print(f"  Nodes: {len(nodes)}")
    print(f"  Edges: {len(edges)}")
    print(f"  Events: {len(events)}")
    
    return {
        'nodes': nodes,
        'edges': edges,
        'events': dict(events)
    }

## 📊 Visualization

In [None]:
def visualize_kg(kg, output_file="kg_visualization.png", figsize=(16, 12)):
    """Create a visualization of the knowledge graph"""
    
    G = nx.DiGraph()
    
    # Add nodes
    for node_id, node_data in kg['nodes'].items():
        G.add_node(node_id, **node_data)
    
    # Add edges
    for edge in kg['edges']:
        G.add_edge(edge['source'], edge['target'], label=edge['relation'])
    
    # Create layout
    plt.figure(figsize=figsize)
    pos = nx.spring_layout(G, k=2, iterations=50)
    
    # Color nodes by type
    color_map = {
        'PERSON': '#FF6B6B',
        'ORGANIZATION': '#4ECDC4',
        'ARTIFACT': '#FFE66D',
        'LOCATION': '#95E1D3'
    }
    
    node_colors = [color_map.get(G.nodes[node].get('type', 'UNKNOWN'), '#CCCCCC') 
                   for node in G.nodes()]
    
    # Draw
    nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                           node_size=2000, alpha=0.9)
    nx.draw_networkx_labels(G, pos, 
                            labels={n: G.nodes[n].get('label', n) for n in G.nodes()},
                            font_size=8)
    nx.draw_networkx_edges(G, pos, edge_color='gray', 
                           arrows=True, arrowsize=15, alpha=0.6)
    
    # Add edge labels
    edge_labels = {(e['source'], e['target']): e['relation'] 
                   for e in kg['edges']}
    nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6)
    
    plt.axis('off')
    plt.tight_layout()
    plt.savefig(output_file, dpi=300, bbox_inches='tight')
    print(f"✅ Saved visualization to {output_file}")
    plt.show()

def display_kg_summary(kg):
    """Display a text summary of the knowledge graph"""
    
    print("\n" + "="*60)
    print("KNOWLEDGE GRAPH SUMMARY")
    print("="*60)
    
    # Group nodes by type
    by_type = defaultdict(list)
    for node_id, node in kg['nodes'].items():
        by_type[node['type']].append((node_id, node))
    
    for entity_type, nodes_list in sorted(by_type.items()):
        print(f"\n{entity_type} ({len(nodes_list)}):")
        for node_id, node in sorted(nodes_list)[:10]:  # Show first 10
            label = node['label']
            mentions = len(node.get('mentions', []))
            print(f"  • {label} (id: {node_id}, {mentions} mentions)")
    
    print(f"\nRELATIONSHIPS ({len(kg['edges'])}):")
    for edge in kg['edges'][:15]:  # Show first 15
        source_label = kg['nodes'].get(edge['source'], {}).get('label', edge['source'])
        target_label = kg['nodes'].get(edge['target'], {}).get('label', edge['target'])
        print(f"  • {source_label} --[{edge['relation']}]--> {target_label}")
    
    if len(kg['edges']) > 15:
        print(f"  ... and {len(kg['edges']) - 15} more")
    
    print("\n" + "="*60)

## 💾 Export Functions

In [None]:
def save_kg_json(kg, filename="kg_export.json"):
    """Save knowledge graph as JSON"""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(kg, f, indent=2, ensure_ascii=False)
    print(f"✅ Saved knowledge graph to {filename}")

def export_to_neo4j_cypher(kg, filename="neo4j_import.cypher"):
    """Export as Neo4j Cypher statements"""
    
    cypher = []
    
    # Create nodes
    for node_id, node in kg['nodes'].items():
        label = node['type']
        props = {
            'id': node_id,
            'label': node['label'],
            'mentions': node.get('mentions', [])
        }
        props.update(node.get('attributes', {}))
        
        props_str = ', '.join([f"{k}: {json.dumps(v)}" for k, v in props.items()])
        cypher.append(f"CREATE (:{label} {{{props_str}}})")
    
    # Create relationships
    for edge in kg['edges']:
        rel_type = edge['relation'].replace(' ', '_').replace('(', '').replace(')', '').upper()
        props_str = ', '.join([f"{k}: {json.dumps(v)}" 
                               for k, v in edge.get('attributes', {}).items()])
        
        cypher.append(
            f"MATCH (a {{id: {json.dumps(edge['source'])}}}), "
            f"(b {{id: {json.dumps(edge['target'])}}}) "
            f"CREATE (a)-[:{rel_type} {{{props_str}}}]->(b)"
        )
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write("\n\n".join(cypher))
    
    print(f"✅ Saved Neo4j Cypher to {filename}")

## 📝 Example Usage

In [None]:
# Sample document
document_text = """Giacomo Medici is an Italian antiquities dealer who was convicted in 2005 of receiving stolen goods, illegal export of goods, and conspiracy to traffic.

Medici started dealing in antiquities in Rome during the 1960s. In July 1967, he was convicted in Italy of receiving looted artefacts, though in the same year he met and became an important supplier of antiquities to US dealer Robert Hecht. In 1968, Medici opened the gallery Antiquaria Romana in Rome and began to explore business opportunities in Switzerland.

In 1978, he closed his Rome gallery, and entered into partnership with Geneva resident Christian Boursaud, who started consigning material supplied by Medici for sale at Sotheby's London. Together, they opened Hydra Gallery in Geneva in 1983.

In October 1985, the Hydra Gallery sold fragments of the Onesimos kylix to the J. Paul Getty Museum for $100,000, providing a false provenance by way of the fictitious Zbinden collection. The Getty returned the kylix to Italy in 1999.

It is widely believed that in December 1971 he bought the illegally-excavated Euphronios (Sarpedon) krater from tombaroli before transporting it to Switzerland and selling it to Hecht.

On 13 September 1995, in concert with Swiss police, they raided Medici's storage space in the Geneva Freeport, which comprised five rooms with a combined area of about 200 sq metres.

In January 1997, Medici was arrested in Rome. Medici was charged with receiving stolen goods, illegal export of goods, and conspiracy to traffic. On 12 May 2005, he was found guilty of all charges. He was sentenced to ten years in prison and received a €10 million fine."""

# Extract entities
print("Starting extraction...\n")
extractions = extract_entities_structured(document_text, use_anthropic=USE_ANTHROPIC)

In [None]:
# Build knowledge graph
kg = build_kg_from_extractions(extractions)

In [None]:
# Display summary
display_kg_summary(kg)

In [None]:
# Visualize
visualize_kg(kg, "antiquities_kg_enhanced.png")

In [None]:
# Save outputs
save_kg_json(kg, "antiquities_kg_enhanced.json")
export_to_neo4j_cypher(kg, "neo4j_import.cypher")

## 🔍 Query Examples

In [None]:
# Find all transactions
print("\n📦 TRANSACTIONS:")
for ext in extractions:
    if ext['extraction_class'] == 'TRANSACTION':
        attrs = ext['attributes']
        print(f"  • {attrs.get('seller_id', '?')} --[{ext['extraction_text']}]--> "
              f"{attrs.get('buyer_id', '?')} [{attrs.get('date', '?')}]")

# Find all legal actions
print("\n⚖️ LEGAL ACTIONS:")
for ext in extractions:
    if ext['extraction_class'] == 'LEGAL_ACTION':
        attrs = ext['attributes']
        print(f"  • {attrs.get('action_type', '?')}: {attrs.get('target_id', '?')} "
              f"[{attrs.get('date', '?')}]")

# Find all artifacts
print("\n🏺 ARTIFACTS:")
for node_id, node in kg['nodes'].items():
    if node['type'] == 'ARTIFACT':
        print(f"  • {node['label']} (id: {node_id})")

## 📤 Process Your Own Documents

In [None]:
# Load your document
# Option 1: From file
# with open('your_document.txt', 'r', encoding='utf-8') as f:
#     your_document = f.read()

# Option 2: Paste directly
your_document = """Paste your antiquities trafficking document here..."""

# Extract
your_extractions = extract_entities_structured(your_document, use_anthropic=USE_ANTHROPIC)

# Build graph
your_kg = build_kg_from_extractions(your_extractions)

# Display and save
display_kg_summary(your_kg)
visualize_kg(your_kg, "my_kg.png")
save_kg_json(your_kg, "my_kg.json")

## 🔄 Batch Processing Multiple Documents

In [None]:
def process_multiple_documents(document_list, use_anthropic=True):
    """Process multiple documents and merge into single knowledge graph"""
    
    all_extractions = []
    
    for i, doc in enumerate(document_list, 1):
        print(f"\n{'='*60}")
        print(f"Processing document {i}/{len(document_list)}")
        print(f"{'='*60}")
        
        extractions = extract_entities_structured(doc, use_anthropic=use_anthropic)
        all_extractions.extend(extractions)
    
    print(f"\n\n{'='*60}")
    print(f"MERGING {len(all_extractions)} TOTAL EXTRACTIONS")
    print(f"{'='*60}")
    
    # Build unified knowledge graph
    merged_kg = build_kg_from_extractions(all_extractions)
    
    return merged_kg, all_extractions

# Example usage:
# documents = [
#     "Document 1 text...",
#     "Document 2 text...",
#     "Document 3 text..."
# ]
# merged_kg, all_exts = process_multiple_documents(documents)
# display_kg_summary(merged_kg)
# visualize_kg(merged_kg, "merged_kg.png")

## 📋 Summary

This enhanced notebook provides:

### Key Features:
1. **Domain-Specific Schema**: 8 entity classes tailored for antiquities trafficking
2. **Coreference Resolution**: Canonical IDs link mentions of same entity
3. **Event Linking**: Shared event_id connects related extractions
4. **Order Preservation**: Entities extracted in order of appearance
5. **Rich Attributes**: Contextual metadata for each extraction

### Entity Classes:
- PERSON (dealers, collectors, officials)
- ORGANIZATION (museums, galleries, law enforcement)
- ARTIFACT (trafficked objects)
- CRIMINAL_ACTIVITY (illicit actions)
- TRANSACTION (sales, acquisitions)
- LEGAL_ACTION (raids, convictions)
- LOCATION (trafficking network locations)
- PROVENANCE_CLAIM (false histories)

### Next Steps:
- Fine-tune extraction prompts for your specific corpus
- Add temporal analysis and timeline visualization
- Implement network analysis (centrality, communities)
- Connect to Neo4j for advanced graph queries
- Build web interface for interactive exploration