[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/06_Relation_Extraction.ipynb)

#  Relation Extraction - Comprehensive Guide

##  Overview

This notebook provides a **comprehensive guide** to extracting relationships between entities and building RDF triplets using Semantica's relation extraction modules. You'll learn to identify connections, extract structured triplets, and prepare data for knowledge graphs.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/semantic_extract/)

###  Learning Objectives

By the end of this notebook, you will be able to:

-  Extract relationships using `RelationExtractor`
-  Understand different extraction methods (pattern, dependency, co-occurrence, HuggingFace, LLM)
-  Configure extraction parameters for optimal results
-  Extract RDF triplets with `TripletExtractor`
-  Validate triplets using `TripletValidator`
-  Serialize triplets to RDF formats with `RDFSerializer`
-  Assess triplet quality with `TripletQualityChecker`
-  Build complete entity → relation → triplet pipelines

###  What You'll Learn

| Component | Purpose | When to Use |
|-----------|---------|-------------|
| `RelationExtractor` | Extract entity relationships | Finding connections |
| `TripletExtractor` | Extract RDF triplets | Building knowledge graphs |
| `TripletValidator` | Validate triplet quality | Quality assurance |
| `RDFSerializer` | Serialize to RDF formats | Data export |
| `TripletQualityChecker` | Assess triplet quality | Quality metrics |

---

##  Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

In [None]:
!pip install semantica


##  Step 1: Basic Relation Extraction

Let's start by extracting relationships between entities using `RelationExtractor`.

### What is RelationExtractor?

`RelationExtractor` identifies relationships between entities:
- Finds connections like "founded_by", "located_in", "works_for"
- Returns Relation objects with subject, predicate, object
- Supports multiple extraction methods
- Provides confidence scores for each relation

### Understanding Relations

A relation has three parts:
- **Subject**: The source entity (e.g., "Apple Inc.")
- **Predicate**: The relationship type (e.g., "founded_by")
- **Object**: The target entity (e.g., "Steve Jobs")

In [None]:
from semantica.semantic_extract import RelationExtractor, NERExtractor

# Initialize extractors
ner_extractor = NERExtractor()
relation_extractor = RelationExtractor()

# Sample text with clear relationships
text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company is headquartered in Cupertino, California. Tim Cook is the current CEO
of Apple Inc. and took over from Steve Jobs in August 2011.
"""

# First, extract entities
entities = ner_extractor.extract(text)
print(f" Extracted {len(entities)} entities\n")

# Then, extract relationships
relationships = relation_extractor.extract(text, entities)

print(f" Extracted {len(relationships)} relationships:\n")
print("=" * 80)

for i, rel in enumerate(relationships, 1):
    # Handle both dict and object formats
    source = rel.get('source', rel.get('subject', '')) if isinstance(rel, dict) else getattr(rel, 'subject', '')
    target = rel.get('target', rel.get('object', '')) if isinstance(rel, dict) else getattr(rel, 'object', '')
    rel_type = rel.get('type', rel.get('predicate', 'related_to')) if isinstance(rel, dict) else getattr(rel, 'predicate', 'related_to')
    confidence = rel.get('confidence', 1.0) if isinstance(rel, dict) else getattr(rel, 'confidence', 1.0)
    
    # Get source and target text
    if isinstance(source, dict):
        source_text = source.get('text', source.get('entity', str(source)))
    else:
        source_text = getattr(source, 'text', str(source))
    
    if isinstance(target, dict):
        target_text = target.get('text', target.get('entity', str(target)))
    else:
        target_text = getattr(target, 'text', str(target))
    
    print(f"{i:2d}. {source_text:20s} --[{rel_type:15s}]--> {target_text:20s} (conf: {confidence:.2f})")

print("=" * 80)

###  Understanding Relation Objects

Each extracted relation contains:

| Attribute | Description | Example |
|-----------|-------------|----------|
| `subject` | Source entity | Entity("Apple Inc.") |
| `predicate` | Relationship type | "founded_by" |
| `object` | Target entity | Entity("Steve Jobs") |
| `confidence` | Extraction confidence (0-1) | 0.85 |
| `context` | Surrounding text | "Apple Inc. was founded by Steve Jobs" |
| `metadata` | Additional info | {"method": "pattern"} |

###  Common Relation Types

- **founded_by**: Organization founded by person
- **located_in**: Entity located in place
- **works_for**: Person works for organization
- **born_in**: Person born in location
- **part_of**: Entity is part of another
- **related_to**: Generic relationship

## ️ Step 2: Different Extraction Methods

Semantica supports multiple relation extraction methods:

### Method Comparison

| Method | Speed | Accuracy | Use Case | Requires |
|--------|-------|----------|----------|----------|
| **pattern** |  | ⭐⭐⭐ | Common relations | Nothing |
| **dependency** |  | ⭐⭐⭐⭐ | Grammatical relations | spaCy |
| **cooccurrence** |  | ⭐⭐ | Proximity-based | Nothing |
| **huggingface** |  | ⭐⭐⭐⭐⭐ | Domain-specific | HF model |
| **llm** |  | ⭐⭐⭐⭐⭐ | Complex, custom | API key |

In [None]:
from semantica.semantic_extract import RelationExtractor

sample_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
sample_entities = ner_extractor.extract(sample_text)

print(" Comparing Relation Extraction Methods:\n")
print("=" * 80)

# Try different methods
methods_to_try = ["pattern", "dependency", "cooccurrence"]

for method_name in methods_to_try:
    try:
        print(f"\n Method: {method_name.upper()}")
        print("-" * 40)
        
        extractor = RelationExtractor(method=method_name)
        relations = extractor.extract(sample_text, sample_entities)
        
        print(f"Found {len(relations)} relations:")
        for rel in relations[:3]:  # Show first 3
            source = rel.get('source', rel.get('subject', '')) if isinstance(rel, dict) else getattr(rel, 'subject', '')
            target = rel.get('target', rel.get('object', '')) if isinstance(rel, dict) else getattr(rel, 'object', '')
            rel_type = rel.get('type', rel.get('predicate', 'related_to')) if isinstance(rel, dict) else getattr(rel, 'predicate', 'related_to')
            
            # Get text representations
            source_text = source.get('text', str(source)) if isinstance(source, dict) else getattr(source, 'text', str(source))
            target_text = target.get('text', str(target)) if isinstance(target, dict) else getattr(target, 'text', str(target))
            
            print(f"  • {source_text} --[{rel_type}]--> {target_text}")
            
    except Exception as e:
        print(f"  ️  Method '{method_name}' not available: {str(e)[:50]}")

print("\n" + "=" * 80)

##  Step 3: Advanced Relation Extraction with Configuration

`RelationExtractor` provides powerful configuration options:

### Key Parameters:

- **`relation_types`**: Specific relation types to extract (e.g., `["founded", "works_at"]`)
- **`bidirectional`**: Extract bidirectional relations (default: False)
- **`confidence_threshold`**: Minimum confidence score (0.0-1.0, default: 0.6)
- **`max_distance`**: Maximum token distance between entities (default: 50)

In [None]:
# Create extractor with custom configuration
advanced_extractor = RelationExtractor(
    relation_types=["founded_by", "located_in", "works_for"],  # Only extract these types
    confidence_threshold=0.7,  # Higher threshold for quality
    bidirectional=False,  # One-way relations only
    max_distance=50  # Max 50 tokens between entities
)

# Sample texts
texts = [
    "Microsoft was founded by Bill Gates and Paul Allen in Albuquerque, New Mexico.",
    "Satya Nadella works for Microsoft as the CEO.",
    "Google is located in Mountain View, California.",
    "Amazon was founded by Jeff Bezos in Seattle, Washington."
]

print(" Advanced Relation Extraction:\n")
print("=" * 80)

for i, text in enumerate(texts, 1):
    entities = ner_extractor.extract(text)
    relations = advanced_extractor.extract(text, entities)
    
    print(f"\n Text {i}: {text}")
    print(f"   Relations found: {len(relations)}")
    
    for rel in relations:
        source = rel.get('source', rel.get('subject', '')) if isinstance(rel, dict) else getattr(rel, 'subject', '')
        target = rel.get('target', rel.get('object', '')) if isinstance(rel, dict) else getattr(rel, 'object', '')
        rel_type = rel.get('type', rel.get('predicate', 'related_to')) if isinstance(rel, dict) else getattr(rel, 'predicate', 'related_to')
        confidence = rel.get('confidence', 1.0) if isinstance(rel, dict) else getattr(rel, 'confidence', 1.0)
        
        source_text = source.get('text', str(source)) if isinstance(source, dict) else getattr(source, 'text', str(source))
        target_text = target.get('text', str(target)) if isinstance(target, dict) else getattr(target, 'text', str(target))
        
        print(f"     • {source_text} --[{rel_type}]--> {target_text} (conf: {confidence:.2f})")

print("\n" + "=" * 80)

## ️ Step 4: Relation Classification

Group and classify extracted relations by their predicate type.

In [None]:
# Extract relations from all texts
all_relations = []
for text in texts:
    entities = ner_extractor.extract(text)
    relations = advanced_extractor.extract(text, entities)
    all_relations.extend(relations)

# Classify relations
classified_relations = advanced_extractor.classify_relations(all_relations)

print("️  Relation Classification:\n")
print("=" * 80)

for rel_type, rel_list in sorted(classified_relations.items()):
    print(f"\n{rel_type.upper()} ({len(rel_list)} relations):")
    print("-" * 40)
    
    for rel in rel_list:
        source = rel.get('source', rel.get('subject', '')) if isinstance(rel, dict) else getattr(rel, 'subject', '')
        target = rel.get('target', rel.get('object', '')) if isinstance(rel, dict) else getattr(rel, 'object', '')
        
        source_text = source.get('text', str(source)) if isinstance(source, dict) else getattr(source, 'text', str(source))
        target_text = target.get('text', str(target)) if isinstance(target, dict) else getattr(target, 'text', str(target))
        
        print(f"  • {source_text} → {target_text}")

print("\n" + "=" * 80)

##  Step 5: Triplet Extraction

Extract RDF triplets using `TripletExtractor`. Triplets are the foundation of knowledge graphs.

### What are RDF Triplets?

RDF (Resource Description Framework) triplets are statements with three parts:
- **Subject**: What we're talking about
- **Predicate**: The property or relationship
- **Object**: The value or target

Example: `(Apple Inc., founded_by, Steve Jobs)`

### Why Use Triplets?

- **Standardized format** for knowledge representation
- **Compatible** with RDF databases and semantic web
- **Queryable** using SPARQL
- **Interoperable** across systems

In [None]:
from semantica.semantic_extract import TripletExtractor

# Initialize triplet extractor
triplet_extractor = TripletExtractor(
    include_temporal=True,  # Include temporal information
    include_provenance=True  # Track source sentences
)

# Sample text
triplet_text = """
Apple Inc. was founded by Steve Jobs in 1976. The company is based in Cupertino, California.
Tim Cook became CEO in 2011. Apple develops the iPhone and MacBook products.
"""

# Extract triplets
triplets = triplet_extractor.extract_triplets(triplet_text)

print(f" Extracted {len(triplets)} RDF Triplets:\n")
print("=" * 80)

for i, triplet in enumerate(triplets, 1):
    subject = triplet.get('subject', '') if isinstance(triplet, dict) else triplet.subject
    predicate = triplet.get('predicate', '') if isinstance(triplet, dict) else triplet.predicate
    obj = triplet.get('object', '') if isinstance(triplet, dict) else triplet.object
    confidence = triplet.get('confidence', 1.0) if isinstance(triplet, dict) else getattr(triplet, 'confidence', 1.0)
    
    print(f"{i:2d}. ({subject}, {predicate}, {obj})")
    print(f"    Confidence: {confidence:.2f}")
    
    # Show temporal info if available
    metadata = triplet.get('metadata', {}) if isinstance(triplet, dict) else getattr(triplet, 'metadata', {})
    if metadata.get('temporal'):
        print(f"    Temporal: {metadata['temporal']}")
    print()

print("=" * 80)

##  Step 6: Triplet Validation

Validate extracted triplets using `TripletValidator` and assess quality with `TripletQualityChecker`.

### Why Validate Triplets?

- **Ensure completeness**: All parts (subject, predicate, object) present
- **Check confidence**: Filter low-quality extractions
- **Verify consistency**: No contradictory statements
- **Assess quality**: Overall extraction quality metrics

In [None]:
from semantica.semantic_extract import TripletValidator, TripletQualityChecker

# Initialize validator and quality checker
validator = TripletValidator()
quality_checker = TripletQualityChecker()

print(" Triplet Validation:\n")
print("=" * 80)

# Validate triplets
valid_triplets = validator.validate_triplets(triplets, min_confidence=0.5)

print(f"\n Validation Results:")
print(f"   Total triplets: {len(triplets)}")
print(f"   Valid triplets: {len(valid_triplets)}")
print(f"   Filtered out: {len(triplets) - len(valid_triplets)}")

# Check quality
quality_scores = quality_checker.calculate_quality_scores(valid_triplets)

print(f"\n Quality Metrics:")
print("-" * 40)
for metric, value in quality_scores.items():
    if isinstance(value, float):
        print(f"   {metric}: {value:.2f}")
    else:
        print(f"   {metric}: {value}")

# Check consistency
consistency_report = validator.check_triplet_consistency(valid_triplets)

print(f"\n Consistency Check:")
print(f"   Consistent: {consistency_report.get('consistent', True)}")
print(f"   Issues found: {len(consistency_report.get('issues', []))}")

print("\n" + "=" * 80)

##  Step 7: RDF Serialization

Serialize triplets to various RDF formats using `RDFSerializer`.

### Supported Formats:

| Format | Extension | Use Case |
|--------|-----------|----------|
| **Turtle** | .ttl | Human-readable, compact |
| **N-Triples** | .nt | Simple, line-based |
| **JSON-LD** | .jsonld | Web-friendly, JSON-based |
| **RDF/XML** | .rdf | XML-based, verbose |

In [None]:
from semantica.semantic_extract import RDFSerializer

# Initialize serializer
serializer = RDFSerializer()

print(" RDF Serialization Examples:\n")
print("=" * 80)

# Serialize to different formats
formats = ["turtle", "ntriples", "jsonld"]

for fmt in formats:
    print(f"\n {fmt.upper()} Format:")
    print("-" * 40)
    
    try:
        serialized = serializer.serialize_to_rdf(valid_triplets[:3], format=fmt)  # Show first 3
        
        # Show preview (first 300 chars)
        preview = serialized[:300] + "..." if len(serialized) > 300 else serialized
        print(preview)
        
    except Exception as e:
        print(f"Error: {str(e)[:50]}")

print("\n" + "=" * 80)

##  Step 8: Complete Extraction Pipeline

Let's build a complete pipeline: **Entities → Relations → Triplets**

This demonstrates the full workflow for knowledge graph construction.

In [None]:
def extract_knowledge(text):
    """
    Complete knowledge extraction pipeline.
    
    Args:
        text: Input text
        
    Returns:
        dict: Extracted entities, relations, and triplets
    """
    # Step 1: Extract entities
    entities = ner_extractor.extract(text)
    
    # Step 2: Extract relations
    relations = relation_extractor.extract(text, entities)
    
    # Step 3: Extract triplets
    triplets = triplet_extractor.extract_triplets(text, entities=entities, relationships=relations)
    
    # Step 4: Validate triplets
    valid_triplets = validator.validate_triplets(triplets)
    
    return {
        'entities': entities,
        'relations': relations,
        'triplets': valid_triplets
    }

# Sample knowledge-rich text
knowledge_text = """
Tesla Inc. was founded by Elon Musk, JB Straubel, Martin Eberhard, Marc Tarpenning, 
and Ian Wright in 2003. The company is headquartered in Austin, Texas. Tesla produces 
electric vehicles including the Model S, Model 3, Model X, and Model Y. Elon Musk serves 
as CEO and has been instrumental in the company's growth.
"""

print(" Complete Extraction Pipeline:\n")
print("=" * 80)

# Run pipeline
result = extract_knowledge(knowledge_text)

print(f"\n Extraction Results:")
print("-" * 40)
print(f"Entities extracted: {len(result['entities'])}")
print(f"Relations extracted: {len(result['relations'])}")
print(f"Triplets extracted: {len(result['triplets'])}")

print(f"\n Sample Triplets:")
for i, triplet in enumerate(result['triplets'][:5], 1):
    subject = triplet.get('subject', '') if isinstance(triplet, dict) else triplet.subject
    predicate = triplet.get('predicate', '') if isinstance(triplet, dict) else triplet.predicate
    obj = triplet.get('object', '') if isinstance(triplet, dict) else triplet.object
    print(f"  {i}. ({subject}, {predicate}, {obj})")

print("\n" + "=" * 80)

##  Step 9: Best Practices & Tips

###  Choosing the Right Method

1. **Start with pattern-based** for common relations
2. **Use dependency parsing** for grammatical accuracy
3. **Try co-occurrence** for exploratory analysis
4. **Consider LLM** for complex, domain-specific relations

### ️ Optimizing Extraction

- **Set confidence thresholds** (0.6-0.7 for production)
- **Specify relation_types** to focus extraction
- **Adjust max_distance** based on text structure
- **Validate triplets** before using in knowledge graphs

###  Common Pitfalls to Avoid

-  **Don't** skip entity extraction (relations need entities)
-  **Don't** use very low confidence thresholds
-  **Don't** ignore relation validation
-  **Don't** forget to serialize triplets for storage

###  When to Use Each Component

| Use Case | Recommended Component |
|----------|----------------------|
| Find entity connections | `RelationExtractor` |
| Build knowledge graphs | `TripletExtractor` |
| Quality assurance | `TripletValidator` |
| Export to RDF | `RDFSerializer` |
| Assess extraction quality | `TripletQualityChecker` |

###  Performance Tips

1. **Extract entities once**, reuse for relations and triplets
2. **Batch process** multiple documents together
3. **Cache extractors** instead of recreating
4. **Filter early** with confidence thresholds
5. **Validate incrementally** rather than all at once

##  Summary

### What You've Learned

In this notebook, you've learned how to:

 **Extract relationships** using `RelationExtractor`  
 **Compare extraction methods** (pattern, dependency, co-occurrence, HuggingFace, LLM)  
 **Configure extraction parameters** for optimal results  
 **Extract RDF triplets** with `TripletExtractor`  
 **Validate triplets** using `TripletValidator`  
 **Serialize to RDF formats** with `RDFSerializer`  
 **Assess quality** with `TripletQualityChecker`  
 **Build complete pipelines** from entities to triplets  

### Key Takeaways

1. **Relations connect entities**: They form the backbone of knowledge graphs
2. **Multiple methods available**: Choose based on accuracy vs speed needs
3. **Configuration is powerful**: Tune parameters for your domain
4. **Triplets are standardized**: Use RDF for interoperability
5. **Validation is essential**: Ensure quality before using triplets
6. **Pipelines are efficient**: Extract entities → relations → triplets in sequence

### Next Steps

 **Next Notebook**: [07_Building_Knowledge_Graphs.ipynb](./07_Building_Knowledge_Graphs.ipynb)  
Learn how to build complete knowledge graphs from your extracted triplets!

 **Further Reading**:
- [Semantic Extract API Reference](https://semantica.readthedocs.io/reference/semantic_extract/)
- [Knowledge Graph Module](https://semantica.readthedocs.io/reference/kg/)
- [Advanced Graph Analytics](../advanced/02_Advanced_Graph_Analytics.ipynb)

---

**Questions or Issues?** Check out our [GitHub repository](https://github.com/Hawksight-AI/semantica) or [documentation](https://semantica.readthedocs.io).