[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/05_Entity_Extraction.ipynb)

#  Entity Extraction - Comprehensive Guide

##  Overview

This notebook provides a **comprehensive guide** to extracting named entities from text using Semantica's powerful NER (Named Entity Recognition) modules. You'll learn to use multiple extractors, methods, and advanced features to identify and classify entities in text.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/semantic_extract/)

###  Learning Objectives

By the end of this notebook, you will be able to:

-  Extract entities using `NERExtractor` and `NamedEntityRecognizer`
-  Understand different extraction methods (pattern, regex, ML, HuggingFace, LLM)
-  Use `EntityClassifier` to classify and group entities
-  Apply `EntityConfidenceScorer` to assess extraction quality
-  Create custom entity patterns with `CustomEntityDetector`
-  Configure extraction parameters for optimal results
-  Process multiple documents efficiently
-  Handle edge cases and errors gracefully

###  What You'll Learn

| Component | Purpose | When to Use |
|-----------|---------|-------------|
| `NERExtractor` | Core entity extraction | Quick, simple extraction |
| `NamedEntityRecognizer` | Advanced NER with configuration | Fine-tuned control needed |
| `EntityClassifier` | Classify and group entities | Organizing extracted entities |
| `EntityConfidenceScorer` | Score entity confidence | Quality assessment |
| `CustomEntityDetector` | Domain-specific entities | Custom patterns needed |

---

##  Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

In [None]:
!pip install semantica


##  Step 1: Basic Entity Extraction with NERExtractor

Let's start with the simplest approach using `NERExtractor`. This class provides a straightforward interface for extracting named entities from text.

### What is NERExtractor?

`NERExtractor` is the core entity extraction class that:
- Identifies named entities (people, organizations, locations, dates, etc.)
- Returns entity objects with text, type, position, and confidence
- Supports multiple extraction methods
- Works out-of-the-box with sensible defaults

In [None]:
from semantica.semantic_extract import NERExtractor

# Initialize the extractor
ner_extractor = NERExtractor()

# Sample text with various entity types
text = """
Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ronald Wayne 
in Cupertino, California on April 1, 1976. The company's current CEO is Tim Cook, who took 
over from Steve Jobs in August 2011. Apple is headquartered at One Apple Park Way in Cupertino.
"""

# Extract entities
entities = ner_extractor.extract(text)

print(f" Extracted {len(entities)} entities:\n")
print("-" * 80)

for i, entity in enumerate(entities, 1):
    # Handle both dict and object formats
    entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
    entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
    confidence = entity.get('confidence', 1.0) if isinstance(entity, dict) else getattr(entity, 'confidence', 1.0)
    
    print(f"{i:2d}. {entity_text:30s} | Type: {entity_type:12s} | Confidence: {confidence:.2f}")

print("-" * 80)

###  Understanding Entity Objects

Each extracted entity contains:

| Attribute | Description | Example |
|-----------|-------------|----------|
| `text` | The entity text | "Apple Inc." |
| `label/type` | Entity category | "ORG" (Organization) |
| `start_char` | Starting position | 0 |
| `end_char` | Ending position | 10 |
| `confidence` | Extraction confidence (0-1) | 0.95 |
| `metadata` | Additional information | {"method": "ml"} |

###  Common Entity Types

- **PERSON**: People, including fictional characters
- **ORG**: Companies, agencies, institutions
- **GPE**: Countries, cities, states (Geo-Political Entities)
- **LOC**: Non-GPE locations, mountain ranges, bodies of water
- **DATE**: Absolute or relative dates or periods
- **TIME**: Times smaller than a day
- **MONEY**: Monetary values
- **PERCENT**: Percentage values

##  Step 2: Visualizing Entities in Context

Let's create a simple visualization to see entities highlighted in the original text.

In [None]:
def highlight_entities(text, entities):
    """
    Create a simple text visualization with entity markers.
    """
    # Group entities by type
    entity_types = {}
    for entity in entities:
        entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
        entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
        
        if entity_type not in entity_types:
            entity_types[entity_type] = []
        entity_types[entity_type].append(entity_text)
    
    print("\n Entity Visualization:\n")
    print("=" * 80)
    
    for entity_type, entity_list in sorted(entity_types.items()):
        unique_entities = list(set(entity_list))
        print(f"\n{entity_type}:")
        for ent in unique_entities:
            print(f"  • {ent}")
    
    print("\n" + "=" * 80)

# Visualize the extracted entities
highlight_entities(text, entities)

## ️ Step 3: Different Extraction Methods

Semantica supports multiple extraction methods, each with different strengths:

### Method Comparison

| Method | Speed | Accuracy | Use Case | Requires |
|--------|-------|----------|----------|----------|
| **pattern** |  | ⭐⭐ | Simple, predictable patterns | Nothing |
| **regex** |  | ⭐⭐⭐ | Custom patterns, IDs, codes | Regex knowledge |
| **ml** (spaCy) |  | ⭐⭐⭐⭐ | General text, multiple languages | spaCy model |
| **huggingface** |  | ⭐⭐⭐⭐⭐ | Domain-specific, fine-tuned | HF model |
| **llm** |  | ⭐⭐⭐⭐⭐ | Complex, custom types | API key |

Let's try different methods:

In [None]:
from semantica.semantic_extract import NERExtractor

sample_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

print(" Comparing Extraction Methods:\n")
print("=" * 80)

# Try different methods
methods_to_try = ["pattern", "regex", "ml"]

for method_name in methods_to_try:
    try:
        print(f"\n Method: {method_name.upper()}")
        print("-" * 40)
        
        extractor = NERExtractor(method=method_name)
        entities = extractor.extract(sample_text)
        
        print(f"Found {len(entities)} entities:")
        for entity in entities[:5]:  # Show first 5
            entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
            entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
            print(f"  • {entity_text} ({entity_type})")
            
    except Exception as e:
        print(f"  ️  Method '{method_name}' not available: {str(e)[:50]}")

print("\n" + "=" * 80)

##  Step 4: Advanced Entity Recognition with NamedEntityRecognizer

`NamedEntityRecognizer` provides more control over the extraction process through configuration parameters.

### Key Parameters:

- **`methods`**: List of extraction methods to use (e.g., `["spacy", "rule-based"]`)
- **`confidence_threshold`**: Minimum confidence score (0.0-1.0, default: 0.5)
- **`merge_overlapping`**: Whether to merge overlapping entities (default: True)
- **`include_standard_types`**: Include standard entity types (PERSON, ORG, LOC, etc.)

In [None]:
from semantica.semantic_extract import NamedEntityRecognizer

# Create recognizer with custom configuration
ner = NamedEntityRecognizer(
    methods=["spacy"],  # Use spaCy for ML-based extraction
    confidence_threshold=0.7,  # Only keep high-confidence entities
    merge_overlapping=True,  # Merge overlapping entity mentions
    include_standard_types=True  # Include standard entity types
)

# Sample texts for batch processing
texts = [
    "Tim Cook is the CEO of Apple Inc., based in Cupertino.",
    "Microsoft Corporation, founded by Bill Gates, is headquartered in Redmond, Washington.",
    "Amazon was founded by Jeff Bezos in Seattle in 1994.",
    "Google was started by Larry Page and Sergey Brin at Stanford University."
]

print(" Advanced Entity Recognition Results:\n")
print("=" * 80)

all_entities = []
for i, text in enumerate(texts, 1):
    entities = ner.extract_entities(text)
    all_entities.extend(entities)
    
    print(f"\n Text {i}: {text[:60]}...")
    print(f"   Found {len(entities)} entities:")
    
    for entity in entities:
        entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
        entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
        confidence = entity.get('confidence', 1.0) if isinstance(entity, dict) else getattr(entity, 'confidence', 1.0)
        print(f"     • {entity_text:25s} | {entity_type:10s} | Confidence: {confidence:.2f}")

print(f"\n Total entities extracted: {len(all_entities)}")
print("=" * 80)

## ️ Step 5: Entity Classification

Use `EntityClassifier` to classify and group entities by type, and disambiguate similar entities.

### What is EntityClassifier?

The `EntityClassifier` helps you:
- **Classify entities** by their type (normalize variations like "ORG" vs "ORGANIZATION")
- **Group entities** by category for analysis
- **Disambiguate entities** when multiple candidates exist
- **Standardize entity types** across different extraction methods

In [None]:
from semantica.semantic_extract import EntityClassifier

# Initialize classifier
classifier = EntityClassifier()

# Classify the entities we extracted earlier
classified = classifier.classify_entities(all_entities)

print("️  Entity Classification Results:\n")
print("=" * 80)

for entity_type, entity_list in sorted(classified.items()):
    print(f"\n{entity_type} ({len(entity_list)} entities):")
    print("-" * 40)
    
    # Get unique entity texts
    unique_entities = set()
    for entity in entity_list:
        entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
        unique_entities.add(entity_text)
    
    for entity_text in sorted(unique_entities):
        print(f"  • {entity_text}")

print("\n" + "=" * 80)

##  Step 6: Confidence Scoring

Use `EntityConfidenceScorer` to assess and improve the confidence scores of extracted entities.

### Why Confidence Scoring?

Confidence scores help you:
- **Filter low-quality extractions** (e.g., only keep entities with confidence > 0.8)
- **Prioritize entities** for manual review or validation
- **Assess extraction quality** across different methods or texts
- **Make informed decisions** about which entities to use

In [None]:
from semantica.semantic_extract import EntityConfidenceScorer

# Initialize confidence scorer
scorer = EntityConfidenceScorer()

# Score the entities
scored_entities = scorer.score_entities(all_entities)

print(" Entity Confidence Scoring:\n")
print("=" * 80)

# Group by confidence levels
high_confidence = []
medium_confidence = []
low_confidence = []

for entity in scored_entities:
    confidence = entity.get('confidence', 1.0) if isinstance(entity, dict) else getattr(entity, 'confidence', 1.0)
    
    if confidence >= 0.8:
        high_confidence.append(entity)
    elif confidence >= 0.5:
        medium_confidence.append(entity)
    else:
        low_confidence.append(entity)

print(f" High Confidence (≥0.8): {len(high_confidence)} entities")
print(f"️  Medium Confidence (0.5-0.8): {len(medium_confidence)} entities")
print(f" Low Confidence (<0.5): {len(low_confidence)} entities")

print("\n Confidence Distribution:")
print("-" * 40)

# Show some examples from each category
if high_confidence:
    print("\nHigh Confidence Examples:")
    for entity in high_confidence[:3]:
        entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
        entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
        confidence = entity.get('confidence', 1.0) if isinstance(entity, dict) else getattr(entity, 'confidence', 1.0)
        print(f"  • {entity_text} ({entity_type}) - {confidence:.2f}")

print("\n" + "=" * 80)

##  Step 7: Custom Entity Detection

Use `CustomEntityDetector` to define domain-specific entity patterns.

### When to Use Custom Patterns?

Custom patterns are useful for:
- **Domain-specific entities** (e.g., product codes, invoice numbers)
- **Structured identifiers** (e.g., email addresses, phone numbers)
- **Industry-specific terms** (e.g., medical codes, legal citations)
- **Custom formats** not recognized by standard NER

In [None]:
from semantica.semantic_extract import CustomEntityDetector
import re

# Define custom patterns
custom_patterns = {
    "EMAIL": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "PHONE": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    "PRODUCT_CODE": r'\b[A-Z]{2,3}-\d{4,6}\b',
    "URL": r'https?://[^\s]+'
}

# Initialize custom detector
custom_detector = CustomEntityDetector(patterns=custom_patterns)

# Sample text with custom entities
custom_text = """
For support, contact support@apple.com or call 1-800-692-7753.
Order product SKU-12345 from https://store.apple.com.
Technical inquiries: tech@apple.com or visit our website.
"""

print(" Custom Entity Detection:\n")
print("=" * 80)

for entity_type in custom_patterns.keys():
    entities = custom_detector.detect_custom_entities(custom_text, entity_type)
    
    if entities:
        print(f"\n{entity_type}:")
        for entity in entities:
            entity_text = entity.get('text', entity.get('entity', '')) if isinstance(entity, dict) else entity.text
            print(f"  • {entity_text}")

print("\n" + "=" * 80)

##  Step 8: Batch Processing

Process multiple documents efficiently using batch processing capabilities.

### Benefits of Batch Processing:

- **Performance**: Process multiple documents in one call
- **Consistency**: Same configuration applied to all documents
- **Efficiency**: Reduced overhead from initialization
- **Scalability**: Handle large document collections

In [None]:
# Sample document collection
documents = [
    "Apple Inc. released the iPhone 15 in September 2023.",
    "Microsoft announced Azure AI updates at Build 2023 in Seattle.",
    "Google's Sundar Pichai spoke at I/O 2023 in Mountain View, California.",
    "Tesla's Elon Musk unveiled the Cybertruck in Austin, Texas.",
    "Amazon Web Services launched new features in Northern Virginia."
]

print(" Batch Processing Results:\n")
print("=" * 80)

# Process all documents
batch_results = ner.process_batch(documents)

# Analyze results
total_entities = 0
entity_type_counts = {}

for i, (doc, entities) in enumerate(zip(documents, batch_results), 1):
    total_entities += len(entities)
    
    print(f"\n Document {i}:")
    print(f"   Text: {doc[:50]}...")
    print(f"   Entities: {len(entities)}")
    
    for entity in entities:
        entity_type = entity.get('type', entity.get('label', 'Unknown')) if isinstance(entity, dict) else entity.label
        entity_type_counts[entity_type] = entity_type_counts.get(entity_type, 0) + 1

print(f"\n Batch Processing Summary:")
print("-" * 40)
print(f"Documents processed: {len(documents)}")
print(f"Total entities: {total_entities}")
print(f"Average per document: {total_entities/len(documents):.1f}")

print("\n Entity Type Distribution:")
for entity_type, count in sorted(entity_type_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {entity_type}: {count}")

print("\n" + "=" * 80)

##  Step 9: Best Practices & Tips

###  Choosing the Right Method

1. **Start with ML (spaCy)** for general text
2. **Use patterns/regex** for structured data (IDs, codes)
3. **Try HuggingFace** for domain-specific needs
4. **Consider LLM** for complex, custom entity types

### ️ Optimizing Performance

- **Set appropriate confidence thresholds** (0.7-0.8 for production)
- **Use batch processing** for multiple documents
- **Enable merge_overlapping** to reduce duplicates
- **Cache extractors** instead of recreating them

###  Common Pitfalls to Avoid

-  **Don't** use very low confidence thresholds (< 0.5)
-  **Don't** process one document at a time in loops
-  **Don't** ignore entity metadata (contains useful info)
-  **Don't** forget to handle extraction errors

###  When to Use Each Class

| Use Case | Recommended Class |
|----------|-------------------|
| Quick extraction | `NERExtractor` |
| Fine-tuned control | `NamedEntityRecognizer` |
| Grouping entities | `EntityClassifier` |
| Quality assessment | `EntityConfidenceScorer` |
| Domain-specific | `CustomEntityDetector` |

##  Summary

### What You've Learned

In this notebook, you've learned how to:

 **Extract entities** using `NERExtractor` and `NamedEntityRecognizer`  
 **Compare different extraction methods** (pattern, regex, ML, HuggingFace, LLM)  
 **Classify and group entities** with `EntityClassifier`  
 **Score entity confidence** using `EntityConfidenceScorer`  
 **Create custom patterns** with `CustomEntityDetector`  
 **Process documents in batch** for efficiency  
 **Apply best practices** for production use  

### Key Takeaways

1. **Multiple methods available**: Choose based on your needs (speed vs accuracy)
2. **Configuration matters**: Tune parameters for optimal results
3. **Confidence is key**: Use thresholds to filter low-quality extractions
4. **Custom patterns work**: For domain-specific entities
5. **Batch processing scales**: Process multiple documents efficiently

### Next Steps

 **Next Notebook**: [06_Relation_Extraction.ipynb](./06_Relation_Extraction.ipynb)  
Learn how to extract relationships between the entities you've identified!

 **Further Reading**:
- [Semantic Extract API Reference](https://semantica.readthedocs.io/reference/semantic_extract/)
- [Advanced Extraction Techniques](../advanced/01_Advanced_Extraction.ipynb)
- [Building Knowledge Graphs](./07_Building_Knowledge_Graphs.ipynb)

---

**Questions or Issues?** Check out our [GitHub repository](https://github.com/Hawksight-AI/semantica) or [documentation](https://semantica.readthedocs.io).