# Entity Extraction with DSPy

This notebook demonstrates how to build named entity recognition (NER) systems using DSPy:
- Extracting entities from unstructured text
- Different entity types (person, organization, location, etc.)
- Structured output formatting
- Entity linking and relationship extraction

Entity extraction is fundamental for information extraction, knowledge graph construction, and text analysis pipelines.

## Setup and Imports

In [None]:
import os
import sys
sys.path.append('../../')

import dspy
import json
from typing import List, Dict, Any, Tuple
from utils import setup_default_lm, print_step, print_result, print_error
from utils.datasets import get_sample_entity_data
from dotenv import load_dotenv

# Load environment variables
load_dotenv('../../.env')

## Configure Language Model

In [None]:
print_step("Configuring Language Model", "Setting up DSPy with OpenAI")

try:
    lm = setup_default_lm(provider="openai", model="gpt-3.5-turbo", max_tokens=1000)
    dspy.configure(lm=lm)
    print_result("Language model configured successfully!")
except Exception as e:
    print_error(f"Failed to configure language model: {e}")
    print("Make sure you have set your OPENAI_API_KEY in the .env file")

## Basic Named Entity Recognition

Let's start with a simple entity extraction signature and module.

In [None]:
print_step("Basic Entity Extraction", "Creating simple NER signature")

class BasicEntityExtraction(dspy.Signature):
    """Extract named entities from the given text."""
    text = dspy.InputField(desc="The input text to analyze")
    entities = dspy.OutputField(desc="List of entities found, each with type and text (format: 'entity_text (TYPE)')")

# Create basic entity extractor
basic_extractor = dspy.Predict(BasicEntityExtraction)

# Test with sample text
sample_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California. The company is now led by Tim Cook."

result = basic_extractor(text=sample_text)

print_result(
    f"Text: {sample_text}\n\n"
    f"Entities: {result.entities}",
    "Basic Entity Extraction"
)

## Structured Entity Extraction

Let's create a more structured approach that separates different entity types.

In [None]:
print_step("Structured Entity Extraction", "Separating entities by type")

class StructuredEntityExtraction(dspy.Signature):
    """Extract and categorize named entities from text into specific types."""
    text = dspy.InputField(desc="The input text to analyze")
    persons = dspy.OutputField(desc="List of person names found in the text")
    organizations = dspy.OutputField(desc="List of organization names found in the text")
    locations = dspy.OutputField(desc="List of location names found in the text")
    dates = dspy.OutputField(desc="List of dates found in the text")
    other_entities = dspy.OutputField(desc="Other notable entities not fitting above categories")

# Create structured entity extractor
structured_extractor = dspy.ChainOfThought(StructuredEntityExtraction)

# Test with more complex text
complex_text = """
On January 15, 2024, Microsoft CEO Satya Nadella announced a partnership with OpenAI 
during a conference in Seattle, Washington. The deal, worth $10 billion, will help 
accelerate AI development. Google's Sundar Pichai responded from Mountain View, California, 
expressing concerns about market competition.
"""

result = structured_extractor(text=complex_text)

print_result(
    f"Text: {complex_text}\n\n"
    f"Reasoning: {result.reasoning}\n\n"
    f"Persons: {result.persons}\n"
    f"Organizations: {result.organizations}\n"
    f"Locations: {result.locations}\n"
    f"Dates: {result.dates}\n"
    f"Other Entities: {result.other_entities}",
    "Structured Entity Extraction"
)

## Entity Relationship Extraction

Let's extract not just entities but also relationships between them.

In [None]:
print_step("Entity Relationship Extraction", "Finding connections between entities")

class EntityRelationshipExtraction(dspy.Signature):
    """Extract entities and their relationships from text."""
    text = dspy.InputField(desc="The input text to analyze")
    entities = dspy.OutputField(desc="List of all entities with their types")
    relationships = dspy.OutputField(desc="List of relationships between entities (format: 'entity1 RELATION entity2')")
    key_facts = dspy.OutputField(desc="Key factual statements extracted from the text")

# Create relationship extractor
relationship_extractor = dspy.ChainOfThought(EntityRelationshipExtraction)

# Test with relationship-rich text
relationship_text = """
Tesla, founded by Elon Musk, is headquartered in Austin, Texas. The company went public in 2010 
and is listed on NASDAQ. Musk also serves as CEO of SpaceX, which is based in Hawthorne, California. 
SpaceX was established in 2002 and has contracts with NASA for space missions.
"""

result = relationship_extractor(text=relationship_text)

print_result(
    f"Text: {relationship_text}\n\n"
    f"Reasoning: {result.reasoning}\n\n"
    f"Entities: {result.entities}\n\n"
    f"Relationships: {result.relationships}\n\n"
    f"Key Facts: {result.key_facts}",
    "Entity Relationship Extraction"
)

## Domain-Specific Entity Extraction

Let's create extractors for specific domains like medical, financial, or legal texts.

In [None]:
print_step("Domain-Specific Entity Extraction", "Medical and financial entity extraction")

class MedicalEntityExtraction(dspy.Signature):
    """Extract medical entities from clinical or health-related text."""
    text = dspy.InputField(desc="Medical or health-related text")
    conditions = dspy.OutputField(desc="Medical conditions, diseases, or symptoms mentioned")
    medications = dspy.OutputField(desc="Medications, drugs, or treatments mentioned")
    procedures = dspy.OutputField(desc="Medical procedures or tests mentioned")
    anatomy = dspy.OutputField(desc="Body parts, organs, or anatomical references")
    dosages = dspy.OutputField(desc="Dosage information or medical measurements")

class FinancialEntityExtraction(dspy.Signature):
    """Extract financial entities from business or financial text."""
    text = dspy.InputField(desc="Financial or business text")
    companies = dspy.OutputField(desc="Company names and stock symbols")
    currencies = dspy.OutputField(desc="Currency amounts and types")
    financial_instruments = dspy.OutputField(desc="Stocks, bonds, derivatives, etc.")
    financial_metrics = dspy.OutputField(desc="Revenue, profit, ratios, percentages")
    market_terms = dspy.OutputField(desc="Financial terms and jargon")

# Create domain-specific extractors
medical_extractor = dspy.ChainOfThought(MedicalEntityExtraction)
financial_extractor = dspy.ChainOfThought(FinancialEntityExtraction)

# Test medical extraction
medical_text = """
Patient presents with acute myocardial infarction. Administered 100mg aspirin and 5mg 
metoprolol. EKG shows ST elevation in leads II, III, aVF. Cardiac catheterization 
scheduled. Troponin levels elevated at 15.2 ng/mL. Blood pressure 140/90 mmHg.
"""

medical_result = medical_extractor(text=medical_text)

print_result(
    f"Medical Text: {medical_text}\n\n"
    f"Conditions: {medical_result.conditions}\n"
    f"Medications: {medical_result.medications}\n"
    f"Procedures: {medical_result.procedures}\n"
    f"Anatomy: {medical_result.anatomy}\n"
    f"Dosages: {medical_result.dosages}",
    "Medical Entity Extraction"
)

# Test financial extraction
financial_text = """
Apple Inc. (AAPL) reported Q4 revenue of $119.58 billion, up 8% year-over-year. 
The company's gross margin improved to 45.96%. Tesla (TSLA) stock fell 3.2% after 
missing delivery targets. Bitcoin dropped below $45,000 amid regulatory concerns. 
The S&P 500 index closed at 4,750 points.
"""

financial_result = financial_extractor(text=financial_text)

print_result(
    f"Financial Text: {financial_text}\n\n"
    f"Companies: {financial_result.companies}\n"
    f"Currencies: {financial_result.currencies}\n"
    f"Financial Instruments: {financial_result.financial_instruments}\n"
    f"Financial Metrics: {financial_result.financial_metrics}\n"
    f"Market Terms: {financial_result.market_terms}",
    "Financial Entity Extraction"
)

## Entity Validation and Linking

Let's add validation and confidence scoring to our entity extraction.

In [None]:
print_step("Entity Validation and Linking", "Adding confidence scores and validation")

class EntityValidation(dspy.Signature):
    """Validate and score extracted entities for accuracy and confidence."""
    text = dspy.InputField(desc="Original text")
    extracted_entities = dspy.InputField(desc="List of extracted entities")
    validated_entities = dspy.OutputField(desc="Validated entities with confidence scores (format: 'entity (TYPE) - confidence%')")
    potential_errors = dspy.OutputField(desc="Entities that might be incorrectly identified")
    missing_entities = dspy.OutputField(desc="Important entities that might have been missed")

class EntityLinking(dspy.Signature):
    """Link entities to external knowledge bases or provide additional context."""
    entity = dspy.InputField(desc="Entity to link")
    entity_type = dspy.InputField(desc="Type of entity (person, organization, etc.)")
    context = dspy.InputField(desc="Context from original text")
    linked_info = dspy.OutputField(desc="Additional information about the entity")
    disambiguation = dspy.OutputField(desc="Clarification if entity could refer to multiple things")
    confidence = dspy.OutputField(desc="Confidence in the linking (high/medium/low)")

# Create validation and linking modules
entity_validator = dspy.ChainOfThought(EntityValidation)
entity_linker = dspy.Predict(EntityLinking)

# Complete entity extraction pipeline
class CompleteEntityExtractor(dspy.Module):
    """Complete entity extraction pipeline with validation and linking."""
    
    def __init__(self):
        super().__init__()
        self.extractor = dspy.ChainOfThought(StructuredEntityExtraction)
        self.validator = dspy.ChainOfThought(EntityValidation)
        self.linker = dspy.Predict(EntityLinking)
    
    def forward(self, text):
        # Step 1: Extract entities
        extraction_result = self.extractor(text=text)
        
        # Step 2: Combine all entities
        all_entities = []
        if extraction_result.persons != "None" and extraction_result.persons:
            all_entities.extend([f"{p} (PERSON)" for p in extraction_result.persons.split(", ") if p.strip()])
        if extraction_result.organizations != "None" and extraction_result.organizations:
            all_entities.extend([f"{o} (ORG)" for o in extraction_result.organizations.split(", ") if o.strip()])
        if extraction_result.locations != "None" and extraction_result.locations:
            all_entities.extend([f"{l} (LOC)" for l in extraction_result.locations.split(", ") if l.strip()])
        
        # Step 3: Validate entities
        validation_result = self.validator(
            text=text,
            extracted_entities=", ".join(all_entities)
        )
        
        return dspy.Prediction(
            text=text,
            extraction_reasoning=extraction_result.reasoning,
            persons=extraction_result.persons,
            organizations=extraction_result.organizations,
            locations=extraction_result.locations,
            dates=extraction_result.dates,
            all_entities=all_entities,
            validation_reasoning=validation_result.reasoning,
            validated_entities=validation_result.validated_entities,
            potential_errors=validation_result.potential_errors,
            missing_entities=validation_result.missing_entities
        )

# Test complete pipeline
complete_extractor = CompleteEntityExtractor()

test_text = """
Barack Obama, the 44th President of the United States, was born in Hawaii. He served 
from 2009 to 2017 and was succeeded by Donald Trump. Obama graduated from Harvard Law School 
and taught at the University of Chicago. His wife Michelle Obama is also a Harvard graduate.
"""

complete_result = complete_extractor(text=test_text)

print_result(
    f"Text: {test_text}\n\n"
    f"Extraction Reasoning: {complete_result.extraction_reasoning}\n\n"
    f"Extracted Entities:\n"
    f"- Persons: {complete_result.persons}\n"
    f"- Organizations: {complete_result.organizations}\n"
    f"- Locations: {complete_result.locations}\n"
    f"- Dates: {complete_result.dates}\n\n"
    f"Validation Reasoning: {complete_result.validation_reasoning}\n\n"
    f"Validated Entities: {complete_result.validated_entities}\n"
    f"Potential Errors: {complete_result.potential_errors}\n"
    f"Missing Entities: {complete_result.missing_entities}",
    "Complete Entity Extraction Pipeline"
)

# Test entity linking for specific entities
print("\n🔗 Entity Linking Examples:")
print("-" * 50)

entities_to_link = [
    ("Barack Obama", "PERSON", "44th President of the United States"),
    ("Harvard Law School", "ORGANIZATION", "prestigious law school"),
    ("Hawaii", "LOCATION", "birth place of Barack Obama")
]

for entity, entity_type, context in entities_to_link:
    link_result = entity_linker(
        entity=entity,
        entity_type=entity_type,
        context=context
    )
    
    print(f"\nEntity: {entity} ({entity_type})")
    print(f"Linked Info: {link_result.linked_info}")
    print(f"Disambiguation: {link_result.disambiguation}")
    print(f"Confidence: {link_result.confidence}")

## Entity Extraction Evaluation

Let's evaluate our entity extraction system against ground truth data.

In [None]:
print_step("Entity Extraction Evaluation", "Comparing against ground truth data")

def evaluate_entity_extraction(extractor, test_examples):
    """Evaluate entity extraction performance."""
    
    def normalize_entity_list(entities_str):
        """Normalize entity string to list."""
        if not entities_str or entities_str.lower() in ['none', 'n/a', 'null']:
            return set()
        return set([e.strip().lower() for e in entities_str.split(',') if e.strip()])
    
    total_precision = 0
    total_recall = 0
    total_f1 = 0
    num_examples = len(test_examples)
    
    print("Evaluation Results:")
    print("=" * 60)
    
    for i, example in enumerate(test_examples, 1):
        # Extract entities
        result = extractor(text=example.text)
        
        # Get predicted entities (combine all types)
        predicted = set()
        for field in ['persons', 'organizations', 'locations']:
            if hasattr(result, field):
                predicted.update(normalize_entity_list(getattr(result, field)))
        
        # Get ground truth entities
        expected_entities = [e.split(' (')[0].lower() for e in example.entities]
        expected = set(expected_entities)
        
        # Calculate metrics
        true_positives = len(predicted.intersection(expected))
        false_positives = len(predicted - expected)
        false_negatives = len(expected - predicted)
        
        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        total_precision += precision
        total_recall += recall
        total_f1 += f1
        
        print(f"\nExample {i}:")
        print(f"Text: {example.text[:100]}...")
        print(f"Expected: {expected}")
        print(f"Predicted: {predicted}")
        print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}")
    
    # Calculate averages
    avg_precision = total_precision / num_examples
    avg_recall = total_recall / num_examples
    avg_f1 = total_f1 / num_examples
    
    print("\n" + "=" * 60)
    print(f"Average Precision: {avg_precision:.3f}")
    print(f"Average Recall: {avg_recall:.3f}")
    print(f"Average F1 Score: {avg_f1:.3f}")
    
    return avg_precision, avg_recall, avg_f1

# Get test data
test_examples = get_sample_entity_data()

print(f"Testing on {len(test_examples)} examples...")

# Evaluate our structured extractor
precision, recall, f1 = evaluate_entity_extraction(structured_extractor, test_examples)

## Batch Processing and Output Formatting

Let's create utilities for processing multiple documents and formatting outputs.

In [None]:
print_step("Batch Processing", "Processing multiple documents efficiently")

def batch_entity_extraction(extractor, texts: List[str], output_format="json"):
    """Process multiple texts and return formatted results."""
    results = []
    
    for i, text in enumerate(texts):
        print(f"Processing document {i+1}/{len(texts)}...")
        
        try:
            result = extractor(text=text)
            
            if output_format == "json":
                formatted_result = {
                    "document_id": i+1,
                    "text": text[:200] + "..." if len(text) > 200 else text,
                    "entities": {
                        "persons": result.persons.split(", ") if result.persons and result.persons != "None" else [],
                        "organizations": result.organizations.split(", ") if result.organizations and result.organizations != "None" else [],
                        "locations": result.locations.split(", ") if result.locations and result.locations != "None" else [],
                        "dates": result.dates.split(", ") if result.dates and result.dates != "None" else []
                    }
                }
            else:  # CSV format
                formatted_result = {
                    "doc_id": i+1,
                    "persons": result.persons or "",
                    "organizations": result.organizations or "",
                    "locations": result.locations or "",
                    "dates": result.dates or ""
                }
            
            results.append(formatted_result)
            
        except Exception as e:
            print(f"Error processing document {i+1}: {e}")
            results.append({"document_id": i+1, "error": str(e)})
    
    return results

# Test batch processing
batch_texts = [
    "Microsoft announced that CEO Satya Nadella will speak at the Tech Conference in San Francisco next month.",
    "The World Health Organization (WHO) reported new cases in Geneva, Switzerland, according to Dr. Maria Santos.",
    "Amazon's Jeff Bezos stepped down as CEO in July 2021, passing leadership to Andy Jassy in Seattle.",
    "The United Nations Security Council met in New York to discuss climate change initiatives with Secretary-General António Guterres."
]

batch_results = batch_entity_extraction(structured_extractor, batch_texts, output_format="json")

print_result(f"Processed {len(batch_texts)} documents", "Batch Processing Complete")

# Display results
for result in batch_results:
    print(f"\nDocument {result['document_id']}:")
    print(f"Text: {result['text']}")
    print(f"Entities: {json.dumps(result['entities'], indent=2)}")
    print("-" * 50)

# Export to JSON file
import json
output_file = "../../data/entity_extraction_results.json"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

with open(output_file, 'w') as f:
    json.dump(batch_results, f, indent=2)

print(f"\nResults exported to {output_file}")

## Interactive Entity Extraction Demo

Let's create an interactive demo where you can input text and see entity extraction results.

In [None]:
print_step("Interactive Demo", "Try entity extraction with your own text!")

def interactive_entity_extraction(text: str):
    """Interactive entity extraction with detailed output."""
    print(f"\n🔍 Analyzing: {text}")
    print("=" * 80)
    
    # Basic extraction
    basic_result = basic_extractor(text=text)
    print(f"\n📝 Basic Extraction:")
    print(f"Entities: {basic_result.entities}")
    
    # Structured extraction
    structured_result = structured_extractor(text=text)
    print(f"\n🏗️ Structured Extraction:")
    print(f"Persons: {structured_result.persons}")
    print(f"Organizations: {structured_result.organizations}")
    print(f"Locations: {structured_result.locations}")
    print(f"Dates: {structured_result.dates}")
    print(f"Other: {structured_result.other_entities}")
    
    # Relationship extraction
    relationship_result = relationship_extractor(text=text)
    print(f"\n🔗 Relationships:")
    print(f"Relationships: {relationship_result.relationships}")
    print(f"Key Facts: {relationship_result.key_facts}")
    
    print("\n" + "=" * 80)

# Demo texts
demo_texts = [
    "Apple Inc. announced that Tim Cook will meet with President Biden in Washington D.C. next Tuesday to discuss trade policies.",
    "The research paper by Dr. Jane Smith from Stanford University was published in Nature magazine on March 15, 2024.",
    "Netflix signed a $500 million deal with Sony Pictures Entertainment to stream movies exclusively starting January 2025."
]

print("Demo Entity Extraction Results:")
for i, demo_text in enumerate(demo_texts, 1):
    print(f"\n🎯 Demo {i}:")
    interactive_entity_extraction(demo_text)

# You can test with your own text by uncommenting and modifying the line below:
# interactive_entity_extraction("Your text here...")

## Summary

In this notebook, we explored comprehensive entity extraction techniques using DSPy:

### Key Components Covered:

1. **Basic Entity Extraction** - Simple NER with entity identification
2. **Structured Extraction** - Categorizing entities by type (person, organization, location, date)
3. **Relationship Extraction** - Finding connections and relationships between entities
4. **Domain-Specific Extraction** - Specialized extractors for medical and financial texts
5. **Entity Validation** - Confidence scoring and error detection
6. **Entity Linking** - Connecting entities to external knowledge
7. **Evaluation Framework** - Measuring extraction performance against ground truth
8. **Batch Processing** - Efficient processing of multiple documents
9. **Output Formatting** - Structured JSON and CSV output formats

### Advanced Features:

- **Chain of Thought Reasoning** - Explicit reasoning steps for complex extractions
- **Multi-Step Pipelines** - Combining extraction, validation, and linking
- **Domain Adaptation** - Specialized signatures for different text types
- **Quality Metrics** - Precision, recall, and F1 scoring
- **Error Handling** - Robust processing with error detection

### Practical Applications:

- **Information Extraction** - Extracting structured data from unstructured text
- **Knowledge Graph Construction** - Building entity-relationship networks
- **Document Processing** - Automated analysis of large document collections
- **Content Analysis** - Understanding key entities in text content
- **Data Enrichment** - Adding structured metadata to text data

### Next Steps:

- **Fine-tuning** - Use DSPy optimizers to improve extraction accuracy
- **Custom Domains** - Create specialized extractors for specific industries
- **Real-time Processing** - Implement streaming entity extraction
- **Integration** - Connect with databases and knowledge graphs
- **Visualization** - Create entity relationship visualizations

This comprehensive entity extraction system demonstrates how DSPy can be used to build sophisticated NLP pipelines that are both accurate and adaptable to different domains and requirements.