# Document Metadata Extraction with Fenic

This notebook demonstrates how to extract structured metadata from unstructured document text using fenic's semantic operations. We'll explore two different approaches for metadata extraction:

1. **ExtractSchema** - Fenic's native schema definition approach
2. **Pydantic Models** - Using familiar Python class syntax

Both methods leverage large language models to intelligently parse and extract structured information from diverse document types including research papers, product announcements, meeting notes, news articles, and technical documentation.

## What You'll Learn

- Setting up fenic sessions with semantic capabilities
- Creating DataFrames from document data
- Extracting structured metadata using AI-powered operations
- Comparing different extraction approaches
- Understanding the trade-offs between schema types

Let's dive in!

## Setup and Configuration

First, we need to import the necessary libraries and configure our fenic session. We'll set up:

- **Type hints** from `typing` for better code documentation
- **Pydantic** for our second extraction approach 
- **Fenic** as our main DataFrame library

For the session configuration, we're setting up semantic capabilities using OpenAI's GPT-4o-mini model with specific rate limits:
- **RPM (Requests Per Minute)**: 500 requests
- **TPM (Tokens Per Minute)**: 200,000 tokens

This configuration ensures we can efficiently process our document extraction tasks while staying within API limits.

In [None]:
from typing import Literal
from pydantic import BaseModel, Field
import fenic as fc

# Configure session with semantic capabilities
config = fc.SessionConfig(
        app_name="document_extraction",
        semantic=fc.SemanticConfig(
            language_models={
                "mini": fc.OpenAIModelConfig(
                    model_name="gpt-4o-mini",
                    rpm=500,
                    tpm=200_000,
                )
            }
        ),
    )

# Create session
session = fc.Session.get_or_create(config)

## Sample Document Data

Now let's create our test dataset. We've carefully selected 5 diverse document types to showcase the versatility of metadata extraction:

1. **Research Paper** (`doc_001`) - Academic study on neural networks and climate prediction
2. **Product Announcement** (`doc_002`) - CloudSync Pro file synchronization software launch
3. **Meeting Notes** (`doc_003`) - Engineering team standup with decisions and action items
4. **News Article** (`doc_004`) - Breaking news about a data breach incident
5. **Technical Documentation** (`doc_005`) - API reference for an authentication service

Each document contains different types of metadata (titles, dates, keywords, etc.) that we'll extract automatically. After creating the DataFrame, we'll inspect the basic properties including document IDs and text lengths to understand our data better.

In [None]:
documents_data = [
        {
            "id": "doc_001",
            "text": "Neural Networks for Climate Prediction: A Comprehensive Study. Published March 15, 2024. This research presents a novel deep learning approach for predicting climate patterns using multi-layered neural networks. Our methodology combines satellite imagery data with ground-based sensor readings to achieve 94% accuracy in temperature forecasting. The study was conducted over 18 months across 12 research stations. Keywords: machine learning, climate modeling, neural networks, environmental science."
        },
        {
            "id": "doc_002",
            "text": "Introducing CloudSync Pro - Next-Generation File Synchronization. Release Date: January 8, 2024. CloudSync Pro revolutionizes how teams collaborate with real-time file synchronization across unlimited devices. Features include end-to-end encryption, automatic conflict resolution, and integration with over 50 productivity tools. Pricing starts at $12/month per user with enterprise discounts available. Contact our sales team for a personalized demo."
        },
        {
            "id": "doc_003",
            "text": "Weekly Engineering Standup - December 4, 2023. Attendees: Sarah Chen (Lead), Marcus Rodriguez (Backend), Lisa Park (Frontend), James Wilson (DevOps). Key decisions: Migration to Kubernetes approved for Q1 2024, new CI/CD pipeline reduces deployment time by 60%, API rate limiting implementation scheduled for next sprint. Action items: Sarah to finalize container specifications, Marcus to document database migration plan."
        },
        {
            "id": "doc_004",
            "text": "Breaking: Major Data Breach Affects 2.3 Million Users. December 12, 2023 - TechCorp announced today that unauthorized access to customer databases occurred between November 28-30, 2023. Compromised data includes email addresses, encrypted passwords, and partial payment information. The company has implemented additional security measures and is offering free credit monitoring to affected users. Stock prices dropped 8% in after-hours trading."
        },
        {
            "id": "doc_005",
            "text": "API Reference: Authentication Service v2.1. Last updated: February 20, 2024. The Authentication Service provides secure user login and session management for distributed applications. Supports OAuth 2.0, SAML, and multi-factor authentication. Rate limits: 1000 requests per hour for standard accounts, 10000 for premium. Available endpoints include /auth/login, /auth/refresh, /auth/logout. Response format: JSON with standardized error codes."
        }
    ]

# Create DataFrame
docs_df = session.create_dataframe(documents_data)

docs_df.select("id", fc.text.length("text").alias("text_length")).show()


## Method 1: ExtractSchema Approach

Our first approach uses **fenic's native ExtractSchema system**. This method provides several advantages:

✅ **Complex Data Types**: Supports lists, nested structures, and rich type definitions  
✅ **Type Safety**: Built-in type checking and validation  
✅ **Native Integration**: Seamlessly works with fenic's DataFrame operations  

### Schema Definition

We'll define a schema with 5 key metadata fields:
- **title**: The main subject or title of the document
- **document_type**: Category classification (research paper, news, etc.)
- **date**: Any temporal information mentioned
- **keywords**: List of important terms and topics *(note: this will be a proper list)*
- **summary**: Brief one-sentence overview

### Extraction Process

The extraction happens in three steps:
1. **Apply semantic extraction** using `fc.semantic.extract()` with our schema
2. **Flatten the results** by accessing nested metadata fields  
3. **Display results** to see the structured output

Notice how the `keywords` field will return an actual Python list - this is a key advantage of ExtractSchema!

In [None]:
# Define schema for document metadata extraction
doc_metadata_schema = fc.ExtractSchema([
    fc.ExtractSchemaField(
        name="title",
        data_type=fc.StringType,
        description="The main title or subject of the document"
    ),
    fc.ExtractSchemaField(
        name="document_type",
        data_type=fc.StringType,
        description="Type of document (e.g., research paper, product announcement, meeting notes, news article, technical documentation)"
    ),
    fc.ExtractSchemaField(
        name="date",
        data_type=fc.StringType,
        description="Any date mentioned in the document (publication date, meeting date, etc.)"
    ),
    fc.ExtractSchemaField(
        name="keywords",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="List of key topics, technologies, or important terms mentioned in the document"
    ),
    fc.ExtractSchemaField(
        name="summary",
        data_type=fc.StringType,
        description="Brief one-sentence summary of the document's main purpose or content"
    )
])

# Apply extraction using ExtractSchema
extracted_df = docs_df.select(
    "id",
    fc.semantic.extract("text", doc_metadata_schema).alias("metadata")
)

# Flatten the extracted metadata into separate columns
extract_schema_results = extracted_df.select(
    "id",
    extracted_df.metadata.title.alias("title"),
    extracted_df.metadata.document_type.alias("document_type"),
    extracted_df.metadata.date.alias("date"),
    extracted_df.metadata.keywords.alias("keywords"),
    extracted_df.metadata.summary.alias("summary")
)

extract_schema_results.show()


## Method 2: Pydantic Model Approach

Now let's explore the **Pydantic model approach**, which offers a different set of advantages:

✅ **Familiar Syntax**: Uses standard Python class definitions that most developers know  
✅ **Rich Validation**: Leverages Pydantic's powerful validation system  
✅ **Literal Types**: Can constrain values to specific options (great for categories)  
⚠️ **Simple Types Only**: Limited to basic types (str, int, float, bool, Literal)  

### Key Differences from ExtractSchema

The most important limitation to understand: **complex types like lists must be represented as strings**. This means:
- ✅ ExtractSchema: `keywords: List[str]` → `["AI", "machine learning", "climate"]`
- ⚠️ Pydantic: `keywords: str` → `"AI, machine learning, climate"`

### Model Definition

Our Pydantic model includes:
- **Literal constraints** for `document_type` to ensure valid categories
- **Field descriptions** for better LLM understanding
- **String representation** for the keywords list (comma-separated)

### Why Use Pydantic?

Despite the limitations, Pydantic models are excellent when you need:
- **Strict validation** with constrained choices
- **Familiar Python patterns** for your team
- **Simple data structures** without nested complexity

Let's see how it performs on the same documents!

In [None]:
# Define Pydantic model for document metadata
# Note: Pydantic models for extraction support simple data types (str, int, float, bool, Literal)
# Complex types like lists must be represented as strings (e.g., comma-separated values)
class DocumentMetadata(BaseModel):
    """Pydantic model for document metadata extraction."""
    title: str = Field(..., description="The main title or subject of the document")
    document_type: Literal["research paper", "product announcement", "meeting notes", "news article", "technical documentation", "other"] = Field(..., description="Type of document")
    date: str = Field(..., description="Any date mentioned in the document (publication date, meeting date, etc.)")
    keywords: str = Field(..., description="Comma-separated list of key topics, technologies, or important terms mentioned in the document")
    summary: str = Field(..., description="Brief one-sentence summary of the document's main purpose or content")

# Apply extraction using Pydantic model
pydantic_extracted_df = docs_df.select(
    "id",
    fc.semantic.extract("text", DocumentMetadata).alias("metadata")
)

# Flatten the extracted metadata into separate columns
pydantic_results = pydantic_extracted_df.select(
    "id",
    pydantic_extracted_df.metadata.title.alias("title"),
    pydantic_extracted_df.metadata.document_type.alias("document_type"),
    pydantic_extracted_df.metadata.date.alias("date"),
    pydantic_extracted_df.metadata.keywords.alias("keywords"),
    pydantic_extracted_df.metadata.summary.alias("summary")
)

pydantic_results.show()


## Cleanup and Conclusion

Finally, we properly close our fenic session to free up resources.

### Key Takeaways

After running both approaches, you should notice several important differences:

🎯 **When to use ExtractSchema:**
- Need complex data structures (lists, nested objects)
- Want type-safe operations with native list handling
- Prefer fenic's native schema system
- Working with hierarchical or deeply structured data

🐍 **When to use Pydantic Models:**
- Need strict validation with constrained choices (Literal types)
- Team is familiar with Pydantic patterns
- Working with simple, flat data structures
- Want familiar Python class syntax

### The Keywords Difference

Pay special attention to how each method handles the `keywords` field:
- **ExtractSchema**: Returns `["keyword1", "keyword2", "keyword3"]` (actual list)
- **Pydantic**: Returns `"keyword1, keyword2, keyword3"` (comma-separated string)

This fundamental difference affects downstream processing and is crucial for choosing the right approach for your use case.

Both methods successfully extract structured metadata from unstructured text, but they excel in different scenarios. Choose based on your data complexity needs and team preferences!

In [5]:
# Clean up
session.stop()