# Information Extraction with PyDI

This example demonstrates both RegexExtractor and CodeExtractor capabilities in PyDI for extracting structured information from unstructured text data.

In [1]:
import pandas as pd
from PyDI.informationextraction import RegexExtractor, CodeExtractor, ExtractorPipeline, built_in_rules

NLTK not available. Advanced tokenization features will be limited.


## Part 1: RegexExtractor with Built-in Rules

Let's start with RegexExtractor using PyDI's built-in regex patterns for common extraction tasks.

### Sample Product Data

First, let's create sample product data with mixed text content containing various extractable patterns.

In [2]:
# Sample product data with mixed text content
product_data = {
    'product_description': [
        "MacBook Pro 16-inch for $2,499.99 - contact sales@apple.com for deals",
        "iPhone 13 Pro 128GB - $999.00, visit https://apple.com/iphone",
        "Samsung TV 55-inch 4K UHD €679.99 - warranty: 2 years, weight: 15.5kg",
        "Dell XPS Laptop - Call us at (555) 123-4567 or email support@dell.com",
        "Gaming PC with RGB lighting - 50% off this week! Was $1,200 now $600",
        "Sony Headphones WH-1000XM4 - Premium noise cancelling for €299.95"
    ]
}

product_df = pd.DataFrame(product_data)
print("Product Data:")
print(product_df.to_string(index=False))
print("\n" + "="*80 + "\n")

Product Data:
                                                  product_description
MacBook Pro 16-inch for $2,499.99 - contact sales@apple.com for deals
        iPhone 13 Pro 128GB - $999.00, visit https://apple.com/iphone
Samsung TV 55-inch 4K UHD €679.99 - warranty: 2 years, weight: 15.5kg
Dell XPS Laptop - Call us at (555) 123-4567 or email support@dell.com
 Gaming PC with RGB lighting - 50% off this week! Was $1,200 now $600
    Sony Headphones WH-1000XM4 - Premium noise cancelling for €299.95




### Explore Built-in Rules

In [3]:
# Explore relevant built-in rules categories
print("Available rule categories:")
for category in built_in_rules.keys():
    print(f"  - {category}: {list(built_in_rules[category].keys())[:3]}...")  # Show first 3 for brevity

print("\n" + "="*60 + "\n")

# Show specific patterns we'll use for product extraction
product_patterns = {
    'price_symbol': built_in_rules['money']['price_symbol'],
    'email': built_in_rules['contact']['email'],
    'url': built_in_rules['contact']['url'],
    'phone_us': built_in_rules['contact']['phone_us'],
    'percent': built_in_rules['money']['percent']
}

print("Product extraction patterns:")
for name, pattern_info in product_patterns.items():
    print(f"  {name}: {pattern_info['pattern']}")
    if 'postprocess' in pattern_info:
        print(f"    → postprocess: {pattern_info['postprocess']}")

Available rule categories:
  - identifiers: ['uuid4', 'imdb_id', 'isbn_13']...
  - contact: ['email', 'url', 'domain']...
  - money: ['price_symbol', 'price_iso', 'percent']...
  - dates: ['iso_date', 'us_date', 'year']...
  - geo: ['postal_us', 'postal_uk', 'postal_de']...
  - measurements: ['length_metric', 'length_imperial', 'weight_metric']...
  - product: ['model_number', 'color', 'size_clothing']...
  - media: ['runtime', 'age_rating', 'language']...
  - company: ['legal_suffix', 'employee_count', 'registration_number_generic']...
  - key_value: ['colon_separator', 'equals_separator', 'dash_separator']...


Product extraction patterns:
  price_symbol: [\$£€¥₹][\d,]+(?:\.\d{2})?
    → postprocess: parse_money
  email: \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b
  url: https?://[^\s<>\"]+
  phone_us: \b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b
  percent: \b\d+(?:\.\d+)?%
    → postprocess: parse_number


### Apply RegexExtractor to Product Data

In [4]:
# Define extraction rules using built-in patterns
product_rules = {
    'price': {
        'source_column': 'product_description',
        'pattern': [
            built_in_rules['money']['price_symbol']['pattern'],
            built_in_rules['money']['price_iso']['pattern']
        ],
        'postprocess': 'parse_money'
    },
    'email': {
        'source_column': 'product_description',
        'pattern': built_in_rules['contact']['email']['pattern']
    },
    'url': {
        'source_column': 'product_description', 
        'pattern': built_in_rules['contact']['url']['pattern']
    },
    'phone': {
        'source_column': 'product_description',
        'pattern': built_in_rules['contact']['phone_us']['pattern']
    },
    'weight': {
        'source_column': 'product_description',
        'pattern': built_in_rules['measurements']['weight_metric']['pattern']
    },
    'percent_discount': {
        'source_column': 'product_description',
        'pattern': built_in_rules['money']['percent']['pattern'], 
        'postprocess': 'parse_number'
    }
}

# Create and apply RegexExtractor
product_extractor = RegexExtractor(product_rules, debug=True, default_source="product_description")
product_result = product_extractor.extract(product_df)

print("Product Extraction Results:")
product_result

Product Extraction Results:


Unnamed: 0,product_description,price,email,url,phone,weight,percent_discount
0,"MacBook Pro 16-inch for $2,499.99 - contact sa...",2499.99,sales@apple.com,,,,
1,"iPhone 13 Pro 128GB - $999.00, visit https://a...",999.0,,https://apple.com/iphone,,,
2,Samsung TV 55-inch 4K UHD €679.99 - warranty: ...,679.99,,,,15.5kg,
3,Dell XPS Laptop - Call us at (555) 123-4567 or...,,support@dell.com,,555) 123-4567,,
4,Gaming PC with RGB lighting - 50% off this wee...,1.2,,,,,0.5
5,Sony Headphones WH-1000XM4 - Premium noise can...,299.95,,,,,


In [5]:
# Show product extraction statistics
print(f"Extracted {len(product_rules)} fields from {len(product_df)} product descriptions\n")

print("Product Extraction Summary:")
for field in product_rules.keys():
    if field in product_result.columns:
        non_null = product_result[field].notna().sum()
        rate = non_null / len(product_result) * 100
        print(f"  {field}: {non_null}/{len(product_result)} ({rate:.1f}%)")

Extracted 6 fields from 6 product descriptions

Product Extraction Summary:
  price: 5/6 (83.3%)
  email: 2/6 (33.3%)
  url: 1/6 (16.7%)
  phone: 1/6 (16.7%)
  weight: 1/6 (16.7%)
  percent_discount: 1/6 (16.7%)


## Part 2: CodeExtractor with Custom Functions

Now let's demonstrate CodeExtractor with custom functions for more complex extraction logic.

### Sample Movie Data

Let's create movie data that requires both regex patterns and custom logic.

In [6]:
# Sample movie data
movie_data = {
    'title': [
        'The Shawshank Redemption',
        'The Godfather', 
        'The Dark Knight',
        'Pulp Fiction',
        '12 Angry Men'
    ],
    'info': [
        'Drama • 1994 • 142 min • R • IMDB: tt0111161',
        'Crime/Drama • 1972 • 175 min • R • IMDB: tt0068646', 
        'Action/Crime • 2008 • 152 min • PG-13 • IMDB: tt0468569',
        'Crime/Drama • 1994 • 154 min • R • IMDB: tt0110912',
        'Drama • 1957 • 96 min • Approved • IMDB: tt0050083'
    ],
    'description': [
        'Two imprisoned men bond over a number of years.',
        'The aging patriarch of an organized crime dynasty.',
        'When the menace known as the Joker emerges from his mysterious past.',
        'The lives of two mob hitmen, a boxer, a gangster and his wife.',
        'A jury holdout attempts to prevent a miscarriage of justice.'
    ]
}

movie_df = pd.DataFrame(movie_data)
print("Movie Data:")
movie_df

Movie Data:


Unnamed: 0,title,info,description
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst..."
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...


### Built-in Patterns for Movies

First, let's use RegexExtractor for structured patterns in movie data.

In [7]:
# Show built-in patterns relevant to movies
movie_patterns = {
    'year': built_in_rules['dates']['year'],
    'imdb_id': built_in_rules['identifiers']['imdb_id'], 
    'runtime': built_in_rules['media']['runtime'],
    'age_rating': built_in_rules['media']['age_rating']
}

print("Movie extraction patterns:")
for name, pattern_info in movie_patterns.items():
    print(f"  {name}: {pattern_info['pattern']}")

print("\n" + "="*60)

Movie extraction patterns:
  year: \b(19|20)\d{2}\b
  imdb_id: \btt\d{7,8}\b
  runtime: \b(\d{1,3})\s*(?:min|minutes?|hrs?|hours?)\b
  age_rating: \b(?:G|PG|PG-13|R|NC-17|NR|Not Rated)\b



In [8]:
# Create RegexExtractor for movie structured patterns
movie_regex_rules = {
    'year': built_in_rules['dates']['year'],
    'imdb_id': built_in_rules['identifiers']['imdb_id'],
    'runtime_minutes': built_in_rules['media']['runtime'],
    'mpaa_rating': built_in_rules['media']['age_rating']
}

movie_regex_extractor = RegexExtractor(movie_regex_rules, default_source='info', debug=True)
movie_regex_result = movie_regex_extractor.extract(movie_df)

print("Movie Regex Extraction Results:")
movie_regex_result

Movie Regex Extraction Results:


Unnamed: 0,title,info,description,year,imdb_id,runtime_minutes,mpaa_rating
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.,1994.0,tt0111161,142.0,R
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...,1972.0,tt0068646,175.0,R
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...,2008.0,tt0468569,152.0,PG
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst...",1994.0,tt0110912,154.0,R
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...,1957.0,tt0050083,96.0,


### Custom Functions for Complex Logic

Now let's define custom functions for extraction logic that can't be handled by regex alone.

In [9]:
# Custom functions for complex logic that requires row-level processing
def classify_genre(row):
    """Classify primary genre from info."""
    info = str(row['info']).lower()
    if 'action' in info:
        return 'Action'
    elif 'crime' in info:
        return 'Crime'
    elif 'drama' in info:
        return 'Drama'
    else:
        return 'Other'

def is_classic(row):
    """Determine if movie is a classic (pre-1980)."""
    # Use the year already extracted by regex extractor
    if 'year' in row and row['year']:
        return row['year'] < 1980
    return False

def create_slug(row):
    """Create URL slug from title."""
    import re
    title = str(row['title']).lower()
    # Remove special characters and replace spaces with hyphens
    slug = re.sub(r'[^\w\s-]', '', title)
    slug = re.sub(r'[\s_-]+', '-', slug)
    return slug.strip('-')

def count_words(description):
    """Count words in description."""
    return len(str(description).split()) if description else 0

def calculate_runtime_category(row):
    """Categorize runtime length."""
    if 'runtime_minutes' in row and row['runtime_minutes']:
        runtime = row['runtime_minutes']
        if runtime < 90:
            return 'Short'
        elif runtime < 150:
            return 'Standard'
        else:
            return 'Long'
    return None

print("Defined 5 custom extraction functions:")
custom_functions = {
    'primary_genre': classify_genre,
    'is_classic_film': is_classic,
    'url_slug': create_slug,
    'word_count': count_words,
    'runtime_category': calculate_runtime_category
}

for name, func in custom_functions.items():
    print(f"  - {name}: {func.__doc__.strip()}")

Defined 5 custom extraction functions:
  - primary_genre: Classify primary genre from info.
  - is_classic_film: Determine if movie is a classic (pre-1980).
  - url_slug: Create URL slug from title.
  - word_count: Count words in description.
  - runtime_category: Categorize runtime length.


In [10]:
# Create CodeExtractor for custom functions
movie_code_extractor = CodeExtractor(custom_functions, debug=True, default_source="description")

# Apply CodeExtractor to the result from RegexExtractor
print("Applying custom functions...")
movie_final_result = movie_code_extractor.extract(movie_regex_result, source_column='description')

print("\nFinal Movie Extraction Results:")
movie_final_result

Applying custom functions...

Final Movie Extraction Results:


Unnamed: 0,title,info,description,year,imdb_id,runtime_minutes,mpaa_rating,primary_genre,is_classic_film,url_slug,word_count,runtime_category
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.,1994.0,tt0111161,142.0,R,Drama,False,the-shawshank-redemption,9,Standard
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...,1972.0,tt0068646,175.0,R,Crime,True,the-godfather,8,Long
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...,2008.0,tt0468569,152.0,PG,Action,False,the-dark-knight,12,Long
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst...",1994.0,tt0110912,154.0,R,Crime,False,pulp-fiction,13,Long
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...,1957.0,tt0050083,96.0,,Drama,True,12-angry-men,10,Standard


## Part 3: Combined Pipeline Approach

Let's demonstrate using ExtractorPipeline to chain multiple extractors cleanly.

In [11]:
# Create a comprehensive extraction pipeline
print("Creating extraction pipeline...")

# Pipeline for movies
movie_pipeline = ExtractorPipeline([movie_regex_extractor, movie_code_extractor])
movie_pipeline_result = movie_pipeline.run(movie_df, debug=True)

print("\nMovie Pipeline Results:")
movie_pipeline_result

Creating extraction pipeline...

Movie Pipeline Results:


Unnamed: 0,title,info,description,year,imdb_id,runtime_minutes,mpaa_rating,primary_genre,is_classic_film,url_slug,word_count,runtime_category
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.,1994.0,tt0111161,142.0,R,Drama,False,the-shawshank-redemption,9,Standard
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...,1972.0,tt0068646,175.0,R,Crime,True,the-godfather,8,Long
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...,2008.0,tt0468569,152.0,PG,Action,False,the-dark-knight,12,Long
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst...",1994.0,tt0110912,154.0,R,Crime,False,pulp-fiction,13,Long
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...,1957.0,tt0050083,96.0,,Drama,True,12-angry-men,10,Standard


## Part 4: Results Analysis and Comparison

Let's analyze and compare the extraction results from both datasets.

In [12]:
# Movie extraction summary
print("=== MOVIE EXTRACTION SUMMARY ===")
print(f"Total movies: {len(movie_pipeline_result)}")
print(f"Average runtime: {movie_pipeline_result['runtime_minutes'].mean():.1f} minutes")
print(f"Classic films: {movie_pipeline_result['is_classic_film'].sum()}/{len(movie_pipeline_result)}")
print(f"Most common rating: {movie_pipeline_result['mpaa_rating'].mode().iloc[0] if not movie_pipeline_result['mpaa_rating'].mode().empty else 'N/A'}")
print(f"Average description length: {movie_pipeline_result['word_count'].mean():.1f} words")

print("\nRuntime categories:")
runtime_counts = movie_pipeline_result['runtime_category'].value_counts()
for category, count in runtime_counts.items():
    print(f"  {category}: {count}")

print("\n" + "="*60 + "\n")

# Product extraction summary  
print("=== PRODUCT EXTRACTION SUMMARY ===")
print(f"Total products: {len(product_result)}")
non_null_prices = product_result['price'].notna().sum()
if non_null_prices > 0:
    print(f"Average price: ${product_result['price'].mean():.2f}")
    print(f"Price range: ${product_result['price'].min():.2f} - ${product_result['price'].max():.2f}")
print(f"Products with contact info: {(product_result['email'].notna() | product_result['phone'].notna()).sum()}")
print(f"Products with URLs: {product_result['url'].notna().sum()}")
discount_products = product_result['percent_discount'].notna().sum()
print(f"Products with discounts: {discount_products}")

=== MOVIE EXTRACTION SUMMARY ===
Total movies: 5
Average runtime: 143.8 minutes
Classic films: 2/5
Most common rating: R
Average description length: 10.4 words

Runtime categories:
  Long: 3
  Standard: 2


=== PRODUCT EXTRACTION SUMMARY ===
Total products: 6
Average price: $896.03
Price range: $1.20 - $2499.99
Products with contact info: 2
Products with URLs: 1
Products with discounts: 1


## Part 5: Auto-Discovery with RuleDiscovery

PyDI's RuleDiscovery automatically discovers useful extraction fields by applying all built-in rules and filtering by coverage threshold. This is perfect for exploring unknown datasets.

### When to Use Auto-Discovery

**RuleDiscovery** is perfect for:

1. **Exploratory Data Analysis**: When you don't know what structured information might be in your text
2. **Data Profiling**: Understanding the "extractability" of a dataset
3. **Schema Discovery**: Finding common patterns across diverse text data
4. **Quick Prototyping**: Rapidly identifying extraction opportunities

**Best Practices**:
- Start with `coverage_threshold=0.1` to see all possibilities
- Use specific `categories` to focus on relevant patterns  
- Set `top_k` to limit results to most valuable fields
- Enable `debug=True` to save coverage statistics and samples
- Combine with manual extraction for complex cases

The discovered fields can then be refined using `RegexExtractor` and `CodeExtractor` for production pipelines.

In [13]:
from PyDI.informationextraction import RuleDiscovery, discover_fields

### Create Mixed Data for Discovery

Let's create a dataset with diverse text content to see what RuleDiscovery can find automatically.

In [14]:
# Create diverse data with various extractable patterns
discovery_data = {
    'mixed_text': [
        'iPhone 14 Pro - $999.99 - Contact support@apple.com for technical issues',
        'Visit https://samsung.com or call (555) 123-4567 for Galaxy S23 deals',
        'MacBook Air M2 - €1299.99 - 13-inch display, weighs 1.24kg - 2 year warranty',
        'Dell XPS 13 laptop - Call 1-800-DELL or email sales@dell.com for quotes',
        'Product SKU: XYZ-2024-RED available in red, blue, black colors',
        'Nintendo Switch OLED - $349.99 - Available at store locations since 2021',
        'Sony WH-1000XM4 headphones - Premium quality - €299.95 with 30-hour battery',
        'Gaming PC with RGB - 50% discount! Original price $1200, now just $600',
        'Free shipping on orders over £75.00 - UK customers only',
        'Tesla Model 3 - Starting at $35,000 - Visit tesla.com or call (650) 681-5000'
    ]
}

discovery_df = pd.DataFrame(discovery_data)
print("Mixed text data for discovery:")
print(discovery_df.to_string(index=False, max_colwidth=80))

Mixed text data for discovery:
                                                                  mixed_text
    iPhone 14 Pro - $999.99 - Contact support@apple.com for technical issues
       Visit https://samsung.com or call (555) 123-4567 for Galaxy S23 deals
MacBook Air M2 - €1299.99 - 13-inch display, weighs 1.24kg - 2 year warranty
     Dell XPS 13 laptop - Call 1-800-DELL or email sales@dell.com for quotes
              Product SKU: XYZ-2024-RED available in red, blue, black colors
    Nintendo Switch OLED - $349.99 - Available at store locations since 2021
 Sony WH-1000XM4 headphones - Premium quality - €299.95 with 30-hour battery
      Gaming PC with RGB - 50% discount! Original price $1200, now just $600
                     Free shipping on orders over £75.00 - UK customers only
Tesla Model 3 - Starting at $35,000 - Visit tesla.com or call (650) 681-5000


### Progressive Threshold Analysis

Let's see how different coverage thresholds affect field selection.

### Method 1: Using discover_fields() Function

The convenience function `discover_fields()` is the easiest way to get started with auto-discovery.

In [15]:
# Use discover_fields for automatic field discovery
discovered_result = discover_fields(
    discovery_df,
    source_column='mixed_text',
    categories=['money', 'contact', 'product', 'dates'],  # Focus on relevant categories
    coverage_threshold=0.25,  # Fields must appear in 25%+ of rows
    top_k=8,  # Limit to top 8 fields by coverage
    include_original=True,
    debug=True
)

print(f"Discovered {len(discovered_result.columns)-1} fields from mixed text:")
print("\nField coverage:")
for col in discovered_result.columns:
    if col != 'mixed_text':  # Skip original column
        non_null = discovered_result[col].notna().sum()
        coverage = non_null / len(discovered_result) * 100
        print(f"  {col}: {non_null}/{len(discovered_result)} ({coverage:.1f}%)")

print("\nDiscovered results:")
discovered_result

Discovered 3 fields from mixed text:

Field coverage:
  product__model_number: 10/10 (100.0%)
  money__price_symbol: 7/10 (70.0%)
  contact__domain: 4/10 (40.0%)

Discovered results:


Unnamed: 0,mixed_text,product__model_number,money__price_symbol,contact__domain
0,iPhone 14 Pro - $999.99 - Contact support@appl...,Pro,999.99,apple.com
1,Visit https://samsung.com or call (555) 123-45...,Visit,,samsung.com
2,"MacBook Air M2 - €1299.99 - 13-inch display, w...",MacBook,1299.99,
3,Dell XPS 13 laptop - Call 1-800-DELL or email ...,Dell,,dell.com
4,"Product SKU: XYZ-2024-RED available in red, bl...",Product,,
5,Nintendo Switch OLED - $349.99 - Available at ...,Nintendo,349.99,
6,Sony WH-1000XM4 headphones - Premium quality -...,Sony,299.95,
7,Gaming PC with RGB - 50% discount! Original pr...,Gaming,1200.0,
8,Free shipping on orders over £75.00 - UK custo...,Free,75.0,
9,"Tesla Model 3 - Starting at $35,000 - Visit te...",Tesla,35.0,tesla.com


### Method 2: Using RuleDiscovery Class with Metadata

For more control and detailed insights, use the `RuleDiscovery` class directly.

In [16]:
# Create RuleDiscovery instance for advanced analysis
discovery = RuleDiscovery(
    debug=True, 
    out_dir="output/informationextraction/quickstart_detailed"
)

# Run discovery with broader categories and return metadata
quickstart_result2, metadata = discovery.extract_and_select(
    discovery_df,
    source_column='mixed_text',
    categories=['money', 'contact', 'product', 'dates', 'measurements'],
    coverage_threshold=0.20,  # Slightly lower threshold
    top_k=12,
    include_original=False,  # Just extracted fields
    return_meta=True
)

print("🔬 Advanced Discovery Analysis:")
print("="*50)
print(f"📊 Total patterns evaluated: {metadata['total_fields_evaluated']}")
print(f"✅ Fields selected: {len(metadata['selected_fields'])}")
print(f"📂 Categories analyzed: {metadata['categories_used']}")

print(f"\n🏆 Top Selected Fields by Coverage:")
for i, field in enumerate(metadata['selected_fields'][:8], 1):
    coverage = metadata['coverage'][field]
    print(f"  {i:2d}. {field}: {coverage:.3f}")

print(f"\n📈 Coverage Distribution:")
all_coverages = list(metadata['coverage'].values())
print(f"  • High coverage (>0.5): {sum(1 for c in all_coverages if c > 0.5)} fields")
print(f"  • Medium coverage (0.2-0.5): {sum(1 for c in all_coverages if 0.2 <= c <= 0.5)} fields") 
print(f"  • Low coverage (<0.2): {sum(1 for c in all_coverages if c < 0.2)} fields")

print(f"\n📋 Extracted Fields Preview:")
quickstart_result2.head(3)

🔬 Advanced Discovery Analysis:
📊 Total patterns evaluated: 29
✅ Fields selected: 7
📂 Categories analyzed: ['money', 'contact', 'product', 'dates', 'measurements']

🏆 Top Selected Fields by Coverage:
   1. product__model_number: 1.000
   2. money__price_symbol: 0.700
   3. contact__domain: 0.400
   4. contact__email: 0.200
   5. contact__phone_us: 0.200
   6. contact__twitter_handle: 0.200
   7. dates__year: 0.200

📈 Coverage Distribution:
  • High coverage (>0.5): 2 fields
  • Medium coverage (0.2-0.5): 5 fields
  • Low coverage (<0.2): 22 fields

📋 Extracted Fields Preview:


Unnamed: 0,product__model_number,money__price_symbol,contact__domain,contact__email,contact__phone_us,contact__twitter_handle,dates__year
0,Pro,999.99,apple.com,support@apple.com,,apple,
1,Visit,,samsung.com,,555) 123-4567,,
2,MacBook,1299.99,,,,,


### Method 3: Progressive Threshold Analysis

Let's explore how different coverage thresholds affect field selection. This helps you understand the trade-off between field quantity and quality.

In [17]:
# Progressive threshold analysis
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
threshold_analysis = {}

print("🎯 Progressive Threshold Analysis:")
print("="*60)
print("Threshold | Fields | Sample Fields")
print("-"*60)

for threshold in thresholds:
    result = discover_fields(
        discovery_df,
        source_column='mixed_text',
        categories=['money', 'contact', 'product'],  # Core business categories
        coverage_threshold=threshold,
        include_original=False,
        debug=False  # Disable debug for cleaner output
    )
    
    threshold_analysis[threshold] = len(result.columns)
    
    # Show sample field names for lower thresholds
    sample_fields = list(result.columns)[:3]
    sample_str = ', '.join(sample_fields)
    if len(result.columns) > 3:
        sample_str += f" (+{len(result.columns)-3} more)"
    
    print(f"   {threshold:0.1f}   |   {len(result.columns):2d}   | {sample_str}")

print(f"\n📊 Threshold Impact Summary:")
for t, count in threshold_analysis.items():
    bar = "█" * (count // 2) if count > 0 else ""
    print(f"  {t:0.1f}: {count:2d} fields {bar}")

print(f"\n💡 Recommendation: Use threshold 0.2-0.3 for balanced discovery")

🎯 Progressive Threshold Analysis:
Threshold | Fields | Sample Fields
------------------------------------------------------------
   0.1   |   10   | product__model_number, money__price_symbol, contact__domain (+7 more)
   0.2   |    6   | product__model_number, money__price_symbol, contact__domain (+3 more)
   0.3   |    3   | product__model_number, money__price_symbol, contact__domain
   0.4   |    3   | product__model_number, money__price_symbol, contact__domain
   0.5   |    2   | product__model_number, money__price_symbol
   0.6   |    2   | product__model_number, money__price_symbol

📊 Threshold Impact Summary:
  0.1: 10 fields █████
  0.2:  6 fields ███
  0.3:  3 fields █
  0.4:  3 fields █
  0.5:  2 fields █
  0.6:  2 fields █

💡 Recommendation: Use threshold 0.2-0.3 for balanced discovery


### Method 4: Category-Specific Discovery

Sometimes you want to focus on specific types of information. Let's explore what each category can extract from our data.

In [18]:
# Category-specific analysis
categories_to_analyze = ['contact', 'money', 'product', 'dates', 'measurements']

print("🏷️ Category-Specific Discovery Analysis:")
print("="*70)

category_results = {}

for category in categories_to_analyze:
    result = discover_fields(
        discovery_df,
        source_column='mixed_text',
        categories=[category],  # Single category
        coverage_threshold=0.1,  # Lower threshold to see all possibilities
        include_original=False,
        debug=False
    )
    
    category_results[category] = result
    
    print(f"\n📂 {category.upper()} Category:")
    print(f"   🔢 Found {len(result.columns)} extractable fields")
    
    if len(result.columns) > 0:
        # Show coverage for each field in this category
        print(f"   📊 Field coverage:")
        for col in result.columns:
            non_null = result[col].notna().sum()
            coverage = non_null / len(result) * 100
            print(f"      • {col.replace(f'{category}__', '')}: {coverage:.0f}%")
        
        # Show sample extracted values for the first field
        first_field = result.columns[0]
        sample_values = result[first_field].dropna().head(3).tolist()
        print(f"   💡 Sample values ({first_field.replace(f'{category}__', '')}): {sample_values}")
    else:
        print(f"   ⚠️ No fields found with sufficient coverage")

# Summary visualization
print(f"\n🌟 Category Summary:")
print("-"*40)
for category, result in category_results.items():
    field_count = len(result.columns)
    bar = "█" * (field_count * 2) if field_count > 0 else "░"
    print(f"{category:12s}: {field_count:2d} fields {bar}")

🏷️ Category-Specific Discovery Analysis:

📂 CONTACT Category:
   🔢 Found 5 extractable fields
   📊 Field coverage:
      • domain: 40%
      • email: 20%
      • phone_us: 20%
      • twitter_handle: 20%
      • url: 10%
   💡 Sample values (domain): ['apple.com', 'samsung.com', 'dell.com']

📂 MONEY Category:
   🔢 Found 2 extractable fields
   📊 Field coverage:
      • price_symbol: 70%
      • percent: 10%
   💡 Sample values (price_symbol): [999.99, 1299.99, 349.99]

📂 PRODUCT Category:
   🔢 Found 3 extractable fields
   📊 Field coverage:
      • model_number: 100%
      • color: 10%
      • warranty: 10%
   💡 Sample values (model_number): ['Pro', 'Visit', 'MacBook']

📂 DATES Category:
   🔢 Found 1 extractable fields
   📊 Field coverage:
      • year: 20%
   💡 Sample values (year): [2024.0, 2021.0]

📂 MEASUREMENTS Category:
   🔢 Found 1 extractable fields
   📊 Field coverage:
      • weight_metric: 10%
   💡 Sample values (weight_metric): ['1.24kg']

🌟 Category Summary:
----------------

### Method 5: Complete Discovery Pipeline

Finally, let's run a comprehensive discovery that combines the best of all approaches: broad category coverage, balanced thresholds, and save everything for further analysis.

In [19]:
# Comprehensive discovery pipeline - Method 5
print("🚀 Running Comprehensive Discovery Pipeline")
print("="*80)

# First, we need to make sure we have quickstart_df defined
# (in case it wasn't defined earlier in the reorganization)
if 'quickstart_df' not in locals():
    quickstart_data = [
        "iPhone 14 Pro - Space Black - $999.99 - Available at apple.com",
        "Samsung Galaxy S23 Ultra for €1199.99 - Contact support@samsung.com",
        "MacBook Air M2 - Silver - Starting at $1199 (13-inch model)",
        "Dell XPS 13 laptop - Visit dell.com or call 1-800-DELL-CARE",
        "Nintendo Switch OLED - $349.99 - Red/Blue Joy-Con controllers",
        "Contact sales@bestbuy.com for bulk orders over $500.00",
        "Sony WH-1000XM4 headphones - Black - €299.99 - 30-hour battery",
        "Visit https://example.com for special deals until 2024-12-31",
        "Free shipping on orders over $50.00 - Call (555) 123-4567",
        "Product model XYZ-2024 in blue, white, black colors - 2 year warranty"
    ]
    quickstart_df = pd.DataFrame({'product_description': quickstart_data})
    print("📋 Created quickstart dataset with sample product descriptions")

# Step 1: Broad discovery for exploration
exploration_result = discover_fields(
    quickstart_df,
    source_column='product_description',
    coverage_threshold=0.15,  # Lower threshold for exploration
    top_k=20,  # More fields for analysis
    include_original=True,
    debug=True,
    out_dir="output/informationextraction/comprehensive_discovery"
)

print(f"🔍 Step 1 - Exploration: Discovered {len(exploration_result.columns)-1} fields")

# Step 2: Production-ready selection
production_result = discover_fields(
    quickstart_df, 
    source_column='product_description',
    categories=['money', 'contact', 'product', 'dates'],  # Business-relevant categories
    coverage_threshold=0.25,  # Higher threshold for production
    top_k=10,  # Manageable number of fields
    include_original=True,
    debug=False
)

print(f"✅ Step 2 - Production: Selected {len(production_result.columns)-1} high-quality fields")

# Step 3: Analysis and insights
print(f"\n📊 Final Analysis:")
print("-"*50)

# Price analysis
if 'money__price_symbol' in production_result.columns:
    prices = production_result['money__price_symbol'].dropna()
    if len(prices) > 0:
        print(f"💰 Price Insights:")
        print(f"   • Products with prices: {len(prices)}/{len(production_result)} ({len(prices)/len(production_result)*100:.0f}%)")
        print(f"   • Average price: ${prices.mean():.2f}")
        print(f"   • Price range: ${prices.min():.2f} - ${prices.max():.2f}")

# Contact analysis  
contact_fields = [col for col in production_result.columns if 'contact__' in col]
if contact_fields:
    print(f"📧 Contact Information:")
    for field in contact_fields:
        count = production_result[field].notna().sum()
        if count > 0:
            field_name = field.replace('contact__', '').replace('_', ' ').title()
            print(f"   • {field_name}: {count} entries")

# Product features
product_fields = [col for col in production_result.columns if 'product__' in col]
if product_fields:
    print(f"🏷️ Product Features:")
    for field in product_fields:
        count = production_result[field].notna().sum()
        if count > 0:
            field_name = field.replace('product__', '').replace('_', ' ').title()
            print(f"   • {field_name}: {count} entries")

# Save comprehensive results
import os
os.makedirs("output/informationextraction/comprehensive_discovery", exist_ok=True)
output_path = "output/informationextraction/comprehensive_discovery/final_results.csv"
production_result.to_csv(output_path, index=False)

print(f"\n💾 Results saved to: {output_path}")
print(f"📋 Final dataset shape: {production_result.shape}")
print(f"🎯 Ready for downstream analysis and modeling!")

# Show a sample of the final results
print(f"\n📋 Sample of Final Results:")
sample_cols = ['product_description'] + [col for col in production_result.columns if col != 'product_description'][:4]
production_result[sample_cols].head()

🚀 Running Comprehensive Discovery Pipeline
📋 Created quickstart dataset with sample product descriptions
🔍 Step 1 - Exploration: Discovered 8 fields
✅ Step 2 - Production: Selected 4 high-quality fields

📊 Final Analysis:
--------------------------------------------------
💰 Price Insights:
   • Products with prices: 7/10 (70%)
   • Average price: $656.99
   • Price range: $50.00 - $1199.99
📧 Contact Information:
   • Domain: 5 entries
🏷️ Product Features:
   • Model Number: 10 entries
   • Color: 5 entries

💾 Results saved to: output/informationextraction/comprehensive_discovery/final_results.csv
📋 Final dataset shape: (10, 5)
🎯 Ready for downstream analysis and modeling!

📋 Sample of Final Results:


Unnamed: 0,product_description,product__model_number,money__price_symbol,contact__domain,product__color
0,iPhone 14 Pro - Space Black - $999.99 - Availa...,Pro,999.99,apple.com,black
1,Samsung Galaxy S23 Ultra for €1199.99 - Contac...,Samsung,1199.99,samsung.com,
2,MacBook Air M2 - Silver - Starting at $1199 (1...,MacBook,1199.0,,silver
3,Dell XPS 13 laptop - Visit dell.com or call 1-...,Dell,,dell.com,
4,Nintendo Switch OLED - $349.99 - Red/Blue Joy-...,Nintendo,349.99,,red


## Part 6: Complete AutoRules Quickstart

Let's now demonstrate a comprehensive example of auto-discovery capabilities using more diverse product data, showing all the different ways to use RuleDiscovery for data exploration and analysis.

In [20]:
# Analyze field selection at different coverage thresholds
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5]
threshold_results = {}

print("Field selection by coverage threshold:")
print("=" * 50)

for threshold in thresholds:
    result = discover_fields(
        discovery_df,
        source_column='mixed_text',
        categories=['money', 'contact', 'product'],  # Core categories
        coverage_threshold=threshold,
        include_original=False,
        debug=False
    )
    
    threshold_results[threshold] = len(result.columns)
    print(f"Threshold {threshold:.1f}: {len(result.columns)} fields selected")
    
    # Show sample field names for lower thresholds
    if threshold <= 0.3 and len(result.columns) > 0:
        sample_fields = list(result.columns)[:4]
        print(f"  Sample fields: {', '.join(sample_fields)}")
        if len(result.columns) > 4:
            print(f"  ... and {len(result.columns) - 4} more")

print(f"\nThreshold analysis summary:")
for t, count in threshold_results.items():
    print(f"  {t:.1f}: {count} fields")

Field selection by coverage threshold:
Threshold 0.1: 10 fields selected
  Sample fields: product__model_number, money__price_symbol, contact__domain, contact__email
  ... and 6 more
Threshold 0.2: 6 fields selected
  Sample fields: product__model_number, money__price_symbol, contact__domain, contact__email
  ... and 2 more
Threshold 0.3: 3 fields selected
  Sample fields: product__model_number, money__price_symbol, contact__domain
Threshold 0.4: 3 fields selected
Threshold 0.5: 2 fields selected

Threshold analysis summary:
  0.1: 10 fields
  0.2: 6 fields
  0.3: 3 fields
  0.4: 3 fields
  0.5: 2 fields
