# Basic Code-Based Information Extraction Example

This example demonstrates how to use the CodeExtractor with custom functions to extract structured information from DataFrame rows.

In [20]:
import pandas as pd
from PyDI.informationextraction import CodeExtractor, RegexExtractor, built_in_rules

## Sample Data

Let's create a sample movie dataset with information embedded in text fields.

In [21]:
# Sample movie data
data = {
    'title': [
        'The Shawshank Redemption',
        'The Godfather', 
        'The Dark Knight',
        'Pulp Fiction',
        '12 Angry Men'
    ],
    'info': [
        'Drama • 1994 • 142 min • R • IMDB: tt0111161',
        'Crime/Drama • 1972 • 175 min • R • IMDB: tt0068646', 
        'Action/Crime • 2008 • 152 min • PG-13 • IMDB: tt0468569',
        'Crime/Drama • 1994 • 154 min • R • IMDB: tt0110912',
        'Drama • 1957 • 96 min • Approved • IMDB: tt0050083'
    ],
    'description': [
        'Two imprisoned men bond over a number of years.',
        'The aging patriarch of an organized crime dynasty.',
        'When the menace known as the Joker emerges from his mysterious past.',
        'The lives of two mob hitmen, a boxer, a gangster and his wife.',
        'A jury holdout attempts to prevent a miscarriage of justice.'
    ]
}

df = pd.DataFrame(data)
df

Unnamed: 0,title,info,description
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst..."
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...


## Using Built-in Extractors

Instead of writing custom regex functions, we can leverage PyDI's built-in patterns and RegexExtractor. Let's first explore what built-in rules are available for our use case.

In [22]:
# Explore built-in rules that match our data
print("Built-in rules categories:", list(built_in_rules.keys()))
print("\nRelevant built-in patterns:")

# Show patterns we can use for our movie data
relevant_patterns = {
    'year': built_in_rules['dates']['year'],
    'imdb_id': built_in_rules['identifiers']['imdb_id'], 
    'runtime': built_in_rules['media']['runtime'],
    'age_rating': built_in_rules['media']['age_rating']
}

for name, pattern_info in relevant_patterns.items():
    print(f"  {name}: {pattern_info['pattern']}")

print("\n" + "="*60)

Built-in rules categories: ['identifiers', 'contact', 'money', 'dates', 'geo', 'measurements', 'product', 'media', 'company', 'key_value']

Relevant built-in patterns:
  year: \b(19|20)\d{2}\b
  imdb_id: \btt\d{7,8}\b
  runtime: \b(\d{1,3})\s*(?:min|minutes?|hrs?|hours?)\b
  age_rating: \b(?:G|PG|PG-13|R|NC-17|NR|Not Rated)\b



In [23]:
# Using RegexExtractor with built-in patterns
regex_rules = {
    'year': built_in_rules['dates']['year'],
    'imdb_id': built_in_rules['identifiers']['imdb_id'],
    'runtime_minutes': built_in_rules['media']['runtime'],
    'mpaa_rating': built_in_rules['media']['age_rating']
}

# Create RegexExtractor for structured patterns
regex_extractor = RegexExtractor(regex_rules, default_source='info', debug=True)

print("Using RegexExtractor with built-in patterns:")

Using RegexExtractor with built-in patterns:


## Define Custom Functions for Complex Logic

For more complex extraction that can't be handled by regex alone, we still need custom functions.

In [28]:
# Custom functions for complex logic that requires row-level processing
def classify_genre(row):
    """Classify primary genre from info."""
    info = str(row['info']).lower()
    if 'action' in info:
        return 'Action'
    elif 'crime' in info:
        return 'Crime'
    elif 'drama' in info:
        return 'Drama'
    else:
        return 'Other'

def is_classic(row):
    """Determine if movie is a classic (pre-1980)."""
    # Use the year already extracted by regex extractor
    if 'year' in row and row['year']:
        return row['year'] < 1980
    return False

def create_slug(row):
    """Create URL slug from title."""
    import re
    title = str(row['title']).lower()
    # Remove special characters and replace spaces with hyphens
    slug = re.sub(r'[^\w\s-]', '', title)
    slug = re.sub(r'[\s_-]+', '-', slug)
    return slug.strip('-')

def count_words(description):
    """Count words in description."""
    return len(str(description).split()) if description else 0

# Define custom functions for CodeExtractor
custom_functions = {
    'primary_genre': classify_genre,
    'is_classic_film': is_classic,
    'url_slug': create_slug,
    'word_count': count_words
}

# Create CodeExtractor for custom functions
code_extractor = CodeExtractor(custom_functions, debug=True, default_source="description")

## Apply Extractors

Now let's apply both extractors: first the RegexExtractor for structured patterns, then the CodeExtractor for custom logic.

In [25]:
# Step 1: Apply RegexExtractor for structured patterns
print("Step 1: Extracting structured patterns...")
result_df = regex_extractor.extract(df)
print(f"After regex extraction: {list(result_df.columns)}")

# Step 2: Apply CodeExtractor for custom functions  
print("\nStep 2: Applying custom functions...")
# Set source columns for text-based function
result_df = code_extractor.extract(result_df, source_column='description')  # for word_count

print(f"Final columns: {list(result_df.columns)}")
print("\nExtracted Results:")
result_df

Step 1: Extracting structured patterns...
After regex extraction: ['title', 'info', 'description', 'year', 'imdb_id', 'runtime_minutes', 'mpaa_rating']

Step 2: Applying custom functions...
Final columns: ['title', 'info', 'description', 'year', 'imdb_id', 'runtime_minutes', 'mpaa_rating', 'primary_genre', 'is_classic_film', 'url_slug', 'word_count']

Extracted Results:


Unnamed: 0,title,info,description,year,imdb_id,runtime_minutes,mpaa_rating,primary_genre,is_classic_film,url_slug,word_count
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.,1994.0,tt0111161,142.0,R,Drama,False,the-shawshank-redemption,9
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...,1972.0,tt0068646,175.0,R,Crime,True,the-godfather,8
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...,2008.0,tt0468569,152.0,PG,Action,False,the-dark-knight,12
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst...",1994.0,tt0110912,154.0,R,Crime,False,pulp-fiction,13
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...,1957.0,tt0050083,96.0,,Drama,True,12-angry-men,10


## Using ExtractorPipeline

PyDI also provides an ExtractorPipeline for chaining multiple extractors cleanly.

In [29]:
from PyDI.informationextraction import ExtractorPipeline

# Create a clean pipeline
pipeline = ExtractorPipeline([regex_extractor, code_extractor])
pipeline_result = pipeline.run(df, debug=True)

print("Running extraction pipeline...")
pipeline_result = pipeline.run(df, debug=True)

print("\nPipeline Results:")
pipeline_result

Running extraction pipeline...

Pipeline Results:


Unnamed: 0,title,info,description,year,imdb_id,runtime_minutes,mpaa_rating,primary_genre,is_classic_film,url_slug,word_count
0,The Shawshank Redemption,Drama • 1994 • 142 min • R • IMDB: tt0111161,Two imprisoned men bond over a number of years.,1994.0,tt0111161,142.0,R,Drama,False,the-shawshank-redemption,9
1,The Godfather,Crime/Drama • 1972 • 175 min • R • IMDB: tt006...,The aging patriarch of an organized crime dyna...,1972.0,tt0068646,175.0,R,Crime,True,the-godfather,8
2,The Dark Knight,Action/Crime • 2008 • 152 min • PG-13 • IMDB: ...,When the menace known as the Joker emerges fro...,2008.0,tt0468569,152.0,PG,Action,False,the-dark-knight,12
3,Pulp Fiction,Crime/Drama • 1994 • 154 min • R • IMDB: tt011...,"The lives of two mob hitmen, a boxer, a gangst...",1994.0,tt0110912,154.0,R,Crime,False,pulp-fiction,13
4,12 Angry Men,Drama • 1957 • 96 min • Approved • IMDB: tt005...,A jury holdout attempts to prevent a miscarria...,1957.0,tt0050083,96.0,,Drama,True,12-angry-men,10


In [30]:
# Summary statistics using the final result
print("Final Extraction Summary:")
print(f"  Total movies: {len(pipeline_result)}")
print(f"  Average runtime: {pipeline_result['runtime_minutes'].mean():.1f} minutes")
print(f"  Classic films: {pipeline_result['is_classic_film'].sum()}/{len(pipeline_result)}")
print(f"  Most common rating: {pipeline_result['mpaa_rating'].mode().iloc[0]}")
print(f"  Average description length: {pipeline_result['word_count'].mean():.1f} words")

print("\nExtracted fields demonstration:")
for i, row in pipeline_result.iterrows():
    print(f"\n{row['title']}:")
    print(f"  Year: {row['year']} ({'Classic' if row['is_classic_film'] else 'Modern'})")
    print(f"  Runtime: {row['runtime_minutes']} minutes")  
    print(f"  Rating: {row['mpaa_rating']}")
    print(f"  Genre: {row['primary_genre']}")
    print(f"  IMDB: {row['imdb_id']}")
    print(f"  URL: {row['url_slug']}")
    print(f"  Description: {row['word_count']} words")

Final Extraction Summary:
  Total movies: 5
  Average runtime: 143.8 minutes
  Classic films: 2/5
  Most common rating: R
  Average description length: 10.4 words

Extracted fields demonstration:

The Shawshank Redemption:
  Year: 1994.0 (Modern)
  Runtime: 142.0 minutes
  Rating: R
  Genre: Drama
  IMDB: tt0111161
  URL: the-shawshank-redemption
  Description: 9 words

The Godfather:
  Year: 1972.0 (Classic)
  Runtime: 175.0 minutes
  Rating: R
  Genre: Crime
  IMDB: tt0068646
  URL: the-godfather
  Description: 8 words

The Dark Knight:
  Year: 2008.0 (Modern)
  Runtime: 152.0 minutes
  Rating: PG
  Genre: Action
  IMDB: tt0468569
  URL: the-dark-knight
  Description: 12 words

Pulp Fiction:
  Year: 1994.0 (Modern)
  Runtime: 154.0 minutes
  Rating: R
  Genre: Crime
  IMDB: tt0110912
  URL: pulp-fiction
  Description: 13 words

12 Angry Men:
  Year: 1957.0 (Classic)
  Runtime: 96.0 minutes
  Rating: None
  Genre: Drama
  IMDB: tt0050083
  URL: 12-angry-men
  Description: 10 words
