# Exercise 3 - Information Extraction with PyDI

In this exercise, you will learn how to use PyDI's information extraction module to extract structured information from unstructured text data. We'll work with product descriptions and demonstrate different extraction techniques including regex patterns, custom code functions, and evaluation metrics.

This exercise uses evaluation methods adapted from the [SelfRefinement4ExtractGPT](https://github.com/wbsg-uni-mannheim/SelfRefinement4ExtractGPT) repository.

## Setup and Data Loading

First, let's import the necessary libraries and load our dataset.

In [None]:
import pandas as pd
import numpy as np
import sys
import os

# Add PyDI to path
sys.path.append('../../../')

# Import PyDI information extraction modules
from PyDI.informationextraction import RegexExtractor, CodeExtractor, ExtractorPipeline
from PyDI.informationextraction.rules import built_in_rules

# Import evaluation utilities
from evaluation import load_jsonl_targets, evaluate_predictions, print_evaluation_results

### Task 3.1: Load and Explore the Dataset

Load the OA-Mine dataset which contains product descriptions with target attribute values. This dataset is commonly used for evaluating information extraction systems.

The data format is JSONL where each line contains:
- `input`: Product description text
- `category`: Product category 
- `target_scores`: Dictionary of target attributes and their values

In [None]:
# Load the target data from JSONL format
targets_df = load_jsonl_targets('input/oa-mine_test.jsonl')

# Display basic information about the dataset
print(f"Dataset shape: {targets_df.shape}")
print(f"Columns: {list(targets_df.columns)}")
print("\nFirst few examples:")
targets_df.head()

In [None]:
# Examine the distribution of categories
print("Category distribution:")
print(targets_df['category'].value_counts())

# Look at some example product descriptions
print("\nSample product descriptions:")
for i in range(3):
    print(f"{i+1}. {targets_df['input'].iloc[i]}")

In [None]:
# Now lets create our dataframe with just the input column
working_df = targets_df[['input']]
working_df.head()

### Task 3.2: Basic Regex-Based Extraction

Create a RegexExtractor to extract product attributes using regular expression patterns. Start with simple patterns for common attributes like brand, gender, and size.

In [None]:
# TODO: Define regex rules for extracting product attributes
# Hint: Create rules dictionary with pattern definitions for:
# - Brand: Look for common brand patterns at the beginning
# - Gender: Men's, Women's, Boys', Girls' patterns  
# - Size: Numeric patterns with units
# - Color: Common color names

regex_rules = {
    # Add your regex rules here
    "Brand": {
        "source_column": "input",
        "pattern": r"^([A-Za-z][A-Za-z\s&\.]+?)(?:\s+(?:Men's|Women's|Boys'|Girls'|Mens|Womens|for|\d))",
        "group": 1,
        "postprocess": "strip"
    },
    # TODO: Add more rules
}

# Create the RegexExtractor
# TODO: Initialize the extractor with your rules

In [None]:
# Apply the regex extractor to the dataset
# TODO: Use the extract method to process the working_df

# Display some results
print("Regex extraction results (first 10 rows):")
# TODO: Show relevant columns from the results

### Evaluate Regex Extraction Results

Now let's evaluate how well our regex extraction performed. This will help us understand which patterns work well and which need improvement.

In [None]:
# TODO: Evaluate your regex extraction results
# 1. Get the list of attributes you extracted (columns from your regex results)
# extracted_attributes = [...]  # List the attributes your regex extractor created

# 2. Use evaluate_predictions to compare your results with targets
evaluation_results = evaluate_predictions(regex_results_df, working_df, extracted_attributes)

# 3. Print the evaluation results
print_evaluation_results(evaluation_results)

# 4. Analyze the results:
# - Which attributes had the best F1 scores?
# - Which attributes had the worst recall (lots of VN - Valid but Not extracted)?
# - Look at some VW (Valid but Wrong) examples to understand why extraction failed

### Task 3.3: Custom Code-Based Extraction

For more complex extraction logic that can't be easily handled with regex, use the CodeExtractor with custom Python functions.

In [None]:
# Define custom extraction functions
def extract_gender(text):
    """Extract gender information from product text."""
    text_lower = text.lower()
    # TODO: Implement gender extraction logic
    # Look for patterns like "men's", "women's", "boys'", "girls'"
    # Return the extracted gender or None
    pass

def extract_shoe_type(text):
    """Extract shoe type from product description."""
    text_lower = text.lower()
    # TODO: Implement shoe type extraction
    # Look for patterns like "sneaker", "boot", "sandal", "loafer", etc.
    # Return the extracted shoe type or None
    pass

def extract_size(text):
    """Extract size information from product text."""
    # TODO: Implement size extraction logic
    # Look for numeric patterns, possibly with letters (like "10 D US")
    # Return the extracted size or None
    pass

In [None]:
# Define code extraction rules
code_rules = {
    # TODO: Define rules that use your custom functions
    # Format: "field_name": {"source_column": "input", "function": function_name}
}

# Create and apply CodeExtractor
# TODO: Initialize CodeExtractor with your rules and apply it

### Evaluate Code-Based Extraction

Let's evaluate the code-based extraction 

In [None]:
# TODO: Evaluate your code extraction results
# 1. Get the attributes extracted by your code extractor
# code_extracted_attributes = [...]  

# 2. Evaluate code extraction performance
code_evaluation = evaluate_predictions(code_results_df, working_df, code_extracted_attributes)
print("=== CODE EXTRACTION EVALUATION ===")
print_evaluation_results(code_evaluation)

### Task 3.4: Combining Extractors with Pipeline

Use ExtractorPipeline to combine multiple extractors for more comprehensive attribute extraction.

In [None]:
# Create an ExtractorPipeline combining regex and code extractors
pipeline = ExtractorPipeline([regex_extractor, code_extractor])

# TODO: Apply the pipeline


### Evaluate Pipline-Based Extraction

Let's evaluate the pipeline-based extraction

In [None]:
# TODO: Analyze the results

## Task 3.5 Solution: Analysis and Improvement

In [None]:
# TODO: Analyze the results
# 1. Which attributes had the best/worst performance?
# 2. Look at some examples where extraction failed
# 3. What patterns could you add to improve performance?

# Example: Find cases where brand extraction failed
# TODO: Filter and examine failed extractions to understand patterns

### Bonus Task 3.6: LLM-Based Extraction (Optional)

If you have access to an API key for OpenAI or another LLM provider, try using the LLMExtractor for more sophisticated extraction.

In [None]:
# Optional: LLM-based extraction
# This requires API keys and is optional

try:
    from PyDI.informationextraction import LLMExtractor
    from langchain_openai import ChatOpenAI
    from pydantic import BaseModel
    from typing import Optional
    
    class Product(BaseModel):
        brand: Optional[str] = None
        gender: Optional[str] = None
        model_name: Optional[str] = None
        shoe_type: Optional[str] = None
        color: Optional[str] = None
        size: Optional[str] = None
    
    print("LLM extraction would require API keys - skipping for now")
    
except ImportError:
    print("LLM extraction dependencies not available - install with: pip install langchain-openai")