# LLM-based Information Extraction with LangChain and OpenAI

This notebook demonstrates how to use the LLMExtractor with OpenAI's GPT models to extract structured product information from unstructured text descriptions.

## Setup and Dependencies

First, let's check if the required dependencies are available and set up our environment.

In [21]:
import os
from typing import Optional
import pandas as pd
from pydantic import BaseModel
from dotenv import load_dotenv
import json

load_dotenv()

# Check for LangChain OpenAI integration
try:
    from langchain_openai import ChatOpenAI
    OPENAI_AVAILABLE = True
    print("✅ LangChain OpenAI integration available")
except ImportError:
    OPENAI_AVAILABLE = False
    print("❌ LangChain OpenAI not available. Install with: pip install langchain-openai")

# Check for PyDI LLMExtractor
try:
    from PyDI.informationextraction import LLMExtractor
    print("✅ PyDI LLMExtractor available")
except ImportError:
    print("❌ PyDI LLMExtractor not available. Make sure PyDI is properly installed.")

✅ LangChain OpenAI integration available
✅ PyDI LLMExtractor available


In [22]:
# Check for OpenAI API key
api_key = os.getenv('OPENAI_API_KEY')
if api_key:
    print("✅ OPENAI_API_KEY found in environment")
    print(f"   Key starts with: {api_key[:10]}...")
else:
    print("❌ OPENAI_API_KEY not found in environment")
    print("   Set it with: os.environ['OPENAI_API_KEY'] = 'your-api-key'")
    print("   Or export OPENAI_API_KEY='your-api-key' in your shell")

✅ OPENAI_API_KEY found in environment
   Key starts with: sk-proj-qH...


## Define Product Schema

We'll use Pydantic to define the structured schema that we want to extract from unstructured text.

In [23]:
class Product(BaseModel):
    """Product schema for extraction."""
    brand: Optional[str] = None
    model: Optional[str] = None  
    price: Optional[float] = None
    category: Optional[str] = None
    color: Optional[str] = None

print("Product schema fields:")
for field_name, field_info in Product.__annotations__.items():
    print(f"  - {field_name}: {field_info}")

Product schema fields:
  - brand: typing.Optional[str]
  - model: typing.Optional[str]
  - price: typing.Optional[float]
  - category: typing.Optional[str]
  - color: typing.Optional[str]


## Sample Data

Let's create sample product descriptions that contain various types of structured information embedded in natural language text.

In [24]:
# Sample product data with rich, unstructured descriptions
sample_data = [
    "Apple iPhone 14 Pro - Space Black - $999.99 - Premium smartphone with advanced camera system",
    "Samsung Galaxy S23 Ultra 256GB Phantom Black smartphone for $1199 with S Pen included", 
    "Sony WH-1000XM4 wireless noise-canceling headphones in silver color priced at $349.99",
    "Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB SSD - $1,299.00 - Ultrabook category",
    "Nintendo Switch OLED White Console with Joy-Con controllers - $349.99 gaming system",
    "Dyson V15 Detect cordless vacuum cleaner in yellow - $749.99 - Home appliances",
]

# Create DataFrame
df = pd.DataFrame({'description': sample_data})

print(f"Created dataset with {len(df)} product descriptions:\n")
for i, desc in enumerate(df['description'], 1):
    print(f"{i}. {desc}")

Created dataset with 6 product descriptions:

1. Apple iPhone 14 Pro - Space Black - $999.99 - Premium smartphone with advanced camera system
2. Samsung Galaxy S23 Ultra 256GB Phantom Black smartphone for $1199 with S Pen included
3. Sony WH-1000XM4 wireless noise-canceling headphones in silver color priced at $349.99
4. Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB SSD - $1,299.00 - Ultrabook category
5. Nintendo Switch OLED White Console with Joy-Con controllers - $349.99 gaming system
6. Dyson V15 Detect cordless vacuum cleaner in yellow - $749.99 - Home appliances


## Configure LLM and Prompts

Now let's set up the OpenAI chat model and define our extraction prompts.

In [25]:
if not OPENAI_AVAILABLE or not os.getenv('OPENAI_API_KEY'):
    print("⚠️ Skipping LLM setup - missing dependencies or API key")
else:
    # Initialize OpenAI chat model
    chat_model = ChatOpenAI(
        model="gpt-5-nano",  # Use cheaper, faster model for demo
        max_tokens=500,        # Reasonable limit for structured output
        temperature=0.0,      # Deterministic output
        reasoning_effort="minimal",  # Set reasoning tokens to a nonzero value
    )
    
    print(f"✅ Configured {chat_model.model_name} with temperature={chat_model.temperature}")

✅ Configured gpt-5-nano with temperature=None


In [26]:
# test chat model with a simple prompt
response = chat_model.invoke("What is the capital of the moon?")
print(response)

content='There is no capital of the Moon. The Moon is a natural satellite without a government or inhabited population, so it has no political capitals or cities. If you’re asking in a fictional or hypothetical sense, you could imagine any place as a “capital” in that context.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 64, 'prompt_tokens': 14, 'total_tokens': 78, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-5-nano-2025-08-07', 'system_fingerprint': None, 'id': 'chatcmpl-C9RnF0nGJIHWus1AUFCXTvKN9WGw6', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--5ea9e04c-5f0f-450d-991a-332e4655aa9f-0' usage_metadata={'input_tokens': 14, 'output_tokens': 64, 'total_tokens': 78, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token

In [27]:
# System prompt for product extraction
system_prompt = """Extract product information from the given text and return it as JSON.

Extract these fields if present:
- brand: The manufacturer or brand name
- model: The specific product model or name
- price: The price as a number (without currency symbols)  
- category: The product category (e.g., "smartphone", "laptop", "headphones")
- color: The product color if mentioned

Return only valid JSON with the extracted fields. If a field is not found, omit it from the JSON.

Example:
{"brand": "Apple", "model": "iPhone 14", "price": 999.99, "category": "smartphone", "color": "black"}"""

print("System prompt:")
print(system_prompt)

System prompt:
Extract product information from the given text and return it as JSON.

Extract these fields if present:
- brand: The manufacturer or brand name
- model: The specific product model or name
- price: The price as a number (without currency symbols)  
- category: The product category (e.g., "smartphone", "laptop", "headphones")
- color: The product color if mentioned

Return only valid JSON with the extracted fields. If a field is not found, omit it from the JSON.

Example:
{"brand": "Apple", "model": "iPhone 14", "price": 999.99, "category": "smartphone", "color": "black"}


In [28]:
# Few-shot examples to improve extraction quality
few_shots = [
    (
        "Microsoft Surface Pro 9 13-inch tablet in platinum - $1099.99 - 2-in-1 device",
        '{"brand": "Microsoft", "model": "Surface Pro 9", "price": 1099.99, "category": "tablet", "color": "platinum"}'
    ),
    (
        "AirPods Pro (2nd generation) wireless earbuds for $249 with active noise cancellation",
        '{"brand": "Apple", "model": "AirPods Pro", "price": 249.0, "category": "earbuds"}'
    )
]

print(f"Few-shot examples ({len(few_shots)} examples):")
for i, (input_text, expected_output) in enumerate(few_shots, 1):
    print(f"\nExample {i}:")
    print(f"  Input: {input_text}")
    print(f"  Output: {expected_output}")

Few-shot examples (2 examples):

Example 1:
  Input: Microsoft Surface Pro 9 13-inch tablet in platinum - $1099.99 - 2-in-1 device
  Output: {"brand": "Microsoft", "model": "Surface Pro 9", "price": 1099.99, "category": "tablet", "color": "platinum"}

Example 2:
  Input: AirPods Pro (2nd generation) wireless earbuds for $249 with active noise cancellation
  Output: {"brand": "Apple", "model": "AirPods Pro", "price": 249.0, "category": "earbuds"}


## Create and Configure LLMExtractor

Now let's create the LLMExtractor with our configuration.

In [29]:
if not OPENAI_AVAILABLE or not os.getenv('OPENAI_API_KEY'):
    print("⚠️ Cannot create LLMExtractor - missing dependencies or API key")
    print("   This cell will be skipped.")
else:
    # Create LLM extractor
    extractor = LLMExtractor(
        chat_model=chat_model,
        schema=Product,
        source_column="description", 
        system_prompt=system_prompt,
        few_shots=few_shots,
        debug=True,  # Enable debug artifacts
        retries=2,   # Retry failed extractions
        out_dir="../../output/informationextraction/llm_example"
    )
    
    print("✅ LLMExtractor configured with:")
    print(f"  - Model: {chat_model.model_name}")
    print(f"  - Schema: {Product.__name__} ({len(Product.__annotations__)} fields)")
    print(f"  - Few-shot examples: {len(few_shots)}")
    print(f"  - Debug mode: {extractor.debug}")
    print(f"  - Max retries: {extractor.retries}")
    print(f"  - Output directory: {extractor.out_dir}")

✅ LLMExtractor configured with:
  - Model: gpt-5-nano
  - Schema: Product (5 fields)
  - Few-shot examples: 2
  - Debug mode: True
  - Max retries: 2
  - Output directory: ../../output/informationextraction/llm_example


## Run Extraction

Now let's run the LLM-based extraction on our product descriptions.

In [30]:
# Pass the system prompt, few-shot examples, and a user message to the chat_model
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

messages = [
    SystemMessage(content=system_prompt),
]
for input_text, expected_output in few_shots:
    messages.append(HumanMessage(content=input_text))
    messages.append(AIMessage(content=str(expected_output)))
# Add a user message to extract info from a sample description
sample_description = df['description'].iloc[0]
messages.append(HumanMessage(content=sample_description))

response = chat_model.invoke(messages)
print(response)

content='{"brand": "Apple", "model": "iPhone 14 Pro"}' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 301, 'total_tokens': 326, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-5-nano-2025-08-07', 'system_fingerprint': None, 'id': 'chatcmpl-C9RnHXPN0nb7pgWzkg5E78e4DvJAt', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--43942186-02bf-4f6d-8ac1-d7d60716bd8a-0' usage_metadata={'input_tokens': 301, 'output_tokens': 25, 'total_tokens': 326, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


In [31]:
# Extract information using LLM
print("🤖 Extracting product information with LLM...")
print("   This may take a few moments as we call the OpenAI API...")

result_df = extractor.extract(df, source_column="description")

print(f"✅ Extraction completed!")

🤖 Extracting product information with LLM...
   This may take a few moments as we call the OpenAI API...
✅ Extraction completed!


## Analyze Results

Let's examine the extraction results and analyze the quality of the LLM-based extraction.

In [32]:
# Display extraction results
print(f"Added {len(Product.__annotations__)} new columns to the dataset")
if OPENAI_AVAILABLE and os.getenv('OPENAI_API_KEY'):
    print(f"Debug artifacts saved to: {extractor.out_dir}/")

# Show sample results
print("\nExtraction Results:")
display_columns = ['description', 'brand', 'model', 'price', 'category', 'color']
available_columns = [col for col in display_columns if col in result_df.columns]

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None) 
pd.set_option('display.max_colwidth', 50)

result_df[available_columns]

Added 5 new columns to the dataset
Debug artifacts saved to: ../../output/informationextraction/llm_example/

Extraction Results:


Unnamed: 0,description,brand,model,price,category,color
0,Apple iPhone 14 Pro - Space Black - $999.99 - ...,Apple,iPhone 14 Pro,999.99,smartphone,Space Black
1,Samsung Galaxy S23 Ultra 256GB Phantom Black s...,Samsung,Galaxy S23 Ultra,1199.0,smartphone,Phantom Black
2,Sony WH-1000XM4 wireless noise-canceling headp...,Sony,WH-1000XM4,349.99,headphones,silver
3,"Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB...",Dell,XPS 13,1299.0,laptop,
4,Nintendo Switch OLED White Console with Joy-Co...,Nintendo,Switch OLED White Console with Joy-Con control...,349.99,gaming system,
5,Dyson V15 Detect cordless vacuum cleaner in ye...,Dyson,V15 Detect,749.99,home appliance,yellow


In [33]:
# Show extraction statistics
print("Extraction Statistics:")
for field in Product.__annotations__.keys():
    if field in result_df.columns:
        non_null = result_df[field].notna().sum()
        total = len(result_df)
        rate = non_null / total * 100
        print(f"  {field}: {non_null}/{total} ({rate:.1f}%)")
        
        # Show unique values for categorical fields
        if field in ['brand', 'category', 'color'] and non_null > 0:
            unique_vals = result_df[field].dropna().unique()
            print(f"    → Unique values: {list(unique_vals)}")

Extraction Statistics:
  brand: 6/6 (100.0%)
    → Unique values: ['Apple', 'Samsung', 'Sony', 'Dell', 'Nintendo', 'Dyson']
  model: 6/6 (100.0%)
  price: 6/6 (100.0%)
  category: 6/6 (100.0%)
    → Unique values: ['smartphone', 'headphones', 'laptop', 'gaming system', 'home appliance']
  color: 4/6 (66.7%)
    → Unique values: ['Space Black', 'Phantom Black', 'silver', 'yellow']


In [34]:
# Detailed analysis of each extraction
print("Detailed Extraction Analysis:")
print("=" * 60)

for i, row in result_df.iterrows():
    print(f"\nProduct {i+1}:")
    print(f"  Description: {row['description'][:80]}..." if len(row['description']) > 80 else f"  Description: {row['description']}")
    
    extracted_info = []
    for field in ['brand', 'model', 'price', 'category', 'color']:
        if field in row and pd.notna(row[field]):
            if field == 'price':
                extracted_info.append(f"{field}: ${row[field]:.2f}")
            else:
                extracted_info.append(f"{field}: {row[field]}")
    
    if extracted_info:
        print(f"  Extracted: {' | '.join(extracted_info)}")
    else:
        print("  Extracted: No structured data found")

Detailed Extraction Analysis:

Product 1:
  Description: Apple iPhone 14 Pro - Space Black - $999.99 - Premium smartphone with advanced c...
  Extracted: brand: Apple | model: iPhone 14 Pro | price: $999.99 | category: smartphone | color: Space Black

Product 2:
  Description: Samsung Galaxy S23 Ultra 256GB Phantom Black smartphone for $1199 with S Pen inc...
  Extracted: brand: Samsung | model: Galaxy S23 Ultra | price: $1199.00 | category: smartphone | color: Phantom Black

Product 3:
  Description: Sony WH-1000XM4 wireless noise-canceling headphones in silver color priced at $3...
  Extracted: brand: Sony | model: WH-1000XM4 | price: $349.99 | category: headphones | color: silver

Product 4:
  Description: Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB SSD - $1,299.00 - Ultrabook categ...
  Extracted: brand: Dell | model: XPS 13 | price: $1299.00 | category: laptop

Product 5:
  Description: Nintendo Switch OLED White Console with Joy-Con controllers - $349.99 gaming sys...
  Extrac

## Save Results

Finally, let's save our extraction results to a CSV file.

In [35]:
# Save results
import os
os.makedirs("output/informationextraction/llm_example", exist_ok=True)
output_path = "output/informationextraction/llm_example/extracted_products.csv"
result_df.to_csv(output_path, index=False)
print(f"✅ Results saved to: {output_path}")

# Show what was saved
print(f"\nSaved {len(result_df)} rows with {len(result_df.columns)} columns:")
print(f"Columns: {list(result_df.columns)}")

✅ Results saved to: output/informationextraction/llm_example/extracted_products.csv

Saved 6 rows with 6 columns:
Columns: ['description', 'brand', 'model', 'price', 'category', 'color']


## Open-schema Extraction (no predefined schema)

Sometimes you don’t know the fields upfront. In open-schema mode, the extractor sends your prompt and examples to the LLM, then stores whatever structured JSON the model returns in a single column named `extracted` (as a JSON string).

What to expect:
- The output DataFrame will not contain fixed columns like `brand`, `model`, etc. Instead it has one column: `extracted`.
- Each `extracted` cell contains a JSON string that you can parse/expand later.
- All usual debug artifacts (prompts, responses, errors, config, stats) are still saved under the timestamped run directory.

How to view the results inline:
- Quick view:
  - `open_result_df[['description', 'extracted']].head(10)`
- Expand to columns (optional post-processing):
  - Use `pd.json_normalize(open_result_df['extracted'].dropna().apply(json.loads))` to turn the JSON into columns, then concat with the original DataFrame (see the code cell below for a working example).


In [36]:
open_extractor = LLMExtractor(
    chat_model=chat_model,
    source_column="description",
    system_prompt="""Extract any meaningful product information from the text as a flat JSON object. 
Return only valid JSON. """,
    schema=None,
    few_shots=[
        (
            "Bose QuietComfort 45 headphones in black - $329 - wireless over-ear",
            '{"brand": "Bose", "model": "QuietComfort 45", "price": 329.0, "category": "headphones", "color": "black", "wireless": true}'
        )
    ],
    debug=True,
    retries=2,
    out_dir="../../output/informationextraction/llm_example"
)
print("✅ Open-schema LLMExtractor configured.")


✅ Open-schema LLMExtractor configured.


In [37]:
print("Extracting with open schema...")
open_result_df = open_extractor.extract(df, source_column="description")
print("Open-schema extraction completed!")


Extracting with open schema...
Open-schema extraction completed!


In [38]:
open_result_df

Unnamed: 0,description,extracted
0,Apple iPhone 14 Pro - Space Black - $999.99 - ...,"{""brand"": ""Apple"", ""model"": ""iPhone 14 Pro"", ""..."
1,Samsung Galaxy S23 Ultra 256GB Phantom Black s...,"{""brand"": ""Samsung"", ""model"": ""Galaxy S23 Ultr..."
2,Sony WH-1000XM4 wireless noise-canceling headp...,"{""brand"": ""Sony"", ""model"": ""WH-1000XM4"", ""pric..."
3,"Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB...","{""brand"": ""Dell"", ""model"": ""XPS 13"", ""processo..."
4,Nintendo Switch OLED White Console with Joy-Co...,"{""brand"": ""Nintendo"", ""model"": ""Switch OLED Wh..."
5,Dyson V15 Detect cordless vacuum cleaner in ye...,"{""brand"": ""Dyson"", ""model"": ""V15 Detect"", ""pri..."


In [39]:
expanded = pd.json_normalize(open_result_df['extracted'].dropna().apply(json.loads)).add_prefix('llm_')
open_result_df_expanded = pd.concat([open_result_df.drop(columns=['extracted']), expanded], axis=1)
open_result_df_expanded.head(10)

Unnamed: 0,description,llm_brand,llm_model,llm_color,llm_price,llm_category,llm_features,llm_storage,llm_includes,llm_wireless,llm_feature,llm_processor,llm_ram,llm_controllers,llm_sub_category
0,Apple iPhone 14 Pro - Space Black - $999.99 - ...,Apple,iPhone 14 Pro,Space Black,999.99,smartphone,[Premium smartphone with advanced camera system],,,,,,,,
1,Samsung Galaxy S23 Ultra 256GB Phantom Black s...,Samsung,Galaxy S23 Ultra,Phantom Black,1199.0,smartphone,,256GB,[S Pen],,,,,,
2,Sony WH-1000XM4 wireless noise-canceling headp...,Sony,WH-1000XM4,silver,349.99,headphones,,,,True,noise-canceling,,,,
3,"Dell XPS 13 laptop - Intel i7, 16GB RAM, 512GB...",Dell,XPS 13,,1299.0,Ultrabook,,512GB SSD,,,,Intel i7,16.0,,
4,Nintendo Switch OLED White Console with Joy-Co...,Nintendo,Switch OLED White Console with Joy-Con control...,white,349.99,gaming system,,,,True,,,,Joy-Con,
5,Dyson V15 Detect cordless vacuum cleaner in ye...,Dyson,V15 Detect,yellow,749.99,home appliances,,,,True,,,,,cordless vacuum cleaner
