# Exercise 3 - Information Extraction with PyDI

In this exercise, you will learn how to use PyDI's information extraction module to extract structured information from unstructured text data. We'll work with product descriptions and demonstrate different extraction techniques including regex patterns, custom code functions, and evaluation metrics.


## Setup and Data Loading

First, let's import the necessary libraries and load our dataset.

In [1]:
import pandas as pd
import numpy as np
import sys
import os

# Add PyDI to path
sys.path.append('../../../')

# Import PyDI information extraction modules
from PyDI.informationextraction import RegexExtractor, CodeExtractor, ExtractorPipeline
from PyDI.informationextraction.rules import built_in_rules
from PyDI.io.loaders import load_json
import re

NLTK not available. Advanced tokenization features will be limited.


## Task: Information Extraction with PyDI (GPU dataset)

In this task you will set up extractors and evaluators. 


In [None]:
# 1) Load the GPU dataset and preview
gpu_df = load_json('input/gpu_products.json', add_index=True)
print(f'Dataset shape: {gpu_df.shape}')
gpu_df.head()


Dataset shape: (10, 12)


Unnamed: 0,id,name,brand,model,chipset,memory_gb,memory_type,clock_speed_mhz,tdp_w,launch_date,price_usd,description
0,gpu-001,NVIDIA GeForce RTX 4070 Ti,NVIDIA,RTX 4070 Ti,AD104,12,GDDR6X,2310,285,2023-01-05,799,"GeForce RTX 4070 Ti with 12GB GDDR6X, boost up..."
1,gpu-002,AMD Radeon RX 7800 XT,AMD,RX 7800 XT,Navi 32,16,GDDR6,2124,263,2023-09-06,499,"Radeon RX 7800 XT with 16 GB GDDR6, boost ~2.4..."
2,gpu-003,NVIDIA GeForce RTX 4090,NVIDIA,RTX 4090,AD102,24,GDDR6X,2235,450,2022-10-12,1599,Flagship RTX 4090 (24GB GDDR6X). Boost ~2.5 GH...
3,gpu-004,AMD Radeon RX 7600,AMD,RX 7600,Navi 33,8,GDDR6,2250,165,2023-05-25,269,RX 7600 with 8 GB GDDR6. Boost ~2.6 GHz. TBP ~...
4,gpu-005,NVIDIA GeForce RTX 4060,NVIDIA,RTX 4060,AD107,8,GDDR6,2460,115,2023-06-29,299,RTX 4060 8GB. Boost around 2.5 GHz. 115W TGP. ...


In [None]:
# 2) Regex extraction from product title (name)
# Provided regex patterns 
regex_rules = {
    'brand_from_title': {
        'source_column': 'name',
        'pattern': r'(NVIDIA|AMD|Intel)',
        'flags': re.IGNORECASE,
        'group': 1,
    },
    'model_from_title': {
        'source_column': 'name',
        'pattern': r'(RTX\s?\d{3,4}(?:\s?(?:Super|Ti))?|RX\s?\d{3,4}\s?(?:XT|XTX)?)',
        'flags': re.IGNORECASE,
        'group': 1,
    },
}

# TODO: Instantiate the RegexExtractor with the provided rules
# from PyDI.informationextraction import RegexExtractor
regex_extractor = None  

# TODO: Run the extractor on gpu_df
regex_gpu_df = None 

# Inspect the extracted fields


In [None]:
# 3) Evaluate regex extraction vs. gold (brand, model)
from PyDI.informationextraction import InformationExtractionEvaluator

# Prepare predictions by renaming to gold column names

pred_eval_df = regex_gpu_df.rename(columns={
     'brand_from_title':'brand',
     'model_from_title':'model',
 })
attributes = ['brand','model']

# TODO: Instantiate the evaluator and run evaluation


### LLM Extraction 

Use the same JSON Schema to guide LLM extraction from the description. Requires langchain-openai and an API key.


### Groq API Keys

Next, we need an API key from [groq.com](https://groq.com/) to use a powerful opensource LLMs for free. Groq offers a free tier allowing for API access with rate limits

After registering, you can create your key [here](https://console.groq.com/keys)

![image.png](./groc_limits.png)


In [None]:
# 4) Define chat model (requires dependencies)



In [4]:
# 5) Set up the LLM extractor with the JSON Schema and run extraction
import json
from PyDI.informationextraction import LLMExtractor
with open('input/gpu_product_schema.json','r', encoding='utf-8') as f:
    schema_dict = json.load(f)

# # TODO: Instantiate the LLMExtractor using the chat model and schema


# # Evaluate LLM results on a small set of attributes
