# AI_EXTRACT Demo: Safety Data Sheet Extraction with Fine-Tuning

This demo shows how to use **AI_EXTRACT** to extract structured data from PDF documents, and how to **fine-tune** the arctic-extract model to improve accuracy.

## What You'll Learn
1. **Basic AI_EXTRACT** - Extract scalar values and tables from PDFs
2. **Fine-Tuning** - Create training data and train a custom model
3. **Using Fine-Tuned Models** - Improved extraction accuracy

## Demo Flow
- Part 1: Extract data WITHOUT fine-tuning
- Part 2: Create training dataset (scalar + table examples)
- Part 3: Fine-tune the model
- Part 4: Extract data WITH fine-tuned model

---
# Setup: Create Stage and Upload Documents

In [None]:
-- Create stage with SSE encryption (required for AI_EXTRACT)
CREATE STAGE IF NOT EXISTS SDS_DOCS 
    DIRECTORY = (ENABLE = TRUE AUTO_REFRESH = TRUE) 
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

In [None]:
-- Refresh directory and list uploaded files
ALTER STAGE SDS_DOCS REFRESH;

SELECT relative_path, size, last_modified 
FROM DIRECTORY(@SDS_DOCS)
WHERE relative_path ILIKE '%.pdf';

---
# Part 1: Basic AI_EXTRACT (No Fine-Tuning)

AI_EXTRACT uses Snowflake's **arctic-extract** vision model to "see" documents and extract structured data - no OCR needed.

We'll extract both:
- **Scalar values**: product name, signal word, pH, etc.
- **Tables**: hazardous ingredients, physical properties

In [None]:
-- Extract BOTH scalar values AND tables from an SDS document
SELECT AI_EXTRACT(
    file => TO_FILE('@SDS_DOCS', 'sds/US001066-Clorox-Regular-Bleach1_3.pdf'),
    responseFormat => {
        'schema': {
            'type': 'object',
            'properties': {
                -- Scalar fields
                'product_name': {
                    'type': 'string',
                    '$comment': 'Product name from Section 1 Identification'
                },
                'manufacturer': {
                    'type': 'string',
                    '$comment': 'Manufacturer or supplier name'
                },
                'signal_word': {
                    'type': 'string',
                    '$comment': 'GHS signal word: Danger, Warning, or None'
                },
                'ph': {
                    'type': 'string',
                    '$comment': 'pH value from Section 9'
                },
                -- Table: Hazardous Ingredients
                'hazardous_ingredients': {
                    '$comment': 'Extract hazardous ingredients table from Section 3',
                    'type': 'object',
                    'properties': {
                        'ingredient_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'cas_number': { 'type': 'array', 'items': { 'type': 'string' } },
                        'percent_range': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                },
                -- Table: Physical Properties
                'physical_properties': {
                    '$comment': 'Extract physical/chemical properties from Section 9',
                    'type': 'object',
                    'properties': {
                        'property_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'value': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                }
            }
        }
    }
) AS extraction_result;

In [None]:
-- Extract from a different SDS document
SELECT AI_EXTRACT(
    file => TO_FILE('@SDS_DOCS', 'sds/Clorox-Commercial-Solutions®-Clorox®-Disinfecting-Wipes-Fresh-Scent.pdf'),
    responseFormat => {
        'schema': {
            'type': 'object',
            'properties': {
                'product_name': { 'type': 'string', '$comment': 'Product name from Section 1' },
                'manufacturer': { 'type': 'string', '$comment': 'Manufacturer name' },
                'signal_word': { 'type': 'string', '$comment': 'GHS signal word' },
                'ph': { 'type': 'string', '$comment': 'pH value' },
                'hazardous_ingredients': {
                    '$comment': 'Hazardous ingredients from Section 3',
                    'type': 'object',
                    'properties': {
                        'ingredient_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'cas_number': { 'type': 'array', 'items': { 'type': 'string' } },
                        'percent_range': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                },
                'physical_properties': {
                    '$comment': 'Physical properties from Section 9',
                    'type': 'object',
                    'properties': {
                        'property_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'value': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                }
            }
        }
    }
) AS extraction_result;

---
# Part 2: Create Training Dataset for Fine-Tuning

To improve extraction accuracy, we fine-tune the model with ground truth examples.

## Training Data Format

**Prompt** - JSON Schema defining what to extract:
```json
{ "schema": { "type": "object", "properties": { ... } } }
```

**Response** - Ground truth values:
- Scalar: `{ "field": "value" }`
- Table (column-based): `{ "table": { "col1": ["v1","v2"], "col2": ["v1","v2"] } }`

In [None]:
-- Create training table
CREATE OR REPLACE TABLE SDS_TRAINING (
    f FILE,           -- Reference to PDF file
    p VARCHAR,        -- Prompt (JSON Schema)
    r VARCHAR         -- Response (ground truth)
);

In [None]:
-- Training Example 1: Clorox Bleach (scalar + tables)
INSERT INTO SDS_TRAINING (f, p, r)
SELECT 
    TO_FILE('@SDS_DOCS', 'sds/US001066-Clorox-Regular-Bleach1_3.pdf'),
    '{
        "schema": {
            "type": "object",
            "properties": {
                "product_name": { "type": "string", "$comment": "Product name from Section 1" },
                "manufacturer": { "type": "string", "$comment": "Manufacturer name" },
                "signal_word": { "type": "string", "$comment": "GHS signal word" },
                "ph": { "type": "string", "$comment": "pH value from Section 9" },
                "hazardous_ingredients": {
                    "$comment": "Hazardous ingredients from Section 3",
                    "type": "object",
                    "properties": {
                        "ingredient_name": { "type": "array", "items": { "type": "string" } },
                        "cas_number": { "type": "array", "items": { "type": "string" } },
                        "percent_range": { "type": "array", "items": { "type": "string" } }
                    }
                },
                "physical_properties": {
                    "$comment": "Physical properties from Section 9",
                    "type": "object",
                    "properties": {
                        "property_name": { "type": "array", "items": { "type": "string" } },
                        "value": { "type": "array", "items": { "type": "string" } }
                    }
                }
            }
        }
    }',
    '{
        "product_name": "Clorox Regular Bleach1",
        "manufacturer": "The Clorox Company",
        "signal_word": "Danger",
        "ph": "11.9",
        "hazardous_ingredients": {
            "ingredient_name": ["Sodium hypochlorite", "Sodium hydroxide"],
            "cas_number": ["7681-52-9", "1310-73-2"],
            "percent_range": ["5 - 10", "0.5 - 1.5"]
        },
        "physical_properties": {
            "property_name": ["pH", "Physical state", "Color", "Odor", "Relative density"],
            "value": ["11.9", "Liquid", "Clear, light yellow", "Chlorine", "1.085"]
        }
    }';

In [None]:
-- Training Example 2: Disinfecting Wipes (scalar + tables)
INSERT INTO SDS_TRAINING (f, p, r)
SELECT 
    TO_FILE('@SDS_DOCS', 'sds/Clorox-Commercial-Solutions®-Clorox®-Disinfecting-Wipes-Fresh-Scent.pdf'),
    '{
        "schema": {
            "type": "object",
            "properties": {
                "product_name": { "type": "string", "$comment": "Product name from Section 1" },
                "manufacturer": { "type": "string", "$comment": "Manufacturer name" },
                "signal_word": { "type": "string", "$comment": "GHS signal word" },
                "ph": { "type": "string", "$comment": "pH value from Section 9" },
                "hazardous_ingredients": {
                    "$comment": "Hazardous ingredients from Section 3",
                    "type": "object",
                    "properties": {
                        "ingredient_name": { "type": "array", "items": { "type": "string" } },
                        "cas_number": { "type": "array", "items": { "type": "string" } },
                        "percent_range": { "type": "array", "items": { "type": "string" } }
                    }
                },
                "physical_properties": {
                    "$comment": "Physical properties from Section 9",
                    "type": "object",
                    "properties": {
                        "property_name": { "type": "array", "items": { "type": "string" } },
                        "value": { "type": "array", "items": { "type": "string" } }
                    }
                }
            }
        }
    }',
    '{
        "product_name": "Clorox Commercial Solutions Clorox Disinfecting Wipes Fresh Scent",
        "manufacturer": "The Clorox Company",
        "signal_word": "Warning",
        "ph": "6 - 7.5",
        "hazardous_ingredients": {
            "ingredient_name": ["Alkyl dimethyl benzyl ammonium chloride", "Alkyl dimethyl ethylbenzyl ammonium chloride"],
            "cas_number": ["68424-85-1", "68956-79-6"],
            "percent_range": ["0.1 - 1", "0.1 - 1"]
        },
        "physical_properties": {
            "property_name": ["pH", "Relative density", "Water Solubility", "Color", "Odor", "Physical state"],
            "value": ["6 - 7.5 (liquid)", "~1.0 (liquid)", "Completely soluble", "Clear White", "Fruity Apple Floral", "Pre-Moistened Towelette"]
        }
    }';

In [None]:
-- Training Example 3: Formula 409 (scalar + tables)
INSERT INTO SDS_TRAINING (f, p, r)
SELECT 
    TO_FILE('@SDS_DOCS', 'sds/SDS-US-Formula-409®-Multi-Surface-Cleaner-English-2022.pdf'),
    '{
        "schema": {
            "type": "object",
            "properties": {
                "product_name": { "type": "string", "$comment": "Product name from Section 1" },
                "manufacturer": { "type": "string", "$comment": "Manufacturer name" },
                "signal_word": { "type": "string", "$comment": "GHS signal word" },
                "ph": { "type": "string", "$comment": "pH value from Section 9" },
                "hazardous_ingredients": {
                    "$comment": "Hazardous ingredients from Section 3",
                    "type": "object",
                    "properties": {
                        "ingredient_name": { "type": "array", "items": { "type": "string" } },
                        "cas_number": { "type": "array", "items": { "type": "string" } },
                        "percent_range": { "type": "array", "items": { "type": "string" } }
                    }
                },
                "physical_properties": {
                    "$comment": "Physical properties from Section 9",
                    "type": "object",
                    "properties": {
                        "property_name": { "type": "array", "items": { "type": "string" } },
                        "value": { "type": "array", "items": { "type": "string" } }
                    }
                }
            }
        }
    }',
    '{
        "product_name": "Formula 409 Multi-Surface Cleaner",
        "manufacturer": "The Clorox Company",
        "signal_word": "None",
        "ph": "9 - 11.5",
        "hazardous_ingredients": {
            "ingredient_name": ["2-Butoxyethanol", "Ethanolamine"],
            "cas_number": ["111-76-2", "141-43-5"],
            "percent_range": ["1 - 5", "0.5 - 1.5"]
        },
        "physical_properties": {
            "property_name": ["pH", "Physical state", "Color", "Odor", "Relative density"],
            "value": ["9 - 11.5", "Liquid", "Green", "Floral Citrus", "1.00 - 1.02"]
        }
    }';

In [None]:
-- View training data
SELECT 
    FL_GET_RELATIVE_PATH(f) AS file_path,
    PARSE_JSON(r):product_name::STRING AS product,
    PARSE_JSON(r):signal_word::STRING AS signal_word,
    PARSE_JSON(r):ph::STRING AS ph,
    ARRAY_SIZE(PARSE_JSON(r):hazardous_ingredients:ingredient_name) AS num_ingredients,
    ARRAY_SIZE(PARSE_JSON(r):physical_properties:property_name) AS num_properties
FROM SDS_TRAINING;

---
# Part 3: Fine-Tune the Model

We'll create a Dataset from our training data, then fine-tune arctic-extract.

**Note:** If model already exists, skip the FINETUNE cell or use a different model name.

In [None]:
-- Create dataset for fine-tuning
CREATE OR REPLACE DATASET SDS_DATASET;

ALTER DATASET SDS_DATASET
ADD VERSION 'v1' FROM (
    SELECT 
        FL_GET_STAGE(f) || '/' || FL_GET_RELATIVE_PATH(f) AS "file",
        p AS "prompt",
        r AS "response"
    FROM SDS_TRAINING
);

In [None]:
-- Start fine-tuning job
-- NOTE: Change model name if 'sds_extract_demo' already exists
SELECT SNOWFLAKE.CORTEX.FINETUNE(
    'CREATE',
    'sds_extract_demo',  -- Change this name if model already exists
    'arctic-extract',
    'snow://dataset/SDS_DATASET/versions/v1'
);

In [None]:
-- Check fine-tuning status (run periodically until status = SUCCESS)
SELECT SNOWFLAKE.CORTEX.FINETUNE('SHOW');

---
# Part 4: Use the Fine-Tuned Model

Once fine-tuning completes (status = SUCCESS), use the custom model with the `model` parameter.

**Using existing model:** `DEMODB.PUBLIC.SDS_TABLE_EXTRACT`

In [None]:
-- Extract using fine-tuned model (scalar + tables)
SELECT AI_EXTRACT(
    model => 'DEMODB.PUBLIC.SDS_TABLE_EXTRACT',
    file => TO_FILE('@SDS_DOCS', 'sds/US001066-Clorox-Regular-Bleach1_3.pdf'),
    responseFormat => {
        'schema': {
            'type': 'object',
            'properties': {
                'product_name': { 'type': 'string', '$comment': 'Product name from Section 1' },
                'manufacturer': { 'type': 'string', '$comment': 'Manufacturer name' },
                'signal_word': { 'type': 'string', '$comment': 'GHS signal word' },
                'ph': { 'type': 'string', '$comment': 'pH value from Section 9' },
                'hazardous_ingredients': {
                    '$comment': 'Hazardous ingredients from Section 3',
                    'type': 'object',
                    'properties': {
                        'ingredient_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'cas_number': { 'type': 'array', 'items': { 'type': 'string' } },
                        'percent_range': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                },
                'physical_properties': {
                    '$comment': 'Physical properties from Section 9',
                    'type': 'object',
                    'properties': {
                        'property_name': { 'type': 'array', 'items': { 'type': 'string' } },
                        'value': { 'type': 'array', 'items': { 'type': 'string' } }
                    }
                }
            }
        }
    }
) AS finetuned_result;

In [None]:
-- Process all SDS documents with fine-tuned model
SELECT 
    relative_path AS source_file,
    AI_EXTRACT(
        model => 'DEMODB.PUBLIC.SDS_TABLE_EXTRACT',
        file => TO_FILE('@SDS_DOCS', relative_path),
        responseFormat => {
            'schema': {
                'type': 'object',
                'properties': {
                    'product_name': { 'type': 'string', '$comment': 'Product name from Section 1' },
                    'manufacturer': { 'type': 'string', '$comment': 'Manufacturer name' },
                    'signal_word': { 'type': 'string', '$comment': 'GHS signal word' },
                    'ph': { 'type': 'string', '$comment': 'pH value' },
                    'hazardous_ingredients': {
                        '$comment': 'Hazardous ingredients from Section 3',
                        'type': 'object',
                        'properties': {
                            'ingredient_name': { 'type': 'array', 'items': { 'type': 'string' } },
                            'cas_number': { 'type': 'array', 'items': { 'type': 'string' } },
                            'percent_range': { 'type': 'array', 'items': { 'type': 'string' } }
                        }
                    },
                    'physical_properties': {
                        '$comment': 'Physical properties from Section 9',
                        'type': 'object',
                        'properties': {
                            'property_name': { 'type': 'array', 'items': { 'type': 'string' } },
                            'value': { 'type': 'array', 'items': { 'type': 'string' } }
                        }
                    }
                }
            }
        }
    ) AS extraction_result
FROM DIRECTORY(@SDS_DOCS)
WHERE relative_path ILIKE 'sds/%.pdf';

In [None]:
-- Parse results into readable format
SELECT 
    relative_path AS source_file,
    result:response:product_name::STRING AS product_name,
    result:response:manufacturer::STRING AS manufacturer,
    result:response:signal_word::STRING AS signal_word,
    result:response:ph::STRING AS ph,
    result:response:hazardous_ingredients AS ingredients_table,
    result:response:physical_properties AS properties_table
FROM (
    SELECT 
        relative_path,
        AI_EXTRACT(
            model => 'DEMODB.PUBLIC.SDS_TABLE_EXTRACT',
            file => TO_FILE('@SDS_DOCS', relative_path),
            responseFormat => {
                'schema': {
                    'type': 'object',
                    'properties': {
                        'product_name': { 'type': 'string', '$comment': 'Product name' },
                        'manufacturer': { 'type': 'string', '$comment': 'Manufacturer' },
                        'signal_word': { 'type': 'string', '$comment': 'Signal word' },
                        'ph': { 'type': 'string', '$comment': 'pH value' },
                        'hazardous_ingredients': {
                            'type': 'object',
                            'properties': {
                                'ingredient_name': { 'type': 'array', 'items': { 'type': 'string' } },
                                'cas_number': { 'type': 'array', 'items': { 'type': 'string' } },
                                'percent_range': { 'type': 'array', 'items': { 'type': 'string' } }
                            }
                        },
                        'physical_properties': {
                            'type': 'object',
                            'properties': {
                                'property_name': { 'type': 'array', 'items': { 'type': 'string' } },
                                'value': { 'type': 'array', 'items': { 'type': 'string' } }
                            }
                        }
                    }
                }
            }
        ) AS result
    FROM DIRECTORY(@SDS_DOCS)
    WHERE relative_path ILIKE 'sds/%.pdf'
);

---
# Summary

## What We Covered

1. **AI_EXTRACT** extracts structured data from PDFs using vision (no OCR)
2. **JSON Schema format** supports both scalar fields and tables
3. **Fine-tuning** improves accuracy for your specific document types
4. **Training data** requires proper format:
   - Prompt: JSON Schema defining fields to extract
   - Response: Ground truth with column-based arrays for tables

## Key Functions

| Function | Purpose |
|----------|----------|
| `AI_EXTRACT()` | Extract structured data from documents |
| `TO_FILE()` | Reference a file in a stage |
| `SNOWFLAKE.CORTEX.FINETUNE()` | Create/manage fine-tuned models |
| `FL_GET_STAGE()` / `FL_GET_RELATIVE_PATH()` | Build dataset paths |