# File Validation with AI Agents

This notebook demonstrates TWO approaches to file validation:
1. **Traditional:** Rule-based magic byte checking
2. **Agent-Based:** LLM reasoning about file quality

## Why Agent-Based Validation?

Traditional validation: Fast but rigid
- ✓ Checks magic bytes
- ✗ Can't reason about ambiguous cases
- ✗ Rejects unusual but valid formats

Agent validation: Slower but intelligent
- ✓ Reasons through ambiguous cases
- ✓ Uses multiple tools for evidence
- ✓ Explains decisions clearly
- ✓ Better for institutional curation

In [1]:
# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / 'lib'))

from file_validator import FileValidator, quick_validate
from ollama_client import OllamaClient
from quality_agent import QualityAssessmentAgent
import config

## Part 1: Traditional Validation (Baseline)

In [2]:
# Create sample files
sample_dir = Path("sample_data")
sample_dir.mkdir(exist_ok=True)

# Valid NetCDF file (from setup)
valid_file = sample_dir / "ocean_temperature.nc"

# Fake file (HTML error page disguised as .nc)
fake_file = sample_dir / "fake_data.nc"
with open(fake_file, 'w') as f:
    f.write("""<!DOCTYPE html>
<html>
<head><title>404 Not Found</title></head>
<body>
<h1>Error: File Not Found</h1>
<p>The requested file could not be found.</p>
</body>
</html>""")

In [3]:
# Traditional validation
validator = FileValidator()

print("Traditional Validation Results:")
print("=" * 60)

for filepath in [valid_file, fake_file]:
    result = validator.check_file_signature(filepath)
    print(f"\nFile: {filepath.name}")
    print(f"  Expected: {result['expected_type']}")
    print(f"  Detected: {result['detected_type']}")
    print(f"  Valid: {'✓' if result['is_valid'] else '✗'}")
    if result['issues']:
        print(f"  Issues: {result['issues']}")

Traditional Validation Results:

File: ocean_temperature.nc
  Expected: netcdf
  Detected: netcdf
  Valid: ✓

File: fake_data.nc
  Expected: netcdf
  Detected: html
  Valid: ✗
  Issues: ['File is HTML (likely download error page)']


## Part 2: Agent-Based Validation

Watch the agent reason through file quality step-by-step!

In [4]:
# Initialize Ollama client
print("Initializing AI Agent...")
ollama = OllamaClient()

# Test connection
ollama.test_model()

Initializing AI Agent...
✓ Connected to Ollama at http://localhost:11434
  Available models: llama3.2:3b

Testing model: llama3.2:3b
Test prompt: What is 2+2? Answer with just the number.
Response: 4
✓ Model is working!


True

In [5]:
# Create quality assessment agent
print("\nCreating Quality Assessment Agent...")
quality_agent = QualityAssessmentAgent(ollama)
print("✓ Agent ready with tools:\n")


Creating Quality Assessment Agent...
  [QualityAgent] Registered tool: check_signature
  [QualityAgent] Registered tool: get_file_info
  [QualityAgent] Registered tool: inspect_content
✓ Agent ready with tools:



### Test 1: Valid NetCDF File

Watch how the agent uses multiple tools to assess quality:

In [6]:
print("\n" + "=" * 70)
print("AGENT ASSESSMENT: Valid NetCDF File")
print("=" * 70)

decision = quality_agent.assess_file(str(valid_file))

print("\n" + "=" * 70)
print("FINAL DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print(f"Processing time: {decision.processing_time:.1f}s")
print(f"Steps taken: {len(decision.thoughts)}")


AGENT ASSESSMENT: Valid NetCDF File

[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/ocean_temperature.nc'}
  Result: {'filename': 'ocean_temperature.nc', 'extension': '.nc', 'size_bytes': 34266, 'size_mb': 0.03}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Note: Already called get_file_info, using cached result
  Result: {'filename': 'ocean_temperature.nc', 'extension': '.nc', 'size_bytes': 34266, 'size_mb': 0.03}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/ocean_temperature.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'netcdf', 'is_valid': True, 'issues': [], 'size': '33.46 KB'}

[QualityAgent] Step 4: Thinking...

[QualityAgent] Decision reached!
  Decision: ACCEPT
  Confidence: 1.00

FINAL DECISION
Decision: ACCEPT
Confidence: 1.00
Reasoning: The file matches its expected type and ha

### Test 2: HTML Error Page (Disguised as .nc)

Watch the agent detect the deception:

In [7]:
print("\n" + "=" * 70)
print("AGENT ASSESSMENT: Fake File (HTML disguised as .nc)")
print("=" * 70)

decision = quality_agent.assess_file(str(fake_file))

print("\n" + "=" * 70)
print("FINAL DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print(f"Processing time: {decision.processing_time:.1f}s")
print(f"Steps taken: {len(decision.thoughts)}")


AGENT ASSESSMENT: Fake File (HTML disguised as .nc)

[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'filename': 'fake_data.nc', 'extension': '.nc', 'size_bytes': 164, 'size_mb': 0.0}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'html', 'is_valid': False, 'issues': ['File is HTML (likely download error page)'], 'size': '164.00 B'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'appears_text': True, 'appears_html': True, 'sample_text': '<!DOCTYPE html>\n<html>\n<head><title>404 Not Found</title></head>\n<body>\n<h1>Error: File Not Found</h1>\n<p>The requested file could not

[QualityAgent] Step 4: Thinking...

[QualityA

### Test 3: Detailed Reasoning Trace

Inspect the agent's thought process:

In [8]:
print("\nAgent Reasoning Trace:")
print("=" * 70)

for i, thought in enumerate(decision.thoughts, 1):
    print(f"\nStep {i}:")
    print(f"  Action: {thought.action}")
    if thought.tool_name:
        print(f"  Tool: {thought.tool_name}")
        print(f"  Parameters: {thought.tool_params}")
        print(f"  Result: {str(thought.result)[:200]}...")
    print(f"  Reasoning: {thought.reasoning[:200]}...")


Agent Reasoning Trace:

Step 1:
  Action: tool_call
  Tool: get_file_info
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'filename': 'fake_data.nc', 'extension': '.nc', 'size_bytes': 164, 'size_mb': 0.0}...
  Reasoning: To start, I will use the 'get_file_info' tool to check the size and extension of the file.

USE_TOOL: get_file_info
PARAMS: {"filepath": "sample_data/fake_data.nc"}

This will provide me with informat...

Step 2:
  Action: tool_call
  Tool: check_signature
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'html', 'is_valid': False, 'issues': ['File is HTML (likely download error page)'], 'size': '164.00 B'}...
  Reasoning: Given the context and the results gathered so far, I will use the "check_signature" tool to verify the file type.

USE_TOOL: check_signature
PARAMS: {"filepath": "sample_data/fake_data.nc"}

This tool...

Step 3:
  Action: tool_call
  Tool: inspect_content
  Parameters: {'

## Part 3: Comparison - Traditional vs Agent

### Speed Comparison

In [9]:
import time

print("Speed Comparison:")
print("=" * 60)

# Traditional
start = time.time()
for _ in range(10):
    validator.check_file_signature(valid_file)
traditional_time = (time.time() - start) / 10

print(f"Traditional validation: {traditional_time*1000:.2f}ms per file")
print(f"Agent validation: ~{decision.processing_time:.1f}s per file")
print(f"\nTradeoff: Agent is ~{decision.processing_time/traditional_time:.0f}x slower")
print(f"But provides reasoning and handles ambiguous cases!")

Speed Comparison:
Traditional validation: 0.04ms per file
Agent validation: ~27.4s per file

Tradeoff: Agent is ~613628x slower
But provides reasoning and handles ambiguous cases!


### Accuracy Comparison

Both catch the fake file, but agent provides explanation:

In [10]:
print("\nAccuracy Comparison:")
print("=" * 60)
print("\nTraditional:")
print("  ✓ Detects HTML")
print("  ✗ Just says 'invalid'")
print("  ✗ No explanation")
print("\nAgent:")
print("  ✓ Detects HTML")
print("  ✓ Explains WHY invalid")
print("  ✓ Shows reasoning steps")
print("  ✓ Can handle ambiguous cases")


Accuracy Comparison:

Traditional:
  ✓ Detects HTML
  ✗ Just says 'invalid'
  ✗ No explanation

Agent:
  ✓ Detects HTML
  ✓ Explains WHY invalid
  ✓ Shows reasoning steps
  ✓ Can handle ambiguous cases


## Part 4: Demo - Ambiguous Case

Create a file that's unusual but might be valid:

In [11]:
# Create unusual file - custom binary format
unusual_file = sample_dir / "custom_format.dat"
with open(unusual_file, 'wb') as f:
    # Write custom header
    f.write(b'CUSTOMFMT\x01\x00')
    # Write some data
    import struct
    for i in range(1000):
        f.write(struct.pack('f', i * 0.1))

print("Created unusual custom format file")
print(f"Size: {unusual_file.stat().st_size} bytes")

Created unusual custom format file
Size: 4011 bytes


In [12]:
print("\nTraditional validator:")
result = validator.check_file_signature(unusual_file)
print(f"  Decision: {'ACCEPT' if result['is_valid'] else 'REJECT'}")
print(f"  Detected type: {result['detected_type']}")
print(f"  Issues: {result['issues']}")


Traditional validator:
  Decision: REJECT
  Detected type: None
  Issues: ['Unknown file type']


In [13]:
print("\n" + "=" * 70)
print("AGENT ASSESSMENT: Unusual Format")
print("=" * 70)

decision = quality_agent.assess_file(str(unusual_file))

print("\n" + "=" * 70)
print("FINAL DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print("\nNotice: Agent may suggest MANUAL_REVIEW for unknown formats,")
print("rather than auto-rejecting potentially valid data!")


AGENT ASSESSMENT: Unusual Format

[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'filename': 'custom_format.dat', 'extension': '.dat', 'size_bytes': 4011, 'size_mb': 0.0}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'expected_type': None, 'detected_type': None, 'is_valid': False, 'issues': ['Unknown file type'], 'size': '3.92 KB'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'appears_text': True, 'appears_html': False, 'sample_text': 'CUSTOMFMT\x01\x00\x00\x00\x00\x00���=��L>���>���>\x00\x00\x00?��\x19?333?��L?fff?\x00\x00�?�̌?���?ff�?33�?\x00\x00�?���?���?ff�?33�?\x00\x

[QualityAgent] Step 4: Thinking...

[QualityAgent] Decision reached!
  De

## Key Insights

### When to Use Traditional Validation:
- Real-time user uploads (need <1s response)
- High-confidence common formats (NetCDF, HDF5)
- Batch processing thousands of files

### When to Use Agent Validation:
- **Institutional curation** (overnight processing is fine)
- **Unusual formats** (need reasoning, not rules)
- **Ambiguous cases** (better than auto-reject)
- **Explanation required** (for research data repositories)

### For HPC Data Centers:
**Use Hybrid Approach:**
1. Fast validation for obvious cases (90%)
2. Agent validation for ambiguous cases (10%)
3. Result: Best of both worlds!

## Next Steps

- **Notebook 02**: Metadata Enrichment Agent
- **Notebook 04**: Discovery Agent
- **Notebook 07**: Full Multi-Agent Workflow

In [14]:
# Cleanup
fake_file.unlink()
unusual_file.unlink()
print("\n✓ Cleanup complete")


✓ Cleanup complete
