# AI Agent-Based File Validation

## The Challenge: Files That Lie

Real-world scientific data repositories face a common problem:
- Download URLs return **HTML error pages** disguised as `.nc` files
- Corrupted transfers create **invalid files** that pass basic checks
- Custom formats are **ambiguous** - valid data or garbage?

Traditional validation (checking magic bytes) is fast but rigid:
```python
if header.startswith(b'CDF'):
    return "valid NetCDF"
else:
    return "invalid"  # But WHY? What should the user do?
```

## Enter: AI Agents 🤖

What if validation could **reason** through ambiguous cases?
- Use multiple tools to gather evidence
- Explain its decision process
- Suggest manual review for edge cases
- Handle unusual formats intelligently

**This notebook demonstrates a quality assessment agent that makes autonomous decisions about data files.**

In [1]:
# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / 'lib'))

from file_validator import FileValidator
from ollama_client import OllamaClient
from quality_agent import QualityAssessmentAgent
import config

## Quick Context: Traditional vs Agent Validation

| Approach | Speed | Capability | Decision Quality |
|----------|-------|------------|------------------|
| **Traditional (magic bytes)** | 0.04ms | Binary pass/fail | No explanation |
| **AI Agent** | ~27s | Reasoned decision | Full explanation |

**Trade-off**: Agents are slower but handle edge cases traditional rules miss.

**Production strategy**: Fast validation for 90% of cases, agent for ambiguous 10%.

## Create Test Files

We'll create three challenging cases:
1. **Valid NetCDF** - Should accept
2. **HTML error page** disguised as `.nc` - Should reject with explanation
3. **Custom binary format** - Should suggest manual review

In [2]:
import netCDF4
import numpy as np

# Setup directory
sample_dir = Path("sample_data")
sample_dir.mkdir(exist_ok=True)

# 1. Valid NetCDF file
valid_file = sample_dir / "ocean_temperature.nc"
if not valid_file.exists():
    with netCDF4.Dataset(valid_file, 'w') as ds:
        ds.title = "Sample Ocean Temperature Data"
        ds.institution = "Demo University"
        ds.createDimension('time', 10)
        ds.createDimension('lat', 20)
        ds.createDimension('lon', 30)
        
        time = ds.createVariable('time', 'f8', ('time',))
        time.units = 'days since 2020-01-01'
        time[:] = np.arange(10)
        
        lat = ds.createVariable('lat', 'f4', ('lat',))
        lat.units = 'degrees_north'
        lat[:] = np.linspace(-90, 90, 20)
        
        lon = ds.createVariable('lon', 'f4', ('lon',))
        lon.units = 'degrees_east'
        lon[:] = np.linspace(-180, 180, 30)
        
        temp = ds.createVariable('sea_surface_temperature', 'f4', ('time', 'lat', 'lon'))
        temp.units = 'celsius'
        temp.long_name = 'Sea Surface Temperature'
        temp[:] = np.random.randn(10, 20, 30) * 5 + 15

# 2. HTML error page disguised as .nc
fake_file = sample_dir / "fake_data.nc"
with open(fake_file, 'w') as f:
    f.write("""<!DOCTYPE html>
<html>
<head><title>404 Not Found</title></head>
<body>
<h1>Error: File Not Found</h1>
<p>The requested file could not be found.</p>
</body>
</html>""")

# 3. Custom binary format (ambiguous case)
unusual_file = sample_dir / "custom_format.dat"
with open(unusual_file, 'wb') as f:
    f.write(b'CUSTOMFMT\x01\x00')  # Custom header
    import struct
    for i in range(1000):
        f.write(struct.pack('f', i * 0.1))

print("✓ Test files created")
print(f"  - {valid_file.name} (valid NetCDF)")
print(f"  - {fake_file.name} (HTML error page)")
print(f"  - {unusual_file.name} (custom format)")

✓ Test files created
  - ocean_temperature.nc (valid NetCDF)
  - fake_data.nc (HTML error page)
  - custom_format.dat (custom format)


## Initialize AI Agent

The agent has access to three tools:
1. **get_file_info** - Check filename, size, extension
2. **check_signature** - Verify magic bytes
3. **inspect_content** - Sample file contents

It will use these tools strategically to make informed decisions.

In [3]:
# Initialize Ollama client
print("Connecting to local Ollama...")
ollama = OllamaClient()

# Quick test
if ollama.test_model():
    print("\n✓ AI agent ready")
else:
    print("\n⚠️  Ollama may not be working correctly")
    print("Ensure Ollama is running: ollama serve")

Connecting to local Ollama...
✓ Connected to Ollama at http://localhost:11434
  Available models: llama3.2:3b

Testing model: llama3.2:3b
Test prompt: What is 2+2? Answer with just the number.
Response: 4
✓ Model is working!

✓ AI agent ready


In [4]:
# Create quality assessment agent
print("Creating Quality Assessment Agent...")
quality_agent = QualityAssessmentAgent(ollama)
print("\n✓ Agent initialized with tools:")
print("  • get_file_info - Basic file metadata")
print("  • check_signature - Magic byte verification")
print("  • inspect_content - Content sampling")

Creating Quality Assessment Agent...
  [QualityAgent] Registered tool: check_signature
  [QualityAgent] Registered tool: get_file_info
  [QualityAgent] Registered tool: inspect_content

✓ Agent initialized with tools:
  • get_file_info - Basic file metadata
  • check_signature - Magic byte verification
  • inspect_content - Content sampling


## Demo 1: Valid NetCDF File

Watch the agent:
1. Gather information about the file
2. Verify its signature
3. Make a confident decision
4. Explain its reasoning

In [5]:
print("\n" + "=" * 70)
print("TEST 1: Valid NetCDF File")
print("=" * 70)
print(f"\nFile: {valid_file.name}")
print("Expected: Agent should ACCEPT with high confidence\n")

decision = quality_agent.assess_file(str(valid_file))

print("\n" + "=" * 70)
print("AGENT DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print(f"\nProcessing time: {decision.processing_time:.1f}s")
print(f"Tools used: {len(decision.thoughts)} steps")


TEST 1: Valid NetCDF File

File: ocean_temperature.nc
Expected: Agent should ACCEPT with high confidence


[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/ocean_temperature.nc'}
  Result: {'filename': 'ocean_temperature.nc', 'extension': '.nc', 'size_bytes': 34266, 'size_mb': 0.03}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/ocean_temperature.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'netcdf', 'is_valid': True, 'issues': [], 'size': '33.46 KB'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/ocean_temperature.nc'}
  Result: {'appears_text': True, 'appears_html': False, 'sample_text': '�HDF\r\n\x1a\n\x02\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00��������څ\x00\x00\x00\x00\x00\x000\x00\x00\x00\x00\x00\x00\x00\x03\\�\x12OHD

## Demo 2: HTML Error Page (The Deception)

This is the case that breaks simple validation:
- File extension: `.nc` ✓
- But content: HTML error page ✗

**Watch the agent detect and explain the problem.**

In [6]:
print("\n" + "=" * 70)
print("TEST 2: HTML Error Page Disguised as .nc")
print("=" * 70)
print(f"\nFile: {fake_file.name}")
print("Expected: Agent should REJECT and explain it's HTML\n")

decision = quality_agent.assess_file(str(fake_file))

print("\n" + "=" * 70)
print("AGENT DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print(f"\nProcessing time: {decision.processing_time:.1f}s")

print("\n💡 Key Insight: Agent not only rejected the file, but explained WHY.")
print("   User knows it's an HTML error page and can fix the download.")


TEST 2: HTML Error Page Disguised as .nc

File: fake_data.nc
Expected: Agent should REJECT and explain it's HTML


[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'filename': 'fake_data.nc', 'extension': '.nc', 'size_bytes': 164, 'size_mb': 0.0}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'html', 'is_valid': False, 'issues': ['File is HTML (likely download error page)'], 'size': '164.00 B'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'appears_text': True, 'appears_html': True, 'sample_text': '<!DOCTYPE html>\n<html>\n<head><title>404 Not Found</title></head>\n<body>\n<h1>Error: File Not Found</h1>\n<p>The requested

## Demo 3: Ambiguous Custom Format

The hardest case: **A file with unknown format**
- Not standard NetCDF/HDF5
- But might be valid research data
- Simple rules would auto-reject

**Watch the agent reason through uncertainty.**

In [7]:
print("\n" + "=" * 70)
print("TEST 3: Custom Binary Format (Ambiguous)")
print("=" * 70)
print(f"\nFile: {unusual_file.name}")
print("Expected: Agent should suggest MANUAL_REVIEW (uncertain but cautious)\n")

decision = quality_agent.assess_file(str(unusual_file))

print("\n" + "=" * 70)
print("AGENT DECISION")
print("=" * 70)
print(f"Decision: {decision.decision}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Reasoning: {decision.reasoning}")
print(f"\nProcessing time: {decision.processing_time:.1f}s")

print("\n💡 Key Insight: Instead of auto-rejecting unknown formats,")
print("   the agent flags it for human review. Preserves potential valid data!")


TEST 3: Custom Binary Format (Ambiguous)

File: custom_format.dat
Expected: Agent should suggest MANUAL_REVIEW (uncertain but cautious)


[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'filename': 'custom_format.dat', 'extension': '.dat', 'size_bytes': 4011, 'size_mb': 0.0}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'expected_type': None, 'detected_type': None, 'is_valid': False, 'issues': ['Unknown file type'], 'size': '3.92 KB'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/custom_format.dat'}
  Result: {'appears_text': True, 'appears_html': False, 'sample_text': 'CUSTOMFMT\x01\x00\x00\x00\x00\x00���=��L>���>���>\x00\x00\x00?��\x19?333?��L?fff?\x00\x00�?�̌?���?ff�?33�?\x0

## Agent Reasoning Trace

Let's inspect how the agent made its decision on the fake file:

In [8]:
# Re-run fake file assessment to get fresh trace
decision = quality_agent.assess_file(str(fake_file))

print("\nAgent's Step-by-Step Reasoning:")
print("=" * 70)

for i, thought in enumerate(decision.thoughts, 1):
    print(f"\nStep {i}: {thought.action.upper()}")
    
    if thought.tool_name:
        print(f"  Tool: {thought.tool_name}")
        print(f"  Parameters: {thought.tool_params}")
        result_str = str(thought.result)[:100]
        print(f"  Result: {result_str}...")
    
    # Show reasoning snippet
    reasoning = thought.reasoning[:150]
    print(f"  Reasoning: {reasoning}...")

print("\n" + "=" * 70)
print(f"Final Decision: {decision.decision} ({decision.confidence:.2f} confidence)")
print("=" * 70)


[QualityAgent] Starting analysis...

[QualityAgent] Step 1: Thinking...
[QualityAgent] Using tool: get_file_info
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'filename': 'fake_data.nc', 'extension': '.nc', 'size_bytes': 164, 'size_mb': 0.0}

[QualityAgent] Step 2: Thinking...
[QualityAgent] Using tool: check_signature
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'expected_type': 'netcdf', 'detected_type': 'html', 'is_valid': False, 'issues': ['File is HTML (likely download error page)'], 'size': '164.00 B'}

[QualityAgent] Step 3: Thinking...
[QualityAgent] Using tool: inspect_content
  Parameters: {'filepath': 'sample_data/fake_data.nc'}
  Result: {'appears_text': True, 'appears_html': True, 'sample_text': '<!DOCTYPE html>\n<html>\n<head><title>404 Not Found</title></head>\n<body>\n<h1>Error: File Not Found</h1>\n<p>The requested file could not

[QualityAgent] Step 4: Thinking...

[QualityAgent] Decision reached!
  Decision: REJECT
  Confiden

## Performance & Trade-offs

### Speed Comparison

In [9]:
import time

# Traditional validation speed
validator = FileValidator()
start = time.time()
for _ in range(100):
    validator.check_file_signature(valid_file)
traditional_time = (time.time() - start) / 100

# Agent speed (approximate from last run)
agent_time = decision.processing_time

print("Performance Comparison:")
print("=" * 60)
print(f"Traditional validation: {traditional_time*1000:.2f}ms per file")
print(f"Agent validation: {agent_time:.1f}s per file")
print(f"\nSpeed difference: {agent_time/traditional_time:.0f}x slower")
print("\nBUT: Agent provides reasoning, handles edge cases, and prevents")
print("     false rejections that would cost researcher time.")

Performance Comparison:
Traditional validation: 0.03ms per file
Agent validation: 25.1s per file

Speed difference: 896180x slower

BUT: Agent provides reasoning, handles edge cases, and prevents
     false rejections that would cost researcher time.


### Production Strategy: Hybrid Approach

```python
def smart_validation(filepath):
    # Fast check first
    result = traditional_validator.check(filepath)
    
    if result.is_clearly_valid:
        return ACCEPT  # 90% of files - instant
    
    elif result.is_clearly_invalid:
        return REJECT  # Obviously broken
    
    else:
        # Ambiguous case - use agent
        return quality_agent.assess(filepath)  # 10% of files
```

**Result**: Fast for common cases, intelligent for edge cases.

## When to Use Agent Validation

### ✅ Good Use Cases
- **Institutional data curation** (overnight processing acceptable)
- **Unusual file formats** (need reasoning, not rules)
- **Ambiguous cases** (better than auto-reject)
- **Quality assurance** (explanation required for auditing)
- **Research data repositories** (need to justify decisions)

### ⚠️ Consider Skipping
- Real-time user uploads (need <1s response)
- High-confidence standard formats (NetCDF, HDF5 with valid headers)
- Batch processing 1000s of files (unless filtering ambiguous subset)

## Key Takeaways for HPC/Data Centers

1. **Multi-tool reasoning** beats single-check validation
2. **Explainable decisions** help users fix problems
3. **Hybrid approach** combines speed + intelligence
4. **Production-ready** with proper workflow design
5. **Reduces false rejections** that waste researcher time

## Next Steps

- **Notebook 02**: Metadata Enrichment Agent
- **Notebook 04**: Discovery Agent
- **Notebook 07**: Full Multi-Agent Workflow

In [10]:
# Cleanup
fake_file.unlink()
unusual_file.unlink()
print("✓ Test files cleaned up")

✓ Test files cleaned up
