# File Validation - Detecting Corrupt or Invalid Files

A common problem when downloading scientific data is getting HTML error pages instead of actual data files. This notebook demonstrates how to validate files before processing.

## Why Validation Matters

- **Download Errors**: URLs may return error pages (HTML) instead of data
- **Corrupted Files**: Incomplete downloads or transmission errors
- **Wrong Format**: Files may have incorrect extensions

## Magic Bytes

Every file format has a unique "magic number" - specific bytes at the start:
- NetCDF: `CDF\x01` or `CDF\x02` or HDF5 signature
- HDF5: `\x89HDF\r\n\x1a\n`
- GRIB: `GRIB`
- HTML: `<!DOCTYPE` or `<html>`

In [4]:
# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent))

from file_validator import FileValidator, quick_validate
import config

## Example 1: Valid NetCDF File

In [5]:
# Validate our sample file from setup
sample_file = Path("sample_data/ocean_temperature.nc")

if sample_file.exists():
    validator = FileValidator()
    result = validator.check_file_signature(sample_file)
    
    print("Validation Result:")
    print(f"  File: {result['filepath']}")
    print(f"  Size: {result['size_formatted']}")
    print(f"  Expected Type: {result['expected_type']}")
    print(f"  Detected Type: {result['detected_type']}")
    print(f"  Valid: {result['is_valid']}")
    
    if result['issues']:
        print(f"  Issues: {result['issues']}")
else:
    print("Sample file not found. Run notebook 00 first.")

Validation Result:
  File: sample_data/ocean_temperature.nc
  Size: 33.46 KB
  Expected Type: netcdf
  Detected Type: netcdf
  Valid: True


## Example 2: Simulating an HTML Error Page

This is what happens when a download returns an error page.

In [6]:
# Create a fake .nc file that's actually HTML
fake_file = Path("sample_data/fake_data.nc")

with open(fake_file, 'w') as f:
    f.write("""<!DOCTYPE html>
<html>
<head><title>404 Not Found</title></head>
<body>
<h1>Error: File Not Found</h1>
<p>The requested file could not be found.</p>
</body>
</html>""")

# Validate it
validator = FileValidator()
result = validator.check_file_signature(fake_file)

print("\nValidation Result for HTML Error Page:")
print(f"  Expected: {result['expected_type']}")
print(f"  Detected: {result['detected_type']}")
print(f"  Valid: {result['is_valid']}")
print(f"  Issues: {result['issues']}")

# Get suggestions
suggestions = validator.suggest_fixes(result)
if suggestions:
    print("\n  Suggestions:")
    for suggestion in suggestions:
        print(f"    - {suggestion}")


Validation Result for HTML Error Page:
  Expected: netcdf
  Detected: html
  Valid: False
  Issues: ['File is HTML (likely download error page)']

  Suggestions:
    - Re-download the file. The URL may have returned an error page.


## Example 3: Inspect File Contents

When validation fails, you can inspect the actual bytes to debug.

In [7]:
# Inspect the fake file
print("Content Inspection:")
content = validator.inspect_file_contents(fake_file, num_bytes=256)
print(content)

Content Inspection:
Hex (first 64 bytes):
3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 3c 74 69 74 6c 65 3e 34 30 34 20 4e 6f 74 20 46 6f 75 6e 64 3c 2f 74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e

Text preview:
<!DOCTYPE html>
<html>
<head><title>404 Not Found</title></head>
<body>
<h1>Error: File Not Found</h1>
<p>The requested file could not be found.</p>
</body>
</html>


## Example 4: Batch Validation

Validate all files in a directory at once.

In [8]:
# Validate all .nc files in sample directory
results = validator.validate_directory(Path("sample_data"))

print("\nDirectory Validation Results:")
print(f"  Total files: {results['total_files']}")
print(f"  Valid: {len(results['valid'])}")
print(f"  Invalid: {len(results['invalid'])}")

if results['invalid']:
    print("\n  Invalid Files:")
    for invalid in results['invalid']:
        print(f"    - {Path(invalid['filepath']).name}: {invalid['issues']}")

if results['issues_summary']:
    print("\n  Issue Summary:")
    for issue, count in results['issues_summary'].items():
        print(f"    - {issue}: {count} file(s)")


Directory Validation Results:
  Total files: 3
  Valid: 2
  Invalid: 1

  Invalid Files:
    - fake_data.nc: ['File is HTML (likely download error page)']

  Issue Summary:
    - File is HTML (likely download error page): 1 file(s)


## Example 5: Quick Validation Function

For simple validation checks.

In [9]:
# Quick validation
files_to_check = [
    "sample_data/ocean_temperature.nc",
    "sample_data/fake_data.nc"
]

for filepath in files_to_check:
    is_valid, message = quick_validate(Path(filepath))
    status = "✓" if is_valid else "✗"
    print(f"{status} {Path(filepath).name}: {message}")

✓ ocean_temperature.nc: Valid netcdf file
✗ fake_data.nc: Invalid: File is HTML (likely download error page)


## Best Practices

1. **Always validate before processing** - Saves time and prevents errors
2. **Check file size** - Very small files are often errors
3. **Inspect magic bytes** - Most reliable validation method
4. **Handle errors gracefully** - Log issues for later review

## Cleanup

In [10]:
# Remove fake file
if fake_file.exists():
    fake_file.unlink()
    print("Cleaned up fake file")

Cleaned up fake file


## Next Steps

- **Notebook 02**: Learn about metadata extraction from valid files
- **Notebook 05**: See validation integrated into batch indexing