# Tellus Archive System - Getting Started

This notebook demonstrates the basic functionality of Tellus's new archive system, which provides intelligent caching, tag-based file organization, and seamless integration with different storage locations.

## Overview

The archive system allows you to:
- **Cache archives locally** for fast repeated access
- **Tag files automatically** based on path patterns
- **Extract files selectively** by tags or patterns
- **Work with multiple archives** through a unified interface
- **Support different storage locations** (local, FTP, tape systems, etc.)

## Setup

First, let's import the necessary modules and create a sample archive for demonstration:

In [None]:
import tempfile
import tarfile
from pathlib import Path
import json

# Import Tellus archive system components
from tellus.simulation.simulation import (
    CacheManager, CacheConfig, TagSystem, PathMapper, PathMapping,
    ArchiveManifest, CompressedArchive, ArchiveRegistry,
    CLIProgressCallback
)

print("✓ All imports successful")

## Creating a Sample Archive

Let's create a sample simulation archive to work with:

In [None]:
def create_sample_archive(archive_path: Path) -> Path:
    """Create a sample simulation archive for demonstration"""
    
    with tempfile.TemporaryDirectory() as temp_dir:
        temp_path = Path(temp_dir)
        
        # Create realistic simulation file structure
        files_to_create = {
            "input/forcing_data.nc": b"Sample atmospheric forcing data",
            "input/initial_conditions.nc": b"Sample initial conditions",
            "input/boundary_conditions.nc": b"Sample boundary conditions",
            "scripts/run_model.sh": b"#!/bin/bash\necho 'Running climate model'\n",
            "scripts/postprocess.py": b"#!/usr/bin/env python3\nprint('Post-processing results')\n",
            "output/temperature_2023.nc": b"Sample temperature output data",
            "output/precipitation_2023.nc": b"Sample precipitation output",
            "output/diagnostics.nc": b"Model diagnostic information",
            "namelists/model.nml": b"&model_params\n  dt = 3600\n  output_freq = 24\n/\n",
            "namelists/postproc.cfg": b"[settings]\noutput_format = netcdf4\n",
            "docs/README.md": b"# Climate Model Simulation\n\nThis archive contains...",
        }
        
        # Create files
        for file_path, content in files_to_create.items():
            full_path = temp_path / file_path
            full_path.parent.mkdir(parents=True, exist_ok=True)
            full_path.write_bytes(content)
        
        # Create metadata
        metadata = {
            "simulation_id": "climate_model_v1",
            "model": "ECMWF-IFS",
            "resolution": "T127",
            "period": "2023-01-01 to 2023-12-31",
            "created": "2024-01-15",
            "description": "Sample climate model simulation for archive system demo"
        }
        
        metadata_file = temp_path / "simulation_metadata.json"
        metadata_file.write_text(json.dumps(metadata, indent=2))
        
        # Create the archive
        archive_path.parent.mkdir(parents=True, exist_ok=True)
        with tarfile.open(archive_path, "w:gz") as tar:
            # Add files individually to maintain clean paths
            for file_path, content in files_to_create.items():
                full_path = temp_path / file_path
                tar.add(full_path, arcname=file_path)
            tar.add(metadata_file, arcname="simulation_metadata.json")
    
    print(f"✓ Created sample archive: {archive_path}")
    print(f"  Size: {archive_path.stat().st_size} bytes")
    print(f"  Files: {len(files_to_create) + 1}")
    return archive_path

# Create a sample archive in a temporary location
temp_dir = Path(tempfile.mkdtemp())
sample_archive = create_sample_archive(temp_dir / "climate_simulation.tar.gz")

## Basic Archive Operations

### 1. Creating an Archive Instance

Let's create a `CompressedArchive` instance and explore its basic functionality:

In [None]:
# Set up caching
cache_config = CacheConfig(
    cache_dir=temp_dir / "cache",
    archive_cache_size_limit=100 * 1024**2,  # 100MB
    file_cache_size_limit=50 * 1024**2       # 50MB
)
cache_manager = CacheManager(cache_config)

# Create archive instance
archive = CompressedArchive(
    archive_id="climate_demo",
    archive_location=str(sample_archive),
    cache_manager=cache_manager
)

# Add progress tracking
progress = CLIProgressCallback(verbose=True)
archive.add_progress_callback(progress)

print("✓ Archive instance created")

### 2. Exploring Archive Status

Let's check the archive status and see what information is available:

In [None]:
# Get archive status
status = archive.status()

print("Archive Status:")
print("=" * 50)
for key, value in status.items():
    if key == 'size' and isinstance(value, int):
        # Format size in human-readable format
        if value < 1024:
            size_str = f"{value} B"
        elif value < 1024**2:
            size_str = f"{value/1024:.1f} KB"
        else:
            size_str = f"{value/1024**2:.1f} MB"
        print(f"  {key}: {size_str}")
    else:
        print(f"  {key}: {value}")

### 3. Listing Files with Automatic Tagging

The archive system automatically tags files based on their paths. Let's see what files are in our archive and how they've been tagged:

In [None]:
# List all files in the archive
files = archive.list_files()

print("Files in Archive (with automatic tags):")
print("=" * 60)

# Group files by their primary tag
by_tag = {}
for file_path, tagged_file in files.items():
    for tag in tagged_file.tags:
        if tag not in by_tag:
            by_tag[tag] = []
        by_tag[tag].append((file_path, tagged_file))

# Display files grouped by tag
for tag in sorted(by_tag.keys()):
    print(f"\n📁 {tag.upper()} ({len(by_tag[tag])} files):")
    for file_path, tagged_file in by_tag[tag]:
        tags_str = ", ".join(sorted(tagged_file.tags))
        size_kb = tagged_file.size / 1024 if tagged_file.size > 1024 else tagged_file.size
        unit = "KB" if tagged_file.size > 1024 else "B"
        print(f"  • {file_path} ({size_kb:.1f} {unit}) [{tags_str}]")

### 4. Extracting Files by Tags

One of the most powerful features is the ability to extract files by their tags:

In [None]:
# Extract all input files
input_files = archive.get_files_by_tags("input")
print(f"Input files found: {len(input_files)}")
for file_path in input_files:
    print(f"  • {file_path}")

# Extract them to a directory
extract_dir = temp_dir / "extracted" / "input_only"
extracted_paths = archive.extract_files_by_tags(extract_dir, "input")

print(f"\n✓ Extracted {len(extracted_paths)} input files to {extract_dir}")
for path in extracted_paths:
    print(f"  → {path}")

### 5. Extracting Individual Files

You can also extract specific files by name:

In [None]:
# Extract a specific script file
script_path = archive.extract_file(
    "scripts/run_model.sh", 
    temp_dir / "extracted" / "scripts"
)

print(f"✓ Extracted script to: {script_path}")
print("\nScript content:")
print("-" * 30)
print(script_path.read_text())

### 6. Cache Performance

The archive system caches both archives and individual files for performance. Let's see the cache statistics:

In [None]:
# Get cache statistics
cache_stats = cache_manager.get_cache_stats()

print("Cache Statistics:")
print("=" * 40)
for key, value in cache_stats.items():
    if 'size' in key and isinstance(value, int):
        if value < 1024:
            size_str = f"{value} B"
        elif value < 1024**2:
            size_str = f"{value/1024:.1f} KB"
        else:
            size_str = f"{value/1024**2:.1f} MB"
        print(f"  {key}: {size_str}")
    else:
        print(f"  {key}: {value}")

# Now let's access the same files again - should be faster from cache
print("\nAccessing cached files (should be faster):")
input_files_cached = archive.get_files_by_tags("input")
print(f"✓ Retrieved {len(input_files_cached)} input files from cache")

## Working with Multiple Archives

The `ArchiveRegistry` allows you to work with multiple archives as a unified collection:

In [None]:
# Create a second archive for demonstration
def create_second_archive(archive_path: Path) -> Path:
    """Create a second archive with different content"""
    
    with tempfile.TemporaryDirectory() as temp_dir:
        temp_path = Path(temp_dir)
        
        files_to_create = {
            "analysis/statistics.nc": b"Statistical analysis results",
            "analysis/trends.nc": b"Climate trend analysis",
            "plots/temperature_map.png": b"Fake PNG data",
            "plots/precipitation_timeseries.png": b"Fake PNG data",
            "reports/summary.pdf": b"Fake PDF report data",
            "scripts/analyze.py": b"#!/usr/bin/env python3\nprint('Analyzing data')\n",
        }
        
        for file_path, content in files_to_create.items():
            full_path = temp_path / file_path
            full_path.parent.mkdir(parents=True, exist_ok=True)
            full_path.write_bytes(content)
        
        archive_path.parent.mkdir(parents=True, exist_ok=True)
        with tarfile.open(archive_path, "w:gz") as tar:
            for file_path, content in files_to_create.items():
                full_path = temp_path / file_path
                tar.add(full_path, arcname=file_path)
    
    return archive_path

# Create second archive
second_archive = create_second_archive(temp_dir / "analysis_results.tar.gz")
print(f"✓ Created second archive: {second_archive}")

# Create archive registry
registry = ArchiveRegistry(
    simulation_id="climate_demo",
    cache_manager=cache_manager
)

# Add both archives
archive2 = CompressedArchive(
    archive_id="analysis_demo",
    archive_location=str(second_archive)
)

registry.add_archive(archive, "simulation_data")
registry.add_archive(archive2, "analysis_results")

print(f"\n✓ Registry contains {len(registry.list_archives())} archives:")
for name in registry.list_archives():
    print(f"  • {name}")

### Smart File Resolution

The registry can find files across all archives and choose the best source:

In [None]:
# Find all script files across archives
all_scripts = []
for archive_name in registry.list_archives():
    archive_obj = registry.get_archive(archive_name)
    script_files = archive_obj.get_files_by_tags("scripts")
    for script_file in script_files:
        all_scripts.append((archive_name, script_file))

print("Script files across all archives:")
print("=" * 45)
for archive_name, script_file in all_scripts:
    print(f"  📜 {script_file} (from {archive_name})")

# Extract files by tags from all archives
extract_dir_all = temp_dir / "extracted" / "all_scripts"
results = registry.extract_files_by_tags(extract_dir_all, "scripts")

print(f"\n✓ Extracted script files from all archives:")
for archive_name, extracted_files in results.items():
    print(f"  From {archive_name}: {len(extracted_files)} files")
    for file_path in extracted_files:
        print(f"    → {file_path}")

## Summary

In this notebook, we've demonstrated the core functionality of the Tellus archive system:

✅ **Automatic file tagging** based on directory structure  
✅ **Selective file extraction** by tags or file names  
✅ **Intelligent caching** for performance optimization  
✅ **Multi-archive management** through registries  
✅ **Progress tracking** for long-running operations  

## Next Steps

- **Advanced Features**: Check out the advanced notebook for location integration, custom tagging, and path mapping
- **CLI Usage**: See the CLI examples notebook for command-line operations
- **Production Use**: The system is designed to work with tape systems, FTP servers, and other remote storage

## Cleanup

Let's clean up the temporary files:

In [None]:
import shutil

# Clean up temporary directory
shutil.rmtree(temp_dir)
print(f"✓ Cleaned up temporary files from {temp_dir}")