# Tutorial 2: Smart File Classification and Selective Archiving

**Learning Goals:** Master how Tellus automatically classifies Earth Science files and create targeted archives for specific use cases.

**Time Estimate:** 25 minutes

**Prerequisites:** Tutorial 1 completed

## The File Classification Challenge

Imagine you're managing a complex CESM simulation with hundreds of files. Without classification, finding what you need is like searching through a messy closet:

```
❌ Unorganized:
output/
├── cam.h0.2024-01.nc       # Atmosphere output - IMPORTANT  
├── cam.log                 # Log file - OPTIONAL
├── cam.r.2024-04-01.nc     # Restart file - CRITICAL
├── user_nl_cam             # Configuration - CRITICAL
├── temp_processing.nc      # Temporary file - TEMPORARY
└── analysis_script.py      # Analysis code - IMPORTANT
```

**The Solution**: Tellus automatically sorts files into meaningful categories, so you can archive exactly what you need:

```
✅ Organized by Content Type:
INPUT:       user_nl_cam, initial_conditions.nc
OUTPUT:      cam.h0.*.nc, clm.h0.*.nc  
RESTART:     *.r.*.nc
LOG:         *.log, *.out
SCRIPT:      *.py, *.sh
TEMPORARY:   temp_*, scratch_*
```

This tutorial will show you how to leverage this classification for smarter archiving strategies.

## Setup: Creating a Complex Earth Science Workflow

Let's create a more complex simulation that represents a realistic Earth Science workflow with multiple analysis stages:

In [None]:
import tempfile
from pathlib import Path
import json
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
from tellus.core.cli import console
from rich.table import Table
from rich.panel import Panel

# Create tutorial workspace
tutorial_dir = Path(tempfile.mkdtemp())
console.print(f"[blue]Tutorial workspace: {tutorial_dir}[/blue]")

def create_complex_earth_science_workflow():
    """
    Creates a complex Earth Science workflow directory that includes:
    - Multiple model components (atmosphere, ocean, land, ice)
    - Different file types (input, output, restart, diagnostics)
    - Analysis products and temporary files
    - Multiple time periods and resolutions
    """
    
    workflow_dir = tutorial_dir / "coupled_climate_system"
    workflow_dir.mkdir(parents=True, exist_ok=True)
    
    console.print("[blue]Creating complex Earth Science workflow...[/blue]")
    
    # ========================================
    # 1. INPUT FILES - Model Configuration
    # ========================================
    input_dir = workflow_dir / "input"
    input_dir.mkdir(exist_ok=True)
    
    console.print("  📝 Creating configuration files...")
    
    # CESM component namelists (CRITICAL)
    (input_dir / "user_nl_cam").write_text(
        "! CAM atmospheric component\n"
        "nhtfrq = -24, -6\n"          # Daily and 6-hourly output
        "mfilt = 30, 120\n"          # Files per stream
        "fincl1 = 'T','Q','U','V','PRECC','PRECL'\n"
        "fincl2 = 'TS','PSL','Z500'\n"  # High-frequency surface vars
    )
    
    (input_dir / "user_nl_pop").write_text(
        "! POP ocean component\n"
        "tavg_nfile = 1\n"
        "tavg_freq_opt = 'nmonth'\n"
        "tavg_freq = 1\n"
        "tavg_contents = 'TEMP','SALT','UVEL','VVEL','SSH'\n"
    )
    
    (input_dir / "user_nl_clm").write_text(
        "! CLM land component\n"
        "hist_nhtfrq = -24\n"
        "hist_mfilt = 30\n"
        "hist_fincl1 = 'TSA','GPP','NPP','SOILWATER_10CM'\n"
    )
    
    (input_dir / "user_nl_cice").write_text(
        "! CICE sea ice component\n"
        "histfreq = 'm','d'\n"       # Monthly and daily
        "histfreq_n = 1,1\n"
        "f_aice = 'm','d'\n"
        "f_hi = 'm','d'\n"
    )
    
    # Boundary conditions and forcing data (CRITICAL)
    create_sample_netcdf(input_dir / "sst_forcing.nc", "forcing")
    create_sample_netcdf(input_dir / "topography.nc", "static")
    create_sample_netcdf(input_dir / "land_surface_data.nc", "static")
    
    # ========================================
    # 2. OUTPUT FILES - Multiple Components and Frequencies
    # ========================================
    output_dir = workflow_dir / "output"
    output_dir.mkdir(exist_ok=True)
    
    console.print("  🌍 Creating model output files...")
    
    # Atmosphere (CAM) - multiple streams
    for month in ["2024-01", "2024-02", "2024-03"]:
        # Daily averages (primary stream)
        create_sample_netcdf(output_dir / f"cam.h0.{month}.nc", "atmosphere_daily")
        # 6-hourly data (secondary stream)
        create_sample_netcdf(output_dir / f"cam.h1.{month}.nc", "atmosphere_6hourly")
    
    # Ocean (POP) - monthly averages
    for month in ["2024-01", "2024-02", "2024-03"]:
        create_sample_netcdf(output_dir / f"pop.h.{month}.nc", "ocean")
    
    # Land (CLM) - daily averages
    for month in ["2024-01", "2024-02", "2024-03"]:
        create_sample_netcdf(output_dir / f"clm.h0.{month}.nc", "land")
    
    # Sea Ice (CICE) - monthly and daily
    for month in ["2024-01", "2024-02", "2024-03"]:
        create_sample_netcdf(output_dir / f"cice.h.{month}.nc", "seaice_monthly")
        create_sample_netcdf(output_dir / f"cice.hd.{month}.nc", "seaice_daily")
    
    # ========================================
    # 3. RESTART FILES - For Continuation
    # ========================================
    restart_dir = workflow_dir / "restart"
    restart_dir.mkdir(exist_ok=True)
    
    console.print("  🔄 Creating restart files...")
    
    # Each component needs restart files
    restart_date = "2024-04-01"
    create_sample_netcdf(restart_dir / f"cam.r.{restart_date}-00000.nc", "restart")
    create_sample_netcdf(restart_dir / f"pop.r.{restart_date}-00000.nc", "restart")
    create_sample_netcdf(restart_dir / f"clm.r.{restart_date}-00000.nc", "restart")
    create_sample_netcdf(restart_dir / f"cice.r.{restart_date}-00000.nc", "restart")
    
    # Restart pointer files (small text files)
    (restart_dir / "rpointer.atm").write_text(f"cam.r.{restart_date}-00000.nc")
    (restart_dir / "rpointer.ocn").write_text(f"pop.r.{restart_date}-00000.nc")
    (restart_dir / "rpointer.lnd").write_text(f"clm.r.{restart_date}-00000.nc")
    (restart_dir / "rpointer.ice").write_text(f"cice.r.{restart_date}-00000.nc")
    
    # ========================================
    # 4. DIAGNOSTIC FILES - Derived Analysis
    # ========================================
    diagnostics_dir = workflow_dir / "diagnostics"
    diagnostics_dir.mkdir(exist_ok=True)
    
    console.print("  📊 Creating diagnostic files...")
    
    # Climate indices and derived quantities
    create_sample_netcdf(diagnostics_dir / "enso_index_2024.nc", "timeseries")
    create_sample_netcdf(diagnostics_dir / "global_temperature_2024.nc", "timeseries")
    create_sample_netcdf(diagnostics_dir / "seasonal_means_2024.nc", "climatology")
    create_sample_netcdf(diagnostics_dir / "annual_cycle_2024.nc", "climatology")
    
    # Component-specific diagnostics
    create_sample_netcdf(diagnostics_dir / "atmosphere_budgets_2024.nc", "budget")
    create_sample_netcdf(diagnostics_dir / "ocean_transports_2024.nc", "transport")
    create_sample_netcdf(diagnostics_dir / "carbon_cycle_2024.nc", "biogeochemistry")
    
    # ========================================
    # 5. LOG FILES - Model Run Information
    # ========================================
    logs_dir = workflow_dir / "logs"
    logs_dir.mkdir(exist_ok=True)
    
    console.print("  📝 Creating log files...")
    
    # Component log files
    (logs_dir / "atm.log.240101-000000").write_text(
        "CAM ATMOSPHERE MODEL LOG\n"
        "========================\n"
        "Model: CAM6\n"
        "Resolution: f19_g16 (1.9x2.5 deg)\n"
        "Timestep: 1800 seconds\n"
        "Physics: CAM6 physics package\n"
        "Start: 2024-01-01 00:00:00\n"
        "End: 2024-03-31 23:59:59\n"
        "Status: COMPLETED SUCCESSFULLY\n"
    )
    
    (logs_dir / "ocn.log.240101-000000").write_text(
        "POP OCEAN MODEL LOG\n"
        "==================\n"
        "Model: POP2\n"
        "Resolution: gx1v7 (1 degree)\n"
        "Timestep: 3600 seconds\n"
        "Vertical levels: 60\n"
        "Start: 2024-01-01 00:00:00\n"
        "End: 2024-03-31 23:59:59\n"
        "Status: COMPLETED SUCCESSFULLY\n"
    )
    
    (logs_dir / "cesm.log").write_text(
        "CESM COUPLED MODEL LOG\n"
        "=====================\n"
        "Case: coupled_climate_system\n"
        "Compset: F2000climo\n"
        "Resolution: f19_g16\n"
        "Components: CAM6, POP2, CLM5, CICE5\n"
        "Runtime: 6.2 hours\n"
        "Throughput: 4.8 simulated years per wall day\n"
        "Memory usage: 12.4 GB peak\n"
        "Status: COMPLETED SUCCESSFULLY\n"
    )
    
    # Performance and timing logs
    (logs_dir / "timing.log").write_text(
        "CESM TIMING SUMMARY\n"
        "==================\n"
        "Total runtime: 22,320 seconds\n"
        "Init time: 45 seconds\n"
        "Run time: 22,200 seconds\n"
        "Finalize time: 75 seconds\n"
        "\n"
        "Component timings:\n"
        "CAM: 12,480 seconds (56.1%)\n"
        "POP: 8,640 seconds (38.9%)\n"
        "CLM: 720 seconds (3.2%)\n"
        "CICE: 360 seconds (1.6%)\n"
        "Coupler: 120 seconds (0.5%)\n"
    )
    
    # ========================================
    # 6. ANALYSIS SCRIPTS AND WORKFLOWS
    # ========================================
    scripts_dir = workflow_dir / "scripts"
    scripts_dir.mkdir(exist_ok=True)
    
    console.print("  🔬 Creating analysis scripts...")
    
    # Analysis and post-processing scripts
    (scripts_dir / "compute_climatology.py").write_text(
        "#!/usr/bin/env python3\n"
        "\"\"\"Compute climatological means from CESM output\"\"\"\n"
        "import xarray as xr\n"
        "import numpy as np\n"
        "\n"
        "def compute_climatology(input_files, output_file):\n"
        "    \"\"\"Compute long-term climatological means\"\"\"\n"
        "    ds = xr.open_mfdataset(input_files)\n"
        "    climatology = ds.groupby('time.month').mean('time')\n"
        "    climatology.to_netcdf(output_file)\n"
        "    print(f'Climatology saved to {output_file}')\n"
    )
    
    (scripts_dir / "analyze_enso.py").write_text(
        "#!/usr/bin/env python3\n"
        "\"\"\"Compute ENSO indices from CESM ocean output\"\"\"\n"
        "import xarray as xr\n"
        "import numpy as np\n"
        "\n"
        "def compute_nino34_index(sst_files):\n"
        "    \"\"\"Compute Nino 3.4 index from SST data\"\"\"\n"
        "    ds = xr.open_mfdataset(sst_files)\n"
        "    # Select Nino 3.4 region (5N-5S, 120W-170W)\n"
        "    nino34 = ds.sel(lat=slice(-5, 5), lon=slice(190, 240))\n"
        "    # Compute area-weighted mean\n"
        "    index = nino34.weighted(np.cos(np.deg2rad(nino34.lat))).mean(['lat', 'lon'])\n"
        "    return index\n"
    )
    
    (scripts_dir / "postprocess_batch.sh").write_text(
        "#!/bin/bash\n"
        "# Batch post-processing script for CESM output\n"
        "\n"
        "set -e  # Exit on any error\n"
        "\n"
        "echo 'Starting CESM post-processing pipeline...'\n"
        "\n"
        "# Step 1: Compute climatologies\n"
        "python compute_climatology.py\n"
        "\n"
        "# Step 2: Compute climate indices\n"
        "python analyze_enso.py\n"
        "\n"
        "# Step 3: Generate summary plots\n"
        "python create_summary_plots.py\n"
        "\n"
        "echo 'Post-processing complete!'\n"
    )
    
    # Visualization script
    (scripts_dir / "create_summary_plots.py").write_text(
        "#!/usr/bin/env python3\n"
        "\"\"\"Create summary plots from CESM analysis\"\"\"\n"
        "import matplotlib.pyplot as plt\n"
        "import xarray as xr\n"
        "import cartopy.crs as ccrs\n"
        "\n"
        "def create_temperature_map(data_file, output_file):\n"
        "    \"\"\"Create global temperature map\"\"\"\n"
        "    ds = xr.open_dataset(data_file)\n"
        "    fig = plt.figure(figsize=(12, 6))\n"
        "    ax = plt.axes(projection=ccrs.PlateCarree())\n"
        "    ds.T.isel(time=0).plot(ax=ax, transform=ccrs.PlateCarree())\n"
        "    ax.coastlines()\n"
        "    plt.savefig(output_file, dpi=150)\n"
    )
    
    # ========================================
    # 7. TEMPORARY AND INTERMEDIATE FILES
    # ========================================
    temp_dir = workflow_dir / "temp"
    temp_dir.mkdir(exist_ok=True)
    
    console.print("  🗂️ Creating temporary files...")
    
    # Temporary processing files (should not be archived)
    (temp_dir / "temp_processing_cam.nc").write_bytes(b"Temporary CAM processing data" * 1000)
    (temp_dir / "scratch_regridding.nc").write_bytes(b"Scratch regridding workspace" * 1000)
    (temp_dir / "intermediate_calculation.dat").write_bytes(b"Intermediate calculation" * 500)
    
    # Work in progress files
    (temp_dir / "work_in_progress.nc").write_bytes(b"Work in progress analysis" * 800)
    (temp_dir / "debug_output.txt").write_text(
        "DEBUG OUTPUT\n"
        "===========\n"
        "Testing regridding function...\n"
        "Input shape: (180, 360)\n"
        "Output shape: (90, 180)\n"
        "Status: In progress...\n"
    )
    
    # ========================================
    # 8. DOCUMENTATION AND METADATA
    # ========================================
    docs_dir = workflow_dir / "docs"
    docs_dir.mkdir(exist_ok=True)
    
    console.print("  📖 Creating documentation...")
    
    (docs_dir / "README.md").write_text(
        "# Coupled Climate System Simulation\n"
        "\n"
        "## Overview\n"
        "This simulation uses CESM2 to study coupled Earth system dynamics\n"
        "under present-day climate forcing conditions.\n"
        "\n"
        "## Model Configuration\n"
        "- **Case**: coupled_climate_system\n"
        "- **Compset**: F2000climo (fixed SSTs, present-day forcing)\n"
        "- **Resolution**: f19_g16 (atmosphere 1.9x2.5°, ocean ~1°)\n"
        "- **Duration**: 3 months (2024-01-01 to 2024-03-31)\n"
        "\n"
        "## Output Files\n"
        "- `output/`: Primary model output from all components\n"
        "- `diagnostics/`: Derived analysis products\n"
        "- `restart/`: Files needed to continue simulation\n"
        "\n"
        "## Analysis Scripts\n"
        "- `scripts/compute_climatology.py`: Compute seasonal means\n"
        "- `scripts/analyze_enso.py`: ENSO index calculations\n"
        "- `scripts/postprocess_batch.sh`: Automated processing pipeline\n"
    )
    
    (docs_dir / "simulation_log.json").write_text(json.dumps({
        "simulation_id": "coupled_climate_system",
        "model": "CESM2",
        "compset": "F2000climo",
        "resolution": "f19_g16",
        "start_date": "2024-01-01",
        "end_date": "2024-03-31",
        "runtime_hours": 6.2,
        "components": ["CAM6", "POP2", "CLM5", "CICE5"],
        "created_by": "tutorial_user",
        "purpose": "Tutorial demonstration of file classification"
    }, indent=2))
    
    return workflow_dir

def create_sample_netcdf(filepath, data_type):
    """
    Creates sample NetCDF files for different Earth Science data types.
    Each type has realistic attributes and structure.
    """
    
    # Standard coordinates
    lat = np.linspace(-90, 90, 96)  
    lon = np.linspace(0, 360, 144)
    time = [datetime(2024, 1, 15)]
    
    if data_type == "atmosphere_daily":
        ds = xr.Dataset({
            'T': (['time', 'lat', 'lon'], 288 + 30 * np.cos(np.radians(lat))[None, :, None]),
            'Q': (['time', 'lat', 'lon'], 0.01 * np.ones((1, 96, 144))),
            'U': (['time', 'lat', 'lon'], 10 * np.sin(2 * np.radians(lat))[None, :, None]),
            'PRECC': (['time', 'lat', 'lon'], 0.001 * np.abs(np.cos(np.radians(lat)))[None, :, None])
        }, coords={'time': time, 'lat': lat, 'lon': lon})
        ds.attrs = {'title': 'CAM Daily Atmospheric Output', 'model': 'CAM6', 'frequency': 'daily'}
    
    elif data_type == "atmosphere_6hourly":
        time_6h = [datetime(2024, 1, 15) + timedelta(hours=6*i) for i in range(4)]
        ds = xr.Dataset({
            'TS': (['time', 'lat', 'lon'], 285 + 25 * np.cos(np.radians(lat))[None, :, None]),
            'PSL': (['time', 'lat', 'lon'], 101325 * np.ones((4, 96, 144)))
        }, coords={'time': time_6h, 'lat': lat, 'lon': lon})
        ds.attrs = {'title': 'CAM 6-Hourly Surface Output', 'model': 'CAM6', 'frequency': '6hourly'}
    
    elif data_type == "ocean":
        depth = np.array([5, 15, 25, 45, 75])  
        ds = xr.Dataset({
            'TEMP': (['time', 'z_t', 'lat', 'lon'], 290 - 40 * depth[:, None, None] / 1000),
            'SALT': (['time', 'z_t', 'lat', 'lon'], 35 * np.ones((1, 5, 96, 144))),
            'SSH': (['time', 'lat', 'lon'], 0.1 * np.sin(2 * np.radians(lat))[None, :, None])
        }, coords={'time': time, 'z_t': depth, 'lat': lat, 'lon': lon})
        ds.attrs = {'title': 'POP Ocean Model Output', 'model': 'POP2', 'frequency': 'monthly'}
        
    elif data_type == "land":
        ds = xr.Dataset({
            'TSA': (['time', 'lat', 'lon'], 285 + 25 * np.cos(np.radians(lat))[None, :, None]),
            'GPP': (['time', 'lat', 'lon'], 0.01 * np.abs(np.cos(np.radians(lat)))[None, :, None]),
            'SOILWATER_10CM': (['time', 'lat', 'lon'], 0.3 * np.ones((1, 96, 144)))
        }, coords={'time': time, 'lat': lat, 'lon': lon})
        ds.attrs = {'title': 'CLM Land Model Output', 'model': 'CLM5', 'frequency': 'daily'}
    
    elif data_type.startswith("seaice"):
        ds = xr.Dataset({
            'aice': (['time', 'lat', 'lon'], 0.8 * (np.abs(lat) > 60)[None, :, None]),
            'hi': (['time', 'lat', 'lon'], 2.0 * (np.abs(lat) > 70)[None, :, None])
        }, coords={'time': time, 'lat': lat, 'lon': lon})
        freq = 'monthly' if 'monthly' in data_type else 'daily'
        ds.attrs = {'title': 'CICE Sea Ice Output', 'model': 'CICE5', 'frequency': freq}
    
    elif data_type in ["restart", "forcing", "static"]:
        # Simplified restart/forcing files
        ds = xr.Dataset({
            'DATA': (['lat', 'lon'], 300 * np.ones((96, 144))),
        }, coords={'lat': lat, 'lon': lon})
        ds.attrs = {'title': f'{data_type.title()} Data', 'purpose': data_type}
    
    elif data_type == "timeseries":
        time_monthly = [datetime(2024, m, 15) for m in range(1, 4)]
        ds = xr.Dataset({
            'index': (['time'], np.random.randn(3)),
        }, coords={'time': time_monthly})
        ds.attrs = {'title': 'Climate Index Time Series'}
    
    elif data_type == "climatology":
        months = np.arange(1, 13)
        ds = xr.Dataset({
            'climatology': (['month', 'lat', 'lon'], 
                          288 + 30 * np.cos(np.radians(lat))[None, :, None] * np.ones((12, 96, 144)))
        }, coords={'month': months, 'lat': lat, 'lon': lon})
        ds.attrs = {'title': 'Climatological Means'}
    
    else:
        # Generic data
        ds = xr.Dataset({
            'data': (['lat', 'lon'], np.random.randn(96, 144))
        }, coords={'lat': lat, 'lon': lon})
        ds.attrs = {'title': f'{data_type.title()} Data'}
    
    # Save file
    ds.to_netcdf(filepath, format='NETCDF4_CLASSIC')

# Create the complex workflow
workflow_dir = create_complex_earth_science_workflow()
console.print(f"\n[green]✅ Complex Earth Science workflow created: {workflow_dir.name}[/green]")

## Understanding File Classification

Now let's examine what we created and see how Tellus would automatically classify these files:

In [None]:
# Let's analyze our workflow directory and classify files
def analyze_earth_science_files(directory):
    """
    Analyze files in Earth Science workflow and classify them.
    This simulates what Tellus does automatically.
    """
    
    classified_files = {
        'INPUT': [],
        'OUTPUT': [],
        'RESTART': [],
        'DIAGNOSTIC': [],
        'LOG': [],
        'SCRIPT': [],
        'TEMPORARY': [],
        'METADATA': [],
        'INTERMEDIATE': []
    }
    
    importance_levels = {
        'CRITICAL': [],
        'IMPORTANT': [],
        'OPTIONAL': [],
        'TEMPORARY': []
    }
    
    total_size = 0
    
    for file_path in directory.rglob('*'):
        if not file_path.is_file():
            continue
            
        rel_path = file_path.relative_to(directory)
        size = file_path.stat().st_size
        total_size += size
        
        # Classify by content type (what Tellus does automatically)
        content_type, importance = classify_earth_science_file(rel_path)
        
        file_info = {
            'path': str(rel_path),
            'size': size,
            'content_type': content_type,
            'importance': importance
        }
        
        classified_files[content_type].append(file_info)
        importance_levels[importance].append(file_info)
    
    return classified_files, importance_levels, total_size

def classify_earth_science_file(file_path):
    """
    Classify a single Earth Science file.
    This is a simplified version of Tellus's classification logic.
    """
    
    path_str = str(file_path).lower()
    name = file_path.name.lower()
    
    # INPUT FILES - Model configuration and boundary conditions
    if (name.startswith('user_nl_') or 
        'forcing' in name or 
        'topography' in name or
        'land_surface' in name or
        'initial' in name):
        return 'INPUT', 'CRITICAL'
    
    # RESTART FILES - For continuing simulations
    if (('.r.' in name and '.nc' in name) or 
        name.startswith('rpointer')):
        return 'RESTART', 'CRITICAL'
    
    # OUTPUT FILES - Primary model results
    if ('output' in path_str and '.nc' in name and 
        any(model in name for model in ['cam.h', 'pop.h', 'clm.h', 'cice.h'])):
        return 'OUTPUT', 'IMPORTANT'
    
    # DIAGNOSTIC FILES - Derived analysis products
    if ('diagnostic' in path_str or 
        any(term in name for term in ['index', 'climatology', 'budget', 'transport'])):
        return 'DIAGNOSTIC', 'IMPORTANT'
    
    # LOG FILES - Model run information
    if (name.endswith('.log') or 
        'timing' in name or
        path_str.startswith('logs/')):
        return 'LOG', 'OPTIONAL'
    
    # SCRIPT FILES - Analysis and processing code
    if (name.endswith(('.py', '.sh', '.ncl', '.m')) or
        'script' in path_str):
        return 'SCRIPT', 'IMPORTANT'
    
    # TEMPORARY FILES - Should not be archived
    if ('temp' in path_str or 
        name.startswith(('temp_', 'scratch_', 'work_in_progress')) or
        'debug' in name):
        return 'TEMPORARY', 'TEMPORARY'
    
    # METADATA FILES - Documentation and catalogs
    if (name.endswith(('.md', '.txt', '.json', '.yaml', '.yml')) or
        'docs' in path_str):
        return 'METADATA', 'OPTIONAL'
    
    # INTERMEDIATE FILES - Processing intermediates
    if name.endswith(('.dat', '.tmp', '.cache')):
        return 'INTERMEDIATE', 'OPTIONAL'
    
    # Default classification
    return 'METADATA', 'OPTIONAL'

# Analyze our workflow
classified_files, importance_levels, total_size = analyze_earth_science_files(workflow_dir)

# Display classification results
console.print("\n[bold blue]🔍 Automatic File Classification Results[/bold blue]")
console.print("=" * 60)

# Create summary table
table = Table(title="File Classification Summary")
table.add_column("Content Type", style="cyan")
table.add_column("Files", justify="right", style="green")
table.add_column("Size (MB)", justify="right", style="yellow")
table.add_column("Purpose", style="dim")

# Add rows for each content type
content_descriptions = {
    'INPUT': 'Model configuration and forcing data',
    'OUTPUT': 'Primary scientific results',
    'RESTART': 'Files for continuing simulations',
    'DIAGNOSTIC': 'Derived analysis products',
    'LOG': 'Model run information and diagnostics',
    'SCRIPT': 'Analysis and processing workflows',
    'TEMPORARY': 'Temporary files (exclude from archive)',
    'METADATA': 'Documentation and catalogs',
    'INTERMEDIATE': 'Processing intermediate files'
}

for content_type, files in classified_files.items():
    if files:  # Only show types with files
        file_count = len(files)
        type_size = sum(f['size'] for f in files) / (1024 * 1024)
        description = content_descriptions[content_type]
        
        table.add_row(content_type, str(file_count), f"{type_size:.1f}", description)

console.print(table)

# Show importance breakdown
console.print("\n[bold blue]⚖️ File Importance Levels[/bold blue]")
importance_table = Table()
importance_table.add_column("Importance", style="cyan")
importance_table.add_column("Files", justify="right", style="green")
importance_table.add_column("Size (MB)", justify="right", style="yellow")
importance_table.add_column("Archive Strategy", style="dim")

importance_strategies = {
    'CRITICAL': 'Always archive (needed for reproduction)',
    'IMPORTANT': 'Usually archive (valuable for analysis)',
    'OPTIONAL': 'Archive selectively (useful but not essential)',
    'TEMPORARY': 'Never archive (can be regenerated)'
}

for importance, files in importance_levels.items():
    if files:
        file_count = len(files)
        imp_size = sum(f['size'] for f in files) / (1024 * 1024)
        strategy = importance_strategies[importance]
        
        importance_table.add_row(importance, str(file_count), f"{imp_size:.1f}", strategy)

console.print(importance_table)

console.print(f"\n[green]Total files analyzed: {sum(len(files) for files in classified_files.values())}[/green]")
console.print(f"[green]Total size: {total_size / (1024 * 1024):.1f} MB[/green]")

## Step 1: Creating Selective Archives

Now let's create different types of archives based on our file classification. This shows the power of intelligent archiving:

In [None]:
import tarfile
import json
from datetime import datetime

def create_selective_archive(source_dir, archive_path, content_types=None, importance_levels=None, exclude_patterns=None):
    """
    Create a selective archive based on file classification.
    This demonstrates the core concept behind Tellus selective archiving.
    """
    
    files_to_archive = []
    metadata_files = []
    excluded_files = []
    
    for file_path in source_dir.rglob('*'):
        if not file_path.is_file():
            continue
            
        rel_path = file_path.relative_to(source_dir)
        content_type, importance = classify_earth_science_file(rel_path)
        
        # Check exclusion patterns
        if exclude_patterns:
            if any(pattern in str(rel_path).lower() for pattern in exclude_patterns):
                excluded_files.append({
                    'path': str(rel_path),
                    'reason': 'excluded_pattern',
                    'content_type': content_type,
                    'importance': importance
                })
                continue
        
        # Check content type filter
        include_file = True
        if content_types and content_type not in content_types:
            include_file = False
        
        # Check importance level filter
        if importance_levels and importance not in importance_levels:
            include_file = False
        
        if include_file:
            file_info = {
                'path': str(rel_path),
                'size': file_path.stat().st_size,
                'content_type': content_type,
                'importance': importance,
                'modified': file_path.stat().st_mtime
            }
            files_to_archive.append((file_path, file_info))
            metadata_files.append(file_info)
        else:
            excluded_files.append({
                'path': str(rel_path),
                'reason': 'filtered_out',
                'content_type': content_type,
                'importance': importance
            })
    
    # Create the archive
    console.print(f"[blue]Creating selective archive with {len(files_to_archive)} files...[/blue]")
    
    with tarfile.open(archive_path, "w:gz") as tar:
        for file_path, file_info in files_to_archive:
            rel_path = Path(file_info['path'])
            tar.add(file_path, arcname=rel_path)
            console.print(f"  Added: {rel_path} [{file_info['content_type']}, {file_info['importance']}]")
    
    # Create metadata
    archive_metadata = {
        'metadata_version': '1.0',
        'created_at': datetime.now().isoformat(),
        'selection_criteria': {
            'content_types': content_types,
            'importance_levels': importance_levels,
            'exclude_patterns': exclude_patterns
        },
        'archive_stats': {
            'files_included': len(files_to_archive),
            'files_excluded': len(excluded_files),
            'total_size': sum(f[1]['size'] for f in files_to_archive),
            'compression_type': 'gzip'
        },
        'included_files': metadata_files,
        'excluded_files': excluded_files
    }
    
    metadata_path = archive_path.with_suffix('.metadata.json')
    metadata_path.write_text(json.dumps(archive_metadata, indent=2))
    
    return archive_path, metadata_path, len(files_to_archive), len(excluded_files)

# Create different types of selective archives
archive_dir = tutorial_dir / "selective_archives"
archive_dir.mkdir(exist_ok=True)

console.print("\n[bold blue]🎯 Creating Selective Archives[/bold blue]")
console.print("=" * 50)

# Archive 1: Critical files only (for backup/restart)
console.print("\n[cyan]1. Creating 'Critical Files Only' archive...[/cyan]")
console.print("[dim]Purpose: Backup essential files needed to restart simulation[/dim]")

critical_archive, critical_metadata, included_critical, excluded_critical = create_selective_archive(
    workflow_dir,
    archive_dir / "critical_files_only.tar.gz",
    importance_levels=['CRITICAL']
)

console.print(f"  ✅ Created: {critical_archive.name}")
console.print(f"  📊 Files included: {included_critical}, excluded: {excluded_critical}")
console.print(f"  💾 Size: {critical_archive.stat().st_size / (1024*1024):.1f} MB")


In [None]:
# Archive 2: Scientific output only (for analysis/sharing)
console.print("\n[cyan]2. Creating 'Scientific Output Only' archive...[/cyan]")
console.print("[dim]Purpose: Share primary results with collaborators[/dim]")

output_archive, output_metadata, included_output, excluded_output = create_selective_archive(
    workflow_dir,
    archive_dir / "scientific_output.tar.gz",
    content_types=['OUTPUT', 'DIAGNOSTIC'],
    importance_levels=['IMPORTANT', 'CRITICAL']
)

console.print(f"  ✅ Created: {output_archive.name}")
console.print(f"  📊 Files included: {included_output}, excluded: {excluded_output}")
console.print(f"  💾 Size: {output_archive.stat().st_size / (1024*1024):.1f} MB")

# Archive 3: Analysis workflow (code + documentation)
console.print("\n[cyan]3. Creating 'Analysis Workflow' archive...[/cyan]")
console.print("[dim]Purpose: Preserve analysis methods and reproducibility[/dim]")

workflow_archive, workflow_metadata, included_workflow, excluded_workflow = create_selective_archive(
    workflow_dir,
    archive_dir / "analysis_workflow.tar.gz",
    content_types=['SCRIPT', 'METADATA'],
    exclude_patterns=['temp', 'debug', 'log']
)

console.print(f"  ✅ Created: {workflow_archive.name}")
console.print(f"  📊 Files included: {included_workflow}, excluded: {excluded_workflow}")
console.print(f"  💾 Size: {workflow_archive.stat().st_size / (1024*1024):.1f} MB")

# Archive 4: Clean production archive (exclude temporary and logs)
console.print("\n[cyan]4. Creating 'Clean Production' archive...[/cyan]")
console.print("[dim]Purpose: Production-ready archive without clutter[/dim]")

production_archive, production_metadata, included_production, excluded_production = create_selective_archive(
    workflow_dir,
    archive_dir / "production_clean.tar.gz",
    importance_levels=['CRITICAL', 'IMPORTANT'],
    exclude_patterns=['temp', 'debug', 'log', 'timing']
)

console.print(f"  ✅ Created: {production_archive.name}")
console.print(f"  📊 Files included: {included_production}, excluded: {excluded_production}")
console.print(f"  💾 Size: {production_archive.stat().st_size / (1024*1024):.1f} MB")

## Step 2: Comparing Archive Strategies

Let's compare our different archiving strategies to understand when to use each approach:

In [None]:
# Compare all archives
def compare_archives(archive_info_list):
    """
    Create a comparison table for different archive strategies.
    """
    
    comparison_table = Table(title="Archive Strategy Comparison")
    comparison_table.add_column("Archive Name", style="cyan")
    comparison_table.add_column("Purpose", style="green")
    comparison_table.add_column("Files", justify="right", style="yellow")
    comparison_table.add_column("Size (MB)", justify="right", style="magenta")
    comparison_table.add_column("Best For", style="dim")
    
    for info in archive_info_list:
        comparison_table.add_row(
            info['name'],
            info['purpose'],
            str(info['files']),
            f"{info['size']:.1f}",
            info['best_for']
        )
    
    return comparison_table

# Gather archive information
archives_info = [
    {
        'name': 'Critical Files Only',
        'purpose': 'Essential restart files',
        'files': included_critical,
        'size': critical_archive.stat().st_size / (1024*1024),
        'best_for': 'Quick backup, minimal storage'
    },
    {
        'name': 'Scientific Output',
        'purpose': 'Analysis data sharing',
        'files': included_output,
        'size': output_archive.stat().st_size / (1024*1024),
        'best_for': 'Collaborator sharing, analysis'
    },
    {
        'name': 'Analysis Workflow',
        'purpose': 'Reproducible methods',
        'files': included_workflow,
        'size': workflow_archive.stat().st_size / (1024*1024),
        'best_for': 'Code preservation, methods'
    },
    {
        'name': 'Production Clean',
        'purpose': 'Professional archiving',
        'files': included_production,
        'size': production_archive.stat().st_size / (1024*1024),
        'best_for': 'Long-term storage, publications'
    }
]

# Display comparison
console.print("\n[bold blue]📊 Archive Strategy Comparison[/bold blue]")
comparison_table = compare_archives(archives_info)
console.print(comparison_table)

# Calculate space savings
original_size = sum(f.stat().st_size for f in workflow_dir.rglob('*') if f.is_file()) / (1024*1024)
total_selective_size = sum(info['size'] for info in archives_info)
space_efficiency = (1 - min(info['size'] for info in archives_info) / original_size) * 100

console.print(f"\n[bold green]💾 Storage Efficiency[/bold green]")
console.print(f"Original workflow: {original_size:.1f} MB")
console.print(f"Smallest selective archive: {min(info['size'] for info in archives_info):.1f} MB")
console.print(f"Space savings: {space_efficiency:.1f}%")
console.print(f"All selective archives combined: {total_selective_size:.1f} MB")

savings_explanation = Panel(
    "[green]Key Benefits of Selective Archiving:[/green]\n\n"
    "✅ [cyan]Targeted Storage:[/cyan] Only archive what you need\n"
    "✅ [cyan]Faster Transfers:[/cyan] Smaller files move quicker\n"
    "✅ [cyan]Organized Access:[/cyan] Find files by purpose\n"
    "✅ [cyan]Cost Efficient:[/cyan] Pay less for cloud storage\n"
    "✅ [cyan]Clear Intent:[/cyan] Archive purpose is obvious",
    title="🎯 Why Use Selective Archiving?",
    border_style="green"
)

console.print(f"\n{savings_explanation}")

## Step 3: Examining Archive Contents

Let's look inside our selective archives to understand what was included and excluded:

In [None]:
def examine_archive_contents(metadata_path):
    """
    Examine what's inside a selective archive based on its metadata.
    """
    
    metadata = json.loads(metadata_path.read_text())
    
    console.print(f"\n[bold cyan]📋 Archive: {metadata_path.stem.replace('.metadata', '')}[/bold cyan]")
    console.print("="* 50)
    
    # Selection criteria
    criteria = metadata['selection_criteria']
    console.print(f"[blue]Content Types:[/blue] {criteria.get('content_types', 'All')}")
    console.print(f"[blue]Importance Levels:[/blue] {criteria.get('importance_levels', 'All')}")
    if criteria.get('exclude_patterns'):
        console.print(f"[blue]Excluded Patterns:[/blue] {criteria['exclude_patterns']}")
    
    # Stats
    stats = metadata['archive_stats']
    console.print(f"[green]Files Included:[/green] {stats['files_included']}")
    console.print(f"[yellow]Files Excluded:[/yellow] {stats['files_excluded']}")
    console.print(f"[magenta]Total Size:[/magenta] {stats['total_size'] / (1024*1024):.1f} MB")
    
    # Show some included files by type
    included_by_type = {}
    for file_info in metadata['included_files']:
        content_type = file_info['content_type']
        if content_type not in included_by_type:
            included_by_type[content_type] = []
        included_by_type[content_type].append(file_info)
    
    if included_by_type:
        console.print("\n[bold]📄 Included Files by Type:[/bold]")
        for content_type, files in included_by_type.items():
            console.print(f"  [cyan]{content_type}[/cyan] ({len(files)} files):")
            for file_info in files[:3]:  # Show first 3 files
                size_str = f"{file_info['size'] / 1024:.1f} KB" if file_info['size'] > 1024 else f"{file_info['size']} B"
                console.print(f"    • {file_info['path']} ({size_str})")
            if len(files) > 3:
                console.print(f"    [dim]... and {len(files) - 3} more[/dim]")
    
    # Show some excluded files
    excluded_files = metadata.get('excluded_files', [])
    if excluded_files:
        console.print(f"\n[bold]🚫 Sample Excluded Files:[/bold] (showing first 5)")
        for excluded in excluded_files[:5]:
            console.print(f"  • {excluded['path']} ({excluded['content_type']}, reason: {excluded['reason']})")

# Examine each archive
console.print("\n[bold blue]🔍 Detailed Archive Contents[/bold blue]")

examine_archive_contents(critical_metadata)
examine_archive_contents(output_metadata)
examine_archive_contents(workflow_metadata)
examine_archive_contents(production_metadata)

## Real-World Decision Making: Which Archive Strategy to Use?

Let's create a decision framework for choosing the right archive strategy based on your specific needs:

In [None]:
from rich.panel import Panel
from rich.columns import Columns

# Create decision scenarios
decision_scenarios = [
    {
        'title': '🚀 Scenario 1: HPC System Migration',
        'situation': 'Need to move simulation to new HPC system and continue run',
        'best_choice': 'Critical Files Only',
        'reasoning': 'Minimal data transfer, contains everything needed to restart',
        'tellus_command': 'tellus archive create migration_package /sim/dir --importance critical'
    },
    {
        'title': '🤝 Scenario 2: Collaborator Sharing',
        'situation': 'Colleague wants to analyze your CESM output for their paper',
        'best_choice': 'Scientific Output',
        'reasoning': 'Contains analysis data without unnecessary config/restart files',
        'tellus_command': 'tellus archive create shared_results /sim/dir --content-types output,diagnostic'
    },
    {
        'title': '📚 Scenario 3: Paper Submission',
        'situation': 'Need to archive methods and results for journal publication',
        'best_choice': 'Production Clean',
        'reasoning': 'Professional, complete, excludes temporary/debug files',
        'tellus_command': 'tellus archive create paper_submission /sim/dir --importance critical,important --exclude-patterns temp,debug,log'
    },
    {
        'title': '🔬 Scenario 4: Methods Preservation',
        'situation': 'Want to preserve analysis workflow for future students',
        'best_choice': 'Analysis Workflow',
        'reasoning': 'Focus on scripts, documentation, reproducibility',
        'tellus_command': 'tellus archive create methods_archive /sim/dir --content-types script,metadata'
    }
]

console.print("\n[bold blue]🎯 Decision Framework: Which Archive Strategy?[/bold blue]")
console.print("=" * 70)

# Create panels for each scenario
panels = []
for scenario in decision_scenarios:
    panel_content = (
        f"[bold green]Situation:[/bold green]\n{scenario['situation']}\n\n"
        f"[bold cyan]Best Choice:[/bold cyan] {scenario['best_choice']}\n\n"
        f"[bold yellow]Why:[/bold yellow]\n{scenario['reasoning']}\n\n"
        f"[bold blue]Command:[/bold blue]\n[dim]{scenario['tellus_command']}[/dim]"
    )
    
    panel = Panel(
        panel_content,
        title=scenario['title'],
        border_style="blue",
        padding=(1, 1)
    )
    panels.append(panel)

# Display scenarios in columns
console.print(Columns(panels[:2], equal=True))
console.print(Columns(panels[2:], equal=True))

## Common Earth Science File Patterns

Let's explore how Tellus recognizes different Earth Science model patterns automatically:

In [None]:
# Create examples of different Earth Science model file patterns
earth_science_patterns = {
    'CESM': {
        'patterns': {
            'cam.h0.*.nc': 'CAM atmospheric monthly output',
            'cam.h1.*.nc': 'CAM atmospheric daily/hourly output', 
            'pop.h.*.nc': 'POP ocean monthly output',
            'clm.h0.*.nc': 'CLM land monthly output',
            'cice.h.*.nc': 'CICE sea ice monthly output',
            'cam.r.*.nc': 'CAM restart files',
            'pop.r.*.nc': 'POP restart files',
            'user_nl_*': 'CESM component namelists',
            'rpointer.*': 'Restart pointer files'
        },
        'description': 'Community Earth System Model - comprehensive climate modeling'
    },
    'WRF': {
        'patterns': {
            'wrfout_d*': 'WRF atmospheric output',
            'wrfrst_d*': 'WRF restart files',
            'wrfbdy_d*': 'WRF boundary condition files',
            'namelist.input': 'WRF main namelist',
            'namelist.wps': 'WRF preprocessing namelist'
        },
        'description': 'Weather Research and Forecasting - mesoscale atmospheric modeling'
    },
    'ICON': {
        'patterns': {
            '*atm_*': 'ICON atmospheric output',
            '*oce_*': 'ICON ocean output', 
            '*lnd_*': 'ICON land output',
            'icon_master.namelist': 'ICON master configuration',
            'NAMELIST_*': 'ICON component namelists'
        },
        'description': 'Icosahedral Nonhydrostatic - next-generation global modeling'
    },
    'ECHAM': {
        'patterns': {
            '*BOT*': 'ECHAM surface output',
            '*ATM*': 'ECHAM atmospheric output',
            'namelist.echam': 'ECHAM namelist',
            'rerun_*': 'ECHAM restart files'
        },
        'description': 'ECHAM - atmospheric general circulation model'
    },
    'FESOM': {
        'patterns': {
            '*.fesom.*': 'FESOM ocean output',
            'namelist.config': 'FESOM configuration',
            'forcing/*': 'FESOM forcing data'
        },
        'description': 'Finite Element Sea Ice-Ocean Model'
    }
}

console.print("\n[bold blue]🌍 Earth Science Model Pattern Recognition[/bold blue]")
console.print("=" * 60)

for model, info in earth_science_patterns.items():
    console.print(f"\n[bold cyan]{model}:[/bold cyan] {info['description']}")
    
    patterns_table = Table(show_header=True, header_style="bold magenta")
    patterns_table.add_column("Pattern", style="yellow")
    patterns_table.add_column("Description", style="green")
    
    for pattern, description in info['patterns'].items():
        patterns_table.add_row(pattern, description)
    
    console.print(patterns_table)

# Show how patterns help with classification
classification_help = Panel(
    "[green]How Pattern Recognition Helps:[/green]\n\n"
    "✅ [cyan]Automatic Classification:[/cyan] Files sorted by purpose without manual work\n"
    "✅ [cyan]Model-Aware Decisions:[/cyan] Knows that cam.r.*.nc files are critical restarts\n"
    "✅ [cyan]Smart Defaults:[/cyan] Suggests appropriate archive strategies per model\n"
    "✅ [cyan]Validation:[/cyan] Warns if expected files are missing\n"
    "✅ [cyan]Documentation:[/cyan] Automatically documents what each file type contains",
    title="🧠 Smart Pattern Recognition",
    border_style="green"
)

console.print(f"\n{classification_help}")

## Hands-On Exercise: Creating Your Archive Strategy

Let's practice with a scenario-based exercise to reinforce the concepts:

In [None]:
# Interactive exercise setup
exercise_scenarios = [
    {
        'scenario': 'You have a 2-year CESM simulation (500GB) that needs to go to tape storage. You want to continue the run later but tape access is slow and expensive.',
        'challenge': 'Balance storage cost with restart capability',
        'solution': 'Create two archives: 1) Critical files for restart (small), 2) Complete archive for long-term storage',
        'commands': [
            'tellus archive create restart_2024 /cesm/run --importance critical --location local_disk',
            'tellus archive create complete_2024 /cesm/run --location tape_storage'
        ]
    },
    {
        'scenario': 'Your WRF hurricane simulation produced 1000+ files. Your collaborator only needs the precipitation and wind fields for their flooding study.',
        'challenge': 'Share only relevant data, not everything',
        'solution': 'Create selective archive with specific variables/patterns',
        'commands': [
            'tellus archive create hurricane_winds_precip /wrf/output --patterns "*RAINNC*,*U10*,*V10*" --content-types output'
        ]
    },
    {
        'scenario': 'You developed a new analysis workflow for ICON output. The code worked perfectly and you want to preserve it for future projects.',
        'challenge': 'Archive methods for reproducibility',
        'solution': 'Focus on scripts, configuration, and documentation',
        'commands': [
            'tellus archive create icon_analysis_methods /project/dir --content-types script,metadata,input --exclude-patterns output,temp,log'
        ]
    }
]

console.print("\n[bold blue]🎓 Hands-On Exercise: Archive Strategy Planning[/bold blue]")
console.print("=" * 65)

for i, exercise in enumerate(exercise_scenarios, 1):
    console.print(f"\n[bold yellow]Exercise {i}:[/bold yellow]")
    console.print(f"[blue]Scenario:[/blue] {exercise['scenario']}")
    console.print(f"[red]Challenge:[/red] {exercise['challenge']}")
    console.print(f"[green]Solution Approach:[/green] {exercise['solution']}")
    console.print(f"[cyan]Recommended Commands:[/cyan]")
    for cmd in exercise['commands']:
        console.print(f"  [dim]{cmd}[/dim]")
    console.print("")

# Practice exercise
practice_exercise = Panel(
    "[bold green]Your Turn![/bold green]\n\n"
    "[blue]Scenario:[/blue] You have a CESM paleoclimate simulation (Last Glacial Maximum) with:\n"
    "• 50 years of monthly output (200GB)\n"
    "• Restart files for year 50 (5GB)\n"
    "• Analysis scripts and plots (100MB)\n"
    "• Log files and diagnostics (2GB)\n\n"
    "[red]Challenge:[/red] Create THREE different archives for different purposes:\n"
    "1. Quick sharing with paleoclimate community\n"
    "2. Continuation of simulation to 100 years\n"
    "3. Publication-ready archive for journal\n\n"
    "[yellow]Think about:[/yellow]\n"
    "• What content types would you include for each?\n"
    "• What importance levels are relevant?\n"
    "• Any patterns to exclude?\n"
    "• Which storage location for each archive?",
    title="🧠 Practice Exercise",
    border_style="green"
)

console.print(practice_exercise)

# Show solution after thinking time
solution_panel = Panel(
    "[bold cyan]Suggested Solutions:[/bold cyan]\n\n"
    "[green]1. Community Sharing:[/green]\n"
    "tellus archive create lgm_community_data /simulation \\\n"
    "  --content-types output,diagnostic --location fast_cloud\n\n"
    "[green]2. Simulation Continuation:[/green]\n"
    "tellus archive create lgm_restart_y50 /simulation \\\n"
    "  --importance critical --location local_backup\n\n"
    "[green]3. Publication Archive:[/green]\n"
    "tellus archive create lgm_publication /simulation \\\n"
    "  --importance critical,important \\\n"
    "  --exclude-patterns log,temp,debug \\\n"
    "  --location long_term_storage",
    title="💡 Solution",
    border_style="cyan"
)

console.print(f"\n{solution_panel}")

## Common Mistakes and How to Avoid Them

Let's look at typical beginner mistakes in selective archiving:

In [None]:
# Common mistakes and solutions
common_mistakes = [
    {
        'mistake': '❌ Over-filtering',
        'example': 'Creating "output only" archive but excluding diagnostic files',
        'problem': 'Diagnostic files often contain derived quantities needed for analysis',
        'solution': 'Include both OUTPUT and DIAGNOSTIC content types for analysis archives',
        'better_command': 'tellus archive create analysis_data /sim --content-types output,diagnostic'
    },
    {
        'mistake': '❌ Including temporary files',
        'example': 'Not excluding temp/debug files from production archives',
        'problem': 'Archives become bloated with unnecessary temporary data',
        'solution': 'Always exclude temporary patterns in production archives',
        'better_command': 'tellus archive create production /sim --exclude-patterns temp,debug,scratch,work_in_progress'
    },
    {
        'mistake': '❌ Wrong restart strategy',
        'example': 'Including only .r. files without pointer files or namelists',
        'problem': 'Cannot actually restart simulation without complete configuration',
        'solution': 'Use importance:critical which includes all restart-essential files',
        'better_command': 'tellus archive create restart_package /sim --importance critical'
    },
    {
        'mistake': '❌ Ignoring file relationships',
        'example': 'Archiving analysis scripts without the input data they process',
        'problem': 'Scripts become useless without their input data context',
        'solution': 'Consider workflow dependencies when selecting content types',
        'better_command': 'tellus archive create complete_workflow /sim --content-types script,output,metadata'
    },
    {
        'mistake': '❌ Poor naming conventions',
        'example': 'Using generic names like "archive1", "data_backup"',
        'problem': 'Cannot identify purpose or contents later',
        'solution': 'Use descriptive names that indicate content and purpose',
        'better_command': 'tellus archive create cesm_lgm_outputs_2024q1 /sim --content-types output'
    }
]

console.print("\n[bold blue]⚠️  Common Mistakes in Selective Archiving[/bold blue]")
console.print("=" * 60)

for i, mistake in enumerate(common_mistakes, 1):
    console.print(f"\n[bold red]{i}. {mistake['mistake']}[/bold red]")
    console.print(f"[yellow]Example:[/yellow] {mistake['example']}")
    console.print(f"[red]Problem:[/red] {mistake['problem']}")
    console.print(f"[green]Solution:[/green] {mistake['solution']}")
    console.print(f"[cyan]Better Command:[/cyan] [dim]{mistake['better_command']}[/dim]")

# Best practices summary
best_practices = Panel(
    "[bold green]Best Practices for Selective Archiving:[/bold green]\n\n"
    "✅ [cyan]Test First:[/cyan] Create small test archives to verify selection criteria\n"
    "✅ [cyan]Document Purpose:[/cyan] Use descriptive archive names and metadata\n"
    "✅ [cyan]Consider Relationships:[/cyan] Think about file dependencies in your workflow\n"
    "✅ [cyan]Validate Contents:[/cyan] Always check archive metadata before long-term storage\n"
    "✅ [cyan]Plan for Future:[/cyan] Consider how you'll use the archive later\n"
    "✅ [cyan]Use Patterns Wisely:[/cyan] Leverage model-specific patterns for better classification\n"
    "✅ [cyan]Multiple Strategies:[/cyan] Create different archives for different purposes",
    title="📋 Best Practices",
    border_style="green"
)

console.print(f"\n{best_practices}")

## Cleanup and Summary

In [None]:
# Cleanup tutorial files
import shutil

console.print("\n[bold blue]🧹 Cleaning up tutorial files...[/bold blue]")
shutil.rmtree(tutorial_dir)
console.print(f"[green]✅ Cleaned up: {tutorial_dir}[/green]")

# Tutorial summary
summary = Panel(
    "[bold green]🎓 Tutorial 2 Complete - You've Mastered Selective Archiving![/bold green]\n\n"
    "[cyan]Key Skills Learned:[/cyan]\n"
    "✅ Understanding automatic file classification\n"
    "✅ Creating targeted archives for specific purposes\n"
    "✅ Recognizing Earth Science model patterns\n"
    "✅ Making strategic decisions about what to archive\n"
    "✅ Avoiding common archiving mistakes\n\n"
    "[yellow]Real-World Impact:[/yellow]\n"
    "• Reduce storage costs by 60-90%\n"
    "• Faster data transfers and access\n"
    "• Better organized, purposeful archives\n"
    "• Improved collaboration and sharing\n\n"
    "[blue]Next: Tutorial 3 - DateTime-Based Extraction[/blue]",
    title="🎉 Tutorial Summary",
    border_style="green"
)

console.print(summary)

console.print("\n[bold blue]📚 Ready for Next Tutorial?[/bold blue]")
console.print("Tutorial 3 will teach you how to extract specific time periods from your archives - perfect for seasonal analysis, event studies, and temporal workflows.")
console.print("\n[dim]Continue to: archive-tutorial-03-datetime-filtering.ipynb[/dim]")