# Tutorial 2: Content Classification and Selective Archiving

In this tutorial, you'll learn how Tellus automatically understands your Earth System Model files and how to create smart, selective archives.

## What You'll Learn
- How Tellus classifies Earth science file types
- Content types vs. importance levels
- Model-specific pattern recognition
- Creating targeted archives for specific workflows

## The Problem: Not All Files Are Equal

Your CESM simulation contains hundreds of files, but they serve very different purposes:

```
üî• CRITICAL: user_nl_cam (reproduce your exact run)
üìä IMPORTANT: *.cam.h0.*.nc (your main output data)
üìù USEFUL: cesm.log (debugging information)
üóëÔ∏è TEMPORARY: *.tmp, *.lock (can be deleted)
```

Tellus helps you make intelligent decisions about what to archive by automatically classifying files.

## Understanding Content Types

Tellus classifies files into these **content types** based on their role in the simulation:

In [None]:
# Let's see how Tellus would classify files in a CESM run
print("üéØ CONTENT TYPES in Earth System Modeling:")
print()
print("üì§ INPUT:        Configuration files, parameters, initial conditions")
print("   Examples:     user_nl_*, cesm_in, ssp585_data.nc")
print()
print("üìä OUTPUT:       Primary model results (what you analyze)")
print("   Examples:     *.cam.h0.*.nc, *.pop.h.*.nc, *.clm2.h0.*.nc")
print()
print("‚öôÔ∏è CONFIG:       Runtime configuration and namelists")
print("   Examples:     namelist.input, icon_master.namelist")
print()
print("üìù LOG:          Log files and diagnostic output")
print("   Examples:     cesm.log, *.stdout, model.log")
print()
print("üîÑ INTERMEDIATE: Restart files, checkpoints, temporary data")
print("   Examples:     *.r.*, *.rs.*, restart_files/*")
print()
print("üìà DIAGNOSTIC:   Analysis output, derived quantities")
print("   Examples:     *.cam.h1.*.nc (high-frequency), *_analysis.nc")
print()
print("üìö METADATA:     Documentation, catalogs, index files")
print("   Examples:     README, *.xml, file_lists.txt")

## Understanding Importance Levels

Beyond content type, Tellus assigns **importance levels** to help you make storage decisions:

In [None]:
print("üéØ IMPORTANCE LEVELS for Archive Decisions:")
print()
print("üî• CRITICAL:     Essential for simulation reproducibility")
print("   Examples:     namelists, input datasets, boundary conditions")
print("   Strategy:     Always archive, store on most reliable media")
print()
print("‚≠ê IMPORTANT:    Valuable for science, hard to regenerate")
print("   Examples:     Primary output files, monthly means")
print("   Strategy:     Archive for long-term science use")
print()
print("üìã OPTIONAL:     Nice to have, but can be recreated")
print("   Examples:     High-frequency output, diagnostic plots")
print("   Strategy:     Archive selectively based on storage budget")
print()
print("üóëÔ∏è TEMPORARY:     Can be safely discarded")
print("   Examples:     .tmp files, .lock files, build artifacts")
print("   Strategy:     Don't archive (Tellus excludes by default)")

## Hands-On: See Classification in Action

Let's create an archive and examine how Tellus classified your files:

In [None]:
# Create a sample archive to see classification
simulation_path = "/data/cesm_runs/b.e20.BHIST.f19_g17.20thC.2005_2015"

!tellus archive create cesm_classification_demo {simulation_path} \
    --simulation cesm_demo \
    --description "Demo archive to show file classification"

In [None]:
# Now examine the detailed classification
!tellus archive show cesm_classification_demo

### What to Look For

In the output above, notice:

1. **Content Summary**: Shows how many files of each type were found
2. **Size Distribution**: Reveals which content types use the most space
3. **File Patterns**: Shows the patterns Tellus used for classification

This information helps you make informed decisions about selective archiving!

## Model-Specific Intelligence

Tellus includes specialized patterns for major Earth System Models:

In [None]:
print("üåç MODEL-SPECIFIC FILE PATTERNS:")
print()
print("üåä CESM (Community Earth System Model):")
print("   ‚Ä¢ Atmosphere: *.cam.h0.*.nc, *.cam.h1.*.nc")
print("   ‚Ä¢ Ocean:      *.pop.h.*.nc, *.pop.h.ecosys.*.nc")
print("   ‚Ä¢ Land:       *.clm2.h0.*.nc, *.clm2.h1.*.nc")
print("   ‚Ä¢ Sea Ice:    *.cice.h.*.nc")
print("   ‚Ä¢ Namelists:  user_nl_*, *_in")
print()
print("üåÄ ICON (ICOsahedral Nonhydrostatic):")
print("   ‚Ä¢ Atmosphere: *_atm_*.nc, *_atmo_*.nc")
print("   ‚Ä¢ Ocean:      *_oce_*.nc, *_ocean_*.nc")
print("   ‚Ä¢ Land:       *_lnd_*.nc, *_land_*.nc")
print("   ‚Ä¢ Config:     icon_master.namelist, NAMELIST_*")
print()
print("‚õàÔ∏è WRF (Weather Research and Forecasting):")
print("   ‚Ä¢ Output:     wrfout_*, wrfxtrm_*")
print("   ‚Ä¢ Restart:    wrfrst_*")
print("   ‚Ä¢ Boundary:   wrfbdy_*, wrflowinp_*")
print("   ‚Ä¢ Config:     namelist.input, namelist.wps")
print()
print("üèîÔ∏è ECHAM (European Centre HAMburg):")
print("   ‚Ä¢ Output:     *_ATM_*, *_BOT_*, *_SFC_*")
print("   ‚Ä¢ Config:     namelist.echam, run.def")

## Selective Archiving by Content Type

Now let's create targeted archives for different use cases:

In [None]:
# Archive 1: Critical files only (for reproducibility)
!tellus archive create cesm_critical_only {simulation_path} \
    --simulation cesm_demo \
    --importance critical \
    --description "Only critical files for reproduction"

print("‚úÖ Created critical-files-only archive")

In [None]:
# Archive 2: Output data only (for analysis)
!tellus archive create cesm_analysis_data {simulation_path} \
    --simulation cesm_demo \
    --content-types output,diagnostic \
    --description "Output and diagnostic data for analysis"

print("‚úÖ Created analysis-ready archive")

In [None]:
# Archive 3: Restart capability (for continuing simulation)
!tellus archive create cesm_restart_ready {simulation_path} \
    --simulation cesm_demo \
    --content-types input,config,intermediate \
    --description "Everything needed to restart simulation"

print("‚úÖ Created restart-ready archive")

## Advanced Pattern Matching

For more precise control, combine content types with file patterns:

In [None]:
# Archive just atmospheric monthly output
!tellus archive create cesm_atm_monthly {simulation_path} \
    --simulation cesm_demo \
    --patterns "*.cam.h0.*.nc" \
    --content-types output \
    --description "Atmospheric monthly output only"

# Archive just ocean and sea ice for marine analysis
!tellus archive create cesm_marine_components {simulation_path} \
    --simulation cesm_demo \
    --patterns "*.pop.h.*.nc,*.cice.h.*.nc" \
    --content-types output \
    --description "Ocean and sea ice output"

print("‚úÖ Created component-specific archives")

## Compare Your Archives

Let's see how different selection strategies affect archive size:

In [None]:
# List all archives to compare sizes
!tellus archive list

### Archive Size Analysis

Notice how dramatically different the archive sizes are:

- **Critical files**: Usually smallest (few MB) - just configs and namelists
- **Analysis data**: Largest (GB to TB) - contains all your scientific output
- **Restart ready**: Medium size - configs plus restart files
- **Component-specific**: Varies - depends on which model components you selected

This size difference is why selective archiving is so important!

## Real-World Workflow Examples

Here are common archiving strategies used by climate modeling groups:

In [None]:
print("üåç REAL-WORLD ARCHIVING WORKFLOWS:")
print()
print("üìö PUBLICATION WORKFLOW:")
print("   1. Archive critical files ‚Üí Long-term repository")
print("   2. Archive analysis data ‚Üí Working storage")
print("   3. After publication ‚Üí Archive complete dataset")
print()
print("üî¨ COLLABORATIVE RESEARCH:")
print("   1. Share configs with team ‚Üí Critical files archive")
print("   2. Share results with collaborators ‚Üí Output data archive")
print("   3. Keep diagnostics locally ‚Üí Diagnostic archive")
print()
print("üíæ STORAGE MANAGEMENT:")
print("   1. Expensive fast storage ‚Üí Critical + recent output")
print("   2. Cheap bulk storage ‚Üí Complete historical data")
print("   3. Temporary storage ‚Üí Logs and diagnostics")
print()
print("üöÄ OPERATIONAL MODELING:")
print("   1. Daily: Archive restart-ready (continue operations)")
print("   2. Weekly: Archive output data (for analysis)")
print("   3. Monthly: Archive complete runs (for archival)")

## Decision Tree: Choosing Your Archive Strategy

```
‚ùì What's your primary goal?
‚îú‚îÄ üî¨ Scientific Analysis
‚îÇ  ‚îú‚îÄ Need all components? ‚Üí Full output archive
‚îÇ  ‚îî‚îÄ Specific component? ‚Üí Component-specific archive
‚îÇ
‚îú‚îÄ üìù Reproducibility
‚îÇ  ‚îú‚îÄ Just reproduce setup? ‚Üí Critical files only
‚îÇ  ‚îî‚îÄ Restart simulation? ‚Üí Restart-ready archive
‚îÇ
‚îú‚îÄ ü§ù Collaboration
‚îÇ  ‚îú‚îÄ Share configs? ‚Üí Critical files archive
‚îÇ  ‚îú‚îÄ Share results? ‚Üí Output data archive
‚îÇ  ‚îî‚îÄ Share everything? ‚Üí Complete archive
‚îÇ
‚îî‚îÄ üíæ Storage Management
   ‚îú‚îÄ Limited space? ‚Üí Selective by importance
   ‚îú‚îÄ Different storage tiers? ‚Üí Multiple targeted archives
   ‚îî‚îÄ Unlimited space? ‚Üí Complete archive
```

## Troubleshooting Classification Issues

### Problem: Files classified incorrectly
**Solution**: Use explicit patterns to override automatic classification:
```bash
tellus archive create my_archive /path/to/sim \
    --patterns "specific_file.nc" \
    --content-types output
```

### Problem: Missing files in selective archive
**Solution**: Check what was actually included:
```bash
tellus archive show my_archive
```
Look at the "File Patterns" section to see what patterns were used.

### Problem: Archive too large/small
**Solution**: Use size estimates before creating:
```bash
# Dry run to see what would be included
tellus archive scan /path/to/sim --content-types output
```

### Problem: Unknown file types not classified
**Solution**: Add custom patterns:
```bash
tellus archive create my_archive /path/to/sim \
    --patterns "*.custom_ext" \
    --content-types output
```

## Practice Exercises

Try these exercises with your own simulation data:

**Exercise 1**: Create a "minimal reproducibility" archive
- Only critical files
- Should be < 100 MB
- Must include all namelists and input data

**Exercise 2**: Create a "analysis-ready" archive
- Primary output data only
- Exclude logs and temporary files
- Include both monthly and daily output if available

**Exercise 3**: Create component-specific archives
- Separate archives for atmosphere, ocean, land
- Compare file counts and sizes
- Which component produces the most data?

**Exercise 4**: Design a storage strategy
- You have 3 storage tiers: fast (expensive), medium, slow (cheap)
- Assign different content types to different tiers
- Create archives appropriate for each tier

## Summary

üéâ **You've mastered intelligent archiving!**

**Key concepts learned:**
- Content types reflect file roles in simulations
- Importance levels guide storage decisions
- Model-specific patterns ensure accurate classification
- Selective archiving dramatically reduces storage needs
- Different workflows require different archive strategies

**Next up**: You'll learn advanced extraction techniques, including how to get just the data you need from your archives using date ranges and patterns.

---
*Next: [Tutorial 3: DateTime-Based Extraction and Filtering](tutorial-3-datetime-extraction.ipynb)*