# CMIP6 Data Management Workflows with Tellus

## User Story: CMIP6 Model Output Post-Processing and Distribution

**Scenario**: Dr. Elena Vasquez leads the CMIP6 data production team at DKRZ (German Climate Computing Center). Her team manages the complete pipeline from raw Earth System Model output to CMIP6-compliant data publication, including quality control, post-processing, archival, and distribution through the Earth System Grid Federation (ESGF).

**Goals**:
- Process raw model output into CMIP6-standardized formats
- Implement comprehensive quality control and validation workflows
- Manage data versioning and retraction procedures
- Coordinate with ESGF nodes for global data distribution
- Ensure full provenance tracking and metadata compliance

**Key Features Demonstrated**:
- CMIP6 data standards compliance
- Automated QC and validation pipelines
- Version management and data lifecycle
- ESGF integration and publishing workflows
- Comprehensive provenance and metadata tracking

## 1. CMIP6 Infrastructure Setup

Configure Tellus for CMIP6 data processing and distribution workflows.

In [None]:
# Import required modules
from tellus.application.container import ServiceContainer
from tellus.application.dtos import (
    CreateLocationDto, CreateSimulationDto, CreateArchiveDto,
    FileTransferOperationDto, BatchFileTransferOperationDto,
    CreateProgressTrackingDto
)
from tellus.domain.entities.location import LocationKind
import json
from datetime import datetime, timedelta
from pathlib import Path

# Initialize service container for CMIP6 workflows
container = ServiceContainer()
location_service = container.get_location_service()
simulation_service = container.get_simulation_service()
archive_service = container.get_archive_service()
transfer_service = container.get_file_transfer_service()
progress_service = container.get_progress_tracking_service()

print("🌍 CMIP6 Data Management Workflow Initialized")
print(f"Institution: DKRZ (German Climate Computing Center)")
print(f"Team Lead: Dr. Elena Vasquez")
print(f"Mission: CMIP6 Data Production and Distribution")
print(f"Scope: Multi-model, multi-experiment CMIP6 pipeline")

## 2. CMIP6 Storage and Processing Infrastructure

Set up the complete CMIP6 data processing infrastructure including staging, processing, QC, and publication areas.

In [None]:
# Configure raw model output staging area
staging_dto = CreateLocationDto(
    name="dkrz-cmip6-staging",
    kinds=[LocationKind.FILESERVER, LocationKind.DISK],
    protocol="file",
    path="/pool/data/CMIP6/staging/raw-output",
    description="DKRZ staging area for incoming raw model output",
    metadata={
        "filesystem_type": "lustre",
        "capacity_tb": 500,
        "retention_policy": "30_days_post_processing",
        "access_level": "cmip6_producers",
        "backup": False,  # Temporary staging area
        "monitoring": "space_usage_alerts",
        "intake_bandwidth_gb_per_sec": 2.0
    }
)
staging_result = location_service.create_location(staging_dto)
print(f"✓ Configured staging area: {staging_result.name}")

# Configure CMIP6 processing cluster
processing_dto = CreateLocationDto(
    name="dkrz-cmip6-processing", 
    kinds=[LocationKind.COMPUTE, LocationKind.FILESERVER],
    protocol="file",
    path="/work/cmip6/processing",
    description="Dedicated CMIP6 post-processing and QC cluster",
    metadata={
        "compute_nodes": 64,
        "cores_per_node": 128,
        "total_cores": 8192,
        "memory_per_node_gb": 512,
        "total_memory_tb": 32,
        "storage_capacity_tb": 100,
        "software_stack": [
            "cmorize-4.0", "cf-checker-4.1", "cmip6-cv-1.0",
            "nco-5.0", "cdo-2.0", "python-3.9", "xarray", "iris"
        ],
        "scheduler": "slurm",
        "queue_types": ["cmip6-urgent", "cmip6-standard", "cmip6-bulk"]
    }
)
processing_result = location_service.create_location(processing_dto)
print(f"✓ Configured processing cluster: {processing_result.name}")

# Configure CMIP6 quality control and validation area
qc_dto = CreateLocationDto(
    name="dkrz-cmip6-qc",
    kinds=[LocationKind.FILESERVER],
    protocol="file",
    path="/pool/data/CMIP6/qc-validation",
    description="CMIP6 quality control and validation workspace",
    metadata={
        "capacity_tb": 200,
        "purpose": "quality_control_validation",
        "tools": [
            "cmip6-qc-suite", "cf-checker", "cmip6-dreq-validator",
            "esgf-prepub-validator", "cmip6-cmvchecker"
        ],
        "retention_policy": "qc_logs_1_year",
        "validation_levels": ["technical", "scientific", "metadata", "format"]
    }
)
qc_result = location_service.create_location(qc_dto)
print(f"✓ Configured QC validation area: {qc_result.name}")

# Configure CMIP6 publication-ready storage
publication_dto = CreateLocationDto(
    name="dkrz-cmip6-publication",
    kinds=[LocationKind.FILESERVER],
    protocol="file",
    path="/pool/data/CMIP6/publication",
    description="CMIP6 publication-ready data for ESGF distribution",
    metadata={
        "capacity_tb": 2000,
        "data_format": "cmip6_compliant_netcdf4",
        "directory_structure": "cmip6_data_reference_syntax",
        "access_permissions": "esgf_publisher",
        "replication_factor": 2,
        "integrity_monitoring": "continuous_checksums",
        "version_control": "cmip6_versioning_scheme"
    }
)
publication_result = location_service.create_location(publication_dto)
print(f"✓ Configured publication storage: {publication_result.name}")

# Configure ESGF data node
esgf_dto = CreateLocationDto(
    name="dkrz-esgf-node",
    kinds=[LocationKind.FILESERVER],
    protocol="thredds",
    host="esgf-data.dkrz.de",
    path="/thredds/fileServer/cmip6",
    description="DKRZ ESGF data node for global CMIP6 distribution",
    metadata={
        "esgf_node_type": "data_node",
        "data_access_methods": ["http", "opendap", "gridftp", "wget"],
        "catalog_service": "thredds_data_server",
        "search_integration": "esgf_search_api",
        "bandwidth_gb_per_sec": 10.0,
        "global_federation": True,
        "pid_service": "handle_system",
        "citation_service": "datacite_doi"
    }
)
esgf_result = location_service.create_location(esgf_dto)
print(f"✓ Configured ESGF data node: {esgf_result.name}")

# Configure long-term archive
archive_dto = CreateLocationDto(
    name="dkrz-cmip6-archive",
    kinds=[LocationKind.TAPE],
    protocol="hsi",
    host="hpss.dkrz.de",
    path="/arch/bb1013/CMIP6",
    description="DKRZ HPSS long-term archive for CMIP6 data preservation",
    metadata={
        "storage_type": "hierarchical_storage_system",
        "capacity_pb": 5,  # 5 petabytes
        "retention_policy": "permanent",
        "tape_technology": "LTO-9",
        "migration_policy": "automatic",
        "retrieval_sla_hours": 4,
        "backup_copies": 2,
        "geographic_replication": "partner_sites"
    }
)
archive_result = location_service.create_location(archive_dto)
print(f"✓ Configured long-term archive: {archive_result.name}")

print("\n🏗️  CMIP6 Infrastructure Overview:")
print("  📥 Raw Output → Staging → Processing → QC → Publication → ESGF → Archive")
print(f"  💾 Total Capacity: {500+100+200+2000+5000} TB across 6 storage tiers")
print(f"  🖥️  Processing Power: {processing_result.metadata['total_cores']} cores dedicated to CMIP6")

## 3. CMIP6 Simulation and Experiment Catalog

Create a comprehensive catalog of CMIP6 simulations following the experiment design.

In [None]:
# Define CMIP6 experiments and models
cmip6_experiments = {
    "historical": {
        "description": "Historical climate simulation 1850-2014",
        "time_period": "1850-2014",
        "forcings": "historical_greenhouse_gases_aerosols_landuse",
        "tier": 1,
        "priority": "high"
    },
    "ssp585": {
        "description": "High emission scenario projection 2015-2100",
        "time_period": "2015-2100", 
        "forcings": "ssp585_greenhouse_gases_aerosols_landuse",
        "tier": 1,
        "priority": "high"
    },
    "ssp245": {
        "description": "Medium emission scenario projection 2015-2100",
        "time_period": "2015-2100",
        "forcings": "ssp245_greenhouse_gases_aerosols_landuse",
        "tier": 1,
        "priority": "high"
    },
    "piControl": {
        "description": "Pre-industrial control simulation",
        "time_period": "500_years_equilibrium",
        "forcings": "pre_industrial_constant",
        "tier": 1,
        "priority": "high"
    }
}

cmip6_models = {
    "MPI-ESM1-2-HR": {
        "institution": "MPI-M",
        "atmosphere_resolution": "T127 (~100km)",
        "ocean_resolution": "TP04 (~40km)",
        "model_components": ["ECHAM6.3", "MPIOM1.6", "JSBACH3.2", "HAMOCC2.0"],
        "grid_labels": ["gn"]
    },
    "ICON-ESM": {
        "institution": "MPI-M", 
        "atmosphere_resolution": "R2B6 (~50km)",
        "ocean_resolution": "R2B6 (~50km)",
        "model_components": ["ICON-A", "ICON-O", "JSBACH4", "HAMOCC"],
        "grid_labels": ["gn"]
    }
}

# Create CMIP6 simulations
created_simulations = []

print("🌍 Creating CMIP6 Simulation Catalog")
print("=" * 40)

for model_id, model_info in cmip6_models.items():
    for experiment_id, exp_info in cmip6_experiments.items():
        for variant in ["r1i1p1f1", "r2i1p1f1", "r3i1p1f1"]:
            simulation_id = f"{model_id.lower()}-{experiment_id}-{variant}"
            
            sim_dto = CreateSimulationDto(
                simulation_id=simulation_id,
                model_id=model_id,
                attrs={
                    # Core CMIP6 attributes
                    "mip_era": "CMIP6",
                    "activity_id": "CMIP" if experiment_id in ["historical", "piControl"] else "ScenarioMIP",
                    "institution_id": model_info["institution"],
                    "source_id": model_id,
                    "experiment_id": experiment_id,
                    "variant_label": variant,
                    "grid_label": model_info["grid_labels"][0],
                    
                    # Experiment details
                    "experiment_description": exp_info["description"],
                    "time_period": exp_info["time_period"],
                    "forcings": exp_info["forcings"],
                    "tier": exp_info["tier"],
                    "priority": exp_info["priority"],
                    
                    # Model configuration
                    "atmosphere_resolution": model_info["atmosphere_resolution"],
                    "ocean_resolution": model_info["ocean_resolution"],
                    "model_components": model_info["model_components"],
                    
                    # Processing status
                    "simulation_status": "completed",
                    "post_processing_status": "in_progress",
                    "qc_status": "pending",
                    "publication_status": "not_started",
                    
                    # Data management
                    "expected_variables": 150,  # Typical CMIP6 variable count
                    "expected_size_tb": 2.5,   # Per simulation estimate
                    "retention_years": 50,
                    
                    # Metadata
                    "contact": "elena.vasquez@dkrz.de",
                    "creation_date": "2024-06-15",
                    "cmip6_compliant": True,
                    "data_reference_syntax": f"CMIP6/{exp_info.get('activity_id', 'CMIP')}/{model_info['institution']}/{model_id}/{experiment_id}/{variant}/"
                }
            )
            
            sim_result = simulation_service.create_simulation(sim_dto)
            created_simulations.append(sim_result)
            
            # Only print first few to avoid spam
            if len(created_simulations) <= 4:
                print(f"✓ {simulation_id}")
                print(f"  Experiment: {experiment_id} ({exp_info['description']})")
                print(f"  Model: {model_id} ({model_info['institution']})")
                print(f"  Expected size: {sim_result.attrs['expected_size_tb']} TB")
                print()

print(f"📊 CMIP6 Simulation Catalog Summary:")
print(f"  Total simulations: {len(created_simulations)}")
print(f"  Models: {len(cmip6_models)} ({', '.join(cmip6_models.keys())})")
print(f"  Experiments: {len(cmip6_experiments)} ({', '.join(cmip6_experiments.keys())})")
print(f"  Ensemble members: 3 per model-experiment combination")
print(f"  Expected total size: {len(created_simulations) * 2.5:.1f} TB")
print(f"  Institution: MPI-M via DKRZ")

## 4. CMIP6 Post-Processing Pipeline

Implement the complete CMIP6 post-processing workflow from raw output to publication-ready data.

In [None]:
# Define CMIP6 post-processing workflow stages
def create_cmip6_processing_pipeline():
    """Create comprehensive CMIP6 post-processing pipeline."""
    
    pipeline_stages = {
        "intake_validation": {
            "name": "Raw Data Intake and Initial Validation",
            "location": "dkrz-cmip6-staging",
            "duration_hours": 2,
            "tools": ["file_integrity_checker", "metadata_extractor"],
            "processes": [
                "File integrity verification (checksums, headers)",
                "Basic metadata extraction and cataloging",
                "Directory structure validation",
                "File size and count verification",
                "Initial format compliance check"
            ],
            "success_criteria": "All files pass integrity checks",
            "failure_action": "Notify data provider, quarantine data"
        },
        "cmorization": {
            "name": "CMOR Processing and Standardization",
            "location": "dkrz-cmip6-processing", 
            "duration_hours": 8,
            "tools": ["cmor-4.0", "cmip6-cmor-tables"],
            "processes": [
                "Variable renaming and unit conversion",
                "Time coordinate standardization",
                "Spatial grid and coordinate processing",
                "Metadata standardization (CF conventions)",
                "CMIP6 global attributes addition",
                "File splitting and temporal chunking"
            ],
            "compute_requirements": "16 nodes, 2048 cores, 4TB memory",
            "output_format": "NetCDF4 with compression",
            "compression_target": "30-50% size reduction"
        },
        "technical_qc": {
            "name": "Technical Quality Control", 
            "location": "dkrz-cmip6-qc",
            "duration_hours": 4,
            "tools": ["cf-checker", "cmip6-qc-suite", "ncdump-validation"],
            "processes": [
                "CF compliance validation",
                "CMIP6 controlled vocabulary checking",
                "File format and structure validation",
                "Coordinate system verification",
                "Missing value and fill value checks",
                "Compression and chunking optimization validation"
            ],
            "validation_levels": ["MUST", "SHOULD", "RECOMMENDED"],
            "pass_threshold": "All MUST checks pass, 95% SHOULD checks pass"
        },
        "scientific_qc": {
            "name": "Scientific Quality Assessment",
            "location": "dkrz-cmip6-qc",
            "duration_hours": 6,
            "tools": ["cmip6-scientific-qc", "climate-validator"],
            "processes": [
                "Physical consistency checks (conservation laws)",
                "Climatological range validation",
                "Temporal consistency analysis",
                "Spatial pattern validation", 
                "Inter-variable relationship checks",
                "Comparison with observational references"
            ],
            "reference_datasets": ["ERA5", "GPCP", "HadISST", "MODIS"],
            "tolerance_levels": "Model-dependent thresholds"
        },
        "metadata_enrichment": {
            "name": "Metadata Enrichment and Documentation",
            "location": "dkrz-cmip6-processing",
            "duration_hours": 3,
            "tools": ["cmip6-metadata-tools", "pid-generator"],
            "processes": [
                "Complete global attributes population",
                "Provenance information addition",
                "Citation and DOI preparation",
                "License and usage terms assignment",
                "Contact and institution information",
                "Version and tracking ID generation"
            ],
            "pid_system": "Handle System with DataCite DOI",
            "citation_format": "CMIP6 standard citation"
        },
        "publication_preparation": {
            "name": "Publication Preparation and Staging",
            "location": "dkrz-cmip6-publication",
            "duration_hours": 2,
            "tools": ["esgf-publisher-prep", "thredds-catalog-gen"],
            "processes": [
                "CMIP6 Data Reference Syntax organization",
                "Directory structure standardization",
                "File permission and ownership setup",
                "Aggregation dataset creation",
                "THREDDS catalog preparation",
                "Final integrity verification"
            ],
            "drs_structure": "activity_id/institution_id/source_id/experiment_id/variant_label/table_id/variable_id/grid_label/version",
            "catalog_format": "THREDDS XML with ESGF extensions"
        }
    }
    
    return pipeline_stages

# Create and display the processing pipeline
cmip6_pipeline = create_cmip6_processing_pipeline()
print("🔄 CMIP6 Post-Processing Pipeline")
print("=" * 45)

total_duration = 0
for stage_id, stage_info in cmip6_pipeline.items():
    print(f"\n🔧 STAGE {list(cmip6_pipeline.keys()).index(stage_id) + 1}: {stage_info['name'].upper()}")
    print(f"   Location: {stage_info['location']}")
    print(f"   Duration: {stage_info['duration_hours']} hours")
    print(f"   Tools: {', '.join(stage_info['tools'])}")
    
    if 'compute_requirements' in stage_info:
        print(f"   Compute: {stage_info['compute_requirements']}")
    
    print(f"   Key Processes:")
    for process in stage_info['processes'][:3]:  # Show first 3
        print(f"     • {process}")
    if len(stage_info['processes']) > 3:
        print(f"     • ... and {len(stage_info['processes']) - 3} more")
    
    total_duration += stage_info['duration_hours']

print(f"\n⏱️  Total Pipeline Duration: {total_duration} hours per simulation")
print(f"📊 Processing Throughput: {24/total_duration:.1f} simulations per day (single pipeline)")
print(f"🔧 Parallel Pipelines: Up to 4 simultaneous for high-priority data")

In [None]:
# Associate sample simulation with CMIP6 processing locations
from tellus.application.dtos import SimulationLocationAssociationDto

# Select a sample simulation for demonstration
sample_sim = created_simulations[0]  # MPI-ESM1-2-HR historical r1i1p1f1

cmip6_assoc_dto = SimulationLocationAssociationDto(
    simulation_id=sample_sim.simulation_id,
    location_names=[
        "dkrz-cmip6-staging", "dkrz-cmip6-processing", 
        "dkrz-cmip6-qc", "dkrz-cmip6-publication",
        "dkrz-esgf-node", "dkrz-cmip6-archive"
    ],
    context_overrides={
        "dkrz-cmip6-staging": {
            "path_prefix": f"/pool/data/CMIP6/staging/raw-output/{sample_sim.attrs['source_id']}/{sample_sim.attrs['experiment_id']}/{sample_sim.attrs['variant_label']}",
            "role": "raw_data_intake",
            "retention_days": 30
        },
        "dkrz-cmip6-processing": {
            "path_prefix": f"/work/cmip6/processing/{sample_sim.attrs['source_id']}/{sample_sim.attrs['experiment_id']}/{sample_sim.attrs['variant_label']}",
            "role": "cmorization_processing",
            "workflow_integration": True
        },
        "dkrz-cmip6-qc": {
            "path_prefix": f"/pool/data/CMIP6/qc-validation/{sample_sim.attrs['source_id']}/{sample_sim.attrs['experiment_id']}/{sample_sim.attrs['variant_label']}",
            "role": "quality_control",
            "validation_reports": True
        },
        "dkrz-cmip6-publication": {
            "path_prefix": sample_sim.attrs['data_reference_syntax'],
            "role": "publication_ready",
            "esgf_ready": True,
            "drs_compliant": True
        },
        "dkrz-esgf-node": {
            "path_prefix": f"/thredds/fileServer/cmip6/{sample_sim.attrs['data_reference_syntax']}",
            "role": "global_distribution",
            "access_methods": ["http", "opendap", "gridftp"]
        },
        "dkrz-cmip6-archive": {
            "path_prefix": f"/arch/bb1013/CMIP6/{sample_sim.attrs['source_id']}/{sample_sim.attrs['experiment_id']}/{sample_sim.attrs['variant_label']}",
            "role": "long_term_preservation",
            "retention": "permanent"
        }
    }
)

association_result = simulation_service.associate_simulation_with_locations(cmip6_assoc_dto)
print(f"✓ Associated {sample_sim.simulation_id} with CMIP6 processing infrastructure")
print(f"  Locations: {len(cmip6_assoc_dto.location_names)} processing stages")
print(f"  Model: {sample_sim.attrs['source_id']}")
print(f"  Experiment: {sample_sim.attrs['experiment_id']}")
print(f"  Variant: {sample_sim.attrs['variant_label']}")
print(f"  Expected size: {sample_sim.attrs['expected_size_tb']} TB")

print("\n🔗 CMIP6 Processing Flow:")
flow_stages = [
    ("Raw Output", "dkrz-cmip6-staging"),
    ("CMORization", "dkrz-cmip6-processing"),
    ("Quality Control", "dkrz-cmip6-qc"),
    ("Publication", "dkrz-cmip6-publication"),
    ("ESGF Distribution", "dkrz-esgf-node"),
    ("Long-term Archive", "dkrz-cmip6-archive")
]

for i, (stage_name, location) in enumerate(flow_stages):
    arrow = " → " if i < len(flow_stages) - 1 else ""
    print(f"  {stage_name} ({location}){arrow}", end="")
print()

## 5. CMIP6 Quality Control and Validation Workflows

Implement comprehensive quality control procedures following CMIP6 standards.

In [None]:
# Define CMIP6 Quality Control procedures
def create_cmip6_qc_procedures():
    """Create comprehensive CMIP6 quality control procedures."""
    
    qc_procedures = {
        "format_compliance": {
            "name": "Format and Structure Compliance",
            "priority": "critical",
            "automated": True,
            "checks": [
                {
                    "check_id": "cf_compliance",
                    "description": "CF Conventions compliance validation",
                    "tool": "cf-checker-4.1",
                    "severity": "error",
                    "criteria": "All CF conventions must be followed"
                },
                {
                    "check_id": "cmip6_cv",
                    "description": "CMIP6 controlled vocabulary validation",
                    "tool": "cmip6-cv-1.0",
                    "severity": "error",
                    "criteria": "All global attributes must use CMIP6 CV terms"
                },
                {
                    "check_id": "netcdf4_format",
                    "description": "NetCDF4 format compliance",
                    "tool": "ncdump-validation",
                    "severity": "error",
                    "criteria": "Files must be valid NetCDF4 with proper compression"
                },
                {
                    "check_id": "drs_compliance", 
                    "description": "Data Reference Syntax compliance",
                    "tool": "drs-validator",
                    "severity": "error",
                    "criteria": "Directory structure must follow CMIP6 DRS"
                }
            ]
        },
        "metadata_validation": {
            "name": "Metadata and Attributes Validation",
            "priority": "critical",
            "automated": True,
            "checks": [
                {
                    "check_id": "required_attributes",
                    "description": "Required global attributes presence",
                    "tool": "cmip6-attribute-checker",
                    "severity": "error",
                    "required_attrs": [
                        "source_id", "experiment_id", "variant_label", "grid_label",
                        "table_id", "variable_id", "activity_id", "institution_id",
                        "creation_date", "tracking_id", "further_info_url"
                    ]
                },
                {
                    "check_id": "variable_attributes",
                    "description": "Variable-specific attributes validation",
                    "tool": "cmor-variable-checker",
                    "severity": "error",
                    "criteria": "Variables must have CF standard_name, units, long_name"
                },
                {
                    "check_id": "time_coordinate",
                    "description": "Time coordinate validation", 
                    "tool": "time-coordinate-validator",
                    "severity": "error",
                    "criteria": "Time coordinates must follow CF calendar conventions"
                }
            ]
        },
        "data_integrity": {
            "name": "Data Integrity and Consistency",
            "priority": "high",
            "automated": True,
            "checks": [
                {
                    "check_id": "missing_values",
                    "description": "Missing value pattern analysis",
                    "tool": "missing-value-analyzer",
                    "severity": "warning",
                    "threshold": "<10% missing values for most variables"
                },
                {
                    "check_id": "temporal_consistency",
                    "description": "Temporal continuity and gaps",
                    "tool": "temporal-consistency-checker",
                    "severity": "error",
                    "criteria": "No gaps in time series for experiment period"
                },
                {
                    "check_id": "spatial_coverage",
                    "description": "Spatial coverage validation",
                    "tool": "spatial-coverage-validator",
                    "severity": "warning",
                    "criteria": "Global coverage expected for most variables"
                }
            ]
        },
        "scientific_validation": {
            "name": "Scientific Content Validation",
            "priority": "medium",
            "automated": False,  # Requires expert review
            "checks": [
                {
                    "check_id": "climatology_bounds",
                    "description": "Climatological range validation",
                    "tool": "climatology-validator",
                    "severity": "warning",
                    "reference": "Observational climatologies (ERA5, GPCP, etc.)"
                },
                {
                    "check_id": "energy_conservation",
                    "description": "Energy balance validation",
                    "tool": "energy-balance-checker",
                    "severity": "warning", 
                    "criteria": "Global energy imbalance within model-dependent thresholds"
                },
                {
                    "check_id": "water_conservation",
                    "description": "Water cycle consistency",
                    "tool": "water-cycle-validator",
                    "severity": "warning",
                    "criteria": "Precipitation-evaporation balance within reasonable bounds"
                }
            ]
        }
    }
    
    return qc_procedures

# Create QC procedures and display
cmip6_qc = create_cmip6_qc_procedures()
print("🔍 CMIP6 Quality Control Procedures")
print("=" * 40)

for category_id, category_info in cmip6_qc.items():
    print(f"\n🔧 {category_info['name'].upper()}")
    print(f"   Priority: {category_info['priority']}")
    print(f"   Automated: {category_info['automated']}")
    print(f"   Checks: {len(category_info['checks'])}")
    
    for i, check in enumerate(category_info['checks'][:2], 1):  # Show first 2
        print(f"   {i}. {check['description']}")
        print(f"      Tool: {check['tool']}")
        print(f"      Severity: {check['severity']}")
    
    if len(category_info['checks']) > 2:
        print(f"   ... and {len(category_info['checks']) - 2} more checks")

# Calculate QC statistics
total_checks = sum(len(cat['checks']) for cat in cmip6_qc.values())
automated_checks = sum(len(cat['checks']) for cat in cmip6_qc.values() if cat['automated'])
critical_procedures = sum(1 for cat in cmip6_qc.values() if cat['priority'] == 'critical')

print(f"\n📊 QC Summary:")
print(f"  Total validation checks: {total_checks}")
print(f"  Automated checks: {automated_checks} ({automated_checks/total_checks*100:.0f}%)")
print(f"  Critical procedures: {critical_procedures}")
print(f"  Manual review procedures: {4 - sum(1 for cat in cmip6_qc.values() if cat['automated'])}")

## 6. CMIP6 Version Management and Data Lifecycle

Implement CMIP6 version control and data lifecycle management procedures.

In [None]:
# Create CMIP6 version management system
def create_cmip6_version_management():
    """Create CMIP6 version management and lifecycle procedures."""
    
    version_management = {
        "versioning_scheme": {
            "format": "vYYYYMMDD",
            "example": "v20240615",
            "description": "Version date indicates data creation/revision date",
            "increment_triggers": [
                "Data processing errors corrected",
                "Metadata errors corrected",
                "Model bugs affecting output fixed",
                "Post-processing improvements applied",
                "Quality control issues resolved"
            ]
        },
        "version_lifecycle": {
            "development": {
                "status": "under_development",
                "description": "Data processing and QC in progress",
                "access": "restricted_to_data_producers",
                "location": "dkrz-cmip6-processing",
                "duration_typical_days": 14
            },
            "validation": {
                "status": "validation_in_progress",
                "description": "Quality control and scientific validation",
                "access": "restricted_to_qc_team",
                "location": "dkrz-cmip6-qc",
                "duration_typical_days": 7,
                "validation_committee": "CMIP6 QC Working Group"
            },
            "pre_publication": {
                "status": "ready_for_publication",
                "description": "Validated and ready for ESGF publication",
                "access": "limited_preview_access",
                "location": "dkrz-cmip6-publication",
                "duration_typical_days": 3,
                "final_checks": ["metadata_complete", "drs_compliant", "pid_assigned"]
            },
            "published": {
                "status": "published_current",
                "description": "Publicly available via ESGF",
                "access": "public_global_access",
                "location": "dkrz-esgf-node",
                "duration_typical_years": 10,
                "features": ["doi_assigned", "citable", "searchable", "federated"]
            },
            "superseded": {
                "status": "superseded",
                "description": "Replaced by newer version",
                "access": "read_only_legacy_access", 
                "location": "dkrz-cmip6-archive",
                "duration_typical_years": 40,
                "migration_policy": "archive_after_1_year_superseded"
            },
            "retracted": {
                "status": "retracted",
                "description": "Removed due to serious issues",
                "access": "no_public_access",
                "location": "dkrz-cmip6-archive",
                "retention": "permanent_for_provenance",
                "notification_required": "global_esgf_notification"
            }
        },
        "change_management": {
            "version_increment_process": [
                "Issue identification and documentation",
                "Impact assessment and stakeholder notification",
                "Data reprocessing or correction",
                "Quality control validation",
                "Version number assignment",
                "Metadata update with change log",
                "Publication and notification"
            ],
            "rollback_procedures": [
                "Critical issue detection",
                "Emergency response team activation",
                "Previous version restoration",
                "Global notification to ESGF network",
                "User community notification",
                "Post-incident analysis and documentation"
            ],
            "approval_authority": "CMIP Data Management Committee",
            "notification_channels": ["esgf-announce", "cmip6-data-users", "institutional-contacts"]
        }
    }
    
    return version_management

# Create version management system
cmip6_versions = create_cmip6_version_management()
print("📋 CMIP6 Version Management System")
print("=" * 40)

# Display versioning scheme
versioning = cmip6_versions['versioning_scheme']
print(f"\n🔢 Versioning Scheme:")
print(f"  Format: {versioning['format']} (Example: {versioning['example']})")
print(f"  Description: {versioning['description']}")
print(f"  Version Increment Triggers:")
for trigger in versioning['increment_triggers'][:3]:
    print(f"    • {trigger}")
print(f"    • ... and {len(versioning['increment_triggers']) - 3} more")

# Display lifecycle stages
lifecycle = cmip6_versions['version_lifecycle']
print(f"\n🔄 Data Version Lifecycle:")
for stage_name, stage_info in lifecycle.items():
    print(f"\n  {stage_name.upper().replace('_', ' ')}:")
    print(f"    Status: {stage_info['status']}")
    print(f"    Access: {stage_info['access']}")
    print(f"    Location: {stage_info['location']}")
    
    if 'duration_typical_days' in stage_info:
        print(f"    Duration: ~{stage_info['duration_typical_days']} days")
    elif 'duration_typical_years' in stage_info:
        print(f"    Duration: ~{stage_info['duration_typical_years']} years")
    
    if 'features' in stage_info:
        print(f"    Features: {', '.join(stage_info['features'])}")

# Display change management
change_mgmt = cmip6_versions['change_management']
print(f"\n⚙️  Change Management:")
print(f"  Approval Authority: {change_mgmt['approval_authority']}")
print(f"  Notification Channels: {', '.join(change_mgmt['notification_channels'])}")
print(f"  Version Process Steps: {len(change_mgmt['version_increment_process'])}")
print(f"  Rollback Procedures: {len(change_mgmt['rollback_procedures'])}")

In [None]:
# Create version-specific archives for the sample simulation
def create_cmip6_versioned_archives(simulation):
    """Create versioned archives following CMIP6 lifecycle."""
    
    # Current version being processed
    current_version = "v20240615"
    
    archive_versions = [
        {
            "version": "v20240301",
            "status": "superseded",
            "location": "dkrz-cmip6-archive",
            "description": "Initial version - superseded due to metadata corrections",
            "issue": "Missing tracking_id and incorrect further_info_url",
            "users_affected": 127,
            "superseded_date": "2024-04-15",
            "access": "legacy_read_only"
        },
        {
            "version": "v20240415",
            "status": "superseded",
            "location": "dkrz-cmip6-archive",
            "description": "Corrected metadata - superseded due to data processing error",
            "issue": "Incorrect time coordinate encoding for daily data",
            "users_affected": 89,
            "superseded_date": "2024-06-10",
            "access": "legacy_read_only"
        },
        {
            "version": current_version,
            "status": "validation_in_progress",
            "location": "dkrz-cmip6-qc",
            "description": "Corrected time coordinates and enhanced quality control",
            "improvements": [
                "Fixed time coordinate encoding",
                "Enhanced CF compliance",
                "Improved compression efficiency",
                "Added missing variable attributes"
            ],
            "qc_progress": "75_percent_complete",
            "expected_publication": "2024-06-20"
        }
    ]
    
    return archive_versions, current_version

# Create versioned archives for sample simulation
archive_versions, current_version = create_cmip6_versioned_archives(sample_sim)

print(f"📦 CMIP6 Versioned Archives for {sample_sim.simulation_id}")
print("=" * 55)

for version_info in archive_versions:
    version = version_info['version']
    status = version_info['status']
    
    # Create archive DTO for each version
    archive_id = f"{sample_sim.simulation_id}-{version}"
    
    archive_dto = CreateArchiveDto(
        archive_id=archive_id,
        location_name=version_info['location'],
        archive_type="directory",
        simulation_id=sample_sim.simulation_id,
        version=version,
        description=version_info['description'],
        tags={
            "cmip6", "versioned", status, 
            sample_sim.attrs['source_id'],
            sample_sim.attrs['experiment_id'],
            f"version_{version}"
        }
    )
    
    # Create archive metadata
    archive_result = archive_service.create_archive_metadata(archive_dto)
    
    print(f"\n📋 VERSION {version}")
    print(f"   Status: {status.replace('_', ' ').title()}")
    print(f"   Location: {version_info['location']}")
    print(f"   Description: {version_info['description']}")
    
    if 'issue' in version_info:
        print(f"   Issue: {version_info['issue']}")
        print(f"   Users Affected: {version_info['users_affected']}")
        print(f"   Superseded: {version_info['superseded_date']}")
        print(f"   Access: {version_info['access']}")
    
    if 'improvements' in version_info:
        print(f"   Improvements:")
        for improvement in version_info['improvements'][:3]:
            print(f"     • {improvement}")
        print(f"   QC Progress: {version_info['qc_progress'].replace('_', ' ').title()}")
        print(f"   Expected Publication: {version_info['expected_publication']}")

print(f"\n📊 Version Management Summary:")
print(f"  Current Version: {current_version}")
print(f"  Total Versions: {len(archive_versions)}")
print(f"  Active: 1 (in validation)")
print(f"  Superseded: {len([v for v in archive_versions if v['status'] == 'superseded'])}")
print(f"  Total Users Affected by Changes: {sum(v.get('users_affected', 0) for v in archive_versions)}")
print(f"  Version Retention: All versions preserved for provenance")

## Summary

This notebook demonstrated comprehensive CMIP6 data management workflows using Tellus:

### Key Accomplishments:

1. **CMIP6 Infrastructure**: Complete processing pipeline from raw output to global distribution
2. **Simulation Catalog**: Multi-model, multi-experiment CMIP6 simulation registry
3. **Processing Pipeline**: 6-stage workflow following CMIP6 standards and best practices
4. **Quality Control**: Comprehensive validation procedures with automated and manual checks
5. **Version Management**: Complete lifecycle management with version control and change tracking
6. **ESGF Integration**: Seamless integration with global climate data distribution

### CMIP6-Specific Features:

- **Standards Compliance**: Full adherence to CMIP6 data standards, conventions, and protocols
- **Global Coordination**: Integration with international ESGF network for worldwide access
- **Quality Assurance**: Multi-level validation ensuring data meets scientific community standards
- **Provenance Tracking**: Complete audit trail from raw model output to published datasets
- **Version Control**: Sophisticated versioning with impact tracking and user notification
- **Long-term Preservation**: Permanent archival ensuring data availability for decades

### Processing Scale and Performance:

- **Data Volume**: 60 TB total across 24 simulations (2 models × 4 experiments × 3 ensemble members)
- **Processing Capacity**: 8,192 cores dedicated to CMIP6 workflows
- **Pipeline Throughput**: 25 hours per simulation, 4 parallel pipelines for high-priority data
- **Storage Hierarchy**: 6 tiers totaling over 7 PB capacity
- **Global Distribution**: Federated access through ESGF network

### Quality Assurance Metrics:

- **Total QC Checks**: 14 comprehensive validation procedures
- **Automation Level**: 79% of checks fully automated
- **Critical Procedures**: 2 for format compliance and metadata validation
- **Version Management**: Complete lifecycle tracking with impact assessment

### Global Impact:

- **Scientific Community**: Data available to thousands of climate researchers worldwide
- **Policy Support**: High-quality datasets supporting IPCC assessments and policy decisions
- **International Coordination**: Seamless integration with global climate data infrastructure
- **Long-term Value**: Datasets preserved for multi-decadal climate research

### Next Steps:

- Expand to additional CMIP6 experiments and model contributions
- Implement machine learning-enhanced quality control
- Develop automated bias correction and downscaling workflows
- Integrate with emerging climate services and impact assessment tools
- Prepare infrastructure for CMIP7 requirements and next-generation models

This comprehensive CMIP6 workflow demonstrates Tellus's capability to manage the full complexity of international climate model data production, from institutional processing through global scientific distribution, while maintaining the highest standards of quality, provenance, and accessibility.