# CSAO Dataset Structure Analysis: Sea Ice Freeboard & SLA Product (2010)

**Purpose:** Comprehensive structural inspection and validation of the CSAO (CryoSat-2 Southern Antarctic Ocean) yearly dataset for quality assurance and data integrity verification.

**Dataset:** `fb_sla_cs2_sam_2010_NOSIT.nc`  
**Data Year:** 2010  
**Source Directory:** `D:\phd\data\CSAO`

**Author:** Xinlong Liu  
**Created:** 2025-12-07  
**Last Updated:** 2025-12-07

---

## Document Conventions

| Symbol | Meaning |
|--------|---------|
| ‚úÖ | Validation passed |
| ‚ö†Ô∏è | Warning - review required |
| ‚ùå | Critical issue detected |

---

In [1]:
"""
CSAO NetCDF Dataset Structure Inspector
=======================================
Enterprise-grade data inspection module following:
- Google Python Style Guide (https://google.github.io/styleguide/pyguide.html)
- Amazon's Operational Excellence Principles

This module provides comprehensive analysis of NetCDF file structure including:
- Complete metadata extraction and validation
- Variable inspection with dimensions and attributes
- Data type and shape analysis
- Memory footprint estimation
- Statistical summary for data quality assurance

Dependencies:
    - xarray >= 2023.0.0
    - numpy >= 1.24.0
    - netCDF4 >= 1.6.0
"""

from __future__ import annotations

import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Optional

import numpy as np
import xarray as xr

# =============================================================================
# LOGGING CONFIGURATION
# Following Google's SRE best practices for observability
# =============================================================================
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)

# =============================================================================
# CONFIGURATION CONSTANTS
# Following Amazon's principle of externalized configuration for maintainability
# =============================================================================
DATA_DIR: Path = Path(r"D:\phd\data\CSAO")
FILENAME: str = "fb_sla_cs2_sam_2010_NOSIT.nc"
FILEPATH: Path = DATA_DIR / FILENAME
DATA_YEAR: int = 2010

# Inspection metadata
INSPECTION_TIMESTAMP: str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

logger.info(f"Inspection initiated at: {INSPECTION_TIMESTAMP}")
logger.info(f"Python version: {sys.version}")
logger.info(f"NumPy version: {np.__version__}")
logger.info(f"xarray version: {xr.__version__}")

2025-12-07 17:47:56 | INFO     | __main__ | Inspection initiated at: 2025-12-07 17:47:56
2025-12-07 17:47:56 | INFO     | __main__ | Python version: 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
2025-12-07 17:47:56 | INFO     | __main__ | NumPy version: 2.1.2
2025-12-07 17:47:56 | INFO     | __main__ | xarray version: 2024.10.0


## 1. File Validation and Metadata Extraction

Pre-flight checks to ensure data file integrity before processing. Following Amazon's "Fail Fast" principle.

In [2]:
def validate_file_path(filepath: Path) -> Dict[str, Any]:
    """
    Validate file existence and extract basic file metadata.
    
    Args:
        filepath: Path object pointing to the NetCDF file.
        
    Returns:
        Dictionary containing file metadata including size, timestamps, and validation status.
        
    Raises:
        FileNotFoundError: If the specified file does not exist.
        
    Example:
        >>> metadata = validate_file_path(Path("data/sample.nc"))
        >>> print(metadata["size_mb"])
        125.5
    """
    if not filepath.exists():
        logger.error(f"File not found: {filepath}")
        raise FileNotFoundError(
            f"Dataset not found at: {filepath}\n"
            f"Please verify:\n"
            f"  1. The directory exists: {filepath.parent}\n"
            f"  2. The filename is correct: {filepath.name}"
        )
    
    file_stat = filepath.stat()
    size_bytes = file_stat.st_size
    
    # Calculate human-readable file size
    if size_bytes >= 1024**3:
        size_str = f"{size_bytes / 1024**3:.2f} GB"
    elif size_bytes >= 1024**2:
        size_str = f"{size_bytes / 1024**2:.2f} MB"
    elif size_bytes >= 1024:
        size_str = f"{size_bytes / 1024:.2f} KB"
    else:
        size_str = f"{size_bytes} bytes"
    
    metadata = {
        "filepath": str(filepath),
        "filename": filepath.name,
        "directory": str(filepath.parent),
        "size_bytes": size_bytes,
        "size_human": size_str,
        "modified_time": datetime.fromtimestamp(file_stat.st_mtime).strftime("%Y-%m-%d %H:%M:%S"),
        "validation_status": "‚úÖ PASSED"
    }
    
    logger.info(f"File validation: {metadata['validation_status']}")
    return metadata


# Execute file validation
file_metadata = validate_file_path(FILEPATH)

print("=" * 80)
print("FILE VALIDATION REPORT")
print("=" * 80)
print(f"  {'Status':<20}: {file_metadata['validation_status']}")
print(f"  {'Filename':<20}: {file_metadata['filename']}")
print(f"  {'Directory':<20}: {file_metadata['directory']}")
print(f"  {'File Size':<20}: {file_metadata['size_human']}")
print(f"  {'Last Modified':<20}: {file_metadata['modified_time']}")
print("=" * 80)

2025-12-07 17:49:20 | INFO     | __main__ | File validation: ‚úÖ PASSED


FILE VALIDATION REPORT
  Status              : ‚úÖ PASSED
  Filename            : fb_sla_cs2_sam_2010_NOSIT.nc
  Directory           : D:\phd\data\CSAO
  File Size           : 85.26 MB
  Last Modified       : 2023-08-23 22:58:52


## 2. Load Dataset and High-Level Overview

Using `xarray` with lazy loading for memory-efficient handling of large NetCDF files. This approach follows Google's principle of resource efficiency and Amazon's cost optimization pillar.

In [3]:
def load_netcdf_dataset(filepath: Path, engine: str = "netcdf4") -> xr.Dataset:
    """
    Load NetCDF dataset with optimal configuration for inspection.
    
    Uses lazy loading to minimize memory footprint during initial inspection.
    Implements retry logic and comprehensive error handling.
    
    Args:
        filepath: Path to the NetCDF file.
        engine: NetCDF engine to use. Defaults to 'netcdf4'.
        
    Returns:
        xr.Dataset: Loaded dataset with lazy evaluation enabled.
        
    Raises:
        IOError: If file cannot be read or is corrupted.
        ValueError: If file format is not recognized.
        
    Note:
        For large files (>1GB), consider using chunks parameter for
        out-of-core computation with Dask.
    """
    logger.info(f"Loading dataset: {filepath.name}")
    
    try:
        ds = xr.open_dataset(filepath, engine=engine)
        logger.info(f"Dataset loaded successfully")
        logger.info(f"  - Dimensions: {len(ds.dims)}")
        logger.info(f"  - Data variables: {len(ds.data_vars)}")
        logger.info(f"  - Coordinates: {len(ds.coords)}")
        return ds
        
    except ValueError as ve:
        logger.error(f"Invalid file format: {ve}")
        raise ValueError(f"File format not recognized: {filepath}") from ve
    except Exception as e:
        logger.error(f"Failed to load dataset: {e}")
        raise IOError(f"Cannot read NetCDF file: {filepath}") from e


# Load the CSAO dataset
ds = load_netcdf_dataset(FILEPATH)

# Display xarray's built-in representation
print("=" * 80)
print("DATASET OVERVIEW (xarray native representation)")
print("=" * 80)
display(ds)

2025-12-07 17:50:08 | INFO     | __main__ | Loading dataset: fb_sla_cs2_sam_2010_NOSIT.nc
2025-12-07 17:50:10 | INFO     | __main__ | Dataset loaded successfully
2025-12-07 17:50:10 | INFO     | __main__ |   - Dimensions: 4
2025-12-07 17:50:10 | INFO     | __main__ |   - Data variables: 15
2025-12-07 17:50:10 | INFO     | __main__ |   - Coordinates: 5


DATASET OVERVIEW (xarray native representation)


## 3. Global Attributes Analysis

Global attributes contain essential metadata about data provenance, processing pipeline, conventions, and scientific context. These are critical for data reproducibility and FAIR principles compliance.

In [4]:
def inspect_global_attributes(ds: xr.Dataset) -> Dict[str, Any]:
    """
    Extract and display all global attributes with formatted output.
    
    Global attributes typically include:
    - Data provenance and source information
    - Processing history and methodology
    - CF/ACDD convention compliance markers
    - Contact and citation information
    
    Args:
        ds: xarray Dataset to inspect.
        
    Returns:
        Dictionary of global attributes for programmatic access.
    """
    print("=" * 80)
    print("GLOBAL ATTRIBUTES")
    print("=" * 80)
    
    if not ds.attrs:
        print("  ‚ö†Ô∏è No global attributes found.")
        logger.warning("Dataset has no global attributes - may affect data provenance tracking")
        return {}
    
    print(f"  Total attributes: {len(ds.attrs)}\n")
    print(f"  {'Attribute Name':<35} {'Value'}")
    print("  " + "-" * 76)
    
    for key, value in ds.attrs.items():
        value_str = str(value)
        # Truncate long values for display, preserving full value in return dict
        if len(value_str) > 100:
            display_value = value_str[:97] + "..."
        else:
            display_value = value_str
        print(f"  {key:<35} {display_value}")
    
    print("=" * 80)
    return dict(ds.attrs)


# Inspect global attributes
global_attrs = inspect_global_attributes(ds)

GLOBAL ATTRIBUTES
  Total attributes: 55

  Attribute Name                      Value
  ----------------------------------------------------------------------------
  projection                          laea
  lat_ts                              0.0
  lon_0                               0
  lat_0                               -90
  resolution                          c
  pixel_size                          12500
  width                               8900000
  height                              8900000
  nb_pixels                           712
  nb_pixels_x                         712
  nb_pixels_y                         712
  lat_min                             -90
  lat_max                             -49.9532037437005
  x_min_grid                          6250.000000000001
  x_max_grid                          8893750.000000002
  y_min_grid                          6250.000000000001
  y_max_grid                          8893750.000000002
  pixel_size_x                        12500.

## 4. Dimensions Structure Analysis

Understanding the dimensional structure is fundamental for data manipulation, subsetting, and analysis operations. Dimensions define the axes along which data is organized.

In [5]:
def inspect_dimensions(ds: xr.Dataset) -> Dict[str, int]:
    """
    Analyze and display dataset dimensions with detailed statistics.
    
    Args:
        ds: xarray Dataset to inspect.
        
    Returns:
        Dictionary mapping dimension names to their sizes.
    """
    print("=" * 80)
    print("DIMENSIONS ANALYSIS")
    print("=" * 80)
    
    if not ds.dims:
        print("  ‚ö†Ô∏è No dimensions found in dataset.")
        return {}
    
    total_cells = 1
    print(f"\n  {'Dimension Name':<30} {'Size':>15} {'Description'}")
    print("  " + "-" * 70)
    
    for dim_name, dim_size in ds.dims.items():
        total_cells *= dim_size
        # Infer dimension type based on common naming conventions
        if any(x in dim_name.lower() for x in ['time', 't', 'date']):
            dim_type = "Temporal"
        elif any(x in dim_name.lower() for x in ['lat', 'y', 'row']):
            dim_type = "Spatial (Y-axis)"
        elif any(x in dim_name.lower() for x in ['lon', 'x', 'col']):
            dim_type = "Spatial (X-axis)"
        else:
            dim_type = "Other"
        
        print(f"  {dim_name:<30} {dim_size:>15,} {dim_type}")
    
    print("  " + "-" * 70)
    print(f"  {'Total dimension count:':<30} {len(ds.dims):>15}")
    print(f"  {'Total grid cells:':<30} {total_cells:>15,}")
    print("=" * 80)
    
    return dict(ds.dims)


# Inspect dimensions
dimensions = inspect_dimensions(ds)

DIMENSIONS ANALYSIS

  Dimension Name                            Size Description
  ----------------------------------------------------------------------
  y                                          712 Spatial (Y-axis)
  x                                          712 Spatial (X-axis)
  time                                         3 Temporal
  dim_time_bnds                                2 Temporal
  ----------------------------------------------------------------------
  Total dimension count:                       4
  Total grid cells:                    3,041,664


  for dim_name, dim_size in ds.dims.items():
  return dict(ds.dims)


## 5. Coordinates Inspection

Coordinates provide the reference system for locating data within the dimensional space. This includes spatial coordinates (lat/lon or projected), temporal coordinates, and any auxiliary coordinates.

In [6]:
def inspect_coordinates(ds: xr.Dataset) -> None:
    """
    Comprehensive inspection of coordinate variables.
    
    Analyzes:
    - Data type and shape
    - Value range (min/max)
    - Associated attributes (units, standard_name, etc.)
    - Potential data quality issues
    
    Args:
        ds: xarray Dataset to inspect.
    """
    print("=" * 80)
    print("COORDINATES INSPECTION")
    print("=" * 80)
    
    if not ds.coords:
        print("  ‚ö†Ô∏è No coordinate variables found.")
        return
    
    print(f"\n  Total coordinates: {len(ds.coords)}\n")
    
    for coord_name, coord_data in ds.coords.items():
        print(f"  {'‚îÄ' * 70}")
        print(f"  üìç Coordinate: {coord_name}")
        print(f"  {'‚îÄ' * 70}")
        print(f"     Dtype      : {coord_data.dtype}")
        print(f"     Shape      : {coord_data.shape}")
        print(f"     Dimensions : {coord_data.dims}")
        
        # Calculate range for numeric types
        if np.issubdtype(coord_data.dtype, np.number):
            try:
                min_val = float(coord_data.min().values)
                max_val = float(coord_data.max().values)
                print(f"     Range      : [{min_val:.6g}, {max_val:.6g}]")
                
                # Check for NaN values
                nan_count = int(np.isnan(coord_data.values).sum())
                if nan_count > 0:
                    print(f"     ‚ö†Ô∏è NaN Count : {nan_count}")
            except Exception:
                print(f"     Range      : Unable to compute")
        else:
            # For non-numeric (e.g., datetime)
            try:
                print(f"     First      : {coord_data.values[0]}")
                print(f"     Last       : {coord_data.values[-1]}")
            except Exception:
                pass
        
        # Display attributes
        if coord_data.attrs:
            print(f"     Attributes :")
            for attr_key, attr_val in coord_data.attrs.items():
                attr_str = str(attr_val)[:50]
                print(f"       ‚Ä¢ {attr_key}: {attr_str}")
        else:
            print(f"     Attributes : None")
    
    print("=" * 80)


# Inspect coordinates
inspect_coordinates(ds)

COORDINATES INSPECTION

  Total coordinates: 5

  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  üìç Coordinate: lat
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
     Dtype      : float32
     Shape      : (712, 712)
     Dimensions : ('y', 'x')
     Range      : [-89.9205, -30.8971]
     Attributes :
       ‚Ä¢ units: degrees_north
       ‚Ä¢ long_name: latitude coordinate
       ‚Ä¢ standard_name: latitude
       ‚Ä¢ ctoh_edit_date: 2023-04-01 13:07
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

## 6. Data Variables Comprehensive Analysis

Detailed inspection of all data variables including their dimensions, data types, shapes, memory footprint, and associated attributes. This is the core content of the dataset.

In [7]:
def inspect_data_variables(ds: xr.Dataset) -> Dict[str, Dict[str, Any]]:
    """
    Comprehensive inspection of all data variables in the dataset.
    
    For each variable, extracts:
    - Dimensional information
    - Data type and shape
    - Memory footprint
    - All CF-compliant attributes
    - Sample values for validation
    
    Args:
        ds: xarray Dataset to inspect.
        
    Returns:
        Dictionary containing detailed metadata for each variable.
    """
    print("=" * 80)
    print("DATA VARIABLES ANALYSIS")
    print("=" * 80)
    
    if not ds.data_vars:
        print("  ‚ö†Ô∏è No data variables found in dataset.")
        return {}
    
    print(f"\n  Total data variables: {len(ds.data_vars)}")
    
    variables_info = {}
    total_memory = 0
    
    for var_name, var_data in ds.data_vars.items():
        print(f"\n  {'‚îÅ' * 70}")
        print(f"  üìä Variable: {var_name}")
        print(f"  {'‚îÅ' * 70}")
        
        # Basic properties
        print(f"     Dimensions  : {var_data.dims}")
        print(f"     Shape       : {var_data.shape}")
        print(f"     Dtype       : {var_data.dtype}")
        print(f"     Size        : {var_data.size:,} elements")
        
        # Memory footprint
        memory_bytes = var_data.nbytes
        total_memory += memory_bytes
        
        if memory_bytes >= 1024**3:
            memory_str = f"{memory_bytes / 1024**3:.2f} GB"
        elif memory_bytes >= 1024**2:
            memory_str = f"{memory_bytes / 1024**2:.2f} MB"
        elif memory_bytes >= 1024:
            memory_str = f"{memory_bytes / 1024:.2f} KB"
        else:
            memory_str = f"{memory_bytes} bytes"
        print(f"     Memory      : {memory_str}")
        
        # Attributes (critical for CF compliance)
        if var_data.attrs:
            print(f"     Attributes  :")
            for attr_key, attr_val in var_data.attrs.items():
                attr_str = str(attr_val)
                if len(attr_str) > 55:
                    attr_str = attr_str[:52] + "..."
                print(f"       ‚Ä¢ {attr_key:<20}: {attr_str}")
        else:
            print(f"     Attributes  : None")
        
        # Store for return
        variables_info[var_name] = {
            "dimensions": var_data.dims,
            "shape": var_data.shape,
            "dtype": str(var_data.dtype),
            "size": var_data.size,
            "memory_bytes": memory_bytes,
            "attributes": dict(var_data.attrs) if var_data.attrs else {}
        }
    
    # Total memory summary
    if total_memory >= 1024**3:
        total_memory_str = f"{total_memory / 1024**3:.2f} GB"
    elif total_memory >= 1024**2:
        total_memory_str = f"{total_memory / 1024**2:.2f} MB"
    else:
        total_memory_str = f"{total_memory / 1024:.2f} KB"
    
    print(f"\n  {'‚îÅ' * 70}")
    print(f"  üíæ Total Memory Footprint: {total_memory_str}")
    print("=" * 80)
    
    return variables_info


# Inspect data variables
variables_metadata = inspect_data_variables(ds)

DATA VARIABLES ANALYSIS

  Total data variables: 15

  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
  üìä Variable: time_bnds
  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
     Dimensions  : ('time', 'dim_time_bnds')
     Shape       : (3, 2)
     Dtype       : datetime64[ns]
     Size        : 6 elements
     Memory      : 48 bytes
     Attributes  :
       ‚Ä¢ long_name           : time_bnds
       ‚Ä¢ ctoh_edit_date      : 2023-04-01 13:07

  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
  üì

## 7. Statistical Summary for Data Quality Assurance

Quick statistical overview to validate data integrity, identify potential outliers, and detect data quality issues such as excessive missing values or unexpected value ranges.

In [8]:
def compute_statistical_summary(ds: xr.Dataset) -> None:
    """
    Compute and display statistical summary for all numeric variables.
    
    Statistics include:
    - Minimum, Maximum, Mean, Standard Deviation
    - Missing value percentage (NaN)
    - Data quality flags based on thresholds
    
    Args:
        ds: xarray Dataset to analyze.
        
    Note:
        Variables with >50% missing values are flagged for review.
    """
    print("=" * 80)
    print("STATISTICAL SUMMARY (Numeric Variables)")
    print("=" * 80)
    
    # Header
    print(f"\n  {'Variable':<25} {'Min':>12} {'Max':>12} {'Mean':>12} {'Std':>12} {'NaN %':>10}")
    print("  " + "-" * 85)
    
    quality_warnings = []
    
    for var_name, var_data in ds.data_vars.items():
        if np.issubdtype(var_data.dtype, np.number):
            try:
                # Compute statistics
                min_val = float(np.nanmin(var_data.values))
                max_val = float(np.nanmax(var_data.values))
                mean_val = float(np.nanmean(var_data.values))
                std_val = float(np.nanstd(var_data.values))
                nan_count = int(np.isnan(var_data.values).sum())
                nan_pct = nan_count / var_data.size * 100
                
                # Format output
                print(f"  {var_name:<25} {min_val:>12.4g} {max_val:>12.4g} "
                      f"{mean_val:>12.4g} {std_val:>12.4g} {nan_pct:>9.2f}%")
                
                # Quality checks
                if nan_pct > 50:
                    quality_warnings.append(f"‚ö†Ô∏è {var_name}: {nan_pct:.1f}% missing values")
                    
            except Exception as e:
                print(f"  {var_name:<25} {'Error computing statistics':<60}")
                logger.warning(f"Statistics computation failed for {var_name}: {e}")
        else:
            print(f"  {var_name:<25} {'(non-numeric dtype)':<60}")
    
    print("  " + "-" * 85)
    
    # Display quality warnings
    if quality_warnings:
        print("\n  DATA QUALITY WARNINGS:")
        for warning in quality_warnings:
            print(f"    {warning}")
    else:
        print("\n  ‚úÖ No data quality issues detected.")
    
    print("=" * 80)


# Compute statistical summary
compute_statistical_summary(ds)

STATISTICAL SUMMARY (Numeric Variables)

  Variable                           Min          Max         Mean          Std      NaN %
  -------------------------------------------------------------------------------------
  time_bnds                 (non-numeric dtype)                                         
  SAR_mode                             0            1       0.7999        0.392     83.50%
  dist_to_closest_lead               133    2.279e+06    2.128e+05    2.893e+05     70.71%
  dist_to_closest_open_ocean            0    2.602e+06    3.041e+05    3.993e+05     75.63%
  floes_density                        0            1       0.5508       0.3855     77.23%
  floes_valid_density                  0            1       0.1711        0.181     77.23%
  radar_freeboard_median          -3.166        4.435     0.004398       0.1908     83.50%
  radar_freeboard_mean            -2.984        3.969    -0.004079       0.1842     83.50%
  radar_freeboard_std                  0         2.27

## 8. Consolidated Summary Report

A structured summary for documentation, reporting, and downstream pipeline integration.

In [9]:
def generate_inspection_report(
    ds: xr.Dataset,
    filepath: Path,
    file_metadata: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Generate comprehensive inspection report as structured dictionary.
    
    This report can be:
    - Serialized to JSON for pipeline integration
    - Used for automated quality checks
    - Archived for data provenance documentation
    
    Args:
        ds: xarray Dataset that was inspected.
        filepath: Path to the source file.
        file_metadata: Pre-computed file metadata.
        
    Returns:
        Structured dictionary containing complete inspection results.
    """
    report = {
        "inspection_metadata": {
            "timestamp": INSPECTION_TIMESTAMP,
            "filepath": str(filepath),
            "filename": filepath.name,
            "data_year": DATA_YEAR,
        },
        "file_info": file_metadata,
        "dataset_summary": {
            "n_dimensions": len(ds.dims),
            "n_coordinates": len(ds.coords),
            "n_data_variables": len(ds.data_vars),
            "n_global_attributes": len(ds.attrs),
        },
        "dimensions": dict(ds.dims),
        "coordinates": list(ds.coords.keys()),
        "data_variables": list(ds.data_vars.keys()),
        "global_attributes": dict(ds.attrs) if ds.attrs else {},
    }
    
    # Display formatted summary
    print("=" * 80)
    print("INSPECTION SUMMARY REPORT")
    print("=" * 80)
    print(f"""
    Dataset: {report['inspection_metadata']['filename']}
    Year: {report['inspection_metadata']['data_year']}
    Inspection Time: {report['inspection_metadata']['timestamp']}
    
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ  DATASET METRICS                                        ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  File Size         : {file_metadata['size_human']:<35}‚îÇ
    ‚îÇ  Dimensions        : {report['dataset_summary']['n_dimensions']:<35}‚îÇ
    ‚îÇ  Coordinates       : {report['dataset_summary']['n_coordinates']:<35}‚îÇ
    ‚îÇ  Data Variables    : {report['dataset_summary']['n_data_variables']:<35}‚îÇ
    ‚îÇ  Global Attributes : {report['dataset_summary']['n_global_attributes']:<35}‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    """)
    
    print("  Dimensions:", list(ds.dims.keys()))
    print("  Coordinates:", list(ds.coords.keys()))
    print("  Data Variables:", list(ds.data_vars.keys()))
    print("=" * 80)
    
    return report


# Generate final report
inspection_report = generate_inspection_report(ds, FILEPATH, file_metadata)

INSPECTION SUMMARY REPORT

    Dataset: fb_sla_cs2_sam_2010_NOSIT.nc
    Year: 2010
    Inspection Time: 2025-12-07 17:47:56
    
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ  DATASET METRICS                                        ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  File Size         : 85.26 MB                           ‚îÇ
    ‚îÇ  Dimensions        : 4                                  ‚îÇ
    ‚îÇ  Coordinates       : 5                                  ‚îÇ
    ‚îÇ  Data Variables    : 15                                 ‚îÇ
    ‚îÇ  Global Attributes : 55                                 ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

  "dimensions": dict(ds.dims),
  print("  Dimensions:", list(ds.dims.keys()))


## 9. Cleanup and Resource Management

Properly close dataset handles to release file locks and free system resources. Following Amazon's principle of operational excellence.

In [12]:
# Close dataset to release file handles
ds.close()
logger.info("Dataset closed successfully. Inspection complete.")

print("=" * 80)
print("‚úÖ INSPECTION COMPLETE")
print("=" * 80)
print(f"  Dataset: {FILENAME}")
print(f"  Status: All inspection routines executed successfully")
print(f"  Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 80)

2025-12-07 18:27:43 | INFO     | __main__ | Dataset closed successfully. Inspection complete.


‚úÖ INSPECTION COMPLETE
  Dataset: fb_sla_cs2_sam_2010_NOSIT.nc
  Status: All inspection routines executed successfully
  Timestamp: 2025-12-07 18:27:43


## 11. Multi-Year Variable Extraction Pipeline

**Purpose:** Extract validated variables from all 11 CSAO datasets (2010-2020) and prepare for consolidation.

**Extraction Strategy:**
- Use the variable mapping results from validation phase
- Apply consistent naming conventions (canonical names)
- Preserve original attributes and metadata
- Handle dimensional alignment across years

**Design Principles:**
- **Idempotent Operations**: Re-runnable without side effects
- **Memory Efficiency**: Process one dataset at a time
- **Fault Tolerance**: Continue extraction even if individual files fail
- **Data Lineage**: Track source file for each extracted record

---

In [13]:
"""
Multi-Year CSAO Variable Extraction Pipeline
=============================================
Extracts specified variables from all yearly datasets with intelligent
variable name mapping and data lineage tracking.

Following:
- Google's Data Engineering Best Practices
- Amazon's Well-Architected Framework (Reliability & Operational Excellence)
- CF Conventions for Climate and Forecast Metadata
"""

from __future__ import annotations

import gc
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any
import warnings

import numpy as np
import xarray as xr

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore", category=RuntimeWarning)

# =============================================================================
# EXTRACTION CONFIGURATION
# =============================================================================

@dataclass
class ExtractionConfig:
    """Configuration for variable extraction pipeline."""
    
    # Source configuration
    data_dir: Path = Path(r"D:\phd\data\CSAO")
    years: Tuple[int, ...] = tuple(range(2010, 2021))
    
    # Target variables (canonical names)
    target_variables: Tuple[str, ...] = (
        "lat",
        "lon",
        "time",
        "x",
        "y",
        "radar_freeboard_mean",
        "radar_freeboard_unc",
        "sea_ice_conc",
    )
    
    # Output configuration
    output_dir: Path = Path(r"D:\phd\data\CSAO")
    output_filename: str = "csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc"
    
    # Processing options
    preserve_attributes: bool = True
    add_provenance: bool = True
    compression_level: int = 4


# Initialize configuration
config = ExtractionConfig()

logger.info("=" * 60)
logger.info("CSAO Multi-Year Variable Extraction Pipeline")
logger.info("=" * 60)
logger.info(f"Source directory: {config.data_dir}")
logger.info(f"Years to process: {config.years[0]}-{config.years[-1]}")
logger.info(f"Target variables: {len(config.target_variables)}")


@dataclass
class VariableMapping:
    """Mapping from canonical name to actual name in dataset."""
    canonical_name: str
    actual_name: str
    source_year: int
    source_file: str


@dataclass
class ExtractionResult:
    """Result of extracting variables from a single dataset."""
    year: int
    filepath: Path
    success: bool
    extracted_data: Optional[xr.Dataset] = None
    variable_mappings: List[VariableMapping] = field(default_factory=list)
    error_message: Optional[str] = None
    processing_time_ms: float = 0.0


def get_variable_mapping_from_validation(
    validation_result: DatasetValidationResult
) -> Dict[str, str]:
    """
    Extract variable name mapping from validation results.
    
    Args:
        validation_result: Validation result containing match information.
        
    Returns:
        Dictionary mapping canonical names to actual variable names.
    """
    mapping = {}
    for match in validation_result.matches:
        if match.matched_name is not None:
            mapping[match.required_name] = match.matched_name
    return mapping


def extract_variables_from_dataset(
    filepath: Path,
    year: int,
    variable_mapping: Dict[str, str],
    config: ExtractionConfig
) -> ExtractionResult:
    """
    Extract specified variables from a single NetCDF dataset.
    
    Args:
        filepath: Path to the source NetCDF file.
        year: Data year for reference.
        variable_mapping: Mapping from canonical to actual variable names.
        config: Extraction configuration.
        
    Returns:
        ExtractionResult containing extracted data or error information.
    """
    import time
    start_time = time.perf_counter()
    
    result = ExtractionResult(
        year=year,
        filepath=filepath,
        success=False
    )
    
    if not filepath.exists():
        result.error_message = f"File not found: {filepath}"
        logger.error(f"[{year}] {result.error_message}")
        return result
    
    try:
        # Open source dataset
        with xr.open_dataset(filepath, engine="netcdf4") as ds:
            # Build list of variables to extract (using actual names)
            vars_to_extract = []
            
            for canonical_name, actual_name in variable_mapping.items():
                if actual_name in ds.data_vars or actual_name in ds.coords:
                    vars_to_extract.append(actual_name)
                    result.variable_mappings.append(VariableMapping(
                        canonical_name=canonical_name,
                        actual_name=actual_name,
                        source_year=year,
                        source_file=filepath.name
                    ))
                else:
                    logger.warning(f"[{year}] Variable '{actual_name}' not found in dataset")
            
            if not vars_to_extract:
                result.error_message = "No variables found to extract"
                return result
            
            # Extract variables - handle both coords and data_vars
            extracted_vars = {}
            for var_name in vars_to_extract:
                if var_name in ds.coords:
                    extracted_vars[var_name] = ds.coords[var_name].load()
                elif var_name in ds.data_vars:
                    extracted_vars[var_name] = ds[var_name].load()
            
            # Create new dataset with extracted variables
            extracted_ds = xr.Dataset(extracted_vars)
            
            # Rename variables to canonical names
            rename_map = {m.actual_name: m.canonical_name for m in result.variable_mappings}
            extracted_ds = extracted_ds.rename(rename_map)
            
            # Add year as a coordinate/attribute for tracking
            extracted_ds.attrs["source_year"] = year
            extracted_ds.attrs["source_file"] = filepath.name
            
            # Preserve relevant global attributes from source
            if config.preserve_attributes:
                for attr_key in ["title", "institution", "source", "references", "comment"]:
                    if attr_key in ds.attrs:
                        extracted_ds.attrs[f"original_{attr_key}"] = ds.attrs[attr_key]
            
            result.extracted_data = extracted_ds
            result.success = True
            
    except Exception as e:
        result.error_message = f"Extraction failed: {str(e)}"
        logger.error(f"[{year}] {result.error_message}")
    
    result.processing_time_ms = (time.perf_counter() - start_time) * 1000
    return result


def extract_all_years(
    validation_results: List[DatasetValidationResult],
    config: ExtractionConfig
) -> List[ExtractionResult]:
    """
    Extract variables from all validated datasets.
    
    Args:
        validation_results: List of validation results with variable mappings.
        config: Extraction configuration.
        
    Returns:
        List of extraction results for all years.
    """
    extraction_results = []
    
    logger.info(f"Starting extraction from {len(validation_results)} datasets...")
    
    for val_result in validation_results:
        # Skip datasets with validation errors
        if val_result.error_message:
            logger.warning(f"[{val_result.year}] Skipping due to validation error")
            extraction_results.append(ExtractionResult(
                year=val_result.year,
                filepath=val_result.filepath,
                success=False,
                error_message=f"Validation failed: {val_result.error_message}"
            ))
            continue
        
        # Get variable mapping from validation
        var_mapping = get_variable_mapping_from_validation(val_result)
        
        # Extract variables
        ext_result = extract_variables_from_dataset(
            filepath=val_result.filepath,
            year=val_result.year,
            variable_mapping=var_mapping,
            config=config
        )
        
        extraction_results.append(ext_result)
        
        status = "‚úÖ SUCCESS" if ext_result.success else "‚ùå FAILED"
        logger.info(f"[{val_result.year}] Extraction: {status} ({ext_result.processing_time_ms:.1f}ms)")
        
        # Force garbage collection to manage memory
        gc.collect()
    
    return extraction_results


# =============================================================================
# EXECUTE EXTRACTION
# =============================================================================

# Run extraction using validation results from previous cell
extraction_results = extract_all_years(validation_results, config)

# Display extraction summary
print("=" * 80)
print("EXTRACTION SUMMARY")
print("=" * 80)

successful = [r for r in extraction_results if r.success]
failed = [r for r in extraction_results if not r.success]

print(f"\n  Total datasets processed: {len(extraction_results)}")
print(f"  Successful extractions : {len(successful)} ‚úÖ")
print(f"  Failed extractions     : {len(failed)} {'‚ùå' if failed else ''}")

if failed:
    print(f"\n  Failed datasets:")
    for r in failed:
        print(f"    ‚Ä¢ {r.year}: {r.error_message}")

print(f"\n  {'Year':<6} {'Status':<12} {'Variables':<12} {'Time (ms)':<12}")
print("  " + "-" * 45)
for r in extraction_results:
    status = "‚úÖ OK" if r.success else "‚ùå FAIL"
    var_count = len(r.variable_mappings) if r.success else 0
    print(f"  {r.year:<6} {status:<12} {var_count:<12} {r.processing_time_ms:<12.1f}")

print("=" * 80)

2025-12-07 18:29:43 | INFO     | __main__ | CSAO Multi-Year Variable Extraction Pipeline
2025-12-07 18:29:43 | INFO     | __main__ | Source directory: D:\phd\data\CSAO
2025-12-07 18:29:43 | INFO     | __main__ | Years to process: 2010-2020
2025-12-07 18:29:43 | INFO     | __main__ | Target variables: 8
2025-12-07 18:29:43 | INFO     | __main__ | Starting extraction from 11 datasets...
2025-12-07 18:29:43 | INFO     | __main__ | [2010] Extraction: ‚úÖ SUCCESS (72.3ms)
2025-12-07 18:29:45 | INFO     | __main__ | [2011] Extraction: ‚úÖ SUCCESS (1963.0ms)
2025-12-07 18:29:47 | INFO     | __main__ | [2012] Extraction: ‚úÖ SUCCESS (1493.8ms)
2025-12-07 18:29:48 | INFO     | __main__ | [2013] Extraction: ‚úÖ SUCCESS (1164.0ms)
2025-12-07 18:29:49 | INFO     | __main__ | [2014] Extraction: ‚úÖ SUCCESS (1373.4ms)
2025-12-07 18:29:50 | INFO     | __main__ | [2015] Extraction: ‚úÖ SUCCESS (1080.3ms)
2025-12-07 18:29:52 | INFO     | __main__ | [2016] Extraction: ‚úÖ SUCCESS (1088.8ms)
2025-12-07 1

EXTRACTION SUMMARY

  Total datasets processed: 11
  Successful extractions : 11 ‚úÖ
  Failed extractions     : 0 

  Year   Status       Variables    Time (ms)   
  ---------------------------------------------
  2010   ‚úÖ OK         8            72.3        
  2011   ‚úÖ OK         8            1963.0      
  2012   ‚úÖ OK         8            1493.8      
  2013   ‚úÖ OK         8            1164.0      
  2014   ‚úÖ OK         8            1373.4      
  2015   ‚úÖ OK         8            1080.3      
  2016   ‚úÖ OK         8            1088.8      
  2017   ‚úÖ OK         8            1114.4      
  2018   ‚úÖ OK         8            1100.7      
  2019   ‚úÖ OK         8            1391.7      
  2020   ‚úÖ OK         8            1237.5      


## 12. Dataset Consolidation and NetCDF Export

**Purpose:** Merge all extracted yearly datasets into a single consolidated NetCDF file with:

- Consistent dimensional structure across all years
- CF-compliant metadata and attributes
- Comprehensive data provenance tracking
- Optimized compression for efficient storage

**Output File Naming Convention:**
```
{project}_{satellite}_{processing_level}_{variables}_{start_year}_{end_year}_{version}.nc
```
Example: `csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc`

**Quality Assurance:**
- Dimensional consistency validation
- Missing value handling
- Attribute standardization

---

In [14]:
"""
CSAO Dataset Consolidation Module
=================================
Merges extracted yearly datasets into a single CF-compliant NetCDF file
with comprehensive metadata and optimized storage.

Following:
- CF Conventions 1.8 for Climate and Forecast Metadata
- ACDD (Attribute Convention for Data Discovery) 1.3
- Google's Data Engineering Standards
- Amazon S3 Best Practices for Scientific Data
"""

from __future__ import annotations

from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, List, Optional, Any
import uuid

import numpy as np
import xarray as xr


@dataclass
class ConsolidationConfig:
    """Configuration for dataset consolidation."""
    
    # Output settings
    output_dir: Path = Path(r"D:\phd\data\CSAO")
    output_filename: str = "csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc"
    
    # Compression settings (NetCDF4)
    compression: Dict[str, Any] = field(default_factory=lambda: {
        "zlib": True,
        "complevel": 4,
        "shuffle": True,
    })
    
    # Metadata
    title: str = "CSAO CryoSat-2 Consolidated Radar Freeboard and Sea Ice Concentration (2010-2020)"
    institution: str = "University of Tasmania"
    source: str = "CSAO (CryoSat-2 Southern Antarctic Ocean) Product"
    references: str = "https://doi.org/xxxxx"  # Update with actual DOI
    
    # Processing metadata
    conventions: str = "CF-1.8, ACDD-1.3"
    processing_level: str = "L3"


def create_global_attributes(
    config: ConsolidationConfig,
    extraction_results: List[ExtractionResult],
    processing_start: datetime
) -> Dict[str, Any]:
    """
    Create CF/ACDD compliant global attributes for consolidated dataset.
    
    Args:
        config: Consolidation configuration.
        extraction_results: List of extraction results for provenance.
        processing_start: Timestamp when processing started.
        
    Returns:
        Dictionary of global attributes.
    """
    successful_years = [r.year for r in extraction_results if r.success]
    source_files = [r.filepath.name for r in extraction_results if r.success]
    
    return {
        # CF Convention required attributes
        "Conventions": config.conventions,
        "title": config.title,
        "institution": config.institution,
        "source": config.source,
        "references": config.references,
        
        # ACDD recommended attributes
        "summary": (
            f"Consolidated radar freeboard and sea ice concentration data from "
            f"CSAO CryoSat-2 products spanning {min(successful_years)}-{max(successful_years)}. "
            f"Contains harmonized variables extracted from {len(successful_years)} yearly datasets."
        ),
        "keywords": "sea ice, freeboard, radar altimetry, CryoSat-2, Antarctic, Southern Ocean",
        "keywords_vocabulary": "GCMD Science Keywords",
        
        # Temporal coverage
        "time_coverage_start": f"{min(successful_years)}-01-01T00:00:00Z",
        "time_coverage_end": f"{max(successful_years)}-12-31T23:59:59Z",
        "time_coverage_resolution": "P1Y",
        
        # Spatial coverage (Antarctic)
        "geospatial_lat_min": -90.0,
        "geospatial_lat_max": -50.0,
        "geospatial_lon_min": -180.0,
        "geospatial_lon_max": 180.0,
        
        # Processing information
        "processing_level": config.processing_level,
        "date_created": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
        "date_modified": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
        "creator_name": "Xinlong Liu",
        "creator_institution": config.institution,
        
        # Provenance
        "history": (
            f"{processing_start.strftime('%Y-%m-%dT%H:%M:%S')} - "
            f"Consolidated from {len(successful_years)} CSAO yearly datasets using "
            f"automated extraction pipeline"
        ),
        "source_files": ", ".join(source_files),
        "source_years": ", ".join(str(y) for y in sorted(successful_years)),
        
        # Unique identifier
        "uuid": str(uuid.uuid4()),
        
        # Software info
        "software": "Python/xarray NetCDF consolidation pipeline",
        "software_version": f"xarray {xr.__version__}, numpy {np.__version__}",
    }


def standardize_variable_attributes(var_name: str, data_array: xr.DataArray) -> xr.DataArray:
    """
    Standardize variable attributes to CF conventions.
    
    Args:
        var_name: Canonical variable name.
        data_array: Data array to standardize.
        
    Returns:
        Data array with standardized attributes.
    """
    # Standard attributes for each variable type
    standard_attrs = {
        "lat": {
            "standard_name": "latitude",
            "long_name": "Latitude",
            "units": "degrees_north",
            "axis": "Y",
            "valid_range": [-90.0, 90.0],
        },
        "lon": {
            "standard_name": "longitude", 
            "long_name": "Longitude",
            "units": "degrees_east",
            "axis": "X",
            "valid_range": [-180.0, 360.0],
        },
        "time": {
            "standard_name": "time",
            "long_name": "Time",
            "axis": "T",
        },
        "x": {
            "standard_name": "projection_x_coordinate",
            "long_name": "X coordinate in EASE-Grid 2.0 projection",
            "units": "m",
            "axis": "X",
        },
        "y": {
            "standard_name": "projection_y_coordinate",
            "long_name": "Y coordinate in EASE-Grid 2.0 projection", 
            "units": "m",
            "axis": "Y",
        },
        "radar_freeboard_mean": {
            "standard_name": "sea_ice_freeboard",
            "long_name": "Mean Radar Freeboard",
            "units": "m",
            "valid_range": [-1.0, 5.0],
            "comment": "Radar freeboard derived from CryoSat-2 altimetry",
        },
        "radar_freeboard_unc": {
            "standard_name": "sea_ice_freeboard standard_error",
            "long_name": "Radar Freeboard Uncertainty",
            "units": "m",
            "valid_range": [0.0, 2.0],
            "comment": "Uncertainty estimate for radar freeboard",
        },
        "sea_ice_conc": {
            "standard_name": "sea_ice_area_fraction",
            "long_name": "Sea Ice Concentration",
            "units": "1",
            "valid_range": [0.0, 1.0],
            "comment": "Sea ice concentration from passive microwave",
        },
    }
    
    # Apply standard attributes if available
    if var_name in standard_attrs:
        # Preserve original attributes that don't conflict
        original_attrs = dict(data_array.attrs)
        new_attrs = standard_attrs[var_name].copy()
        
        # Merge: standard attrs take precedence, but keep unique original attrs
        for key, value in original_attrs.items():
            if key not in new_attrs:
                new_attrs[f"original_{key}"] = value
        
        data_array.attrs = new_attrs
    
    return data_array


def consolidate_datasets(
    extraction_results: List[ExtractionResult],
    config: ConsolidationConfig
) -> Optional[xr.Dataset]:
    """
    Consolidate all extracted datasets into a single xarray Dataset.
    
    Strategy: Since each year may have different grid structure, we store
    each year as separate variables with year suffix, or concatenate along
    a new 'year' dimension if grids are consistent.
    
    Args:
        extraction_results: List of extraction results with data.
        config: Consolidation configuration.
        
    Returns:
        Consolidated xarray Dataset or None if consolidation fails.
    """
    processing_start = datetime.now()
    
    successful_results = [r for r in extraction_results if r.success and r.extracted_data is not None]
    
    if not successful_results:
        logger.error("No successful extractions to consolidate")
        return None
    
    logger.info(f"Consolidating {len(successful_results)} datasets...")
    
    # Strategy: Attempt to concatenate along a new 'year' dimension
    # First, check if all datasets have compatible dimensions
    
    datasets_to_merge = []
    
    for result in sorted(successful_results, key=lambda r: r.year):
        ds = result.extracted_data.copy()
        
        # Add year coordinate
        ds = ds.expand_dims({"year": [result.year]})
        
        # Standardize variable attributes
        for var_name in ds.data_vars:
            ds[var_name] = standardize_variable_attributes(var_name, ds[var_name])
        
        for coord_name in ds.coords:
            if coord_name != "year":
                ds.coords[coord_name] = standardize_variable_attributes(
                    coord_name, ds.coords[coord_name]
                )
        
        datasets_to_merge.append(ds)
        logger.info(f"  Prepared {result.year}: {list(ds.data_vars.keys())}")
    
    try:
        # Try to concatenate along year dimension
        # Use 'override' to handle minor coordinate differences
        consolidated = xr.concat(
            datasets_to_merge,
            dim="year",
            coords="minimal",
            compat="override",
            combine_attrs="drop_conflicts"
        )
        
        logger.info("‚úÖ Successfully concatenated datasets along 'year' dimension")
        
    except Exception as concat_error:
        logger.warning(f"Concatenation failed: {concat_error}")
        logger.info("Falling back to merge strategy (separate variables per year)...")
        
        # Fallback: Merge datasets with year suffix in variable names
        consolidated = xr.Dataset()
        
        for result in sorted(successful_results, key=lambda r: r.year):
            ds = result.extracted_data
            year = result.year
            
            for var_name, var_data in ds.data_vars.items():
                new_name = f"{var_name}_{year}"
                var_data = standardize_variable_attributes(var_name, var_data)
                var_data.attrs["source_year"] = year
                consolidated[new_name] = var_data
            
            # Add coordinates (only once, from first dataset)
            if not consolidated.coords:
                for coord_name, coord_data in ds.coords.items():
                    consolidated.coords[coord_name] = standardize_variable_attributes(
                        coord_name, coord_data
                    )
        
        logger.info("‚úÖ Merged datasets with year-suffixed variable names")
    
    # Add global attributes
    consolidated.attrs = create_global_attributes(
        config, extraction_results, processing_start
    )
    
    return consolidated


# =============================================================================
# EXECUTE CONSOLIDATION
# =============================================================================

consolidation_config = ConsolidationConfig()

logger.info("=" * 60)
logger.info("CSAO Dataset Consolidation")
logger.info("=" * 60)

# Consolidate datasets
consolidated_ds = consolidate_datasets(extraction_results, consolidation_config)

if consolidated_ds is not None:
    print("=" * 80)
    print("CONSOLIDATED DATASET OVERVIEW")
    print("=" * 80)
    display(consolidated_ds)
    
    # Show dimensions
    print(f"\n  Dimensions: {dict(consolidated_ds.dims)}")
    print(f"  Coordinates: {list(consolidated_ds.coords.keys())}")
    print(f"  Data Variables: {list(consolidated_ds.data_vars.keys())}")
    print(f"  Global Attributes: {len(consolidated_ds.attrs)}")
else:
    print("‚ùå Consolidation failed. Check logs for details.")

2025-12-07 18:30:59 | INFO     | __main__ | CSAO Dataset Consolidation
2025-12-07 18:30:59 | INFO     | __main__ | Consolidating 11 datasets...
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2010: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2011: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2012: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2013: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2014: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2015: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
2025-12-07 18:30:59 | INFO     | __main__ |   Prepared 2016: ['radar_freeboard_mean', 'radar_freeboard_unc

CONSOLIDATED DATASET OVERVIEW



  Dimensions: {'year': 11, 'time': 121, 'y': 712, 'x': 712}
  Coordinates: ['x', 'y', 'time', 'year', 'lon', 'lat']
  Data Variables: ['radar_freeboard_mean', 'radar_freeboard_unc', 'sea_ice_conc']
  Global Attributes: 26


  print(f"\n  Dimensions: {dict(consolidated_ds.dims)}")


## 13. Export Consolidated Dataset to NetCDF

**Output File:** `csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc`

**Export Features:**
- NetCDF4 format with HDF5 backend
- ZLIB compression (level 4) for efficient storage
- Unlimited dimension for temporal extensibility
- CF-compliant encoding

**File Naming Convention (Google/Amazon Standard):**
```
{project}_{satellite}_{content_type}_{variables}_{temporal_range}_{version}.nc
```

---

In [15]:
"""
NetCDF Export Module
====================
Exports consolidated dataset to NetCDF4 format with optimized settings
following Google Cloud and Amazon S3 best practices for scientific data.
"""

from __future__ import annotations

from pathlib import Path
from typing import Dict, Any
import os


def create_encoding_config(
    ds: xr.Dataset,
    compression_level: int = 4
) -> Dict[str, Dict[str, Any]]:
    """
    Create variable-specific encoding configuration for NetCDF export.
    
    Args:
        ds: Dataset to encode.
        compression_level: ZLIB compression level (0-9).
        
    Returns:
        Dictionary of encoding settings per variable.
    """
    encoding = {}
    
    for var_name in ds.data_vars:
        var_encoding = {
            "zlib": True,
            "complevel": compression_level,
            "shuffle": True,
        }
        
        # Set appropriate dtype and fill value
        dtype = ds[var_name].dtype
        if np.issubdtype(dtype, np.floating):
            var_encoding["dtype"] = "float32"
            var_encoding["_FillValue"] = np.float32(-9999.0)
        elif np.issubdtype(dtype, np.integer):
            var_encoding["dtype"] = dtype
            var_encoding["_FillValue"] = -9999
        
        encoding[var_name] = var_encoding
    
    # Coordinate encoding
    for coord_name in ds.coords:
        coord_encoding = {
            "zlib": True,
            "complevel": compression_level,
        }
        
        dtype = ds.coords[coord_name].dtype
        if np.issubdtype(dtype, np.floating):
            coord_encoding["dtype"] = "float32"
        
        # Time coordinate special handling
        if coord_name == "time" and np.issubdtype(dtype, np.datetime64):
            coord_encoding["units"] = "days since 1970-01-01"
            coord_encoding["calendar"] = "standard"
        
        encoding[coord_name] = coord_encoding
    
    return encoding


def export_to_netcdf(
    ds: xr.Dataset,
    output_path: Path,
    compression_level: int = 4
) -> Dict[str, Any]:
    """
    Export dataset to NetCDF4 file with comprehensive metadata.
    
    Args:
        ds: Dataset to export.
        output_path: Full path for output file.
        compression_level: ZLIB compression level.
        
    Returns:
        Dictionary with export statistics and status.
    """
    import time
    
    start_time = time.perf_counter()
    
    result = {
        "success": False,
        "output_path": str(output_path),
        "file_size_mb": 0.0,
        "processing_time_s": 0.0,
        "n_variables": len(ds.data_vars),
        "n_coordinates": len(ds.coords),
    }
    
    try:
        # Ensure output directory exists
        output_path.parent.mkdir(parents=True, exist_ok=True)
        
        # Create encoding configuration
        encoding = create_encoding_config(ds, compression_level)
        
        logger.info(f"Exporting to: {output_path}")
        logger.info(f"Compression level: {compression_level}")
        
        # Export to NetCDF4
        ds.to_netcdf(
            output_path,
            engine="netcdf4",
            format="NETCDF4",
            encoding=encoding,
            unlimited_dims=["year"] if "year" in ds.dims else None,
        )
        
        # Calculate file size
        file_size_bytes = output_path.stat().st_size
        result["file_size_mb"] = file_size_bytes / (1024 ** 2)
        result["success"] = True
        
        logger.info(f"‚úÖ Export successful: {result['file_size_mb']:.2f} MB")
        
    except Exception as e:
        result["error"] = str(e)
        logger.error(f"‚ùå Export failed: {e}")
    
    result["processing_time_s"] = time.perf_counter() - start_time
    return result


# =============================================================================
# EXECUTE EXPORT
# =============================================================================

if consolidated_ds is not None:
    # Define output path
    output_filepath = consolidation_config.output_dir / consolidation_config.output_filename
    
    logger.info("=" * 60)
    logger.info("NetCDF Export")
    logger.info("=" * 60)
    
    # Export dataset
    export_result = export_to_netcdf(
        ds=consolidated_ds,
        output_path=output_filepath,
        compression_level=4
    )
    
    # Display export results
    print("=" * 80)
    print("EXPORT RESULTS")
    print("=" * 80)
    print(f"""
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ  EXPORT STATUS                                                  ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  Status          : {'‚úÖ SUCCESS' if export_result['success'] else '‚ùå FAILED':<45}‚îÇ
    ‚îÇ  Output File     : {consolidation_config.output_filename:<45}‚îÇ
    ‚îÇ  Output Directory: {str(consolidation_config.output_dir):<45}‚îÇ
    ‚îÇ  File Size       : {f"{export_result['file_size_mb']:.2f} MB":<45}‚îÇ
    ‚îÇ  Variables       : {export_result['n_variables']:<45}‚îÇ
    ‚îÇ  Coordinates     : {export_result['n_coordinates']:<45}‚îÇ
    ‚îÇ  Processing Time : {f"{export_result['processing_time_s']:.2f} seconds":<45}‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    """)
    
    if export_result['success']:
        print(f"  üìÅ Full path: {output_filepath}")
    else:
        print(f"  ‚ùå Error: {export_result.get('error', 'Unknown error')}")
    
    print("=" * 80)
else:
    print("‚ùå No consolidated dataset available for export.")

2025-12-07 18:32:34 | INFO     | __main__ | NetCDF Export
2025-12-07 18:32:34 | INFO     | __main__ | Exporting to: D:\phd\data\CSAO\csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc
2025-12-07 18:32:34 | INFO     | __main__ | Compression level: 4
2025-12-07 18:33:44 | INFO     | __main__ | ‚úÖ Export successful: 111.08 MB


EXPORT RESULTS

    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ  EXPORT STATUS                                                  ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  Status          : ‚úÖ SUCCESS                                    ‚îÇ
    ‚îÇ  Output File     : csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc‚îÇ
    ‚îÇ  Output Directory: D:\phd\data\CSAO                             ‚îÇ
    ‚îÇ  File Size       : 111.08 MB                                    ‚îÇ
    ‚îÇ  Variables       : 3                                            ‚îÇ
    ‚îÇ  Coordinates     : 6                                            ‚îÇ
    ‚îÇ  Processing Time : 70.32 secon

## 14. Post-Export Validation and Quality Check

**Purpose:** Verify the exported NetCDF file integrity and validate that all expected data is present and accessible.

**Validation Checks:**
1. File readability verification
2. Variable presence confirmation
3. Dimension consistency check
4. Attribute completeness review
5. Data integrity spot-check (NaN percentage, value ranges)

---

In [16]:
"""
Post-Export Validation Module
=============================
Validates exported NetCDF file integrity and data quality.
"""

from __future__ import annotations


def validate_exported_file(filepath: Path) -> Dict[str, Any]:
    """
    Comprehensive validation of exported NetCDF file.
    
    Args:
        filepath: Path to the exported NetCDF file.
        
    Returns:
        Dictionary containing validation results.
    """
    validation = {
        "filepath": str(filepath),
        "readable": False,
        "dimensions": {},
        "coordinates": [],
        "variables": [],
        "global_attributes": [],
        "data_quality": {},
        "issues": [],
        "passed": False,
    }
    
    if not filepath.exists():
        validation["issues"].append(f"File not found: {filepath}")
        return validation
    
    try:
        with xr.open_dataset(filepath, engine="netcdf4") as ds:
            validation["readable"] = True
            validation["dimensions"] = dict(ds.dims)
            validation["coordinates"] = list(ds.coords.keys())
            validation["variables"] = list(ds.data_vars.keys())
            validation["global_attributes"] = list(ds.attrs.keys())
            
            # Data quality checks for each variable
            for var_name in ds.data_vars:
                var_data = ds[var_name]
                
                quality = {
                    "dtype": str(var_data.dtype),
                    "shape": var_data.shape,
                    "size": var_data.size,
                }
                
                if np.issubdtype(var_data.dtype, np.number):
                    values = var_data.values
                    nan_count = int(np.isnan(values).sum())
                    quality["nan_percentage"] = nan_count / var_data.size * 100
                    quality["min"] = float(np.nanmin(values))
                    quality["max"] = float(np.nanmax(values))
                    quality["mean"] = float(np.nanmean(values))
                    
                    # Flag potential issues
                    if quality["nan_percentage"] > 90:
                        validation["issues"].append(
                            f"{var_name}: High NaN percentage ({quality['nan_percentage']:.1f}%)"
                        )
                
                validation["data_quality"][var_name] = quality
            
            # Check for required global attributes
            required_attrs = ["title", "Conventions", "history", "date_created"]
            for attr in required_attrs:
                if attr not in ds.attrs:
                    validation["issues"].append(f"Missing required attribute: {attr}")
    
    except Exception as e:
        validation["issues"].append(f"Error reading file: {str(e)}")
        return validation
    
    validation["passed"] = len(validation["issues"]) == 0
    return validation


# =============================================================================
# EXECUTE POST-EXPORT VALIDATION
# =============================================================================

if export_result.get("success", False):
    logger.info("=" * 60)
    logger.info("Post-Export Validation")
    logger.info("=" * 60)
    
    validation_result = validate_exported_file(output_filepath)
    
    print("=" * 80)
    print("POST-EXPORT VALIDATION REPORT")
    print("=" * 80)
    
    print(f"\n  üìã FILE VALIDATION")
    print(f"  {'‚îÄ' * 50}")
    print(f"  Readable       : {'‚úÖ Yes' if validation_result['readable'] else '‚ùå No'}")
    print(f"  Dimensions     : {validation_result['dimensions']}")
    print(f"  Coordinates    : {len(validation_result['coordinates'])}")
    print(f"  Variables      : {len(validation_result['variables'])}")
    print(f"  Attributes     : {len(validation_result['global_attributes'])}")
    
    print(f"\n  üìä DATA QUALITY SUMMARY")
    print(f"  {'‚îÄ' * 70}")
    print(f"  {'Variable':<30} {'Shape':<20} {'NaN %':<10} {'Range'}")
    print(f"  {'-' * 70}")
    
    for var_name, quality in validation_result['data_quality'].items():
        shape_str = str(quality['shape'])
        if 'nan_percentage' in quality:
            nan_str = f"{quality['nan_percentage']:.1f}%"
            range_str = f"[{quality['min']:.4g}, {quality['max']:.4g}]"
        else:
            nan_str = "N/A"
            range_str = "N/A"
        print(f"  {var_name:<30} {shape_str:<20} {nan_str:<10} {range_str}")
    
    print(f"\n  üîç VALIDATION STATUS")
    print(f"  {'‚îÄ' * 50}")
    if validation_result['passed']:
        print(f"  ‚úÖ All validation checks PASSED")
    else:
        print(f"  ‚ö†Ô∏è Issues detected:")
        for issue in validation_result['issues']:
            print(f"    ‚Ä¢ {issue}")
    
    print("=" * 80)
else:
    print("‚ö†Ô∏è Skipping validation - export was not successful.")

2025-12-07 18:34:43 | INFO     | __main__ | Post-Export Validation
  validation["dimensions"] = dict(ds.dims)


POST-EXPORT VALIDATION REPORT

  üìã FILE VALIDATION
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Readable       : ‚úÖ Yes
  Dimensions     : {'year': 11, 'time': 121, 'y': 712, 'x': 712}
  Coordinates    : 6
  Variables      : 3
  Attributes     : 26

  üìä DATA QUALITY SUMMARY
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Variable                       Shape                NaN %      Range
  ----------------------------------------------------------------------
  radar_freeboard_mean           (11, 121, 712, 712)  98.6%      [-4.835, 5.684]
  radar_freeboard_unc            (11, 121, 712, 712)  98.5%      [0, 5.391]
  sea_ice_conc                   (11, 121, 712, 712)  97.9%      [-32.36, 110.8]

  üîç VALID

## 15. Pipeline Completion Summary

**Processing Summary:**
- Multi-year variable validation across 11 CSAO datasets
- Intelligent variable name mapping with fuzzy matching
- Extraction of 8 key variables per year
- Consolidation into single CF-compliant NetCDF file
- Comprehensive quality assurance validation

**Output:**
```
csao_cs2_consolidated_rfb_sic_2010_2020_v1.nc
```

**Next Steps:**
1. Verify spatial coverage using visualization tools
2. Perform cross-validation with independent datasets
3. Update documentation with final variable mappings

---

In [17]:
"""
Pipeline Completion Summary
===========================
Final summary and cleanup of the CSAO data extraction pipeline.
"""

# =============================================================================
# PIPELINE SUMMARY
# =============================================================================

print("=" * 80)
print("üéØ CSAO DATA EXTRACTION PIPELINE - COMPLETION SUMMARY")
print("=" * 80)

summary_data = {
    "pipeline_name": "CSAO Multi-Year Variable Extraction",
    "completion_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "source_directory": str(config.data_dir),
    "years_processed": f"{config.years[0]}-{config.years[-1]}",
    "total_datasets": len(extraction_results),
    "successful_extractions": len([r for r in extraction_results if r.success]),
    "variables_extracted": len(config.target_variables),
    "output_file": consolidation_config.output_filename if export_result.get("success") else "N/A",
    "output_size_mb": f"{export_result.get('file_size_mb', 0):.2f}",
}

print(f"""
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ                    PIPELINE EXECUTION SUMMARY                        ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  Pipeline           : {summary_data['pipeline_name']:<44}‚îÇ
    ‚îÇ  Completion Time    : {summary_data['completion_time']:<44}‚îÇ
    ‚îÇ  Source Directory   : {str(config.data_dir):<44}‚îÇ
    ‚îÇ  Years Processed    : {summary_data['years_processed']:<44}‚îÇ
    ‚îÇ  Total Datasets     : {summary_data['total_datasets']:<44}‚îÇ
    ‚îÇ  Successful         : {summary_data['successful_extractions']:<44}‚îÇ
    ‚îÇ  Variables per Year : {summary_data['variables_extracted']:<44}‚îÇ
    ‚îÇ  Output File        : {summary_data['output_file']:<44}‚îÇ
    ‚îÇ  Output Size        : {summary_data['output_size_mb']} MB{' ' * (41 - len(summary_data['output_size_mb']))}‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

# Variables extracted
print("  üì¶ VARIABLES EXTRACTED:")
for var in config.target_variables:
    print(f"    ‚úì {var}")

# Cleanup
if consolidated_ds is not None:
    consolidated_ds.close()
    logger.info("Consolidated dataset handle closed.")

# Clear extraction results to free memory
for result in extraction_results:
    if result.extracted_data is not None:
        result.extracted_data.close()

gc.collect()

print("\n" + "=" * 80)
print("‚úÖ PIPELINE COMPLETED SUCCESSFULLY")
print("=" * 80)
logger.info("Pipeline execution completed.")

2025-12-07 18:36:23 | INFO     | __main__ | Consolidated dataset handle closed.
2025-12-07 18:36:23 | INFO     | __main__ | Pipeline execution completed.


üéØ CSAO DATA EXTRACTION PIPELINE - COMPLETION SUMMARY

    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ                    PIPELINE EXECUTION SUMMARY                        ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ  Pipeline           : CSAO Multi-Year Variable Extraction         ‚îÇ
    ‚îÇ  Completion Time    : 2025-12-07 18:36:23                         ‚îÇ
    ‚îÇ  Source Directory   : D:\phd\data\CSAO                            ‚îÇ
    ‚îÇ  Years Processed    : 2010-2020                                   ‚îÇ
    ‚îÇ  Total Datasets     : 11                                          ‚îÇ
    ‚îÇ  Successful         : 