# FITS to Zarr Conversion

This notebook demonstrates how to convert FITS image files to Zarr format for efficient storage and processing. The conversion process groups FITS files by timestamps and frequencies, then combines them into a single Zarr dataset.

## Overview

- **Input**: FITS image files with timestamp and frequency information in filenames
- **Output**: A single Zarr store containing all data organized by time and frequency
- **Key Features**: 
  - Efficient chunking for large datasets
  - Memory-optimized processing
  - Sky coordinate transformations
  - Data validation and verification

## File Naming Convention

Expected format: `YYYYMMDD_HHMMSS_<freq>MHz_<other>_parts-I-image.fits`

In [1]:
# Import required libraries
from pathlib import Path
import glob
import logging
from typing import List, Dict, Optional, Tuple
import warnings

import numpy as np
import xarray as xr
from image_plane_correction.xds_from_fits import _fits_image_to_xds

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

## Configuration and Constants

In [2]:
# Configuration parameters
class Config:
    """Configuration class for FITS to Zarr conversion."""
    
    # Input and output paths
    FITS_PATTERN = "test_fits_files/*.fits"
    OUTPUT_ZARR = "all_times_freqs.zarr"
    
    # Chunking strategy for optimal performance
    CHUNKS = {"l": 1024, "m": 1024}
    
    # Processing options
    DO_SKY_COORDS = True
    VERBOSE = False
    
    # File naming pattern parts
    TIMESTAMP_PARTS = 2  # YYYYMMDD_HHMMSS
    FREQ_PART_INDEX = 2  # Position of frequency in filename parts

config = Config()
print(f"Configuration loaded:")
print(f"  Input pattern: {config.FITS_PATTERN}")
print(f"  Output path: {config.OUTPUT_ZARR}")
print(f"  Chunks: {config.CHUNKS}")
print(f"  Sky coordinates: {config.DO_SKY_COORDS}")

Configuration loaded:
  Input pattern: test_fits_files/*.fits
  Output path: all_times_freqs.zarr
  Chunks: {'l': 1024, 'm': 1024}
  Sky coordinates: True


## Utility Functions

These functions provide the core functionality for file discovery, grouping, and conversion.

In [3]:
def discover_fits_files(pattern: str) -> List[str]:
    """
    Discover FITS files matching the specified pattern.
    
    Args:
        pattern: Glob pattern for finding FITS files
        
    Returns:
        Sorted list of FITS file paths
        
    Raises:
        FileNotFoundError: If no FITS files are found
    """
    fits_files = sorted(glob.glob(pattern))
    
    if not fits_files:
        raise FileNotFoundError(f"No FITS files found matching pattern: {pattern}")
    
    logger.info(f"Discovered {len(fits_files)} FITS files")
    return fits_files


def parse_filename_metadata(filename: str) -> Tuple[str, str]:
    """
    Extract timestamp and frequency from FITS filename.
    
    Args:
        filename: Path to FITS file
        
    Returns:
        Tuple of (timestamp, frequency) strings
        
    Raises:
        ValueError: If filename doesn't match expected format
    """
    try:
        parts = Path(filename).stem.split("_")
        
        if len(parts) < config.FREQ_PART_INDEX + 1:
            raise ValueError(f"Insufficient parts in filename: {filename}")
        
        # Extract timestamp (first two parts: YYYYMMDD_HHMMSS)
        timestamp = f"{parts[0]}_{parts[1]}"
        
        # Extract frequency
        frequency = parts[config.FREQ_PART_INDEX]
        
        return timestamp, frequency
        
    except Exception as e:
        raise ValueError(f"Failed to parse filename {filename}: {e}")


def group_files_by_timestamp(fits_files: List[str]) -> Dict[str, List[str]]:
    """
    Group FITS files by their timestamps.
    
    Args:
        fits_files: List of FITS file paths
        
    Returns:
        Dictionary mapping timestamps to lists of file paths
    """
    groups = {}
    
    for filename in fits_files:
        try:
            timestamp, frequency = parse_filename_metadata(filename)
            groups.setdefault(timestamp, []).append(filename)
        except ValueError as e:
            logger.warning(f"Skipping file due to parsing error: {e}")
            continue
    
    logger.info(f"Grouped files into {len(groups)} timestamps")
    
    # Log group details
    for timestamp, files in groups.items():
        logger.debug(f"Timestamp {timestamp}: {len(files)} files")
    
    return groups

In [4]:
def sort_files_by_frequency(fits_files: List[str]) -> List[str]:
    """
    Sort FITS files by their frequency values.
    
    Args:
        fits_files: List of FITS file paths for a single timestamp
        
    Returns:
        Files sorted by frequency (ascending)
    """
    def extract_freq_value(filename: str) -> int:
        """Extract numeric frequency value from filename."""
        try:
            _, frequency = parse_filename_metadata(filename)
            # Remove 'MHz' suffix and convert to int
            return int(frequency.rstrip("MHz"))
        except Exception:
            logger.warning(f"Could not extract frequency from {filename}, using 0")
            return 0
    
    return sorted(fits_files, key=extract_freq_value)


def build_time_slice(fits_list: List[str], 
                    chunks: Dict[str, int], 
                    do_sky_coords: bool = True) -> xr.Dataset:
    """
    Build a single time slice dataset from multiple frequency FITS files.
    
    Args:
        fits_list: List of FITS files for a single timestamp
        chunks: Chunking strategy for the dataset
        do_sky_coords: Whether to compute sky coordinates
        
    Returns:
        xarray Dataset containing all frequencies for this time slice
        
    Raises:
        Exception: If any FITS file fails to load
    """
    logger.info(f"Processing {len(fits_list)} files for time slice")
    
    datasets = []
    failed_files = []
    
    for fits_file in fits_list:
        try:
            logger.debug(f"Loading {fits_file}")
            ds = _fits_image_to_xds(
                fits_file,
                chunks=chunks,
                verbose=config.VERBOSE,
                do_sky_coords=do_sky_coords
            )
            datasets.append(ds)
            
        except Exception as e:
            logger.error(f"Failed to load {fits_file}: {e}")
            failed_files.append(fits_file)
            continue
    
    if not datasets:
        raise Exception(f"No files could be loaded from: {fits_list}")
    
    if failed_files:
        logger.warning(f"Failed to load {len(failed_files)} files: {failed_files}")
    
    # Concatenate along frequency dimension
    logger.info(f"Concatenating {len(datasets)} datasets along frequency dimension")
    ds_combined = xr.concat(datasets, dim="frequency")
    
    return ds_combined

## Main Conversion Function

In [5]:
def convert_fits_to_zarr(fits_pattern: str,
                        output_path: str,
                        chunks: Optional[Dict[str, int]] = None,
                        do_sky_coords: bool = True,
                        overwrite: bool = False) -> xr.Dataset:
    """
    Convert FITS files to Zarr format with comprehensive error handling.
    
    Args:
        fits_pattern: Glob pattern for finding FITS files
        output_path: Path for output Zarr store
        chunks: Chunking strategy (uses config default if None)
        do_sky_coords: Whether to compute sky coordinates
        overwrite: Whether to overwrite existing Zarr store
        
    Returns:
        Final combined dataset
        
    Raises:
        FileExistsError: If output exists and overwrite=False
        Exception: For various processing errors
    """
    if chunks is None:
        chunks = config.CHUNKS
    
    output_path = Path(output_path)
    
    # Check if output already exists
    if output_path.exists() and not overwrite:
        raise FileExistsError(f"Output path {output_path} already exists. "
                             f"Set overwrite=True to replace it.")
    
    # Step 1: Discover and group files
    logger.info("Starting FITS to Zarr conversion")
    fits_files = discover_fits_files(fits_pattern)
    groups = group_files_by_timestamp(fits_files)
    
    if not groups:
        raise Exception("No valid file groups found")
    
    # Step 2: Process each timestamp
    time_keys = sorted(groups.keys())
    logger.info(f"Processing {len(time_keys)} timestamps: {time_keys}")
    
    first_write = True
    processed_timestamps = []
    
    for i, timestamp in enumerate(time_keys, 1):
        logger.info(f"Processing timestamp {i}/{len(time_keys)}: {timestamp}")
        
        try:
            # Sort files by frequency for this timestamp
            fits_at_timestamp = sort_files_by_frequency(groups[timestamp])
            
            # Build time slice dataset
            ds_time_slice = build_time_slice(
                fits_at_timestamp, 
                chunks, 
                do_sky_coords
            )
            
            # Log dataset information
            logger.info(f"Time slice shape: {dict(ds_time_slice.sizes)}")
            logger.info(f"Frequencies: {ds_time_slice.frequency.values}")
            logger.info(f"Time: {ds_time_slice.time.values}")
            
            # Write to Zarr
            if first_write:
                logger.info(f"Creating new Zarr store: {output_path}")
                ds_time_slice.to_zarr(output_path, mode="w")
                first_write = False
            else:
                logger.info(f"Appending to Zarr store: {output_path}")
                ds_time_slice.to_zarr(output_path, mode="a", append_dim="time")
            
            processed_timestamps.append(timestamp)
            logger.info(f"Successfully processed timestamp {timestamp}")
            
        except Exception as e:
            logger.error(f"Failed to process timestamp {timestamp}: {e}")
            # Continue with next timestamp rather than failing completely
            continue
    
    if not processed_timestamps:
        raise Exception("No timestamps were successfully processed")
    
    logger.info(f"Successfully processed {len(processed_timestamps)}/{len(time_keys)} timestamps")
    
    # Step 3: Load and return final dataset
    logger.info("Loading final dataset for verification")
    final_dataset = xr.open_zarr(output_path)
    
    return final_dataset

## File Discovery and Analysis

Let's start by discovering the available FITS files and analyzing their structure.

In [6]:
# Discover FITS files
try:
    fits_files = discover_fits_files(config.FITS_PATTERN)
    print(f"Found {len(fits_files)} FITS files")
    
    # Show first few files as examples
    print("\nFirst 5 files:")
    for i, file in enumerate(fits_files[:5], 1):
        timestamp, frequency = parse_filename_metadata(file)
        print(f"  {i}. {Path(file).name}")
        print(f"     Timestamp: {timestamp}, Frequency: {frequency}")
    
    if len(fits_files) > 5:
        print(f"  ... and {len(fits_files) - 5} more files")
        
except Exception as e:
    print(f"Error discovering files: {e}")
    # This will allow the notebook to continue even if files aren't found

2025-10-06 15:40:37,642 - INFO - Discovered 50 FITS files


Found 50 FITS files

First 5 files:
  1. 20240524_050009_41MHz_averaged_20000_iterations-I-image.fits
     Timestamp: 20240524_050009, Frequency: 41MHz
  2. 20240524_050009_46MHz_averaged_20000_iterations-I-image.fits
     Timestamp: 20240524_050009, Frequency: 46MHz
  3. 20240524_050009_50MHz_averaged_20000_iterations-I-image.fits
     Timestamp: 20240524_050009, Frequency: 50MHz
  4. 20240524_050009_55MHz_averaged_20000_iterations-I-image.fits
     Timestamp: 20240524_050009, Frequency: 55MHz
  5. 20240524_050009_59MHz_averaged_20000_iterations-I-image.fits
     Timestamp: 20240524_050009, Frequency: 59MHz
  ... and 45 more files


In [7]:
# Group files by timestamp and analyze the structure
try:
    groups = group_files_by_timestamp(fits_files)
    
    print(f"\nFile grouping analysis:")
    print(f"Total timestamps: {len(groups)}")
    
    # Analyze frequencies per timestamp
    freq_counts = {}
    all_frequencies = set()
    
    for timestamp, files in groups.items():
        freq_count = len(files)
        freq_counts[timestamp] = freq_count
        
        # Extract frequencies for this timestamp
        freqs = []
        for file in files:
            _, freq = parse_filename_metadata(file)
            freqs.append(freq)
            all_frequencies.add(freq)
        
        print(f"  {timestamp}: {freq_count} frequencies - {sorted(freqs)}")
    
    print(f"\nOverall statistics:")
    print(f"  Unique frequencies: {len(all_frequencies)}")
    print(f"  Frequency range: {sorted(all_frequencies)}")
    print(f"  Files per timestamp: {list(freq_counts.values())}")
    
    # Check for consistency
    freq_counts_values = list(freq_counts.values())
    if len(set(freq_counts_values)) == 1:
        print(f"  ✓ Consistent: All timestamps have {freq_counts_values[0]} frequencies")
    else:
        print(f"  ⚠ Inconsistent: Timestamps have varying numbers of frequencies")
        
except Exception as e:
    print(f"Error analyzing file structure: {e}")

2025-10-06 15:40:38,747 - INFO - Grouped files into 5 timestamps



File grouping analysis:
Total timestamps: 5
  20240524_050009: 10 frequencies - ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']
  20240524_050019: 10 frequencies - ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']
  20240524_050029: 10 frequencies - ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']
  20240524_050039: 10 frequencies - ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']
  20240524_050049: 10 frequencies - ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']

Overall statistics:
  Unique frequencies: 10
  Frequency range: ['41MHz', '46MHz', '50MHz', '55MHz', '59MHz', '64MHz', '69MHz', '73MHz', '78MHz', '82MHz']
  Files per timestamp: [10, 10, 10, 10, 10]
  ✓ Consistent: All timestamps have 10 frequencies


## Conversion Execution

Now let's perform the actual conversion from FITS to Zarr format.

In [8]:
# Perform the conversion
import time

try:
    print("Starting FITS to Zarr conversion...")
    start_time = time.time()
    
    # Run the conversion
    final_dataset = convert_fits_to_zarr(
        fits_pattern=config.FITS_PATTERN,
        output_path=config.OUTPUT_ZARR,
        chunks=config.CHUNKS,
        do_sky_coords=config.DO_SKY_COORDS,
        overwrite=True  # Allow overwriting for demonstration
    )
    
    end_time = time.time()
    conversion_time = end_time - start_time
    
    print(f"\n✓ Conversion completed successfully in {conversion_time:.2f} seconds")
    print(f"Output saved to: {config.OUTPUT_ZARR}")
    
except Exception as e:
    print(f"❌ Conversion failed: {e}")
    print("Check the logs above for detailed error information")

2025-10-06 15:40:53,905 - INFO - Starting FITS to Zarr conversion
2025-10-06 15:40:53,906 - INFO - Discovered 50 FITS files
2025-10-06 15:40:53,907 - INFO - Grouped files into 5 timestamps
2025-10-06 15:40:53,908 - INFO - Processing 5 timestamps: ['20240524_050009', '20240524_050019', '20240524_050029', '20240524_050039', '20240524_050049']
2025-10-06 15:40:53,908 - INFO - Processing timestamp 1/5: 20240524_050009
2025-10-06 15:40:53,908 - INFO - Processing 10 files for time slice


Starting FITS to Zarr conversion...
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found


2025-10-06 15:41:09,016 - INFO - Concatenating 10 datasets along frequency dimension
2025-10-06 15:41:09,739 - INFO - Time slice shape: {'time': 1, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
2025-10-06 15:41:09,741 - INFO - Frequencies: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]
2025-10-06 15:41:09,742 - INFO - Time: [60454.20843981]
2025-10-06 15:41:09,742 - INFO - Creating new Zarr store: all_times_freqs.zarr
2025-10-06 15:41:16,504 - INFO - Successfully processed timestamp 20240524_050009
2025-10-06 15:41:16,504 - INFO - Processing timestamp 2/5: 20240524_050019
2025-10-06 15:41:16,504 - INFO - Processing 10 files for time slice


not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found


2025-10-06 15:41:30,913 - INFO - Concatenating 10 datasets along frequency dimension
2025-10-06 15:41:31,487 - INFO - Time slice shape: {'time': 1, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
2025-10-06 15:41:31,488 - INFO - Frequencies: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]
2025-10-06 15:41:31,488 - INFO - Time: [60454.20855556]
2025-10-06 15:41:31,488 - INFO - Appending to Zarr store: all_times_freqs.zarr
2025-10-06 15:41:35,115 - INFO - Successfully processed timestamp 20240524_050019
2025-10-06 15:41:35,116 - INFO - Processing timestamp 3/5: 20240524_050029
2025-10-06 15:41:35,116 - INFO - Processing 10 files for time slice


not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found


2025-10-06 15:41:49,882 - INFO - Concatenating 10 datasets along frequency dimension
2025-10-06 15:41:50,759 - INFO - Time slice shape: {'time': 1, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
2025-10-06 15:41:50,760 - INFO - Frequencies: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]
2025-10-06 15:41:50,760 - INFO - Time: [60454.2086713]
2025-10-06 15:41:50,760 - INFO - Appending to Zarr store: all_times_freqs.zarr
2025-10-06 15:41:54,814 - INFO - Successfully processed timestamp 20240524_050029
2025-10-06 15:41:54,815 - INFO - Processing timestamp 4/5: 20240524_050039
2025-10-06 15:41:54,815 - INFO - Processing 10 files for time slice


not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found


2025-10-06 15:42:08,763 - INFO - Concatenating 10 datasets along frequency dimension
2025-10-06 15:42:09,285 - INFO - Time slice shape: {'time': 1, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
2025-10-06 15:42:09,286 - INFO - Frequencies: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]
2025-10-06 15:42:09,286 - INFO - Time: [60454.20878819]
2025-10-06 15:42:09,286 - INFO - Appending to Zarr store: all_times_freqs.zarr
2025-10-06 15:42:13,068 - INFO - Successfully processed timestamp 20240524_050039
2025-10-06 15:42:13,069 - INFO - Processing timestamp 5/5: 20240524_050049
2025-10-06 15:42:13,069 - INFO - Processing 10 files for time slice


not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found
not every header key found


2025-10-06 15:42:27,575 - INFO - Concatenating 10 datasets along frequency dimension
2025-10-06 15:42:28,392 - INFO - Time slice shape: {'time': 1, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
2025-10-06 15:42:28,392 - INFO - Frequencies: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]
2025-10-06 15:42:28,393 - INFO - Time: [60454.20890394]
2025-10-06 15:42:28,393 - INFO - Appending to Zarr store: all_times_freqs.zarr
2025-10-06 15:42:32,630 - INFO - Successfully processed timestamp 20240524_050049
2025-10-06 15:42:32,630 - INFO - Successfully processed 5/5 timestamps
2025-10-06 15:42:32,630 - INFO - Loading final dataset for verification



✓ Conversion completed successfully in 98.80 seconds
Output saved to: all_times_freqs.zarr


## Dataset Verification and Analysis

Let's examine the final Zarr dataset to verify the conversion was successful.

In [9]:
# Verify and analyze the final dataset
try:
    # Reload the dataset to ensure it's properly stored
    ds_final = xr.open_zarr(config.OUTPUT_ZARR)
    
    print("=== FINAL DATASET SUMMARY ===")
    print(f"Dataset dimensions: {dict(ds_final.sizes)}")
    print(f"Data variables: {list(ds_final.data_vars.keys())}")
    print(f"Coordinates: {list(ds_final.coords.keys())}")
    
    print(f"\n=== COORDINATE DETAILS ===")
    print(f"Time range: {ds_final.time.values[0]} to {ds_final.time.values[-1]}")
    print(f"Number of time steps: {len(ds_final.time)}")
    print(f"Time values: {ds_final.time.values}")
    
    print(f"\nFrequency range: {ds_final.frequency.values[0]:.1f} to {ds_final.frequency.values[-1]:.1f}")
    print(f"Number of frequencies: {len(ds_final.frequency)}")
    print(f"Frequency values: {ds_final.frequency.values}")
    
    print(f"\nSpatial dimensions:")
    print(f"  l (longitude): {ds_final.sizes['l']} pixels")
    print(f"  m (latitude): {ds_final.sizes['m']} pixels")
    
    if 'right_ascension' in ds_final.coords:
        print(f"  Right ascension range: {ds_final.right_ascension.values.min():.3f} to {ds_final.right_ascension.values.max():.3f}")
    if 'declination' in ds_final.coords:
        print(f"  Declination range: {ds_final.declination.values.min():.3f} to {ds_final.declination.values.max():.3f}")
    
    print(f"\n=== STORAGE INFORMATION ===")
    zarr_path = Path(config.OUTPUT_ZARR)
    if zarr_path.exists():
        # Calculate storage size
        total_size = sum(f.stat().st_size for f in zarr_path.rglob('*') if f.is_file())
        size_mb = total_size / (1024 * 1024)
        print(f"Zarr store size: {size_mb:.2f} MB")
        
    print(f"Chunking strategy: {config.CHUNKS}")
    
    # Display the dataset structure
    print(f"\n=== DATASET STRUCTURE ===")
    print(ds_final)
    
except Exception as e:
    print(f"Error loading final dataset: {e}")
    print("The conversion may have failed or the output file may be corrupted")

=== FINAL DATASET SUMMARY ===
Dataset dimensions: {'time': 5, 'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
Data variables: ['SKY']
Coordinates: ['declination', 'frequency', 'l', 'm', 'polarization', 'right_ascension', 'time', 'velocity']

=== COORDINATE DETAILS ===
Time range: 60454.208439814814 to 60454.208903935185
Number of time steps: 5
Time values: [60454.20843981 60454.20855556 60454.2086713  60454.20878819
 60454.20890394]

Frequency range: 43245849.6 to 84589599.6
Number of frequencies: 10
Frequency values: [43245849.609375 47839599.609375 52433349.609375 57027099.609375
 61620849.609375 66214599.609375 70808349.609375 75402099.609375
 79995849.609375 84589599.609375]

Spatial dimensions:
  l (longitude): 4096 pixels
  m (latitude): 4096 pixels
  Right ascension range: nan to nan
  Declination range: nan to nan

=== STORAGE INFORMATION ===
Zarr store size: 4439.04 MB
Chunking strategy: {'l': 1024, 'm': 1024}

=== DATASET STRUCTURE ===
<xarray.Dataset> Size: 6GB
Dim

## Data Quality Checks

Let's perform some basic data quality checks on the converted dataset.

In [10]:
# Perform data quality checks
try:
    ds = xr.open_zarr(config.OUTPUT_ZARR)
    
    print("=== DATA QUALITY CHECKS ===")
    
    # Check for missing data
    if 'SKY' in ds.data_vars:
        sky_data = ds['SKY']
        
        # Basic statistics
        print(f"\nSKY data statistics:")
        print(f"  Shape: {sky_data.shape}")
        print(f"  Data type: {sky_data.dtype}")
        print(f"  Min value: {sky_data.min().values:.6f}")
        print(f"  Max value: {sky_data.max().values:.6f}")
        print(f"  Mean value: {sky_data.mean().values:.6f}")
        print(f"  Standard deviation: {sky_data.std().values:.6f}")
        
        # Check for NaN values
        nan_count = np.isnan(sky_data).sum().values
        total_count = sky_data.size
        print(f"  NaN values: {nan_count} / {total_count} ({100*nan_count/total_count:.2f}%)")
        
        # Check for infinite values
        inf_count = np.isinf(sky_data).sum().values
        print(f"  Infinite values: {inf_count} / {total_count} ({100*inf_count/total_count:.2f}%)")
        
        # Sample a small subset for detailed inspection
        print(f"\nSample data (first time, first frequency, center 5x5 pixels):")
        center_l = sky_data.sizes['l'] // 2
        center_m = sky_data.sizes['m'] // 2
        sample = sky_data.isel(
            time=0, 
            frequency=0,
            l=slice(center_l-2, center_l+3),
            m=slice(center_m-2, center_m+3)
        )
        print(sample.values)
    
    # Check coordinate consistency
    print(f"\n=== COORDINATE CONSISTENCY ===")
    
    # Time coordinate checks
    time_diffs = np.diff(ds.time.values)
    if len(time_diffs) > 0:
        time_diffs_seconds = time_diffs.astype('timedelta64[s]').astype(float)
        print(f"Time intervals (seconds): {time_diffs_seconds}")
        if len(set(time_diffs_seconds)) == 1:
            print(f"  ✓ Regular time intervals: {time_diffs_seconds[0]} seconds")
        else:
            print(f"  ⚠ Irregular time intervals")
    
    # Frequency coordinate checks
    freq_diffs = np.diff(ds.frequency.values)
    if len(freq_diffs) > 0:
        print(f"Frequency intervals (Hz): {freq_diffs}")
        if np.allclose(freq_diffs, freq_diffs[0]):
            print(f"  ✓ Regular frequency intervals: {freq_diffs[0]:.0f} Hz")
        else:
            print(f"  ⚠ Irregular frequency intervals")
    
    print(f"\n✓ Data quality checks completed")
    
except Exception as e:
    print(f"Error during data quality checks: {e}")

=== DATA QUALITY CHECKS ===

SKY data statistics:
  Shape: (5, 1, 10, 4096, 4096)
  Data type: float32
  Min value: -150.958191
  Max value: 571.997986
  Mean value: 0.007101
  Standard deviation: 2.189538
  NaN values: 0 / 838860800 (0.00%)
  Infinite values: 0 / 838860800 (0.00%)

Sample data (first time, first frequency, center 5x5 pixels):
[[[-1.6273046  -1.3366616  -0.8624843  -0.24447401  0.42338887]
  [-1.2882841  -1.1843743  -0.9738831  -0.63234526 -0.1995452 ]
  [-0.6248585  -0.63417983 -0.66482204 -0.6446685  -0.54869765]
  [ 0.22863458  0.20751776  0.00333057 -0.28743058 -0.5715446 ]
  [ 1.0695328   1.124087    0.83592325  0.3007557  -0.3295514 ]]]

=== COORDINATE CONSISTENCY ===
Time intervals (seconds): [0. 0. 0. 0.]
  ✓ Regular time intervals: 0.0 seconds
Frequency intervals (Hz): [4593750. 4593750. 4593750. 4593750. 4593750. 4593750. 4593750. 4593750.
 4593750.]
  ✓ Regular frequency intervals: 4593750 Hz

✓ Data quality checks completed


## Performance and Usage Examples

Some examples of how to efficiently work with the Zarr dataset.

In [11]:
# Demonstrate efficient data access patterns
try:
    ds = xr.open_zarr(config.OUTPUT_ZARR)
    
    print("=== EFFICIENT DATA ACCESS EXAMPLES ===")
    
    # Example 1: Select specific time and frequency
    if ds.sizes['time'] > 0 and ds.sizes['frequency'] > 0:
        print("\n1. Selecting specific time and frequency:")
        subset = ds.isel(time=0, frequency=0)
        print(f"   Selected shape: {dict(subset.sizes)}")
        print(f"   Memory usage: ~{subset.nbytes / 1024**2:.2f} MB")
    
    # Example 2: Select frequency range
    if ds.sizes['frequency'] > 2:
        print("\n2. Selecting frequency range:")
        freq_subset = ds.isel(frequency=slice(0, 3))
        print(f"   Selected frequencies: {freq_subset.frequency.values}")
        print(f"   Selected shape: {dict(freq_subset.sizes)}")
    
    # Example 3: Spatial subset
    if ds.sizes['l'] > 100 and ds.sizes['m'] > 100:
        print("\n3. Selecting spatial subset (center 100x100 pixels):")
        center_l = ds.sizes['l'] // 2
        center_m = ds.sizes['m'] // 2
        spatial_subset = ds.isel(
            l=slice(center_l-50, center_l+50),
            m=slice(center_m-50, center_m+50)
        )
        print(f"   Selected shape: {dict(spatial_subset.sizes)}")
        print(f"   Memory usage: ~{spatial_subset.nbytes / 1024**2:.2f} MB")
    
    # Example 4: Compute statistics along dimensions
    if 'SKY' in ds.data_vars:
        print("\n4. Computing statistics along dimensions:")
        
        # Time average
        print("   Computing time average...")
        time_avg = ds['SKY'].mean(dim='time')
        print(f"     Result shape: {dict(time_avg.sizes)}")
        
        # Frequency average
        print("   Computing frequency average...")
        freq_avg = ds['SKY'].mean(dim='frequency')
        print(f"     Result shape: {dict(freq_avg.sizes)}")
    
    # Example 5: Memory-efficient processing with dask
    print("\n5. Memory-efficient processing tips:")
    print("   - Use .isel() for integer indexing")
    print("   - Use .sel() for label-based indexing")
    print("   - Chain operations before calling .compute() or .load()")
    print("   - Use .chunk() to rechunk data if needed")
    print(f"   - Current chunk sizes: {dict(ds.chunks) if hasattr(ds, 'chunks') else 'No chunking info'}")
    
    print(f"\n✓ Performance examples completed")
    
except Exception as e:
    print(f"Error in performance examples: {e}")

=== EFFICIENT DATA ACCESS EXAMPLES ===

1. Selecting specific time and frequency:
   Selected shape: {'polarization': 1, 'l': 4096, 'm': 4096}
   Memory usage: ~320.06 MB

2. Selecting frequency range:
   Selected frequencies: [43245849.609375 47839599.609375 52433349.609375]
   Selected shape: {'time': 5, 'polarization': 1, 'frequency': 3, 'l': 4096, 'm': 4096}

3. Selecting spatial subset (center 100x100 pixels):
   Selected shape: {'time': 5, 'polarization': 1, 'frequency': 10, 'l': 100, 'm': 100}
   Memory usage: ~3.43 MB

4. Computing statistics along dimensions:
   Computing time average...
     Result shape: {'polarization': 1, 'frequency': 10, 'l': 4096, 'm': 4096}
   Computing frequency average...
     Result shape: {'time': 5, 'polarization': 1, 'l': 4096, 'm': 4096}

5. Memory-efficient processing tips:
   - Use .isel() for integer indexing
   - Use .sel() for label-based indexing
   - Chain operations before calling .compute() or .load()
   - Use .chunk() to rechunk data if

## Summary and Next Steps

### What we accomplished:

1. **Improved Code Structure**: Organized the conversion logic into well-documented functions with proper error handling
2. **Enhanced Error Handling**: Added comprehensive try-catch blocks and logging for better debugging
3. **Type Hints**: Added proper type annotations for better code clarity and IDE support
4. **Configuration Management**: Centralized configuration in a Config class for easy modification
5. **Data Validation**: Added quality checks and verification steps
6. **Performance Optimization**: Demonstrated efficient data access patterns for large datasets

### Key Improvements over original script:

- **Modularity**: Functions are reusable and testable
- **Robustness**: Better error handling and recovery
- **Documentation**: Comprehensive docstrings and comments
- **Logging**: Proper logging for monitoring progress
- **Validation**: Data quality checks and verification
- **Flexibility**: Configurable parameters and options

### Next Steps:

1. **Add visualization capabilities** for data inspection
2. **Implement parallel processing** for larger datasets
3. **Add metadata preservation** from original FITS headers
4. **Create automated tests** for the conversion functions
5. **Add compression options** for Zarr storage optimization
6. **Implement incremental updates** for new data files

### Usage in production:

```python
# Simple usage
dataset = convert_fits_to_zarr(
    fits_pattern="path/to/fits/*.fits",
    output_path="output.zarr",
    overwrite=True
)

# Advanced usage with custom configuration
config = Config()
config.CHUNKS = {"l": 2048, "m": 2048}  # Larger chunks for performance
config.DO_SKY_COORDS = False  # Skip sky coordinates for speed

dataset = convert_fits_to_zarr(
    fits_pattern=config.FITS_PATTERN,
    output_path=config.OUTPUT_ZARR,
    chunks=config.CHUNKS,
    do_sky_coords=config.DO_SKY_COORDS
)
```