# SciTeX IO Module Tutorial

This notebook demonstrates the powerful file I/O capabilities of the scitex.io module.

## Key Features

- **Universal Load/Save**: Automatic format detection for 20+ file types
- **Scientific Data Support**: NumPy, MATLAB, HDF5, EEG formats
- **Configuration Management**: YAML config loading with debug modes
- **Smart File Patterns**: Enhanced glob with parsing
- **Caching System**: Speed up expensive computations
- **matplotlib Integration**: Plots with automatic data export

Let's explore these capabilities with practical examples!

In [None]:
# Import the scitex io module
import scitex as stx
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import tempfile
import os

# Create a temporary directory for examples
temp_dir = Path(tempfile.mkdtemp())
print(f"Working in temporary directory: {temp_dir}")

## 1. Universal Load/Save Operations

The most powerful feature of scitex.io is its universal `load()` and `save()` functions that automatically detect file formats:

In [None]:
# Create sample data
data_dict = {
    'numbers': np.random.randn(100),
    'dataframe': pd.DataFrame({
        'A': np.random.randn(50),
        'B': np.random.randint(0, 10, 50),
        'C': np.random.choice(['cat', 'dog', 'bird'], 50)
    }),
    'metadata': {'experiment': 'tutorial', 'version': 1.0}
}

print("Created sample data:")
print(f"- Array shape: {data_dict['numbers'].shape}")
print(f"- DataFrame shape: {data_dict['dataframe'].shape}")
print(f"- Metadata: {data_dict['metadata']}")

In [None]:
# Save to different formats - scitex automatically detects the format!
formats_to_demo = {
    'pickle': temp_dir / 'data.pkl',
    'json': temp_dir / 'metadata.json', 
    'csv': temp_dir / 'dataframe.csv',
    'numpy': temp_dir / 'numbers.npy',
    'yaml': temp_dir / 'config.yaml'
}

# Save complete data as pickle
stx.io.save(data_dict, formats_to_demo['pickle'])
print(f"✅ Saved complete data to: {formats_to_demo['pickle'].name}")

# Save just metadata as JSON
stx.io.save(data_dict['metadata'], formats_to_demo['json'])
print(f"✅ Saved metadata to: {formats_to_demo['json'].name}")

# Save DataFrame as CSV
stx.io.save(data_dict['dataframe'], formats_to_demo['csv'])
print(f"✅ Saved DataFrame to: {formats_to_demo['csv'].name}")

# Save array as NumPy
stx.io.save(data_dict['numbers'], formats_to_demo['numpy'])
print(f"✅ Saved array to: {formats_to_demo['numpy'].name}")

# Save config as YAML
config = {'data_path': './data/', 'batch_size': 32, 'learning_rate': 0.001}
stx.io.save(config, formats_to_demo['yaml'])
print(f"✅ Saved config to: {formats_to_demo['yaml'].name}")

In [None]:
# Load back the data - scitex automatically detects formats!
print("Loading data back:")

loaded_data = stx.io.load(formats_to_demo['pickle'])
print(f"✅ Loaded complete data: {type(loaded_data)}")

loaded_metadata = stx.io.load(formats_to_demo['json'])
print(f"✅ Loaded metadata: {loaded_metadata}")

loaded_df = stx.io.load(formats_to_demo['csv'])
print(f"✅ Loaded DataFrame: {loaded_df.shape}")

loaded_array = stx.io.load(formats_to_demo['numpy'])
print(f"✅ Loaded array: {loaded_array.shape}")

loaded_config = stx.io.load(formats_to_demo['yaml'])
print(f"✅ Loaded config: {loaded_config}")

# Verify data integrity
print("\nData integrity check:")
print(f"Arrays equal: {np.allclose(data_dict['numbers'], loaded_array)}")
print(f"DataFrames equal: {data_dict['dataframe'].equals(loaded_df)}")
print(f"Metadata equal: {data_dict['metadata'] == loaded_metadata}")

## 2. Enhanced File Pattern Matching

SciTeX provides powerful file pattern matching with parsing capabilities:

In [None]:
# Create sample file structure for demonstration
sample_files = [
    'subject_001/session_01.csv',
    'subject_001/session_02.csv', 
    'subject_002/session_01.csv',
    'subject_002/session_02.csv',
    'subject_003/session_01.csv',
    'train/data_001.txt',
    'train/data_002.txt',
    'test/data_001.txt',
    'validation/data_001.txt'
]

# Create the directory structure
for file_path in sample_files:
    full_path = temp_dir / file_path
    full_path.parent.mkdir(parents=True, exist_ok=True)
    # Create dummy data
    dummy_data = pd.DataFrame({
        'x': np.random.randn(10), 
        'y': np.random.randn(10)
    })
    full_path.write_text(dummy_data.to_csv(index=False))

print(f"Created {len(sample_files)} sample files")

In [None]:
# Change to temp directory for glob examples
original_cwd = os.getcwd()
os.chdir(temp_dir)

try:
    # Basic glob with natural sorting
    all_csv = stx.io.glob('subject_*/session_*.csv')
    print(f"Found {len(all_csv)} CSV files:")
    for f in all_csv:
        print(f"  {f}")
        
    # Curly brace expansion
    train_test_files = stx.io.glob('{train,test}/data_*.txt')
    print(f"\nFound {len(train_test_files)} train/test files:")
    for f in train_test_files:
        print(f"  {f}")
        
    # Glob with parsing - extract parameters from filenames
    paths, parsed = stx.io.glob('subject_{subject_id}/session_{session_num}.csv', parse=True)
    print(f"\nParsed {len(paths)} files with parameters:")
    for path, params in zip(paths, parsed):
        print(f"  {path} → Subject: {params['subject_id']}, Session: {params['session_num']}")
        
finally:
    os.chdir(original_cwd)

## 3. Configuration Management

SciTeX provides powerful YAML-based configuration management:

In [None]:
# Create a config directory structure
config_dir = temp_dir / 'config'
config_dir.mkdir(exist_ok=True)

# Create different config files
configs = {
    'PATH.yaml': {
        'data_dir': './data',
        'output_dir': './results',
        'model_dir': './models',
        'log_dir': './logs'
    },
    'PARAMS.yaml': {
        'model': {
            'learning_rate': 0.001,
            'batch_size': 32,
            'epochs': 100
        },
        'data': {
            'train_split': 0.8,
            'validation_split': 0.1,
            'test_split': 0.1
        },
        'experiment': {
            'name': 'baseline_model',
            'seed': 42,
            'device': 'auto'
        }
    }
}

# Save configuration files
for filename, config_data in configs.items():
    config_path = config_dir / filename
    stx.io.save(config_data, config_path)
    print(f"✅ Created config: {filename}")

In [None]:
# Change to temp directory and load configurations
os.chdir(temp_dir)

try:
    # Load all configurations at once
    CONFIG = stx.io.load_configs()
    
    print("Loaded configurations:")
    print(f"Data directory: {CONFIG.PATH.data_dir}")
    print(f"Learning rate: {CONFIG.PARAMS.model.learning_rate}")
    print(f"Batch size: {CONFIG.PARAMS.model.batch_size}")
    print(f"Train split: {CONFIG.PARAMS.data.train_split}")
    print(f"Experiment name: {CONFIG.PARAMS.experiment.name}")
    
    # The CONFIG object supports both dict and dot notation
    print("\nAccess methods:")
    print(f"Dict style: {CONFIG['PARAMS']['model']['learning_rate']}")
    print(f"Dot style: {CONFIG.PARAMS.model.learning_rate}")
    
finally:
    os.chdir(original_cwd)

## 4. Caching for Expensive Computations

SciTeX provides a simple but effective caching system:

In [None]:
def expensive_computation(n=1000000):
    """Simulate an expensive computation."""
    print(f"Running expensive computation with n={n}...")
    import time
    time.sleep(2)  # Simulate computation time
    result = np.sum(np.random.randn(n) ** 2)
    return result

# First run - not cached
import time
print("First run (no cache):")
start_time = time.time()

# Check if result is cached
cache_key = "expensive_result_1M"
result = stx.io.cache(cache_key)

if result is None:
    # Not cached, compute and cache
    result = expensive_computation(1000000)
    stx.io.cache(cache_key, result)
    print(f"Computed and cached result: {result:.4f}")
else:
    print(f"Loaded from cache: {result:.4f}")
    
first_time = time.time() - start_time
print(f"Time taken: {first_time:.2f} seconds")

In [None]:
# Second run - should be cached
print("\nSecond run (from cache):")
start_time = time.time()

result = stx.io.cache(cache_key)
if result is None:
    result = expensive_computation(1000000)
    stx.io.cache(cache_key, result)
    print(f"Computed and cached result: {result:.4f}")
else:
    print(f"Loaded from cache: {result:.4f}")
    
second_time = time.time() - start_time
print(f"Time taken: {second_time:.2f} seconds")
print(f"Speedup: {first_time/second_time:.1f}x faster!")

## 5. HDF5 Interactive Exploration

For large scientific datasets, SciTeX provides powerful HDF5 exploration tools:

In [None]:
# Create a sample HDF5 file with hierarchical data
h5_file = temp_dir / 'experiment_data.h5'

# Create sample scientific data
experiment_data = {
    'experiment_1': {
        'raw_data': np.random.randn(1000, 64),  # 1000 samples, 64 channels
        'metadata': {
            'sampling_rate': 1000,
            'channels': [f'Ch{i:02d}' for i in range(64)],
            'experiment_date': '2025-07-03'
        },
        'processed': {
            'filtered': np.random.randn(1000, 64) * 0.8,
            'features': np.random.randn(100, 10)
        }
    },
    'experiment_2': {
        'raw_data': np.random.randn(1200, 64),
        'metadata': {
            'sampling_rate': 1000,
            'channels': [f'Ch{i:02d}' for i in range(64)],
            'experiment_date': '2025-07-04'
        }
    }
}

# Save as HDF5
stx.io.save(experiment_data, h5_file)
print(f"Created HDF5 file: {h5_file}")
print(f"File size: {h5_file.stat().st_size / 1024:.1f} KB")

In [None]:
# Explore the HDF5 file structure
print("HDF5 File Structure:")
explorer = stx.io.H5Explorer(h5_file)
explorer.explore()

# Load specific parts of the data
print("\nLoading specific datasets:")
exp1_raw = stx.io.load(h5_file, key='experiment_1/raw_data')
print(f"Experiment 1 raw data shape: {exp1_raw.shape}")

exp1_metadata = stx.io.load(h5_file, key='experiment_1/metadata')
print(f"Experiment 1 metadata: {exp1_metadata}")

# Check if specific keys exist
keys_to_check = [
    'experiment_1/raw_data',
    'experiment_1/processed/features', 
    'experiment_3/data',  # This doesn't exist
]

print("\nKey existence check:")
for key in keys_to_check:
    exists = stx.io.has_h5_key(h5_file, key)
    print(f"  {key}: {'✅ exists' if exists else '❌ not found'}")

## 6. matplotlib Integration with Data Export

SciTeX automatically exports plot data for reproducibility:

In [None]:
# Create sample data for plotting
np.random.seed(42)
x = np.linspace(0, 10, 100)
y1 = np.sin(x) + 0.1 * np.random.randn(100)
y2 = np.cos(x) + 0.1 * np.random.randn(100)
y3 = np.sin(2*x) * np.exp(-x/5) + 0.05 * np.random.randn(100)

# Create a publication-ready plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

# Top subplot
ax1.plot(x, y1, 'b-', label='sin(x) + noise', linewidth=2)
ax1.plot(x, y2, 'r--', label='cos(x) + noise', linewidth=2)
ax1.set_xlabel('Time (s)')
ax1.set_ylabel('Amplitude')
ax1.set_title('Trigonometric Functions with Noise')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bottom subplot
ax2.plot(x, y3, 'g-', label='Damped oscillation', linewidth=2)
ax2.fill_between(x, y3-0.1, y3+0.1, alpha=0.3, color='green')
ax2.set_xlabel('Time (s)')
ax2.set_ylabel('Amplitude')
ax2.set_title('Damped Oscillation')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Save the figure - SciTeX will automatically export the data too!
plot_file = temp_dir / 'scientific_plot.png'
stx.io.save(fig, plot_file)
print(f"Saved plot to: {plot_file}")

# Check what files were created
plot_files = list(temp_dir.glob('scientific_plot*'))
print("\nGenerated files:")
for f in sorted(plot_files):
    print(f"  {f.name} ({f.stat().st_size} bytes)")

## 7. Machine Learning Model Persistence

SciTeX handles various ML model formats seamlessly:

In [None]:
# Simulate different types of model artifacts
models_data = {
    # Scikit-learn style model
    'sklearn_model': {
        'model_type': 'RandomForestClassifier',
        'parameters': {'n_estimators': 100, 'max_depth': 10},
        'feature_names': ['feature_1', 'feature_2', 'feature_3'],
        'training_score': 0.95,
        'validation_score': 0.87
    },
    
    # PyTorch-style model state
    'pytorch_weights': {
        'layer1.weight': np.random.randn(128, 64),
        'layer1.bias': np.random.randn(128),
        'layer2.weight': np.random.randn(64, 32),
        'layer2.bias': np.random.randn(64),
        'output.weight': np.random.randn(10, 32),
        'output.bias': np.random.randn(10)
    },
    
    # Training history
    'training_history': {
        'epoch': list(range(1, 51)),
        'train_loss': np.exp(-np.linspace(0, 3, 50)) + 0.1 * np.random.randn(50),
        'val_loss': np.exp(-np.linspace(0, 2.5, 50)) + 0.15 * np.random.randn(50),
        'train_acc': 1 - np.exp(-np.linspace(0, 3, 50)) + 0.05 * np.random.randn(50),
        'val_acc': 1 - np.exp(-np.linspace(0, 2.5, 50)) + 0.1 * np.random.randn(50)
    }
}

# Save different model components in appropriate formats
model_files = {
    'sklearn_model.pkl': models_data['sklearn_model'],
    'pytorch_weights.pth': models_data['pytorch_weights'], 
    'training_history.json': models_data['training_history']
}

print("Saving model artifacts:")
for filename, data in model_files.items():
    filepath = temp_dir / filename
    stx.io.save(data, filepath)
    print(f"✅ Saved {filename} ({filepath.stat().st_size} bytes)")

In [None]:
# Load and verify model artifacts
print("Loading model artifacts:")

loaded_sklearn = stx.io.load(temp_dir / 'sklearn_model.pkl')
print(f"✅ Loaded sklearn model: {loaded_sklearn['model_type']}")
print(f"   Training score: {loaded_sklearn['training_score']:.3f}")

loaded_weights = stx.io.load(temp_dir / 'pytorch_weights.pth')
print(f"✅ Loaded PyTorch weights: {len(loaded_weights)} layers")
for layer_name, weights in loaded_weights.items():
    if hasattr(weights, 'shape'):
        print(f"   {layer_name}: {weights.shape}")

loaded_history = stx.io.load(temp_dir / 'training_history.json')
print(f"✅ Loaded training history: {len(loaded_history['epoch'])} epochs")
print(f"   Final train loss: {loaded_history['train_loss'][-1]:.4f}")
print(f"   Final val loss: {loaded_history['val_loss'][-1]:.4f}")

## 8. Batch Processing with File Patterns

Real-world example of processing multiple data files:

In [None]:
# Create a more realistic batch processing scenario
os.chdir(temp_dir)

try:
    # Find all subject data files
    subject_files, parsed_params = stx.io.glob('subject_{subject_id}/session_{session_num}.csv', parse=True)
    
    print(f"Processing {len(subject_files)} data files:")
    
    results = []
    for filepath, params in zip(subject_files, parsed_params):
        # Load the data
        data = stx.io.load(filepath)
        
        # Process the data (example: compute statistics)
        stats = {
            'subject_id': params['subject_id'],
            'session_num': params['session_num'],
            'n_samples': len(data),
            'mean_x': data['x'].mean(),
            'std_x': data['x'].std(),
            'mean_y': data['y'].mean(),
            'std_y': data['y'].std(),
            'correlation': data['x'].corr(data['y'])
        }
        
        results.append(stats)
        print(f"  ✅ Processed {filepath} → corr = {stats['correlation']:.3f}")
    
    # Combine results into a summary DataFrame
    summary_df = pd.DataFrame(results)
    
    # Save the summary
    summary_file = 'batch_processing_summary.csv'
    stx.io.save(summary_df, summary_file)
    print(f"\n✅ Saved summary to: {summary_file}")
    
    # Display the summary
    print("\nBatch Processing Summary:")
    print(summary_df.to_string(index=False))
    
finally:
    os.chdir(original_cwd)

## 9. Advanced Features Demo

Let's explore some advanced features:

In [None]:
# Multi-format data pipeline example
print("Multi-format data pipeline:")

# Start with raw data
raw_data = {
    'experiment_id': 'EXP_001',
    'timestamp': '2025-07-03T10:00:00',
    'measurements': np.random.randn(1000),
    'metadata': {
        'device': 'sensor_v2',
        'calibration': 1.05,
        'units': 'mV'
    }
}

# Save in different formats for different purposes
formats = {
    'archive.pkl': raw_data,  # Complete data for archival
    'measurements.npy': raw_data['measurements'],  # Just data for analysis
    'metadata.json': raw_data['metadata'],  # Metadata for documentation
    'config.yaml': {'experiment_id': raw_data['experiment_id'], 
                   'timestamp': raw_data['timestamp']}  # Config for reproducibility
}

pipeline_dir = temp_dir / 'pipeline'
pipeline_dir.mkdir(exist_ok=True)

for filename, data in formats.items():
    filepath = pipeline_dir / filename
    stx.io.save(data, filepath)
    print(f"  ✅ {filename} → {filepath.stat().st_size} bytes")

print("\nPipeline files created - each optimized for its purpose!")

In [None]:
# Demonstrate robust error handling
print("Error handling demonstrations:")

# Try to load non-existent file
try:
    result = stx.io.load(temp_dir / 'nonexistent_file.pkl')
except FileNotFoundError as e:
    print(f"  ✅ Handled missing file gracefully: {type(e).__name__}")

# Try to save to read-only location (simulated)
try:
    # This will work, but shows how errors would be handled
    test_data = {'key': 'value'}
    stx.io.save(test_data, temp_dir / 'test_error_handling.json')
    print(f"  ✅ Save operation successful")
except Exception as e:
    print(f"  ⚠️ Save error handled: {type(e).__name__}")

# Demonstrate format auto-detection
ambiguous_file = temp_dir / 'data_without_extension'
test_data = {'auto_detected': True, 'format': 'pickle'}
stx.io.save(test_data, ambiguous_file)
loaded_data = stx.io.load(ambiguous_file)
print(f"  ✅ Auto-detection worked: {loaded_data['format']}")

## 10. Cleanup and Summary

Let's clean up and summarize what we've learned:

In [None]:
# Summary of files created during this tutorial
all_files = list(temp_dir.rglob('*'))
file_types = {}

for file_path in all_files:
    if file_path.is_file():
        suffix = file_path.suffix or 'no_extension'
        if suffix not in file_types:
            file_types[suffix] = []
        file_types[suffix].append(file_path)

print("📊 SciTeX IO Tutorial Summary")
print("=" * 50)
print(f"Total files created: {len([f for f in all_files if f.is_file()])}")
print(f"Total directories: {len([f for f in all_files if f.is_dir()])}")
print("\nFile types handled:")

for suffix, files in sorted(file_types.items()):
    total_size = sum(f.stat().st_size for f in files)
    print(f"  {suffix:15} {len(files):3d} files ({total_size:7,d} bytes)")

print("\n✅ Key Features Demonstrated:")
features = [
    "Universal load/save with automatic format detection",
    "Enhanced file pattern matching with parsing", 
    "YAML configuration management with dot notation",
    "Caching system for expensive computations",
    "HDF5 interactive exploration for large datasets",
    "matplotlib integration with automatic data export",
    "Machine learning model persistence", 
    "Batch processing with parameter extraction",
    "Multi-format data pipelines",
    "Robust error handling and format auto-detection"
]

for i, feature in enumerate(features, 1):
    print(f"  {i:2d}. {feature}")

print(f"\n🗂️ Temporary files location: {temp_dir}")
print("   (Files will be cleaned up when the notebook session ends)")

## 🎯 Next Steps

Now that you've seen the power of SciTeX IO, here are some ways to use it in your projects:

### **Quick Start Templates**

```python
# Universal data loading
import scitex as stx
data = stx.io.load('your_file.any_format')

# Configuration management
CONFIG = stx.io.load_configs()  # Loads all YAML files from ./config/

# Batch processing
files, params = stx.io.glob('data/subject_{id}_session_{num}.csv', parse=True)
for file, param in zip(files, params):
    data = stx.io.load(file)
    # Process data using param['id'] and param['num']

# Caching expensive computations
result = stx.io.cache('computation_key')
if result is None:
    result = expensive_function()
    stx.io.cache('computation_key', result)
```

### **Advanced Patterns**

- **Scientific Workflows**: Load instrument data, process, cache results, export plots
- **ML Pipelines**: Load configs, batch process datasets, save models and histories
- **Data Analysis**: Explore HDF5 datasets, extract features, generate reports
- **Reproducible Research**: Version configurations, cache computations, export plot data

The SciTeX IO module handles the complexity of file formats so you can focus on your research and analysis!