# FileUtils Tutorial

This notebook demonstrates how to use FileUtils for data management in Python data science projects. We'll cover:

1. Installation and Setup
2. Basic File Operations
3. Working with Different File Formats
4. Metadata Management
5. Azure Storage Integration
6. Advanced Configuration

## 1. Installation and Setup

First, let's install FileUtils and set up our environment:

In [None]:
# Install FileUtils with all features
!pip install "git+https://github.com/topij/FileUtils.git#egg=FileUtils[all]"

In [None]:
# Import required packages
import pandas as pd
import numpy as np
from pathlib import Path
from FileUtils import FileUtils, OutputFileType

# Initialize FileUtils
file_utils = FileUtils()

# Create some sample data
np.random.seed(42)
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'value': np.random.randn(10),
    'category': np.random.choice(['A', 'B', 'C'], 10)
})

print("Sample data:")
df.head()

## 2. Basic File Operations

Let's explore basic file operations with metadata tracking:

In [None]:
# Save data with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data={'sample': df},
    output_filetype=OutputFileType.CSV,
    output_type="processed",
    file_name="sample_data"
)

print("Saved files:", saved_files)
print("Metadata location:", metadata)

# Load using metadata
loaded_data = file_utils.load_from_metadata(metadata)
print("\nLoaded data:")
print(loaded_data['sample'].head())

### Working with Multiple DataFrames

FileUtils can efficiently handle multiple DataFrames:

In [None]:
# Create multiple views of the data
df_dict = {
    'all_data': df,
    'filtered': df[df['value'] > 0],
    'summary': df.groupby('category').agg({
        'value': ['mean', 'std', 'count']
    }).reset_index()
}

# Save to Excel with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data=df_dict,
    output_filetype=OutputFileType.XLSX,
    output_type="processed",
    file_name="multi_sheet_data"
)

# Load all sheets
loaded_sheets = file_utils.load_excel_sheets(saved_files['multi_sheet_data'])

print("Excel sheets loaded:")
for name, sheet_df in loaded_sheets.items():
    print(f"\n{name}:")
    print(sheet_df.head())

## 3. Working with Different File Formats

FileUtils supports multiple file formats with automatic handling:

In [None]:
# Test different formats
for format_type in [OutputFileType.CSV, OutputFileType.XLSX, OutputFileType.PARQUET]:
    # Save data
    saved_files, metadata = file_utils.save_with_metadata(
        data={'data': df},
        output_filetype=format_type,
        output_type="processed",
        file_name=f"format_test_{format_type.value}"
    )
    
    # Load and verify
    loaded_data = file_utils.load_from_metadata(metadata)
    print(f"\n{format_type.value.upper()} format test:")
    print(f"Original shape: {df.shape}")
    print(f"Loaded shape: {loaded_data['data'].shape}")
    print("Data preserved:", df.equals(loaded_data['data']))

## 4. Metadata Management

Let's explore the metadata features:

In [None]:
import json

# Save data with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data={
        'raw': df,
        'processed': df.copy().assign(value=lambda x: x['value'] * 2)
    },
    output_filetype=OutputFileType.PARQUET,
    output_type="processed",
    file_name="metadata_test"
)

# Examine metadata contents
with open(metadata, 'r') as f:
    metadata_content = json.load(f)

print("Metadata contents:")
print(json.dumps(metadata_content, indent=2))

## 5. Azure Storage Integration

To use Azure Storage, you'll need valid credentials:

In [None]:
import os
from dotenv import load_dotenv
from FileUtils.core.base import StorageConnectionError

# Load credentials
load_dotenv()

# Try Azure connection
try:
    connection_string = os.getenv("AZURE_STORAGE_CONNECTION_STRING")
    if not connection_string:
        raise ValueError("Azure connection string not found")
        
    azure_utils = FileUtils(
        storage_type="azure",
        connection_string=connection_string
    )
    
    # Save to Azure
    saved_files, metadata = azure_utils.save_with_metadata(
        data={'test': df},
        output_filetype=OutputFileType.PARQUET,
        output_type="processed",
        file_name="azure_test"
    )
    
    print("Successfully saved to Azure:")
    print(saved_files)
    
    # Load from Azure
    loaded_data = azure_utils.load_from_metadata(metadata)
    print("\nSuccessfully loaded from Azure")
    
except (ValueError, StorageConnectionError) as e:
    print(f"Azure setup not available: {e}")

## 6. Advanced Configuration

Let's explore custom configuration options:

In [None]:
import yaml

# Create custom config
config = {
    'csv_delimiter': '|',
    'encoding': 'utf-8',
    'include_timestamp': True,
    'logging_level': 'DEBUG',
    'directory_structure': {
        'data': ['raw', 'interim', 'processed', 'external'],
        'reports': ['figures', 'tables'],
        'models': ['trained', 'evaluations']
    }
}

config_path = Path('custom_config.yaml')
with open(config_path, 'w') as f:
    yaml.dump(config, f)

# Initialize with custom config
custom_utils = FileUtils(config_file=config_path)

# Test custom configuration
saved_files, metadata = custom_utils.save_with_metadata(
    data={'test': df},
    output_filetype=OutputFileType.CSV,
    output_type="processed",
    file_name="custom_config_test"
)

# Show custom delimiter in action
with open(list(saved_files.values())[0], 'r') as f:
    print("CSV with custom delimiter:")
    print(f.read())

# Clean up
config_path.unlink()

## Further Resources

- Check the [Installation Guide](docs/INSTALLATION.md) for detailed setup instructions
- See the [Usage Guide](docs/USAGE.md) for more examples and best practices
- Refer to [Azure Setup](docs/AZURE_SETUP.md) for cloud storage configuration

For issues or suggestions, please visit the GitHub repository.