# FileUtils Tutorial

This notebook demonstrates how to use FileUtils for data management in Python data science projects. We'll cover:

1. Installation and Setup
2. Basic File Operations
3. Working with Different File Formats
4. Document Handling (NEW!)
5. Metadata Management
6. Azure Storage Integration
7. Advanced Configuration

## 1. Installation and Setup

First, let's install FileUtils and set up our environment:

In [1]:
# Install FileUtils with all features (including document support)
#%pip install "git+https://github.com/topij/FileUtils.git#egg=FileUtils[all]"

In [2]:
# Import required packages
import pandas as pd
import numpy as np
from pathlib import Path
import sys

# Add project src directory to path (for local development)
# This allows importing FileUtils when running the notebook from the examples directory
project_root = Path().resolve().parent
src_path = str(project_root / "src")
if src_path not in sys.path:
    sys.path.append(src_path)

from FileUtils import FileUtils, OutputFileType

# Initialize FileUtils
file_utils = FileUtils()

# Create some sample data
np.random.seed(42)
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'value': np.random.randn(10),
    'category': np.random.choice(['A', 'B', 'C'], 10)
})

print("Sample data:")
df.head()

2025-10-18 08:15:26,900 - FileUtils.core.file_utils - INFO - Project root: /Users/topi/data-science/FileUtils
2025-10-18 08:15:26,900 - FileUtils.core.file_utils - INFO - FileUtils initialized with local storage
Sample data:


Unnamed: 0,date,value,category
0,2024-01-01,0.496714,A
1,2024-01-02,-0.138264,A
2,2024-01-03,0.647689,B
3,2024-01-04,1.52303,B
4,2024-01-05,-0.234153,A


## 2. Basic File Operations

Let's explore basic file operations with metadata tracking:

In [3]:
# Save data with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data={'sample': df},
    output_filetype=OutputFileType.CSV,
    output_type="processed",
    file_name="sample_data"
)

print("Saved files:", saved_files)
print("Metadata location:", metadata)

# Load using metadata
loaded_data = file_utils.load_from_metadata(metadata)
print("\nLoaded data:")
print(loaded_data['sample'].head())

Saved files: {'sample': '/Users/topi/data-science/FileUtils/data/processed/sample_data_20251018_081526_sample.csv'}
Metadata location: /Users/topi/data-science/FileUtils/data/processed/sample_data_20251018_081526_metadata.json

Loaded data:
         date     value category
0  2024-01-01  0.496714        A
1  2024-01-02 -0.138264        A
2  2024-01-03  0.647689        B
3  2024-01-04  1.523030        B
4  2024-01-05 -0.234153        A


### Working with Multiple DataFrames

FileUtils can efficiently handle multiple DataFrames:

In [4]:
# Create multiple views of the data
df_dict = {
    'all_data': df,
    'filtered': df[df['value'] > 0],
    'summary': df.groupby('category').agg({
        'value': ['mean', 'std', 'count']
    }).reset_index()
}

# Save to Excel with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data=df_dict,
    output_filetype=OutputFileType.XLSX,
    output_type="processed",
    file_name="multi_sheet_data"
)

# Get the Excel file path (all sheet names point to the same file)
excel_file_path = list(saved_files.values())[0]
loaded_sheets = file_utils.load_excel_sheets(excel_file_path)

print("Excel sheets loaded:")
for name, sheet_df in loaded_sheets.items():
    print(f"\n{name}:")
    print(sheet_df.head())


Excel sheets loaded:

all_data:
        date     value category
0 2024-01-01  0.496714        A
1 2024-01-02 -0.138264        A
2 2024-01-03  0.647689        B
3 2024-01-04  1.523030        B
4 2024-01-05 -0.234153        A

filtered:
        date     value category
0 2024-01-01  0.496714        A
1 2024-01-03  0.647689        B
2 2024-01-04  1.523030        B
3 2024-01-07  1.579213        A
4 2024-01-08  0.767435        C

summary:
   Unnamed: 0 category_  value_mean  value_std  value_count
0           0         A    0.293874   0.780639            5
1           1         B    1.085359   0.618960            2
2           2         C    0.280173   0.658879            3


## 3. Working with Different File Formats

FileUtils supports multiple file formats with automatic handling:

## 4. Document Handling (NEW!)

FileUtils now supports rich document formats perfect for AI/agentic workflows:

- **Markdown (.md)**: Text-based documents with YAML frontmatter support
- **Microsoft Word (.docx)**: Structured documents with headings, text, and tables  
- **PDF (.pdf)**: Text documents with basic formatting (read-only extraction)

Let's explore these new capabilities:


In [5]:
# Markdown Document Example
markdown_content = """# AI Analysis Report

## Executive Summary
This report analyzes the performance of our AI models using FileUtils.

## Key Findings
- Model accuracy: 95.2%
- Processing time: 2.3 seconds
- User satisfaction: 4.8/5

## Recommendations
1. Implement additional training data
2. Optimize inference pipeline
3. Add real-time monitoring
"""

# Save simple markdown
saved_path, _ = file_utils.save_document_to_storage(
    content=markdown_content,
    output_filetype=OutputFileType.MARKDOWN,
    output_type="processed",
    file_name="ai_analysis_report"
)

print("Markdown document saved to:", saved_path)

# Load markdown
loaded_content = file_utils.load_document_from_storage(
    file_path="ai_analysis_report.md",
    input_type="processed"
)

print("\nLoaded markdown content:")
print(loaded_content[:200] + "...")


2025-10-18 08:15:27,072 - FileUtils.core.file_utils - INFO - Document saved successfully: /Users/topi/data-science/FileUtils/data/processed/ai_analysis_report_20251018_081527.md
Markdown document saved to: /Users/topi/data-science/FileUtils/data/processed/ai_analysis_report_20251018_081527.md

Loaded markdown content:
# AI Analysis Report

## Executive Summary
This report analyzes the performance of our AI models using FileUtils.

## Key Findings
- Model accuracy: 95.2%
- Processing time: 2.3 seconds
- User satisfa...


In [6]:
# Markdown with YAML Frontmatter Example
structured_content = {
    "frontmatter": {
        "title": "AI Model Performance Report",
        "author": "AI Team",
        "date": "2024-01-15",
        "version": "1.0",
        "tags": ["AI", "Performance", "Analysis"],
        "confidence": 0.95,
        "model": "GPT-4"
    },
    "body": """# AI Model Performance Report

## Model Metrics

| Model | Accuracy | Precision | Recall | F1-Score |
|-------|----------|-----------|--------|----------|
| Model A | 94.2% | 93.8% | 94.5% | 94.1% |
| Model B | 95.7% | 95.2% | 96.1% | 95.6% |
| Model C | 96.1% | 95.8% | 96.4% | 96.1% |

## Analysis
Model C shows the best overall performance across all metrics.

## Recommendations
1. Deploy Model C to production
2. Monitor performance metrics
3. Schedule retraining cycle
"""
}

# Save structured markdown
saved_path, _ = file_utils.save_document_to_storage(
    content=structured_content,
    output_filetype=OutputFileType.MARKDOWN,
    output_type="processed",
    file_name="model_performance_report",
    sub_path="reports/2024"
)

print("Structured markdown saved to:", saved_path)

# Load structured markdown
loaded_content = file_utils.load_document_from_storage(
    file_path="model_performance_report.md",
    input_type="processed",
    sub_path="reports/2024"
)

# Access frontmatter and body separately
if isinstance(loaded_content, dict):
    metadata = loaded_content["frontmatter"]
    content = loaded_content["body"]
    print(f"\nReport by {metadata['author']} with {metadata['confidence']} confidence")
    print(f"Model: {metadata['model']}")
    print(f"Content preview: {content[:100]}...")


2025-10-18 08:15:27,081 - FileUtils.core.file_utils - INFO - Document saved successfully: /Users/topi/data-science/FileUtils/data/processed/reports/2024/model_performance_report_20251018_081527.md
Structured markdown saved to: /Users/topi/data-science/FileUtils/data/processed/reports/2024/model_performance_report_20251018_081527.md

Report by AI Team with 0.95 confidence
Model: GPT-4
Content preview: # AI Model Performance Report

## Model Metrics

| Model | Accuracy | Precision | Recall | F1-Score ...


In [7]:
# DOCX Document Example (requires python-docx)
try:
    # Simple DOCX document
    docx_content = "This is a test document for DOCX format created with FileUtils."
    
    saved_path, _ = file_utils.save_document_to_storage(
        content=docx_content,
        output_filetype=OutputFileType.DOCX,
        output_type="processed",
        file_name="simple_document"
    )
    
    print("DOCX document saved to:", saved_path)
    
    # Load DOCX (extracts text content)
    loaded_content = file_utils.load_document_from_storage(
        file_path="simple_document.docx",
        input_type="processed"
    )
    
    print("\nLoaded DOCX content:")
    print(loaded_content)
    
except Exception as e:
    print(f"DOCX functionality requires python-docx: {e}")
    print("Install with: pip install python-docx")


2025-10-18 08:15:27,141 - FileUtils.core.file_utils - INFO - Document saved successfully: /Users/topi/data-science/FileUtils/data/processed/simple_document_20251018_081527.docx
DOCX document saved to: /Users/topi/data-science/FileUtils/data/processed/simple_document_20251018_081527.docx

Loaded DOCX content:
This is a test document for DOCX format created with FileUtils.


In [8]:
# PDF Document Example (requires PyMuPDF)
try:
    # Simple PDF document
    pdf_content = "This is a test document for PDF format created with FileUtils."
    
    saved_path, _ = file_utils.save_document_to_storage(
        content=pdf_content,
        output_filetype=OutputFileType.PDF,
        output_type="processed",
        file_name="simple_pdf"
    )
    
    print("PDF document saved to:", saved_path)
    
    # Load PDF (extracts text content)
    loaded_content = file_utils.load_document_from_storage(
        file_path="simple_pdf.pdf",
        input_type="processed"
    )
    
    print("\nLoaded PDF content:")
    print(loaded_content)
    
except Exception as e:
    print(f"PDF functionality requires PyMuPDF: {e}")
    print("Install with: pip install PyMuPDF")


2025-10-18 08:15:28,658 - FileUtils.core.file_utils - INFO - Document saved successfully: /Users/topi/data-science/FileUtils/data/processed/simple_pdf_20251018_081527.pdf
PDF document saved to: /Users/topi/data-science/FileUtils/data/processed/simple_pdf_20251018_081527.pdf

Loaded PDF content:
This is a test document for PDF format created with FileUtils.



In [9]:
# Test different formats
for format_type in [OutputFileType.CSV, OutputFileType.XLSX, OutputFileType.PARQUET]:
    # Save data
    saved_files, metadata = file_utils.save_with_metadata(
        data={'data': df},
        output_filetype=format_type,
        output_type="processed",
        file_name=f"format_test_{format_type.value}"
    )
    
    # Load and verify
    loaded_data = file_utils.load_from_metadata(metadata)
    print(f"\n{format_type.value.upper()} format test:")
    print(f"Original shape: {df.shape}")
    print(f"Loaded shape: {loaded_data['data'].shape}")
    print("Data preserved:", df.equals(loaded_data['data']))


CSV format test:
Original shape: (10, 3)
Loaded shape: (10, 3)
Data preserved: False

XLSX format test:
Original shape: (10, 3)
Loaded shape: (10, 3)
Data preserved: False

PARQUET format test:
Original shape: (10, 3)
Loaded shape: (10, 3)
Data preserved: True


## 5. Metadata Management

Let's explore the metadata features:

In [10]:
import json

# Save data with metadata
saved_files, metadata = file_utils.save_with_metadata(
    data={
        'raw': df,
        'processed': df.copy().assign(value=lambda x: x['value'] * 2)
    },
    output_filetype=OutputFileType.PARQUET,
    output_type="processed",
    file_name="metadata_test"
)

# Examine metadata contents
with open(metadata, 'r') as f:
    metadata_content = json.load(f)

print("Metadata contents:")
print(json.dumps(metadata_content, indent=2))

Metadata contents:
{
  "timestamp": "2025-10-18T08:15:34.321388",
  "files": {
    "raw": {
      "path": "/Users/topi/data-science/FileUtils/data/processed/metadata_test_20251018_081534_raw.parquet",
      "format": "parquet"
    },
    "processed": {
      "path": "/Users/topi/data-science/FileUtils/data/processed/metadata_test_20251018_081534_processed.parquet",
      "format": "parquet"
    }
  },
  "config": {
    "directory_structure": {
      "data": [
        "raw",
        "processed"
      ]
    },
    "csv_delimiter": ";",
    "encoding": "utf-8",
    "quoting": 0,
    "include_timestamp": true
  }
}


## 6. Azure Storage Integration

To use Azure Storage, you'll need valid credentials:

In [11]:
import os
from dotenv import load_dotenv
from FileUtils.core.base import StorageConnectionError

# Load credentials
load_dotenv()

# Try Azure connection
try:
    connection_string = os.getenv("AZURE_STORAGE_CONNECTION_STRING")
    if not connection_string:
        raise ValueError("Azure connection string not found")
        
    azure_utils = FileUtils(
        storage_type="azure",
        connection_string=connection_string
    )
    
    # Save to Azure
    saved_files, metadata = azure_utils.save_with_metadata(
        data={'test': df},
        output_filetype=OutputFileType.PARQUET,
        output_type="processed",
        file_name="azure_test"
    )
    
    print("Successfully saved to Azure:")
    print(saved_files)
    
    # Load from Azure
    loaded_data = azure_utils.load_from_metadata(metadata)
    print("\nSuccessfully loaded from Azure")
    
except (ValueError, StorageConnectionError) as e:
    print(f"Azure setup not available: {e}")

Azure setup not available: Azure connection string not found


## 7. Advanced Configuration

Let's explore custom configuration options:

In [12]:
import yaml

# Create custom config
config = {
    'csv_delimiter': '|',
    'encoding': 'utf-8',
    'include_timestamp': True,
    'logging_level': 'DEBUG',
    'directory_structure': {
        'data': ['raw', 'interim', 'processed', 'external'],
        'reports': ['figures', 'tables'],
        'models': ['trained', 'evaluations']
    }
}

config_path = Path('custom_config.yaml')
with open(config_path, 'w') as f:
    yaml.dump(config, f)

# Initialize with custom config
custom_utils = FileUtils(config_file=config_path)

# Test custom configuration
saved_files, metadata = custom_utils.save_with_metadata(
    data={'test': df},
    output_filetype=OutputFileType.CSV,
    output_type="processed",
    file_name="custom_config_test"
)

# Show custom delimiter in action
with open(list(saved_files.values())[0], 'r') as f:
    print("CSV with custom delimiter:")
    print(f.read())

# Clean up
config_path.unlink()

2025-10-18 08:15:34,351 - FileUtils.core.file_utils - INFO - Project root: /Users/topi/data-science/FileUtils
2025-10-18 08:15:34,351 - FileUtils.core.file_utils - INFO - FileUtils initialized with local storage
CSV with custom delimiter:
date|value|category
2024-01-01|0.4967141530112327|A
2024-01-02|-0.13826430117118466|A
2024-01-03|0.6476885381006925|B
2024-01-04|1.5230298564080254|B
2024-01-05|-0.23415337472333597|A
2024-01-06|-0.23413695694918055|A
2024-01-07|1.5792128155073915|A
2024-01-08|0.7674347291529088|C
2024-01-09|-0.4694743859349521|C
2024-01-10|0.5425600435859647|C



## Further Resources

- Check the [Installation Guide](docs/INSTALLATION.md) for detailed setup instructions
- See the [Usage Guide](docs/USAGE.md) for more examples and best practices
- Explore the [Document Types Guide](docs/DOCUMENT_TYPES.md) for comprehensive document handling
- Refer to [Azure Setup](docs/AZURE_SETUP.md) for cloud storage configuration

For issues or suggestions, please visit the GitHub repository.

For issues or suggestions, please visit the GitHub repository.