# Data Analysis with ostruct in Jupyter

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yaniv-golan/ostruct/blob/main/examples/data-science/notebooks/ostruct_data_analysis.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/yaniv-golan/ostruct/main?labpath=examples%2Fdata-science%2Fnotebooks%2Fostruct_data_analysis.ipynb)
[![Open in Jupyter](https://img.shields.io/badge/Open%20in-Jupyter-orange?logo=jupyter)](https://jupyter.org/try-jupyter/lab/?path=ostruct_data_analysis.ipynb)

This notebook demonstrates how to use ostruct for data analysis within Jupyter notebooks, combining the power of AI-driven analysis with interactive data science workflows.

> **💡 Environment Options:**
> - **Colab**: Full GPU support, Google Drive integration, built-in Secrets management
> - **Binder**: Free hosted Jupyter environment, great for quick testing
> - **Local Jupyter**: Full control, best performance, works with Notebook 7 features like [real-time collaboration](https://jupyter-notebook.readthedocs.io/en/stable/notebook_7_features.html#real-time-collaboration)

## What You'll Learn

- 📊 Run ostruct analysis from Jupyter cells
- 🔄 Integrate ostruct results with pandas workflows
- 📈 Generate visualizations using AI + Code Interpreter
- 🚀 Build automated analysis pipelines
- 💡 Best practices for production data science

In [None]:
# Parameters cell (for Papermill parameterization)
# Following Google Cloud Jupyter best practices for parameterized notebooks
# Tag this cell with "parameters" for Papermill execution

# Model configuration
DEFAULT_MODEL = "gpt-4.1"
ANALYSIS_TIMEOUT = 180  # seconds

# Data configuration  
SAMPLE_SIZE = 100  # Number of rows for demo data
RANDOM_SEED = 42

# Output configuration
ENABLE_VERBOSE_OUTPUT = True
SAVE_INTERMEDIATE_RESULTS = True
OUTPUT_DIR = "notebook_outputs"

# Tool configuration
DEFAULT_TOOLS = ["code-interpreter"]
ENABLE_WEB_SEARCH = False  # Set to True for market research examples

print("📋 Notebook parameters configured for reproducible execution")


## Setup and Installation

First, let's install ostruct and set up our environment:

In [None]:
# Install ostruct (run this once)
# NOTE: Using release candidate v1.6.0rc3 with Code Interpreter file download fixes
# TODO: Revert to stable version after testing: !pip install ostruct-cli
!pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ ostruct-cli==1.6.0rc3

# Import required libraries
import json
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from IPython.display import Image, display, HTML

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Setup complete!")

In [None]:
# Set up OpenAI API key (required) - Cross-platform approach
import os

def setup_openai_key():
    """
    Set up OpenAI API key with cross-platform support for Colab, Jupyter, and local environments.
    
    This function automatically detects your environment and uses the most appropriate method:
    - Colab: Uses built-in Secrets (recommended for security)
    - Binder: Uses environment variables or getpass fallback
    - Local Jupyter: Supports .env files, environment variables, or manual input
    """
    
    # Method 1: Try Colab Secrets first (if in Colab)
    try:
        from google.colab import userdata
        os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
        print("🔑 API key loaded from Colab Secrets")
        return True
    except ImportError:
        # Not in Colab, continue with other methods
        pass
    except Exception as e:
        print(f"⚠️  Colab Secrets not configured: {e}")
        print("   Add your API key to Colab Secrets:")
        print("   1. Click the 🔑 key icon in the left sidebar")
        print("   2. Add 'OPENAI_API_KEY' as a secret")
        print("   3. Re-run this cell")
    
    # Method 2: Try environment variable
    if os.getenv('OPENAI_API_KEY'):
        print("🔑 API key loaded from environment variable")
        return True
    
    # Method 3: Try .env file (for local Jupyter)
    try:
        from dotenv import load_dotenv
        load_dotenv()
        if os.getenv('OPENAI_API_KEY'):
            print("🔑 API key loaded from .env file")
            return True
    except ImportError:
        pass
    
    # Method 4: Fallback to manual entry (least secure but most compatible)
    print("💡 For better security, consider:")
    print("   • Colab: Use Colab Secrets (🔑 icon in sidebar)")
    print("   • Jupyter: Set OPENAI_API_KEY environment variable")
    print("   • Local: Create .env file with python-dotenv")
    
    import getpass
    os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')
    print("🔑 API key configured")
    return True

setup_openai_key()

In [None]:
# Utility functions for notebook robustness and experiment tracking
import uuid
import json
from datetime import datetime
from pathlib import Path

# Experiment tracking (following Google Cloud best practices)
class ExperimentTracker:
    """
    Track experiments automatically following Google Cloud Jupyter best practices.
    Logs metadata about training sessions, hyperparameters, data sources, results, and timing.
    """
    
    def __init__(self, experiment_name="ostruct_analysis"):
        self.experiment_name = experiment_name
        self.experiment_id = str(uuid.uuid4())[:8]
        self.start_time = datetime.now()
        # Use OUTPUT_DIR from parameters cell or default
        output_dir = globals().get('OUTPUT_DIR', 'notebook_outputs')
        self.log_dir = Path(output_dir) / "experiment_logs"
        self.log_dir.mkdir(parents=True, exist_ok=True)
        
        self.metadata = {
            "experiment_id": self.experiment_id,
            "experiment_name": experiment_name,
            "start_time": self.start_time.isoformat(),
            "parameters": {},
            "results": {},
            "execution_info": {}
        }
        
    def log_parameters(self, **params):
        """Log experiment parameters"""
        self.metadata["parameters"].update(params)
        
    def log_results(self, **results):
        """Log experiment results"""
        self.metadata["results"].update(results)
        
    def log_execution_info(self, **info):
        """Log execution metadata"""
        self.metadata["execution_info"].update(info)
        
    def save_experiment(self):
        """Save experiment log to file"""
        self.metadata["end_time"] = datetime.now().isoformat()
        self.metadata["duration_seconds"] = (datetime.now() - self.start_time).total_seconds()
        
        log_file = self.log_dir / f"experiment_{self.experiment_id}.json"
        with open(log_file, 'w') as f:
            json.dump(self.metadata, f, indent=2)
        
        print(f"💾 Experiment logged: {log_file}")
        return log_file

def ensure_imports():
    """
    Ensure required imports are available for data analysis.
    This function can be called at the start of any cell that needs pandas/matplotlib.
    
    Based on best practices from:
    - https://ryan.orendorff.io/posts/2022-11-27-define-once/
    - https://www.angela1c.com/posts/2021/08/a-few-random-reference-notes.../
    """
    try:
        import pandas as pd
        import matplotlib.pyplot as plt
        import seaborn as sns
        from IPython.display import display, HTML
        
        # Make imports available in global namespace for notebook cells
        globals().update({
            'pd': pd,
            'plt': plt, 
            'sns': sns,
            'display': display,
            'HTML': HTML
        })
        
        return True
        
    except ImportError as e:
        print("❌ Missing required packages!")
        print("💡 Solution: Run the setup cell (Cell 2) first to install and import libraries")
        print(f"Missing: {e}")
        print("\nIf packages aren't installed, run:")
        print("!pip install pandas matplotlib seaborn")
        raise
        
    except Exception as e:
        print(f"❌ Unexpected error setting up imports: {e}")
        print("💡 Try restarting the kernel and running cells in order")
        raise

# Initialize global experiment tracker
experiment = ExperimentTracker("ostruct_data_analysis")

print("📚 Utility functions loaded with experiment tracking!")


## Core ostruct Integration Functions

Let's create helper functions for running ostruct from Jupyter:

In [None]:
# Core ostruct integration function for Jupyter notebooks
import subprocess
import os
import json

def run_ostruct_analysis(template_file, schema_file, data_file=None, model=None, 
                        enable_tools=None, output_file=None, dry_run=False, enable_downloads=True):
    """
    Run ostruct analysis from Jupyter notebooks.
    
    Uses subprocess with stdin=DEVNULL to prevent asyncio event loop conflicts
    that can cause timeouts in notebook environments.
    
    Args:
        enable_downloads: If True, adds --ci-download flag for file downloads (charts, reports, etc.)
                         Defaults to True since notebook users typically want to see generated charts.
    """
    
    # Build command arguments
    cmd = ['ostruct', 'run', template_file, schema_file]
    
    if data_file:
        cmd.extend(['--file', 'ci:data', data_file])
    
    if model:
        cmd.extend(['--model', model])
        
    if enable_tools:
        for tool in enable_tools:
            cmd.extend(['--enable-tool', tool])
            
    # Add --ci-download flag if downloads are enabled and code-interpreter is being used
    if enable_downloads and enable_tools and \'code-interpreter\' in enable_tools:
        cmd.append(\'--ci-download\')
            
    if output_file:
        cmd.extend(['--output-file', output_file])
        
    if dry_run:
        cmd.append('--dry-run')
    
    cmd.append('--verbose')  # Always include verbose for better debugging
    
    print(f"🔧 Command: {' '.join(cmd)}")
    
    try:
        result = subprocess.run(
            cmd,
            env=os.environ,
            stdin=subprocess.DEVNULL,  # ← IMPORTANT: Prevents asyncio deadlock
            capture_output=True,
            text=True,
            timeout=180,  # 3 minutes should be enough for most operations
            check=True
        )
        
        print("✅ ostruct analysis completed successfully!")
        print(f"📤 Return code: {result.returncode}")
        
        # Show output preview
        if result.stdout:
            print("📋 Output preview:")
            print(result.stdout[:500] + "..." if len(result.stdout) > 500 else result.stdout)
        
        # Load and return results if output file specified
        if output_file and os.path.exists(output_file):
            with open(output_file, 'r') as f:
                results = json.load(f)
            return results
        else:
            return None
            
    except subprocess.TimeoutExpired:
        print("❌ ostruct analysis timed out after 3 minutes")
        raise
    except subprocess.CalledProcessError as e:
        print(f"❌ ostruct analysis failed with exit code {e.returncode}")
        if e.stderr:
            print(f"STDERR: {e.stderr}")
        raise
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        raise

print("✅ ostruct integration function ready!")
print("   Use run_ostruct_analysis() for reliable ostruct execution in Jupyter notebooks")


## Example 1: Basic Data Analysis

Let's start with a simple CSV analysis using the data science template:

In [None]:
# Create sample data for analysis
# Template and schema content are in the raw cells above (cells 9 and 10)
import pandas as pd
import numpy as np
import json

# Generate sample sales data if not already present
np.random.seed(42)
sample_data = {
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget A', 'Widget B'] * 20,
    'price': np.random.uniform(10, 100, 100),
    'quantity': np.random.randint(1, 10, 100)
}
df = pd.DataFrame(sample_data)
df['revenue'] = df['price'] * df['quantity']

# Save to CSV for ostruct processing
df.to_csv('sample_sales.csv', index=False)

# Note: In a Jupyter notebook, the template and schema content would be read from 
# the raw cells above. For CLI compatibility, we write placeholder files.
# The actual content is visible in the raw cells above for editing.

# Write placeholder files (content from raw cells 9 and 10)
with open('basic_template.j2', 'w') as f:
    f.write("# Template content is in raw cell 9 above\n")
    
with open('basic_schema.json', 'w') as f:
    json.dump({"note": "Schema content is in raw cell 10 above"}, f, indent=2)

print("✅ Placeholder files created - actual content is in raw cells above")
print("📝 Edit templates/schemas by modifying the raw cells above")

In [None]:
# Run ostruct analysis with better error handling and debugging
import os
print("🚀 Starting ostruct analysis...")

try:
    # First, let's verify our files exist
    print("📋 Checking required files:")
    required_files = ['basic_template.j2', 'basic_schema.json', 'sample_sales.csv']
    for file in required_files:
        if Path(file).exists():
            print(f"  ✅ {file} ({Path(file).stat().st_size} bytes)")
        else:
            print(f"  ❌ {file} - MISSING!")
            
    print("\n🔑 Checking API key...")
    if 'OPENAI_API_KEY' in os.environ:
        key = os.environ['OPENAI_API_KEY']
        print(f"  ✅ API key set ({key[:10]}...{key[-4:]})")
    else:
        print("  ❌ OPENAI_API_KEY not found!")
        
    print("\n⏰ Running analysis (this may take 30-60 seconds)...")
    
    # Run with timeout and verbose output
    import subprocess
    import signal
    
    cmd = [
        'ostruct', 'run', 
        'basic_template.j2', 
        'basic_schema.json',
        '--file', 'ci:data', 'sample_sales.csv',
        '--model', 'gpt-4.1',
        '--enable-tool', 'code-interpreter',
        '--output-file', 'analysis_results.json',
        '--verbose'  # Add verbose output
    ]
    
    print(f"🔧 Command: {' '.join(cmd)}")
    
    # Run with timeout (using parameterized value)
    timeout_seconds = globals().get('ANALYSIS_TIMEOUT', 180)
    try:
        result = subprocess.run(
            cmd, 
            capture_output=True, 
            text=True, 
            timeout=timeout_seconds,  # Use parameterized timeout
            check=True,
            env=os.environ,
            stdin=subprocess.DEVNULL  # Prevent asyncio deadlock in Jupyter
        )
        
        print("✅ Command completed successfully!")
        print(f"📤 Return code: {result.returncode}")
        
        if result.stdout:
            print("📄 Output:")
            print(result.stdout[:1000])  # First 1000 chars
            
        # Load results
        if Path('analysis_results.json').exists():
            with open('analysis_results.json', 'r') as f:
                results = json.load(f)
            print("✅ Results loaded successfully!")
            display_analysis_summary(results)
        else:
            print("❌ Output file not created")
            
    except subprocess.TimeoutExpired:
        print(f"⏰ Command timed out after {timeout_seconds} seconds ({timeout_seconds/60:.1f} minutes)")
        print("This might indicate:")
        print("  - API is slow to respond")
        print("  - Network connectivity issues")
        print("  - API key problems")
        
    except subprocess.CalledProcessError as e:
        print(f"❌ Command failed with exit code {e.returncode}")
        print(f"📄 Error output:")
        print(e.stderr)
        
except Exception as e:
    print(f"❌ Unexpected error: {e}")
    import traceback
    traceback.print_exc()

print("\n📁 Current directory contents:")
for item in sorted(Path('.').iterdir()):
    if item.is_file():
        print(f"  📄 {item.name} ({item.stat().st_size} bytes)")
    else:
        print(f"  📁 {item.name}/")

In [None]:
# Load and display the analysis results
print("📊 Loading analysis results...")

try:
    # Load the results file that was created
    with open('analysis_results.json', 'r') as f:
        results = json.load(f)
    
    print("✅ Results loaded successfully!")
    print(f"📄 Result keys: {list(results.keys())}")
    
    # Display the full results first
    print("\n📋 COMPLETE ANALYSIS RESULTS:")
    print(json.dumps(results, indent=2))
    
    # Use our display function
    display_analysis_summary(results)
    
    # Look for any generated chart files (RC2 should fix download issues)
    print("\n🔍 Comprehensive search for generated charts...")
    
    # Check for image files in multiple locations
    image_extensions = ['.png', '.jpg', '.jpeg', '.svg', '.gif']
    
    # Search locations where charts might be saved
    search_locations = [
        Path('.'),  # Current directory
        Path('downloads'),  # Default download location
        Path('/content'),  # Colab content directory  
        Path('/content/downloads'),  # Colab downloads
    ]
    
    all_found_images = []
    
    for location in search_locations:
        if location.exists() and location.is_dir():
            print(f"\n📁 Checking {location}:")
            try:
                items = list(location.iterdir())
                image_files = [f for f in items if f.suffix.lower() in image_extensions]
                
                if image_files:
                    print(f"  🎯 Found {len(image_files)} image(s):")
                    for img in image_files:
                        print(f"    📊 {img.name} ({img.stat().st_size} bytes)")
                        all_found_images.append(img)
                else:
                    all_files = [f for f in items if f.is_file()]
                    print(f"  📄 {len(all_files)} files, no images")
                    if all_files and len(all_files) <= 10:  # Show files if not too many
                        print(f"    Files: {[f.name for f in all_files]}")
                        
            except Exception as e:
                print(f"  ❌ Error accessing {location}: {e}")
        else:
            print(f"📁 {location}: doesn't exist")
    
    # Check if the specific chart mentioned in results exists
    if 'chart_info' in results and 'filename' in results['chart_info']:
        chart_filename = results['chart_info']['filename']
        print(f"\n🎯 Looking specifically for: {chart_filename}")
        
        # Check in all search locations
        for location in search_locations:
            if location.exists():
                chart_path = location / chart_filename
                if chart_path.exists():
                    print(f"  ✅ Found at: {chart_path}")
                    all_found_images.append(chart_path)
                else:
                    print(f"  ❌ Not found in: {location}")
    
    # Display all found images
    if all_found_images:
        print(f"\n🎨 Displaying {len(all_found_images)} chart(s):")
        for img in all_found_images:
            print(f"\n📊 Displaying chart: {img}")
            try:
                display(Image(str(img)))
                print(f"✅ Successfully displayed {img.name}")
            except Exception as e:
                print(f"❌ Could not display {img}: {e}")
    else:
        print("\n🤔 No AI-generated charts found - checking if download issue persists in RC2")
        print("Creating fallback chart with matplotlib...")
        
        # Create a fallback chart from the data
        import pandas as pd
        import matplotlib.pyplot as plt
        
        df = pd.read_csv('sample_sales.csv')
        
        # Create sales by product chart matching the AI description
        sales_by_product = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
        
        plt.figure(figsize=(10, 6))
        bars = plt.bar(sales_by_product.index, sales_by_product.values, 
                      color=['#2E86AB', '#A23B72', '#F18F01'])
        plt.title('Sales by Product', fontsize=16, fontweight='bold', pad=20)
        plt.xlabel('Product', fontsize=12)
        plt.ylabel('Revenue ($)', fontsize=12)
        plt.xticks(rotation=45, ha='right')
        
        # Add value labels on bars
        for bar, value in zip(bars, sales_by_product.values):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
                    f'${value:.0f}', ha='center', va='bottom', fontweight='bold')
        
        plt.tight_layout()
        plt.grid(axis='y', alpha=0.3)
        
        # Add summary text
        total_sales = sales_by_product.sum()
        plt.figtext(0.02, 0.02, f'Total Sales: ${total_sales:.0f} • Top Product: {sales_by_product.index[0]}', 
                   fontsize=10, style='italic')
        
        plt.show()
        
        print("✅ Fallback chart displayed!")
        print("💡 This shows the same data the AI analyzed, just generated with matplotlib instead")
        print("📝 Note: RC2 should fix Code Interpreter file downloads - if charts still don't appear, please report the issue")

except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# Multi-tool template content is in raw cell 14 above
# For CLI compatibility, write placeholder file

with open('multi_tool_template.j2', 'w') as f:
    f.write("# Template content is in raw cell 14 above\n")

print("✅ Multi-tool placeholder created - actual content is in raw cell above")
print("📝 Edit template by modifying raw cell 14 above")

In [None]:
# Enhanced schema content is in raw cell 16 above
# For CLI compatibility, write placeholder file

import json
with open('enhanced_schema.json', 'w') as f:
    json.dump({"note": "Schema content is in raw cell 16 above"}, f, indent=2)

print("✅ Enhanced schema placeholder created - actual content is in raw cell above")
print("📝 Edit schema by modifying raw cell 16 above")

In [None]:
# Run multi-tool analysis (corrected - no --web-query parameter)
enhanced_results = run_ostruct_analysis(
    template_file='multi_tool_template.j2',
    schema_file='enhanced_schema.json', 
    data_file='sample_sales.csv',
    model='gpt-4o',
    enable_tools=['code-interpreter', 'web-search'],
    output_file='enhanced_results.json'
,
        enable_downloads=True  # Enable chart downloads for visualization
    )

print("✅ Enhanced analysis complete!")
print("\n📋 ENHANCED ANALYSIS RESULTS:")
print(json.dumps(enhanced_results, indent=2))

# Display key results
if 'internal_analysis' in enhanced_results:
    internal = enhanced_results['internal_analysis']
    print(f"\n💼 INTERNAL ANALYSIS:")
    print(f"📊 Total Revenue: ${internal.get('total_revenue', 0):,.2f}")
    print(f"🏆 Top Product: {internal.get('top_product', 'N/A')}")

if 'recommendations' in enhanced_results:
    print(f"\n💡 KEY RECOMMENDATIONS:")
    for i, rec in enumerate(enhanced_results['recommendations'][:3], 1):
        priority = rec.get('priority', 'medium').upper()
        recommendation = rec.get('recommendation', 'N/A')
        print(f"  {i}. [{priority}] {recommendation}")

if 'market_insights' in enhanced_results:
    insights = enhanced_results['market_insights']
    if 'industry_trends' in insights and insights['industry_trends']:
        print(f"\n📈 MARKET INSIGHTS:")
        for trend in insights['industry_trends'][:3]:
            print(f"  • {trend}")

## Example 3: Interactive Data Science Workflow

Let's create an interactive workflow that combines pandas analysis with AI insights:

In [None]:
def interactive_analysis_workflow(dataframe, analysis_question):
    """
    Interactive analysis workflow that uses dynamic templates from raw cells.
    
    The template and schema content are stored in raw cells 20 and 21 above.
    This function demonstrates how to create dynamic analysis workflows.
    """
    import tempfile
    from pathlib import Path
    
    # Create temporary file with sample data
    temp_file = f"temp_analysis_{hash(analysis_question) % 10000}.csv"
    sample_data = dataframe.head(50)  # Use first 50 rows for analysis
    sample_data.to_csv(temp_file, index=False)
    
    print(f"🔄 Step 1: Data Preparation")
    print(f"   • Created sample dataset: {temp_file}")
    print(f"   • Question: {analysis_question}")
    
    # Note: Template and schema content are in raw cells 20 and 21 above
    # For CLI compatibility, we create simple template files here
    
    # Create dynamic template (content should be read from raw cell 20)
    template_content = f"# Dynamic template for question: {analysis_question}\n# Content from raw cell 20 above\n"
    
    # Create dynamic schema (content should be read from raw cell 21)  
    schema_content = {
        "type": "object",
        "properties": {
            "answer": {"type": "string"},
            "confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
            "key_insights": {"type": "array", "items": {"type": "string"}},
            "methodology": {"type": "string"}
        },
        "required": ["answer", "confidence_level"]
    }
    
    # Write files for ostruct CLI
    with open('dynamic_template.j2', 'w') as f:
        f.write(template_content)
    
    import json
    with open('dynamic_schema.json', 'w') as f:
        json.dump(schema_content, f, indent=2)
    
    print("📝 Step 2: Template Files Created")
    print("   • Template: dynamic_template.j2 (placeholder - edit raw cell 20 for content)")
    print("   • Schema: dynamic_schema.json (basic structure)")
    
    # Run AI analysis
    print("🤖 Step 3: AI Analysis")
    ai_results = run_ostruct_analysis(
        template_file='dynamic_template.j2',
        schema_file='dynamic_schema.json',
        data_file=temp_file,
        model='gpt-4o',
        enable_tools=['code-interpreter']
    ,
        enable_downloads=True  # Enable chart downloads for visualization
    )
    
    # Display results
    print(f"\n🎯 Answer: {ai_results.get('answer', 'No answer provided')}")
    print(f"🔒 Confidence: {ai_results.get('confidence_level', 'unknown')}")
    
    if 'key_insights' in ai_results:
        print("\n💡 Key Insights:")
        for insight in ai_results['key_insights']:
            print(f"  • {insight}")
    
    # Clean up
    Path(temp_file).unlink()
    
    return ai_results

print("✅ Interactive workflow function defined")
print("📝 Template and schema content are in raw cells 20-21 above")

In [None]:
# Test the interactive workflow
question = "Which product has the highest profit margin and what factors contribute to its success?"

workflow_results = interactive_analysis_workflow(df, question)

## Example 4: Batch Processing Multiple Datasets

For production scenarios, you often need to analyze multiple datasets:

In [None]:
def batch_analysis(file_list, template_file, schema_file, output_dir='batch_results'):
    """
    Analyze multiple datasets in batch using ostruct.
    """
    Path(output_dir).mkdir(exist_ok=True)
    batch_results = {}
    
    for i, file_path in enumerate(file_list):
        print(f"\n📊 Processing {i+1}/{len(file_list)}: {file_path}")
        
        try:
            output_file = Path(output_dir) / f"{Path(file_path).stem}_analysis.json"
            
            results = run_ostruct_analysis(
                template_file=template_file,
                schema_file=schema_file,
                data_file=file_path,
                model='gpt-4o-mini',
                enable_tools=['code-interpreter'],
                output_file=str(output_file,
        enable_downloads=True  # Enable chart downloads for visualization
    )
            )
            
            batch_results[file_path] = {
                'status': 'success',
                'results': results,
                'output_file': str(output_file)
            }
            
            print(f"  ✅ Success: {output_file}")
            
        except Exception as e:
            print(f"  ❌ Error: {e}")
            batch_results[file_path] = {
                'status': 'error',
                'error': str(e)
            }
    
    return batch_results

# Create multiple sample datasets
# Ensure df exists from previous examples
if "df" not in globals():
    sample_data = {
        "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"],
        "product": ["Widget A", "Widget B", "Widget A", "Widget C", "Widget B"],
        "quantity": [10, 15, 8, 12, 20],
        "price": [25.50, 30.00, 25.50, 45.00, 30.00]
    }
    df = pd.DataFrame(sample_data)
    df["revenue"] = df["quantity"] * df["price"]

datasets = []
for month in ['Jan', 'Feb', 'Mar']:
    monthly_data = df.copy()
    monthly_data['month'] = month
    monthly_data['quantity'] = monthly_data['quantity'] * (1 + 0.1 * len(datasets))  # Simulate growth
    
    filename = f'sales_{month.lower()}.csv'
    monthly_data.to_csv(filename, index=False)
    datasets.append(filename)

print(f"✅ Created {len(datasets)} datasets for batch processing")


In [None]:
# Run batch analysis
batch_results = batch_analysis(
    file_list=datasets,
    template_file='basic_template.j2',
    schema_file='basic_schema.json'
)

# Summary of batch results
successful = sum(1 for r in batch_results.values() if r['status'] == 'success')
print(f"\n📈 Batch Analysis Complete: {successful}/{len(datasets)} successful")

# Display summary of results
for file_path, result in batch_results.items():
    if result['status'] == 'success':
        summary = result['results'].get('summary', {})
        total_sales = summary.get('total_sales', 0)
        print(f"  {Path(file_path).stem}: ${total_sales:,.2f} total sales")

## Example 5: Real-time Analysis Dashboard

Create a simple dashboard that updates with new analysis:

In [None]:
from IPython.display import clear_output
import time

def create_analysis_dashboard(data_files, refresh_interval=30):
    """
    Create a simple analysis dashboard that refreshes periodically.
    """
    def update_dashboard():
        clear_output(wait=True)
        
        print("📊 OSTRUCT ANALYSIS DASHBOARD")
        print("=" * 50)
        print(f"Last Updated: {time.strftime('%Y-%m-%d %H:%M:%S')}")
        print()
        
        total_revenue = 0
        total_transactions = 0
        
        for file_path in data_files:
            try:
                # Quick analysis for dashboard
                results = run_ostruct_analysis(
                    template_file='basic_template.j2',
                    schema_file='basic_schema.json',
                    data_file=file_path,
                    model='gpt-4o-mini',
                    enable_tools=['code-interpreter']
                ,
        enable_downloads=True  # Enable chart downloads for visualization
    )
                
                summary = results.get('summary', {})
                revenue = summary.get('total_sales', 0)
                transactions = summary.get('total_transactions', 0)
                
                total_revenue += revenue
                total_transactions += transactions
                
                print(f"📈 {Path(file_path).stem.upper()}:")
                print(f"   Revenue: ${revenue:,.2f}")
                print(f"   Transactions: {transactions:,}")
                print()
                
            except Exception as e:
                print(f"❌ Error analyzing {file_path}: {e}")
        
        print("🎯 TOTALS:")
        print(f"   Total Revenue: ${total_revenue:,.2f}")
        print(f"   Total Transactions: {total_transactions:,}")
        print(f"   Average per Transaction: ${total_revenue/total_transactions if total_transactions > 0 else 0:.2f}")
        
        print(f"\n⏰ Next refresh in {refresh_interval} seconds...")
    
    # Run initial update
    update_dashboard()
    
    return update_dashboard

# Create dashboard (run once for demo)
dashboard = create_analysis_dashboard(datasets[:2])  # Use first 2 datasets for demo
print("✅ Dashboard created (static version for demo)")

## Best Practices and Tips

Here are some best practices for using ostruct in Jupyter notebooks:

In [None]:
# Best Practices Demo

def data_science_best_practices():
    """
    Demonstrate best practices for ostruct in data science workflows.
    Following Google Cloud Jupyter Notebook Manifesto and industry standards.
    """
    print("🎯 OSTRUCT DATA SCIENCE BEST PRACTICES")
    print("Following Google Cloud Jupyter Notebook Manifesto")
    print("=" * 60)
    
    practices = [
        {
            "category": "📋 Reproducible Notebooks (Google Cloud Manifesto #3)",
            "tips": [
                "✅ Environment info tracked automatically in setup cell",
                "✅ Random seeds set for reproducible results (np.random.seed(42))",
                "✅ Requirements.txt provided for consistent dependencies",
                "✅ Parameters cell enables Papermill execution",
                "✅ All experiments logged with metadata and timing"
            ]
        },
        {
            "category": "📊 Experiment Logging (Google Cloud Manifesto #7)",
            "tips": [
                "✅ ExperimentTracker logs all analysis metadata automatically",
                "✅ Parameters, results, and execution info captured",
                "✅ Unique experiment IDs for tracking and comparison",
                "✅ JSON logs saved to experiment_logs/ directory",
                "✅ Execution time and success/failure tracked"
            ]
        },
        {
            "category": "⚙️ Parameterized Execution (Google Cloud Manifesto #5)",
            "tips": [
                "✅ Parameters cell tagged for Papermill compatibility",
                "✅ DEFAULT_MODEL, ANALYSIS_TIMEOUT configurable",
                "✅ Tool selection and output paths parameterized",
                "✅ Functions use parameter defaults automatically",
                "✅ No hardcoded timeouts - all use ANALYSIS_TIMEOUT parameter",
                "✅ Easy to run with different configurations"
            ]
        },
        {
            "category": "🔧 Performance Optimization",
            "tips": [
                "Use gpt-4o-mini for exploratory analysis, gpt-4o for complex insights",
                "Cache results using --output-file to avoid re-running expensive analyses",
                "Sample large datasets for development, full data for production",
                "Use --dry-run for template validation before API calls"
            ]
        },
        {
            "category": "💰 Cost Management",
            "tips": [
                "Start with cheaper models and upgrade only when needed",
                "Use batch processing to reduce per-request overhead",
                "Monitor token usage with verbose output",
                "Reuse schemas across similar analyses"
            ]
        },
        {
            "category": "🛡️ Reliability & Security",
            "tips": [
                "Always validate schemas before production use",
                "Handle API errors gracefully with try/catch blocks",
                "Don't commit API keys to notebooks",
                "Use environment variables for configuration"
            ]
        },
        {
            "category": "📊 Analysis Quality",
            "tips": [
                "Design schemas that capture business value, not just technical metrics",
                "Include confidence levels and caveats in your schemas",
                "Combine AI insights with traditional statistical validation",
                "Document assumptions and limitations in templates"
            ]
        }
    ]
    
    for practice in practices:
        print(f"\n{practice['category']}")
        for tip in practice['tips']:
            print(f"  ✓ {tip}")
    
    print("\n🚀 Ready to build amazing data science workflows with ostruct!")

data_science_best_practices()

## Example 6: Advanced Workflows from Data Science Guide

Let's implement the complete workflows from the Data Science Integration Guide, including Financial Analysis, Research Synthesis, Business Intelligence, and Market Research examples.

In [None]:
# Financial Analysis Workflow Example

def create_financial_analysis_example():
    """Create complete financial analysis workflow from integration guide."""
    
    # Create sample financial data
    financial_data = {
        'date': pd.date_range('2024-01-01', periods=12, freq='M'),
        'revenue': [1500000, 1620000, 1580000, 1750000, 1690000, 1820000,
                   1950000, 1880000, 2100000, 2050000, 2200000, 2350000],
        'expenses': [1200000, 1250000, 1180000, 1300000, 1220000, 1350000,
                    1400000, 1380000, 1450000, 1420000, 1500000, 1550000],
        'market_segment': ['Consumer'] * 6 + ['Enterprise'] * 6
    }
    
    df_financial = pd.DataFrame(financial_data)
    df_financial['net_income'] = df_financial['revenue'] - df_financial['expenses']
    df_financial['profit_margin'] = (df_financial['net_income'] / df_financial['revenue']) * 100
    
    # Save financial data
    df_financial.to_csv('quarterly_financial_data.csv', index=False)
    
    print("📊 Financial Data Created:")
    display(df_financial.head())
    
    # Create financial analysis template (from integration guide)
    financial_template = """
You are a senior financial analyst. Perform comprehensive analysis of the provided financial data.

## Financial Analysis for Company - 2024

### Market Data Analysis
Analyze the following financial data and provide comprehensive insights:

**Raw Data:**
{{ quarterly_data.content }}

### Analysis Requirements:
1. **Performance Metrics**: Calculate key ratios (ROE, EBITDA margin, profit margins)
2. **Trend Analysis**: Compare performance across time periods
3. **Market Position**: Analyze segment performance 
4. **Risk Assessment**: Identify potential financial risks
5. **Growth Projection**: Forecast trends based on current data

### Regulatory Compliance Check:
Review all metrics and flag any concerning trends for stakeholder reporting.

Create professional visualization showing key financial trends.
"""
    
    with open('financial_analysis_template.j2', 'w') as f:
        f.write(financial_template)
    
    # Financial analysis schema (from integration guide)
    financial_schema = {
        "type": "object",
        "properties": {
            "executive_summary": {
                "type": "string",
                "description": "2-3 sentence summary of financial health"
            },
            "key_metrics": {
                "type": "object",
                "properties": {
                    "total_revenue": {"type": "number"},
                    "net_income": {"type": "number"},
                    "average_profit_margin": {"type": "number"},
                    "revenue_growth_rate": {"type": "number"}
                },
                "required": ["total_revenue", "net_income", "average_profit_margin"]
            },
            "trend_analysis": {
                "type": "object",
                "properties": {
                    "revenue_trend": {"type": "string"},
                    "profit_margin_trend": {"type": "string"},
                    "quarter_over_quarter_change": {"type": "number"}
                }
            },
            "risk_factors": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "risk_type": {"type": "string"},
                        "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                        "description": {"type": "string"},
                        "mitigation_suggestions": {"type": "string"}
                    },
                    "required": ["risk_type", "severity", "description"]
                }
            },
            "growth_forecast": {
                "type": "object",
                "properties": {
                    "next_quarter_revenue_estimate": {"type": "number"},
                    "confidence_level": {"type": "string", "enum": ["low", "medium", "high"]},
                    "key_assumptions": {"type": "array", "items": {"type": "string"}}
                }
            }
        },
        "required": ["executive_summary", "key_metrics", "risk_factors"]
    }
    
    with open('financial_analysis_schema.json', 'w') as f:
        json.dump(financial_schema, f, indent=2)
    
    print("✅ Financial analysis template and schema created")
    
    # Run financial analysis
    financial_results = run_ostruct_analysis(
        template_file='financial_analysis_template.j2',
        schema_file='financial_analysis_schema.json',
        data_file='quarterly_financial_data.csv',
        model='gpt-4o',
        enable_tools=['code-interpreter', 'web-search'],
        output_file='financial_analysis_results.json'
    ,
        enable_downloads=True  # Enable chart downloads for visualization
    )
    
    print("✅ Financial Analysis Complete!")
    
    # Display key results
    print("\n💼 FINANCIAL ANALYSIS SUMMARY:")
    print(f"📊 Executive Summary: {financial_results['executive_summary']}")
    
    if 'key_metrics' in financial_results:
        metrics = financial_results['key_metrics']
        print(f"💰 Total Revenue: ${metrics.get('total_revenue', 0):,.2f}")
        print(f"💸 Net Income: ${metrics.get('net_income', 0):,.2f}")
        print(f"📈 Avg Profit Margin: {metrics.get('average_profit_margin', 0):.1f}%")
    
    if 'risk_factors' in financial_results:
        print("\n⚠️ Risk Factors:")
        for risk in financial_results['risk_factors'][:3]:  # Show top 3
            severity = risk.get('severity', 'unknown').upper()
            print(f"  • [{severity}] {risk.get('risk_type', 'Unknown')}: {risk.get('description', 'N/A')}")
    
    return financial_results

# Run financial analysis example
financial_results = create_financial_analysis_example()

In [None]:
# Business Intelligence Report Generation Example

def create_business_intelligence_example():
    """Create Business Intelligence workflow from integration guide."""
    
    # Create sample business data
    business_data = {
        'date': pd.date_range('2024-01-01', periods=100, freq='D'),
        'customer_segment': np.random.choice(['Enterprise', 'SMB', 'Consumer'], 100),
        'product_category': np.random.choice(['Software', 'Hardware', 'Services'], 100),
        'revenue': np.random.normal(50000, 15000, 100),
        'customer_satisfaction': np.random.normal(4.2, 0.8, 100),
        'market_share': np.random.normal(0.15, 0.05, 100)
    }
    
    df_business = pd.DataFrame(business_data)
    df_business['revenue'] = np.maximum(df_business['revenue'], 1000)  # Ensure positive revenue
    df_business['customer_satisfaction'] = np.clip(df_business['customer_satisfaction'], 1, 5)
    df_business['market_share'] = np.clip(df_business['market_share'], 0.01, 0.5)
    
    # Save business data
    df_business.to_csv('business_intelligence_data.csv', index=False)
    
    print("📊 Business Intelligence Data Created:")
    display(df_business.head())
    
    # Create BI analysis template (from integration guide)
    bi_template = """
You are a senior business analyst. Perform comprehensive competitive analysis and business intelligence.

## Business Intelligence Report - Q4 2024

### Internal Performance Analysis
**Sales Data:**
{{ sales_data.content }}

### Analysis Requirements:
1. **Market Position**: Analyze our position vs competitors across key metrics
2. **Growth Opportunities**: Identify untapped segments and expansion possibilities  
3. **Competitive Threats**: Assess emerging competitors and market disruptions
4. **Pricing Analysis**: Evaluate price positioning and optimization opportunities
5. **Strategic Recommendations**: Provide actionable next steps with ROI projections

### Executive Briefing Elements:
- Top 3 strategic priorities
- Revenue impact projections
- Resource requirements
- Timeline for implementation

Create professional visualizations showing competitive positioning and market trends.
"""
    
    with open('bi_analysis_template.j2', 'w') as f:
        f.write(bi_template)
    
    # BI analysis schema (from integration guide)
    bi_schema = {
        "type": "object",
        "properties": {
            "executive_summary": {
                "type": "string",
                "description": "CEO-ready 2-3 sentence summary of strategic position"
            },
            "market_position": {
                "type": "object",
                "properties": {
                    "market_share": {"type": "number"},
                    "competitive_ranking": {"type": "integer"},
                    "differentiation_strengths": {"type": "array", "items": {"type": "string"}},
                    "competitive_gaps": {"type": "array", "items": {"type": "string"}}
                }
            },
            "growth_opportunities": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "opportunity": {"type": "string"},
                        "market_size": {"type": "number"},
                        "revenue_potential": {"type": "number"},
                        "time_to_market": {"type": "string"},
                        "investment_required": {"type": "number"},
                        "risk_level": {"type": "string", "enum": ["low", "medium", "high"]}
                    },
                    "required": ["opportunity", "revenue_potential", "risk_level"]
                }
            },
            "strategic_recommendations": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "recommendation": {"type": "string"},
                        "priority": {"type": "string", "enum": ["critical", "high", "medium", "low"]},
                        "expected_roi": {"type": "number"},
                        "implementation_timeline": {"type": "string"},
                        "resource_requirements": {"type": "array", "items": {"type": "string"}},
                        "success_metrics": {"type": "array", "items": {"type": "string"}}
                    },
                    "required": ["recommendation", "priority", "expected_roi"]
                }
            },
            "competitive_analysis": {
                "type": "object",
                "properties": {
                    "top_competitors": {"type": "array", "items": {"type": "string"}},
                    "competitive_advantages": {"type": "array", "items": {"type": "string"}},
                    "market_threats": {"type": "array", "items": {"type": "string"}}
                }
            }
        },
        "required": ["executive_summary", "market_position", "growth_opportunities", "strategic_recommendations"]
    }
    
    with open('bi_analysis_schema.json', 'w') as f:
        json.dump(bi_schema, f, indent=2)
    
    print("✅ Business Intelligence template and schema created")
    
    # Run BI analysis
    bi_results = run_ostruct_analysis(
        template_file='bi_analysis_template.j2',
        schema_file='bi_analysis_schema.json',
        data_file='business_intelligence_data.csv',
        model='gpt-4o',
        enable_tools=['code-interpreter', 'web-search'],
        output_file='bi_analysis_results.json'
    ,
        enable_downloads=True  # Enable chart downloads for visualization
    )
    
    print("✅ Business Intelligence Analysis Complete!")
    
    # Display key results
    print("\n🏢 BUSINESS INTELLIGENCE SUMMARY:")
    print(f"📊 Executive Summary: {bi_results['executive_summary']}")
    
    if 'market_position' in bi_results:
        position = bi_results['market_position']
        print(f"📈 Market Share: {position.get('market_share', 0):.1%}")
        print(f"🏆 Competitive Ranking: #{position.get('competitive_ranking', 'N/A')}")
    
    if 'strategic_recommendations' in bi_results:
        print("\n💡 TOP STRATEGIC RECOMMENDATIONS:")
        for i, rec in enumerate(bi_results['strategic_recommendations'][:3], 1):
            priority = rec.get('priority', 'medium').upper()
            recommendation = rec.get('recommendation', 'N/A')
            roi = rec.get('expected_roi', 0)
            print(f"  {i}. [{priority}] {recommendation}")
            print(f"     Expected ROI: {roi:.1%}")
    
    return bi_results

# Run business intelligence example
bi_results = create_business_intelligence_example()

## Cleanup

Clean up temporary files created during this notebook:

In [None]:
# Cleanup temporary files
import glob

temp_files = [
    '*.csv', '*.json', '*.j2', 'downloads/*', 'batch_results/*'
]

for pattern in temp_files:
    for file in glob.glob(pattern):
        try:
            Path(file).unlink()
            print(f"🗑️ Removed: {file}")
        except:
            pass  # Ignore errors for directories or non-existent files

# Remove directories
for dir_name in ['downloads', 'batch_results']:
    try:
        import shutil
        shutil.rmtree(dir_name)
        print(f"🗑️ Removed directory: {dir_name}")
    except:
        pass

print("✅ Cleanup complete!")

## Next Steps

🎉 **Congratulations!** You've learned how to integrate ostruct with Jupyter notebooks for powerful data science workflows.

### What to try next:

1. **🔄 Adapt for your data**: Replace the sample data with your own datasets
2. **🎨 Custom templates**: Create domain-specific templates for your analysis needs
3. **📊 Advanced schemas**: Design schemas that capture your business metrics
4. **🚀 Production deployment**: Build automated pipelines using these patterns
5. **🔗 Tool integration**: Combine with other data science tools in your stack

### Resources:

- [ostruct Documentation](https://ostruct.readthedocs.io/)
- [Data Science Integration Guide](https://ostruct.readthedocs.io/en/latest/user-guide/data_science_integration.html)
- [More Examples](../)
- [GitHub Repository](https://github.com/yaniv-golan/ostruct)

Happy analyzing! 🚀📊