# üß† Data Analysis Agent - Demo

This notebook demonstrates the **Data Analysis Agent** system with:
- **Schema Compression**: Efficient dataset representation
- **History Compression**: Minimal token usage for analysis context
- **Automated EDA**: AI-powered exploratory data analysis

## üì¶ Setup and Imports

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from schema_compressor import SchemaCompressor
from history_compressor import HistoryCompressor
from eda_agent import EDAAgent
from utils import load_sample_data
from visualizations import setup_plot_style, plot_missing_values, plot_correlation_matrix

# Setup plotting
setup_plot_style()
%matplotlib inline

print("‚úì All modules loaded successfully!")

## üìä Load Sample Dataset

In [None]:
# Load Titanic sample dataset
df = load_sample_data('titanic')

print(f"Dataset loaded: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nFirst few rows:")
df.head()

## üóúÔ∏è Step 1: Schema Compression

Compress the dataset schema to reduce token usage while preserving essential information.

In [None]:
# Initialize schema compressor
schema_compressor = SchemaCompressor()

# Generate compressed schema
compressed_schema = schema_compressor.compress(df)

# Display compressed schema
print(schema_compressor.to_text(compressed_schema))

In [None]:
# Estimate token savings
token_stats = schema_compressor.estimate_token_reduction(df)

print("\nüìà Token Efficiency Metrics:")
print(f"  Schema tokens: {token_stats['schema_tokens']:,}")
print(f"  Full dataset tokens (estimated): {token_stats['estimated_full_tokens']:,}")
print(f"  Reduction ratio: {token_stats['reduction_ratio']:.1f}x")
print(f"  Tokens saved: {token_stats['tokens_saved']:,}")
print(f"\nüí∞ At $0.002 per 1K tokens, this saves ~${token_stats['tokens_saved'] * 0.002 / 1000:.4f} per query!")

## ü§ñ Step 2: Initialize EDA Agent

Create an AI agent that uses compressed schema and maintains compressed history.

In [None]:
# Initialize the EDA Agent
agent = EDAAgent(df, name="Titanic Analysis Agent")

print("‚úì EDA Agent initialized!")
print("\nüéØ Suggested next steps:")
for i, suggestion in enumerate(agent.suggest_next_steps(), 1):
    print(f"  {i}. {suggestion}")

## üîç Step 3: Perform Analysis

Execute various EDA steps while the agent compresses history automatically.

### 3.1 Missing Value Analysis

In [None]:
# Analyze missing values
missing_results = agent.analyze_missing_values()

print("üìã Missing Value Analysis:")
for insight in missing_results['insights']:
    print(f"  ‚Ä¢ {insight}")

if missing_results['columns_with_missing']:
    print("\nColumns with missing data:")
    for col, info in missing_results['summary'].items():
        print(f"  - {col}: {info['count']} missing ({info['ratio']:.1%})")

In [None]:
# Visualize missing values
plot_missing_values(df)

### 3.2 Distribution Analysis

In [None]:
# Analyze distributions
dist_results = agent.analyze_distributions()

print("üìä Distribution Analysis:")
for insight in dist_results['insights']:
    print(f"  ‚Ä¢ {insight}")

print("\nDistribution details:")
for col, info in dist_results['distributions'].items():
    print(f"  {col}: {info['distribution_type']} (skew={info['skewness']:.2f})")

### 3.3 Correlation Analysis

In [None]:
# Analyze correlations
corr_results = agent.analyze_correlations(threshold=0.5)

print("üîó Correlation Analysis:")
for insight in corr_results['insights']:
    print(f"  ‚Ä¢ {insight}")

In [None]:
# Visualize correlation matrix
plot_correlation_matrix(df)

### 3.4 Outlier Detection

In [None]:
# Detect outliers
outlier_results = agent.detect_outliers(method='iqr')

print("üéØ Outlier Detection (IQR method):")
for insight in outlier_results['insights']:
    print(f"  ‚Ä¢ {insight}")

## üìù Step 4: View Compressed History

See how the analysis history is compressed to minimize token usage.

In [None]:
# Get compressed history
print(agent.history_compressor.to_text())

In [None]:
# Get context for next step (LLM-ready format)
context = agent.get_history_context()

print("ü§ñ LLM-Ready Context (compressed):")
print(context)
print(f"\nContext length: {len(context)} characters (~{len(context)//4} tokens)")

In [None]:
# Estimate history compression savings
history_stats = agent.history_compressor.estimate_token_savings()

print("üìà History Compression Metrics:")
print(f"  Full history tokens: {history_stats['full_history_tokens']:,}")
print(f"  Compressed tokens: {history_stats['compressed_tokens']:,}")
print(f"  Compression ratio: {history_stats['compression_ratio']:.1f}x")
print(f"  Tokens saved: {history_stats['tokens_saved']:,}")

## üìã Step 5: Generate Summary Report

In [None]:
# Generate comprehensive summary
report = agent.generate_summary_report()
print(report)

## üöÄ Step 6: Automated EDA (All-in-One)

Run a complete automated EDA in one command!

In [None]:
# Load a fresh dataset for automated analysis
df_iris = load_sample_data('iris')

# Create new agent
iris_agent = EDAAgent(df_iris, name="Iris Analysis Agent")

# Run automated EDA
results = iris_agent.run_automated_eda()

# Display the summary
print(results['summary_report'])

## üí° Key Takeaways

1. **Schema Compression**: Reduces dataset representation by 50-100x
2. **History Compression**: Maintains analysis context with 5-10x fewer tokens
3. **Automated EDA**: Intelligent agent suggests next steps based on data characteristics
4. **Token Efficiency**: Significant cost savings when using LLM APIs
5. **Scalability**: Can handle large datasets without overwhelming context windows

## üî¨ Next Steps

Try the agent with your own data:

```python
# Load your data
my_df = pd.read_csv('your_data.csv')

# Initialize agent
my_agent = EDAAgent(my_df, name="My Analysis")

# Run automated analysis
results = my_agent.run_automated_eda()
print(results['summary_report'])
```