# Autolyse Tutorial - Automated EDA in 2 Lines

This notebook demonstrates how to use **Autolyse** for comprehensive exploratory data analysis with minimal code.

## What is Autolyse?

Autolyse is an automated EDA tool that generates:
- **Statistical Analysis**: Mean, median, std, skewness, kurtosis, quartiles
- **Data Quality Assessment**: Missing values, duplicates, data quality score
- **Distribution Analysis**: Normality tests, distribution patterns
- **Correlation Analysis**: Pearson & Spearman correlations, relationship strength
- **Outlier Detection**: IQR method + Isolation Forest
- **Visualizations**: Matplotlib (static) + Plotly (interactive) plots
- **AI Insights**: 2-4 sentence summaries powered by Google Gemini API

All in **just 2 lines of code**!

## Installation & Setup

First, install dependencies and set up your Gemini API key (optional but recommended):

In [None]:
# Install dependencies
# !pip install -r requirements.txt

# Or install the package
# !pip install -e .

# Set up your Gemini API key (optional for AI insights)
import os
# os.environ['GEMINI_KEY'] = 'your-api-key-here'

## Example 1: Basic Usage with Iris Dataset

In [None]:
# Import Autolyse
from autolyse import Autolyse
import pandas as pd
import os

# Load sample dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

print(f"Dataset shape: {df.shape}")
print(f"First few rows:")
print(df.head())

### The 2-Line Magic âœ¨

Here's where Autolyse shines - comprehensive analysis in just 2 lines:

In [None]:
# Line 1: Initialize with your preferences
analyser = Autolyse(html=False, api_key=os.environ.get("GEMINI_KEY"))

# Line 2: Run complete analysis
analyser.analyse(df)

### Access the Results Programmatically

In [None]:
# Get analysis results
results = analyser.get_analysis_results()

# Get AI insights
insights = analyser.get_insights()

# Get column information
col_info = analyser.get_dataframe_info()

print("\n=== Insights ===")
for analysis_type, insight in insights.items():
    print(f"\nðŸ“Œ {analysis_type}:")
    print(insight)

## Example 2: HTML Report Generation

Generate a professional HTML report instead of Jupyter display:

In [None]:
# Load winequality dataset
df_wine = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# Create analyzer with HTML output
html_analyser = Autolyse(html=True, api_key=os.environ.get("GEMINI_KEY"), 
                         output_dir="./my_reports")

# Run analysis - generates HTML report
html_analyser.analyse(df_wine)

The HTML report is saved and contains:
- âœ… Beautiful gradient-styled layout
- âœ… Summary cards with key statistics
- âœ… Detailed tables for all analyses
- âœ… AI-generated insights
- âœ… Color-coded severity indicators
- âœ… Responsive design for mobile viewing

## Example 3: Working with Mixed Data Types

Autolyse intelligently handles numeric, categorical, text, and datetime columns:

In [None]:
import numpy as np
from datetime import datetime, timedelta

# Create a diverse dataset
np.random.seed(42)
n = 500

df_mixed = pd.DataFrame({
    'id': range(n),
    'age': np.random.normal(35, 15, n).astype(int),
    'salary': np.random.exponential(50000, n),
    'department': np.random.choice(['Sales', 'Engineering', 'HR', 'Marketing'], n),
    'performance_score': np.random.uniform(1, 5, n),
    'hire_date': [datetime(2020, 1, 1) + timedelta(days=int(x)) for x in np.random.uniform(0, 1460, n)],
    'is_manager': np.random.choice([True, False], n),
    'bio': ['Employee ' + str(i) for i in range(n)]  # Text column
})

# Add some missing values
df_mixed.loc[np.random.choice(df_mixed.index, 30), 'salary'] = np.nan
df_mixed.loc[np.random.choice(df_mixed.index, 20), 'department'] = np.nan

print(df_mixed.dtypes)
print(f"\nDataset shape: {df_mixed.shape}")

In [None]:
# Autolyse handles all column types automatically!
multi_analyser = Autolyse(html=False, api_key=os.environ.get("GEMINI_KEY"))
multi_analyser.analyse(df_mixed)

Notice how Autolyse:
- âœ… Detected numeric columns (age, salary, performance_score)
- âœ… Identified categorical columns (department, is_manager)
- âœ… Recognized datetime columns (hire_date)
- âœ… Handled text columns (bio)
- âœ… Analyzed missing values
- âœ… Generated appropriate visualizations for each type
- âœ… Created meaningful AI insights

## Example 4: Advanced Usage - Direct Access to Modules

You can also use individual modules for more control:

In [None]:
from autolyse.analyzers import (
    StatisticalAnalyzer,
    MissingValuesAnalyzer,
    CorrelationAnalyzer
)
from autolyse.utils import DataPreparation
from autolyse.visualizers import MatplotlibVisualizer, PlotlyVisualizer

# Get data insights
data_prep = DataPreparation(df_wine)
print("Column Types:", data_prep.get_column_types())
print("Data Quality Score:", data_prep.validate_data()['data_quality_score'])

# Run specific analyzers
stat_analyzer = StatisticalAnalyzer(df_wine)
stats = stat_analyzer.analyze()
print("\nStatistics for first column:")
print(stats[list(stats.keys())[0]])

## Features Comparison

### Jupyter Display (html=False)
- âœ… Interactive exploration in notebook
- âœ… Plotly visualizations with hover info
- âœ… Live tables with sorting
- âœ… Inline markdown formatting

### HTML Report (html=True)
- âœ… Professional report for sharing
- âœ… Beautiful gradient styling
- âœ… Summary cards & statistics tables
- âœ… Self-contained (single HTML file)
- âœ… Mobile-responsive design
- âœ… Perfect for presentations & documentation

## Configuration Options

```python
# Basic usage with Jupyter display
analyser = Autolyse(html=False)
analyser.analyse(df)

# Generate HTML report
analyser = Autolyse(html=True, output_dir="./reports")
analyser.analyse(df)

# With Gemini API key for AI insights
analyser = Autolyse(html=True, api_key="your-api-key")
analyser.analyse(df)

# Using environment variable for API key
analyser = Autolyse(html=True, api_key=os.environ.get("GEMINI_KEY"))
analyser.analyse(df)
```

## Analysis Coverage

### What Autolyse Analyzes:

1. **Statistical Analysis**
   - Mean, median, std, variance
   - Min, max, range, quartiles, IQR
   - Skewness, kurtosis
   - Missing value analysis

2. **Distribution Analysis**
   - Normality tests (Shapiro-Wilk)
   - Distribution type classification
   - Histogram & KDE plots
   - Categorical distributions

3. **Missing Values**
   - Count & percentage per column
   - Missing value patterns
   - Correlation of missingness
   - Complete column identification

4. **Outlier Detection**
   - IQR method (traditional)
   - Isolation Forest (advanced)
   - Anomaly scores
   - Visual highlighting

5. **Correlation Analysis**
   - Pearson correlations
   - Spearman rank correlations
   - Strength classification
   - Heatmap visualization

6. **Relationship Analysis**
   - Categorical-numeric relationships
   - Categorical-categorical associations
   - Numeric-numeric pair analysis
   - CramÃ©r's V statistics

## Tips & Best Practices

1. **Large Datasets**: Autolyse handles large datasets efficiently. For 1M+ rows, consider sampling first.

2. **AI Insights**: Set `api_key` parameter or `GEMINI_KEY` environment variable for AI summaries. Works fine without it too!

3. **Output Format**: 
   - Use `html=True` for reports to share
   - Use `html=False` for interactive exploration

4. **Reproducibility**: All analyses are deterministic (set random seed for reproducibility)

5. **Memory**: Autolyse keeps visualizations in memory. For very large analyses, clear figures after saving:
   ```python
   analyser.figures = {}  # Clear figures to free memory
   ```

## Summary

With just **2 lines of code**:

```python
analyser = Autolyse(html=True, api_key=os.environ.get("GEMINI_KEY"))
analyser.analyse(df)
```

You get:
- ðŸ“Š 6 comprehensive analyses
- ðŸŽ¨ 20+ visualizations (static + interactive)
- ðŸ¤– AI-powered insights
- ðŸ“„ Professional HTML report OR Jupyter display
- âœ… Data quality assessment
- ðŸŽ¯ Actionable insights

**Happy exploring! ðŸš€**