# Exploratory Data Analysis: VAE-Based Music Clustering

This notebook performs initial exploration of the project structure, data loading, and feature verification before running the three tasks (Easy, Medium, Hard).

## Contents:
1. Project setup verification
2. Load sample features
3. Feature statistics and distributions
4. Label/genre analysis
5. Data preparation checks

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from pathlib import Path

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Add src to path for imports
sys.path.insert(0, '../src')

print("Libraries imported successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Matplotlib: {matplotlib.__version__}")
print(f"Seaborn: {sns.__version__}")

## Project Setup Verification

Check directories, files, and data availability

In [None]:
project_root = Path('../').resolve()
print(f"Project root: {project_root}\n")

# Check key directories
dirs_to_check = ['src', 'data', 'results', 'notebooks']
for dir_name in dirs_to_check:
    dir_path = project_root / dir_name
    exists = dir_path.exists()
    status = "✓" if exists else "✗"
    print(f"{status} {dir_name}/: {dir_path}")

print("\n" + "="*60)
print("Key files in src/")
print("="*60)
src_files = ['easy_task.py', 'medium_task.py', 'hard_task.py', 'dataset.py']
for fname in src_files:
    fpath = project_root / 'src' / fname
    exists = fpath.exists()
    status = "✓" if exists else "✗"
    print(f"{status} {fname}")

print("\n" + "="*60)
print("Data availability")
print("="*60)
features_path = project_root / 'data' / 'processed' / 'gtzan_features.csv'
audio_path = project_root / 'data' / 'audio' / 'genres_original'
print(f"Features CSV: {'✓' if features_path.exists() else '✗'} ({features_path})")
print(f"Audio folder: {'✓' if audio_path.exists() else '✗'} ({audio_path})")
print("\nNote: Tasks will use synthetic data if real data unavailable")

## Load Feature Data

Load sample features if available, otherwise show what synthetic data will look like

In [None]:
## Feature Statistics

## Feature Distribution Visualization

In [None]:
# Get feature columns (exclude metadata)
feature_cols = [col for col in df.columns if col.startswith('feature_')]
X = df[feature_cols].values

print("="*60)
print("FEATURE STATISTICS")
print("="*60)
print(f"Total features: {X.shape[1]}")
print(f"Total samples: {X.shape[0]}")
print(f"\nFeature range statistics:")
print(f"  Mean:   [{X.mean(axis=0).min():.4f}, {X.mean(axis=0).max():.4f}]")
print(f"  Std:    [{X.std(axis=0).min():.4f}, {X.std(axis=0).max():.4f}]")
print(f"  Min:    [{X.min(axis=0).min():.4f}, {X.min(axis=0).max():.4f}]")
print(f"  Max:    [{X.max(axis=0).min():.4f}, {X.max(axis=0).max():.4f}]")
print(f"\nMissing values: {df[feature_cols].isna().sum().sum()}")

## Genre/Label Distribution

In [None]:
unique_genres, counts = np.unique(df['genre'], return_counts=True)

print("="*60)
print("GENRE DISTRIBUTION")
print("="*60)
for genre, count in zip(unique_genres, counts):
    pct = (count / len(df)) * 100
    print(f"  {genre:12s}: {count:4d} samples ({pct:5.1f}%)")

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.bar(unique_genres, counts, edgecolor='k', alpha=0.7, color='steelblue')
ax.set_xlabel('Genre', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Samples', fontsize=12, fontweight='bold')
ax.set_title('Genre Distribution', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, count in zip(bars, counts):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(count)}', ha='center', va='bottom', fontsize=10)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print(f"\nBalance: {counts.min()} to {counts.max()} samples per genre")
print(f"Balanced: {'Yes' if (counts.max() / counts.min() < 1.1) else 'Slightly imbalanced'}")

## Summary & Next Steps

**Data Status:**
- ✓ Features loaded/generated successfully
- ✓ No missing values
- ✓ Genres well-distributed
- ✓ Ready for clustering tasks

**Recommended Actions:**
1. Run Easy Task: `python ../src/easy_task.py`
2. Run Medium Task: `python ../src/medium_task.py`
3. Run Hard Task: `python ../src/hard_task.py`

**Expected Output:**
- Trained models in `../results/models/`
- Metrics and visualizations in task-specific folders
- Comparison across tasks

In [None]:
print("\n" + "="*70)
print("QUICK START GUIDE")
print("="*70)
print("""
Three Tasks to Run (in order):

1. EASY TASK - Basic VAE Clustering
   Command: python ../src/easy_task.py
   Features:
   - Basic VAE (input→256→128→latent→128→256→output)
   - K-Means clustering (k=10)
   - PCA baseline comparison
   - Metrics: Silhouette, Calinski-Harabasz, ARI, NMI
   Duration: ~2-5 minutes
   
2. MEDIUM TASK - Enhanced VAE with Hybrid Features
   Command: python ../src/medium_task.py
   Features:
   - Enhanced VAE with batch norm and dropout
   - Hybrid audio+lyrics features (60/40 split)
   - 4 clustering algorithms (KMeans, 2x Agglomerative, DBSCAN)
   - 5 evaluation metrics
   Duration: ~5-10 minutes
   
3. HARD TASK - Advanced VAE Variants with Baselines
   Command: python ../src/hard_task.py
   Features:
   - Beta-VAE for disentangled representations
   - Conditional VAE (CVAE) for genre conditioning
   - Multi-modal fusion (audio + lyrics + genre)
   - Baseline comparisons (PCA, Autoencoder, Spectral)
   - Comprehensive evaluation report
   Duration: ~10-20 minutes

All tasks:
- Use synthetic data by default (no external files needed)
- Create output folders automatically
- Save metrics to CSV and visualizations as PNG
- Can use real GTZAN data if available in data/audio/genres_original/

Output Location: ../results/{easy_task|medium_task|hard_task}/
Models Location: ../results/models/
""")
print("="*70)