<a href="https://colab.research.google.com/github/verammaz/KMeans-VAE/blob/main/run_in_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VAE Training Pipeline on Google Colab

This notebook provides a complete workflow for:
1. Setting up the environment
2. Generating synthetic datasets (Gaussian & Bernoulli)
3. Training a Variational Autoencoder (VAE)
4. Analyzing and visualizing results

**Runtime:** GPU recommended for faster training (Runtime → Change runtime type → GPU)

---

## 📦 Setup & Installation

In [1]:
    from google.colab import auth
    auth.authenticate_user()

In [1]:
# Install required packages
! pip install -q torch torchvision tqdm matplotlib wandb

print("Packages installed successfully!")

Packages installed successfully!


In [2]:
# Check available device
import torch
import sys

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4
CUDA version: 12.6


## Clone GitHub Repo


In [4]:
! git clone https://github.com/verammaz/KMeans-VAE.git
%cd KMeans-VAE

Cloning into 'KMeans-VAE'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 52 (delta 0), reused 2 (delta 0), pack-reused 49 (from 4)[K
Receiving objects: 100% (52/52), 127.01 MiB | 32.53 MiB/s, done.
Resolving deltas: 100% (7/7), done.
/content/KMeans-VAE


## W&B Setup (Optional)

If you want to log experiments to Weights & Biases:

In [3]:
USE_WANDB = True  # Set to True if you want W&B logging

if USE_WANDB:
    import wandb
    wandb.login()
    print("W&B configured!")
else:
    print("W&B logging disabled. Set USE_WANDB=True to enable.")

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvmm2146[0m ([33mvmm2146-columbia-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


W&B configured!


## Step 1: Generate Synthetic Datasets

Generate both Gaussian and Bernoulli mixture datasets.

In [40]:
# Generate datasets
! python data/make_datasets.py \
    --k 5 \
    --dims 64 \
    --target-mb 256 \
    --seed 42 \
    --outroot ./data_set



Per-dataset per-component counts: [102475, 102475, 102475, 102475, 102475] (dims=64)
Wrote ./data_set/gaussian_raw  (~127.05 MB)
Wrote ./data_set/bernoulli_raw   (~127.05 MB)
Total on disk ≈ 254.10 MB (target 256.00 MB)


In [41]:
# Verify data generation
import os
import json

for dataset in ['gaussian_raw', 'bernoulli_raw']:
    path = f'./data_set/{dataset}'
    if os.path.exists(path):
        with open(os.path.join(path, 'metadata.json')) as f:
            meta = json.load(f)
        print(f"\n{dataset}:")
        print(f"  Type: {meta['type']}")
        print(f"  Classes: {meta['k']}")
        print(f"  Dimensions: {meta['dims']}")
        print(f"  Samples per class: {meta['n_per']}")


gaussian_raw:
  Type: gaussian
  Classes: 5
  Dimensions: 64
  Samples per class: [102475, 102475, 102475, 102475, 102475]

bernoulli_raw:
  Type: bernoulli
  Classes: 5
  Dimensions: 64
  Samples per class: [102475, 102475, 102475, 102475, 102475]


## Step 2: Train VAE

Choose your configuration and train the model.

### Configuration Options

In [7]:
# Training configuration
CONFIG = {
    # Dataset
    'dataset': 'gaussian',  # 'gaussian' or 'bernoulli'

    # Model architecture
    'latent_dim': 10,
    'hidden_dims': [128, 64],
    'kl_beta': 1.0,  # 1.0 = standard VAE, >1.0 = beta-VAE
    'activation': 'LeakyReLU',

    # Training
    'epochs': 1,
    'batch_size': 128,
    'lr': 3e-4,
    'optimizer': 'adam',

    # System
    'seed': 3407,
    'device': 'auto',  # 'auto', 'cuda', 'cpu'

    # W&B (if enabled)
    'use_wandb': USE_WANDB,
    'wandb_project': 'vae-colab-experiments',
    'wandb_name': None,  # Auto-generated if None
}

print("Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

Configuration:
  dataset: gaussian
  latent_dim: 10
  hidden_dims: [128, 64]
  kl_beta: 1.0
  activation: LeakyReLU
  epochs: 1
  batch_size: 128
  lr: 0.0003
  optimizer: adam
  seed: 3407
  device: auto
  use_wandb: True
  wandb_project: vae-colab-experiments
  wandb_name: None


### Quick Training Presets

Uncomment one to use:

In [None]:
# Preset 1: Quick test (fast, for debugging)
# CONFIG.update({'epochs': 10, 'latent_dim': 5, 'hidden_dims': [64, 32]})

# Preset 2: Standard VAE
# CONFIG.update({'epochs': 50, 'latent_dim': 10, 'kl_beta': 1.0})

# Preset 3: Beta-VAE (disentanglement)
# CONFIG.update({'epochs': 100, 'latent_dim': 20, 'kl_beta': 4.0})

# Preset 4: High capacity
# CONFIG.update({'epochs': 100, 'latent_dim': 20, 'hidden_dims': [256, 128, 64]})

print("Using configuration:", CONFIG['dataset'], "dataset")

### Run Training

In [44]:
# Build command line arguments
data_dir = f"data_set/{CONFIG['dataset']}_raw"
hidden_dims_str = str(CONFIG['hidden_dims']).replace(' ', '')

cmd = f"""
python -m vae.main \
    --data.data_dir={data_dir} \
    --model.latent_dim={CONFIG['latent_dim']} \
    --model.hidden_dims={hidden_dims_str} \
    --model.kl_beta={CONFIG['kl_beta']} \
    --model.activation={CONFIG['activation']} \
    --trainer.epochs={CONFIG['epochs']} \
    --trainer.batch_size={CONFIG['batch_size']} \
    --trainer.lr={CONFIG['lr']} \
    --trainer.optimizer={CONFIG['optimizer']} \
    --trainer.device={CONFIG['device']} \
    --system.seed={CONFIG['seed']} \
"""

MODEL_NAME = f"vae_{CONFIG['dataset']}_z{CONFIG['latent_dim']}_beta{CONFIG['kl_beta']}"

# Add W&B flags if enabled
if CONFIG['use_wandb']:
    cmd += f" \
    --wandb.enabled=True \
    --wandb.project={CONFIG['wandb_project']}"
    if CONFIG['wandb_name']:
        cmd += f" \
    --wandb.name={CONFIG['wandb_name']}"

print("Training command:")
print(cmd)
print("\n" + "="*60)
print("Starting training...")
print("="*60 + "\n")

# Run training
!{cmd}

Training command:

python -m vae.main     --data.data_dir=data_set/gaussian_raw     --model.latent_dim=10     --model.hidden_dims=[128,64]     --model.kl_beta=1.0     --model.activation=LeakyReLU     --trainer.epochs=1     --trainer.batch_size=128     --trainer.lr=0.0003     --trainer.optimizer=adam     --trainer.device=auto     --system.seed=3407      --wandb.enabled=True     --wandb.project=vae-colab-experiments

Starting training...

command line overwriting config attribute data.data_dir with data_set/gaussian_raw
command line overwriting config attribute model.latent_dim with 10
command line overwriting config attribute model.hidden_dims with [128, 64]
command line overwriting config attribute model.kl_beta with 1.0
command line overwriting config attribute model.activation with LeakyReLU
command line overwriting config attribute trainer.epochs with 1
command line overwriting config attribute trainer.batch_size with 128
command line overwriting config attribute trainer.lr with 0.0

## Step 3: Analyze Results

Load the trained model.

In [45]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from vae.model import VAE
from data.data_io import load_and_split

# Load trained model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = torch.load(os.path.join('./out', 'vae_gaus_i64_k5_z10_beta1.0/model.pt'), map_location=device)
config = checkpoint['config']

print("Model configuration:")
print(f"  Latent dim: {config['model']['latent_dim']}")
print(f"  Hidden dims: {config['model']['hidden_dims']}")
print(f"  Beta: {config['model']['kl_beta']}")
print(f"\nTest statistics:")
for k, v in checkpoint['test_stats'].items():
    print(f"  {k}: {v:.4f}")

Model configuration:
  Latent dim: 10
  Hidden dims: [128, 64]
  Beta: 1.0

Test statistics:
  loss: 17.7313
  recon: 13.1203
  kl: 4.6111


In [46]:
# Recreate model
model_config = config['model']
input_dim = checkpoint['model_state_dict']['mean.weight'].shape[1]

model = VAE(
    input_dim=input_dim,
    latent_dim=model_config['latent_dim'],
    hidden_dims=model_config['hidden_dims'],
    likelihood=model_config['likelihood'],
    beta=model_config['kl_beta'],
    activation=model_config.get('activation', 'LeakyReLU')
)

model.load_state_dict(checkpoint['model_state_dict'])
model.to(device)
model.eval()

print("Model loaded successfully!")

Model loaded successfully!


In [47]:
# Load test data
data_dir = config['data']['data_dir']
data = load_and_split(data_dir, normalize=True)

X_test = torch.tensor(data['X_test'], dtype=torch.float32).to(device)
y_test = torch.tensor(data['y_test'], dtype=torch.long)

print(f"Test set: {X_test.shape[0]} samples, {X_test.shape[1]} features")

Test set: 76860 samples, 64 features


## Download Results

Download trained model and checkpoints to your local machine.

In [None]:
# Zip the output directory
!zip -r vae_results.zip out/

print("Results zipped!")
print("Download 'vae_results.zip' from the file browser on the left.")

# Or use Google Drive (if mounted)
# from google.colab import drive
# drive.mount('/content/drive')
# !cp -r out/ /content/drive/MyDrive/vae/

## Experiment: Compare Different Beta Values

Run a quick sweep to see the effect of different beta values.

In [None]:
# Sweep different beta values
beta_values = [0.5, 1.0, 2.0, 4.0]

for beta in beta_values:
    print(f"\n{'='*60}")
    print(f"Training with beta = {beta}")
    print(f"{'='*60}\n")

    cmd = f"""python main.py \\
        --data.data_dir=./data_set/gaussian_raw \\
        --model.kl_beta={beta} \\
        --model.latent_dim=10 \\
        --trainer.epochs=30 \\
        --trainer.batch_size=128 \\
        --system.out_dir=./out/vae_beta_{beta}"""

    if USE_WANDB:
        cmd += f" \\\n        --wandb.enabled=True \\\n        --wandb.name=beta_{beta}"

    # Execute command
    import os
    os.system(cmd.replace('\\\n', ' '))

print("\nBeta sweep complete! Check out/vae_beta_* directories for results.")