# NAOMI-II Training on Google Colab

This notebook trains NAOMI-II embeddings on the full WordNet dataset (157K vocabulary, 15.67M edges).

**Requirements:**
- Google Colab Pro (free for students)
- T4 GPU runtime
- Pre-generated training data uploaded to Google Drive

**Expected Runtime:** 12-24 hours on T4 GPU

## 1. Setup GPU Runtime

**IMPORTANT:** Before running, change runtime type:
1. Runtime → Change runtime type
2. Hardware accelerator: **GPU**
3. GPU type: **T4** (or best available)
4. Click **Save**

In [None]:
# Verify GPU is available
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ WARNING: No GPU detected! Change runtime type to GPU.")

## 2. Clone Repository

In [None]:
# Clone NAOMI-II repository
!git clone https://github.com/YOUR_USERNAME/NAOMI-II.git
%cd NAOMI-II

# Show current directory
!pwd
!ls -lh

## 3. Install Dependencies

In [None]:
# Install required packages
!pip install -q nltk numpy tqdm

# Download WordNet
import nltk
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

print("✓ Dependencies installed")

## 4. Mount Google Drive and Load Data

**Before running:** Upload your pre-generated data to Google Drive:
- `full_wordnet/` folder (extracted WordNet synsets)
- `wordnet_training/` folder (15.67M training edges)

Suggested location: `My Drive/NAOMI-II-data/`

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# List available data (adjust path to your Drive location)
!ls -lh /content/drive/MyDrive/NAOMI-II-data/

In [None]:
# Copy data from Drive to local Colab storage (faster for training)
import os

# Adjust this path to match your Google Drive structure
DRIVE_DATA_PATH = "/content/drive/MyDrive/NAOMI-II-data"

# Create data directories
!mkdir -p data/full_wordnet
!mkdir -p data/wordnet_training

# Copy data (this may take a few minutes)
print("Copying WordNet data...")
!cp -r {DRIVE_DATA_PATH}/full_wordnet/* data/full_wordnet/
print("Copying training data...")
!cp -r {DRIVE_DATA_PATH}/wordnet_training/* data/wordnet_training/

# Verify data copied correctly
print("\n✓ Data copied successfully:")
!ls -lh data/full_wordnet/
!ls -lh data/wordnet_training/

### Alternative: Generate Data (if not pre-uploaded)

**Skip this if you already copied data above.**

If you didn't upload pre-generated data, you can generate it here (~1.5 hours):

In [None]:
# OPTIONAL: Only run if you need to generate data from scratch

# Extract WordNet (~30 minutes)
# !python scripts/extract_full_wordnet.py --output data/full_wordnet

# Generate training edges (~1 hour)
# !python scripts/generate_wordnet_training_data.py \
#     --input data/full_wordnet \
#     --output data/wordnet_training

## 5. Start Training

This will train for 50 epochs on 15.67M edges with dynamic dimension expansion.

**Expected time:** 12-24 hours on T4 GPU

In [None]:
# Start training
!python scripts/train_embeddings.py \
    --training-data data/wordnet_training \
    --unsupervised \
    --dynamic-dims \
    --embedding-dim 128 \
    --max-dims 512 \
    --epochs 50 \
    --lr 0.001 \
    --batch-size 128

## 6. Monitor Training Progress

The training output will show:
- Loss trends (distance + sparsity)
- Dimension statistics every 10 epochs
- Dimension expansion events

**Expected behavior:**
- Initial loss: ~0.08
- Final loss: ~0.01-0.02
- Dimensions: May expand from 128 → 192-256 if needed

## 7. Download Trained Model

After training completes, download checkpoints back to your Google Drive or local machine.

In [None]:
# Copy checkpoints to Google Drive for safekeeping
!mkdir -p /content/drive/MyDrive/NAOMI-II-results
!cp -r data/checkpoints /content/drive/MyDrive/NAOMI-II-results/

print("✓ Checkpoints saved to Google Drive")
!ls -lh /content/drive/MyDrive/NAOMI-II-results/checkpoints/

In [None]:
# Optional: Download directly to your computer
from google.colab import files

# Download final embeddings
files.download('data/checkpoints/best_model.pkl')
files.download('data/checkpoints/embeddings_final.npy')

## 8. Quick Analysis (Optional)

Run basic analysis on the trained embeddings:

In [None]:
# Load trained embeddings
import numpy as np
import json

embeddings = np.load('data/checkpoints/embeddings_final.npy')
with open('data/wordnet_training/vocabulary.json', 'r') as f:
    vocab_data = json.load(f)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Vocabulary size: {len(vocab_data['word_to_id'])}")
print(f"\nDimension statistics:")
print(f"  Mean: {embeddings.mean():.4f}")
print(f"  Std: {embeddings.std():.4f}")
print(f"  Min: {embeddings.min():.4f}")
print(f"  Max: {embeddings.max():.4f}")

# Compute sparsity
sparsity = np.mean(np.abs(embeddings) < 0.001)
print(f"\nSparsity: {sparsity:.1%} of values near zero")

## 9. Cleanup (Optional)

Free up Colab storage after saving checkpoints:

In [None]:
# Remove large data files to free up space
# !rm -rf data/full_wordnet
# !rm -rf data/wordnet_training

# Check remaining disk space
!df -h /content

## Notes

**Session Management:**
- Colab Pro sessions can run for 24 hours
- Training should complete in 12-24 hours on T4
- Checkpoints are saved every 5 epochs automatically

**If Session Disconnects:**
1. Reconnect to the same runtime
2. Re-mount Google Drive
3. Resume from last checkpoint (modify training command)

**Troubleshooting:**
- Out of memory: Reduce `--batch-size` to 64 or 32
- Slow training: Verify GPU is enabled (check cell 1)
- Data not found: Check Google Drive paths in cell 4

**After Training:**
- Checkpoints saved to Google Drive
- Download to local machine for analysis
- Run dimension discovery analysis locally on Surface