# GoEmotions DeBERTa-v3-large SIMPLIFIED PARALLEL Workflow

## Sequential + Dual-GPU Parallel Training

**GOAL**: Achieve >50% F1 macro at threshold=0.2 with class imbalance fixes

**KEY FEATURES**:
- Phase 1: Sequential single-GPU for stability (5 configs: BCE, Asymmetric, Combined 0.7/0.5/0.3)
- Phase 1.5: Parallel dual-GPU for speed (same configs on GPU 0 & 1 concurrently)
- Fixed: differentiable losses, per-class pos_weight, oversampling, threshold=0.2, LR=3e-5
- Expected: 50-65% F1 macro, 50% faster with parallel

**Baseline**: 42.18% F1 (original notebook line 1405), target >50% at threshold=0.2

**Workflow**: Environment → Cache → Phase 1 Sequential → Phase 1.5 Parallel → Analysis

In [5]:
print("🔍 Enhanced Disk Usage Analysis for / (247/249GB used)")
print("=" * 60)

# Top space hogs in /
!du -sh /* | sort -hr | head -15  # Top 15 root dirs

print("\n📊 Detailed / breakdown (top subdirs)")
!du -sh /*/* 2>/dev/null | sort -hr | head -10  # Top sub-subdirs

# Hidden files/logs/temp in project and system
print("\n🕵️ Hidden/Logs/Temp space:")
!find / -name "*.log" -type f -size +100M -exec du -sh {} \; 2>/dev/null | head -5  # Large logs
!find / -name "*.tmp" -type f -size +1G -exec du -sh {} \; 2>/dev/null | head -5  # Large temps
!find /home/user -name "*cache*" -type d -exec du -sh {} \; 2>/dev/null | head -5  # User caches
!du -sh /var/log/* 2>/dev/null | sort -hr | head -5  # System logs

# Python/pip caches
print("\n🐍 Python/Pip caches:")
!du -sh /venv/deberta-v3/lib/python3.10/site-packages/* | sort -hr | head -5  # Large packages
!pip cache dir && !du -sh $(pip cache dir) 2>/dev/null  # Pip cache

# Overall / usage tree (top level)
print("\n🌳 / Usage Tree (top 10):")
!ncdu / --exclude=/proc --exclude=/sys 2>/dev/null || echo "Install ncdu for interactive tree: !apt install ncdu"

# Quota check
print("\n⚖️ Quota/FS info:")
!quota -u user 2>/dev/null || echo "No user quota"
!df -i /  # Inode usage (sometimes quota on inodes, not blocks)

🔍 Enhanced Disk Usage Analysis for / (247/249GB used)
du: cannot read directory '/proc/122/map_files': Permission denied
du: cannot read directory '/proc/166/map_files': Permission denied
du: cannot read directory '/proc/168/map_files': Permission denied
du: cannot read directory '/proc/182/map_files': Permission denied
du: cannot read directory '/proc/196/map_files': Permission denied
du: cannot access '/proc/58291/task/58291/fd/4': No such file or directory
du: cannot access '/proc/58291/task/58291/fdinfo/4': No such file or directory
du: cannot access '/proc/58291/fd/3': No such file or directory
du: cannot access '/proc/58291/fdinfo/3': No such file or directory
184G	/home
22G	/workspace
20G	/usr
16G	/venv
13G	/root
8.4G	/opt
124M	/var
93M	/tmp
2.3M	/etc
24K	/run
20K	/NGC-DL-CONTAINER-LICENSE
12K	/proc
8.0K	/cuda-keyring_1.1-1_all.deb
0	/sys
0	/srv

📊 Detailed / breakdown (top subdirs)
184G	/home/user
15G	/usr/local
8.5G	/venv/main
7.3G	/venv/deberta-v3
6.5G	/opt/miniforge3
4.0G	/l

!mkdir -p /mnt/large_outputs; 
!ln -s /mnt/large_outputs ./outputs
!pip install rclone; 
!rclone config create drive remote drive
!rclone mkdir drive:goemotions-outputs; !ln -s /root/.rclone/drive:goemotions-outputs ./outputs


In [5]:
# ENVIRONMENT VERIFICATION - RUN FIRST
print("🔍 Verifying Conda Environment...")
import sys, os
print(f"Python: {sys.executable}, Version: {sys.version}")
conda_env = os.environ.get('CONDA_DEFAULT_ENV', 'None')
print(f"Conda env: {conda_env}")
if conda_env != 'deberta-v3':
    print("⚠️ Switch to 'Python (deberta-v3)' kernel")

# Check packages
try:
    import torch; print(f"PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}")
except: print("❌ PyTorch missing")
try:
    import transformers; print(f"Transformers {transformers.__version__}")
except: print("❌ Transformers missing")

print("\n🎯 Environment ready! Run !nvidia-smi for GPU check")
!nvidia-smi

🔍 Verifying Conda Environment...
Python: /venv/deberta-v3/bin/python3, Version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0]
Conda env: None
⚠️ Switch to 'Python (deberta-v3)' kernel
PyTorch 2.6.0+cu124, CUDA: True, Devices: 2
Transformers 4.56.0

🎯 Environment ready! Run !nvidia-smi for GPU check
Sun Sep  7 20:35:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:C1:00.0 Off |               

In [6]:
# SETUP & CACHE
print("🔧 Setup environment...")
!apt-get update -qq && apt-get install -y cmake build-essential pkg-config libgoogle-perftools-dev
!pip install --upgrade pip torch>=2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --root-user-action=ignore
!pip install sentencepiece transformers accelerate datasets evaluate scikit-learn tensorboard pyarrow tiktoken --root-user-action=ignore
os.chdir('/home/user/goemotions-deberta')
print(f"Working dir: {os.getcwd()}")

# Local cache
print("🚀 Setup cache...")
!python3 notebooks/scripts/setup_local_cache.py
!ls -la models/deberta-v3-large/ | head -3
!ls -la data/goemotions/ | head -3

🔧 Setup environment...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libgoogle-perftools-dev is already the newest version (2.9.1-0ubuntu3).
pkg-config is already the newest version (0.29.2-1ubuntu3).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 75 not upgraded.
Working dir: /home/user/goemotions-deberta
🚀 Setup cache...
🚀 Setting up local cache for GoEmotions DeBERTa project
📁 Setting up directory structure...
✅ Created: data/goemotions
✅ Created: models/deberta-v3-large
✅ Created: models/roberta-large
✅ Created: outputs/deberta
✅ Created: outputs/roberta
✅ Created: logs

📊 Caching GoEmotions dataset...
✅ GoEmotions dataset already cached

🤖 Caching DeBERTa-v3-large model...
✅ DeBERTa-v3-large model already cached

🎉 Local cache setup completed successfully!
📁 All models and datasets are now cached locally
🚀 Re

## PHASE 1.5: PARALLEL DUAL-GPU TRAINING

**Run the same 5 configs in parallel on GPU 0 & 1 for 50% faster execution.**
- Pairs: (BCE + Asymmetric), (Combined 0.7 + 0.5), (Combined 0.3 alone)
- Uses threading + subprocess with CUDA_VISIBLE_DEVICES isolation
- Duration: ~1-1.5 hours (vs 2-3 hours sequential)
- Monitor: !tail -f gpu*.log & !nvidia-smi

In [None]:
# PHASE 1.5: PARALLEL DUAL-GPU IMPLEMENTATION
import subprocess, threading, os, time, threading, os, time
subprocess.run(['pkill', '-f', 'train_deberta_local'], capture_output=True)
time.sleep(2)

print("🚀 PHASE 1.5: Parallel Dual-GPU Training - 5 Configs")
print("=" * 70)
print("GPU 0 & 1 concurrent: 50% faster than sequential")
print("Fixes maintained: pos_weight, oversampling, threshold=0.2")
print("=" * 70)

def run_config(gpu_id, config_name, use_asym=False, ratio=None):
    """Run training on specific GPU with in-memory logging to avoid disk quota"""
    print(f"🚀 Starting {config_name} on GPU {gpu_id}")
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    
    cmd = [
        'python3', 'notebooks/scripts/train_deberta_local.py',
        '--output_dir', f'./outputs/parallel_{config_name}',
        '--model_type', 'deberta-v3-large',
        '--per_device_train_batch_size', '4',
        '--per_device_eval_batch_size', '8',
        '--gradient_accumulation_steps', '4',
        '--num_train_epochs', '2',
        '--learning_rate', '3e-5',
        '--lr_scheduler_type', 'cosine',
        '--warmup_ratio', '0.15',
        '--weight_decay', '0.01',
        '--fp16', '--max_length', '256',
        '--max_train_samples', '20000', '--max_eval_samples', '3000',
        '--save_total_limit', '1',  # Compress: Keep only best checkpoint
        '--overwrite_output_dir'  # Allow re-use
    ]
    if use_asym: cmd += ['--use_asymmetric_loss']
    if ratio is not None: cmd += ['--use_combined_loss', '--loss_combination_ratio', str(ratio)]
    
    # In-memory capture: No disk log file
    proc = subprocess.Popen(cmd, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd='/home/user/goemotions-deberta')
    stdout, stderr = proc.communicate()
    output = stdout.decode() if stdout else ''
    if stderr: output += f"\nERROR: {stderr.decode()}"
    
    # Print key progress lines (simulate tail -f)
    for line in output.split('\n')[-20:]:  # Last 20 lines
        if any(k in line for k in ['loss', 'grad_norm', 'f1_macro', 'epoch']):  # Filter relevant
            print(f"GPU {gpu_id} [{config_name}]: {line}")
    
    print(f"✅ {config_name} complete on GPU {gpu_id} (return code: {proc.returncode})")
    return proc.returncode

# PAIR 1: BCE (GPU0) + Asymmetric (GPU1)
print("\n📍 PAIR 1: BCE + Asymmetric")
t1 = threading.Thread(target=run_config, args=(0, 'BCE_Parallel'))
t2 = threading.Thread(target=run_config, args=(1, 'Asymmetric_Parallel', True))
t1.start(); t2.start(); t1.join(); t2.join()

# PAIR 2: Combined 0.7 (GPU0) + 0.5 (GPU1)
print("\n📍 PAIR 2: Combined 0.7 + 0.5")
t3 = threading.Thread(target=run_config, args=(0, 'Combined_07_Parallel', False, 0.7))
t4 = threading.Thread(target=run_config, args=(1, 'Combined_05_Parallel', False, 0.5))
t3.start(); t4.start(); t3.join(); t4.join()

# SINGLE: Combined 0.3 (GPU0)
print("\n📍 SINGLE: Combined 0.3")
run_config(0, 'Combined_03_Parallel', False, 0.3)

print("\n🎉 PHASE 1.5 PARALLEL COMPLETE!")
print("📊 Outputs: ./outputs/parallel_*/")
print("🔍 Run analysis for F1@0.2 comparison")
print("📊 Logs: !tail -f gpu*.log | !nvidia-smi")

🚀 PHASE 1.5: Parallel Dual-GPU Training - 5 Configs
GPU 0 & 1 concurrent: 50% faster than sequential
Fixes maintained: pos_weight, oversampling, threshold=0.2

📍 PAIR 1: BCE + Asymmetric
🚀 Starting BCE_Parallel on GPU 0
🚀 Starting Asymmetric_Parallel on GPU 1
✅ Asymmetric_Parallel complete on GPU 1
✅ BCE_Parallel complete on GPU 0

📍 PAIR 2: Combined 0.7 + 0.5
🚀 Starting Combined_07_Parallel on GPU 0
🚀 Starting Combined_05_Parallel on GPU 1
✅ Combined_07_Parallel complete on GPU 0
✅ Combined_05_Parallel complete on GPU 1

📍 SINGLE: Combined 0.3
🚀 Starting Combined_03_Parallel on GPU 0


OSError: [Errno 122] Disk quota exceeded: 'gpu0_combined_03_parallel.log'

## ANALYSIS: F1 Comparison at Threshold=0.2

**Load eval_report.json from all configs, extract f1_macro_t2, compare to baseline 42.18%.**
- Success if >50%
- Diagnose if below (check loss curve, class F1)
- HF multi-label best practices: threshold sweep, per-class weights effective on rare emotions

In [None]:
# PHASE 1 & 1.5 RESULTS ANALYSIS (Threshold=0.2)
import json, os
BASELINE_F1 = 0.4218  # Original notebook line 1405

def load_results(dirs):
    results = {}
    for d in dirs:
        path = os.path.join(d, 'eval_report.json')
        if os.path.exists(path):
            with open(path, 'r') as f:
                data = json.load(f)
            name = d.split('/')[-1]
            f1_t2 = data.get('f1_macro_t2', data.get('f1_macro', 0.0))
            results[name] = {'f1_macro_t2': f1_t2, 'success': f1_t2 > 0.50, 'improvement': ((f1_t2 - BASELINE_F1) / BASELINE_F1) * 100}
            print(f"✅ {name}: F1@0.2 = {f1_t2:.4f} ({'SUCCESS >50%' if results[name]['success'] else 'NEEDS IMPROVEMENT'})")
        else:
            print(f"⏳ {d.split('/')[-1]}: Not completed")
    return results

# Sequential Phase 1
seq_dirs = ['./outputs/phase1_bce_simplified', './outputs/phase1_asymmetric_simplified', 
            './outputs/phase1_combined_07_simplified', './outputs/phase1_combined_05_simplified', 
            './outputs/phase1_combined_03_simplified']
seq_results = load_results(seq_dirs)

# Parallel Phase 1.5
par_dirs = ['./outputs/parallel_BCE_Parallel', './outputs/parallel_Asymmetric_Parallel',
            './outputs/parallel_Combined_07_Parallel', './outputs/parallel_Combined_05_Parallel',
            './outputs/parallel_Combined_03_Parallel']
par_results = load_results(par_dirs)

# Summary
all_results = {**seq_results, **par_results}
best_f1 = max([r['f1_macro_t2'] for r in all_results.values()])
print(f"\n🏆 BEST F1@0.2: {best_f1:.4f} ({'SUCCESS' if best_f1 > 0.50 else 'BELOW TARGET (42.18% baseline)'}")
if best_f1 > 0.50:
    print("✅ PHASE 2 READY: Add cell for sequential top 2 configs (3 epochs, 30k samples)")
    print("Top configs for Phase 2: Asymmetric + best Combined")
else:
    print("🔍 DIAGNOSE: Check loss curve, class-wise F1 for rare emotions")
    print("HF Best Practices: Threshold sweep 0.1-0.3, per-class weights validated")

# Phase 2 Preparation (add this cell if success)
if best_f1 > 0.50:
    top_configs = sorted(all_results.items(), key=lambda x: x[1]['f1_macro_t2'], reverse=True)[:2]
    print(f"\n🎯 PHASE 2: Train top 2 - {top_configs[0][0]} + {top_configs[1][0]}")
    print("Command template: --num_train_epochs 3 --max_train_samples 30000 + same fixes")
print("\n📁 All outputs: ./outputs/phase1_* & ./outputs/parallel_*/")

## MONITORING & VALIDATION

**Commands for live monitoring:**
- Sequential: !nvidia-smi (GPU 0 usage)
- Parallel: !tail -f gpu*.log & !nvidia-smi (both GPUs)
- Validation: Per-class weights effective on rare emotions (check eval_report.json per-class F1)
- HF Docs: Multi-label eval with threshold=0.2 optimal for imbalance (precision/recall balance)

In [None]:
# LIVE MONITORING UTILITIES
def monitor_processes():
    result = subprocess.run(['ps', 'aux'], capture_output=True, text=True)
    processes = [line for line in result.stdout.split('\n') if 'train_deberta_local' in line]
    if processes:
        print("🔄 Active processes:")
        for p in processes: print(f"  {p}")
    else:
        print("⏸️ No active training")
    print("\n🖥️ GPU status:")
    !nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used --format=csv

def tail_logs(pattern='gpu*.log'):
    import glob
    logs = glob.glob(pattern)
    for log in logs[-2:]:  # Last 2 logs
        print(f"\n📊 {log}:")
        !tail -5 {log}

monitor_processes()
tail_logs()

print("\n✅ Notebook ready! Run Phase 1 sequential, then Phase 1.5 parallel for dual-GPU.")
print("Target: >50% F1@0.2 vs baseline 42.18%")

## PROGRESS SUMMARY vs FIXED_RIGOROUS

**SIMPLIFIED_PARALLEL vs FIXED_RIGOROUS**:
- **Structure**: Simplified starts sequential (stability), adds parallel Phase 1.5 (speed); no early stopping/multi-phase complexity
- **Parameters**: 20k samples/2 epochs vs rigorous 30k/3 epochs; threshold=0.2 fixed for imbalance
- **Fixes**: BCE focus with pos_weight/oversampling/differentiable losses; same as rigorous but streamlined
- **Performance**: Target >50% F1@0.2 (vs baseline 42.18%); parallel 50% faster dual-GPU utilization
- **Cost/Time**: ~$4 total, 1-1.5 hours parallel vs 2-3 hours sequential

**Next**: If >50% achieved, Phase 2 cell for top 2 configs (3 epochs, 30k samples) already prepared in analysis.