# 🚀 **SAMo Multi-Dataset DeBERTa (CLEAN VERSION)**
## **Efficient Multi-Dataset Training with Proven BCE Configuration**

### **🎯 MISSION**
- **One-Command Multi-Dataset Training**: GoEmotions + SemEval + ISEAR + MELD
- **Proven BCE Configuration**: Use your 51.79% F1-macro winning setup
- **No Threshold Testing**: Save time with threshold=0.2
- **Achieve >60% F1-macro**: Through comprehensive dataset integration

### **📋 SIMPLE WORKFLOW**
1. **Run Cell 2**: Data preparation (10-15 minutes)
2. **Run Cell 4**: Training (3-4 hours)
3. **Monitor**: `tail -f logs/train_comprehensive_multidataset.log`

### **📊 EXPECTED RESULTS**
- **Baseline**: 51.79% F1-macro (GoEmotions BCE Extended)
- **Target**: 60-65% F1-macro (All datasets combined)
- **Dataset**: 38,111 samples (GoEmotions + SemEval + ISEAR + MELD)

**Start with Cell 2 below!** 🚀

## 📥 **STEP 1: Data Preparation**
### **🎯 ONE COMMAND - PREPARE ALL DATASETS**
Combines GoEmotions + SemEval + ISEAR + MELD into unified training format

In [None]:
# MULTI-DATASET PREPARATION
# Combines GoEmotions + SemEval + ISEAR + MELD datasets

import os
import subprocess

print("🚀 MULTI-DATASET PREPARATION")
print("=" * 50)
print("📊 Datasets: GoEmotions + SemEval + ISEAR + MELD")
print("⏱️ Time: ~10-15 minutes")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')
print(f"📁 Working directory: {os.getcwd()}")

# Run data preparation
print("\n🔄 Preparing datasets...")
result = subprocess.run(['python', './notebooks/prepare_all_datasets.py'], 
                       capture_output=False, text=True)

# Verify success
if os.path.exists('data/combined_all_datasets/train.jsonl'):
    train_count = sum(1 for line in open('data/combined_all_datasets/train.jsonl'))
    val_count = sum(1 for line in open('data/combined_all_datasets/val.jsonl'))
    print(f"\n✅ SUCCESS: {train_count + val_count} samples prepared")
    print(f"   Training: {train_count} samples")
    print(f"   Validation: {val_count} samples")
    print("\n🚀 Ready for training! Run Cell 4 next.")
else:
    print("\n❌ FAILED: Dataset preparation unsuccessful")
    print("💡 Check logs and try again")

🚀 MULTI-DATASET PREPARATION
📊 Datasets: GoEmotions + SemEval + ISEAR + MELD
⏱️ Time: ~10-15 minutes
📁 Working directory: /home/user/goemotions-deberta

🔄 Preparing datasets...
🚀 COMPREHENSIVE MULTI-DATASET PREPARATION
📊 Datasets: GoEmotions + SemEval + ISEAR + MELD
⚙️ Configuration: Proven BCE setup (threshold=0.2)
⏱️ Time: ~10-15 minutes
📁 Working directory: /home/user/goemotions-deberta
📖 Loading GoEmotions dataset...
✅ Loaded 43410 GoEmotions train samples
✅ Loaded 5426 GoEmotions val samples
📥 Processing local SemEval-2018 EI-reg dataset...
✅ Found local SemEval zip file
✅ Copied local SemEval zip to data directory
📦 Extracting SemEval-2018 zip file...
✅ Extracted SemEval-2018 data
📖 Processing anger data...
📖 Processing fear data...
📖 Processing joy data...
📖 Processing sadness data...
✅ Processed 802 SemEval samples
📥 Downloading ISEAR dataset...
📥 Loading ISEAR from Hugging Face...
✅ Processed 7516 ISEAR samples
📥 Processing local MELD dataset (TEXT ONLY)...
✅ Found local MELD d

## ⚡ **STEP 2: Training**
### **🎯 START MULTI-DATASET TRAINING**
Trains DeBERTa on combined dataset with proven configuration

In [None]:
# MULTI-DATASET TRAINING
# Trains DeBERTa on combined dataset with Asymmetric Loss

import os
import subprocess
from pathlib import Path
import threading
import time
from datetime import datetime

print("🚀 MULTI-DATASET TRAINING")
print("=" * 50)
print("🤖 Model: DeBERTa-v3-large")
print("📊 Data: 38K+ samples (GoEmotions + SemEval + ISEAR + MELD)")
print("🎯 Loss: BCE (your proven 51.79% winner)")
print("⏱️ Time: ~6-8 hours (5 epochs on larger dataset)")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')

# Set optimized cloud storage environment variables
print("\n⚙️ CONFIGURING OPTIMIZED CLOUD STORAGE...")
os.environ['GDRIVE_BACKUP_PATH'] = 'drive:backup/goemotions-training'
os.environ['IMMEDIATE_CLEANUP'] = 'true'  # Clean up local files after cloud upload
os.environ['MAX_LOCAL_CHECKPOINTS'] = '1'  # Keep only 1 checkpoint locally
print("✅ Cloud storage optimized for minimal local storage")
print("   📤 Backup every 2 minutes to Google Drive")
print("   🗑️ Automatic cleanup after each backup")
print("   💾 Keep only 1 checkpoint locally (saves ~15-25GB)")

# Verify prerequisites
checks_passed = True

if not os.path.exists('data/combined_all_datasets/train.jsonl'):
    print("❌ Dataset not found - run Cell 2 first")
    checks_passed = False

if not os.path.exists('scripts/train_comprehensive_multidataset.sh'):
    print("❌ Training script not found")
    checks_passed = False

if not checks_passed:
    print("\n💡 Please run Cell 2 first to prepare data")
    exit()

# Make script executable
os.chmod('scripts/train_comprehensive_multidataset.sh', 0o755)
print("✅ Training script ready")

# Start training
print("\n🚀 STARTING TRAINING...")
print("📊 Live progress display enabled!")
print("📊 Results will be in: checkpoints_comprehensive_multidataset/eval_report.json")
print("☁️ Google Drive backup: Automatic (every 2 minutes during training)")
print("🗑️ Local cleanup: Automatic after each backup")
print("\n⚠️ This will take 6-8 hours. You'll see LIVE progress below!")
print("⚠️ DO NOT close this notebook - training output streams in real-time!")
print("-" * 70)

def display_live_training():
    """Monitor training log and display live progress"""
    log_file = 'logs/train_comprehensive_multidataset.log'
    
    # Wait for log file to be created
    wait_count = 0
    while not os.path.exists(log_file) and wait_count < 30:
        time.sleep(2)
        wait_count += 1
    
    if not os.path.exists(log_file):
        return
    
    # Follow log file and display progress
    try:
        with open(log_file, 'r') as f:
            # Go to end of file
            f.seek(0, 2)
            
            while True:
                line = f.readline()
                if line:
                    # Display important lines
                    if any(keyword in line for keyword in [
                        'Starting training', 'Epoch', 'Step', 'Loss:', 'F1',
                        'RESULTS:', 'SUCCESS:', 'FAILED:', 'EXCELLENT:', 'GOOD:',
                        'Using GPU', 'DUAL GPU', 'DataParallel'
                    ]):
                        timestamp = datetime.now().strftime("%H:%M:%S")
                        print(f"📊 [{timestamp}] {line.strip()}")
                else:
                    time.sleep(1)
    except:
        pass

# Start live monitoring in background
monitor_thread = threading.Thread(target=display_live_training, daemon=True)
monitor_thread.start()

# Run training with live output
print("🚀 Starting training with live progress...")
training_process = subprocess.Popen(
    ['bash', 'scripts/train_comprehensive_multidataset.sh'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1,
    universal_newlines=True
)

# Display real-time output
for line in training_process.stdout:
    if line.strip():  # Only print non-empty lines
        print(line.strip())

# Wait for completion
training_result = training_process.wait()

# Check results
print("\n" + "=" * 50)
if os.path.exists('checkpoints_comprehensive_multidataset/eval_report.json'):
    print("✅ TRAINING COMPLETED SUCCESSFULLY!")
    print("📊 Results available locally: checkpoints_comprehensive_multidataset/eval_report.json")
    print("☁️ Google Drive backup: Completed automatically during training")
    
    # Try to show F1 scores
    try:
        import json
        with open('checkpoints_comprehensive_multidataset/eval_report.json', 'r') as f:
            results = json.load(f)
        f1_macro = results.get('f1_macro', 'N/A')
        f1_micro = results.get('f1_micro', 'N/A')
        print(f"\n📈 PERFORMANCE:")
        print(f"   F1 Macro: {f1_macro}")
        print(f"   F1 Micro: {f1_micro}")
        if f1_macro != 'N/A' and f1_macro > 0.6:
            print("\n🎉 SUCCESS: Achieved >60% F1-macro target!")
        elif f1_macro != 'N/A' and f1_macro > 0.55:
            print("\n👍 GOOD: Achieved >55% F1-macro!")
    except:
        print("📊 Check eval_report.json for detailed results")
        
else:
    print("⚠️ TRAINING MAY HAVE FAILED OR IS STILL RUNNING")
    print("📊 Check logs: tail -f logs/train_comprehensive_multidataset.log")
    print("📊 Check for results: checkpoints_comprehensive_multidataset/eval_report.json")

print("\n🎯 Target: >60% F1-macro")
print("🏆 Baseline: 51.79% F1-macro (GoEmotions only)")
print("☁️ Backup: Automatic Google Drive (every 2 minutes)")

In [None]:
# 🔬 SCIENTIFIC LOSS FUNCTION TESTING
# Fix the performance regression with systematic testing

import os
import subprocess
import json
import time
from datetime import datetime

print("🔬 SCIENTIFIC LOSS FUNCTION COMPARISON")
print("=" * 50)
print("🎯 FIXING PERFORMANCE REGRESSION: 51.79% → 39.43% F1-macro")
print("🧪 Testing: BCE, Asymmetric, Combined Loss systematically")
print("⏱️ Time: ~45-60 minutes for all tests")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')

# Test configurations with expected performance
test_configs = [
    {
        'name': 'BCE_Pure',
        'description': 'Pure BCE Loss (baseline reproduction)',
        'args': ['--threshold', '0.2'],
        'expected_f1': 0.52,
    },
    {
        'name': 'Asymmetric_Tuned', 
        'description': 'Asymmetric Loss with optimal parameters',
        'args': ['--use_asymmetric_loss', '--threshold', '0.2'],
        'expected_f1': 0.55,
    },
    {
        'name': 'Combined_Conservative',
        'description': 'Combined Loss (conservative ratio)',
        'args': ['--use_combined_loss', '--loss_combination_ratio', '0.3', '--threshold', '0.2'],
        'expected_f1': 0.50,
    }
]

def run_loss_test(config):
    """Run systematic test for one loss function"""
    output_dir = f'./outputs/loss_test_{config["name"]}'
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"\n🚀 Testing {config['name']}: {config['description']}")
    
    # Optimized base command for quick testing
    base_cmd = [
        'python3', 'notebooks/scripts/train_deberta_local.py',
        '--output_dir', output_dir,
        '--model_type', 'deberta-v3-large',
        '--per_device_train_batch_size', '4',
        '--per_device_eval_batch_size', '8', 
        '--gradient_accumulation_steps', '2',
        '--num_train_epochs', '2',
        '--learning_rate', '3e-5',
        '--lr_scheduler_type', 'cosine',
        '--warmup_ratio', '0.1',
        '--weight_decay', '0.01',
        '--fp16',
        '--max_length', '256',
        '--max_train_samples', '15000',  # Subset for quick testing
        '--max_eval_samples', '3000',
        '--augment_prob', '0.0',
        '--early_stopping_patience', '3'
    ]
    
    # Add loss-specific arguments
    cmd = base_cmd + config['args']
    
    # Set environment for single GPU
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = '0'
    
    print(f"   Command: {' '.join(cmd[-4:])}")  # Show last few args
    
    start_time = time.time()
    try:
        result = subprocess.run(cmd, env=env, timeout=1800)  # 30 min timeout
        elapsed_time = time.time() - start_time
        
        if result.returncode == 0:
            print(f"   ✅ Completed in {elapsed_time:.1f}s")
            return extract_results(output_dir, config)
        else:
            print(f"   ❌ Failed with return code: {result.returncode}")
            return None
            
    except subprocess.TimeoutExpired:
        print(f"   ⏰ Timed out after 30 minutes")
        return None
    except Exception as e:
        print(f"   💥 Error: {str(e)}")
        return None

def extract_results(output_dir, config):
    """Extract F1 scores from training results"""
    eval_file = f'{output_dir}/eval_report.json'
    
    if not os.path.exists(eval_file):
        print(f"   ⚠️ No eval_report.json found")
        return None
    
    try:
        with open(eval_file, 'r') as f:
            data = json.load(f)
        
        f1_macro = data.get('f1_macro', 0.0)
        f1_micro = data.get('f1_micro', 0.0)
        
        baseline_f1 = 0.5179
        improvement = ((f1_macro - baseline_f1) / baseline_f1) * 100
        success = f1_macro > 0.50
        
        print(f"   📊 F1 Macro: {f1_macro:.4f} ({improvement:+.1f}% vs baseline)")
        print(f"   🎯 Status: {'✅ SUCCESS' if success else '⚠️ NEEDS WORK'}")
        
        return {
            'name': config['name'],
            'f1_macro': f1_macro,
            'f1_micro': f1_micro,
            'improvement_pct': improvement,
            'success': success,
            'expected_f1': config['expected_f1']
        }
        
    except Exception as e:
        print(f"   ❌ Error reading results: {str(e)}")
        return None

# Run all tests systematically
print("\n🧪 STARTING SYSTEMATIC TESTS...")
all_results = []

for i, config in enumerate(test_configs, 1):
    print(f"\n📋 Test {i}/{len(test_configs)}")
    result = run_loss_test(config)
    if result:
        all_results.append(result)

# Analyze results
print("\n" + "=" * 50)
print("🧪 SCIENTIFIC ANALYSIS COMPLETE")
print("=" * 50)

if all_results:
    # Sort by F1 score
    sorted_results = sorted(all_results, key=lambda x: x['f1_macro'], reverse=True)
    
    print(f"📊 RESULTS SUMMARY:")
    for i, result in enumerate(sorted_results, 1):
        status = "🎉" if result['success'] else "📈" if result['f1_macro'] > 0.5179 else "📉"
        print(f"   {i}. {result['name']}: {result['f1_macro']:.4f} ({result['improvement_pct']:+.1f}%) {status}")
    
    # Best configuration
    best = sorted_results[0]
    print(f"\n🏆 WINNER: {best['name']}")
    print(f"   F1 Score: {best['f1_macro']:.4f}")
    print(f"   Improvement: {best['improvement_pct']:+.1f}% over baseline")
    
    if best['success']:
        print(f"\n🚀 RECOMMENDATION: Use {best['name']} for full multi-dataset training")
        print(f"🔧 Update training script to use optimal loss function")
    else:
        print(f"\n🔍 DEBUGGING NEEDED: Best result still below 50% target")
        print(f"🔧 Consider data quality issues or hyperparameter tuning")
    
    # Save results for reference
    with open('loss_comparison_results.json', 'w') as f:
        json.dump({
            'timestamp': datetime.now().isoformat(),
            'results': all_results,
            'winner': best['name']
        }, f, indent=2)
    
    print(f"\n📄 Results saved to: loss_comparison_results.json")
    
else:
    print("❌ ALL TESTS FAILED!")
    print("🔧 Check training script and environment setup")

print(f"\n🎯 NEXT STEPS:")
print(f"   1. Update training script with winning loss function")
print(f"   2. Run full multi-dataset training with optimal configuration")
print(f"   3. Target: >60% F1-macro with all datasets combined")

## 📊 **STEP 3: Results Analysis (Optional)**
### **🎯 COMPARE WITH BASELINE**
Analyze performance improvement from multi-dataset training

In [None]:
# RESULTS ANALYSIS
# Compare multi-dataset performance with baseline

import json
from pathlib import Path

print("📊 MULTI-DATASET RESULTS ANALYSIS")
print("=" * 50)

# Baseline from original GoEmotions training
baseline = {
    'f1_macro': 0.5179,
    'f1_micro': 0.5975,
    'model': 'BCE Extended (GoEmotions only)'
}

print("🏆 BASELINE PERFORMANCE:")
print(f"   F1 Macro: {baseline['f1_macro']:.4f} ({baseline['f1_macro']*100:.1f}%)")
print(f"   F1 Micro: {baseline['f1_micro']:.4f} ({baseline['f1_micro']*100:.1f}%)")
print(f"   Model: {baseline['model']}")

# Load current results
eval_file = Path("checkpoints_comprehensive_multidataset/eval_report.json")

if eval_file.exists():
    with open(eval_file, 'r') as f:
        results = json.load(f)
    
    current_f1_macro = results.get('f1_macro', 0)
    current_f1_micro = results.get('f1_micro', 0)
    
    print("\n🎯 MULTI-DATASET RESULTS:")
    print(f"   F1 Macro: {current_f1_macro:.4f} ({current_f1_macro*100:.1f}%)")
    print(f"   F1 Micro: {current_f1_micro:.4f} ({current_f1_micro*100:.1f}%)")
    
    # Calculate improvement
    improvement = ((current_f1_macro - baseline['f1_macro']) / baseline['f1_macro']) * 100
    
    print(f"\n📈 IMPROVEMENT: {improvement:+.1f}%")
    
    # Success assessment
    if current_f1_macro >= 0.60:
        print("🚀 EXCELLENT: Achieved >60% F1-macro target!")
        print("🎉 Multi-dataset training SUCCESSFUL!")
    elif current_f1_macro >= 0.55:
        print("✅ GOOD: Achieved >55% F1-macro!")
        print("📈 Significant improvement from multi-dataset approach")
    elif current_f1_macro > baseline['f1_macro']:
        print("👍 IMPROVEMENT: Better than baseline")
        print("🔧 May need more training epochs or parameter tuning")
    else:
        print("⚠️ NO IMPROVEMENT: Check data quality or training setup")
        
    print(f"\n📊 TARGET ACHIEVEMENT:")
    print(f"   >60% F1-macro: {'✅' if current_f1_macro >= 0.60 else '❌'} (Target: 60%+)")
    print(f"   >55% F1-macro: {'✅' if current_f1_macro >= 0.55 else '❌'} (Target: 55%+)")
    print(f"   Beat baseline: {'✅' if current_f1_macro > baseline['f1_macro'] else '❌'}")
    
else:
    print("\n⏳ RESULTS NOT AVAILABLE")
    print("🔧 Training may still be in progress or check file path")
    print("📁 Expected: checkpoints_comprehensive_multidataset/eval_report.json")

print("\n🔍 MONITORING COMMANDS:")
print("   Training logs: tail -f logs/train_comprehensive_multidataset.log")
print("   GPU status: watch -n 5 'nvidia-smi'")
print("   Process status: ps aux | grep train_deberta")

In [None]:
# POST-TRAINING CLEANUP
# Remove local artifacts after ensuring cloud backup

import os
import subprocess
import shutil
from pathlib import Path

print("🧹 POST-TRAINING CLEANUP")
print("=" * 50)
print("🎯 Goal: Free up disk space by removing local artifacts")
print("☁️ Prerequisite: Ensure cloud backup is complete")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')

# Check current disk usage
def get_disk_usage():
    disk_usage = shutil.disk_usage(".")
    free_gb = disk_usage.free / (1024 ** 3)
    used_gb = disk_usage.used / (1024 ** 3)
    total_gb = disk_usage.total / (1024 ** 3)
    used_percent = (disk_usage.used / disk_usage.total) * 100
    return free_gb, used_gb, total_gb, used_percent

# Before cleanup
free_before, used_before, total, used_percent_before = get_disk_usage()
print(f"💾 BEFORE CLEANUP:")
print(f"   Used: {used_before:.1f}GB ({used_percent_before:.1f}%)")
print(f"   Free: {free_before:.1f}GB")
print(f"   Total: {total:.1f}GB")

# Verify cloud backup exists
print(f"\n☁️ VERIFYING CLOUD BACKUP...")
result = subprocess.run(['rclone', 'lsf', 'drive:backup/goemotions-training/'], 
                       capture_output=True, text=True)

if result.returncode == 0 and result.stdout.strip():
    print("✅ Cloud backup found!")
    backup_folders = result.stdout.strip().split('\n')
    print(f"   Found {len(backup_folders)} backup folder(s)")
    for folder in backup_folders[-3:]:  # Show latest 3
        print(f"   📁 {folder}")
else:
    print("⚠️ No cloud backup found or rclone error!")
    print("💡 Run training first to create backup before cleanup")
    print("🛑 STOPPING CLEANUP - Backup required for safety")
    exit()

# Safe cleanup of training artifacts
print(f"\n🗑️ CLEANING LOCAL ARTIFACTS...")

cleanup_targets = [
    'checkpoints_comprehensive_multidataset/',
    'logs/',
    'models/',
    'outputs/',
    '__pycache__/',
    '.cache/'
]

total_freed = 0

for target in cleanup_targets:
    if os.path.exists(target):
        # Get size before deletion
        if os.path.isdir(target):
            size_mb = sum(os.path.getsize(os.path.join(dirpath, filename))
                         for dirpath, dirnames, filenames in os.walk(target)
                         for filename in filenames) / (1024 * 1024)
        else:
            size_mb = os.path.getsize(target) / (1024 * 1024)
        
        try:
            if os.path.isdir(target):
                # Keep only essential files in checkpoints
                if target == 'checkpoints_comprehensive_multidataset/':
                    # Keep config and eval_report, remove model weights
                    keep_files = ['config.json', 'eval_report.json', 'tokenizer.json', 'tokenizer_config.json']
                    if os.path.exists(target):
                        for item in os.listdir(target):
                            item_path = os.path.join(target, item)
                            if os.path.isdir(item_path) and 'checkpoint-' in item:
                                shutil.rmtree(item_path)
                                print(f"🗑️ Removed checkpoint: {item}")
                            elif os.path.isfile(item_path) and item not in keep_files and item.endswith(('.bin', '.safetensors')):
                                os.remove(item_path)
                                print(f"🗑️ Removed model file: {item}")
                else:
                    shutil.rmtree(target)
                    print(f"🗑️ Removed directory: {target} ({size_mb:.1f}MB)")
            else:
                os.remove(target)
                print(f"🗑️ Removed file: {target} ({size_mb:.1f}MB)")
            
            total_freed += size_mb
            
        except Exception as e:
            print(f"⚠️ Could not remove {target}: {str(e)}")

# Clean up temporary and cache files
temp_patterns = ['*.tmp', 'tmp_*', '.DS_Store', 'Thumbs.db']
for pattern in temp_patterns:
    import glob
    for file in glob.glob(pattern):
        try:
            os.remove(file)
            print(f"🗑️ Removed temp file: {file}")
        except:
            pass

# After cleanup
free_after, used_after, total, used_percent_after = get_disk_usage()
space_freed = used_before - used_after

print(f"\n✅ CLEANUP COMPLETE!")
print(f"💾 AFTER CLEANUP:")
print(f"   Used: {used_after:.1f}GB ({used_percent_after:.1f}%)")
print(f"   Free: {free_after:.1f}GB")
print(f"   Space freed: {space_freed:.1f}GB")

if space_freed > 1:
    print(f"🎉 Successfully freed {space_freed:.1f}GB of disk space!")
    print(f"📈 Disk usage reduced from {used_percent_before:.1f}% to {used_percent_after:.1f}%")
else:
    print(f"ℹ️ Minimal space freed ({space_freed:.1f}GB) - artifacts may have been cleaned during training")

print(f"\n☁️ IMPORTANT: All training artifacts are safely backed up to Google Drive")
print(f"🔄 To restore: Use rclone to download from drive:backup/goemotions-training/")
print(f"📁 Local config files and eval reports retained for quick access")

## 🧹 **STEP 4: Post-Training Cleanup (Optional)**
### **🗑️ FREE UP DISK SPACE**
Remove local artifacts after confirming cloud backup