# 🚀 **SAMo Multi-Dataset DeBERTa (CLEAN VERSION)**
## **Efficient Multi-Dataset Training with Proven BCE Configuration**

### **🎯 MISSION**
- **One-Command Multi-Dataset Training**: GoEmotions + SemEval + ISEAR + MELD
- **Proven BCE Configuration**: Use your 51.79% F1-macro winning setup
- **No Threshold Testing**: Save time with threshold=0.2
- **Achieve >60% F1-macro**: Through comprehensive dataset integration

### **📋 SIMPLE WORKFLOW**
1. **Run Cell 2**: Data preparation (10-15 minutes)
2. **Run Cell 4**: Training (3-4 hours)
3. **Monitor**: `tail -f logs/train_comprehensive_multidataset.log`

### **📊 EXPECTED RESULTS**
- **Baseline**: 51.79% F1-macro (GoEmotions BCE Extended)
- **Target**: 60-65% F1-macro (All datasets combined)
- **Dataset**: 38,111 samples (GoEmotions + SemEval + ISEAR + MELD)

**Start with Cell 2 below!** 🚀

## 📥 **STEP 1: Data Preparation**
### **🎯 ONE COMMAND - PREPARE ALL DATASETS**
Combines GoEmotions + SemEval + ISEAR + MELD into unified training format

In [None]:
# MULTI-DATASET PREPARATION
# Combines GoEmotions + SemEval + ISEAR + MELD datasets

import os
import subprocess

print("🚀 MULTI-DATASET PREPARATION")
print("=" * 50)
print("📊 Datasets: GoEmotions + SemEval + ISEAR + MELD")
print("⏱️ Time: ~10-15 minutes")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')
print(f"📁 Working directory: {os.getcwd()}")

# Run data preparation
print("\n🔄 Preparing datasets...")
result = subprocess.run(['python', './notebooks/prepare_all_datasets.py'], 
                       capture_output=False, text=True)

# Verify success
if os.path.exists('data/combined_all_datasets/train.jsonl'):
    train_count = sum(1 for line in open('data/combined_all_datasets/train.jsonl'))
    val_count = sum(1 for line in open('data/combined_all_datasets/val.jsonl'))
    print(f"\n✅ SUCCESS: {train_count + val_count} samples prepared")
    print(f"   Training: {train_count} samples")
    print(f"   Validation: {val_count} samples")
    print("\n🚀 Ready for training! Run Cell 4 next.")
else:
    print("\n❌ FAILED: Dataset preparation unsuccessful")
    print("💡 Check logs and try again")

🚀 MULTI-DATASET PREPARATION
📊 Datasets: GoEmotions + SemEval + ISEAR + MELD
⏱️ Time: ~10-15 minutes
📁 Working directory: /home/user/goemotions-deberta

🔄 Preparing datasets...
🚀 COMPREHENSIVE MULTI-DATASET PREPARATION
📊 Datasets: GoEmotions + SemEval + ISEAR + MELD
⚙️ Configuration: Proven BCE setup (threshold=0.2)
⏱️ Time: ~10-15 minutes
📁 Working directory: /home/user/goemotions-deberta
📖 Loading GoEmotions dataset...
✅ Loaded 43410 GoEmotions train samples
✅ Loaded 5426 GoEmotions val samples
📥 Processing local SemEval-2018 EI-reg dataset...
✅ Found local SemEval zip file
✅ Copied local SemEval zip to data directory
📦 Extracting SemEval-2018 zip file...
✅ Extracted SemEval-2018 data
📖 Processing anger data...
📖 Processing fear data...
📖 Processing joy data...
📖 Processing sadness data...
✅ Processed 802 SemEval samples
📥 Downloading ISEAR dataset...
📥 Loading ISEAR from Hugging Face...
✅ Processed 7516 ISEAR samples
📥 Processing local MELD dataset (TEXT ONLY)...
✅ Found local MELD d

## ⚡ **STEP 2: Training**
### **🎯 START MULTI-DATASET TRAINING**
Trains DeBERTa on combined dataset with proven configuration

In [None]:
# MULTI-DATASET TRAINING
# Trains DeBERTa on combined dataset with Asymmetric Loss

import os
import subprocess
from pathlib import Path

print("🚀 MULTI-DATASET TRAINING")
print("=" * 50)
print("🤖 Model: DeBERTa-v3-large")
print("📊 Data: 38K+ samples (GoEmotions + SemEval + ISEAR + MELD)")
print("🎯 Loss: BCE (your proven 51.79% winner)")
print("⏱️ Time: ~6-8 hours (5 epochs on larger dataset)")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')

# Verify prerequisites
checks_passed = True

if not os.path.exists('data/combined_all_datasets/train.jsonl'):
    print("❌ Dataset not found - run Cell 2 first")
    checks_passed = False

if not os.path.exists('scripts/train_comprehensive_multidataset.sh'):
    print("❌ Training script not found")
    checks_passed = False

if not checks_passed:
    print("\n💡 Please run Cell 2 first to prepare data")
    exit()

# Make script executable
os.chmod('scripts/train_comprehensive_multidataset.sh', 0o755)
print("✅ Training script ready")

# Start training
print("\n🚀 STARTING TRAINING...")
print("📊 Monitor progress: tail -f logs/train_comprehensive_multidataset.log")
print("📊 Results will be in: checkpoints_comprehensive_multidataset/eval_report.json")
print("☁️ Google Drive backup: Automatic (every 15 minutes during training)")
print("\n⚠️ This will take 6-8 hours. Training runs with VISIBLE progress!")
print("⚠️ DO NOT close this notebook - you'll see live progress bars!")
print("-" * 70)

# Run training
training_result = subprocess.run(['bash', 'scripts/train_comprehensive_multidataset.sh'], 
                                capture_output=False, text=True)

# Check results
print("\n" + "=" * 50)
if os.path.exists('checkpoints_comprehensive_multidataset/eval_report.json'):
    print("✅ TRAINING COMPLETED SUCCESSFULLY!")
    print("📊 Results available locally: checkpoints_comprehensive_multidataset/eval_report.json")
    print("☁️ Google Drive backup: Completed automatically during training")
    
    # Try to show F1 scores
    try:
        import json
        with open('checkpoints_comprehensive_multidataset/eval_report.json', 'r') as f:
            results = json.load(f)
        f1_macro = results.get('f1_macro', 'N/A')
        f1_micro = results.get('f1_micro', 'N/A')
        print(f"\n📈 PERFORMANCE:")
        print(f"   F1 Macro: {f1_macro}")
        print(f"   F1 Micro: {f1_micro}")
        if f1_macro != 'N/A' and f1_macro > 0.6:
            print("\n🎉 SUCCESS: Achieved >60% F1-macro target!")
        elif f1_macro != 'N/A' and f1_macro > 0.55:
            print("\n👍 GOOD: Achieved >55% F1-macro!")
    except:
        print("📊 Check eval_report.json for detailed results")
        
else:
    print("⚠️ TRAINING MAY HAVE FAILED OR IS STILL RUNNING")
    print("📊 Check logs: tail -f logs/train_comprehensive_multidataset.log")
    print("📊 Check for results: checkpoints_comprehensive_multidataset/eval_report.json")

print("\n🎯 Target: >60% F1-macro")
print("🏆 Baseline: 51.79% F1-macro (GoEmotions only)")
print("☁️ Backup: Automatic Google Drive (timestamped folders)")

🚀 MULTI-DATASET TRAINING
🤖 Model: DeBERTa-v3-large
📊 Data: 38K+ samples (GoEmotions + SemEval + ISEAR + MELD)
🎯 Loss: BCE (your proven 51.79% winner)
⏱️ Time: ~6-8 hours (5 epochs on larger dataset)
✅ Training script ready

🚀 STARTING TRAINING...
📊 Monitor progress: tail -f logs/train_comprehensive_multidataset.log
📊 Results will be in: checkpoints_comprehensive_multidataset/eval_report.json
☁️ Google Drive backup: Automatic (every 15 minutes during training)

⚠️ This will take 6-8 hours. Training runs with VISIBLE progress!
⚠️ DO NOT close this notebook - you'll see live progress bars!
----------------------------------------------------------------------
🚀 COMPREHENSIVE MULTI-DATASET TRAINING
🎯 TARGET: >60% F1-macro with all datasets combined
📊 Datasets: GoEmotions + SemEval + ISEAR + MELD
⚡ Configuration: BCE Extended (your proven 51.79% winner)


## 📊 **STEP 3: Results Analysis (Optional)**
### **🎯 COMPARE WITH BASELINE**
Analyze performance improvement from multi-dataset training

In [None]:
# RESULTS ANALYSIS
# Compare multi-dataset performance with baseline

import json
from pathlib import Path

print("📊 MULTI-DATASET RESULTS ANALYSIS")
print("=" * 50)

# Baseline from original GoEmotions training
baseline = {
    'f1_macro': 0.5179,
    'f1_micro': 0.5975,
    'model': 'BCE Extended (GoEmotions only)'
}

print("🏆 BASELINE PERFORMANCE:")
print(f"   F1 Macro: {baseline['f1_macro']:.4f} ({baseline['f1_macro']*100:.1f}%)")
print(f"   F1 Micro: {baseline['f1_micro']:.4f} ({baseline['f1_micro']*100:.1f}%)")
print(f"   Model: {baseline['model']}")

# Load current results
eval_file = Path("checkpoints_comprehensive_multidataset/eval_report.json")

if eval_file.exists():
    with open(eval_file, 'r') as f:
        results = json.load(f)
    
    current_f1_macro = results.get('f1_macro', 0)
    current_f1_micro = results.get('f1_micro', 0)
    
    print("\n🎯 MULTI-DATASET RESULTS:")
    print(f"   F1 Macro: {current_f1_macro:.4f} ({current_f1_macro*100:.1f}%)")
    print(f"   F1 Micro: {current_f1_micro:.4f} ({current_f1_micro*100:.1f}%)")
    
    # Calculate improvement
    improvement = ((current_f1_macro - baseline['f1_macro']) / baseline['f1_macro']) * 100
    
    print(f"\n📈 IMPROVEMENT: {improvement:+.1f}%")
    
    # Success assessment
    if current_f1_macro >= 0.60:
        print("🚀 EXCELLENT: Achieved >60% F1-macro target!")
        print("🎉 Multi-dataset training SUCCESSFUL!")
    elif current_f1_macro >= 0.55:
        print("✅ GOOD: Achieved >55% F1-macro!")
        print("📈 Significant improvement from multi-dataset approach")
    elif current_f1_macro > baseline['f1_macro']:
        print("👍 IMPROVEMENT: Better than baseline")
        print("🔧 May need more training epochs or parameter tuning")
    else:
        print("⚠️ NO IMPROVEMENT: Check data quality or training setup")
        
    print(f"\n📊 TARGET ACHIEVEMENT:")
    print(f"   >60% F1-macro: {'✅' if current_f1_macro >= 0.60 else '❌'} (Target: 60%+)")
    print(f"   >55% F1-macro: {'✅' if current_f1_macro >= 0.55 else '❌'} (Target: 55%+)")
    print(f"   Beat baseline: {'✅' if current_f1_macro > baseline['f1_macro'] else '❌'}")
    
else:
    print("\n⏳ RESULTS NOT AVAILABLE")
    print("🔧 Training may still be in progress or check file path")
    print("📁 Expected: checkpoints_comprehensive_multidataset/eval_report.json")

print("\n🔍 MONITORING COMMANDS:")
print("   Training logs: tail -f logs/train_comprehensive_multidataset.log")
print("   GPU status: watch -n 5 'nvidia-smi'")
print("   Process status: ps aux | grep train_deberta")