# üöÄ **SAMo Multi-Dataset DeBERTa (CLEAN VERSION)**
## **Efficient Multi-Dataset Training with Proven BCE Configuration**

### **üéØ MISSION**
- **One-Command Multi-Dataset Training**: GoEmotions + SemEval + ISEAR + MELD
- **Proven BCE Configuration**: Use your 51.79% F1-macro winning setup
- **No Threshold Testing**: Save time with threshold=0.2
- **Achieve >60% F1-macro**: Through comprehensive dataset integration

### **üìã SIMPLE WORKFLOW**
1. **Run Cell 2**: Data preparation (10-15 minutes)
2. **Run Cell 4**: Training (3-4 hours)
3. **Monitor**: `tail -f logs/train_comprehensive_multidataset.log`

### **üìä EXPECTED RESULTS**
- **Baseline**: 51.79% F1-macro (GoEmotions BCE Extended)
- **Target**: 60-65% F1-macro (All datasets combined)
- **Dataset**: 38,111 samples (GoEmotions + SemEval + ISEAR + MELD)

**Start with Cell 2 below!** üöÄ

## üì• **STEP 1: Data Preparation**
### **üéØ ONE COMMAND - PREPARE ALL DATASETS**
Combines GoEmotions + SemEval + ISEAR + MELD into unified training format

In [None]:
# MULTI-DATASET PREPARATION
# Combines GoEmotions + SemEval + ISEAR + MELD datasets

import os
import subprocess

print("üöÄ MULTI-DATASET PREPARATION")
print("=" * 50)
print("üìä Datasets: GoEmotions + SemEval + ISEAR + MELD")
print("‚è±Ô∏è Time: ~10-15 minutes")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')
print(f"üìÅ Working directory: {os.getcwd()}")

# Run data preparation
print("\nüîÑ Preparing datasets...")
result = subprocess.run(['python', './notebooks/prepare_all_datasets.py'], 
                       capture_output=False, text=True)

# Verify success
if os.path.exists('data/combined_all_datasets/train.jsonl'):
    train_count = sum(1 for line in open('data/combined_all_datasets/train.jsonl'))
    val_count = sum(1 for line in open('data/combined_all_datasets/val.jsonl'))
    print(f"\n‚úÖ SUCCESS: {train_count + val_count} samples prepared")
    print(f"   Training: {train_count} samples")
    print(f"   Validation: {val_count} samples")
    print("\nüöÄ Ready for training! Run Cell 4 next.")
else:
    print("\n‚ùå FAILED: Dataset preparation unsuccessful")
    print("üí° Check logs and try again")

üöÄ MULTI-DATASET PREPARATION
üìä Datasets: GoEmotions + SemEval + ISEAR + MELD
‚è±Ô∏è Time: ~10-15 minutes
üìÅ Working directory: /home/user/goemotions-deberta

üîÑ Preparing datasets...
üöÄ COMPREHENSIVE MULTI-DATASET PREPARATION
üìä Datasets: GoEmotions + SemEval + ISEAR + MELD
‚öôÔ∏è Configuration: Proven BCE setup (threshold=0.2)
‚è±Ô∏è Time: ~10-15 minutes
üìÅ Working directory: /home/user/goemotions-deberta
üìñ Loading GoEmotions dataset...
‚úÖ Loaded 43410 GoEmotions train samples
‚úÖ Loaded 5426 GoEmotions val samples
üì• Processing local SemEval-2018 EI-reg dataset...
‚úÖ Found local SemEval zip file
‚úÖ Copied local SemEval zip to data directory
üì¶ Extracting SemEval-2018 zip file...
‚úÖ Extracted SemEval-2018 data
üìñ Processing anger data...
üìñ Processing fear data...
üìñ Processing joy data...
üìñ Processing sadness data...
‚úÖ Processed 802 SemEval samples
üì• Downloading ISEAR dataset...
üì• Loading ISEAR from Hugging Face...
‚úÖ Processed 7516 ISEAR s

## ‚ö° **STEP 2: Training**
### **üéØ START MULTI-DATASET TRAINING**
Trains DeBERTa on combined dataset with proven configuration

In [None]:
# MULTI-DATASET TRAINING
# Trains DeBERTa on combined dataset with Asymmetric Loss

import os
import subprocess
from pathlib import Path

print("üöÄ MULTI-DATASET TRAINING")
print("=" * 50)
print("ü§ñ Model: DeBERTa-v3-large")
print("üìä Data: 38K+ samples (GoEmotions + SemEval + ISEAR + MELD)")
print("üéØ Loss: BCE (your proven 51.79% winner)")
print("‚è±Ô∏è Time: ~6-8 hours (5 epochs on larger dataset)")
print("=" * 50)

# Change to project directory
os.chdir('/home/user/goemotions-deberta')

# Verify prerequisites
checks_passed = True

if not os.path.exists('data/combined_all_datasets/train.jsonl'):
    print("‚ùå Dataset not found - run Cell 2 first")
    checks_passed = False

if not os.path.exists('scripts/train_comprehensive_multidataset.sh'):
    print("‚ùå Training script not found")
    checks_passed = False

if not checks_passed:
    print("\nüí° Please run Cell 2 first to prepare data")
    exit()

# Make script executable
os.chmod('scripts/train_comprehensive_multidataset.sh', 0o755)
print("‚úÖ Training script ready")

# Start training
print("\nüöÄ STARTING TRAINING...")
print("üìä Monitor progress: tail -f logs/train_comprehensive_multidataset.log")
print("üìä Results will be in: checkpoints_comprehensive_multidataset/eval_report.json")
print("‚òÅÔ∏è Google Drive backup: Automatic (every 15 minutes during training)")
print("\n‚ö†Ô∏è This will take 6-8 hours. Training runs with VISIBLE progress!")
print("‚ö†Ô∏è DO NOT close this notebook - you'll see live progress bars!")
print("-" * 70)

# Run training
training_result = subprocess.run(['bash', 'scripts/train_comprehensive_multidataset.sh'], 
                                capture_output=False, text=True)

# Check results
print("\n" + "=" * 50)
if os.path.exists('checkpoints_comprehensive_multidataset/eval_report.json'):
    print("‚úÖ TRAINING COMPLETED SUCCESSFULLY!")
    print("üìä Results available locally: checkpoints_comprehensive_multidataset/eval_report.json")
    print("‚òÅÔ∏è Google Drive backup: Completed automatically during training")
    
    # Try to show F1 scores
    try:
        import json
        with open('checkpoints_comprehensive_multidataset/eval_report.json', 'r') as f:
            results = json.load(f)
        f1_macro = results.get('f1_macro', 'N/A')
        f1_micro = results.get('f1_micro', 'N/A')
        print(f"\nüìà PERFORMANCE:")
        print(f"   F1 Macro: {f1_macro}")
        print(f"   F1 Micro: {f1_micro}")
        if f1_macro != 'N/A' and f1_macro > 0.6:
            print("\nüéâ SUCCESS: Achieved >60% F1-macro target!")
        elif f1_macro != 'N/A' and f1_macro > 0.55:
            print("\nüëç GOOD: Achieved >55% F1-macro!")
    except:
        print("üìä Check eval_report.json for detailed results")
        
else:
    print("‚ö†Ô∏è TRAINING MAY HAVE FAILED OR IS STILL RUNNING")
    print("üìä Check logs: tail -f logs/train_comprehensive_multidataset.log")
    print("üìä Check for results: checkpoints_comprehensive_multidataset/eval_report.json")

print("\nüéØ Target: >60% F1-macro")
print("üèÜ Baseline: 51.79% F1-macro (GoEmotions only)")
print("‚òÅÔ∏è Backup: Automatic Google Drive (timestamped folders)")

üöÄ MULTI-DATASET TRAINING
ü§ñ Model: DeBERTa-v3-large
üìä Data: 38K+ samples (GoEmotions + SemEval + ISEAR + MELD)
üéØ Loss: BCE (your proven 51.79% winner)
‚è±Ô∏è Time: ~6-8 hours (5 epochs on larger dataset)
‚úÖ Training script ready

üöÄ STARTING TRAINING...
üìä Monitor progress: tail -f logs/train_comprehensive_multidataset.log
üìä Results will be in: checkpoints_comprehensive_multidataset/eval_report.json
‚òÅÔ∏è Google Drive backup: Automatic (every 15 minutes during training)

‚ö†Ô∏è This will take 6-8 hours. Training runs with VISIBLE progress!
‚ö†Ô∏è DO NOT close this notebook - you'll see live progress bars!
----------------------------------------------------------------------
üöÄ COMPREHENSIVE MULTI-DATASET TRAINING
üéØ TARGET: >60% F1-macro with all datasets combined
üìä Datasets: GoEmotions + SemEval + ISEAR + MELD
‚ö° Configuration: BCE Extended (your proven 51.79% winner)


## üìä **STEP 3: Results Analysis (Optional)**
### **üéØ COMPARE WITH BASELINE**
Analyze performance improvement from multi-dataset training

In [None]:
# RESULTS ANALYSIS
# Compare multi-dataset performance with baseline

import json
from pathlib import Path

print("üìä MULTI-DATASET RESULTS ANALYSIS")
print("=" * 50)

# Baseline from original GoEmotions training
baseline = {
    'f1_macro': 0.5179,
    'f1_micro': 0.5975,
    'model': 'BCE Extended (GoEmotions only)'
}

print("üèÜ BASELINE PERFORMANCE:")
print(f"   F1 Macro: {baseline['f1_macro']:.4f} ({baseline['f1_macro']*100:.1f}%)")
print(f"   F1 Micro: {baseline['f1_micro']:.4f} ({baseline['f1_micro']*100:.1f}%)")
print(f"   Model: {baseline['model']}")

# Load current results
eval_file = Path("checkpoints_comprehensive_multidataset/eval_report.json")

if eval_file.exists():
    with open(eval_file, 'r') as f:
        results = json.load(f)
    
    current_f1_macro = results.get('f1_macro', 0)
    current_f1_micro = results.get('f1_micro', 0)
    
    print("\nüéØ MULTI-DATASET RESULTS:")
    print(f"   F1 Macro: {current_f1_macro:.4f} ({current_f1_macro*100:.1f}%)")
    print(f"   F1 Micro: {current_f1_micro:.4f} ({current_f1_micro*100:.1f}%)")
    
    # Calculate improvement
    improvement = ((current_f1_macro - baseline['f1_macro']) / baseline['f1_macro']) * 100
    
    print(f"\nüìà IMPROVEMENT: {improvement:+.1f}%")
    
    # Success assessment
    if current_f1_macro >= 0.60:
        print("üöÄ EXCELLENT: Achieved >60% F1-macro target!")
        print("üéâ Multi-dataset training SUCCESSFUL!")
    elif current_f1_macro >= 0.55:
        print("‚úÖ GOOD: Achieved >55% F1-macro!")
        print("üìà Significant improvement from multi-dataset approach")
    elif current_f1_macro > baseline['f1_macro']:
        print("üëç IMPROVEMENT: Better than baseline")
        print("üîß May need more training epochs or parameter tuning")
    else:
        print("‚ö†Ô∏è NO IMPROVEMENT: Check data quality or training setup")
        
    print(f"\nüìä TARGET ACHIEVEMENT:")
    print(f"   >60% F1-macro: {'‚úÖ' if current_f1_macro >= 0.60 else '‚ùå'} (Target: 60%+)")
    print(f"   >55% F1-macro: {'‚úÖ' if current_f1_macro >= 0.55 else '‚ùå'} (Target: 55%+)")
    print(f"   Beat baseline: {'‚úÖ' if current_f1_macro > baseline['f1_macro'] else '‚ùå'}")
    
else:
    print("\n‚è≥ RESULTS NOT AVAILABLE")
    print("üîß Training may still be in progress or check file path")
    print("üìÅ Expected: checkpoints_comprehensive_multidataset/eval_report.json")

print("\nüîç MONITORING COMMANDS:")
print("   Training logs: tail -f logs/train_comprehensive_multidataset.log")
print("   GPU status: watch -n 5 'nvidia-smi'")
print("   Process status: ps aux | grep train_deberta")