# 🚀 FIXED BULLETPROOF TRAINING - UNIQUE DATASET

**Target: 75-85% F1 Score**  
**Current: 67% F1 Score**  
**Strategy: Use UNIQUE fallback dataset with NO DUPLICATES**

This notebook uses:
- Original 150 high-quality journal samples
- CMU-MOSEI samples for diversity
- **UNIQUE** fallback dataset (144 samples, no duplicates)
- Optimized hyperparameters for 75-85% F1

In [None]:
# Install required packages
!pip install transformers torch scikit-learn numpy pandas

In [None]:
# Import libraries
import json
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

print('🚀 FIXED BULLETPROOF TRAINING - UNIQUE DATASET')
print('=' * 60)

In [None]:
# BULLETPROOF: Auto-detect repository path and data files
import os
print('🔍 Auto-detecting repository structure...')

# Find the repository directory
possible_paths = [
    '/content/SAMO--DL',
    '/content/SAMO--DL/SAMO--DL',
    '/content/SAMO--DL-main',
    '/content/SAMO--DL-main/SAMO--DL',
    '/content/SAMO--DL-main/SAMO--DL-main'
]

repo_path = None
for path in possible_paths:
    if os.path.exists(path):
        repo_path = path
        print(f'✅ Found repository at: {repo_path}')
        break

if repo_path is None:
    print('❌ Could not find repository! Listing /content:')
    !ls -la /content/
    raise Exception('Repository not found!')

# Verify data directory exists
data_path = os.path.join(repo_path, 'data')
if not os.path.exists(data_path):
    print(f'❌ Data directory not found: {data_path}')
    raise Exception('Data directory not found!')

print(f'✅ Data directory found: {data_path}')
print('📂 Listing data files:')
!ls -la {data_path}/

In [None]:
# Load combined dataset with UNIQUE fallback
print('📊 Loading combined dataset...')
combined_samples = []

# Load journal data
journal_path = os.path.join(repo_path, 'data', 'journal_test_dataset.json')
try:
    with open(journal_path, 'r') as f:
        journal_data = json.load(f)
    for item in journal_data:
        # CRITICAL FIX: Use 'content' for journal data, 'text' for CMU-MOSEI
        if 'content' in item and 'emotion' in item:
            combined_samples.append({'text': item['content'], 'emotion': item['emotion']})
        elif 'text' in item and 'emotion' in item: # Fallback for other journal formats
            combined_samples.append({'text': item['text'], 'emotion': item['emotion']})
    print(f'✅ Loaded {len(journal_data)} journal samples from {journal_path}')
except FileNotFoundError:
    print(f'⚠️ Could not load journal data: {journal_path} not found.')

# Load CMU-MOSEI data
cmu_path = os.path.join(repo_path, 'data', 'cmu_mosei_balanced_dataset.json')
try:
    with open(cmu_path, 'r') as f:
        cmu_data = json.load(f)
    for item in cmu_data:
        if 'text' in item and 'emotion' in item:
            combined_samples.append({'text': item['text'], 'emotion': item['emotion']})
    print(f'✅ Loaded {len(cmu_data)} CMU-MOSEI samples from {cmu_path}')
except FileNotFoundError:
    print(f'⚠️ Could not load CMU-MOSEI data: {cmu_path} not found.')

print(f'📊 Total combined samples: {len(combined_samples)}')

# BULLETPROOF: Use UNIQUE fallback dataset if needed
if len(combined_samples) < 100:
    print(f'⚠️ Only {len(combined_samples)} samples loaded! Using UNIQUE fallback dataset...')
    
    # Load the unique fallback dataset
    fallback_path = os.path.join(repo_path, 'data', 'unique_fallback_dataset.json')
    try:
        with open(fallback_path, 'r') as f:
            fallback_data = json.load(f)
        combined_samples = fallback_data
        print(f'✅ Loaded {len(combined_samples)} UNIQUE fallback samples')
    except FileNotFoundError:
        print(f'❌ Could not load unique fallback dataset: {fallback_path}')
        print('❌ No data available for training!')
        raise Exception('No training data available!')

print(f'✅ Final dataset size: {len(combined_samples)} samples')

# Verify no duplicates
texts = [sample['text'] for sample in combined_samples]
unique_texts = set(texts)
print(f'🔍 Duplicate check: {len(texts)} total, {len(unique_texts)} unique')
if len(texts) != len(unique_texts):
    print('❌ WARNING: DUPLICATES FOUND! This will cause model collapse!')
else:
    print('✅ All samples are unique - no model collapse risk!')

In [None]:
# Prepare data for training
print('🔧 Preparing data for training...')

texts = [sample['text'] for sample in combined_samples]
emotions = [sample['emotion'] for sample in combined_samples]

# Encode labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(emotions)

print(f'🎯 Number of labels: {len(label_encoder.classes_)}')
print(f'📊 Labels: {list(label_encoder.classes_)}')

# Split data
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

print(f'📈 Training samples: {len(train_texts)}')
print(f'🧪 Test samples: {len(test_labels)}')

# Show emotion distribution
emotion_counts = {}
for emotion in emotions:
    emotion_counts[emotion] = emotion_counts.get(emotion, 0) + 1

print('\n📊 Emotion Distribution:')
for emotion, count in sorted(emotion_counts.items()):
    print(f'  {emotion}: {count} samples')

In [None]:
# Create custom dataset
class EmotionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [None]:
# Initialize model and tokenizer
print('🔧 Initializing model and tokenizer...')

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label_encoder.classes_),
    problem_type='single_label_classification'
)

print(f'✅ Model initialized with {len(label_encoder.classes_)} labels')

# Create datasets
train_dataset = EmotionDataset(train_texts, train_labels, tokenizer)
test_dataset = EmotionDataset(test_texts, test_labels, tokenizer)

print('✅ Datasets created successfully')

In [None]:
# Define metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    f1 = f1_score(labels, predictions, average='weighted')
    accuracy = accuracy_score(labels, predictions)
    
    return {'f1': f1, 'accuracy': accuracy}

In [None]:
# Configure training arguments with OPTIMIZED hyperparameters
print('🚀 Starting FIXED BULLETPROOF training...')
print('🎯 Target F1 Score: 75-85%')
print('📊 Current Best: 67%')
print('📈 Expected Improvement: 8-18%')

training_args = TrainingArguments(
    output_dir='./emotion_model_fixed_bulletproof',
    num_train_epochs=3,  # Reduced to prevent overfitting
    per_device_train_batch_size=8,  # Smaller batch size
    per_device_eval_batch_size=8,
    warmup_steps=50,  # Reduced warmup
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,  # More frequent logging
    eval_strategy='steps',
    eval_steps=25,  # More frequent evaluation
    save_strategy='steps',
    save_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,
    dataloader_num_workers=2,
    remove_unused_columns=False,
    report_to=None,  # Disable wandb
    learning_rate=2e-5,  # Standard learning rate
    gradient_accumulation_steps=2,  # Increased for stability
    fp16=True,  # Enable mixed precision for GPU
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

print(f'📊 Training on {len(train_texts)} samples')
print(f'🧪 Evaluating on {len(test_labels)} samples')

# Start training
trainer.train()

In [None]:
# Evaluate final model
print('📊 Evaluating final model...')
results = trainer.evaluate()

print(f'🏆 Final F1 Score: {results["eval_f1"]:.4f} ({results["eval_f1"]*100:.2f}%)')
print(f'🎯 Target achieved: {"✅ YES!" if results["eval_f1"] >= 0.75 else "❌ Not yet"}')

# Save model
trainer.save_model('./emotion_model_fixed_bulletproof_final')
print('💾 Model saved to ./emotion_model_fixed_bulletproof_final')

In [None]:
# Test on sample texts
print('🧪 Testing on sample texts...')

test_texts = [
    "I'm feeling really happy today!",
    "I'm so frustrated with this project.",
    "I feel anxious about the presentation.",
    "I'm grateful for all the support.",
    "I'm feeling overwhelmed with tasks."
]

model.eval()
with torch.no_grad():
    for i, text in enumerate(test_texts, 1):
        inputs = tokenizer(
            text,
            truncation=True,
            padding=True,
            return_tensors='pt'
        )
        
        outputs = model(**inputs)
        probabilities = torch.softmax(outputs.logits, dim=1)
        predicted_class = torch.argmax(probabilities, dim=1).item()
        confidence = probabilities[0][predicted_class].item()
        
        predicted_emotion = label_encoder.inverse_transform([predicted_class])[0]
        
        print(f'{i}. Text: {text}')
        print(f'   Predicted: {predicted_emotion} (confidence: {confidence:.3f})\n')

## 🎉 Training Complete!

**Key Improvements:**
- ✅ **UNIQUE** fallback dataset (no duplicates)
- ✅ Proper data loading with field name handling
- ✅ Optimized hyperparameters
- ✅ Early stopping to prevent overfitting
- ✅ Mixed precision training for GPU efficiency

**Expected Results:**
- 🎯 **Target F1 Score: 75-85%**
- 📈 **Improvement from 67% baseline**
- 🔧 **No model collapse** (unique data prevents this)

**Next Steps:**
1. Review the F1 score achieved
2. If below 75%, consider adding more real data
3. Fine-tune hyperparameters if needed