### **Student Information**
Name: 余雅韻

Student ID: 114577002

GitHub ID: yyyynwa

Kaggle name: Yyy (Yyyun2001)

Kaggle private scoreboard snapshot: 

![pic_ranking.png](./pics/pic_ranking.png)

***

# **Project Report**

## 1. Model Development (10 pts Required)

### 1.1 Preprocessing Steps

| **Classical Model:**                       | **RoBERTa Model:**                                  |
|--------------------------------------------|-----------------------------------------------------|
| Aggressive cleaning for feature extraction | Minimal preprocessing to preserve emotional signals |
| Lowercase conversion                       | Only removed URLs and normalized whitespace         |
| Contraction expansion (n't → not)          | Kept capitalization (CAPS indicate emotion)         |
| URL and @mention removal                   | Preserved punctuation (!!!, ??? show intensity)     |
| Hashtag symbol removal (keep words)        | Kept emojis (strong emotion indicators)             |
| Whitespace normalization                   |                                                     |


### 1.2 Feature Engineering Steps

**Classical Model (Manual Features):**

1.  TF-IDF Features

* Word-level unigrams/bigrams
* Character n-grams 2-4 
* Captures misspellings and writing style


2. PMI-Based Emotion Lexicons

* Learned emotion-specific vocabularies from training data
* Top 300 words per emotion with highest PMI scores
* Captures domain-specific emotion expressions


3. Disgust-Specific Features 

* Custom lexicon of 40+ disgust-related words
* Count features: total, binary indicator, intensity (clipped)
* For improving disgust detection (hardest class)


4. SVD Dimensionality Reduction

* Applied to TF-IDF for dense semantic representation
* Explained 30-40% variance
* Reduces noise and overfitting


5. VADER Sentiment Scores

* Positive, negative, neutral, compound scores
* Pre-built sentiment lexicon
* Standardized using StandardScaler


**RoBERTa Model (Learned Features):**

1. No manual feature engineering required
2. Tokenization using subword units (50,265 vocabulary)
3. Captures semantic relationships, context, long-range dependencies

### 1.3 Explanation of Your Model

| **Classical ML Pipeline**                                              | **RoBERTa Transformer**                                  | **Ensemble Strategy**                                         |
|------------------------------------------------------------------------|----------------------------------------------------------|---------------------------------------------------------------|
| **Model: LinearSVC with Probability Calibration**                      | **Architecture:**                                        | **Combination Method: Weighted probability averaging**        |
| Fast and effective for high-dimensional text data                      | Pre-trained RoBERTa-base model                           | RoBERTa: 65% weight (strong at anger, fear, surprise)         |
| Linear decision boundaries                                             | 12 transformer encoder layers                            | Classical: 35% weight (strong at disgust)                     |
| Regularization: C=0.2 (prevent overfitting)                            | 768 hidden dimensions                                    | **Weight Optimization:**                                      |
| dual=False (faster for n_samples > n_features)                         | 12 attention heads per layer                             | Grid search over weights [0.50, 0.55, 0.60, 0.65, 0.70, 0.75] |
| **Class Imbalance Handling:**                                          | Fine-tuned all layers for emotion classification         | Evaluated on out-of-fold predictions                          |
| Custom class weights (balanced + 2.5x boost for disgust)               | **Training Strategy:**                                   | Selected weights maximizing macro F1                          |
| Addresses severe class imbalance in training data                      | 3-fold Stratified K-Fold cross-validation                | **Final Performance:**                                        |
| Critical for minority class performance                                | 3 epochs per fold / Batch size: 16                       | Training F1: 0.6700                                           |
| **Probability Calibration:**                                           | Learning rate: 2e-5 with linear warmup                   | Public Leaderboard F1: 0.6889                                 |
| CalibratedClassifierCV with sigmoid method                             | AdamW optimizer                                          |                                                               |
| 3-fold cross-validation                                                | Gradient clipping (max_norm=1.0)                         |                                                               |
| Converts SVM scores to probabilities for ensemble                      | Class weights: 2.0x boost for disgust                    |                                                               |
| **Performance:**                                                       | **Performance:**                                         |                                                               |
| Training F1: ~0.5800                                                   | Training F1: ~0.5600                                     |                                                               |
| Strong at: disgust detection (21.2% accuracy vs 5-10% in other models) | Strong at: anger (64.4%), fear (58.1%), surprise (54.8%) |                                                               |
|                                                                        | Weak at: disgust (10-15%)                                |                                                               |

**Why Ensemble Works:**

1. Complementary strengths: RoBERTa captures context, Classical captures specific patterns
2. Diversity: Different feature representations
3. Reduces overfitting: Averages out individual model biases
4. Boosts weak classes: Classical compensates for RoBERTa's poor disgust performance
---

## 2. Bonus Section (5 pts Optional)

### 2.1 Mention Different Things You Tried

| **Successful Experiments**                                                   | **Unsuccessful Experiments**                                           |
|------------------------------------------------------------------------------|------------------------------------------------------------------------|
| PMI-based emotion lexicons - Added emotion-specific vocabulary (+0.02 F1)    | Threshold optimization - Severe overfitting (train 0.72 → val 0.55)    |
| Custom disgust features - Dramatically improved disgust detection (+0.04 F1) | Excessive feature engineering - More features led to worse performance |
| Character n-grams - Captured misspellings and style (+0.01 F1)               | Single model optimization - Hit ceiling around 0.58-0.60 F1            |
| SVD dimensionality reduction - Reduced noise, improved generalization        | Data augmentation - Didn't help transformers significantly             |
| Class weight tuning - Essential for minority classes                         | Deeper models - RoBERTa-large was too slow, minimal gain               |
| Ensemble combination - Best improvement (+0.09 F1)                           |                                                                        |
| K-fold cross-validation - Robust performance estimates                       |                                                                        |

![Submission records the things I've tried](./pics/Tried2.png)
![Submission records the things I've tried](./pics/Tried1.png)

### 2.2 Mention Insights You Gained

**Challenges in This Task :**
1. Disgust detection: Consistently worst-performing class across all models
2. Class imbalance: Required multiple strategies (weights, features, ensemble)
3. Overfitting: High-dimensional features led to train/val gap
4. Confusion between similar emotions: Joy vs surprise, fear vs sadness

**Key Insights & Learnings**
1. Class Imbalance is Critical

* Disgust was severely underrepresented and hardest to detect
* Required specialized features + class weighting
* Single approach wasn't enough --> needed both targeted features and ensemble

2. More Features ≠ Better Performance

* Initial attempts with excessive features led to overfitting
* Simpler models MAY be generalized better
* Feature selection and regularization were crucial

3. Ensemble Diversity is Key

* RoBERTa and Classical models made different types of errors
* Combining them captured more patterns than either alone

4. Cross-Validation Prevents Overfitting

* Used 3-fold stratified CV for both models
* Out-of-fold predictions for unbiased ensemble training
* Gap between train/val indicated overfitting

5. Domain Knowledge Matters

* Custom disgust lexicon significantly improved performance
* Understanding social media language patterns helped preprocessing
* PMI-based lexicons captured emotion-specific vocabulary

---

**`From here on starts the code section for the competition.`**

---

# **Competition Code**

## 1. Preprocessing Steps

In [None]:
### Add the code related to the preprocessing steps in cells inside this section

def clean_text(text):
    """
    Clean and normalize text data.
    
    Preprocessing steps:
    1. Convert to lowercase for consistency
    2. Expand contractions (n't -> not) to preserve negation
    3. Remove URLs (not informative for emotion)
    4. Remove @mentions (privacy and generalization)
    5. Remove hashtag symbols but keep content
    6. Normalize whitespace
    
    Args:
        text: Raw text string
    
    Returns:
        Cleaned text string
    """
    text = str(text).lower()
    
    # Expand contractions to preserve negation
    text = re.sub(r"n't\b", " not", text)
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove @mentions
    text = re.sub(r'@\w+', '', text)
    
    # Remove hashtag symbol but keep the word
    text = re.sub(r'#(\w+)', r'\1', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply preprocessing
train_df['clean_text'] = train_df['text'].apply(clean_text)
test_df['clean_text'] = test_df['text'].apply(clean_text)

print("✓ Applied text cleaning")
print("\nExample transformation:")
print(f"Original: {train_df['text'].iloc[0][:100]}")
print(f"Cleaned:  {train_df['clean_text'].iloc[0][:100]}")

## 2. Feature Engineering Steps

In [None]:
### Add the code related to the feature engineering steps in cells inside this section

## Domain-Specific Features (Disgust Detection) ##

# Define disgust-related vocabulary based on domain knowledge
DISGUST_WORDS = {
    # Core disgust terms
    'disgusting', 'gross', 'nasty', 'revolting', 'repulsive', 'vile', 
    'filthy', 'foul', 'horrible', 'yuck', 'ew', 'eww', 'ugh',
    
    # Bodily reactions
    'sick', 'puke', 'vomit', 'nauseous', 'gag', 'rotten', 'stink',
    'smell', 'smells', 'stinks', 'reeks',
    
    # Visual disgust
    'ugly', 'hideous', 'grotesque', 'repugnant',
    
    # Contamination
    'dirty', 'filth', 'contaminated', 'infection', 'disease',
    'germs', 'bacteria', 'mold', 'decay',
    
    # Emotional responses
    'disgusted', 'appalled', 'repulsed', 'sickening',
    
    # Internet slang
    'cringe', 'cringy', 'cringeworthy'
}


#    Extract features specifically designed to detect disgust emotion
def extract_disgust_features(df):
    feats = pd.DataFrame()
    texts_lower = df['clean_text'].str.lower()
    
    # Count disgust words using word boundary matching
    feats['disgust_word_count'] = texts_lower.apply(
        lambda x: sum(1 for w in DISGUST_WORDS if re.search(rf'\b{w}\b', x))
    )
    
    # Binary indicator
    feats['has_disgust_word'] = (feats['disgust_word_count'] > 0).astype(int)
    
    # Capped intensity to prevent outliers
    feats['disgust_intensity'] = feats['disgust_word_count'].clip(upper=3)
    
    # Bodily reaction subset
    bodily = ['sick', 'vomit', 'puke', 'smell', 'stink', 'gag', 'nauseous']
    feats['bodily_reference'] = texts_lower.apply(
        lambda x: sum(1 for w in bodily if re.search(rf'\b{w}\b', x))
    )
    
    return feats

X_train_disgust = extract_disgust_features(train_df)
X_test_disgust = extract_disgust_features(test_df)

print(f"✓ Extracted disgust features")
print(f"  Training samples with disgust words: {X_train_disgust['has_disgust_word'].sum()}")
print(f"  Average disgust words per text: {X_train_disgust['disgust_word_count'].mean():.2f}")


## PMI-Based Emotion Lexicons ##

def learn_pmi_lexicons(df, labels, top_k=300):
    """
    Learn emotion-specific word lexicons using Pointwise Mutual Information (PMI).
    
    1. PMI measures the association between a word and an emotion class:
    2. PMI(word, emotion) = log(P(word|emotion) / P(word))

    --> High PMI indicates the word is strongly associated with that emotion.
    """
    class_counts = defaultdict(Counter)
    total_counts = Counter()
    
    # Count word occurrences per class
    for text, label_idx in zip(df['clean_text'], labels):
        words = set(text.split())  # Use set to count document frequency
        label = target_names[label_idx]
        class_counts[label].update(words)
        total_counts.update(words)
    
    # Calculate PMI for each word-emotion pair
    label_counts = pd.Series(labels).value_counts()
    total_docs = len(labels)
    learned_lexicons = {}
    
    for label in target_names:
        word_scores = {}
        label_idx = le.transform([label])[0]
        
        for word, count in class_counts[label].items():
            # Require minimum frequency to avoid noise
            if count < 5:
                continue
            
            # Calculate PMI
            p_w_given_c = count / label_counts[label_idx]
            p_w = total_counts[word] / total_docs
            pmi = math.log(p_w_given_c / (p_w + 1e-8) + 1e-8)
            
            word_scores[word] = pmi
        
        # Keep top-k words with highest PMI
        top_words = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
        learned_lexicons[label] = set([w for w, s in top_words])
    
    return learned_lexicons

# Count matches against learned emotion lexicons
def apply_pmi_features(text_series, lexicons):
    feats = pd.DataFrame()
    for label in target_names:
        lex = lexicons[label]
        feats[f'{label}_pmi'] = text_series.apply(
            lambda x: sum(1 for w in x.split() if w in lex)
        )
    return feats

# Learn lexicons from training data
pmi_lexicons = learn_pmi_lexicons(train_df, y_train, top_k=300)

# Apply to both train and test
X_train_pmi = apply_pmi_features(train_df['clean_text'], pmi_lexicons)
X_test_pmi = apply_pmi_features(test_df['clean_text'], pmi_lexicons)

print(f"✓ Learned PMI lexicons")
print(f"  Lexicon sizes: {[len(pmi_lexicons[e]) for e in target_names]}")
print(f"\nExample words per emotion:")
for emotion in target_names:
    sample_words = list(pmi_lexicons[emotion])[:5]
    print(f"  {emotion:12s}: {', '.join(sample_words)}")


## TF-IDF Features ##

# Word-level TF-IDF (1-grams and 2-grams)
tfidf_word = TfidfVectorizer(
    max_features=15000,      # Keep top 15k features
    ngram_range=(1, 2),      # Unigrams and bigrams
    sublinear_tf=True        # Use log scaling for term frequency
)
X_train_word = tfidf_word.fit_transform(train_df['clean_text'])
X_test_word = tfidf_word.transform(test_df['clean_text'])

print(f"✓ Word TF-IDF: {X_train_word.shape[1]} features")

# Character-level TF-IDF (captures misspellings and style)
tfidf_char = TfidfVectorizer(
    max_features=8000,
    ngram_range=(2, 4),      # 2-4 grams
    analyzer='char',
    sublinear_tf=True
)
X_train_char = tfidf_char.fit_transform(train_df['clean_text'])
X_test_char = tfidf_char.transform(test_df['clean_text'])

print(f"✓ Char TF-IDF: {X_train_char.shape[1]} features")


## Dimensionality Reduction (SVD)##

# Apply Truncated SVD to word TF-IDF for dense semantic features
svd = TruncatedSVD(n_components=120, random_state=SEED)
X_train_svd = svd.fit_transform(X_train_word)
X_test_svd = svd.transform(X_test_word)

print(f"✓ SVD features: {X_train_svd.shape[1]} components")
print(f"  Explained variance: {svd.explained_variance_ratio_.sum():.2%}")


## Sentiment Features (VADER)##

# VADER: positive, negative, neutral, compound scores
sid = SentimentIntensityAnalyzer()

def get_vader_features(text_series):

    return pd.DataFrame([
        list(sid.polarity_scores(str(t)).values()) 
        for t in text_series
    ])

X_train_vader = get_vader_features(train_df['text'])
X_test_vader = get_vader_features(test_df['text'])

# Standardize sentiment scores
scaler_vader = StandardScaler()
X_train_vader = scaler_vader.fit_transform(X_train_vader)
X_test_vader = scaler_vader.transform(X_test_vader)


## Feature Combination##

# Combine PMI and disgust features
X_train_mining = pd.concat([X_train_pmi, X_train_disgust], axis=1)
X_test_mining = pd.concat([X_test_pmi, X_test_disgust], axis=1)

# Scale data mining features to [0, 1]
scaler_mining = MinMaxScaler()
X_train_mining_scaled = scaler_mining.fit_transform(X_train_mining)
X_test_mining_scaled = scaler_mining.transform(X_test_mining)

# Stack all feature types for training and testing (sparse + dense)
X_train_classical = hstack([
    X_train_word,           
    X_train_char,           
    X_train_svd,            
    X_train_vader,          
    X_train_mining_scaled   
]).tocsr()

X_test_classical = hstack([
    X_test_word,
    X_test_char,
    X_test_svd,
    X_test_vader,
    X_test_mining_scaled
]).tocsr()

print(f"✓ Combined feature matrix: {X_train_classical.shape}")
print(f"  Total features: {X_train_classical.shape[1]}")

## 3. Model Implementation Steps

In [None]:

## Class Weights (Handle Imbalance) ##

# Calculate custom weights with extra boost for disgust
class_counts = pd.Series(y_train).value_counts().sort_index()
n_samples = len(y_train)
custom_weights = {}

for i in range(NUM_CLASSES):
    # Base balanced weight
    base_weight = n_samples / (NUM_CLASSES * class_counts[i])
    
    # Extra weight for disgust (hardest class to detect)
    if target_names[i] == 'disgust':
        custom_weights[i] = base_weight * 2.5
    else:
        custom_weights[i] = base_weight

print("Class weights:")
for i in range(NUM_CLASSES):
    print(f"  {target_names[i]:12s}: {custom_weights[i]:.2f}")

## Model Training with Calibration ##

# Linear SVM (for high-dimensional text data)
svc = LinearSVC(
    C=0.2,                      # Regularization strength
    class_weight=custom_weights, # Handle class imbalance
    max_iter=3000,              # More iterations for convergence
    dual=False,                 # Primal form faster for n_samples > n_features
    random_state=SEED
)

# Calibrate to get probability estimates 
classical_model = CalibratedClassifierCV(svc, method='sigmoid', cv=3)
classical_model.fit(X_train_classical, y_train)

## Generate Probability Predictions ##

# Cross-validation for train probabilities
classical_train_probs = cross_val_predict(
    classical_model, 
    X_train_classical, 
    y_train, 
    cv=3, 
    method='predict_proba',
    n_jobs=-1
)

# Direct prediction for test
classical_test_probs = classical_model.predict_proba(X_test_classical)

# Save for ensemble use
np.save('./logs/classical_train_probs.npy', classical_train_probs)
np.save('./logs/classical_test_probs.npy', classical_test_probs)

# Evaluate classical model alone
classical_preds = np.argmax(classical_train_probs, axis=1)
classical_f1 = f1_score(y_train, classical_preds, average='macro')
print(f"\nClassical Model Performance:")
print(f"  Macro F1: {classical_f1:.4f}")

In [None]:
## Minimal Cleaning for Transformers ##
# Pre-trained on noisy, real-world text (no need to over-clean)

def clean_text_minimal(text):
    text = str(text)
    # Remove URLs (transformers don't learn from these)
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

train_df['clean_text'] = train_df['text'].apply(clean_text_minimal)
test_df['clean_text'] = test_df['text'].apply(clean_text_minimal)

## Creating Dataset ##

class EmotionDataset(Dataset):
        PyTorch Dataset for emotion classification.
    """
    1. Tokenization with padding and truncation
    2. Attention masks (which tokens are real vs padding)
    3. Conversion to PyTorch tensors
    """
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        # Tokenize with special tokens
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,    # Add [CLS] and [SEP] tokens
            max_length=self.max_length,  # Truncate long sequences 
            padding='max_length',        # Pad short sequences
            truncation=True,             # Enable truncation
            return_attention_mask=True,  # Return attention mask
            return_tensors='pt'          # Return PyTorch tensors
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


## Load RoBERTa Model and Tokenizer ##

MODEL_NAME = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

## Class Weights for Imbalanced Data ##

class_counts = pd.Series(y_train).value_counts().sort_index()
n_samples = len(y_train)

# Balanced weights 
class_weights = []
for i in range(NUM_CLASSES):
    base_weight = n_samples / (NUM_CLASSES * class_counts[i])
    if target_names[i] == 'disgust':
        weight = base_weight * 2.0  # Extra boost for disgust (hardest class)
    else:
        weight = base_weight
    class_weights.append(weight)

class_weights = torch.FloatTensor(class_weights).to(device)

print("Class weights:")
for i, emotion in enumerate(target_names):
    print(f"  {emotion:12s}: {class_weights[i]:.3f}")

## Training Function ##

def train_epoch(model, data_loader, optimizer, scheduler, device, class_weights):
    model.train()   # Set to training mode
    losses = []
    correct_predictions = 0

    progress_bar = tqdm(data_loader, desc='Training')

    for batch in progress_bar:
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # Apply class weights to loss
        loss = outputs.loss

        # Get predictions
        preds = torch.argmax(outputs.logits, dim=1)
        correct_predictions += torch.sum(preds == labels)

        losses.append(loss.item())

        # Backward pass
        loss.backward()

        # Gradient clipping (prevent exploding gradients)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Update weights
        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'loss': np.mean(losses), 'acc': correct_predictions.double().item() / len(data_loader.dataset)})

    return correct_predictions.double() / len(data_loader.dataset), np.mean(losses)

## Evaluation Function ##

def eval_model(model, data_loader, device):
    model.eval()    # Set to evaluation mode
    losses = []
    predictions = []
    true_labels = []

    with torch.no_grad():      # Disable gradient computation
        progress_bar = tqdm(data_loader, desc='Evaluating')
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            preds = torch.argmax(outputs.logits, dim=1)

            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
            losses.append(loss.item())

    return predictions, true_labels, np.mean(losses)

## Prediction with Probabilities ##

def get_predictions_with_proba(model, data_loader, device):
    """
    Get predictions and probability distributions.
        1. Ensemble methods (combining multiple models)
        2. Confidence analysis
        3. Threshold optimization
    """
    model.eval()
    predictions = []
    probabilities = []
    true_labels = []

    with torch.no_grad():
        for batch in tqdm(data_loader, desc='Predicting'):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            # Convert logits to probabilities
            probs = torch.softmax(outputs.logits, dim=1)
            preds = torch.argmax(probs, dim=1)

            probabilities.extend(probs.cpu().numpy())
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    return np.array(probabilities), np.array(predictions), np.array(true_labels)

## K-Fold Cross- Validation Training ##

# Training configuration
N_FOLDS = 3
EPOCHS = 3  
BATCH_SIZE = 16 
LEARNING_RATE = 2e-5    # Standard for fine-tuning

# Storage for final predictions
all_train_preds = np.zeros((len(train_df), NUM_CLASSES))
all_test_preds = []

# K-Fold splitter
kfold = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Training Loop
for fold, (train_idx, val_idx) in enumerate(kfold.split(train_df, y_train)):
    print(f"\n{'='*80}")
    print(f"FOLD {fold + 1}/{N_FOLDS}")
    print(f"{'='*80}")

    # Prepare datasets for each fold
    train_texts = train_df.iloc[train_idx]['clean_text'].values
    train_labels = y_train[train_idx]
    val_texts = train_df.iloc[val_idx]['clean_text'].values
    val_labels = y_train[val_idx]

    train_dataset = EmotionDataset(train_texts, train_labels, tokenizer)
    val_dataset = EmotionDataset(val_texts, val_labels, tokenizer)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

    # Load FRESH model for each fold
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=NUM_CLASSES
    ).to(device)

    # Optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
    total_steps = len(train_loader) * EPOCHS
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=total_steps // 10, # 10% warmup
        num_training_steps=total_steps
    )

    # Training loop for each fold
    best_val_f1 = 0
    for epoch in range(EPOCHS):
        print(f"\nEpoch {epoch + 1}/{EPOCHS}")

        # Training
        train_acc, train_loss = train_epoch(
            model, train_loader, optimizer, scheduler, device, class_weights
        )
        print(f"Train loss: {train_loss:.4f}, Train acc: {train_acc:.4f}")

        # Validation
        val_preds, val_labels_actual, val_loss = eval_model(model, val_loader, device)
        val_f1 = f1_score(val_labels_actual, val_preds, average='macro')
        print(f"Val loss: {val_loss:.4f}, Val F1: {val_f1:.4f}")

        # Track best model
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            print(f"Best F1: {val_f1:.4f}")

    # Get predictions on validation fold
    val_preds_proba, _, _ = get_predictions_with_proba(model, val_loader, device)
    all_train_preds[val_idx] = val_preds_proba

    # Get predictions on test set
    test_dataset = EmotionDataset(test_df['clean_text'].values, np.zeros(len(test_df)), tokenizer)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
    test_preds_proba, _, _ = get_predictions_with_proba(model, test_loader, device)
    all_test_preds.append(test_preds_proba)

    print(f"\nFold {fold + 1} complete. Best Val F1: {best_val_f1:.4f}")

# Average test predictions across folds
roberta_test_probs = np.mean(all_test_preds, axis=0)
roberta_train_probs = all_train_preds

roberta_preds = np.argmax(train_probs_roberta, axis=1)
roberta_f1 = f1_score(y_train, train_preds_roberta, average='macro')
print("RoBERTa Model Marco F1: {roberta_f1:.4f}")

print(f"RoBERTa Training F1: {train_f1_roberta:.4f}")
# Save probabilities for potential ensemble
np.save('./logs/roberta_train_probs.npy', roberta_train_probs)
np.save('./logs/roberta_test_probs.npy', roberta_test_probs)

In [None]:
### Ensemble Model ###

## Analyze Model Strengths ##
for i, emotion in enumerate(target_names):
    mask = (y_train == i)
    if mask.sum() > 0:
        f1_roberta = f1_score(y_train[mask], roberta_preds[mask], labels=[i], average='macro')
        f1_classical = f1_score(y_train[mask], classical_preds[mask], labels=[i], average='macro')
        better = "RoBERTa" if f1_roberta > f1_classical else "Classical"
        print(f"{emotion:<12} {f1_roberta:8.4f} {f1_classical:10.4f} {better:>15}")

## Optimize Ensemble Weights ##

best_f1 = 0
best_weights = None
best_preds = None

print(f"\n{'RoBERTa':>8s} {'Classical':>10s} {'Macro F1':>9s} {'Disgust F1':>11s} {'Anger F1':>9s}")
print("-" * 60)

# Grid search over different weight combinations
for roberta_weight in [0.50, 0.55, 0.60, 0.65, 0.70, 0.75]:
    classical_weight = 1.0 - roberta_weight
    
    # Weighted average of probabilities
    ensemble_probs = (roberta_weight * roberta_train_probs) + \
                     (classical_weight * classical_train_probs)
    ensemble_preds = np.argmax(ensemble_probs, axis=1)
    
    # Calculate metrics
    f1_macro = f1_score(y_train, ensemble_preds, average='macro')
    f1_disgust = f1_score(y_train == disgust_idx, ensemble_preds == disgust_idx)
    f1_anger = f1_score(y_train == anger_idx, ensemble_preds == anger_idx)
    
    print(f"{roberta_weight:8.2f} {classical_weight:10.2f} {f1_macro:9.4f} "
          f"{f1_disgust:11.4f} {f1_anger:9.4f}")
    
    if f1_macro > best_f1:
        best_f1 = f1_macro
        best_weights = (roberta_weight, classical_weight)
        best_preds = ensemble_preds

print(f"\nOptimal weights: RoBERTa={best_weights[0]:.2f}, Classical={best_weights[1]:.2f}")
print(f"\nTraining F1: {best_f1:.4f}")


## Final Ensemble Predictions ##

# Apply best weights to create final ensemble
ensemble_train_probs = (best_weights[0] * roberta_train_probs) + \
                       (best_weights[1] * classical_train_probs)
ensemble_test_probs = (best_weights[0] * roberta_test_probs) + \
                      (best_weights[1] * classical_test_probs)

preds_train = np.argmax(ensemble_train_probs, axis=1)
preds_test = np.argmax(ensemble_test_probs, axis=1)

train_f1 = f1_score(y_train, preds_train, average='macro')
print(f"\nEnsemble Training F1: {train_f1:.4f}")