<a href="https://colab.research.google.com/github/yifang-psu/demo/blob/main/Fang_demo_0126.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Author: Yi Fang**

## **Content:**

0. **Deep Learning Framworks**
1. **Sentence Transformer Model**
  * 1.1 Model script
  * 1.2 Test
  * 1.3 Discussion
2. **Multi-Task Transformer Model**
  * 2.1 Model script
  * 2.2 Test
  * 3.3 Discussion
3. **Training**
  * 3.1 Training script
  * 3.2 Prepare datasets and train the model
  * 3.3 Test
  * 3.4 Discussion

## (0-1) Deep Learning Frameworks

- PyTorch (used for this demo)
- TensorFlow/Keras
- JAX/Flax
- Hugging Face Transformers (used for this demo)

#### Why choose PyTorch with Hugging Face Transformers?
- Extensive pre-trained models
- Active community support
- Clean API for transformer architectures
- Efficient fine-tuning capabilities

In [None]:
!pip install transformers torch  # Uncomment this if you need to install
# I developed the following code in Databricks (runtime: ML 15.4 https://docs.databricks.com/en/release-notes/runtime/15.4lts-ml.html)
# The code also runs well in Colab (runtime Python 3, CPU)

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [1]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

In [8]:
print(torch.__version__)

2.5.1+cu121


# **Section 1**

### (1-1) Sentence Transformer Model

In [9]:
class SentenceTransformerModel(nn.Module):
    def __init__(self, model_name='bert-base-uncased'):
        super(SentenceTransformerModel, self).__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.transformer = AutoModel.from_pretrained(model_name)

    def mean_pooling(self, model_output, attention_mask):
        """
        Apply mean pooling on the last hidden state to get sentence embeddings.
        """
        token_embeddings = model_output.last_hidden_state  # (batch_size, seq_len, hidden_size)
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

        # Sum the embeddings for each token, then divide by the number of tokens
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9) # it is more numerical stable than using torch.mean()
        return sum_embeddings / sum_mask


    def encode(self, sentences):
        """
        Tokenize and encode sentences to fixed-length embeddings.
        """
        # sentences can be a list of strings or a single string
        if isinstance(sentences, str):
            sentences = [sentences]

        encoding = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors='pt'
        )

        with torch.no_grad():  # turn off gradients for inference
            model_output = self.transformer(
                input_ids=encoding['input_ids'],
                attention_mask=encoding['attention_mask']
            )

        sentence_embeddings = self.mean_pooling(model_output, encoding['attention_mask'])
        return sentence_embeddings

    def forward(self, input_ids, attention_mask):
        """
        Forward pass for training: returns the sentence embeddings.
        """
        model_output = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        return self.mean_pooling(model_output, attention_mask)

## (1-2) Test Sentence Transformer Model

In [10]:
def test_sentence_transformer(model, sentences):
    """
    Demonstrates how to use a SentenceTransformerModel for encoding sentences into fixed-length embeddings.
    """
    # Encode the sentences to get embeddings
    embeddings = model.encode(sentences)

    # Print the shape of the resulting embeddings tensor
    print(f"Embeddings shape: {embeddings.shape}")

    # Show sample embeddings for each sentence
    for i, sentence in enumerate(sentences):
        print(f"\nSentence {i+1}: {sentence}")
        print(f"Embedding (first 5 dims): {embeddings[i, :5]}")


In [11]:
if __name__ == "__main__":
    # Instantiate your SentenceTransformerModel
    model = SentenceTransformerModel(model_name='bert-base-uncased')

    # Define some sample sentences
    sentences = [
        "Hello world!",
        "I love AI",
        "Machine learning is fascinating.",
        "I love programming in Python."
    ]

    # Run the test function
    test_sentence_transformer(model, sentences)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Embeddings shape: torch.Size([4, 768])

Sentence 1: Hello world!
Embedding (first 5 dims): tensor([-0.1373, -0.1593,  0.0821, -0.3459, -0.2501])

Sentence 2: I love AI
Embedding (first 5 dims): tensor([ 0.2247,  0.3316,  0.1592, -0.0284,  0.1185])

Sentence 3: Machine learning is fascinating.
Embedding (first 5 dims): tensor([ 0.1596,  0.0725, -0.1440,  0.0461,  0.4271])

Sentence 4: I love programming in Python.
Embedding (first 5 dims): tensor([ 0.2869,  0.4483, -0.2176, -0.2858,  0.0800])


## (1-3) Discussion

#### what architecture choices are made outside of the transformer backbone and why?

1. **Used mean pooling**:
  * Common pooling strategy
    - mean pooling (used for this implementation)
    - token pooling [CLS]: often used as a default representation in BERT
    - max pooling: might help highlight the most salient features.
    - concatenation of last layers or multiple layers
  * Why used mean pooling?
    - After obtaining the last_hidden_state from the transformer, I apply mean pooling across all tokens. This is how to produce a single sentence embedding from a matrix of token-level embeddings.
    - Mean Pooling incorporates all tokens in the sentence rather than relying on just the [CLS] token.
    - [CLS] may fit the classification better than mean pooling (check discussion here: https://discuss.huggingface.co/t/common-practice-using-the-hidden-state-associated-with-cls-as-an-input-feature-for-a-classification-task/14003)

2. **No Additional Layers**
  * No FC layer added here
    - It means that the sentence embedding are essentially the direct output of the backbone and mean pooling
  * Why keep it simple?
    - It should be sufficient for downstream tasks like similarity, sentiment analysis
    - Unless a more complex architecture is preferred for a specific goal/task is required. In this case, we can add training layers, a linear or MLP layer is added on top of the pooled representation to improve performance (for simplicity, this task does not add this layer, for the next multi-task model, the layer is added).

#### Why `torch.no_grad()` is used?

1. **Inference Mode**:
  - Freezing: in this demo code, `torgch.no_grad()` in the `encode` method is used for inference. This effectively does not update the transfomer weights. If the model is aimed for training a specific task, we should remove `torch.no_grad()` and allow backpropagation to fine-tune the transformer backbone.
2. **In-Method Convenience**:
  * Placing torch.no_grad() inside the encode() method guarantees that anyone calling encode(...) won’t accidentally compute gradients.



# **Section 2**

## (2-1) Multi-Task Transformer Model
- Task 1: Classification (e.g,. three classes)
  - Product-related
  - Service-related
  - General feedback

- Task 2: Sentiment Analysis (e.g., three categories)
  - Negative (0)
  - Neutral (1)
  - Positive (2)

In [12]:
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer


class MultiTaskTransformer(nn.Module):
    CLASSIFICATION_LABELS = ['Product', 'Service', 'General']
    SENTIMENT_LABELS = ['Negative', 'Neutral', 'Positive']

    def __init__(self, model_name='bert-base-uncased'):
        super().__init__()

        # Shared Transformer
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Decide if running on CPU or GPU
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # Model hidden size (often 768 for BERT-base)
        hidden_size = self.transformer.config.hidden_size

        # Task-specific heads
        self.classifier = nn.Linear(hidden_size, len(self.CLASSIFICATION_LABELS)).to(self.device)
        self.sentiment_analyzer = nn.Linear(hidden_size, len(self.SENTIMENT_LABELS)).to(self.device)

        # CLS pooler for classification (transform CLS hidden state)
        self.cls_pooler = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh()
        ).to(self.device)

    def mean_pooling(self, model_output, attention_mask):
        """Use mean pooling of the last hidden states for sentence-level representations."""
        token_embeddings = model_output.last_hidden_state  # (batch_size, seq_len, hidden_size)
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)

        return sum_embeddings / sum_mask

    def cls_pooling(self, model_output):
        """Use [CLS] token and pass it through a small transformation (cls_pooler)."""
        cls_token = model_output.last_hidden_state[:, 0, :]  # (batch_size, hidden_size)
        return self.cls_pooler(cls_token)  # (batch_size, hidden_size)

    def forward(self, input_ids, attention_mask, token_type_ids=None, task=None):
        # Remove with torch.no_grad()
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        if task == 'classification':
            pooled = self.cls_pooling(outputs)
            return self.classifier(pooled)
        elif task == 'sentiment':
            pooled = self.mean_pooling(outputs, attention_mask)
            return self.sentiment_analyzer(pooled)

        # Default
        return self.mean_pooling(outputs, attention_mask)


    def get_classification_label(self, idx):
        return self.CLASSIFICATION_LABELS[idx]

    def get_sentiment_label(self, idx):
        return self.SENTIMENT_LABELS[idx]


## (2-2) Test Multi-Task Transformer Model (before training or fine-tuning)


In [13]:
def test_multitask_transformer(model, sentences):
    """
    Demonstrates how to use MultiTaskTransformer for:
      1) Classification
      2) Sentiment Analysis
      3) Default mean-pooled embeddings (no task)
    Now it also prints per-class probabilities for each task.
    """
    # Move model to the correct device
    model.to(model.device)

    # Tokenize inputs
    encoding = model.tokenizer(
        sentences,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )
    # Send inputs to model device
    input_ids = encoding['input_ids'].to(model.device)
    attention_mask = encoding['attention_mask'].to(model.device)

    # 1) Classification logits
    classification_logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        task='classification'
    )
    # Convert logits to probabilities
    classification_probs = torch.softmax(classification_logits, dim=1)
    classification_preds = torch.argmax(classification_probs, dim=1)

    # 2) Sentiment logits
    sentiment_logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        task='sentiment'
    )
    sentiment_probs = torch.softmax(sentiment_logits, dim=1)
    sentiment_preds = torch.argmax(sentiment_probs, dim=1)

    # 3) Default embeddings (no task)
    embeddings = model(
        input_ids=input_ids,
        attention_mask=attention_mask
    )

    # Print out results
    print("=== MultiTaskTransformer Test ===\n")
    for i, sentence in enumerate(sentences):
        print(f"Sentence: {sentence}")

        # --- Classification ---
        print("\nClassification Probabilities:")
        for j, label in enumerate(model.CLASSIFICATION_LABELS):
            prob = classification_probs[i, j].item()
            print(f"{label}: {prob:.3f}")
        class_label = model.get_classification_label(int(classification_preds[i]))
        print(f"Predicted Class: {class_label}")

        # --- Sentiment ---
        print("\nSentiment Probabilities:")
        for j, label in enumerate(model.SENTIMENT_LABELS):
            prob = sentiment_probs[i, j].item()
            print(f"{label}: {prob:.3f}")
        senti_label = model.get_sentiment_label(int(sentiment_preds[i]))
        print(f"Predicted Sentiment: {senti_label}\n")

        print("-" * 50)

    print(f"Embeddings shape (no task): {embeddings.shape}")
    print("Example embedding row:", embeddings[0, :5])


In [14]:
if __name__ == "__main__":
    # Instantiate the multi-task model
    model = MultiTaskTransformer(model_name='bert-base-uncased')

    # Sample sentences
    test_sentences = [
        "The product quality is outstanding!",  # Product, Positive
        "Your customer service team was very helpful",  # Service, Positive
        "I have some general feedback to share",  # General, Neutral
        "This product is completely useless",  # Product, Negative
        "The service was mediocre at best"  # Service, Neutral
    ]

    # Run our test function
    test_multitask_transformer(model, test_sentences)


=== MultiTaskTransformer Test ===

Sentence: The product quality is outstanding!

Classification Probabilities:
Product: 0.424
Service: 0.273
General: 0.303
Predicted Class: Product

Sentiment Probabilities:
Negative: 0.342
Neutral: 0.369
Positive: 0.289
Predicted Sentiment: Neutral

--------------------------------------------------
Sentence: Your customer service team was very helpful

Classification Probabilities:
Product: 0.394
Service: 0.289
General: 0.317
Predicted Class: Product

Sentiment Probabilities:
Negative: 0.363
Neutral: 0.427
Positive: 0.210
Predicted Sentiment: Neutral

--------------------------------------------------
Sentence: I have some general feedback to share

Classification Probabilities:
Product: 0.391
Service: 0.294
General: 0.315
Predicted Class: Product

Sentiment Probabilities:
Negative: 0.463
Neutral: 0.345
Positive: 0.192
Predicted Sentiment: Negative

--------------------------------------------------
Sentence: This product is completely useless

Class


## (2-3) Discussion

#### What architecture choices are made outside of the transformer backbone and why?

1. Task-Specific Heads:

  * The model adds two linear layers:
  ```
  self.classifier for the classification task (e.g., Product/Service/General).
  self.sentiment_analyzer for the sentiment task (Negative/Neutral/Positive).
  ```
  * These layers map the pooled embedding to the respective label spaces.

2. Pooling Strategy:

  * CLS Pooling (cls_pooling) for classification:
    * The [CLS] token is passed through a small MLP (nn.Linear + nn.Tanh) before classification. This is a learned transformation of the [CLS] embedding.
    * Why? CLS pooling is a common approach in BERT-like models, where the [CLS] token is often used as a condensed representation of the entire sequence. The additional linear + Tanh adds a small nonlinearity, which can improve classification performance.

  * Mean Pooling (mean_pooling) for sentiment:
    * The hidden states are averaged across all tokens to generate a single embedding for sentiment analysis.
    * Why? Mean pooling is used for sentiment to capture an average representation of all tokens. Some tasks benefit from seeing all tokens’ contributions rather than focusing on the [CLS] embedding alone.



  3. Shared BERT backbone
    * Both tasks use the same transformer embeddings but have separate output layers, enabling knowledge sharing while maintaining task-specific predictions.

#### Why the results are NOT good at all?!

**(1) Untrained task-specific heads**
* The BERT backbone has learned general language representations, but it knows nothing about your specific tasks (product vs. service vs. general, negative vs. neutral vs. positive).
* The classification and sentiment analysis heads each have randomly initialized weights. Without fine-tuning or training, they are effectively making predictions randomly (although slightly shaped by the hidden states from BERT).

**(2) No fine-tuning data**
* Pre-trained models (like BERT) learn universal language features (e.g., synonyms, grammar).
* To do well on a downstream task (classification, sentiment analysis), we need to show the model labeled examples that link BERT’s general features to your custom labels.

**(3) Softmax probabilities on random logits**
* Even though you see probabilities like 0.40, 0.35, etc., these are just the results of the softmax function on random logits. They might look slightly better than uniform by chance, but they are still untrained predictions.




**(4) In the training section below, used combined loss function during training and here is why:**
- Joint optimization - allows the model to learn optimal parameters that work well for both tasks simultaneously
- Task balancing - helps prevent one task from dominating the learning process.
- Knowledge sharing - enables transfer of relevant information between tasks through the shared BERT backbone
- Efficiency - training both tasks together is more computationally efficient than training separate models
- The combined loss can be weighted if needed:
  ```
  loss = alpha * classification_loss + beta * sentiment_loss
  ```

# **Section 3**


## (3-1) Training



In [15]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
import numpy as np

class MultiTaskDataset(Dataset):
    def __init__(self, texts, classification_labels, sentiment_labels, tokenizer):
        self.encodings = tokenizer(texts, truncation=True, padding=True, return_tensors='pt')
        self.classification_labels = torch.tensor(classification_labels)
        self.sentiment_labels = torch.tensor(sentiment_labels)

    def __getitem__(self, idx):
        item = {
            'input_ids': self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'token_type_ids': self.encodings['token_type_ids'][idx],
            'classification_labels': self.classification_labels[idx],
            'sentiment_labels': self.sentiment_labels[idx]
        }
        return item

    def __len__(self):
        return len(self.classification_labels)

In [16]:
def train_multitask_model(
    model,
    train_texts,
    classification_labels,
    sentiment_labels,
    epochs=3,
    batch_size=8,
    learning_rate= 5e-6,  # 2e-5,
    freeze_backbone=True,
    unfreeze_after=None
):
    """
    Train the MultiTaskTransformer model on CPU with a freeze-then-unfreeze approach.

    Args:
        model (nn.Module): The multi-task model (with classification/sentiment heads).
        train_texts (list): List of training sentences.
        classification_labels (list[int]): Labels for classification task.
        sentiment_labels (list[int]): Labels for sentiment task.
        epochs (int): Number of total epochs.
        batch_size (int): Batch size for DataLoader.
        learning_rate (float): Learning rate for AdamW optimizer.
        freeze_backbone (bool): If True, freeze the transformer backbone at first.
        unfreeze_after (int or None): After this many epochs, unfreeze the backbone for fine-tuning.
                                      If None, keep it frozen throughout training.
    """

    # -------------------------
    # 1) Prepare dataset
    # -------------------------
    train_dataset = MultiTaskDataset(
        train_texts,
        classification_labels,
        sentiment_labels,
        model.tokenizer
    )
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # -------------------------
    # 2) Freeze backbone initially
    # -------------------------
    if freeze_backbone:
        for param in model.transformer.parameters():
            param.requires_grad = False # Freeze the backbone parameters (no gradient updates)

    # -------------------------
    # 3) Create optimizer & loss
    # -------------------------
    optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=learning_rate) # Create an optimizer that only includes parameters with requires_grad=True
    criterion = nn.CrossEntropyLoss()

    model.train()  # Enable training mode (for heads)

    # -------------------------
    # 4) Training Loop
    # -------------------------
    for epoch in range(epochs):
        total_loss = 0.0

        # Optionally unfreeze backbone after a certain epoch
        if unfreeze_after is not None and epoch + 1 == unfreeze_after:
            print(f"Unfreezing backbone after epoch {epoch}...")
            for param in model.transformer.parameters():
                param.requires_grad = True
            optimizer = AdamW(model.parameters(), lr=learning_rate)  # Re-init optimizer

        for batch in train_loader:
            # Move inputs to CPU (model is on CPU as well)
            batch = {k: v.to('cpu') for k, v in batch.items()}

            # 1) Zero out old gradients
            optimizer.zero_grad()

            # 2) Forward pass: classification
            classification_logits = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                token_type_ids=batch['token_type_ids'],
                task='classification'
            )
            classification_loss = criterion(classification_logits, batch['classification_labels'])

            # 3) Forward pass: sentiment
            sentiment_logits = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                token_type_ids=batch['token_type_ids'],
                task='sentiment'
            )
            sentiment_loss = criterion(sentiment_logits, batch['sentiment_labels'])

            # 4) Combined loss
            loss = classification_loss + sentiment_loss
            total_loss += loss.item()

            # 5) Backprop & update
            loss.backward()
            optimizer.step()

        avg_loss = total_loss / len(train_loader)
        print(f"[Epoch {epoch+1}/{epochs}] Average Loss: {avg_loss:.4f}")

    print("Training complete!")

## (3-2) Load datasets and train the model
* 200 sample text and labels are prepared in a txt file (`training_datasets.txt`)
  - Text feedback (col 1)
  - Classification labels (0=Product, 1=Service, 2=General) (col 2)
  - Sentiment labels (0=Negative, 1=Neutral, 2=Positive) (col 3)

In [17]:
import pandas as pd

def load_training_data(url):
    """
    Load training data from a given URL, expecting the format:
    "<text>,<classification_label>,<sentiment_label>".

    Args:
        url (str): URL of the training dataset file.

    Returns:
        tuple: A tuple containing three lists:
            - train_texts (list): List of text samples.
            - classification_labels (list): List of classification labels (int).
            - sentiment_labels (list): List of sentiment labels (int).
    """
    # Read the file into a DataFrame, specifying no header since the file does not have one
    df = pd.read_csv(url, header=None, names=['text', 'classification_label', 'sentiment_label'])

    # Extract the data into separate lists
    train_texts = df['text'].tolist()
    classification_labels = df['classification_label'].astype(int).tolist()
    sentiment_labels = df['sentiment_label'].astype(int).tolist()

    return train_texts, classification_labels, sentiment_labels

In [20]:
if __name__ == "__main__":
    url = "https://raw.githubusercontent.com/yifang-psu/demo/refs/heads/main/training_datasets.txt"

    # Call the function to load the data
    train_texts, classification_labels, sentiment_labels = load_training_data(url)

    # Initialize the model
    model = MultiTaskTransformer().to('cpu')

    # Train the model with freeze + unfreeze strategy
    train_multitask_model(
        model=model,
        train_texts=train_texts,
        classification_labels=classification_labels,
        sentiment_labels=sentiment_labels,
        epochs=10,               # e.g. 10 total epochs
        freeze_backbone=True,    # freeze at first
        unfreeze_after=None      # not updating backbone for now
    )

[Epoch 1/10] Average Loss: 2.2130
[Epoch 2/10] Average Loss: 2.1949
[Epoch 3/10] Average Loss: 2.1732
[Epoch 4/10] Average Loss: 2.1649
[Epoch 5/10] Average Loss: 2.1475
[Epoch 6/10] Average Loss: 2.1322
[Epoch 7/10] Average Loss: 2.1134
[Epoch 8/10] Average Loss: 2.0936
[Epoch 9/10] Average Loss: 2.0855
[Epoch 10/10] Average Loss: 2.0644
Training complete!


## (3-3) Test the trained model

In [21]:
def test_trained_model(model, test_texts):
    model.eval()  # Set model to evaluation mode

    # Tokenize new texts
    encoded = model.tokenizer(
        test_texts,
        padding=True,
        truncation=True,
        return_tensors='pt',
        return_token_type_ids=True
    ).to(model.device)

    with torch.no_grad():
        # Get predictions
        classification_outputs = model(**encoded, task='classification')
        sentiment_outputs = model(**encoded, task='sentiment')

        # Convert to probabilities
        class_probs = torch.softmax(classification_outputs, dim=1)
        sent_probs = torch.softmax(sentiment_outputs, dim=1)

        # Get predicted labels
        class_predictions = [model.get_classification_label(idx) for idx in class_probs.argmax(dim=1).cpu()]
        sent_predictions = [model.get_sentiment_label(idx) for idx in sent_probs.argmax(dim=1).cpu()]

        return class_predictions, sent_predictions

In [22]:
# Example usage
test_texts = [
    "Your support team was very helpful", # Service, Positive
    "This new product is fantastic", # Product, Positive
    "The product quality is outstanding!",  # Product, Positive
    "Your customer service team was very helpful",  # Service, Positive
    "I have some general feedback to share",  # General, Neutral
    "This product is completely useless",  # Product, Negative
    "The service was mediocre at best",  # Service, Neutral
    "I hate your horrible product" # Product, Negative
]

class_preds, sent_preds = test_trained_model(model, test_texts)

for text, class_pred, sent_pred in zip(test_texts, class_preds, sent_preds):
    print(f"\nText: {text}")
    print(f"Predicted Class: {class_pred}")
    print(f"Predicted Sentiment: {sent_pred}")


Text: Your support team was very helpful
Predicted Class: Service
Predicted Sentiment: Positive

Text: This new product is fantastic
Predicted Class: Product
Predicted Sentiment: Neutral

Text: The product quality is outstanding!
Predicted Class: Product
Predicted Sentiment: Positive

Text: Your customer service team was very helpful
Predicted Class: Service
Predicted Sentiment: Positive

Text: I have some general feedback to share
Predicted Class: General
Predicted Sentiment: Positive

Text: This product is completely useless
Predicted Class: Product
Predicted Sentiment: Neutral

Text: The service was mediocre at best
Predicted Class: Product
Predicted Sentiment: Positive

Text: I hate your horrible product
Predicted Class: Product
Predicted Sentiment: Positive



## (3-4) Discussion

#### **(1) When would it make sense to freeze the transformer backbone and only train the task-specific layers?**
  * Limited Data:
    * If few labeled samples for downstream tasks, fully fine-tuning all BERT parameters often leads to overfitting. Freezing the backbone and training only the final layers can help avoid destroying the pre-trained representations with noisy gradients.

  * Limited Compute (CPU-bound or small GPU):
    * Fine-tuning the entire transformer can be computationally expensive. Freezing the backbone reduces the number of trainable parameters drastically.

  * Faster Training:
    * Only update the small classification heads, so training is quicker, especially important on CPU (implemented below).

  Note the trade-off: You may lose some performance potential, because the backbone does not adapt to your specific domain or tasks. But if your domain is close to general language or your data is very small, freezing can be effective and more efficient.

#### **(2) What changes are made to the architecture to support multi-task learning?**
  * Multiple Heads:
    * Instead of a single classifier, you now have two separate layers (or “heads”)—one for classification (self.classifier) and one for sentiment analysis (self.sentiment_analyzer).

  * Conditional Forward Logic:
    * In the forward method, you select which head to use based on the task argument ('classification' vs. 'sentiment'). The model shares the same transformer backbone, but branches into different heads depending on the requested task.

  * Different Pooling for Different Tasks:
    * cls_pooling for classification
    * mean_pooling for sentiment
    This is a design choice to best handle each task’s nature.

#### **(3) When would it make sense to freeze one head while training the other?**
  * Task A Already Converged:
    * If Task A is performing well and you don’t want to risk degrading that performance, you can freeze its head and continue training only Task B’s head (and optionally the backbone) for additional epochs.
  * Task-Specific Priorities:
    * In some scenarios, you might want to preserve a stable classifier in production while experimenting with or improving the sentiment head.
  * Different Data Readiness:
    * If you have new data for Task B but no new data for Task A, you can keep A’s head frozen (so it remains stable) and fine-tune only B’s head.


#### **(4) In what situation, would you decide when to implement a multi-task model like the one in this assignment and when it would make more sense to use two completely separate model for each task?**

  1. Multi-Task Model:
    * Shared Representations:
      * If your tasks are related (e.g., sentence-level classification and sentiment, both about user feedback), the backbone can learn features useful for both tasks. This can improve data efficiency and performance, especially if you have limited data for at least one task.

    * Deployment Efficiency:
      * A single model can output multiple predictions with one forward pass (less overhead than hosting two completely separate models).

    * Regularization Benefit:
      * Jointly training on multiple tasks often acts as a form of regularization, preventing overfitting on one single task.

  2. Two Separate Models:
    * Very Different Tasks or Domains:
      * If tasks differ drastically (e.g., sentiment on tweets vs. question-answering on legal documents), the synergy of sharing a backbone might be minimal or even detrimental.

    * Scalability Concerns:
      * If you can afford separate models for each task and one task requires a specialized architecture, a separate model might be simpler to maintain.

    * Different Update Schedules:
      * If the tasks are maintained by different teams or require different release cycles, separate models might be more practical.

    * Inference speed requirements:
      * If one task requires a latency significant different from the other.

    * Performance requirements per task:
      * If one task performance requirement is significant different from the other.


#### **(5) when training the multi-task models, assume that Task A (classification) has abundant data, while Task B (sentiment analysis) has limited data. Explain how you would handle this imbalance?**

  The following methods can help with this assumed issue:

  1. Weighted Loss or Sampling:

    * Weighted Loss:
      * Assign a higher loss weight to Task B so it is not overshadowed by Task A’s abundant data. E.g., loss = alpha * classification_loss + beta * sentiment_loss, with beta > alpha to emphasize sentiment.

    * Weighted Sampler:
      * If using a single training loop that picks samples for both tasks, oversample the fewer-task samples or undersample the abundant task to balance how often each task is trained.

  2. Curriculum or Two-Phase Training:
    * Phase 1: Train the shared backbone + classification head on abundant classification data (or freeze the backbone if it’s already well-trained).

    * Phase 2: Fine-tune on the limited sentiment data with a relatively higher focus (unfreeze the backbone if needed). This ensures the backbone can also adapt to sentiment signals.

  3. Data Augmentation (for the smaller task):
    * If feasible, artificially augment the sentiment data: paraphrasing negative/positive statements, synonyms, or other text manipulation tools.

  4. Separate or Partial Freezing:
    * freeze the classification head once it’s performing well and continue training the sentiment head (and maybe the backbone) a bit more to help the sentiment task catch up.

  Note: you don’t want the larger dataset (Task A) to dominate over the other during the training. Balancing or re-weighing each task’s contribution ensures you don’t neglect the underrepresented task (Task B).




#### **(6) Additional self-made question: in early experiments that average loss is flucturing, what is concluded from that?**

Examples of average loss using only 10 samples to train the task heads:
```
[Epoch 1/5] Average Loss: 2.2200
[Epoch 2/5] Average Loss: 2.2187
[Epoch 3/5] Average Loss: 2.2625
[Epoch 4/5] Average Loss: 2.1628
[Epoch 5/5] Average Loss: 2.1679
Training complete!
```
Based on this observation, what is learned here:

1. **The Loss is Fluctuating and Remains High**:
  * The values hover around ~2.16–2.26 and do not show a clear, steady downward trend.For a cross-entropy loss across two tasks (classification + sentiment, each with 3 classes), you’d hope to see a loss eventually drop below ~1.8 or lower if the model is learning effectively. The fact that it stays around ~2.2 suggests limited or no meaningful learning on this dataset.

2. **Very Small Dataset**:
  * Using 10 samples, it’s extremely difficult to get stable losses or see a typical downward trend.
  * The model can easily memorize the small training set but still produce high average loss if it cannot fit those few samples well or if tasks conflict.

3. **Possible Over/Underfitting or Randomness**
  * With so few samples, each individual batch heavily influences the average loss.
  * Small changes in model weights can cause large swings in loss, especially with multi-task training.
  * Also, if any parameters are still frozen or incorrectly set (e.g., if the backbone was not actually unfreezing), the model may barely move from random initialization.
  * Model’s loss is not showing a clear downward trend and is staying around 2.2. This likely means it’s not learning effectively on such a small dataset.

4. **Check the Training Configuration**
  * Are you freezing or unfreezing the backbone? If you froze the backbone and only have 10 training samples for the new heads, the heads might not converge well.
  * Learning rate: If it’s too high (e.g., 2e-5 for a tiny dataset) or too low.
  * Batch size: With only 10 samples, and a batch size of 5, each epoch sees very few updates—any single example can skew the loss in a big way.

**Improvements made based on the above observations:**

  1. Gather more data: collected 200 sample datasets (100 short text with each less than 6 words, and 100 long text with each more than 6 words).
  2. Lower the learning rate or experiment with different rates (e.g., 5e-6 is used later).

#### What is learned from the new results of trained model:
```
Text: Your support team was very helpful
Predicted Class: Service
Predicted Sentiment: Positive

Text: This new product is fantastic
Predicted Class: Product
Predicted Sentiment: Neutral

Text: The product quality is outstanding!
Predicted Class: Product
Predicted Sentiment: Positive

Text: Your customer service team was very helpful
Predicted Class: Service
Predicted Sentiment: Positive

Text: I have some general feedback to share
Predicted Class: General
Predicted Sentiment: Positive

Text: This product is completely useless
Predicted Class: Product
Predicted Sentiment: Neutral (**wrong!**)

Text: The service was mediocre at best
Predicted Class: Product (**wrong!**)
Predicted Sentiment: Positive

Text: I hate your horrible product
Predicted Class: Product
Predicted Sentiment: Positive (**wrong!**)
```
* Before the model was trained, the predicted class is all product, and the predicted sentiment is all positive.
* After the model was trained, the model was able to capture the positive sentiment and classify more accurate than before. But it still failed to capture the negative sentiment here (marked as **wrong!** above).
* In such case, **class imbalance** or **insufficient negative** examples is a main cause.



In [23]:
from collections import Counter

sentiment_counts = Counter(sentiment_labels)
print(sentiment_counts)  # e.g., {0: 30, 1: 100, 2: 70}


Counter({2: 90, 1: 70, 0: 50})


The label distribution `Counter({2: 90, 1: 70, 0: 50})` shows:
```
90 samples of label 2
70 samples of label 1
50 samples of label 0
```
So, positive class has 90 samples, neutral class has 70, and negative class has 50. While there is a difference in the counts, the imbalance is not extremely severe (for example, you don’t have one class with 5 samples and another with 100). However, it is somewhat imbalanced because the negative class has significantly fewer samples than the positive one (50 vs 90).

#### **(7) Finally, how to handle data imbalance based on the current anpve training and inference results?**

1. **Data Collection & Balancing**
  * Collect More Negative Samples: If negative examples are underrepresented, try to add more negative data (real user complaints, negative reviews, etc.).

  * Data Augmentation: For underrepresented classes (e.g., negative), we can do text augmentation: paraphrasing, synonyms, changing sentence structure while keeping the negative sentiment. Tools like NLPAug or TextAttack (not used in this demo yet) can help generate variant sentences that maintain the same label.

  * Stratified Sampling / Weighted Sampling: in DataLoader, we can weight classes inversely to their frequency so the model sees more negative samples during training.

2. Class Weights or Focal Loss
  * Class Weights in CrossEntropyLoss: use `torch.nn.CrossEntropyLoss(weight=...)` or a custom WeightedLoss, we can give higher weight to negative samples. This counters the model’s tendency to ignore rare classes.
  * Focal Loss: Specifically designed to combat class imbalance by down-weighting easy examples and focusing more on hard, misclassified examples.

3. Model Fine-Tuning Strategy
  * Ensure Entire Backbone is Fine-Tuned (not implemented yet in this demo): If we only trained the heads, consider unfreezing the backbone once we have more data. This can significantly improve performance.

  * Hyperparameter Tuning: Adjust learning rate, batch size, or number of epochs. With more data, we might do 3–5 epochs or more. Evaluate on a validation set after each epoch, consider early stopping or model checkpointing.

  #### when training the multi-task models, assume that Task A (classification) has abundant data, while Task B (sentiment analysis) has limited data. Explain how you would handle this imbalance?
