# Homework 10

In this homework, you will train a sentiment classifier on the [SST-2](https://huggingface.co/datasets/sst2) dataset using the pre-trained BERT model. For simplicity, I recommend using the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index). I've linked to corresponding tutorials below. You're welcome to use a different framework if you prefer.

# Problem 1

1. Fine-tune [DistilBERT](https://huggingface.co/distilbert-base-uncased) from scratch on SST-2 and evaluate the results. You can find a tutorial for loading BERT and fine-tuning [here](https://huggingface.co/docs/transformers/training). In that tutorial, you will need to change the dataset from `"yelp_review_full"` to `"sst2"` and the model from `"bert-base-uncased"` to `"distilbert-base-uncased"`. You'll also need to modify the code since SST-2 is a two-class classification dataset (unlike the Yelp Reviews dataset, which is a five-class classification dataset).

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aioht

In [2]:
!{sys.executable} -m pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m64.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.27.4


In [3]:
!{sys.executable} -m pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


**Load Dataset**

In [4]:
from datasets import load_dataset
dataset = load_dataset("sst2")

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.06k [00:00<?, ?B/s]

Downloading and preparing dataset sst2/default to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset sst2 downloaded and prepared to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset["train"][100]

{'idx': 100, 'sentence': 'in memory ', 'label': 1}

**Tokenization**

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [7]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42)

**Train**

In [8]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

Training hyperparameters

In [9]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

Evaluate

In [10]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [11]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [12]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Trainer

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [14]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.458254,0.787844
2,No log,0.501943,0.850917
3,No log,0.594634,0.854358


TrainOutput(global_step=375, training_loss=0.2733328857421875, metrics={'train_runtime': 182.1282, 'train_samples_per_second': 16.472, 'train_steps_per_second': 2.059, 'total_flos': 397402195968000.0, 'train_loss': 0.2733328857421875, 'epoch': 3.0})

2. Choose a different pre-trained BERT-style model from the [Hugging Face Model Hub](https://huggingface.co/models) and fine-tune it. There are tons of options - part of the homework is navigating the hub to find different models! I recommend picking a model that is smaller than BERT-Base (as DistilBERT is) just to make things computationally cheaper. Is the final validation accuracy higher or lower with this other model?

In [15]:
import torch

del model
del trainer
torch.cuda.empty_cache()

In [16]:
dataset = load_dataset("sst2")
dataset["train"][100]



  0%|          | 0/3 [00:00<?, ?it/s]

{'idx': 100, 'sentence': 'in memory ', 'label': 1}

**Tokenization**

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [18]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42)

**Train**

In [19]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [20]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

Evaluate

In [21]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [22]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Trainer

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [24]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.614589,0.711009
2,No log,0.3301,0.883028
3,No log,0.424832,0.880734


TrainOutput(global_step=375, training_loss=0.5162443440755208, metrics={'train_runtime': 357.0714, 'train_samples_per_second': 8.402, 'train_steps_per_second': 1.05, 'total_flos': 789333166080000.0, 'train_loss': 0.5162443440755208, 'epoch': 3.0})

# Problem 2

Instead of fine-tuning the full model on a target dataset, it's also possible to use the output representations from a BERT-style model as input to a linear classifier and *only* train the classifier (leaving the rest of the pre-trained parameters fixed). You can do this easily using the [`sentence-transformers`](https://www.sbert.net/) library. Using `sentence-tranformers` gives you back a fixed-length representation of a given text sequence. To achieve this, you need to 
1. Pick a pre-trained sentence Transformer.
2. Load the SST-2 dataset and feed the text from each example into the model.
3. Train a linear classifier on the representations.
4. Evaluate performance on the validation set.

For the second step, you can learn more about how to use Hugging Face datasets [here](https://huggingface.co/docs/datasets/index). For the third and fourth step, you can do this directly in PyTorch, or you can just collect the learned representations and use them as feature vectors to train a linear classifier in any other library (e.g. [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html).

After you complete the above steps, report whether the accuracy on the validation set is higher or lower using a fixed sentence Transformer.

In [25]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125942 sha256=e861e3847f00d07826ba

In [26]:
import torch
import torch.nn as nn
import torch.optim as optim
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from sklearn.metrics import accuracy_score

In [33]:
del model
torch.cuda.empty_cache()

**Pre-trained sentence Transformer**

In [34]:
# Load the pre-trained sentence Transformer
model = 'sentence-transformers/paraphrase-distilroberta-base-v2'
sentence_transformer = SentenceTransformer(model)

**Load the data**

In [35]:
# Load the SST-2 dataset
dataset = load_dataset('sst2')
train_data = dataset['train']
validation_data = dataset['validation']



  0%|          | 0/3 [00:00<?, ?it/s]

**Encode the text**

In [36]:
# Encode the text from each example in the train and validation sets
train_embeddings = sentence_transformer.encode(train_data['sentence'], convert_to_tensor=True)
validation_embeddings = sentence_transformer.encode(validation_data['sentence'], convert_to_tensor=True)

train_labels = torch.tensor(train_data['label'])
validation_labels = torch.tensor(validation_data['label'])

# Move embeddings and labels to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_embeddings, train_labels = train_embeddings.to(device), train_labels.to(device)
validation_embeddings, validation_labels = validation_embeddings.to(device), validation_labels.to(device)

**Train and Evaluate**

In [37]:
# Create a simple logistic regression model using PyTorch
input_size = train_embeddings.size(1)
num_classes = 2
logistic_regression = nn.Linear(input_size, num_classes).to(device)

In [38]:
# Set the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(logistic_regression.parameters(), lr=0.01)

# Train the logistic regression model and print accuracy for each epoch
num_epochs = 50
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = logistic_regression(train_embeddings)
    loss = criterion(outputs, train_labels)
    loss.backward()
    optimizer.step()

    # Compute training accuracy
    with torch.no_grad():
        _, train_predicted = torch.max(outputs, 1)
        train_accuracy = accuracy_score(train_labels.cpu(), train_predicted.cpu())

    # Compute validation accuracy
    with torch.no_grad():
        validation_outputs = logistic_regression(validation_embeddings)
        _, validation_predicted = torch.max(validation_outputs, 1)
        validation_accuracy = accuracy_score(validation_labels.cpu(), validation_predicted.cpu())

    print(f'Epoch [{epoch + 1}/{num_epochs}], '
          f'Train Accuracy: {train_accuracy:.4f}, '
          f'Validation Accuracy: {validation_accuracy:.4f}')

Epoch [1/50], Train Accuracy: 0.5448, Validation Accuracy: 0.5677
Epoch [2/50], Train Accuracy: 0.5958, Validation Accuracy: 0.5952
Epoch [3/50], Train Accuracy: 0.6379, Validation Accuracy: 0.6227
Epoch [4/50], Train Accuracy: 0.6687, Validation Accuracy: 0.6342
Epoch [5/50], Train Accuracy: 0.6918, Validation Accuracy: 0.6411
Epoch [6/50], Train Accuracy: 0.7092, Validation Accuracy: 0.6594
Epoch [7/50], Train Accuracy: 0.7241, Validation Accuracy: 0.6674
Epoch [8/50], Train Accuracy: 0.7337, Validation Accuracy: 0.6812
Epoch [9/50], Train Accuracy: 0.7427, Validation Accuracy: 0.6869
Epoch [10/50], Train Accuracy: 0.7508, Validation Accuracy: 0.6950
Epoch [11/50], Train Accuracy: 0.7585, Validation Accuracy: 0.7064
Epoch [12/50], Train Accuracy: 0.7658, Validation Accuracy: 0.7110
Epoch [13/50], Train Accuracy: 0.7719, Validation Accuracy: 0.7236
Epoch [14/50], Train Accuracy: 0.7773, Validation Accuracy: 0.7294
Epoch [15/50], Train Accuracy: 0.7822, Validation Accuracy: 0.7362
Epoc

As we can see, the accuracy on the validation set is lower when using a fixed sentence Transformer compared to fine-tuning the entire model. During the fine-tuning process, all parameters are updated, allowing the model to become more specialized for the target task. Consequently, fine-tuning typically leads to better performance. However, it's worth noting that fine-tuning requires more computing resources than using a fixed sentence Transformer method.