# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

- PEFT technique: Layer-wise Freezing, selectively freezing certain layers of the model and only training a few upper or final layers. Specifically, we are retraining the classification head, which has not been frozen.

- Model: DistilBERT is a smaller, faster, cheaper, and lighter version of BERT. It retains 97% of BERT's language understanding capacity while being 60% faster and 40% smaller. It's well-suited for sequence classification tasks and is efficient for lightweight fine-tuning.

- Evaluation approach: 
    - Accuracy: Measures the proportion of correct predictions.
        
    - Precision, Recall, and F1-Score: These metrics provide detailed insights into the model's performance for each class. Precision is the accuracy of positive predictions, recall is the ability to find all relevant cases, and F1-score balances both.
        
    - Confusion Matrix: Shows the distribution of correct and incorrect predictions across all classes.

- Fine-tuning dataset: The AG News dataset is a popular text classification dataset that is used to categorize news articles into one of four classes: World, Sports, Business, and Sci/Tech. It's a balanced dataset with a manageable size for lightweight fine-tuning.

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [178]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
# Load the DistilBERT model and tokenizer
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [172]:
from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset('ag_news')
dataset['train'] = dataset['train'].shuffle(seed=23).select(range(5000))
dataset['test'] = dataset['test'].shuffle(seed=23).select(range(1000))
print(f"dataset:\n{dataset}")
# Preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)
print(f"encoded_dataset:\n{encoded_dataset}")

dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

encoded_dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})


In [202]:
def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

small_test_dataset = dataset['test'].select(range(500))
test_texts = small_test_dataset['text']
true_labels = small_test_dataset['label']

In [203]:
predicted_labels = predict(test_texts).cpu().numpy()

In [204]:
report = classification_report(true_labels, predicted_labels, target_names=["World", "Sports", "Business", "Sci/Tech"])
print(report)

              precision    recall  f1-score   support

       World       0.92      0.09      0.16       123
      Sports       0.00      0.00      0.00       122
    Business       0.21      0.26      0.23       136
    Sci/Tech       0.31      0.82      0.45       119

    accuracy                           0.29       500
   macro avg       0.36      0.29      0.21       500
weighted avg       0.36      0.29      0.21       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The model's accuracy is 29%, with poor performance overall. The World category has a high precision of 92% but very low recall, while the Sports category performed the worst with zero precision, recall, and f1-score. The Sci/Tech category did relatively better with a recall of 82%, but its precision is still low.

In [205]:
confusion_mx = confusion_matrix(true_labels, predicted_labels)
print(f"Confusion Matrix:\n{confusion_mx}")

Confusion Matrix:
[[11  0 28 84]
 [ 0  0 87 35]
 [ 1  0 36 99]
 [ 0  0 22 97]]


The confusion matrix shows that the model struggled with accurate predictions. The model correctly identified 11 instances of the first class but misclassified many others, particularly confusing them with the fourth class. The second class had no correct predictions, with most being misclassified as the third or fourth class. The third and fourth classes had better performance but still faced significant misclassifications, mainly within each other.

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [125]:
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [126]:
for param in model.base_model.parameters():
    param.requires_grad = False

model.classifier

Linear(in_features=768, out_features=4, bias=True)

In [127]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [128]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [129]:
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/categorize_news",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

In [130]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.371939,0.893


Checkpoint destination directory ./data/categorize_news/checkpoint-157 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=157, training_loss=0.4316483515842705, metrics={'train_runtime': 68.8347, 'train_samples_per_second': 72.638, 'train_steps_per_second': 2.281, 'total_flos': 456666597240000.0, 'train_loss': 0.4316483515842705, 'epoch': 1.0})

In [116]:
trainer.evaluate()

{'eval_loss': 0.370482474565506,
 'eval_accuracy': 0.889,
 'eval_runtime': 5.8344,
 'eval_samples_per_second': 171.396,
 'eval_steps_per_second': 5.485,
 'epoch': 1.0}

In [149]:
model.save_pretrained('./model/fine_tuned_model')
#tokenizer.save_pretrained('./model/fine_tuned_model')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [206]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [150]:
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained('./model/fine_tuned_model')
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [157]:
def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

In [158]:
test_texts = dataset['test']['text']
true_labels = dataset['test']['label']

In [159]:
predicted_labels = predict(test_texts).cpu().numpy()

report = classification_report(true_labels, predicted_labels, target_names=["World", "Sports", "Business", "Sci/Tech"])
print(report)

              precision    recall  f1-score   support

       World       0.89      0.87      0.88       241
      Sports       0.94      1.00      0.97       253
    Business       0.88      0.84      0.86       256
    Sci/Tech       0.86      0.86      0.86       250

    accuracy                           0.89      1000
   macro avg       0.89      0.89      0.89      1000
weighted avg       0.89      0.89      0.89      1000



In [160]:
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8930


In [168]:
confusion_mx = confusion_matrix(true_labels, predicted_labels)
print(f"Confusion Matrix:\n{confusion_mx}")

Confusion Matrix:
[[210  13   9   9]
 [  0 252   1   0]
 [ 14   1 215  26]
 [ 13   2  19 216]]


The confusion matrix shows that the model correctly predicted most instances for each class but had some misclassifications. Overall, it performed best on the second class with only one misclassification and worst on the fourth class with more errors across the other classes.

## Conclusion

The initial model, without any fine-tuning, generated poor results. However, after fine-tuning the model, its performance significantly improved. The fine-tuned model achieved a much higher accuracy of 89%, with high precision and recall across all categories.

This demonstrates the importance of fine-tuning in enhancing the model's reliability in classifying different categories, proving that fine-tuning can significantly elevate the model's performance.