# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

- PEFT technique: In this process, we use the LoRA (Low-Rank Adaptation) technique. This method allows us to fine-tune the model efficiently by making small adjustments to certain layers of the pre-trained model. Specifically, we will focus on the attention layers to enhance the model's performance.

- Model: DistilBERT is a smaller, faster, cheaper, and lighter version of BERT. It retains 97% of BERT's language understanding capacity while being 60% faster and 40% smaller. It's well-suited for sequence classification tasks and is efficient for lightweight fine-tuning.

- Evaluation approach: 
    - Accuracy: Measures the proportion of correct predictions.
        
    - Precision, Recall, and F1-Score: These metrics provide detailed insights into the model's performance for each class. Precision is the accuracy of positive predictions, recall is the ability to find all relevant cases, and F1-score balances both.
        
    - Confusion Matrix: Shows the distribution of correct and incorrect predictions across all classes.

- Fine-tuning dataset: The AG News dataset is a popular text classification dataset that is used to categorize news articles into one of four classes: World, Sports, Business, and Sci/Tech. It's a balanced dataset with a manageable size for lightweight fine-tuning.

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [4]:
!pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.0 threadpoolctl-3.5.0


In [5]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_recall_fscore_support

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
# Load the DistilBERT model and tokenizer
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset('ag_news')
dataset['train'] = dataset['train'].shuffle(seed=23).select(range(5000))
dataset['test'] = dataset['test'].shuffle(seed=23).select(range(1000))
print(f"dataset:\n{dataset}")
# Preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)
print(f"encoded_dataset:\n{encoded_dataset}")

dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

encoded_dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})


In [None]:
def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

small_test_dataset = dataset['test'].select(range(500))
test_texts = small_test_dataset['text']
true_labels = small_test_dataset['label']

In [203]:
predicted_labels = predict(test_texts).cpu().numpy()

In [204]:
report = classification_report(true_labels, predicted_labels, target_names=["World", "Sports", "Business", "Sci/Tech"])
print(report)

              precision    recall  f1-score   support

       World       0.92      0.09      0.16       123
      Sports       0.00      0.00      0.00       122
    Business       0.21      0.26      0.23       136
    Sci/Tech       0.31      0.82      0.45       119

    accuracy                           0.29       500
   macro avg       0.36      0.29      0.21       500
weighted avg       0.36      0.29      0.21       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The model's accuracy is 29%, with poor performance overall. The World category has a high precision of 92% but very low recall, while the Sports category performed the worst with zero precision, recall, and f1-score. The Sci/Tech category did relatively better with a recall of 82%, but its precision is still low.

In [205]:
confusion_mx = confusion_matrix(true_labels, predicted_labels)
print(f"Confusion Matrix:\n{confusion_mx}")

Confusion Matrix:
[[11  0 28 84]
 [ 0  0 87 35]
 [ 1  0 36 99]
 [ 0  0 22 97]]


The confusion matrix shows that the model struggled with accurate predictions. The model correctly identified 11 instances of the first class but misclassified many others, particularly confusing them with the fourth class. The second class had no correct predictions, with most being misclassified as the third or fourth class. The third and fourth classes had better performance but still faced significant misclassifications, mainly within each other.

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [68]:
from peft import get_peft_model, LoraConfig, AutoPeftModelForSequenceClassification

In [69]:
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [70]:
lora_config = LoraConfig(
    task_type="SEQ_CLS",
    r=4,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules= ["q_lin", "k_lin", "v_lin"]
)

In [71]:
peft_model = get_peft_model(model, lora_config, "default")

In [72]:
peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): Linear(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=4, bias=Fals

In [73]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='weighted')
    acc = accuracy_score(p.label_ids, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [74]:
training_args = TrainingArguments(
    output_dir="./data/categorize_news",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy='epoch',
    weight_decay=0.01,
    save_strategy="epoch",
    num_train_epochs=4,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [75]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.846975,0.845,0.843264,0.845959,0.845
2,No log,0.441993,0.87,0.869205,0.868875,0.87
3,No log,0.392974,0.871,0.870172,0.869868,0.871
4,0.701100,0.384063,0.876,0.875291,0.875095,0.876


Checkpoint destination directory ./data/categorize_news/checkpoint-157 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/categorize_news/checkpoint-314 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=628, training_loss=0.6345494689455458, metrics={'train_runtime': 525.7429, 'train_samples_per_second': 38.041, 'train_steps_per_second': 1.195, 'total_flos': 1856498842560000.0, 'train_loss': 0.6345494689455458, 'epoch': 4.0})

In [76]:
peft_model.save_pretrained("model/lora-tuned-model")

In [77]:
trainer.evaluate()

{'eval_loss': 0.3840634822845459,
 'eval_accuracy': 0.876,
 'eval_f1': 0.8752911996601751,
 'eval_precision': 0.8750954105500008,
 'eval_recall': 0.876,
 'eval_runtime': 6.2762,
 'eval_samples_per_second': 159.332,
 'eval_steps_per_second': 5.099,
 'epoch': 4.0}

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [78]:
from peft import AutoPeftModelForSequenceClassification, PeftModel
fine_tuned_model = AutoPeftModelForSequenceClassification.from_pretrained('model/lora-tuned-model',num_labels = 4)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [64]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [79]:
def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

In [80]:
test_texts = dataset['test']['text']
true_labels = dataset['test']['label']

In [81]:
predicted_labels = predict(test_texts).cpu().numpy()

report = classification_report(true_labels, predicted_labels, target_names=["World", "Sports", "Business", "Sci/Tech"])
print(report)

              precision    recall  f1-score   support

       World       0.88      0.85      0.86       241
      Sports       0.94      0.98      0.96       253
    Business       0.84      0.81      0.83       256
    Sci/Tech       0.85      0.86      0.85       250

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000



In [82]:
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8760


In [83]:
confusion_mx = confusion_matrix(true_labels, predicted_labels)
print(f"Confusion Matrix:\n{confusion_mx}")

Confusion Matrix:
[[205  12  18   6]
 [  3 248   1   1]
 [ 13   3 208  32]
 [ 13   2  20 215]]


The confusion matrix shows that the model correctly predicted most instances for each class but had some misclassifications. Overall, it performed best on the second class with only one misclassification and worst on the fourth class with more errors across the other classes.

## Conclusion

The initial model, without any fine-tuning, generated poor results. However, after fine-tuning the model, its performance significantly improved. The fine-tuned model achieved a much higher accuracy of 87%, with high precision and recall across all categories.

This demonstrates the importance of fine-tuning in enhancing the model's reliability in classifying different categories, proving that fine-tuning can significantly elevate the model's performance.