# Fine-tuning BERT for Sequence Classification

This project focuses on fine-tuning a BERT (Bidirectional Encoder Representations from Transformers) model for sequence classification, leveraging LoRa (Low-Rank Adaptation) techniques. The dataset employed originates from the AG News dataset curated by fancyzhx (fancyzhx/ag_news).

For seamless execution in Colab, ensure you have the ðŸ¤— Transformers, ðŸ¤— Datasets, and ðŸ¤— Accelerate libraries installed. These libraries are crucial for efficient experimentation and deployment of state-of-the-art natural language processing models.


In [2]:
print(transformers.__version__)
print(accelerate.__version__)
print(bitsandbytes.__version__)

4.41.1
0.30.1
0.43.1


Load the dataset

In [3]:
dataset = load_dataset("ag_news")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

The preprocess_function defines how the text data should be tokenized, padded to a maximum length, and truncated if necessary. The map function applies this preprocessing function to the entire dataset efficiently.

In [5]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

This code splits the encoded dataset into training and validation sets. It shuffles the data to ensure randomness and selects a specific number of samples for each set. You can increase the number of training samples.

In [7]:
train_dataset = encoded_dataset["train"].shuffle(seed=42).select(range(8000))
val_dataset = encoded_dataset["test"].shuffle(seed=42).select(range(700))

PEFT (Parameter-Efficient Fine-Tuning) is a technique that fine-tunes models with fewer trainable parameters, improving efficiency while maintaining performance.

This code initializes a BERT model for sequence classification with 4 labels, configures LoRA (Low-Rank Adaptation) for efficient fine-tuning, and applies the LoRA configuration to the model.

In [102]:
model_no_qlora = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4,
)

lora_config = LoraConfig(
    r=2,
    lora_alpha=2,
    target_modules = [ "q_proj", "k_proj", "v_proj", "dense"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS",
)
model = get_peft_model(model_no_qlora, lora_config)

model.config.use_cache = False

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
model.print_trainable_parameters()

trainable params: 227,332 || all params: 109,712,648 || trainable%: 0.2072


In [10]:
model.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.41.1",
  "type_vocab_size": 2,
  "use_cache": false,
  "vocab_size": 30522
}

This code loads the accuracy metric and defines a function to compute evaluation metrics by comparing model predictions to true labels.


In [11]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Code cells below sets up the training arguments, initializes a Trainer with the model and datasets, specifies evaluation and training configurations, and then starts the training process

In [12]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    num_train_epochs=3,
)



In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4048,0.408162,0.861429
2,0.3362,0.391328,0.882857
3,0.3502,0.390637,0.881429




TrainOutput(global_step=3000, training_loss=0.4755001424153646, metrics={'train_runtime': 1998.7896, 'train_samples_per_second': 12.007, 'train_steps_per_second': 1.501, 'total_flos': 6331539456000000.0, 'train_loss': 0.4755001424153646, 'epoch': 3.0})

The model exhibits a consistent decrease in both training and validation loss over the epochs, indicating effective learning. Accuracy improves initially but then plateaus, suggesting the model is converging. To achieve better results, consider increasing the training set size and adjusting the Rank parameter of LoRa.

In [18]:
output_dir = './main_model'
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)



('./main_model/tokenizer_config.json',
 './main_model/special_tokens_map.json',
 './main_model/vocab.txt',
 './main_model/added_tokens.json')

This code merges a base BERT model with an adapter model for sequence classification. Merging them is important to leverage the pre-trained knowledge of the base model and the task-specific adaptations from the adapter, enhancing performance for the specific classification task.

In [104]:
base_model_name = "bert-base-uncased"
adapter_model_name = "/content/main_model"

model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=4)
model = PeftModel.from_pretrained(model, adapter_model_name)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This function predicts the label for a given text by tokenizing it, passing it through the model, and identifying the label with the highest logit score.

In [72]:
def predict_label(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1).item()
    return predicted_label

In [98]:
shuffled_test_dataset = encoded_dataset["test"].shuffle()

test_dataset = shuffled_test_dataset.select(range(150))

In [99]:
test_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 150
})

This code evaluates the model's performance on the test dataset by counting the number of correct and incorrect predictions, then prints the results.

In [100]:
correct = 0
incorrect = 0
for i in range(len(test_dataset)):
  prediction = predict_label(test_dataset['text'][i])
  if prediction == test_dataset['label'][i]:
    correct += 1
  else:
    incorrect += 1

print("Correct guesses:", correct)
print("Incorrect guesses:", incorrect)

Correct guesses: 134
Incorrect guesses: 16
