# Fine-Tuning a Pretrained NLP Model for Text Classification

In this notebook, we'll walk through the process of fine-tuning a pretrained model from the Hugging Face Hub for a text classification task. We'll use the `transformers` library, which provides a high-level API for working with state-of-the-art NLP models.

## 1. Setting Up the Environment

First, we need to install the necessary libraries. We'll be using `transformers` for the model, `datasets` to load our data, `evaluate` to calculate our metrics, and `torch` as the backend.

In [None]:
!pip install transformers datasets evaluate torch

## 2. Loading the Dataset

For this demonstration, we'll use the AG News dataset, which is a collection of news articles categorized into four classes: World, Sports, Business, and Sci/Tech. This is a multi-class classification problem, which makes it a bit more interesting than a simple binary classification task.

In [1]:
from datasets import load_dataset

dataset = load_dataset("ag_news")

  from .autonotebook import tqdm as notebook_tqdm


Let's take a look at the structure of our dataset.

In [2]:
dataset
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(500))

And inspect a single example.

In [10]:
small_train_dataset[0:10]

{'text': ['Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.',
  'Desiring Stability Redskins coach Joe Gibbs expects few major personnel changes in the offseason and wants to instill a culture of stability in Washington.',
  'Will Putin #39;s Power Play Make Russia Safer? Outwardly, Russia has not changed since the barrage of terrorist attacks that culminated in the school massacre in Beslan on Sept.',
  'U2 pitches for Apple New iTunes ads airing during baseball games Tuesday will feature the advertising-shy Irish rockers.',
  'S African TV in beheading blunder Public broadcaster SABC apologises after news bulletin shows footage of American beheaded in Iraq.',
  'A Cosmic Storm: When Galaxy Clusters Collide Astronomers have found what they are calling the perfect cosmic storm, a galaxy cluster pile-up so powerful its energy output is second only to the Big B

## 3. Preprocessing the Data

Before we can feed the text to our model, we need to convert it into a format that the model can understand. This process is called tokenization. We'll use a tokenizer from a pretrained model to ensure that the text is split into tokens in the same way as the model was originally trained.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

train_tokenized_dataset = small_train_dataset.map(preprocess_function, batched=True)
test_tokenized_dataset = small_eval_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 500/500 [00:00<00:00, 25065.16 examples/s]


In [5]:
train_tokenized_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

`input_ids`: This is the most important part for the model. The original text has been converted into a list of numbers. Each number is an "ID" that corresponds to a unique word or part-of-a-word (a "token") in the tokenizer's vocabulary. The model works with these numbers, not the raw text.

`attention_mask`: This is a list of 1s and 0s that has the same length as input_ids. It tells the model which tokens to pay attention to. A 1 means it's a real token, and a 0 means it's just "padding" that was added to make sure all the sequences in a batch are the same length. This way, the model ignores the padding and only focuses on the actual content.

## 4. Loading the Pretrained Model

Now, we'll load the pretrained model. We'll use `DistilBERT`, which is a smaller and faster version of BERT, making it ideal for fine-tuning. We'll use the `AutoModelForSequenceClassification` class, which will automatically add a classification head on top of the pretrained model.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5. Fine-Tuning the Model

We're now ready to fine-tune the model. We'll use the `Trainer` API from the `transformers` library, which simplifies the training process. We'll need to define the training arguments and a function to compute the metrics.

In [7]:
import numpy as np
import evaluate
from transformers import DataCollatorWithPadding

# 4. Use a Data Collator to handle padding
# This is more efficient as it pads batches to the length of the longest item
# in that batch, not to the overall maximum length.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

# 7. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

# 8. Create the Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized_dataset,
    eval_dataset=test_tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator, # This is the key change
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
                                                 
 33%|███▎      | 251/750 [00:31<04:30,  1.85it/s]

{'eval_loss': 0.5085667371749878, 'eval_accuracy': 0.862, 'eval_runtime': 3.0359, 'eval_samples_per_second': 164.698, 'eval_steps_per_second': 41.174, 'epoch': 1.0}


 67%|██████▋   | 500/750 [00:50<00:18, 13.21it/s]Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.5052, 'grad_norm': 0.2347167432308197, 'learning_rate': 6.666666666666667e-06, 'epoch': 2.0}


                                                 
 67%|██████▋   | 501/750 [00:53<02:29,  1.66it/s]

{'eval_loss': 0.5042542815208435, 'eval_accuracy': 0.88, 'eval_runtime': 2.4461, 'eval_samples_per_second': 204.411, 'eval_steps_per_second': 51.103, 'epoch': 2.0}


                                                 
100%|██████████| 750/750 [01:15<00:00,  9.96it/s]

{'eval_loss': 0.5332722067832947, 'eval_accuracy': 0.878, 'eval_runtime': 2.5518, 'eval_samples_per_second': 195.939, 'eval_steps_per_second': 48.985, 'epoch': 3.0}
{'train_runtime': 75.2995, 'train_samples_per_second': 39.841, 'train_steps_per_second': 9.96, 'train_loss': 0.4209159901936849, 'epoch': 3.0}





TrainOutput(global_step=750, training_loss=0.4209159901936849, metrics={'train_runtime': 75.2995, 'train_samples_per_second': 39.841, 'train_steps_per_second': 9.96, 'train_loss': 0.4209159901936849, 'epoch': 3.0})

## 6. Evaluating the Model

After training, we can evaluate the performance of our fine-tuned model on the test set.

In [8]:
trainer.evaluate()

100%|██████████| 125/125 [00:02<00:00, 46.32it/s]


{'eval_loss': 0.5332722067832947,
 'eval_accuracy': 0.878,
 'eval_runtime': 3.235,
 'eval_samples_per_second': 154.56,
 'eval_steps_per_second': 38.64,
 'epoch': 3.0}

## 7. Logging the experiment and the model to JFrogML

In [None]:
import frogml.huggingface

frogml.huggingface.log_model(
    model=trainer.model,
    tokenizer = trainer.tokenizer,
    model_name='text_classification',
    repository = 'nlp-models',
    version = 'finetuned-agnews-v1.2'
)

/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/tokenizer_config.json: 100%|██████████| 1.20k/1.20k [00:00<00:00, 2.88kB/s]
[A

[A[A


/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 1.32MB/s]

[A

[A[A

/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/config.json: 100%|██████████| 807/807 [00:00<00:00, 918B/s]  


/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/tokenizer_config.json: 100%|██████████| 1.20k/1.20k [00:01<00:00, 876B/s]  
/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 3.17GB/s]


[A[A

/var/folders/mt/wvz9xr_s7k3cwk3r0b96hyn00000gn/T/tmpjdsw8jj9/text_classification.pretrained_model/tokenizer.json: 100%|█

2025-06-11 17:28:01,581 - INFO - frogml_storage._log_config.frog_ml.__upload_model:528 - Model: "text_classification", version: "finetuned-agnews-v1.0" has been uploaded successfully





## 8. Conclusion

In this notebook, we've successfully fine-tuned a pretrained NLP model for a text classification task. We started by loading and preprocessing the data, then we loaded a pretrained model and fine-tuned it on our dataset. Finally, we evaluated the performance of our model. This process, known as transfer learning, allows us to achieve state-of-the-art results on a wide range of NLP tasks without having to train a model from scratch.