Text classification is one of the most practical tasks in NLP. If you’re starting out and want to build your first real-world text classification pipeline, this article will guide you through every step. Here, I’ll show you how to create a complete pipeline using Hugging Face Transformers, from data preparation to final predictions.

**Dataset: 20 Newsgroups**

We’ll use the 20 Newsgroups dataset, a well-known text classification dataset containing ~18,000 newsgroup documents categorized into 20 topics. Let’s get started:

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# load dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

texts = newsgroups.data
labels = newsgroups.target
label_names = newsgroups.target_names

# split into train and test
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42)

**Tokenization with Hugging Face**

Before we can feed our text data into a transformer model like DistilBERT, we need to convert the raw text into a numerical format that the model understands (known as tokenization). Unlike traditional NLP methods that rely on word-level features, transformer models use subword tokenization to break text into smaller, more meaningful units. Hugging Face provides pre-trained tokenizers designed to work with their models. In our case, we’ll use the distilbert-base-uncased tokenizer to process the news articles:

In [3]:
from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# tokenize text data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

This tokenizer handles lowercasing, padding, and truncating, converting each input text into input_ids and attention_mask.

**Create a PyTorch Dataset**

Now that we’ve tokenized our text data, the next step is to structure it in a format suitable for training with PyTorch and Hugging Face’s Trainer API. Transformer models expect inputs like input_ids, attention_mask, and labels to be provided as tensors. To streamline this process and allow efficient data loading during training, we’ll create a custom Dataset class by subclassing torch.utils.data.Dataset:

In [4]:
import torch

class NewsGroupDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            key: torch.tensor(val[idx]) for key, val in self.encodings.items()
        } | {"labels": torch.tensor(self.labels[idx])}

train_dataset = NewsGroupDataset(train_encodings, train_labels)
test_dataset = NewsGroupDataset(test_encodings, test_labels)

This class will take the tokenized encodings and corresponding labels and return them in the format the model expects.

**Load Pretrained Transformer Model**

With our dataset now properly tokenized and formatted, it’s time to select and load a pre-trained transformer model for our classification task. Instead of training a model from scratch, which requires massive data and compute, we leverage AutoModelForSequenceClassification, which wraps pre-trained models like BERT, DistilBERT, and RoBERTa specifically for classification problems. In this case, we’ll use distilbert-base-uncased, a lightweight version of BERT that’s faster and still highly effective:

In [5]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=20)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
classifier.weight       | MISSING    | 
classifier.bias         | MISSING    | 
pre_classifier.bias     | MISSING    | 
pre_classifier.weight   | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Since our dataset has 20 unique categories, we set num_labels=20 to adapt the model’s final classification layer accordingly.

**Define Training Configuration**

Before we can train our model, we need to define the training configuration, which includes how the model should learn, how often to evaluate, when to save checkpoints, and other essential hyperparameters.

Hugging Face provides a convenient TrainingArguments class that lets us configure all of this in one place. We’ll specify parameters such as the learning rate, batch sizes, number of training epochs, weight decay for regularization, and logging frequency. Additionally, we will define a compute_metrics function to evaluate our model using accuracy during training:

In [6]:
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"accuracy": accuracy_score(p.label_ids, preds)}

training_args = TrainingArguments(
    output_dir="./results",
    do_train=True,
    do_eval=True,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    save_strategy="epoch",
    eval_steps=500
)

`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


This setup ensures our model is trained efficiently and consistently assessed throughout the process.

**Train the Model**

With our model, datasets, training arguments, and evaluation metrics all defined, we’re ready to bring everything together using Hugging Face’s Trainer API.

The Trainer class abstracts away much of the boilerplate involved in training a transformer model, handling batching, optimization, evaluation, logging, and checkpointing behind the scenes. By passing in our model, training configuration, datasets, and metric function, we can initiate the fine-tuning process in a single line of code:

In [7]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

In [8]:
result = trainer.train()

Step,Training Loss
100,2.804023
200,2.188186
300,1.809019
400,1.544245
500,1.357874
600,1.28749
700,1.253448
800,1.188241
900,1.18358
1000,1.090337


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]