# Training a Text Classification Model with DistilBERT and Hugging Face Transformers

In this notebook, we will train a text classification model using the [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) model and the Hugging Face Transformers library. We will use the `transformers` library from Hugging Face to fine-tune a pre-trained DistilBERT model for text classification. 

We will be using the [distilert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) version of BERT for this task.

The notebook is divided into the following sections:
1. Setup Development Environment
2. Load Dataset and Preprocess
3. Train DistilBERT Model for Text Classification
4. Inference on New Data
5. Push Model to Hugging Face (Optional)

## 1. Setup Development Environment

Install the following libraries:
```bash
pip install torch torchvision torchaudio
pip install transformers==4.38.2
pip install accelerate==0.29.3
pip install datasets==2.19.0
pip install evaluate==0.4.1
pip install scikit-learn==1.2.2
pip install numpy==1.25.2
pip install pandas==2.0.3
```

Authenticate with Hugging Face Hub to push the model to the model hub:
```bash
huggingface-cli login
```

## 2. Load Dataset and Preprocess


In [1]:
from datasets import load_dataset

claire_dataset = load_dataset('shub-kris/claire-dataset')

  from .autonotebook import tqdm as notebook_tqdm


### Preprocess Data

- Changing Name of Labels
- Adding integer labels
- Splitting the dataset into train and validation

In [2]:
def change_labels(row):
    temp = row['label'].lower().replace("-", "_").split()
    row['label'] = '_'.join(temp)
    return row

In [3]:
# Change the labels to lowercase and replace '-' with '_'
processed_dataset = claire_dataset.map(change_labels)

In [4]:
# Extract unique labels and map them to integers
unique_labels = set(processed_dataset['train']['label'])
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for idx, label in enumerate(unique_labels)}

In [5]:
# Add the integer labels to the dataset
def add_id_mapping(row):
    cat_label = row['label']
    row['label'] = label2id[cat_label]
    row['cat_label'] = cat_label

    return row

In [6]:
# Add the integer labels to the dataset
processed_dataset = processed_dataset.map(add_id_mapping)

Map: 100%|██████████| 756/756 [00:00<00:00, 32044.44 examples/s]


#### Splitting the dataset into train and validation

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split


In [8]:
def split_dataset(dataset, train_size:float = 0.8):

    if train_size >= 1 or train_size <= 0:
        raise ValueError("train_size must be > 0 and < 1")

    labels = dataset["train"]["cat_label"]

    # Get unique classes and their counts
    unique_classes, _ = np.unique(labels, return_counts=True)

    # Initialize empty lists to hold the split datasets
    train_splits = []
    val_splits = []

    # Split the dataset for each class
    for class_label in unique_classes:
        # Get indices of samples belonging to the current classß
        class_indices = [i for i, label in enumerate(labels) if label == class_label]
        
        # Randomly shuffle the indices
        np.random.shuffle(class_indices)
        
        # Split the indices into train and test sets
        train_indices, val_indices = train_test_split(class_indices, test_size=1-train_size, random_state=42)
        
        # Add the split indices to the lists
        train_splits.extend(train_indices)
        val_splits.extend(val_indices)

    # Use the indices to create train and test datasets
    train_dataset = processed_dataset["train"].select(train_splits)
    val_dataset = processed_dataset["train"].select(val_splits)

    train_dataset = train_dataset.shuffle(len(train_dataset))
    val_dataset = val_dataset.shuffle(len(val_dataset))

    return train_dataset, val_dataset 

In [9]:
# Split the dataset
train_dataset, val_dataset = split_dataset(processed_dataset)

In [10]:
## Let's verify the split
print("Train dataset size: ", len(train_dataset) / len(processed_dataset["train"])) # Should be close to 0.8
print("Validation dataset size: ", len(val_dataset) / len(processed_dataset["train"])) # Should be close to 0.2

Train dataset size:  0.7976190476190477
Validation dataset size:  0.20238095238095238


## 3. Train DistilBERT Model for Text Classification

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

In [12]:
model_id = "distilbert/distilbert-base-uncased"

In [13]:
## In order to train the model, we need to tokenize the text data 
# and convert it to a format that the model can understand
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [14]:
# This function tokenizes the text data
# Tokenization is the process of converting text data into numbers
# Max_length is the maximum number of tokens in the input sequence
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

In [15]:
# Remove the columns that are not needed
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True, remove_columns=["cat_label", "text"])
tokenized_val_dataset = val_dataset.map(preprocess_function, batched=True, remove_columns=["cat_label", "text"])

Map: 100%|██████████| 603/603 [00:00<00:00, 47391.05 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 31065.91 examples/s]


In [16]:
from transformers import DataCollatorWithPadding

# This data collator will pad the input data so that all samples have the same length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [17]:
## Load the metric for evaluation

import evaluate

accuracy = evaluate.load("accuracy")
recall = evaluate.load("recall")
precision = evaluate.load("precision")

In [18]:
# Define the evaluation function
# During evaluation, we will compute accuracy, precision and recall
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy.compute(predictions=predictions, references=labels)
    prec = precision.compute(predictions=predictions, references=labels, average='weighted')
    rec = precision.compute(predictions=predictions, references=labels, average='weighted')
    return {"accuracy": acc, "precision": prec, "recall": rec}

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    num_labels=5, 
    id2label=id2label, 
    label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
## Define the training arguments and hyperparameters
training_args = TrainingArguments(
    output_dir="claire_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [21]:
## Initialize the Hugging Face Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [None]:
## Train the model
trainer.train()

## 4. Inference on New Data

In [None]:
from transformers import pipeline


In [None]:
texts = ["I feel like dying",
         "What do you know about Football",
         "How can I change my notification settings",
         "I want to quit the subscription",
         "I am not feeling well today"]

## Initialize the text classification pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
classifier(texts)

## 5. Push Model to Hugging Face (Optional)

Make sure you have authenticated with the Hugging Face Hub by running:

```bash
huggingface-cli login
```

The model that I trained is available [here](https://huggingface.co/shub-kris/claire-text-classification-model)


In [None]:
## Make sure to change the user-name to your username
model_name = "user-name/distilbert-base-uncased-finetuned-claire"

model.push_to_hub(model_name)
tokenizer.push_to_hub(model_name)