# distilBERT: text classification for intent detection

This notebook is sourcing the data from the SNIPs dataset for intent classification. The aim is to fine-tune a pre-trained distilBERT model and classify the utterances with the following intents:
- `AddToPlaylist`
- `BookRestaurant`
- `GetWeather`
- `PlayMusic`
- `RateBook`
- `SearchCreativeWork`
- `SearchScreeningEvent`

## Data loading, transformation and inspection

In [1]:
import json

def load_intents(file_name, intent_name):
    with open(file_name, 'r') as file:
        raw_data = json.load(file)
    
    return [
        ''.join(
            [
                data.get('text', '') for data in intent.get('data', [])
            ]
        ).strip() for intent in raw_data.get(intent_name, [])
    ]

intents = [
    'AddToPlaylist',
    'BookRestaurant',
    'GetWeather',
    'PlayMusic',
    'RateBook',
    'SearchCreativeWork',
    'SearchScreeningEvent'
]

In [2]:
import pandas as pd
from random import shuffle

dataset = {}

for intent in intents:
    intents_list = list(set(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}_full.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/validate_{intent}.json', intent)
    ))
    # shuffle the data of the list
    shuffle(intents_list)
    # calculate 1/3 split
    split = len(intents_list) // 3
    # 1/3 of the data is used for validation
    validation = intents_list[:split]
    # 2/3 of the data is used for training
    training = intents_list[split:]
    
    dataset[intent] = {}
    dataset[intent]['training'] = training
    dataset[intent]['validation'] = validation

dataset = pd.DataFrame(dataset)
dataset

Unnamed: 0,AddToPlaylist,BookRestaurant,GetWeather,PlayMusic,RateBook,SearchCreativeWork,SearchScreeningEvent
training,"[add this track to my Gold School playlist, pu...",[book a party of 9 for The Wieners Circle in P...,[What's the weather in Ecola State Park in thr...,"[Play me a song by Stephen Jones, play somethi...","[Rate The Removers 4 out of 6, Rate the curren...",[Can you get me The Education of Little Tree s...,"[Please tell me movie times, What time is The ..."
validation,"[add a track in Nike Running Tempo Mix, I'd li...",[I need a table for nine at a nice restaurant ...,"[What's the weather in the current place?, Wea...",[Can you play a chant by Butch Trucks on Spoti...,"[I give zero points to this chronicle, The You...","[find Karol: The Pope, Search for The Long Dar...",[Find the movie schedules for animated movies ...


In [3]:
training = dataset.loc[['training']].T.explode('training').reset_index().rename(columns={
    'index': 'intent',
    'training': 'utterance'
})
training.sample(5)

Unnamed: 0,intent,utterance
8922,SearchCreativeWork,Please search the Abby saga.
465,AddToPlaylist,add Rakim y Ken Y to my Gold Edition playlist
4712,PlayMusic,Play Tomtegubben Som Hade Snuva
2168,BookRestaurant,Find a reservation in Hesston NC at a new rest...
1831,BookRestaurant,Please book seating for one person at an indoo...


In [4]:
validation = dataset.loc[['validation']].T.explode('validation').reset_index().rename(columns={
    'index': 'intent',
    'validation': 'utterance'
})
validation.sample(5)

Unnamed: 0,intent,utterance
3048,RateBook,this essay should get 1 of the points
2935,PlayMusic,I want to hear a top ten soundtrack from 1984 ...
3669,RateBook,Rate the current novel a 1
2830,PlayMusic,Play some ivy anderson from around 1967
3036,PlayMusic,something on Spotify please


In [5]:
len(training) * 100 / (len(training) + len(validation))

66.68578712555768

In [6]:
len(validation) * 100 / (len(training) + len(validation))

33.31421287444232

In [7]:
training.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.823664
PlayMusic               14.775877
BookRestaurant          14.747204
RateBook                14.078180
SearchCreativeWork      14.039950
AddToPlaylist           13.896588
SearchScreeningEvent    13.638536
Name: proportion, dtype: float64

In [8]:
validation.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.826861
PlayMusic               14.769466
BookRestaurant          14.750335
RateBook                14.080735
SearchCreativeWork      14.042472
AddToPlaylist           13.889420
SearchScreeningEvent    13.640712
Name: proportion, dtype: float64

## BERT

Create training and evaluation datasets

In [9]:
from datasets import Dataset

label2id = {v: k for k, v in enumerate(intents)}
id2label = {k: v for k, v in enumerate(intents)}

def encode(row):  
    return label2id[row['intent']]

train_dataset = Dataset.from_dict({
    'text': training.utterance.to_list(),
    'label': training.apply(encode, axis=1).to_list()
})

eval_dataset = Dataset.from_dict({
    'text': validation.utterance.to_list(),
    'label': validation.apply(encode, axis=1).to_list()
})

Load a pretrained BERT tokenizer

In [10]:
from transformers import DistilBertTokenizerFast

BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(BERT)

Tokenize text and truncate sequences to be no longer than DistilBERTâ€™s maximum input length:

In [11]:
def preprocess(examples):
    return tokenizer(examples['text'], truncation=True) # padding="max_length"

train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/10463 [00:00<?, ? examples/s]

Map:   0%|          | 0/5227 [00:00<?, ? examples/s]

Create a batch of examples using DataCollatorWithPadding. Itâ€™s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [12]:
train_dataset[0]

{'text': 'add this track to my Gold School playlist',
 'label': 0,
 'input_ids': [101, 5587, 2023, 2650, 2000, 2026, 2751, 2082, 2377, 9863, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Load an evaluation method with the ðŸ¤— Evaluate library. Including a metric during training is often helpful for evaluating your modelâ€™s performance.

In [14]:
import evaluate

accuracy = evaluate.load('accuracy')

Then create a function that passes your predictions and labels to compute to calculate the accuracy

In [15]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [16]:
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer

model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)
model.config

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.53.3",
  "vocab_size": 30522
}

## References
- [Create Dataset](https://huggingface.co/docs/datasets/v4.0.0/en/create_dataset)
- [NLP quickstart](https://huggingface.co/docs/datasets/en/quickstart#nlp)
- [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer)
- [Evaluate](https://huggingface.co/docs/evaluate/a_quick_tour)
- [Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification)

```
import torch

dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset = dataset.with_format(type="torch")
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
```

Define the hyperparameters

In [17]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [18]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 1.9477510452270508,
 'eval_model_preparation_time': 0.0019,
 'eval_accuracy': 0.12511957145590205,
 'eval_runtime': 3.3465,
 'eval_samples_per_second': 1561.911,
 'eval_steps_per_second': 49.006}

In [19]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10,463
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 654
  Number of trainable parameters = 66,958,855


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.6108,1.586888,0.0019,0.840635
2,0.3382,0.346689,0.0019,0.967094


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32
Saving model checkpoint to ./models/output/checkpoint-327
Configuration saved in ./models/output/checkpoint-327/config.json
Model weights saved in ./models/output/checkpoint-327/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/checkpoint-327/tokenizer_config.json
Special tokens file saved in ./models/output/checkpoint-327/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequen

TrainOutput(global_step=654, training_loss=1.410431821776457, metrics={'train_runtime': 56.9922, 'train_samples_per_second': 367.173, 'train_steps_per_second': 11.475, 'total_flos': 122946141057906.0, 'train_loss': 1.410431821776457, 'epoch': 2.0})

In [20]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 0.34668880701065063,
 'eval_model_preparation_time': 0.0019,
 'eval_accuracy': 0.9670939353357566,
 'eval_runtime': 2.9087,
 'eval_samples_per_second': 1797.02,
 'eval_steps_per_second': 56.382,
 'epoch': 2.0}

# Inference

In [21]:
from transformers import pipeline

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Device set to use mps:0


In [23]:
classifier('Is it going to be sunny tomorrow?')

[{'label': 'GetWeather', 'score': 0.8123217821121216}]

In [24]:
trainer.save_model()

Saving model checkpoint to ./models/output
Configuration saved in ./models/output/config.json
Model weights saved in ./models/output/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/tokenizer_config.json
Special tokens file saved in ./models/output/special_tokens_map.json


In [25]:
classifier = pipeline('text-classification', model='./models/output', tokenizer=tokenizer)

loading configuration file ./models/output/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformer

In [26]:
classifier('Add Stitches by Shawn Mendes to my favourites')

[{'label': 'AddToPlaylist', 'score': 0.8058907389640808}]

In [28]:
classifier('Harry Potter was 10 out of 10')

[{'label': 'RateBook', 'score': 0.8625803589820862}]

In [34]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')

[{'label': 'BookRestaurant', 'score': 0.7580662369728088}]

# Fine tuning by freezing BERT parameters

In [35]:
frozen_model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)

loading configuration file config.json from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": fals

In [36]:
for param in frozen_model.distilbert.parameters():
    param.requires_grad = False # freezing the parameters of the pre-trained model

We freeze the parameters of the pre-trained model, and then we run the fine-tuning training as before.
The training phase will take less time to run, but the accuracy will be lower than the other fine-tuned model.

In [37]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/frozen/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=frozen_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [38]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 1.9417152404785156,
 'eval_model_preparation_time': 0.0019,
 'eval_accuracy': 0.1811746699827817,
 'eval_runtime': 2.9131,
 'eval_samples_per_second': 1794.338,
 'eval_steps_per_second': 56.298}

In [39]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10,463
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 654
  Number of trainable parameters = 595,975


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.9551,1.925956,0.0019,0.236656
2,1.8591,1.876852,0.0019,0.458389


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32
Saving model checkpoint to ./models/frozen/output/checkpoint-327
Configuration saved in ./models/frozen/output/checkpoint-327/config.json
Model weights saved in ./models/frozen/output/checkpoint-327/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/frozen/output/checkpoint-327/tokenizer_config.json
Special tokens file saved in ./models/frozen/output/checkpoint-327/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are n

TrainOutput(global_step=654, training_loss=1.9238947165121727, metrics={'train_runtime': 19.4954, 'train_samples_per_second': 1073.38, 'train_steps_per_second': 33.546, 'total_flos': 122946141057906.0, 'train_loss': 1.9238947165121727, 'epoch': 2.0})

In [40]:
classifier = pipeline('text-classification', model=frozen_model, tokenizer=tokenizer)

Device set to use mps:0


In [41]:
classifier('Add Stitches by Shawn Mendes to my favourites')

[{'label': 'AddToPlaylist', 'score': 0.161112979054451}]

In [42]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')

[{'label': 'BookRestaurant', 'score': 0.16327163577079773}]