# distilBERT: text classification for intent detection

This notebook is sourcing the data from the SNIPs dataset for intent classification. The aim is to fine-tune a pre-trained distilBERT model and classify the utterances with the following intents:
- `AddToPlaylist`
- `BookRestaurant`
- `GetWeather`
- `PlayMusic`
- `RateBook`
- `SearchCreativeWork`
- `SearchScreeningEvent`

## Data loading, transformation and exploration

In [1]:
import json

def load_intents(file_name, intent_name):
    with open(file_name, 'r') as file:
        raw_data = json.load(file)
    
    return [
        ''.join(
            [
                data.get('text', '') for data in intent.get('data', [])
            ]
        ).strip() for intent in raw_data.get(intent_name, [])
    ]

intents = [
    'AddToPlaylist',
    'BookRestaurant',
    'GetWeather',
    'PlayMusic',
    'RateBook',
    'SearchCreativeWork',
    'SearchScreeningEvent'
]

In [2]:
import pandas as pd
from random import shuffle

dataset = {}

for intent in intents:
    intents_list = list(set(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}_full.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/validate_{intent}.json', intent)
    ))
    # shuffle the data of the list
    shuffle(intents_list)
    # calculate 1/3 split
    split = len(intents_list) // 3
    # 1/3 of the data is used for validation
    validation = intents_list[:split]
    # 2/3 of the data is used for training
    training = intents_list[split:]
    
    dataset[intent] = {}
    dataset[intent]['training'] = training
    dataset[intent]['validation'] = validation

dataset = pd.DataFrame(dataset)
dataset

Unnamed: 0,AddToPlaylist,BookRestaurant,GetWeather,PlayMusic,RateBook,SearchCreativeWork,SearchScreeningEvent
training,"[insert Rock Me UP song to my list, Put an alb...","[Book me a restaurant for nine in Statham, boo...","[Weather for Burr, what is the weather like no...","[play the top jazz record from 1951, Play top-...","[Give Unleashing Nepal zero stars, I would rat...",[I want to see the television show The Muppet ...,"[Where can I find the movie schedules?, What a..."
validation,[Add kaya newest track to my I Love My '00's R...,[I need a reservation for 8 at a top-rated res...,[What will the weather be doing at midnight in...,"[Ply best 1973 sound track, Play music year 20...","[I give The Spirit of St. Louis a 1, give zero...",[Can you help me search the Two Row Times show...,[When and where can I see A Film with Me in It...


Extrapolate the the training rows containing the list of utterances per intent

In [3]:
training = dataset.loc[['training']].T.explode('training').reset_index().rename(columns={
    'index': 'intent',
    'training': 'utterance'
})
training

Unnamed: 0,intent,utterance
0,AddToPlaylist,insert Rock Me UP song to my list
1,AddToPlaylist,Put an album by max richter into my this is Ro...
2,AddToPlaylist,Please add the song by raphael rabello to the ...
3,AddToPlaylist,add Stephen McNally to Confidence Boost
4,AddToPlaylist,Add Ernie Hawkins to the Dubstep playlist.
...,...,...
10458,SearchScreeningEvent,Fine movie times.
10459,SearchScreeningEvent,Is The Happy Hooker Goes Hollywood at the movi...
10460,SearchScreeningEvent,show me the movie schedule
10461,SearchScreeningEvent,Show me the schedule for ArcLight Hollywood fo...


Extrapolate the the validation rows containing the list of utterances per intent

In [4]:
validation = dataset.loc[['validation']].T.explode('validation').reset_index().rename(columns={
    'index': 'intent',
    'validation': 'utterance'
})
validation

Unnamed: 0,intent,utterance
0,AddToPlaylist,Add kaya newest track to my I Love My '00's R&...
1,AddToPlaylist,put K Maro track on my soul lounge list
2,AddToPlaylist,Put jiro in my cl√°sicos del hip hop espa√±ol pl...
3,AddToPlaylist,add global underground 006 sydney to my Best M...
4,AddToPlaylist,Can you put the artist Giovanni Giacomo Gastol...
...,...,...
5222,SearchScreeningEvent,What time is Beat the Devil coming on at Mann ...
5223,SearchScreeningEvent,Is Strauss Is Playing Today at the Cineplex Od...
5224,SearchScreeningEvent,What time is The Graduates of Malibu High play...
5225,SearchScreeningEvent,movie schedules of movies in the neighbourhood...


Visualise that the number or rows used for training are $2/3$ and for validation are $1/3$ of the entire dataset.

In [5]:
len(training) * 100 / (len(training) + len(validation))

66.68578712555768

In [6]:
len(validation) * 100 / (len(training) + len(validation))

33.31421287444232

Visualise the even split of the rows per intent (stratification), on both the training and validation sets.

In [7]:
training.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.823664
PlayMusic               14.775877
BookRestaurant          14.747204
RateBook                14.078180
SearchCreativeWork      14.039950
AddToPlaylist           13.896588
SearchScreeningEvent    13.638536
Name: proportion, dtype: float64

In [8]:
validation.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.826861
PlayMusic               14.769466
BookRestaurant          14.750335
RateBook                14.080735
SearchCreativeWork      14.042472
AddToPlaylist           13.889420
SearchScreeningEvent    13.640712
Name: proportion, dtype: float64

## distilBERT modelling

Create the training and evaluation datasets for distilBERT

In [9]:
from datasets import Dataset

label2id = {v: k for k, v in enumerate(intents)}
id2label = {k: v for k, v in enumerate(intents)}

def encode(row):  
    return label2id[row['intent']]

train_dataset = Dataset.from_dict({
    'text': training.utterance.to_list(),
    'label': training.apply(encode, axis=1).to_list()
})

eval_dataset = Dataset.from_dict({
    'text': validation.utterance.to_list(),
    'label': validation.apply(encode, axis=1).to_list()
})

In [10]:
train_dataset[0]

{'text': 'insert Rock Me UP song to my list', 'label': 0}

In [11]:
eval_dataset[100]

{'text': 'add tune to my playlist ironing', 'label': 0}

Load a pretrained BERT tokenizer

In [12]:
from transformers import DistilBertTokenizerFast

BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(BERT)

Tokenize the text and truncate the token sequences to be no longer than the distilBERT‚Äôs maximum input length

In [13]:
def preprocess(examples):
    return tokenizer(examples['text'], truncation=True) # padding="max_length"

train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/10463 [00:00<?, ? examples/s]

Map:   0%|          | 0/5227 [00:00<?, ? examples/s]

Create a batch of examples using DataCollatorWithPadding. It‚Äôs more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [14]:
train_dataset[0]

{'text': 'insert Rock Me UP song to my list',
 'label': 0,
 'input_ids': [101, 19274, 2600, 2033, 2039, 2299, 2000, 2026, 2862, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Load an evaluation method with the ü§ó Evaluate library. Including a metric during training is often helpful for evaluating your model‚Äôs performance.

In [16]:
import evaluate

accuracy = evaluate.load('accuracy')

Then create a function that passes your predictions and labels to compute to calculate the accuracy

In [17]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [18]:
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer

model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)
model.config

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.53.3",
  "vocab_size": 30522
}

## References
- [Create Dataset](https://huggingface.co/docs/datasets/v4.0.0/en/create_dataset)
- [NLP quickstart](https://huggingface.co/docs/datasets/en/quickstart#nlp)
- [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer)
- [Evaluate](https://huggingface.co/docs/evaluate/a_quick_tour)
- [Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification)

```
import torch

dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset = dataset.with_format(type="torch")
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
```

Define the hyperparameters

In [19]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [20]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 1.955590844154358,
 'eval_model_preparation_time': 0.0005,
 'eval_accuracy': 0.14482494738855942,
 'eval_runtime': 3.3769,
 'eval_samples_per_second': 1547.89,
 'eval_steps_per_second': 48.566}

In [21]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10,463
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 654
  Number of trainable parameters = 66,958,855


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.7289,1.631116,0.0005,0.743256
2,0.2866,0.339452,0.0005,0.96652


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32
Saving model checkpoint to ./models/output/checkpoint-327
Configuration saved in ./models/output/checkpoint-327/config.json
Model weights saved in ./models/output/checkpoint-327/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/checkpoint-327/tokenizer_config.json
Special tokens file saved in ./models/output/checkpoint-327/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequen

TrainOutput(global_step=654, training_loss=1.4393571984330449, metrics={'train_runtime': 55.7894, 'train_samples_per_second': 375.09, 'train_steps_per_second': 11.723, 'total_flos': 123343578689394.0, 'train_loss': 1.4393571984330449, 'epoch': 2.0})

In [22]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 0.33945173025131226,
 'eval_model_preparation_time': 0.0005,
 'eval_accuracy': 0.9665199923474268,
 'eval_runtime': 2.9256,
 'eval_samples_per_second': 1786.626,
 'eval_steps_per_second': 56.056,
 'epoch': 2.0}

# Inference

In [23]:
from transformers import pipeline

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Device set to use mps:0


In [24]:
classifier('Is it going to be sunny tomorrow?')

[{'label': 'GetWeather', 'score': 0.7958238124847412}]

In [25]:
trainer.save_model()

Saving model checkpoint to ./models/output
Configuration saved in ./models/output/config.json
Model weights saved in ./models/output/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/tokenizer_config.json
Special tokens file saved in ./models/output/special_tokens_map.json


In [26]:
classifier = pipeline('text-classification', model='./models/output', tokenizer=tokenizer)

loading configuration file ./models/output/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformer

In [27]:
classifier('Add Stitches by Shawn Mendes to my favourites')

[{'label': 'AddToPlaylist', 'score': 0.8306653499603271}]

In [28]:
classifier('Harry Potter was 10 out of 10')

[{'label': 'RateBook', 'score': 0.7446354627609253}]

In [29]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')

[{'label': 'BookRestaurant', 'score': 0.7830557227134705}]

# Fine tuning by freezing BERT parameters

In [30]:
frozen_model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)

loading configuration file config.json from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "AddToPlaylist",
    "1": "BookRestaurant",
    "2": "GetWeather",
    "3": "PlayMusic",
    "4": "RateBook",
    "5": "SearchCreativeWork",
    "6": "SearchScreeningEvent"
  },
  "initializer_range": 0.02,
  "label2id": {
    "AddToPlaylist": 0,
    "BookRestaurant": 1,
    "GetWeather": 2,
    "PlayMusic": 3,
    "RateBook": 4,
    "SearchCreativeWork": 5,
    "SearchScreeningEvent": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": fals

In [31]:
for param in frozen_model.distilbert.parameters():
    param.requires_grad = False # freezing the parameters of the pre-trained model

We freeze the parameters of the pre-trained model, and then we run the fine-tuning training as before.
The training phase will take less time to run, but the accuracy will be lower than the other fine-tuned model.

In [32]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/frozen/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=frozen_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [33]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32


{'eval_loss': 1.9418292045593262,
 'eval_model_preparation_time': 0.0006,
 'eval_accuracy': 0.17543524009948344,
 'eval_runtime': 2.88,
 'eval_samples_per_second': 1814.946,
 'eval_steps_per_second': 56.945}

In [34]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10,463
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 654
  Number of trainable parameters = 595,975


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.9591,1.925937,0.0006,0.23723
2,1.8599,1.876745,0.0006,0.464703


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 5227
  Batch size = 32
Saving model checkpoint to ./models/frozen/output/checkpoint-327
Configuration saved in ./models/frozen/output/checkpoint-327/config.json
Model weights saved in ./models/frozen/output/checkpoint-327/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/frozen/output/checkpoint-327/tokenizer_config.json
Special tokens file saved in ./models/frozen/output/checkpoint-327/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are n

TrainOutput(global_step=654, training_loss=1.9237030164911113, metrics={'train_runtime': 20.1679, 'train_samples_per_second': 1037.587, 'train_steps_per_second': 32.428, 'total_flos': 123343578689394.0, 'train_loss': 1.9237030164911113, 'epoch': 2.0})

In [35]:
classifier = pipeline('text-classification', model=frozen_model, tokenizer=tokenizer)

Device set to use mps:0


In [36]:
classifier('Add Stitches by Shawn Mendes to my favourites')

[{'label': 'AddToPlaylist', 'score': 0.1608964502811432}]

In [37]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')

[{'label': 'BookRestaurant', 'score': 0.16289660334587097}]