# distilBERT: text classification for intent detection

This notebook is sourcing the data from the SNIPs dataset for intent classification. The aim is to fine-tune a pre-trained distilBERT model and classify the utterances with the following intents:

- `BookRestaurant`
- `GetPlaceDetails`
- `GetWeather`
- `GetDirections`
- `SearchPlace`
- `RequestRide`
- `ShareETA`
- `GetTrafficInformation`
- `ComparePlaces`
- `ShareCurrentLocation`

## Data loading, transformation and exploration

In [1]:
import json
import pandas as pd

In [2]:
with open('data/built-in-intents/benchmark_data.json', 'r') as file:
    builtin_intents = json.load(file)

dataset = []

for domain in builtin_intents.get('domains', ''):
    intents = domain.get('intents', {})
    for intent in intents:
        for query in intent.get('queries', []):
            dataset.append({
                'Domain': domain.get('name', ''),
                'Intent': intent.get('name'),
                'Text': query.get('text').strip()
            })

builtin_intents_df = pd.DataFrame(dataset)
builtin_intents_df

Unnamed: 0,Domain,Intent,Text
0,places,ShareCurrentLocation,Share my location with Hillary's sister
1,places,ShareCurrentLocation,Send my current location to my father
2,places,ShareCurrentLocation,Share my current location with Jim
3,places,ShareCurrentLocation,Send my location to my husband
4,places,ShareCurrentLocation,Send my location
...,...,...,...
323,weather,GetWeather,Will it rain tomorrow near my all day event?
324,weather,GetWeather,I need the weather at Jo's place around 8 pm
325,weather,GetWeather,What will the weather be like when I get out o...
326,weather,GetWeather,Show me the forecast for my upcoming weekend


In [3]:
builtin_intents_df.value_counts(subset=['Intent'])

Intent               
BookRestaurant           70
GetPlaceDetails          50
GetWeather               42
GetDirections            35
SearchPlace              28
RequestRide              26
ShareETA                 22
GetTrafficInformation    20
ComparePlaces            19
ShareCurrentLocation     16
Name: count, dtype: int64

In [4]:
from random import shuffle
import pandas as pd

dataset = {}
intents = list(builtin_intents_df['Intent'].unique())

for intent in intents:
    text_list = builtin_intents_df.query('Intent == @intent')['Text'].to_list()
    # shuffle the data of the list
    shuffle(text_list)
    # calculate 1/3 split
    split = len(text_list) // 3
    # 1/3 of the data is used for validation
    validation = text_list[:split]
    # 2/3 of the data is used for training
    training = text_list[split:]
    
    dataset[intent] = {}
    dataset[intent]['training'] = training
    dataset[intent]['validation'] = validation

dataset = pd.DataFrame(dataset)
dataset

Unnamed: 0,ShareCurrentLocation,ComparePlaces,GetPlaceDetails,SearchPlace,BookRestaurant,RequestRide,GetDirections,ShareETA,GetTrafficInformation,GetWeather
training,"[Share my current location with Jim, Share my ...",[What's the cheapest place between my favorite...,[How much is it to go to the top of Empire Sta...,[Find me a salad bar I can go to for my lunch ...,[Book a table for 8 at Tavern on the Green for...,"[Book a Lyft car to go to 33 greene street, I ...",[Show me the fastest itinerary to go to Willia...,"[Share with Franz my ETA, Share my estimated t...",[Is there traffic jam from here to Brooklyn br...,"[Will it rain tomorrow near my all day event?,..."
validation,"[Share my current location, Share my location ...","[Is my Airbnb closer than John's hotel?, What ...","[Show me the Butcher's Daughter's menu, How cr...",[I want to eat some fried chicken. Any suggest...,[Make a reservation at Delmonico's on Saturday...,"[Is there any Uber around?, Get an Uber to go ...","[Directions to JFK airport at 7am, Cycling dir...","[Send a message to Michael with my ETA, Send m...",[Shoud I expect traffic between broadway and p...,[What will the weather be like at Jo's place t...


Extrapolate the the training rows containing the list of utterances per intent

In [38]:
training = dataset.loc[['training']].T.explode('training').reset_index().rename(columns={
    'index': 'intent',
    'training': 'utterance'
})
training

Unnamed: 0,intent,utterance
0,ShareCurrentLocation,Share my current location with Jim
1,ShareCurrentLocation,Share my location with Robert for the next 10 min
2,ShareCurrentLocation,Send my current location to the friends I'm me...
3,ShareCurrentLocation,Send my location to my husband
4,ShareCurrentLocation,Send my location
...,...,...
218,GetWeather,Show me the forecast for my upcoming weekend
219,GetWeather,What will the weather be like when I get out o...
220,GetWeather,It is a beautiful day for a walk?
221,GetWeather,Will it rain in the next 30 minutes?


Extrapolate the the validation rows containing the list of utterances per intent

In [39]:
validation = dataset.loc[['validation']].T.explode('validation').reset_index().rename(columns={
    'index': 'intent',
    'validation': 'utterance'
})
validation

Unnamed: 0,intent,utterance
0,ShareCurrentLocation,Share my current location
1,ShareCurrentLocation,Share my location with Jo until 8pm
2,ShareCurrentLocation,Send my current location to my father
3,ShareCurrentLocation,Share my location with my office manager until...
4,ShareCurrentLocation,Share my location to mum until I get to school
...,...,...
100,GetWeather,Is it cold outside?
101,GetWeather,Is it going to be sunny next week?
102,GetWeather,What will the weather be like from 8am to 2pm ...
103,GetWeather,How windy will it be tomorrow?


Visualise that the number or rows used for training are roughly $2/3$ and for validation $1/3$ of the entire dataset.

In [7]:
len(training) * 100 / (len(training) + len(validation))

67.98780487804878

In [8]:
len(validation) * 100 / (len(training) + len(validation))

32.01219512195122

Visualise the even split of the rows per intent (stratification), on both the training and validation sets.

In [9]:
training.value_counts(subset=['intent'], normalize=True) * 100

intent               
BookRestaurant           21.076233
GetPlaceDetails          15.246637
GetWeather               12.556054
GetDirections            10.762332
SearchPlace               8.520179
RequestRide               8.071749
ShareETA                  6.726457
GetTrafficInformation     6.278027
ComparePlaces             5.829596
ShareCurrentLocation      4.932735
Name: proportion, dtype: float64

In [10]:
validation.value_counts(subset=['intent'], normalize=True) * 100

intent               
BookRestaurant           21.904762
GetPlaceDetails          15.238095
GetWeather               13.333333
GetDirections            10.476190
SearchPlace               8.571429
RequestRide               7.619048
ShareETA                  6.666667
ComparePlaces             5.714286
GetTrafficInformation     5.714286
ShareCurrentLocation      4.761905
Name: proportion, dtype: float64

## distilBERT modelling

Create the training and evaluation datasets for distilBERT

In [11]:
from datasets import Dataset

label2id = {v: k for k, v in enumerate(intents)}
id2label = {k: v for k, v in enumerate(intents)}

def encode(row):  
    return label2id[row['intent']]

train_dataset = Dataset.from_dict({
    'text': training.utterance.to_list(),
    'label': training.apply(encode, axis=1).to_list()
})

eval_dataset = Dataset.from_dict({
    'text': validation.utterance.to_list(),
    'label': validation.apply(encode, axis=1).to_list()
})

In [12]:
train_dataset[0]

{'text': 'Share my current location with Jim', 'label': 0}

In [13]:
eval_dataset[100]

{'text': 'Is it cold outside?', 'label': 9}

Load a pretrained distilBERT tokenizer and model

In [27]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(BERT)
model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)
model.config

loading file vocab.txt from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/vocab.txt
loading file tokenizer.json from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/tokenizer_config.json
loading file chat_template.jinja from cache at None
loading configuration file config.json from cache at /Users/spaccs01/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "a

DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "ShareCurrentLocation",
    "1": "ComparePlaces",
    "2": "GetPlaceDetails",
    "3": "SearchPlace",
    "4": "BookRestaurant",
    "5": "RequestRide",
    "6": "GetDirections",
    "7": "ShareETA",
    "8": "GetTrafficInformation",
    "9": "GetWeather"
  },
  "initializer_range": 0.02,
  "label2id": {
    "BookRestaurant": 4,
    "ComparePlaces": 1,
    "GetDirections": 6,
    "GetPlaceDetails": 2,
    "GetTrafficInformation": 8,
    "GetWeather": 9,
    "RequestRide": 5,
    "SearchPlace": 3,
    "ShareCurrentLocation": 0,
    "ShareETA": 7
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtyp

Tokenize the text and truncate the token sequences to be no longer than the distilBERTâ€™s maximum input length

In [28]:
def preprocess(examples):
    return tokenizer(examples['text'], truncation=True) # padding="max_length"

train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/223 [00:00<?, ? examples/s]

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

In [29]:
train_dataset[0]

{'text': 'Share my current location with Jim',
 'label': 0,
 'input_ids': [101, 3745, 2026, 2783, 3295, 2007, 3958, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
eval_dataset[100]

{'text': 'Is it cold outside?',
 'label': 9,
 'input_ids': [101, 2003, 2009, 3147, 2648, 1029, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Create a batch of examples using DataCollatorWithPadding. Itâ€™s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [31]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Load an evaluation method with the ðŸ¤— Evaluate library. Including a metric during training is often helpful for evaluating your modelâ€™s performance. Then create a function that passes your predictions and labels to compute the accuracy.

In [32]:
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Define the hyperparameters for the fine-tuning process

In [20]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

## Is the pre-trained BERT able to recognise the labels without fine-tuning?

In [21]:
from transformers import pipeline

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Device set to use mps:0


In [22]:
classifier('Send my current location to Anna')

[{'label': 'RequestRide', 'score': 0.12309124320745468}]

The answer is NO. Let's fine-tune BERT with a small dataset and see whether we can improve its performances.

## Fine Tuning

In [23]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 105
  Batch size = 32


{'eval_loss': 2.292901039123535,
 'eval_model_preparation_time': 0.0016,
 'eval_accuracy': 0.09523809523809523,
 'eval_runtime': 0.3826,
 'eval_samples_per_second': 274.467,
 'eval_steps_per_second': 10.456}

In [24]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 223
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 14
  Number of trainable parameters = 66,961,162


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,2.1867,2.180512,0.0016,0.380952
2,2.1419,2.086236,0.0016,0.4


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 105
  Batch size = 32
Saving model checkpoint to ./models/output/checkpoint-7
Configuration saved in ./models/output/checkpoint-7/config.json
Model weights saved in ./models/output/checkpoint-7/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/checkpoint-7/tokenizer_config.json
Special tokens file saved in ./models/output/checkpoint-7/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassific

TrainOutput(global_step=14, training_loss=2.207785333905901, metrics={'train_runtime': 3.5772, 'train_samples_per_second': 124.677, 'train_steps_per_second': 3.914, 'total_flos': 2374144102500.0, 'train_loss': 2.207785333905901, 'epoch': 2.0})

In [25]:
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Device set to use mps:0


Let's try now

In [26]:
classifier('Send my current location to Anna')

[{'label': 'BookRestaurant', 'score': 0.12027914077043533}]

Still nothing, better to do some changes

In [33]:
epochs = 10

training_args = TrainingArguments(
    output_dir = './models/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [34]:
trainer.evaluate()

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 105
  Batch size = 16


{'eval_loss': 2.3313937187194824,
 'eval_model_preparation_time': 0.0018,
 'eval_accuracy': 0.05714285714285714,
 'eval_runtime': 0.2579,
 'eval_samples_per_second': 407.114,
 'eval_steps_per_second': 27.141}

In [35]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 223
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 140
  Number of trainable parameters = 66,961,162


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,1.7914,1.811198,0.0018,0.409524
2,1.3235,1.302345,0.0018,0.67619
3,0.8976,0.900782,0.0018,0.857143
4,0.4126,0.61889,0.0018,0.914286
5,0.3507,0.440623,0.0018,0.952381
6,0.2497,0.342227,0.0018,0.952381
7,0.1669,0.267084,0.0018,0.971429
8,0.1231,0.238706,0.0018,0.952381
9,0.0857,0.225424,0.0018,0.961905
10,0.0909,0.221194,0.0018,0.961905


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 105
  Batch size = 16
Saving model checkpoint to ./models/output/checkpoint-14
Configuration saved in ./models/output/checkpoint-14/config.json
Model weights saved in ./models/output/checkpoint-14/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./models/output/checkpoint-14/tokenizer_config.json
Special tokens file saved in ./models/output/checkpoint-14/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClas

TrainOutput(global_step=140, training_loss=0.6817898352763483, metrics={'train_runtime': 18.0393, 'train_samples_per_second': 123.619, 'train_steps_per_second': 7.761, 'total_flos': 11296009444200.0, 'train_loss': 0.6817898352763483, 'epoch': 10.0})

In [36]:
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Device set to use mps:0


In [37]:
classifier('Send my current location to Anna')

[{'label': 'ShareCurrentLocation', 'score': 0.8326056003570557}]

Now it works as it is supposed to be.