<a href="https://colab.research.google.com/github/rosamariaryh/text-classification/blob/main/distilbert_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification with DistilBERT

This notebook presents some ways to train text classification models with deep learning architectures. The notebook is organised as follows:
- Installing libraries
- Data formatting
- Setting up evaluation metrics
- Text classification
- Visualisation of results

In [2]:
from platform import python_version

#check python version
python_version()

'3.8.16'

In [6]:
# only for google colab
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## Installing libraries

We install and import the libraries for text classification with transformers.

In [None]:
!pip install transformers
!pip install torch
!pip install evaluate
!pip install seqeval
!pip install transformers[tf-cpu]
!pip install transformers[torch]
!pip install transformers[flax]

In [None]:
!pip install tensorflow

## Data formatting

First, we read the data set and take a look at some samples.

Since we will be using transformers, extensive preprocessing of the texts is not necessary. If we used more traditional machine learning approaches, such as Naive Bayes or Random Forest, we would want to convert the texts into lowercase, remove punctuation, remove stop words and, possibly, lemmatise the tokens in order to perform featuring engineering.

In [11]:
import json

#open file
f = open('/content/drive/My Drive/tc-dataset.json')
data = json.load(f)

In [12]:
# check sample of data set
data["data"][0]

{'text': 'Standort Trovarit AG München, Deutschland', 'label': 'none'}

In [13]:
from collections import Counter

# check number of samples in each class
c = Counter(dictionary['label'] for dictionary in data["data"])
print(c)

Counter({'none': 4325, 'soft': 3635, 'tech': 2289})


We create a dataframe for easier processing.

In [15]:
texts = []
labels = []

#extract texts and preprocess them
for dictionary in data["data"]:
    sentence = dictionary['text']
    texts.append(sentence)

# extract labels
# we need numeric labels for transformers
# 0 = none, 1 = tech, 2 = soft
for dictionary in data["data"]:
    label = dictionary['label']
    if label == "none":
        label = 0
        labels.append(label)
    elif label == "tech":
        label = 1
        labels.append(label)
    elif label == "soft":
        label = 2
        labels.append(label)

# check information for first element
print(texts[0])
print(labels[0])

Standort Trovarit AG München, Deutschland
0


In [16]:
import pandas as pd

# create dataframe from two lists: labels and texts
df2 = pd.DataFrame(labels, columns=['labels'])
df3 = pd.DataFrame(texts, columns=['texts'])
final_df = df2.join(df3)

display(final_df)

Unnamed: 0,labels,texts
0,0,"Standort Trovarit AG München, Deutschland"
1,0,Wir freuen uns auf Ihre Bewerbung unter Angabe...
2,1,Qualifikation zur Heimleitung gemäß Heimperson...
3,2,Gute organisatorische und konzeptionelle Fähig...
4,2,"Teamfähigkeit, hohe Flexibilität und Einsatzbe..."
...,...,...
10244,0,Zu unserer Unternehmensgruppe gehören u. a. To...
10245,0,Zum Aufbau unserer Abteilung Entwicklung Posit...
10246,0,Zur vereinfachten Lesbarkeit verwenden wir im ...
10247,0,Zur Verstärkung der Marketing Abteilung WPR su...


We will split the data set into train and test sets by using a stratified split in order to maintain samples of all classes in both sets. From: https://proclusacademy.com/blog/stratified_sampling_pandas/

In [17]:
# create statistics for each class

# Get ratio instead of raw numbers using normalize=True
expected_ratio = final_df['labels'].value_counts(normalize=True)

# Round and then convert to percentage
expected_ratio = expected_ratio.round(4)*100

# convert to a DataFrame and store in variable 'label_ratios'
# We'll use this variable to compare ratios for samples 
# selected using Stratified Sampling 
label_ratios = pd.DataFrame({'Expected':expected_ratio})
label_ratios

Unnamed: 0,Expected
0,42.2
2,35.47
1,22.33


In [18]:
# Stratified Sampling
# Use groupby and apply to select sample 
# which maintains the population group ratios
test_set = final_df.groupby('labels').apply(
    lambda x: x.sample(frac=0.20))

# get rid off first column
test_set = test_set.droplevel(0)

# take a look at our test set
display(test_set)

Unnamed: 0,labels,texts
7028,0,Mit einem Umsatz von ungefähr 1 Mrd.
8681,0,Fachbereiche
163,0,Leistungen für Bewerber Beratung und Vermittlu...
9848,0,Was wir dir bieten:
2790,0,© 2017 msg systems ag
...,...,...
3998,2,Eigenständige Arbeitsweise.
5599,2,Französischkenntnisse wünschenswert
489,2,"Selbständige, gewissenhafte Arbeitsweise, Vera..."
4515,2,Motivation zur internationalen Teamarbeit


In [19]:
#create stratified train set
train_set = final_df.groupby('labels').apply(
    lambda x: x.sample(frac=0.80))

# get rid off first column
train_set = train_set.droplevel(0)

# check our train set
display(train_set)

Unnamed: 0,labels,texts
8710,0,ca. 12.000 Anästhesieleistungen/Jahr
8815,0,PRIVATKUNDEN
8948,0,Direkt zum Inhalt
8867,0,© Copyright 2019.
2781,0,Sie arbeiten in einem erfahrenen Team gemeinsa...
...,...,...
6082,2,"Selbstständiges, strukturiertes und eigenveran..."
7114,2,"Sehr selbstständige, proaktive und strukturier..."
3217,2,"Sie sind flexibel, belastbar und selbstständig..."
3061,2,In einem Team aus Fachkräften sind Sie kompete...


Check ratios in entire data set and test set:

In [25]:
#Ratio of selected items by the island
stratified_ratio = test_set['labels'].value_counts(normalize=True)
stratified_ratio = train_set['labels'].value_counts(normalize=True)

# Convert to percentage
stratified_ratio = stratified_ratio.round(4)*100

# We did stratified sampling. So give it proper name
stratified_ratio.name = 'test_set'
stratified_ratio.name = 'train_set'

# Add it to the variable label_ratios which already has 
# the expected and SRS proportions 
label_ratios = pd.concat([label_ratios, stratified_ratio], axis=1)
label_ratios

Unnamed: 0,Expected,Stratified,Stratified.1,test_set,train_set
0,42.2,42.2,42.2,42.2,42.2
2,35.47,35.47,35.46,35.46,35.47
1,22.33,22.33,22.34,22.34,22.33


We preprocess the texts, which means we tokenise them and truncate them to make sure they do not exceed the maximum length. We also add padding to the short texts to make them longer. Each text gets an attention mask and input ids. Help from https://github.com/huggingface/transformers/issues/11455 or https://huggingface.co/transformers/v3.2.0/custom_datasets.html

In [26]:
from transformers import DistilBertTokenizerFast

#load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-german-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/240k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/464 [00:00<?, ?B/s]

In [27]:
def encode_data(texts):
    return tokenizer.batch_encode_plus(
                texts, 
                add_special_tokens=True, 
                return_attention_mask=True, 
                padding = True,
                truncation=True,
                max_length=200,
                return_tensors='pt'
            )

In [28]:
import torch

class my_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        print(item)
        return item

    def __len__(self):
        return len(self.labels)

In [29]:
# get encodings for data
encoded_data_train = encode_data(train_set['texts'].tolist())
encoded_data_test = encode_data(test_set['texts'].tolist())

# add labels to data 
dataset_train = my_Dataset(encoded_data_train, train_set['labels'].tolist())
dataset_test = my_Dataset(encoded_data_test, test_set['labels'].tolist())

In [32]:
# define labels for training

id2label = {0: "none", 1: "tech", 2: "soft"}
label2id = {"none": 0, "tech": 1, "soft": 2}

## Set up evaluation metrics

We set up metrics that will be used for the evaluation of our model. https:/stackoverflow.com/questions/67457480/how-to-get-the-accuracy-per-epoch-or-step-for-the-huggingface-transformers-train 

In [40]:
import evaluate

accuracy = evaluate.load("accuracy")

In [46]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Classifying with transformers

Since the data set we will be using has one label per text but three classes are present, we will perform multi-class classification. If each text had several labels, we could consider multi-label classification. Most of the data is in German, so we will use a language model trained on German data. However, I found that a part of it was in English, so we will experiment with the multilingual model as well.

https://huggingface.co/distilbert-base-multilingual-cased
https://huggingface.co/distilbert-base-german-cased

In [49]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
#define model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-german-cased", num_labels=3, id2label=id2label, label2id=label2id)

#define training arguments
training_args = TrainingArguments(
    output_dir="german_distilbert",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

#define training data 
#use data collator only if padding hasn't been done before
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    tokenizer=tokenizer,
    #data_collator=data_collator,
    compute_metrics=compute_metrics
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-german-cased/snapshots/06b1dc5ba050ddbf462d060df38f906eedb31b01/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-german-cased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "none",
    "1": "tech",
    "2": "soft"
  },
  "initializer_range": 0.02,
  "label2id": {
    "none": 0,
    "soft": 2,
    "tech": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": true,
  "tie_weights_": true,
  "transformers_version": "4.25.1",
  "vocab_size": 31102
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilbert

In [50]:
trainer.train()

***** Running training *****
  Num examples = 8199
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1026
  Number of trainable parameters = 67008003


{'input_ids': tensor([ 102, 5374, 1293, 1673,  103,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0, 

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2257,0.091109,0.973171
2,0.0995,0.085327,0.974634


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,    

***** Running Evaluation *****
  Num examples = 2050
  Batch size = 16


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(2)}
{'input_ids': tensor([  102,  1685, 30881,   708,   232,   686,  7706, 20509,   103,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,   

Saving model checkpoint to german_distilbert/checkpoint-513
Configuration saved in german_distilbert/checkpoint-513/config.json
Model weights saved in german_distilbert/checkpoint-513/pytorch_model.bin
tokenizer config file saved in german_distilbert/checkpoint-513/tokenizer_config.json
Special tokens file saved in german_distilbert/checkpoint-513/special_tokens_map.json


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(2)}
{'input_ids': tensor([  102,  7198,  8837, 15813,   232,  8736,   223,  1373,   195,  6628,
         4901,   136,  1701,  4401,   818,  9554, 30886,   232,   394,   995,
        23564,  7075,   818,  2976,   106,  8526,   818,  6318,  1031,  1726,
          105, 14371,   818,   926, 14047,  8692,   232, 19577

***** Running Evaluation *****
  Num examples = 2050
  Batch size = 16


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(2)}
{'input_ids': tensor([  102,  1685, 30881,   708,   232,   686,  7706, 20509,   103,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,   

Saving model checkpoint to german_distilbert/checkpoint-1026
Configuration saved in german_distilbert/checkpoint-1026/config.json
Model weights saved in german_distilbert/checkpoint-1026/pytorch_model.bin
tokenizer config file saved in german_distilbert/checkpoint-1026/tokenizer_config.json
Special tokens file saved in german_distilbert/checkpoint-1026/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from german_distilbert/checkpoint-1026 (score: 0.0853266641497612).


TrainOutput(global_step=1026, training_loss=0.16006818407925016, metrics={'train_runtime': 324.5639, 'train_samples_per_second': 50.523, 'train_steps_per_second': 3.161, 'total_flos': 848530914559200.0, 'train_loss': 0.16006818407925016, 'epoch': 2.0})

In [None]:
# get accuracy metrics for each category and whole model
# Flair?

## Visualisation of results

In [None]:
# visualise results