Distil BERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This is from paper: https://arxiv.org/abs/1910.01108

As transfer learning from large-scale pre-trained models becomes more prevalent in Natrual Language Processing (NLP), operating these large models in on the edge and/or under constrained computational training or inference budgets remains challenging. In this work, we prospose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof of concept experiment and a comparative on-device study.

What is DistilBERT?

BERT is designed to pretrain deep bidrectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, run 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark.

What is KTRAIN?

ktrain is a library to help build, train, debug and deploy neural networks in the deep learning software framework, Keras.

ktrain uses tf.keras in TensorFlow instead of standalone Keras. Inspired by the fastai library, with only a few lines of code, ktrain allows you to easily:
* estimate an optimal learning rate for your model given your data using a learning rate finder.
* employ learning rate schedules such as triangular learning rate policy, 1cycle policy, and SGDR to more effectively train your model.
* employ fast and easy to use pre-canned models for both text classification (e.g. NBSVM, fastText, GRU with pretrained word embeddings) and image classification (e.g. ResNet, Wide Residual Networks, Inception)
* load and preprocess text and image data from a variety of formats
* inspect data points that were misclassified to help improve your model
* leverage a simple prediction API for saving and deploying both models and data pre-processing steps to make predictions on new raw data

ktrain GitHub: https://github.com/amaiya/ktrain


In [6]:
import tensorflow as tf
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
tf.__version__

'2.12.0'

Dataset: IMDB-Movie-Reviews-Large-Dataset-50k

In [4]:
data_test_dir = r'/content/drive/MyDrive/BERT_Project/data/data_test.csv'
data_train_dir = r'/content/drive/MyDrive/BERT_Project/data/data_train.csv'

In [13]:
data_train_df = pd.read_csv(data_train_dir)
data_test_df = pd.read_csv(data_test_dir)
data_train_df.head()

Unnamed: 0.1,Unnamed: 0,Reviews,Sentiment
0,0,Tim Meadows has to be the most underrated of S...,pos
1,1,It's hard to say anything about a movie like t...,pos
2,2,If you haven't seen Eva Longoria from the TV s...,pos
3,3,The early career of Abe Lincoln is beautifully...,pos
4,4,In Panic In The Streets Richard Widmark plays ...,pos


In [9]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) from keras_bert [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face transformers [https://arxiv.org/abs/1910.01108]


In [15]:
(train, val, preproc) = text.texts_from_df(data_train_df,
                   text_column = 'Reviews',
                   label_columns = 'Sentiment',
                   val_df = data_test_df,
                   maxlen = 512,
                   preprocess_mode = 'distilbert')

['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  0.0  1.0
3  0.0  1.0
4  0.0  1.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  0.0  1.0
3  0.0  1.0
4  0.0  1.0


Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

preprocessing train...
language: en
train sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 229
	95percentile : 582
	99percentile : 901


In [18]:
model = text.text_classifier(name = 'distilbert',
                             train_data = train,
                             preproc = preproc)

Is Multi-Label? False
maxlen is 512
done.


In [22]:
learner = ktrain.get_learner(model = model,
                             train_data = train,
                             val_data = val,
                             batch_size = 6)

In [23]:
learner.fit_onecycle(lr = 2e-5,
                     epochs = 2)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7b2556564460>

In [27]:
# Save the model
predictor = ktrain.get_predictor(learner.model,
                                 preproc)
predictor.save('/content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model')


In [29]:
!zip -r '/content/drive/MyDrive/BERT DistilBERT Project.zip' '/content/drive/MyDrive/BERT DistilBERT Project'

  adding: content/drive/MyDrive/BERT DistilBERT Project/ (stored 0%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/ (stored 0%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/config.json (deflated 44%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/tf_model.h5 (deflated 8%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/tokenizer_config.json (deflated 43%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/special_tokens_map.json (deflated 42%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/vocab.txt (deflated 53%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/tokenizer.json (deflated 71%)
  adding: content/drive/MyDrive/BERT DistilBERT Project/distilBERT-model/tf_model.preproc (deflated 47%)


In [34]:
# Evaluate the model using new data
data = ['this movie was really bad. acting was also bad. I will not watch it again',
        'the movie was really great. I will watch it again',
        'the movie was horible and no plot']

predictor.predict(data)

['neg', 'pos', 'neg']

In [31]:
predictor.get_classes()

['neg', 'pos']