### Check Hardware & RAM availability:
Commands to check for available GPU and RAM allocation on runtime

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sun Oct 24 12:56:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### References:
* https://huggingface.co/
* https://arxiv.org/abs/1907.11692

### Install Required Libraries for Transformer Models:
* Pre-Trained Transformer models are part of Hugging Face Library(transformers).
* Similarly, any datatset part of Hugging Face can be called from the datasets library.
* Finally we will use a high level abstraction package called k-train to simplify our modelling and predictions

In [2]:
!pip install ktrain
!pip install transformers



### Import Libraries:

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ktrain
from ktrain import text
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import timeit
import warnings

pd.set_option('display.max_columns', None)
warnings.simplefilter(action="ignore")

In [4]:
tf.__version__

'2.6.0'

### Load Dataset:

In [5]:
dbpedia_14_train = pd.read_csv("/content/dbpedia_14_train.csv")
dbpedia_14_test = pd.read_csv("/content/dbpedia_14_test.csv")

### Dataset Information:

In [6]:
dbpedia_14_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Labels   560000 non-null  object
 1   Content  560000 non-null  object
dtypes: object(2)
memory usage: 8.5+ MB


In [7]:
dbpedia_14_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Labels   70000 non-null  object
 1   Content  70000 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


In [8]:
dbpedia_14_train.head()

Unnamed: 0,Labels,Content
0,Company,Abbott of Farnham E D Abbott Limited was a Br...
1,Company,Schwan-STABILO is a German maker of pens for ...
2,Company,Q-workshop is a Polish company located in Poz...
3,Company,Marvell Software Solutions Israel known as RA...
4,Company,Bergan Mercy Medical Center is a hospital loc...


In [9]:
dbpedia_14_test.head()

Unnamed: 0,Labels,Content
0,Company,TY KU /taɪkuː/ is an American alcoholic bever...
1,Company,OddLot Entertainment founded in 2001 by longt...
2,Company,Henkel AG & Company KGaA operates worldwide w...
3,Company,The GOAT Store (Games Of All Type Store) LLC ...
4,Company,RagWing Aircraft Designs (also called the Rag...


### Split Train & Validation data:

In [10]:
X_train = dbpedia_14_train[:]["Content"]
y_train = dbpedia_14_train[:]["Labels"]
X_test = dbpedia_14_test[:]["Content"]
y_test = dbpedia_14_test[:]["Labels"]

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(560000,) (560000,) (70000,) (70000,)


### Instantiating a DistilBERT Instance:
Create a DistilBERT instance with the model name, max token length, the labels to be used for each category and the batch size.

In [11]:
class_names_list = ['Company',
 'EducationalInstitution',
 'Artist',
 'Athlete',
 'OfficeHolder',
 'MeanOfTransportation',
 'Building',
 'NaturalPlace',
 'Village',
 'Animal',
 'Plant',
 'Album',
 'Film',
 'WrittenWork']

In [12]:
distilbert_transformer = text.Transformer('distilbert-base-uncased', maxlen=512, classes=class_names_list, batch_size=16)

### Perform Data Preprocessing:

In [13]:
dbpedia_ont_train = distilbert_transformer.preprocess_train(X_train.to_list(), y_train.to_list())
dbpedia_ont_val = distilbert_transformer.preprocess_test(X_test.to_list(), y_test.to_list())

preprocessing train...
language: en
train sequence lengths:
	mean : 46
	95percentile : 80
	99percentile : 86


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 46
	95percentile : 80
	99percentile : 86


### Compile DistilBERT in a K-Train Learner Object:
Since we are using k-train as a high level abstration package, we need to wrap our model in a k-train Learner Object for further compuation

In [14]:
distilbert_model = distilbert_transformer.get_classifier()

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

In [15]:
distilbert_learner_ins = ktrain.get_learner(model=distilbert_model,
                            train_data=dbpedia_ont_train,
                            val_data=dbpedia_ont_val,
                            batch_size=16)

### DistilBERT Model Summary:

In [16]:
distilbert_learner_ins.model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  10766     
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,964,238
Trainable params: 66,964,238
Non-trainable params: 0
_________________________________________________________________


### DistilBERT Optimal Learning Rates:¶
DistilBERT follows Knowledge Distillation on BERT, hence we can use the established batch sizes and learning rates as used in BERT:

* Batch Sizes => {16, 32}
* Learning Rates => {1e−5, 2e−5, 3e−5}
We will choose the maximum among these for our fine-tuning and evaluation purposes.

### Fine Tuning DistilBERT on Dbpedia Ontology Dataset:
We take our Dbpedia Ontology dataset along with the DistilBERT model we created, define the learning-rate & epochs to be used and start fine-tuning.

In [17]:
distilbert_fine_tuning_start= timeit.default_timer()
distilbert_learner_ins.fit_onecycle(lr=2e-5, epochs=1)
distilbert_fine_tuning_stop = timeit.default_timer()



begin training using onecycle policy with max lr of 2e-05...


In [18]:
print("\nFine-Tuning time for DistilBERT on Dbpedia Ontology dataset: \n", (distilbert_fine_tuning_stop - distilbert_fine_tuning_start)/60, " min")


Fine-Tuning time for DistilBERT on Dbpedia Ontology dataset: 
 328.34684509633337  min


### Checking DistilBERT performance metrics:

In [19]:
distilbert_validation_start= timeit.default_timer()
distilbert_learner_ins.validate()
distilbert_validation_stop= timeit.default_timer()

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5000
           1       1.00      1.00      1.00      5000
           2       0.99      0.99      0.99      5000
           3       1.00      1.00      1.00      5000
           4       0.99      0.98      0.98      5000
           5       0.97      0.98      0.98      5000
           6       0.99      0.99      0.99      5000
           7       1.00      0.99      1.00      5000
           8       1.00      1.00      1.00      5000
           9       1.00      1.00      1.00      5000
          10       0.99      0.99      0.99      5000
          11       1.00      1.00      1.00      5000
          12       1.00      1.00      1.00      5000
          13       0.99      0.99      0.99      5000

    accuracy                           0.99     70000
   macro avg       0.99      0.99      0.99     70000
weighted avg       0.99      0.99      0.99     70000



In [20]:
print("\nInference time for DistilBERT on Dbpedia Ontology dataset: \n", (distilbert_validation_stop - distilbert_validation_start), " sec")


Inference time for DistilBERT on Dbpedia Ontology dataset: 
 712.3952941440002  sec


In [21]:
distilbert_learner_ins.validate(class_names=class_names_list)

                        precision    recall  f1-score   support

               Company       1.00      1.00      1.00      5000
EducationalInstitution       1.00      1.00      1.00      5000
                Artist       0.99      0.99      0.99      5000
               Athlete       1.00      1.00      1.00      5000
          OfficeHolder       0.99      0.98      0.98      5000
  MeanOfTransportation       0.97      0.98      0.98      5000
              Building       0.99      0.99      0.99      5000
          NaturalPlace       1.00      0.99      1.00      5000
               Village       1.00      1.00      1.00      5000
                Animal       1.00      1.00      1.00      5000
                 Plant       0.99      0.99      0.99      5000
                 Album       1.00      1.00      1.00      5000
                  Film       1.00      1.00      1.00      5000
           WrittenWork       0.99      0.99      0.99      5000

              accuracy                

array([[4983,    0,    5,    0,    0,    1,    0,    7,    0,    0,    0,
           0,    0,    4],
       [   0, 4991,    0,    0,    0,    1,    0,    0,    0,    0,    0,
           8,    0,    0],
       [   3,    0, 4931,    6,    0,    2,    0,    0,    0,    0,   57,
           0,    0,    1],
       [   1,    0,    3, 4988,    0,    0,    0,    0,    0,    0,    8,
           0,    0,    0],
       [   0,    1,    1,    0, 4919,   42,   15,    0,    9,    7,    0,
           0,    6,    0],
       [   3,    1,    8,    2,   39, 4896,   25,    1,   10,    0,    1,
           0,    0,   14],
       [   0,    0,    0,    0,   15,   38, 4939,    0,    0,    0,    4,
           1,    1,    2],
       [  10,    0,    1,    0,    0,    3,    0, 4973,    1,    0,    0,
           0,    0,   12],
       [   0,    0,    0,    0,    3,   15,    0,    0, 4982,    0,    0,
           0,    0,    0],
       [   0,    0,    1,    0,    8,    2,    0,    0,    0, 4983,    0,
           0,    

In [22]:
distilbert_learner_ins.view_top_losses(preproc=distilbert_transformer)

----------
id:21130 | loss:11.18 | true:OfficeHolder | pred:Building)

----------
id:5447 | loss:10.32 | true:EducationalInstitution | pred:Company)

----------
id:5069 | loss:10.0 | true:EducationalInstitution | pred:Company)

----------
id:56873 | loss:9.72 | true:Album | pred:Artist)



### Saving DistilBERT Model:

In [23]:
distilbert_predictor = ktrain.get_predictor(distilbert_learner_ins.model, preproc=distilbert_transformer)
distilbert_predictor.get_classes()

['Album',
 'Animal',
 'Artist',
 'Athlete',
 'Building',
 'Company',
 'EducationalInstitution',
 'Film',
 'MeanOfTransportation',
 'NaturalPlace',
 'OfficeHolder',
 'Plant',
 'Village',
 'WrittenWork']

In [24]:
distilbert_predictor.save('/content/distilbert-predictor-on-dbpedia')

In [25]:
!zip -r /content/distilbert-predictor-on-dbpedia /content/distilbert-predictor-on-dbpedia

  adding: content/distilbert-predictor-on-dbpedia/ (stored 0%)
  adding: content/distilbert-predictor-on-dbpedia/special_tokens_map.json (deflated 40%)
  adding: content/distilbert-predictor-on-dbpedia/vocab.txt (deflated 53%)
  adding: content/distilbert-predictor-on-dbpedia/config.json (deflated 58%)
  adding: content/distilbert-predictor-on-dbpedia/tokenizer_config.json (deflated 40%)
  adding: content/distilbert-predictor-on-dbpedia/tf_model.h5 (deflated 8%)
  adding: content/distilbert-predictor-on-dbpedia/tf_model.preproc (deflated 56%)
  adding: content/distilbert-predictor-on-dbpedia/tokenizer.json (deflated 59%)
