### Check Hardware & RAM availability:
Commands to check for available GPU and RAM allocation on runtime

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Thu Oct 28 03:40:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### References:
* https://huggingface.co/
* https://arxiv.org/abs/1810.04805

### Install Required Libraries for Transformer Models:
* Transformer package from Hugging Face Library(transformers) contains Pre-Trained Language models.
* K-train as a high level abstraction package to simplify our modelling and predictions.

In [2]:
!pip install ktrain
!pip install transformers

Collecting ktrain
  Downloading ktrain-0.28.2.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 1.3 MB/s 
[?25hCollecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 46.7 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 49.4 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 50.3 MB/s 
Collecting syntok
  Downloading syntok-1.3.1.tar.gz (23 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<=4.10.3,>=4.0.0
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 47.6 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

### Import Libraries:

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ktrain
from ktrain import text
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import timeit
import warnings

pd.set_option('display.max_columns', None)
warnings.simplefilter(action="ignore")

In [4]:
tf.__version__

'2.6.0'

### Load Dataset:

In [5]:
dbpedia_14_train = pd.read_csv("/content/dbpedia_14_train.csv")
dbpedia_14_test = pd.read_csv("/content/dbpedia_14_test.csv")

### Dataset Information:

In [6]:
dbpedia_14_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Labels   560000 non-null  object
 1   Content  560000 non-null  object
dtypes: object(2)
memory usage: 8.5+ MB


In [7]:
dbpedia_14_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Labels   70000 non-null  object
 1   Content  70000 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


In [8]:
dbpedia_14_train.head()

Unnamed: 0,Labels,Content
0,Company,Abbott of Farnham E D Abbott Limited was a Br...
1,Company,Schwan-STABILO is a German maker of pens for ...
2,Company,Q-workshop is a Polish company located in Poz...
3,Company,Marvell Software Solutions Israel known as RA...
4,Company,Bergan Mercy Medical Center is a hospital loc...


In [9]:
dbpedia_14_test.head()

Unnamed: 0,Labels,Content
0,Company,TY KU /taɪkuː/ is an American alcoholic bever...
1,Company,OddLot Entertainment founded in 2001 by longt...
2,Company,Henkel AG & Company KGaA operates worldwide w...
3,Company,The GOAT Store (Games Of All Type Store) LLC ...
4,Company,RagWing Aircraft Designs (also called the Rag...


### Split Train & Validation data:

In [10]:
X_train = dbpedia_14_train[:]["Content"]
y_train = dbpedia_14_train[:]["Labels"]
X_test = dbpedia_14_test[:]["Content"]
y_test = dbpedia_14_test[:]["Labels"]

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(560000,) (560000,) (70000,) (70000,)


### Instantiating a BERT Instance:
Create a BERT instance with the model name, max token length, the labels to be used for each category and the batch size.

In [11]:
class_names_list = ['Company',
 'EducationalInstitution',
 'Artist',
 'Athlete',
 'OfficeHolder',
 'MeanOfTransportation',
 'Building',
 'NaturalPlace',
 'Village',
 'Animal',
 'Plant',
 'Album',
 'Film',
 'WrittenWork']

In [12]:
bert_transformer = text.Transformer('bert-base-uncased', maxlen=512, classes=class_names_list, batch_size=6)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Perform Data Preprocessing:

In [13]:
dbpedia_ont_train = bert_transformer.preprocess_train(X_train.to_list(), y_train.to_list())
dbpedia_ont_val = bert_transformer.preprocess_test(X_test.to_list(), y_test.to_list())

preprocessing train...
language: en
train sequence lengths:
	mean : 46
	95percentile : 80
	99percentile : 86


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 46
	95percentile : 80
	99percentile : 86


### Compile BERT in a K-Train Learner Object:
Since we are using k-train as a high level abstration package, we need to wrap our model in a k-train Learner Object for further compuation

In [14]:
bert_model = bert_transformer.get_classifier()

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

In [15]:
bert_learner_ins = ktrain.get_learner(model=bert_model,
                            train_data=dbpedia_ont_train,
                            val_data=dbpedia_ont_val,
                            batch_size=6)

### BERT Model Summary:

In [16]:
bert_learner_ins.model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  10766     
Total params: 109,493,006
Trainable params: 109,493,006
Non-trainable params: 0
_________________________________________________________________


### BERT Optimal Learning Rates:

Below are the established batch sizes and learning rates for BERT as per the research papers:
* Batch Sizes => {16, 32}
* Learning Rates => {1e−5, 2e−5, 3e−5}
We will choose the maximum among these for our fine-tuning and evaluation purposes.

### Fine Tuning BERT on Dbpedia Ontology Dataset:
We take our Dbpedia dataset along with the BERT model we created, define the learning-rate & epochs to be used and start fine-tuning.

In [17]:
bert_fine_tuning_start= timeit.default_timer()
bert_learner_ins.fit_onecycle(lr=2e-5, epochs=1)
bert_fine_tuning_stop = timeit.default_timer()



begin training using onecycle policy with max lr of 2e-05...


In [18]:
print("\nFine-Tuning time for BERT on Dbpedia Ontology dataset: \n", (bert_fine_tuning_stop - bert_fine_tuning_start)/60, " min")


Fine-Tuning time for BERT on Dbpedia Ontology dataset: 
 728.0891408850167  min


### Checking BERT performance metrics:

In [19]:
bert_validation_start= timeit.default_timer()
bert_learner_ins.validate()
bert_validation_stop= timeit.default_timer()

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5000
           1       1.00      1.00      1.00      5000
           2       0.99      0.99      0.99      5000
           3       1.00      1.00      1.00      5000
           4       0.98      0.99      0.99      5000
           5       0.98      0.98      0.98      5000
           6       0.99      0.99      0.99      5000
           7       1.00      0.99      1.00      5000
           8       1.00      1.00      1.00      5000
           9       1.00      1.00      1.00      5000
          10       0.99      0.99      0.99      5000
          11       1.00      1.00      1.00      5000
          12       1.00      1.00      1.00      5000
          13       0.99      1.00      0.99      5000

    accuracy                           0.99     70000
   macro avg       0.99      0.99      0.99     70000
weighted avg       0.99      0.99      0.99     70000



In [20]:
print("\nInference time for BERT on Dbpedia Ontology dataset: \n", (bert_validation_stop - bert_validation_start), " sec")


Inference time for BERT on Dbpedia Ontology dataset: 
 1456.0601591899976  sec


In [21]:
bert_learner_ins.validate(class_names=class_names_list)

                        precision    recall  f1-score   support

               Company       1.00      1.00      1.00      5000
EducationalInstitution       1.00      1.00      1.00      5000
                Artist       0.99      0.99      0.99      5000
               Athlete       1.00      1.00      1.00      5000
          OfficeHolder       0.98      0.99      0.99      5000
  MeanOfTransportation       0.98      0.98      0.98      5000
              Building       0.99      0.99      0.99      5000
          NaturalPlace       1.00      0.99      1.00      5000
               Village       1.00      1.00      1.00      5000
                Animal       1.00      1.00      1.00      5000
                 Plant       0.99      0.99      0.99      5000
                 Album       1.00      1.00      1.00      5000
                  Film       1.00      1.00      1.00      5000
           WrittenWork       0.99      1.00      0.99      5000

              accuracy                

array([[4986,    0,    4,    0,    0,    0,    0,    5,    0,    0,    0,
           0,    0,    5],
       [   0, 4993,    0,    0,    0,    1,    0,    0,    0,    0,    0,
           5,    0,    1],
       [   3,    0, 4937,    5,    0,    4,    0,    0,    0,    0,   50,
           0,    0,    1],
       [   1,    0,    4, 4986,    0,    1,    0,    0,    0,    0,    8,
           0,    0,    0],
       [   0,    1,    0,    0, 4928,   35,   15,    0,    8,   12,    0,
           0,    1,    0],
       [   3,    0,    6,    1,   44, 4888,   27,    1,    8,    1,    1,
           0,    1,   19],
       [   0,    0,    0,    0,   20,   36, 4938,    0,    0,    0,    3,
           0,    2,    1],
       [  13,    0,    1,    0,    0,    3,    0, 4968,    0,    0,    0,
           0,    0,   15],
       [   0,    0,    0,    0,    2,   16,    0,    0, 4982,    0,    0,
           0,    0,    0],
       [   0,    0,    0,    0,    7,    0,    0,    0,    0, 4986,    0,
           0,    

In [22]:
bert_learner_ins.view_top_losses(preproc=bert_transformer)

----------
id:21130 | loss:10.3 | true:OfficeHolder | pred:Building)

----------
id:35579 | loss:10.23 | true:NaturalPlace | pred:Village)

----------
id:623 | loss:9.81 | true:Company | pred:MeanOfTransportation)

----------
id:10373 | loss:9.76 | true:Artist | pred:WrittenWork)



### Saving BERT Model:

In [23]:
bert_predictor = ktrain.get_predictor(bert_learner_ins.model, preproc=bert_transformer)
bert_predictor.get_classes()

['Album',
 'Animal',
 'Artist',
 'Athlete',
 'Building',
 'Company',
 'EducationalInstitution',
 'Film',
 'MeanOfTransportation',
 'NaturalPlace',
 'OfficeHolder',
 'Plant',
 'Village',
 'WrittenWork']

In [24]:
bert_predictor.save('/content/bert-predictor-on-dbpedia')

In [25]:
!zip -r /content/bert-predictor-on-dbpedia /content/bert-predictor-on-dbpedia

  adding: content/bert-predictor-on-dbpedia/ (stored 0%)
  adding: content/bert-predictor-on-dbpedia/config.json (deflated 59%)
  adding: content/bert-predictor-on-dbpedia/special_tokens_map.json (deflated 40%)
  adding: content/bert-predictor-on-dbpedia/vocab.txt (deflated 53%)
  adding: content/bert-predictor-on-dbpedia/tokenizer_config.json (deflated 39%)
  adding: content/bert-predictor-on-dbpedia/tokenizer.json (deflated 59%)
  adding: content/bert-predictor-on-dbpedia/tf_model.preproc (deflated 55%)
  adding: content/bert-predictor-on-dbpedia/tf_model.h5 (deflated 8%)
