**NOTE: For better computing efficiency, I have taken DistilBERT instead of BERT for this POC**

### Few Important Links: 
* Transfer Learning: https://www.hackerearth.com/practice/machine-learning/transfer-learning/transfer-learning-intro/tutorial/
* Add class_weights : https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras
* Keras Fine-Tuning: https://learnopencv.com/keras-tutorial-fine-tuning-using-pre-trained-models/
* Huggingface Fine-Tuning: https://huggingface.co/transformers/custom_datasets.html#seq-imdb
* Different usages of BERT: https://datascience.stackexchange.com/questions/79772/can-we-use-bert-for-only-word-embedding-and-then-use-svm-rnn-to-do-intent-classi


## Classification Problem: Spam Classification

In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline, DistilBertTokenizerFast, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
import tensorflow as tf
import warnings

warnings.filterwarnings("ignore")

#### Load the moodel and its tokenizer

In [2]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [3]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

### Load and Transform the data

In [4]:
mydata = pd.read_csv('spam.csv')[['v1','v2']]

In [5]:
mydata['target'] = np.where(mydata['v1']=='ham',0,1)

In [6]:
mydata.drop(columns=['v1'],inplace=True)

In [7]:
mydata.head()

Unnamed: 0,v2,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [8]:
trainX,testX, trainY, testY = train_test_split(mydata['v2'],mydata['target'],stratify=mydata['target'],test_size=.3)

In [9]:
trainX.reset_index(inplace=True,drop=True)
testX.reset_index(inplace=True,drop=True)
trainY.reset_index(inplace=True,drop=True)
testY.reset_index(inplace=True,drop=True)

In [10]:
trainX.shape, trainY.shape

((3900,), (3900,))

In [11]:
validX, testX, validY, testY = train_test_split(testX,testY,stratify=testY,test_size=.2)

In [12]:
validX.reset_index(inplace=True,drop=True)
testX.reset_index(inplace=True,drop=True)
validY.reset_index(inplace=True,drop=True)
testY.reset_index(inplace=True,drop=True)

In [13]:
validX.shape,validY.shape

((1337,), (1337,))

In [14]:
testX.shape, testY.shape

((335,), (335,))

In [15]:
my_max_length=221

In [16]:
trainX_encoded = tokenizer(trainX.to_list(),padding='max_length',truncation=True,max_length=my_max_length)
validX_encoded = tokenizer(validX.to_list(),padding='max_length',truncation=True,max_length=my_max_length)
testX_encoded = tokenizer(testX.to_list(),padding='max_length',truncation=True,max_length=my_max_length)

In [17]:
trainX_encoded[0]

Encoding(num_tokens=221, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [18]:
validX_encoded[0]

Encoding(num_tokens=221, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [19]:
testX_encoded[0]

Encoding(num_tokens=221, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [20]:
trainY[trainY==1].head(1)

6    1
Name: target, dtype: int64

In [21]:
trainX_encoded['input_ids'][11]

[101,
 2008,
 2015,
 4658,
 1012,
 2073,
 2323,
 1045,
 13988,
 1029,
 2006,
 2017,
 2030,
 1999,
 2017,
 1029,
 1024,
 1007,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [22]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(trainX_encoded),trainY))
valid_dataset = tf.data.Dataset.from_tensor_slices((dict(validX_encoded),validY))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(testX_encoded),testY))

In [23]:
## as_numpy_iterator = > Returns an iterator which converts all elements of the dataset to numpy.

list(train_dataset.as_numpy_iterator())[10:11]

[({'input_ids': array([  101,  4658,  1012,  2061,  2129,  2272,  2017,  4033,  2102,
           2042,  4511,  2094,  1998, 11586,  2098,  2077,  1029,   102,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0, 

In [24]:
count = 0
for element in train_dataset.shuffle(1000).batch(10):
    count = count+1

In [25]:
count

390

In [26]:
trainX.shape

(3900,)

In [27]:
from sklearn.metrics import roc_auc_score

### Case1: Use the DistilBERT model as it is to retrain it on the custom data

In [28]:
model1= model

In [29]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model1.compile(optimizer=optimizer, 
              loss=model.compute_loss) # can also use any keras loss function


In [42]:
model1.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [30]:
## When you use batch with the dataset, you don't need to define the batch_size in the fit method
## source: https://stackoverflow.com/questions/62670041/batch-size-in-tf-model-fit-vs-batch-size-in-tf-data-dataset

model1.fit(np.array(trainX_encoded['input_ids']),np.array(trainY), epochs=1, 
          #validation_data=(np.array(validX_encoded['input_ids']),np.array(validY)),
          #validation_split=.2,
          batch_size=64)



<keras.callbacks.History at 0x7fe59ed6ff90>

In [31]:
raw_pred = model.predict(testX_encoded['input_ids'])

In [32]:
pred_proba = tf.math.softmax(raw_pred[0], axis=-1).numpy()

In [33]:
roc_auc_score(testY,pred_proba[:,1])

0.9966283524904214

### Case2: Add a new dense trainable layer and train this complete model on the custom data

In [43]:
model2 = model

In [44]:
input_layer = tf.keras.layers.Input(shape = (221,), dtype='int64')
distbert = model2(input_layer)
distbert = distbert[0]              
flat = tf.keras.layers.Flatten()(distbert)
dense = tf.keras.layers.Dense(units=512, activation=tf.keras.activations.relu)(flat) # Adding Additional Linear Layer
classifier = tf.keras.layers.Dense(units=1, activation=tf.keras.activations.sigmoid)(dense)
mymodel = tf.keras.Model(inputs=input_layer, outputs=classifier)
mymodel.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 221)]             0         
_________________________________________________________________
tf_distil_bert_for_sequence_ TFSequenceClassifierOutpu 66955010  
_________________________________________________________________
flatten_2 (Flatten)          (None, 2)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 512)               1536      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 513       
Total params: 66,957,059
Trainable params: 66,957,059
Non-trainable params: 0
_________________________________________________________________


In [45]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
mymodel.compile(optimizer=optimizer, 
              loss=tf.keras.losses.binary_crossentropy,
              metrics = tf.keras.metrics.AUC())

In [46]:
mymodel.fit(np.array(trainX_encoded['input_ids']),
            np.array(trainY), 
            epochs=1, 
            batch_size=64,
            validation_data=(np.array(validX_encoded['input_ids']),np.array(validY)))



<keras.callbacks.History at 0x7fe2af301e10>

In [47]:
pred_proba = mymodel.predict(np.array(testX_encoded['input_ids']))

In [48]:
roc_auc_score(testY,pred_proba)

0.9766283524904215

### Case3: Freeze the pretrained model and train only the newly added dense layer on the custom data

In [49]:
model3 = model

In [50]:
model3.trainable = False ## Make the training of the model as false, i.e. train only additional layer

In [51]:
input_layer = tf.keras.layers.Input(shape = (221,), dtype='int64')
distbert = model3(input_layer)
distbert = distbert[0]              
flat = tf.keras.layers.Flatten()(distbert)
dense = tf.keras.layers.Dense(units=512, activation=tf.keras.activations.relu)(flat) # Adding Additional Linear Layer
classifier = tf.keras.layers.Dense(units=1, activation=tf.keras.activations.sigmoid)(dense)
mymodel = tf.keras.Model(inputs=input_layer, outputs=classifier)
mymodel.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 221)]             0         
_________________________________________________________________
tf_distil_bert_for_sequence_ TFSequenceClassifierOutpu 66955010  
_________________________________________________________________
flatten_3 (Flatten)          (None, 2)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 512)               1536      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 513       
Total params: 66,957,059
Trainable params: 2,049
Non-trainable params: 66,955,010
_________________________________________________________________


* Here we can see, only 2,049 params are trainable.

In [52]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
mymodel.compile(optimizer=optimizer, 
              loss=tf.keras.losses.binary_crossentropy,
              metrics = tf.keras.metrics.AUC())

In [53]:
mymodel.fit(np.array(trainX_encoded['input_ids']),
            np.array(trainY), epochs=3, 
            batch_size=64,
            validation_data=(np.array(validX_encoded['input_ids']),np.array(validY)))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fe285427590>

In [142]:
mymodel.layers

[<keras.engine.input_layer.InputLayer at 0x7fd2c07bea90>,
 <transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification at 0x7fd5ae628190>,
 <keras.layers.core.Flatten at 0x7fd2c07beb50>,
 <keras.layers.core.Dense at 0x7fd2c05a6c90>,
 <keras.layers.core.Dense at 0x7fd2a9a06b50>]

In [54]:
pred_proba = mymodel.predict(np.array(testX_encoded['input_ids']))

* The closer pred_proba is to 0.0 the more likely it is class 0 and when it is closer to 1.0 then it is more likely that it is class 1

In [55]:
roc_auc_score(testY,pred_proba)

0.9781609195402299