### Next Tasks
* class weight
* add a new dense trainable layer after the huggingface model (Done)
* add a final sigmoid dense layer to get the probabilities of the classes (Done)
* change metrics to auc (Done)
* freeze pretrained model layers, add a new dense layer and train the model only for this new layer ( transfer learning

## Classification Problem: Spam Classification

In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline, DistilBertTokenizerFast, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
import tensorflow as tf

In [2]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_transform', 'vocab_projector', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In [3]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [4]:
mydata = pd.read_csv('spam.csv')[['v1','v2']]

In [5]:
mydata['target'] = np.where(mydata['v1']=='ham',0,1)

In [6]:
mydata.drop(columns=['v1'],inplace=True)

In [7]:
mydata.head()

Unnamed: 0,v2,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [8]:
trainX,testX, trainY, testY = train_test_split(mydata['v2'],mydata['target'],stratify=mydata['target'],test_size=.3)

In [9]:
trainX.reset_index(inplace=True,drop=True)
testX.reset_index(inplace=True,drop=True)
trainY.reset_index(inplace=True,drop=True)
testY.reset_index(inplace=True,drop=True)

In [105]:
my_max_length=221

In [110]:
trainX_encoded = tokenizer(trainX.to_list(),padding='max_length',truncation=True,max_length=my_max_length)
testX_encoded = tokenizer(testX.to_list(),padding='max_length',truncation=True,max_length=my_max_length)

In [111]:
trainX_encoded[0]

Encoding(num_tokens=221, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [112]:
testX_encoded[0]

Encoding(num_tokens=221, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [113]:
trainY[trainY==1].head(1)

0    1
Name: target, dtype: int64

In [13]:
trainX_encoded['input_ids'][10]

[101,
 1061,
 24654,
 3013,
 2205,
 2460,
 3393,
 2232,
 1012,
 1057,
 24654,
 2066,
 6289,
 1029,
 2016,
 3478,
 1012,
 2016,
 1005,
 1055,
 3243,
 6517,
 1012,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [14]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(trainX_encoded),trainY))

In [15]:
## as_numpy_iterator = > Returns an iterator which converts all elements of the dataset to numpy.

list(train_dataset.as_numpy_iterator())[10:11]

[({'input_ids': array([  101,  1061, 24654,  3013,  2205,  2460,  3393,  2232,  1012,
           1057, 24654,  2066,  6289,  1029,  2016,  3478,  1012,  2016,
           1005,  1055,  3243,  6517,  1012,   102,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0, 

In [16]:
test_dataset = tf.data.Dataset.from_tensor_slices((dict(testX_encoded),testY))

In [17]:
val_dataset = tf.data.Dataset.from_tensor_slices((dict(testX_encoded)))

In [18]:
count = 0
for element in train_dataset.shuffle(1000).batch(10):
    count = count+1

In [19]:
count

390

In [20]:
trainX.shape

(3900,)

In [136]:
input_layer = tf.keras.layers.Input(shape = (221,), dtype='int64')
distbert = model(input_layer)
distbert = distbert[0]              
flat = tf.keras.layers.Flatten()(distbert)
dense = tf.keras.layers.Dense(units=256, activation=tf.keras.activations.relu)(flat) # Adding Additional Linear Layer
classifier = tf.keras.layers.Dense(units=1, activation=tf.keras.activations.sigmoid)(dense)
mymodel = tf.keras.Model(inputs=input_layer, outputs=classifier)
mymodel.summary()

Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, 221)]             0         
_________________________________________________________________
tf_distil_bert_for_sequence_ TFSequenceClassifierOutpu 66955010  
_________________________________________________________________
flatten_10 (Flatten)         (None, 2)                 0         
_________________________________________________________________
dense_10 (Dense)             (None, 256)               768       
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 257       
Total params: 66,956,035
Trainable params: 66,956,035
Non-trainable params: 0
_________________________________________________________________


In [137]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
mymodel.compile(optimizer=optimizer, 
              loss=tf.keras.losses.binary_crossentropy,
              metrics = tf.keras.metrics.AUC())

In [139]:
mymodel.fit(np.array(trainX_encoded['input_ids']),np.array(trainY), epochs=1, batch_size=64)



<keras.callbacks.History at 0x7fd281b3b3d0>

In [140]:
np.array(trainY).shape

(3900,)

In [141]:
np.array(trainX_encoded['input_ids']).shape

(3900, 221)

In [142]:
mymodel.layers

[<keras.engine.input_layer.InputLayer at 0x7fd2c07bea90>,
 <transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification at 0x7fd5ae628190>,
 <keras.layers.core.Flatten at 0x7fd2c07beb50>,
 <keras.layers.core.Dense at 0x7fd2c05a6c90>,
 <keras.layers.core.Dense at 0x7fd2a9a06b50>]

In [143]:
pred_proba = mymodel.predict(np.array(testX_encoded['input_ids']))

* The closer pred_proba is to 0.0 the more likely it is class 0 and when it is closer to 1.0 then it is more likely that it is class 1

In [150]:
pred_proba[0:5]

array([[0.1380687 ],
       [0.13807476],
       [0.13807559],
       [0.8192456 ],
       [0.13805553]], dtype=float32)

In [145]:
testY[0:5]

0    0
1    0
2    0
3    1
4    0
Name: target, dtype: int64

In [146]:
from sklearn.metrics import roc_auc_score

In [147]:
roc_auc_score(testY,pred_proba)

0.9955927510852407

* Additonal Dense Layer Model has beaten the previous core model (part-1)
* 0.9899 vs 0.9955