# Compilation of NLP models

## Outline

Explore the following NLP models:
- Simple RNNs
- Word Embeddings
- LSTM
- GRU
- Bi-Directional RNNs
- Encoder-Decoder Models
- Transformer/ Attention Models (DistilBERT)

Referenced sources
- Notebook #1 : https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert/notebook
- Notebook #2: https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
- Notebook #3: https://www.kaggle.com/code/pranavmoothedath/real-nlp
- Dataset: https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/overview

In [1]:
# EDA tools
import pandas as pd
import numpy as np
import itertools
import datetime as dt

# display and tracking of iterative processes
from tqdm import tqdm

# plotting tools
import matplotlib.pyplot as plt
import seaborn as sns

# xgb
import xgboost as xgb

# sklearn tools
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# TensorFlow/ Keras
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN, Dense, Activation, Dropout, Embedding, Bidirectional, Input, Lambda

# transformer
import transformers
from transformers import TFDistilBertModel, DistilBertTokenizer

# BERT Tokenizers
from tokenizers import BertWordPieceTokenizer


### Configuring TPU

In [2]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

    
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


### Import and check data

In [3]:
# import and check data
train_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')

In [4]:
train_df.info()
validation_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223549 entries, 0 to 223548
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             223549 non-null  object
 1   comment_text   223549 non-null  object
 2   toxic          223549 non-null  int64 
 3   severe_toxic   223549 non-null  int64 
 4   obscene        223549 non-null  int64 
 5   threat         223549 non-null  int64 
 6   insult         223549 non-null  int64 
 7   identity_hate  223549 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 13.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            8000 non-null   int64 
 1   comment_text  8000 non-null   object
 2   lang          8000 non-null   object
 3   toxic         8000 non-null   int64 
dtypes: int64(2), object(2)
memory usage:

In [5]:
train_df.isna().sum()
# validation_df.isna().sum()
# test_df.isna().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [6]:
# train_df.head(3)
# validation_df.head(3)
test_df.head(3)

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it


### Approach: Binary Classification of topic toxicity

In [7]:
# drop all columns except for 'toxic'

train_df.drop(columns = ['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], inplace= True)

In [8]:
train_df.head(3)

Unnamed: 0,id,comment_text,toxic
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0


In [9]:
# check for feature skewness
train_df['toxic'].value_counts()

toxic
0    202165
1     21384
Name: count, dtype: int64

### EDA

In [10]:
# max word count in a comment
max_length = train_df['comment_text'].map(lambda x: len(x.split())).max()
max_length

2321

In [11]:
# min word count in a comment
train_df['comment_text'].map(lambda x: len(x.split())).min()

1

In [12]:
# add feature: word count
train_df['word_count'] = train_df['comment_text'].map(lambda x: len(x.split()))

In [13]:
# groupby toxic comments and average word count difference
train_df.groupby(['toxic']).agg({'word_count': 'mean'})

Unnamed: 0_level_0,word_count
toxic,Unnamed: 1_level_1
0,68.415161
1,48.573466


Data suggest that toxic comments are straight to the point (e.g less word count).

In [14]:
train_df['comment_text'].values

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       ...,
       '==shame on you all!!!== \n\n You want to speak about gays and not about romanians...',
       'MEL GIBSON IS A NAZI BITCH WHO MAKES SHITTY MOVIES. HE HAS SO MUCH BUTTSEX THAT HIS ASSHOLE IS NOW BIG ENOUGH TO BE CONSIDERED A COUNTRY.',
       '" \n\n == Unicorn lair discovery == \n\n Supposedly, a \'unicorn lair\' has been discovered in

### Train/Test Split

In [15]:
# shuffle default = True
X_train, X_test, y_train, y_test = train_test_split(train_df['comment_text'].values, train_df['toxic'].values, test_size=0.25, stratify=train_df['toxic'], random_state=22)

### Recurrent Neural Network

https://stackoverflow.com/questions/76029717/having-trouble-correctly-importing-tensorflow-tokenizer-and-tensorflow-padded-se

In [16]:
# tokenise data with keras.text.Tokenizer

token = text.Tokenizer()

### Fit the tokenizer to the text data

In [17]:
# fit tokenise on train data. fit_on_texts accept both array and list
token.fit_on_texts(X_train)

# maps word to corresponding index
word_index = token.word_index

In [18]:
# transform on X_train and X_test
X_train_seq = token.texts_to_sequences(X_train)
X_test_seq = token.texts_to_sequences(X_test)

In [19]:
# pad data so that they are of uniform length, so neural network architectures can process sequential data
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen= max_length)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen= max_length)

### Simple RNN


### Determine embedding vector space dimension: https://ai.stackexchange.com/questions/28564/how-to-determine-the-embedding-size

In [20]:
# set embedding_dim
# should embedding_dim == max_length == neurons?
# Here, we have used an Embedding layer followed by an LSTM layer. The embedding layer takes the 32-dimensional vectors, each of which corresponds to a sentence, and subsequently outputs (32,32) dimensional matrices i.e., it creates a 32-dimensional vector corresponding to each word. This embedding is also learnt during model training.

embedding_dim = 300

In [21]:
# instantiate a sequential model
model = Sequential()

# turns indexes in to Dense vectors

model.add(Embedding(
    input_dim = len(word_index)+1, # match the number of unique words or tokens in your vocabulary.
    output_dim = embedding_dim, # vector space dimensions. depends on the specific task, but common values range from 64 to 512. Equals to # of dimensional vector
    input_length = max_length # maximum sequence length in your data
))

# add a Simple RNN layer of x neurons
model.add(SimpleRNN(embedding_dim)) # 300 = number of neurons/ cells

# add one Dense layer
model.add(Dense(
    1, # add one dense layer
    activation = 'sigmoid' # add sigmoid activation layer
))

model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'], # tracked classification metric
)

model.summary()



In [22]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=512)

Epoch 1/2


I0000 00:00:1732415126.587556      70 service.cc:145] XLA service 0x5c882f9c5b70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1732415126.587658      70 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1732415127.647757      70 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m328/328[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m260s[0m 772ms/step - accuracy: 0.9086 - loss: 0.2784
Epoch 2/2
[1m328/328[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m253s[0m 770ms/step - accuracy: 0.9324 - loss: 0.2075


<keras.src.callbacks.history.History at 0x7a39ab9c7310>

### Predict test result

In [23]:
# predict = model.predict_proba(X_test_pad)[:,0]
predict = model.predict(X_test_pad)[:,0]

[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 68ms/step


### AUC Score

In [24]:
# track auc scores
auc_scores = []

In [25]:
# append SimpleRnn auc score
auc_scores.append({'Model':'SimpleRnn', 'AUC_Score':round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

auc: 0.8893


### Use pre-trained word embeddings instead of training from scratch

1. Visualisation of embedding vectors: https://www.kaggle.com/code/auxeno/word-embedding-visualisations-nlp
2. Load GloVe Embeddings - Standford GloVe link: https://nlp.stanford.edu/projects/glove/


In [26]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.array([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print(f'Found {len(embeddings_index)} word vectors')

2196018it [02:56, 12423.62it/s]

Found 2196017 word vectors





### LSTM - Advance RNN with memory cells and gates. Address vanishing gradient problems.

In [27]:
# creating embedding matrix of 0s, in shape of row = len of word_index +1 ,and columns = vector space dimensions
# loop through word_index, get the word and check if the word exist in embeddings_index (created from GloVe Vectors)
# if yes, assign corresponding GloVe vector to word index

embeddings_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):
    embeddings_vector = embeddings_index.get(word)
    if embeddings_vector is not None:
        embeddings_matrix[i] = embeddings_vector


100%|██████████| 248872/248872 [00:00<00:00, 360774.41it/s]


In [28]:
embeddings_matrix

array([[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.27204 , -0.06203 , -0.1884  , ...,  0.13015 , -0.18317 ,
         0.1323  ],
       [ 0.31924 ,  0.06316 , -0.27858 , ...,  0.082745,  0.097801,
         0.25045 ],
       ...,
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.56065 ,  0.20414 , -0.076262, ...,  0.061545,  0.81221 ,
        -0.8306  ],
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ]])

### LSTM Model

In [29]:
# initialise a sequential model
model = Sequential()

# Add Embedding as the first layer
# Use embedding matrix as weights instead of training again
model.add(Embedding(
    len(word_index)+1,
    embedding_dim, # should vector space dimension tie to model input?
    weights = [embeddings_matrix],
    input_length = max_length,
    trainable = False # set trainable to False so we dont retrain model again
))

# Add LSTM layers
model.add(LSTM(
    embedding_dim, # should it tie to Embedding vector space dimension?
    dropout = 0.3,
    recurrent_dropout = 0.3
))

# Add Dense layer
model.add(Dense(
    1, 
    activation = 'sigmoid'
))

# Compile model with loss function and optimiser
model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
)

model.summary()




In [30]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=256)

Epoch 1/2
[1m655/655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1809s[0m 3s/step - accuracy: 0.9329 - loss: 0.1840
Epoch 2/2
[1m655/655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1808s[0m 3s/step - accuracy: 0.9536 - loss: 0.1187


<keras.src.callbacks.history.History at 0x7a396c997220>

In [31]:
predict = model.predict(X_test_pad)[:,0]

[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1454s[0m 832ms/step


In [32]:
# append LSTM auc score
auc_scores.append({'Model': 'LSTM', 'AUC_Score': round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

auc: 0.9769


### Gated Recurrent Unit - GRU
1. Design to solve vanishing gradient problem. Similar to LSTM.
2. Update Gate: How much information to pass along to the future
3. Reset Gate: How much information of past information to forget

In [33]:
# with strategy.scope():
    
# initialise a sequential model
model = Sequential()

# Add Embedding as the first layer
# Use embedding matrix as weights instead of training again
model.add(Embedding(
    len(word_index)+1,
    embedding_dim,
    weights = [embeddings_matrix],
    input_length = max_length,
    trainable = False # set trainable to False so we dont retrain model again
))

# Add GRU layers
model.add(GRU(
    embedding_dim,
    dropout = 0.3,
    # recurrent_dropout = 0.3 #issue with losses: NaN is this is active
))

# Add Dense layer
model.add(Dense(
    1, 
    activation = 'sigmoid'
))

# Compile model with loss function and optimiser
model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
)

model.summary()
    



In [34]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=256)

Epoch 1/2
[1m655/655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2152s[0m 3s/step - accuracy: 0.9318 - loss: 0.1764
Epoch 2/2
[1m655/655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2147s[0m 3s/step - accuracy: 0.9562 - loss: 0.1097


<keras.src.callbacks.history.History at 0x7a3984ad5210>

In [35]:
# predict = model.predict_proba(X_test_pad)[:,0]
predict = model.predict(X_test_pad)[:,0]

[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1604s[0m 918ms/step


In [36]:
# append GRU auc score
auc_scores.append({'Model': 'GRU', 'AUC_Score':round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

auc: 0.9786


### Bi-Directional RNN

Two independent RNNs. #1 with input sequence in normal time order. #2 with input sequence in reverse time order.

Outputs of two networks are concatenated or summed at each time steps, depending on options.

Allow networks to have both forward and backward information about the sequence at each time step.

In [37]:
# with strategy.scope():
    
# initialise a sequential model
model = Sequential()

# Add Embedding as the first layer
# Use embedding matrix as weights instead of training again
model.add(Embedding(
    len(word_index)+1,
    embedding_dim, # should this be the same as RNN/GRU/LSTM input?
    weights = [embeddings_matrix],
    input_length = max_length,
    trainable = False # set trainable to False so we dont retrain model again
))

# Add Bidirectional RNN layers
model.add(Bidirectional(LSTM(
    embedding_dim, # this should tie to embedding 
    dropout = 0.3,
    recurrent_dropout = 0.3
)))

# Add Dense layer
model.add(Dense(
    1, 
    activation = 'sigmoid'
))

# Compile model with loss function and optimiser
model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
)

model.summary()



In [38]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=128)

Epoch 1/2
[1m1310/1310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5221s[0m 4s/step - accuracy: 0.9350 - loss: 0.1767
Epoch 2/2
[1m1310/1310[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5238s[0m 4s/step - accuracy: 0.9543 - loss: 0.1148


<keras.src.callbacks.history.History at 0x7a3987c64bb0>

In [39]:
predict = model.predict(X_test_pad)[:,0]

[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1930s[0m 1s/step


In [40]:
# append Bi-Directional auc score
auc_scores.append({'Model':'Bi-Directional', 'AUC_Score': round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

auc: 0.9773


### Attention Models

### Relationship between BERT Tokens (max 512 tokens) and vector dimensions (768 vector dimensions):

The tensor contains 512 tokens, each with 768 values, representing contextual embeddings. The mean pooling process involves calculating the average of all token embeddings, effectively consolidating them into a single 768-dimensional vector, which serves as the 'sentence vector' representing the entire input sequence.

### Why only CLS token is required for classification, and not mean of other tokens:

https://datascience.stackexchange.com/questions/77044/bert-transformer-why-bert-transformer-uses-cls-token-for-classification-inst


The CLS token helps with the NSP task on which BERT is trained (apart from MLM). The authors found it convenient to create a new hidden state at the start of a sentence, rather than taking the sentence average or other types of poolin

In essence, CLS token of the last layer has connections with all of the other tokens on the previous layer.. 

### DistilBERT layer + Simple NN Implementation (Tensorflow)

1) BERT explained: https://jalammar.github.io/illustrated-bert/ 
2) BERT code: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ (Pytorch implementation)

In [41]:
# pretrained BERT models have max 512 token
max_len = 512

In [42]:
# reimport date
train_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')

### Tokeniser

In [43]:
# load hugging face pretrained distilbert tokenizer
# Other pretrained models:
# - distilbert-base-uncased
# - distilbert-base-cased

tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

In [44]:
# Preprocess data: Tokenize and encode X_train, X_test

def preprocess_data(texts, max_len=512):
    encodings = tokenizer(
        texts.astype(str).tolist(),
        truncation=True,
        padding='max_length',
        max_length=max_len,
        return_tensors='tf'
    )
    
    return encodings['input_ids'], encodings['attention_mask']

In [45]:
X_train_id, X_train_att_mask = preprocess_data(train_df['comment_text'], max_len)
X_valid_id, X_valid_att_mask = preprocess_data(validation_df['comment_text'], max_len)
X_test_id, X_test_att_mask = preprocess_data(test_df['content'], max_len)

y_train = train_df['toxic'].values
y_valid = validation_df['toxic'].values

### transformer model output

1. sequence_output ( [0] ): Last hidden state of the sequence.
2. pooler_output ( [1] ): Pooler output (e.g., [CLS] token representation).
3. hidden_states ( [2] ): Intermediate hidden states.
4. attentions ( [3] ): Attention weights.

Discussion on attention mask for BERT models:
https://ai.stackexchange.com/questions/28833/isnt-attention-mask-for-bert-model-useless

BERT's original implementation was an encoder. Huggingface's implementation was both encoder and decoder. For a transformer to act as decoder, it requires:
1. masking future tokens  == attention_mask
2. cross attention based on "supplied encoder representations".

In [46]:
def transformer_model(transformer, max_len=512):

    # input data feed in to transformer
    input_ids = Input(
        shape=(max_len,),
        dtype=tf.int32,
        name='input_ids' # assign a name to input layer
    )

    # attention mask is used to indicate padding, and prevent model from attending the token
    attention_mask = Input(
        shape=(max_len,),
        dtype=tf.int32,
        name = 'attention_mask'
    )

    # Lambda layer = customer function layer
    bert_output = Lambda(
        lambda x: transformer(input_ids = x[0], attention_mask= x[1]),
        output_shape = (None, max_len, transformer.config.hidden_size)
    )([input_ids, attention_mask])
    
    
    sequence_output = bert_output[0] # last hidden state

    cls_token = sequence_output[:,0,:] # all rows, only cls token, all vectors of cls token)
    
    outputs = Dense(1, activation='sigmoid')(cls_token)

    model = Model(
        inputs=[input_ids, attention_mask], 
        outputs = outputs
    )

    model.compile(
        loss = 'binary_crossentropy',
        optimizer = 'adam',
        metrics = ['accuracy']
    )

    return model

### Instantiate DistilBERT model

In [47]:
transformer_layer = (
    transformers.TFDistilBertModel
    .from_pretrained('distilbert-base-multilingual-cased')
)

model = transformer_model(transformer_layer, max_len)

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [48]:
model.summary()

### Train model

In [49]:
epochs = 2
batch_size = 128

model.fit(
    [X_train_id, X_train_att_mask],
    y_train,
    validation_data = ([X_valid_id, X_valid_att_mask], y_valid), 
    epochs= epochs,
    batch_size = batch_size
)

Epoch 1/2


W0000 00:00:1732439645.694850      69 assert_op.cc:38] Ignoring Assert operator functional_14_1/lambda_1/tf_distil_bert_model/distilbert/embeddings/assert_less/Assert/Assert


[1m1746/1747[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 916ms/step - accuracy: 0.8951 - loss: 0.2653

W0000 00:00:1732441251.107630      70 assert_op.cc:38] Ignoring Assert operator functional_14_1/lambda_1/tf_distil_bert_model/distilbert/embeddings/assert_less/Assert/Assert


[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 918ms/step - accuracy: 0.8951 - loss: 0.2653

W0000 00:00:1732441256.086868      69 assert_op.cc:38] Ignoring Assert operator functional_14_1/lambda_1/tf_distil_bert_model/distilbert/embeddings/assert_less/Assert/Assert
W0000 00:00:1732441313.971951      68 assert_op.cc:38] Ignoring Assert operator functional_14_1/lambda_1/tf_distil_bert_model/distilbert/embeddings/assert_less/Assert/Assert


[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1678s[0m 954ms/step - accuracy: 0.8951 - loss: 0.2652 - val_accuracy: 0.8489 - val_loss: 0.4003
Epoch 2/2
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1658s[0m 949ms/step - accuracy: 0.9221 - loss: 0.2045 - val_accuracy: 0.8487 - val_loss: 0.4039


<keras.src.callbacks.history.History at 0x7a39a9e699f0>

### Predict presence of toxic comments

In [50]:
predict =  model.predict([X_valid_id, X_valid_att_mask], verbose=1)

W0000 00:00:1732442975.654846      70 assert_op.cc:38] Ignoring Assert operator functional_14_1/lambda_1/tf_distil_bert_model/distilbert/embeddings/assert_less/Assert/Assert


[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 229ms/step


In [51]:
# append DistilBERT auc score
auc_scores.append({'Model':'DistilBert', 'AUC_Score':round(roc_auc_score(y_valid,predict),4)})
print(f'auc: {roc_auc_score(y_valid,predict):.4f}')

auc: 0.6765


In [52]:
results = pd.DataFrame(auc_scores).sort_values(by='AUC_Score',ascending=False)
results

Unnamed: 0,Model,AUC_Score
2,GRU,0.9786
3,Bi-Directional,0.9773
1,LSTM,0.9769
0,SimpleRnn,0.8893
4,DistilBert,0.6765


Further work:
1. Sequence to Sequence models
2. Pytorch implementation of DistilBERT
3. Comparison to other models such as RoBERTa and DeBERTa