# Code Explanation

This notebook imports various libraries and modules that are essential for data manipulation, visualization, machine learning, and deep learning tasks. Below is a brief explanation of each import:

1. **pandas**: A powerful data manipulation and analysis library.
2. **numpy**: A fundamental package for scientific computing with Python.
3. **matplotlib.pyplot**: A plotting library used for creating static, animated, and interactive visualizations.
4. **time**: A module that provides various time-related functions.
5. **imblearn.over_sampling.SMOTE**: A technique for handling imbalanced datasets by oversampling the minority class.
6. **tensorflow**: An open-source platform for machine learning and deep learning.
7. **transformers**: A library by Hugging Face that provides pre-trained models for natural language processing tasks.
   - **BertTokenizer, TFBertModel**: Tokenizer and model for BERT (Bidirectional Encoder Representations from Transformers).
   - **DistilBertTokenizer, TFDistilBertModel**: Tokenizer and model for DistilBERT, a smaller and faster version of BERT.
   - **TFAlbertModel**: Model for ALBERT (A Lite BERT), a lighter version of BERT.
8. **sklearn.model_selection.train_test_split**: A function for splitting data into training and testing sets.
9. **tensorflow.keras.callbacks.EarlyStopping**: A callback to stop training when a monitored metric has stopped improving.
10. **tensorflow.keras.models.Model**: The base class for creating a Keras model.
11. **tensorflow.keras.layers.Input, Dense, Concatenate, Dropout**: Various layers used in building neural networks.
12. **tensorflow.keras.optimizers.Adam**: An optimizer that implements the Adam algorithm.

The following cells will utilize these libraries and modules to perform data preprocessing, model building, training, and evaluation.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from imblearn.over_sampling import SMOTE
import tensorflow as tf
from transformers import BertTokenizer,TFBertModel,DistilBertTokenizer,TFDistilBertModel,TFAlbertModel
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Concatenate,Dropout
from tensorflow.keras.optimizers import Adam

2024-06-28 15:06:10.535509: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-28 15:06:10.535649: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-28 15:06:10.825348: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Data Preprocessing

This section of the code performs data preprocessing on a dataset containing slurs. Below is a step-by-step explanation of the code:

1. **mapping_dict = {1:0, 2:1, 3:2}**: Creates a dictionary to map the values 1, 2, and 3 to 0, 1, and 2, respectively.

2. **slurs_df = pd.DataFrame(pd.read_csv('/kaggle/input/slurs-dataset/slurs3.csv'))**: Reads the CSV file containing the slurs dataset into a pandas DataFrame.

3. **slurs_df.drop(columns=['id', 'country', 'valid', 'subj_anger'], axis=1, inplace=True)**: Drops the columns 'id', 'country', 'valid', and 'subj_anger' from the DataFrame as they are not needed for further analysis.

4. **slurs_df = slurs_df[['text', 'condition', 'recalled', 'slur_source', 'slur_gender', 'f_pain', 'f_fear', 'f_panic', 'f_anger', 'f_guilt', 'f_humiliation']]**: Reorders the columns in the DataFrame to focus on the relevant features.

5. **slurs_df['slur_gender'].replace(mapping_dict, inplace=True)**: Replaces the values in the 'slur_gender' column using the mapping dictionary created earlier.

6. **slurs_df.dropna(inplace=True)**: Removes any rows with missing values from the DataFrame to ensure data quality.

7. **slurs_df.reset_index(inplace=True, drop=True)**: Resets the index of the DataFrame to ensure it is sequential after dropping rows.

The following cells will utilize this preprocessed data for further analysis and modeling.


In [2]:
mapping_dict = {1:0,2:1,3:2}
slurs_df = pd.DataFrame(pd.read_csv('/kaggle/input/slurs-dataset/slurs3.csv'))
slurs_df.drop(columns=['id','country','valid','subj_anger'],axis=1,inplace=True)
slurs_df = slurs_df[['text','condition','recalled','slur_source','slur_gender','f_pain','f_fear','f_panic','f_anger','f_guilt','f_humiliation']]
slurs_df['slur_gender'].replace(mapping_dict,inplace=True)
slurs_df.dropna(inplace=True)
slurs_df.reset_index(inplace=True,drop=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  slurs_df['slur_gender'].replace(mapping_dict,inplace=True)


In [3]:
slurs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   text           435 non-null    object 
 1   condition      435 non-null    int64  
 2   recalled       435 non-null    int64  
 3   slur_source    435 non-null    float64
 4   slur_gender    435 non-null    float64
 5   f_pain         435 non-null    int64  
 6   f_fear         435 non-null    int64  
 7   f_panic        435 non-null    int64  
 8   f_anger        435 non-null    int64  
 9   f_guilt        435 non-null    int64  
 10  f_humiliation  435 non-null    int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 37.5+ KB


In [4]:
slurs_df['slur_gender'].value_counts()

slur_gender
0.0    285
1.0    144
2.0      6
Name: count, dtype: int64

# Tokenization and Encoding

This section of the code defines a function to tokenize and encode sentences using the BERT tokenizer. Below is a step-by-step explanation of the code:

1. **tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')**: Loads the pre-trained BERT tokenizer from the Hugging Face library. The 'bert-base-uncased' model is used, which is a version of BERT that converts all text to lowercase.

2. **max_length = 128**: Sets the maximum length for tokenized sequences to 128 tokens.

3. **def tokenize_and_encode(sentences)**: Defines a function named `tokenize_and_encode` that takes a list of sentences as input.

4. **input_ids = []** and **attention_masks = []**: Initializes two empty lists to store the input IDs and attention masks for each sentence.

5. **for sent in sentences**: Iterates over each sentence in the input list.

6. **encoded_dict = tokenizer.encode_plus(...)**: Tokenizes and encodes each sentence using the `encode_plus` method of the tokenizer. This method:
   - Adds special tokens to the sentence.
   - Truncates the sentence to the maximum length if it exceeds 128 tokens.
   - Pads the sentence to the maximum length if it is shorter than 128 tokens.
   - Returns attention masks to indicate which tokens are actual tokens and which are padding.
   - Returns the results as tensors.

7. **input_ids.append(encoded_dict['input_ids'].numpy().flatten().tolist())**: Converts the input IDs tensor to a list and appends it to the `input_ids` list.

8. **attention_masks.append(encoded_dict['attention_mask'].numpy().flatten().tolist())**: Converts the attention mask tensor to a list and appends it to the `attention_masks` list.

9. **return input_ids, attention_masks**: Returns the lists of input IDs and attention masks for all sentences.

The following cells will use these tokenized and encoded sentences for further processing and model training.


In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 128

def tokenize_and_encode(sentences):
    
    input_ids = []
    attention_masks = []
    
    for sent in sentences : 
        
        encoded_dict = tokenizer.encode_plus ( 
        
        sent,
        add_special_tokens = True,
        max_length = max_length,
        padding = 'max_length',
        truncation = True,
        return_attention_mask = True,
        return_tensors = 'tf'
        )
        
        input_ids.append(encoded_dict['input_ids'].numpy().flatten().tolist())
        attention_masks.append(encoded_dict['attention_mask'].numpy().flatten().tolist())
        
    return input_ids,attention_masks    
    
    

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Data Splitting and Tokenization

This section of the code splits the data into training and testing sets, tokenizes the text data, and prepares it for model training. Below is a step-by-step explanation of the code:

1. **X = slurs_df['text'].values** and **Y = slurs_df['slur_gender'].values**: Extracts the 'text' and 'slur_gender' columns from the DataFrame as numpy arrays. `X` contains the text data, and `Y` contains the corresponding labels.

2. **X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)**: Splits the data into training and testing sets. 80% of the data is used for training, and 20% is used for testing. The `random_state` parameter ensures reproducibility.

3. **X_train_ids, X_train_attention = tokenize_and_encode(X_train)**: Tokenizes and encodes the training text data using the `tokenize_and_encode` function defined earlier. This function returns the input IDs and attention masks for the training data.

4. **X_train_ids = tf.concat(X_train_ids, axis=0)** and **X_train_attention = tf.concat(X_train_attention, axis=0)**: Concatenates the lists of input IDs and attention masks into single tensors for the training data.

5. **X_test_ids, X_test_attention = tokenize_and_encode(X_test)**: Tokenizes and encodes the testing text data using the `tokenize_and_encode` function. This function returns the input IDs and attention masks for the testing data.

6. **X_test_ids = tf.concat(X_test_ids, axis=0)** and **X_test_attention = tf.concat(X_test_attention, axis=0)**: Concatenates the lists of input IDs and attention masks into single tensors for the testing data.

7. **Y_train = tf.convert_to_tensor(Y_train, dtype=tf.int32)** and **Y_test = tf.convert_to_tensor(Y_test, dtype=tf.int32)**: Converts the training and testing labels to TensorFlow tensors of type `int32`.

8. **X_train_ids = tf.reshape(X_train_ids, (-1, max_length))** and **X_train_attention = tf.reshape(X_train_attention, (-1, max_length))**: Reshapes the training input IDs and attention masks to ensure they have the correct dimensions.

9. **X_test_ids = tf.reshape(X_test_ids, (-1, max_length))** and **X_test_attention = tf.reshape(X_test_attention, (-1, max_length))**: Reshapes the testing input IDs and attention masks to ensure they have the correct dimensions.

10. **final_X_train_id, final_X_train_attention, final_Y_train, final_Y_test, final_X_test_id, final_X_test_attention**: Assigns the reshaped tensors to final variables for use in model training and evaluation.

The following cells will use these prepared tensors for building and training the model.


In [6]:
X = slurs_df['text'].values
Y = slurs_df['slur_gender'].values


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)


X_train_ids, X_train_attention = tokenize_and_encode(X_train)
X_train_ids = tf.concat(X_train_ids,axis = 0)
X_train_attention = tf.concat(X_train_attention,axis = 0)


X_test_ids, X_test_attention = tokenize_and_encode(X_test)
X_test_ids = tf.concat(X_test_ids,axis = 0)
X_test_attention = tf.concat(X_test_attention,axis = 0)

Y_train = tf.convert_to_tensor(Y_train, dtype=tf.int32)
Y_test =  tf.convert_to_tensor(Y_test, dtype=tf.int32)


X_train_ids = tf.reshape(X_train_ids, (-1, max_length))
X_train_attention = tf.reshape(X_train_attention, (-1, max_length))


X_test_ids = tf.reshape(X_test_ids, (-1, max_length))
X_test_attention = tf.reshape(X_test_attention, (-1, max_length))

final_X_train_id = X_train_ids
final_X_train_attention = X_train_attention
final_Y_train = Y_train
final_Y_test = Y_test
final_X_test_id = X_test_ids
final_X_test_attention = X_test_attention

# BERT Model and Pooler Output

This section of the code loads a pre-trained BERT model and extracts the pooler output for the training and testing data. Below is a step-by-step explanation of the code:

1. **BERT = TFBertModel.from_pretrained('bert-base-uncased')**: Loads the pre-trained BERT model from the Hugging Face library. The 'bert-base-uncased' model is used, which is a version of BERT that converts all text to lowercase.

2. **pooler_output_Train = BERT(final_X_train_id, attention_mask=final_X_train_attention)[1]**: Passes the training input IDs and attention masks through the BERT model to obtain the pooler output for the training data. The pooler output is the representation of the [CLS] token, which is typically used for classification tasks.

3. **pooler_output_Test = BERT(final_X_test_id, attention_mask=final_X_test_attention)[1]**: Passes the testing input IDs and attention masks through the BERT model to obtain the pooler output for the testing data.

The following cells will use these pooler outputs for further processing and model training.


In [7]:
BERT = TFBertModel.from_pretrained('bert-base-uncased')
pooler_output_Train = BERT(final_X_train_id, attention_mask=final_X_train_attention)[1]
pooler_output_Test = BERT(final_X_test_id, attention_mask=final_X_test_attention)[1]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

# Model Building, Training, and Evaluation

## Purpose
The purpose of this section is to build, train, and evaluate a neural network model using the pooler output from the BERT model. The goal is to classify the slur gender into one of three categories.

## Procedure
1. **Model Definition**: 
   - The model is defined using the Keras functional API. The input layer takes the pooler output from the BERT model, which has a shape of 768.
   - Two hidden layers are added with ReLU activation functions and dropout layers to prevent overfitting.
   - The output layer uses a softmax activation function to classify the input into one of three categories.

2. **Early Stopping**: 
   - Early stopping is implemented to monitor the validation loss. If the validation loss does not improve for three consecutive epochs, training is stopped, and the best weights are restored.

3. **Model Compilation**: 
   - The model is compiled with the Adam optimizer, sparse categorical cross-entropy loss, and accuracy as the evaluation metric.

4. **Model Training**: 
   - The model is trained on the training data for up to 50 epochs with a batch size of 32. The validation data is used to evaluate the model during training, and early stopping is applied to prevent overfitting.

5. **Processing Time**: 
   - The total processing time for model training is measured and printed.


In [8]:
start_time = time.time()
pooled_output = Input(shape=(768,),dtype = 'float32', name = 'pooled_output')

hidden_layer_1 = Dense(1024,activation = 'relu')(pooled_output)
Dropout_1 = Dropout(0.1)(hidden_layer_1)


hidden_layer_2 = Dense(256,activation = 'relu')(Dropout_1)
Dropout_2 = Dropout(0.1)(hidden_layer_2)


output_layer = Dense(3,activation = 'softmax')(Dropout_2)

Pooled_Output_Model = Model(inputs = pooled_output,outputs = output_layer)

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

Pooled_Output_Model.compile(optimizer = Adam(learning_rate=2e-4),
                 loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                 metrics = ['accuracy'])

Pooled_Output_Model.fit(pooler_output_Train,
             final_Y_train,
             epochs = 50,
             batch_size = 32,
             validation_data=(pooler_output_Test,final_Y_test),
             callbacks=[early_stopping]
             )
end_time = time.time()
processing_time = end_time - start_time 
print(processing_time)   

Epoch 1/50
[1m 1/11[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m51s[0m 5s/step - accuracy: 0.4062 - loss: 0.9288

I0000 00:00:1719587215.348061      74 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
W0000 00:00:1719587215.363044      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 423ms/step - accuracy: 0.5571 - loss: 0.8667

W0000 00:00:1719587219.596237      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1719587220.054964      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 511ms/step - accuracy: 0.5566 - loss: 0.8654 - val_accuracy: 0.6322 - val_loss: 0.7116
Epoch 2/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6602 - loss: 0.7207 - val_accuracy: 0.6322 - val_loss: 0.6845
Epoch 3/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6731 - loss: 0.6330 - val_accuracy: 0.6207 - val_loss: 0.6466
Epoch 4/50


W0000 00:00:1719587220.444740      72 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6516 - loss: 0.6485 - val_accuracy: 0.7356 - val_loss: 0.6364
Epoch 5/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7003 - loss: 0.6307 - val_accuracy: 0.7356 - val_loss: 0.6112
Epoch 6/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7229 - loss: 0.5960 - val_accuracy: 0.6667 - val_loss: 0.5991
Epoch 7/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7159 - loss: 0.5826 - val_accuracy: 0.6437 - val_loss: 0.6224
Epoch 8/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7400 - loss: 0.5545 - val_accuracy: 0.8276 - val_loss: 0.5601
Epoch 9/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7623 - loss: 0.5107 - val_accuracy: 0.7586 - val_loss: 0.5284
Epoch 10/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━

# Multi-Input Model Building, Training, and Evaluation

## Purpose
The purpose of this section is to build, train, and evaluate a neural network model that integrates multiple inputs: the pooler output from the BERT model, input IDs, and attention masks. This approach aims to leverage different aspects of the input data to improve classification performance.

## Procedure
1. **Model Definition**: 
   - The model is designed using the Keras functional API to accept three distinct inputs: the pooler output, input IDs, and attention masks.
   - Each input is processed through its own series of hidden layers with ReLU activation functions to capture unique features.
   - The outputs of these hidden layers are concatenated to form a combined feature representation.
   - Additional hidden layers are applied to the concatenated features to further refine the representation.
   - The final output layer uses a softmax activation function to classify the input into one of three categories.

2. **Early Stopping**: 
   - Early stopping is implemented to monitor the validation loss. If the validation loss does not improve for three consecutive epochs, training is halted, and the best weights are restored to prevent overfitting.

3. **Model Compilation**: 
   - The model is compiled with the Adam optimizer, sparse categorical cross-entropy loss, and accuracy as the evaluation metric.

4. **Model Training**: 
   - The model is trained on the training data for up to 50 epochs with a batch size of 32. The validation data is used to evaluate the model during training, and early stopping is applied to prevent overfitting.

5. **Processing Time**: 
   - The total processing time for model training is measured and printed to provide insight into the computational efficiency of the training process.


## Comparison with Previous Approach
- **Complexity**: The multi-input model is more complex as it processes three different inputs separately before combining them. This allows for a more nuanced understanding of the data.
- **Feature Extraction**: By processing each input through its own series of hidden layers, the model can extract unique features from each input type, potentially leading to better performance.
- **Flexibility**: This approach is more flexible as it can be adapted to include additional inputs or different types of data.
- **Training Time**: The increased complexity may lead to longer training times, but the early stopping mechanism helps mitigate this by halting training when no further improvement is observed.

Overall, the multi-input model aims to leverage the strengths of different input types to improve classification performance, while the previous single-input model focused on simplicity and efficiency.


In [9]:
start_time = time.time()
pooled_output = Input(shape=(768,),dtype = 'float32', name = 'pooled_output')
input_ids = Input(shape=(128,),dtype='int32',name='input-ids')
attention_mask = Input(shape=(128,),dtype='int32',name='attention_mask')

hidden_layer_1_pooled = Dense(2048,activation = 'relu')(pooled_output)
hidden_layer_2_pooled = Dense(1024,activation = 'relu')(hidden_layer_1_pooled)
hidden_layer_3_pooled = Dense(512,activation = 'relu')(hidden_layer_2_pooled)


hidden_layer_1_ids = Dense(2048,activation = 'relu')(input_ids)
hidden_layer_2_ids = Dense(1024,activation = 'relu')(hidden_layer_1_ids)
hidden_layer_3_ids = Dense(512,activation = 'relu')(hidden_layer_2_ids)


hidden_layer_1_attention = Dense(2048,activation = 'relu')(attention_mask)
hidden_layer_2_attention = Dense(1024,activation = 'relu')(hidden_layer_1_attention)
hidden_layer_3_attention = Dense(512,activation = 'relu')(hidden_layer_2_attention)

concatenate_layer = tf.keras.layers.concatenate([hidden_layer_3_pooled,hidden_layer_3_ids,hidden_layer_3_attention],axis = -1)


hidden_layer_1_concatenate = Dense(2048,activation = 'relu')(concatenate_layer)
hidden_layer_2_concatenate = Dense(1024,activation = 'relu')(hidden_layer_1_concatenate)
hidden_layer_3_concatenate = Dense(512,activation = 'relu')(hidden_layer_2_attention)


output_layer = Dense(3,activation = 'softmax')(hidden_layer_3_concatenate)

Multiple_Input_Model = Model(inputs = [pooled_output,input_ids,attention_mask],outputs = output_layer)

early_stopping = EarlyStopping(monitor='val_loss', patience=3,restore_best_weights=True)

Multiple_Input_Model.compile(optimizer = Adam(learning_rate=2e-4),
                 loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                 metrics = ['accuracy'])

Multiple_Input_Model.fit([pooler_output_Train,final_X_train_id,final_X_train_attention],
             final_Y_train,
             epochs = 50,
             batch_size = 32,
             validation_data=([pooler_output_Test,final_X_test_id,final_X_test_attention],final_Y_test),
             callbacks=[early_stopping])

end_time = time.time()
processing_time = end_time - start_time
print(processing_time)


Epoch 1/50
[1m 1/11[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m1:10[0m 7s/step - accuracy: 0.0938 - loss: 1.1267

W0000 00:00:1719587229.339893      72 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 786ms/step - accuracy: 0.5038 - loss: 0.8465

W0000 00:00:1719587237.201512      75 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1719587238.199894      75 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 1s/step - accuracy: 0.5125 - loss: 0.8410 - val_accuracy: 0.6322 - val_loss: 0.7355
Epoch 2/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6913 - loss: 0.6406 - val_accuracy: 0.6322 - val_loss: 0.7292
Epoch 3/50
[1m 1/11[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 24ms/step - accuracy: 0.6562 - loss: 0.9081

W0000 00:00:1719587240.968589      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6467 - loss: 0.7744 - val_accuracy: 0.6322 - val_loss: 0.7322
Epoch 4/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6238 - loss: 0.7518 - val_accuracy: 0.6322 - val_loss: 0.7361
Epoch 5/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6980 - loss: 0.6767 - val_accuracy: 0.5977 - val_loss: 0.7350
19.63499355316162


# BERT-Based Model Building, Training, and Evaluation

## Purpose
The purpose of this section is to build, train, and evaluate a neural network model that directly integrates the BERT model as a custom layer. This approach aims to leverage the powerful contextual embeddings provided by BERT for the classification task.

## Procedure
1. **Custom BERT Layer**: 
   - A custom layer is created to load the pre-trained BERT model and extract the pooler output. This allows the model to directly utilize BERT's contextual embeddings.
   
2. **Model Definition**: 
   - The model is designed using the Keras functional API to accept two inputs: input IDs and attention masks.
   - The custom BERT layer processes these inputs to generate the pooler output.
   - The pooler output is then passed through a dense layer with a softmax activation function to classify the input into one of three categories.

3. **Early Stopping**: 
   - Early stopping is implemented to monitor the validation loss. If the validation loss does not improve for three consecutive epochs, training is halted, and the best weights are restored to prevent overfitting.

4. **Model Compilation**: 
   - The model is compiled with the Adam optimizer, sparse categorical cross-entropy loss, and accuracy as the evaluation metric.

5. **Model Training**: 
   - The model is trained on the training data for up to 50 epochs with a batch size of 32. The validation data is used to evaluate the model during training, and early stopping is applied to prevent overfitting.

6. **Processing Time**: 
   - The total processing time for model training is measured and printed to provide insight into the computational efficiency of the training process.


## Comparison with Previous Approaches
- **Direct BERT Integration**: Unlike the previous approaches that used the pooler output from BERT as a separate input, this approach directly integrates BERT as a custom layer. This allows for a more seamless and efficient use of BERT's contextual embeddings.
- **Model Complexity**: This approach simplifies the model architecture by reducing the need for additional hidden layers to process the pooler output. The custom BERT layer handles the heavy lifting of feature extraction.
- **Performance**: By directly using BERT's contextual embeddings, this approach may achieve better performance in capturing the nuances of the input text. The custom BERT layer ensures that the model fully leverages BERT's capabilities.
- **Flexibility**: This approach is highly flexible as it can be easily adapted to use different pre-trained models from the Hugging Face library. The custom layer can be modified to integrate other models like DistilBERT or ALBERT.
- **Training Time**: The direct integration with BERT may lead to longer training times due to the complexity of the BERT model. However, the early stopping mechanism helps mitigate this by halting training when no further improvement is observed.

Overall, this BERT-based model aims to leverage the strengths of BERT's contextual embeddings to improve classification performance, while the previous approaches focused on processing the pooler output or multiple inputs separately.


In [10]:
start_time = time.time()
input_ids = Input(shape=(128,),dtype='int32',name='input-ids')
attention_mask = Input(shape=(128,),dtype='int32',name='attention-mask')

class bert_layer(tf.keras.layers.Layer):
    
    def __init__(self, **kwargs):
        
        super(bert_layer, self).__init__(**kwargs)
        
        self.bert = TFBertModel.from_pretrained('bert-base-uncased')

    def call(self, inputs):
        
        input_ids, attention_mask = inputs
        
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        return outputs[1]
        
BERT_layer = bert_layer()
pooler_output = BERT_layer([input_ids,attention_mask])


output_layer = Dense(3,activation = 'softmax')(pooler_output)

Bert_Input_Model = Model(inputs = [input_ids,attention_mask],outputs = output_layer)

Bert_Input_Model.compile(optimizer = Adam(learning_rate=1e-3),
                   loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                   metrics = ['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

Bert_Input_Model.fit([final_X_train_id,final_X_train_attention],
               final_Y_train,
               epochs = 50,
               batch_size = 32,
               validation_data=([final_X_test_id,final_X_test_attention],final_Y_test),
               callbacks = [early_stopping] 
               )
end_time = time.time()
processing_time = end_time - start_time
print(processing_time)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Epoch 1/50


W0000 00:00:1719587284.580726      73 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m10/11[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 202ms/step - accuracy: 0.3483 - loss: 1.2562

W0000 00:00:1719587291.412242      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 683ms/step - accuracy: 0.3639 - loss: 1.2367

W0000 00:00:1719587297.862109      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 2s/step - accuracy: 0.3769 - loss: 1.2204 - val_accuracy: 0.6322 - val_loss: 0.7898
Epoch 2/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 262ms/step - accuracy: 0.5668 - loss: 0.7863 - val_accuracy: 0.6207 - val_loss: 0.6981
Epoch 3/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 267ms/step - accuracy: 0.6863 - loss: 0.7074 - val_accuracy: 0.6322 - val_loss: 0.6981
Epoch 4/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 268ms/step - accuracy: 0.6759 - loss: 0.7430 - val_accuracy: 0.5977 - val_loss: 0.6787
Epoch 5/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 271ms/step - accuracy: 0.7050 - loss: 0.6844 - val_accuracy: 0.6092 - val_loss: 0.6705
Epoch 6/50
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 274ms/step - accuracy: 0.6863 - loss: 0.6800 - val_accuracy: 0.6092 - val_loss: 0.6657
Epoch 7/50
[1m11/11[0m [32m━━━━━━━━━━━

# DistilBERT-Based Model Building, Training, and Evaluation


## Comparison with Previous Approaches
- **Efficiency**: DistilBERT is a smaller and faster version of BERT, making this approach more efficient in terms of computational resources and training time. This is particularly beneficial when working with large datasets or limited computational power.
- **Direct Integration**: Similar to the previous BERT-based approach, this method directly integrates DistilBERT as a custom layer, allowing for seamless use of its contextual embeddings.
- **Model Complexity**: This approach simplifies the model architecture by reducing the need for additional hidden layers to process the [CLS] token's output. The custom DistilBERT layer handles the feature extraction.
- **Performance**: While DistilBERT is more efficient, it may slightly compromise on performance compared to the full BERT model. However, it still provides strong contextual embeddings that can significantly improve classification tasks.
- **Flexibility**: This approach is highly flexible as it can be easily adapted to use different pre-trained models from the Hugging Face library. The custom layer can be modified to integrate other models like BERT or ALBERT.
- **Training Time**: The use of DistilBERT reduces training time due to its smaller size and faster processing capabilities. The early stopping mechanism further helps by halting training when no further improvement is observed.

Overall, this DistilBERT-based model aims to balance efficiency and performance by leveraging the strengths of DistilBERT's contextual embeddings, while the previous approaches focused on using the full BERT model or processing multiple inputs separately.


In [11]:
start_time = time.time()
input_ids = Input(shape=(128,),dtype='int32',name='input-ids')
attention_mask = Input(shape=(128,),dtype='int32',name='attention-mask')

class bert_layer(tf.keras.layers.Layer):
    
    def __init__(self, **kwargs):
        
        super(bert_layer, self).__init__(**kwargs)
        
        self.bert = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

    def call(self, inputs):
        
        input_ids, attention_mask = inputs
        
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        return outputs.last_hidden_state[:,0]
        
BERT_layer = bert_layer()
CLS_output = BERT_layer([input_ids,attention_mask])

output_layer = Dense(3,activation = 'softmax')(CLS_output)

Bert_Input_Model = Model(inputs = [input_ids,attention_mask],outputs = output_layer)

Bert_Input_Model.compile(optimizer = Adam(learning_rate=1e-3),
                   loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                   metrics = ['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

Bert_Input_Model.fit([final_X_train_id,final_X_train_attention],
               final_Y_train,
               epochs = 50,
               batch_size = 8,
               validation_data=([final_X_test_id,final_X_test_attention],final_Y_test),
               callbacks = [early_stopping] 
               )
end_time = time.time()
processing_time = end_time - start_time
print(processing_time)
        

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Epoch 1/50
[1m 7/44[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m1s[0m 29ms/step - accuracy: 0.1804 - loss: 1.2401

W0000 00:00:1719587438.177906      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step - accuracy: 0.5054 - loss: 0.9231

W0000 00:00:1719587442.142526      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1719587445.527242      73 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 221ms/step - accuracy: 0.5079 - loss: 0.9195 - val_accuracy: 0.6552 - val_loss: 0.6593
Epoch 2/50
[1m 5/44[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m1s[0m 28ms/step - accuracy: 0.6833 - loss: 0.5704

W0000 00:00:1719587447.682314      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - accuracy: 0.6919 - loss: 0.6240 - val_accuracy: 0.6437 - val_loss: 0.6176
Epoch 3/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - accuracy: 0.7537 - loss: 0.5752 - val_accuracy: 0.7356 - val_loss: 0.5729
Epoch 4/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 37ms/step - accuracy: 0.7901 - loss: 0.5074 - val_accuracy: 0.8046 - val_loss: 0.5413
Epoch 5/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - accuracy: 0.8332 - loss: 0.4679 - val_accuracy: 0.8046 - val_loss: 0.5242
Epoch 6/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - accuracy: 0.8674 - loss: 0.4474 - val_accuracy: 0.7931 - val_loss: 0.5021
Epoch 7/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - accuracy: 0.9125 - loss: 0.3637 - val_accuracy: 0.7931 - val_loss: 0.4854
Epoch 8/50
[1m44/44[0m [32m━━━━━━━━━━━━━━━

# ALBERT-Based Model Building, Training, and Evaluation

## Comparison with Previous Approaches
- **Efficiency**: ALBERT is designed to be a lighter and more efficient version of BERT, making this approach more efficient in terms of computational resources and training time. This is particularly beneficial when working with large datasets or limited computational power.
- **Direct Integration**: Similar to the previous BERT and DistilBERT-based approaches, this method directly integrates ALBERT as a custom layer, allowing for seamless use of its contextual embeddings.
- **Model Complexity**: This approach simplifies the model architecture by reducing the need for additional hidden layers to process the pooler output. The custom ALBERT layer handles the feature extraction.
- **Performance**: ALBERT's efficient architecture allows it to maintain strong performance while being more resource-efficient. This can lead to faster training times without significantly compromising on accuracy.
- **Flexibility**: This approach is highly flexible as it can be easily adapted to use different pre-trained models from the Hugging Face library. The custom layer can be modified to integrate other models like BERT or DistilBERT.
- **Training Time**: The use of ALBERT reduces training time due to its smaller size and faster processing capabilities. The early stopping mechanism further helps by halting training when no further improvement is observed.

Overall, this ALBERT-based model aims to balance efficiency and performance by leveraging the strengths of ALBERT's contextual embeddings, while the previous approaches focused on using the full BERT model or processing multiple inputs separately.


In [12]:
input_ids = Input(shape=(128,),dtype='int32',name='input-ids')
attention_mask = Input(shape=(128,),dtype='int32',name='attention-mask')

class bert_layer(tf.keras.layers.Layer):
    
    def __init__(self, **kwargs):
        
        super(bert_layer, self).__init__(**kwargs)
        
        self.bert = TFAlbertModel.from_pretrained('albert-base-v2')

    def call(self, inputs):
        
        input_ids, attention_mask = inputs
        
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        return outputs[1]
        
BERT_layer = bert_layer()
pooler_output = BERT_layer([input_ids,attention_mask])

output_layer = Dense(3,activation = 'softmax')(pooler_output)

Bert_Input_Model = Model(inputs = [input_ids,attention_mask],outputs = output_layer)

Bert_Input_Model.compile(optimizer = Adam(learning_rate=1e-3),
                   loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                   metrics = ['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

Bert_Input_Model.fit([final_X_train_id,final_X_train_attention],
               final_Y_train,
               epochs = 50,
               batch_size = 16,
               validation_data=([final_X_test_id,final_X_test_attention],final_Y_test),
               callbacks = [early_stopping] 
               )


        

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFAlbertModel: ['predictions.decoder.bias', 'predictions.dense.weight', 'predictions.bias', 'predictions.LayerNorm.weight', 'predictions.dense.bias', 'predictions.LayerNorm.bias']
- This IS expected if you are initializing TFAlbertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFAlbertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFAlbertModel for predictions without further training.


Epoch 1/50
[1m 1/22[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6:23[0m 18s/step - accuracy: 0.6875 - loss: 1.0773

W0000 00:00:1719587515.995907      74 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 325ms/step - accuracy: 0.6460 - loss: 0.8306

W0000 00:00:1719587522.861265      73 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1719587528.908115      75 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 761ms/step - accuracy: 0.6427 - loss: 0.8298 - val_accuracy: 0.6322 - val_loss: 0.7188
Epoch 2/50


W0000 00:00:1719587532.049540      75 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 135ms/step - accuracy: 0.6674 - loss: 0.7337 - val_accuracy: 0.6322 - val_loss: 0.7221
Epoch 3/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 138ms/step - accuracy: 0.6099 - loss: 0.7981 - val_accuracy: 0.6322 - val_loss: 0.7292
Epoch 4/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 139ms/step - accuracy: 0.6652 - loss: 0.7005 - val_accuracy: 0.6322 - val_loss: 0.7190


<keras.src.callbacks.history.History at 0x7b7130322560>