# HuggingFace BERT model finetuning with TensorFlow

Hi, everyone! This notebook is a BERT finetuning solution to [Feedback Prize - English Language Learning competition](https://www.kaggle.com/competitions/feedback-prize-english-language-learning). It covers:

* BERT tokenize dataset with Tensorflow
* BERT model finetuning

The inference notebook is [here](https://www.kaggle.com/code/electro/fp3-bert-inference-tensorflow).
I also have a [basic EDA and bag-of-words solution](https://www.kaggle.com/code/electro/fp3-bag-of-words-tensorflow-starter). Please check it out if you are interested. 

If you find this notebook helpful, please upvote it. Thank you.

# Imports

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
print(f'TF version: {tf.__version__}')
from tensorflow.keras import layers
import transformers

TF version: 2.6.4


In [2]:
def set_seed(seed=42):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
#     os.environ['TF_DETERMINISTIC_OPS'] = '1'
set_seed(42)

# Load DataFrame

In [3]:
df = pd.read_csv('../input/feedback-prize-english-language-learning/train.csv')
display(df.head())
print('\n---------DataFrame Summary---------')
df.info()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5



---------DataFrame Summary---------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text_id      3911 non-null   object 
 1   full_text    3911 non-null   object 
 2   cohesion     3911 non-null   float64
 3   syntax       3911 non-null   float64
 4   vocabulary   3911 non-null   float64
 5   phraseology  3911 non-null   float64
 6   grammar      3911 non-null   float64
 7   conventions  3911 non-null   float64
dtypes: float64(6), object(2)
memory usage: 244.6+ KB


# Data Split

In [4]:
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)
print(f'Training examples: {len(train_df)}, validation examples: {len(val_df)}')

Training examples: 3519, validation examples: 392


# Config

In [5]:
TARGET_COLS = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']

MAX_LENGTH = 512
BATCH_SIZE = 8
BERT_MODEL = "bert-base-uncased"

# Define Data Generator

To make use of HugggingFace BERT model, we have to tokenize our input texts as the pretrained BERT model requires.

In [6]:
#https://keras.io/examples/nlp/semantic_similarity_with_bert/
class BertDataGenerator(tf.keras.utils.Sequence):
    def __init__(
        self,
        full_texts,
        labels,
        batch_size=BATCH_SIZE,
        shuffle=True,
        include_targets=True,
    ):
        self.full_texts = full_texts
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            BERT_MODEL, do_lower_case=True
        )
        self.indexes = np.arange(len(self.full_texts))
        self.on_epoch_end()

    def __len__(self):
        return len(self.full_texts) // self.batch_size

    def __getitem__(self, idx):
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        batch_texts = self.full_texts[indexes]

        encoded = self.tokenizer.batch_encode_plus(
            batch_texts.tolist(),
            add_special_tokens=True,
            max_length=MAX_LENGTH,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="tf",
            truncation=True,
            padding='max_length'
        )

        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="float32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)

In [7]:
train_data = BertDataGenerator(
    train_df["full_text"].values.astype("str"),
    np.array(train_df[TARGET_COLS]),
    batch_size=BATCH_SIZE,
    shuffle=True,
)
valid_data = BertDataGenerator(
    val_df["full_text"].values.astype("str"),
    np.array(val_df[TARGET_COLS]),
    batch_size=BATCH_SIZE,
    shuffle=False,
)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Every input sample should include three tensors: input_ids, attention_mask, token_type_ids.

In [8]:
inputs, labels = next(iter(train_data))
print(f'input_ids:\n{inputs[0]} \n With shape {inputs[0].shape} and dtype {inputs[0].dtype}\n')
print(f'attention_mask:\n{inputs[1]} \n With shape {inputs[1].shape} and dtype {inputs[0].dtype}\n')
print(f'token_type_ids:\n{inputs[2]} \n With shape {inputs[2].shape} and dtype {inputs[0].dtype}\n')
print(f'Labels:\n{labels} \n With shape {labels.shape} and dtype {labels.dtype}')

2022-10-09 09:31:10.656688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:10.749012: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:10.749773: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:10.751885: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

input_ids:
[[  101 11640  1045 ...     0     0     0]
 [  101  2070  2111 ...  2003  2138   102]
 [  101  6203  1010 ...  2005  2635   102]
 ...
 [  101  4078  8490 ...     0     0     0]
 [  101  2429  2000 ...     0     0     0]
 [  101  2070  2493 ...     0     0     0]] 
 With shape (8, 512) and dtype int32

attention_mask:
[[1 1 1 ... 0 0 0]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]] 
 With shape (8, 512) and dtype int32

token_type_ids:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]] 
 With shape (8, 512) and dtype int32

Labels:
[[2.  2.5 3.  3.  3.  2.5]
 [3.5 4.  3.5 3.  3.  4. ]
 [3.5 4.  4.  4.  3.5 3.5]
 [3.  2.5 3.5 3.5 3.  3.5]
 [3.5 2.  3.  2.  2.  3. ]
 [1.5 1.5 2.  1.5 2.  2. ]
 [2.5 3.  3.5 3.5 3.  3. ]
 [3.5 3.  3.5 3.5 3.5 3. ]] 
 With shape (8, 6) and dtype float32


node zero
2022-10-09 09:31:10.752934: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:10.753577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:13.037169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:13.038128: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-09 09:31:13.039323: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node r

# Model

Our model is BERT pre-trained model with a dense head connect to the '[CLS]' token's last hidden state.

In [9]:
def get_model():
    input_ids = tf.keras.layers.Input(
        shape=(MAX_LENGTH,), dtype=tf.int32, name="input_ids"
    )
    
    attention_masks = tf.keras.layers.Input(
        shape=(MAX_LENGTH,), dtype=tf.int32, name="attention_masks"
    )
    
    token_type_ids = tf.keras.layers.Input(
        shape=(MAX_LENGTH,), dtype=tf.int32, name="token_type_ids"
    )
   
    bert_model = transformers.TFBertModel.from_pretrained(BERT_MODEL)
    bert_model.trainable = False

    bert_output = bert_model.bert(
        input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    cls_output = bert_output.last_hidden_state[:, 0, :]
    output = layers.Dense(6)(cls_output)
    model = tf.keras.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs=output)
    model.compile(optimizer=tf.optimizers.Adam(learning_rate=1e-3),
                 loss='huber_loss',
                 metrics=[tf.keras.metrics.RootMeanSquaredError()],
                 )
    return model

In [10]:
tf.keras.backend.clear_session()
model = get_model()
model.summary()

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 512)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 512)]        0                                            
__________________________________________________________________________________________________
bert (TFBertMainLayer)          TFBaseModelOutputWit 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0]        

# Fine-tuning

We freeze BERT model and train dense layer for 1 epochs first.

In [11]:
model.fit(train_data, validation_data=valid_data, epochs=1)

2022-10-09 09:31:38.588804: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)




<keras.callbacks.History at 0x7f1e61bb1910>

Then, we unfreeze BERT model, train the whole model with smaller learning rate.

In [12]:
for layer in model.layers:
    layer.trainable = True
    
model.compile(optimizer=tf.optimizers.Adam(learning_rate=1e-5),
              loss='huber_loss',
              metrics=[tf.keras.metrics.RootMeanSquaredError()],)

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 512)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 512)]        0                                            
__________________________________________________________________________________________________
bert (TFBertMainLayer)          TFBaseModelOutputWit 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0]        

In [13]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint("./bert-finetuning",
                                       monitor='val_loss',
                                      save_best_only=True,
                                      mode = 'min', verbose = 1),
]
model.fit(train_data,
          validation_data=valid_data,
          epochs=5, 
          callbacks=callbacks,
         )

Epoch 1/5

Epoch 00001: val_loss improved from inf to 0.13772, saving model to ./bert-finetuning


2022-10-09 09:38:37.579830: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


Epoch 2/5

Epoch 00002: val_loss improved from 0.13772 to 0.12115, saving model to ./bert-finetuning
Epoch 3/5

Epoch 00003: val_loss improved from 0.12115 to 0.10895, saving model to ./bert-finetuning
Epoch 4/5

Epoch 00004: val_loss did not improve from 0.10895
Epoch 5/5

Epoch 00005: val_loss did not improve from 0.10895


<keras.callbacks.History at 0x7f1dc480cb90>