# Bert for classification

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 8.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 65.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3


## Pre-procssing data

In [3]:
import pandas as pd
import re
import numpy as np
from tensorflow.keras.utils import to_categorical
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import TFBertModel, TFBertForPreTraining

In [4]:
def clean(data):
  tokens = data.split()
  translation_table = str.maketrans('', '', "\"#$%&'()*+-/:;<=>@[\]^_`{|}~?!.,")
  tokens = [w.translate(translation_table) for w in tokens]
  tokens = [word.lower() for word in tokens]
  return ' '.join(tokens)

In [5]:
# read news
with open('/content/drive/MyDrive/SMSSpamCollection.txt') as f:
    lines = [line.rstrip() for line in f]

In [6]:
data = []
labels = []

for line in lines:
  tmp = line.split('\t')
  data.append(clean(tmp[1]))
  
  if tmp[0]=='ham':
    labels.append(0)

  if tmp[0]=='spam':
    labels.append(1)

In [7]:
# change labels to one hot representation
labels = to_categorical(labels)

## First Bert model

In [8]:
input_ids=[]
attention_masks=[]
# bert-large has more dense layers and encoder/decoders
# tokenizing
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

for sent in data:
  # add special tokens is used for adding <cls> <seperate> tags
  # we can also give multiple sentences as input instead of sent
  bert_inp = bert_tokenizer.encode_plus(sent, add_special_tokens = True, max_length =64, pad_to_max_length = True, return_attention_mask = True)
  input_ids.append(bert_inp['input_ids'])
  attention_masks.append(bert_inp['attention_mask'])

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [9]:
input_ids = np.asarray(input_ids)
attention_masks = np.array(attention_masks)
labels = np.array(labels)

In [10]:
len(input_ids),len(attention_masks),len(labels)

(5574, 5574, 5574)

In [11]:
train_inp, val_inp, train_label, val_label, train_mask, val_mask = train_test_split(input_ids, labels, attention_masks, test_size=0.20, random_state=1000)
test_inp, val_inp, test_label, val_label, test_mask, val_mask = train_test_split(val_inp, val_label, val_mask, test_size=0.5, random_state=1000)

In [12]:
SEQ_LEN = 64

# bert model
bert = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.keras.layers.Input(shape=(SEQ_LEN,), name='input_ids', dtype='int32')
mask = tf.keras.layers.Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int32')

# bert has two output 
# [0] means all words are represented in word representation (Vector shape)
# [1] for first <cls> tag
embeddings = bert(input_ids, attention_mask=mask)[0]
# we take an average all of the bert[0] outputs
X = tf.keras.layers.Lambda(lambda x: tf.keras.backend.mean(x, axis=1))(embeddings)
# move over feed forward network
y = tf.keras.layers.Dense(2, activation='softmax', name='outputs')(X)

# before this all layers were independent
# now we have to connect these layers
bert_model = tf.keras.Model(inputs=[input_ids, mask], outputs=y)

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['accuracy'])

bert_model.summary()

Downloading tf_model.h5:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 64)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 64)]         0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 64,                                            

In [13]:
history = bert_model.fit([train_inp, train_mask], train_label,batch_size = 32, epochs = 4, validation_data = ([val_inp, val_mask], val_label))

Epoch 1/4




Epoch 2/4
Epoch 3/4
Epoch 4/4


## Second bert model

In [14]:
import tensorflow as tf
from transformers import TFBertModel, TFBertForPreTraining

SEQ_LEN = 64

bert = TFBertModel.from_pretrained('bert-base-uncased')

input_ids = tf.keras.layers.Input(shape=(SEQ_LEN,), name='input_ids', dtype='int32')
mask = tf.keras.layers.Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int32')

# used second bert output [1] (output for <cls> tags)

embeddings = bert(input_ids, attention_mask=mask)[1]
y = tf.keras.layers.Dense(2, activation='softmax', name='outputs')(embeddings)

bert_model = tf.keras.Model(inputs=[input_ids, mask], outputs=y)

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08)

bert_model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])

bert_model.summary()

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 64)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 64)]         0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 64,                                          

In [15]:
history = bert_model.fit([train_inp, train_mask], train_label,batch_size = 32, epochs = 5,validation_data = ([val_inp, val_mask],val_label))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
