<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Step%20by%20Step%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Session 9 - Step by Step BERT**

**What is BERT?**

Bert stands for Bidirectional Encoder Representations from Transformers. It’s google new techniques for NLP pre-training language representation. Which means now machine learning communities can use Bert models that have been training already on a large number of words,(some researchers say the Bert model train on the English Wikipedia 2,500 million words) for NLP models to do a wide variety of tasks such as Question Answering tasks, Named Entity Recognition (NER), and Classification like sentiment analysis.
In Bert paper, they present two types of Bert models one is the Best Base and the other is Bert Large. Both of these models have a large number of encoder layers 12 for the base and 24 for the large. If you understand the concept of transformers. You will see that Bert also trained on the Encoder stacks in the transformers to use the same attention mechanism. But why is it called bidirectional?

**What is bidirectional mean?**

Because the transformers encoder reads the entire sequence of the words at once which is the opposite of the directional models that read the input sequentially for the left to the right or from the right to the left. The bidirectional method will help the model to learn and understand the meaning and the intention of the word based on its surrounding. Since we will use it for toxic classification, we will explain only the Bert steps for classification tasks only.

**What is the input of Bert?**

The input of Bert is a special input start with [CLS] token stand for classification. As in the Transformers, Bert will take a sequence of words (vector) as an input that keeps feed up from the first encoder layer up to the last layer in the stack. Each layer in the stack will apply the self-attention method to the sequence after that it will pass to the feed-forward network to deliver the next encoder layer.

**What is the output of Bert?**

The output of Bert model contains the vector of size (hidden size) and the first position in the output is the [CLS] token. Now, this output can be used as an input to our classifier neural network for classification of the toxicity of the words. In the Bert paper, they achieve a great result by using only a single layer neural network as the classifier.


In [None]:
# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"gajendraks","key":"cdd74665a04eb627273caac8b44b770d"}'}

In [None]:
# Let's make sure the kaggle.json file is present.
!ls -lha kaggle.json

-rw-r--r-- 1 root root 66 Mar 24 02:07 kaggle.json


In [None]:
# Next, install the Kaggle API client.
%pip install -qq kaggle

In [None]:
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# List available datasets.
!kaggle datasets list

ref                                                                       title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
datasets/piterfm/2022-ukraine-russian-war                                 2022 Ukraine Russia War                               1KB  2022-03-23 09:28:52           2471        170  1.0              
datasets/prasertk/healthy-lifestyle-cities-report-2021                    Healthy Lifestyle Cities Report 2021                  2KB  2022-03-03 00:26:02           2787         91  1.0              
datasets/prasertk/netflix-daily-top-10-in-us                              Netflix daily top 10                                 70KB  2022-03-12 13:22:19           1237         35  1.0              
datasets/v

In [None]:
!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification

Downloading jigsaw-multilingual-toxic-comment-classification.zip to /content
 99% 1.07G/1.08G [00:11<00:00, 93.5MB/s]
100% 1.08G/1.08G [00:11<00:00, 99.3MB/s]


**Import Libraries**

In [None]:
%pip install -qq transformers



In [None]:
!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

In [None]:
import zipfile
with zipfile.ZipFile("/content/jigsaw-multilingual-toxic-comment-classification.zip","r") as zip_ref:
    zip_ref.extractall("jigsaw-multilingual-toxic-comment-classification")

**Function for Encoding the comment**

Encode job is to convert word into vector encapsulate the meaning of the word, similar word has a closer number.

In [None]:
def regular_encode(texts, tokenizer, maxlen=512):
    """
    Function to encode the word
    """
    # encode the word to vector of integer
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])

**Function for build Keras model**

In [None]:
def build_model(transformer, max_len=512):

    """
    This function to build and compile Keras model
    """
    #Input: for define input layer
    #shape is vector with 512-dimensional vectors
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids") # name is optional 
    sequence_output = transformer(input_word_ids)[0]

    # to get the vector
    cls_token = sequence_output[:, 0, :]
    
    # define output layer
    out = Dense(1, activation='sigmoid')(cls_token)
    
    # initiate the model with inputs and outputs
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy',metrics=[tf.keras.metrics.AUC()])
    
    return model

**Preprocessing**

**Configuration**

In [None]:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


In [None]:
# input pipeline that delivers data for the next step before the current step has finished.
# The tf.data API helps to build flexible and efficient input pipelines.
# This document demonstrates how to use the tf.data 
# API to build highly performant TensorFlow input pipelines.
AUTO = tf.data.experimental.AUTOTUNE

# Configuration
EPOCHS = 2
#BATCH_SIZE = 16 * strategy.num_replicas_in_sync
BATCH_SIZE = 2
MAX_LEN = 192
MODEL = 'bert-base-multilingual-cased'

**Import Dataset**

In [None]:
train = pd.read_csv("/content/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")

valid = pd.read_csv('/content/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/content/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/content/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')

**Tokenizer**

In [None]:
#use the pre-trained model bert as a tokenizer 
#bert tokenizer has vocabulary for emoji. this is the reason we don't need to remove emoji from 
#datasets, for more details see the (EDA & data cleaning) notebook

tokenizer = AutoTokenizer.from_pretrained(MODEL)

**Encode Comments**

In [None]:
#call the function regular encode on for all the 3 dataset to convert each words after the tokenizer
#into a vector
#x_train,x_test, and x_validation will have the comment text column only,(in test called "content")
x_train = regular_encode(train.comment_text.values.tolist(), tokenizer, maxlen=MAX_LEN)
x_valid = regular_encode(valid.comment_text.values.tolist(), tokenizer, maxlen=MAX_LEN)
x_test = regular_encode(test.content.values.tolist(), tokenizer, maxlen=MAX_LEN)

#y_train,y_valid will have te target column "toxic"
y_train = train.toxic.values
y_valid = valid.toxic.values

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


**Prepare TensorFlow dataset for modeling**

In [None]:
# Create a source dataset from your input data.
# Apply dataset transformations to preprocess the data.
# Iterate over the dataset and process the elements.
train_dataset = (
    tf.data.Dataset # create dataset
    .from_tensor_slices((x_train, y_train)) # Once you have a dataset, you can apply transformations 
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)# Combines consecutive elements of this dataset into batches.
    .prefetch(AUTO) #This allows later elements to be prepared while the current element is being processed.
)

valid_dataset = (
    tf.data.Dataset # create dataset
    .from_tensor_slices((x_valid, y_valid)) # Once you have a dataset, you can apply transformations 
    .batch(BATCH_SIZE) #Combines consecutive elements of this dataset into batches.
    .cache()
    .prefetch(AUTO)#This allows later elements to be prepared while the current element is being processed.
)

test_dataset = (
    tf.data.Dataset# create dataset
    .from_tensor_slices(x_test) # Once you have a dataset, you can apply transformations 
    .batch(BATCH_SIZE)
)

**Build the model**

**Build the model**

In [None]:
%%time
# in the TPU
with strategy.scope():
    #take the encoder results of bert from transformers and use it as an input in the NN model
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

Some layers from the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_word_ids (InputLayer)  [(None, 192)]            0         
                                                                 
 tf_bert_model (TFBertModel)  TFBaseModelOutputWithPoo  177853440
                             lingAndCrossAttentions(l            
                             ast_hidden_state=(None,             
                             192, 768),                          
                              pooler_output=(None, 76            
                             8),                                 
                              past_key_values=None, h            
                             idden_states=None, atten            
                             tions=None, cross_attent            
                             ions=None)                          
                                                             

  super(Adam, self).__init__(name, **kwargs)


**Training The Model, Tuning Hyper-Parameters**

In [None]:
#train the model
# training the data and tune our model with the results of the metrics we get from the validation dataset
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(train_dataset, steps_per_epoch=n_steps, validation_data=valid_dataset,
                epochs=EPOCHS)

Epoch 1/2
   260/111774 [..............................] - ETA: 187:42:00 - loss: 0.3707 - auc: 0.5813

**Testing The Model**

In [None]:
#test the model on validation
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(valid_dataset.repeat(), steps_per_epoch=n_steps,epochs=EPOCHS*2)

**Predict and store the result**

In [None]:
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)