This TF Hub model uses the implementation of BERT from the TensorFlow Models repository on GitHub at tensorflow/models/official/nlp/bert. It uses L=24 hidden layers (i.e., Transformer blocks), a hidden size of H=1024, and A=16 attention heads.

This model has been pre-trained for English on the Wikipedia and BooksCorpus using the code published on GitHub. Inputs have been "uncased", meaning that the text has been lower-cased before tokenization into word pieces, and any accent markers have been stripped. For training, random input masking has been applied independently to word pieces (as in the original BERT paper).

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

- No pooling, directly use the CLS embedding.
- No dense layer. Simply add a sigmoid output directly to the last layer of BERT, not to the intermediate layers.
- Fixed learning rate, batch size, epochs, optimizer. Adam optimizer is used. Learning rate: 2e-5 and 5e-5. Epochs=3. Batch-size=32. These values are used in the original paper.



References¶

   - Source for bert_encode function: https://www.kaggle.com/user123454321/bert-starter-inference
   - All pre-trained BERT models from Tensorflow Hub: https://tfhub.dev/s?q=bert


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization #the above script

**1. FUNCTIONS WE WILL USE**

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text  in texts:
        text = tokenizer.tokenize(text)
        
        text = text[:max_len - 2] #i think it is because we are gonna add [CLS] and [SEP]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        print("Length of input sequence: ", len(input_sequence))
        
        pad_len = max_len - len(input_sequence)
        print("Length of padding: ", pad_len)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len 
        print("Tokens: ", tokens)
        
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        print("Padding_masks: ", pad_masks)
        
        segment_ids = [0] * max_len
        print("Ids of segments: ", segment_ids)
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
        
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
        

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape = (max_len), dtype = tf.int32, name = "input_words_ids")
    input_mask = Input(shape = (max_len), dtype = tf.int32, name = "input_mask")
    segment_ids = Input(shape = (max_len), dtype = tf.int32, name = "segment_ids")
    
    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation = 'sigmoid')(clf_output)
    
    model = Model(inputs = [input_word_ids, input_mask, segment_ids], outputs = out)
    model.compile(Adam(lr=2e-6), loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model


**DATA**
- Load BERT from Tensorflow Hub
- Load tokenizer from the bert layer
- Encode the text into tokens, masks, and segment flags

In [None]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable = True)

In [None]:
train_data = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")

test_data = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

submission_data = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)


In [None]:
train_input = bert_encode(train_data.text.values, tokenizer, max_len=160)
test_input = bert_encode(test_data.text.values, tokenizer, max_len=160)


In [None]:
train_labels = train_data.target.values
train_labels

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()

In [None]:
train_history = model.fit(train_input, train_labels, validation_split = 0.2, epochs = 3, batch_size = 16)

In [None]:
model.save('model.h5')

In [None]:
test_pred = model.predict(test_input)

In [None]:
test_data['target'] = test_pred.round().astype(int)
test_data = test_data.drop(columns = ['keyword', 'location', 'text'])
test_data.to_csv('submission.csv', index=False)