We will create a baseline BERT model following the [excellent notebook](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub) written by @xhlulu. This work also has been recently implemented [here](https://www.kaggle.com/jeongyoonlee/tf-keras-bert-baseline-training-inference)

In [None]:
debug = False
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

Let's first import few librarie.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
from tensorflow.keras.metrics import RootMeanSquaredError

import tokenization
from sklearn.manifold import TSNE

seed = 0
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
tf.random.set_seed(seed)
import plotly
import plotly.graph_objs as go
from plotly.graph_objs import FigureWidget

import warnings
warnings.filterwarnings("ignore")

We create our training data by randomly selecting 70% of records and our test/validation set by the rest of the records

In [None]:
data = pd.read_csv("../input/commonlitreadabilityprize/train.csv")

if debug == True:
    data = data.tail(100)

shuffle_df = data.sample(frac=1)

train_size = int(0.7 * len(data))


train = shuffle_df[:train_size]
test = shuffle_df[train_size:]

In [None]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
def bert_encode(texts, tokenizer, max_len=205, first=True):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
        if first == True:
            text = text[:max_len-2]
        else: 
            text = text[-(max_len-2):]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
def build_model(bert_layer, max_len=205):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='linear')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='mean_squared_error', metrics=[RootMeanSquaredError()])
    
    return model

In [None]:
%%time

train_input = bert_encode(train.excerpt.values, tokenizer, first=True)
test_input = bert_encode(test.excerpt.values, tokenizer, first=True)

In [None]:
train_labels = train.target.values
test_labels = test.target.values

In [None]:
model = build_model(bert_layer, max_len=205)
model.summary()

In [None]:
checkpoint_first = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)
train_history_first = model.fit(train_input, train_labels,validation_data=(test_input, test_labels),epochs=10,callbacks=[checkpoint_first],batch_size=8)
test_pred_first = model.predict(test_input)

Now, our model development is complete. We now look at the predictions that it generated in the validation set

In [None]:
print("validation rmse is", mean_squared_error(test.target, test_pred_first, squared=False))
plt.hist(test_pred_first, bins = 50)
plt.title("Distribution of predictions on validation set")
plt.show()

Now, we extract the features from the penultimate layer of the BERT model that we developed. 

In [None]:
extract = Model(model.inputs, model.layers[-2].output)
features_last = extract.predict(test_input)
features_last.shape

Now, we perform TSNE on these features to visualize these embeddings in smaller dimension. We create 2 tsne components and use the target variable to plot the scatter diagram in 3 dimensions.  

In [None]:
tsne = TSNE(n_components=2 , random_state=0)
data_tsne = tsne.fit_transform(features_last)

data_tsne

data_tsne = pd.DataFrame(data_tsne , columns=['tsne1' , 'tsne2'])
data_tsne.head()

data_tsne["target"] = test["target"].values

traces = go.Scatter3d(
    x=data_tsne['tsne1'],
    y=data_tsne['tsne2'],
    z=data_tsne['target'],
    mode='markers',
    marker=dict(
        size=4,
        opacity=0.2,
        colorscale='Viridis',
     )
)

layout = go.Layout(
    autosize=True,
    showlegend=True,
    width=800,
    height=1000,
)

FigureWidget(data=[traces], layout=layout)

Now, we need to look carefully by rotating the plot, do we see the linear pattern between the two t-sne components and the target variable from any plane? Also, if there is non-linearity at some places, we see how to use a [tree based model](https://www.kaggle.com/maunish/clrp-roberta-svm) on top of these embeddings. We also check if the texts with high standard error and low standard errors are somehow separate or not. We first create an indicator in the test data which would indicate if the texts are corresponding to high standard error or low. We use the mean value of training set standard error to avoid any information leakage.

In [None]:
test['std'] = np.where(test['standard_error']>train.standard_error.mean(), 1, 0)
test['std'].value_counts()

In [None]:
traces = go.Scatter3d(
    x=data_tsne['tsne1'],
    y=data_tsne['tsne2'],
    z=data_tsne['target'],
    mode='markers',
    marker=dict(
        size=4,
        opacity=0.2,
        colorscale='Viridis',
        color = test['std'].values
     )
)

layout = go.Layout(
    autosize=True,
    showlegend=True,
    width=800,
    height=1000,
)

FigureWidget(data=[traces], layout=layout)

In the above plot, we just coloured points of the first plot. Points with yellow colours are corresponding to texts with high standard deviation. Are the yellow and non-yellow points following any pattern or they are randomly spaced in the space (with respect to x and y axes)? If their presence is random in nature, then it could be a good news becasue the bert embedding space does not face a big problem (while predicting tharget) between high and low standard deviation texts. However, if we look closely, we see a pattern. If we keep z axis vertically with respect to our screen, then we see that yellow points are more concentrated towrds two tails of z (that is the target column) axis, this is expected, as we earlier observed any many EDAs that the target and standard error in this data have U-shaped relationship. That means targets with high and low values have high standard error values. 