#Semantic Similarity with BERT


This notebook demonstrates semantic similarity which determines how similar
two sentences are, in terms of their meaning. This also demonstrates the use of SNLI (Stanford Natural Language Inference) Corpus for predicting sentence semamntic similarity with transformers.


In [None]:
# Install transformers library
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 9.0MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 23.5MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 49.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K  

In [None]:
# Import required libraries

import numpy as np
import pandas as pd
import tensorflow as tf
import transformers

## Configuration

In [None]:
# Define model parameters
max_length = 128  #length of input sentence
batch_size = 32
epoch = 2
label = ["contradiction", "entailment", "neutral"]

## Loading the data

In [None]:
# Load the data from github
!curl -LO https://raw.githubusercontent.com/MohamadMerchant/SNLI/master/data.tar.gz
!tar -xvzf data.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.1M  100 11.1M    0     0  15.1M      0 --:--:-- --:--:-- --:--:-- 15.1M
SNLI_Corpus/
SNLI_Corpus/snli_1.0_dev.csv
SNLI_Corpus/snli_1.0_train.csv
SNLI_Corpus/snli_1.0_test.csv


## Dataset Overview

Sentence1: The premise caption </br>
Sentence2: The hypothesis caption that was written by the author of the pair.</br>
Similarity: This is the label chosen by the majority of annotators.</br>
The "Similarity" label values in the dataset:</br>
Contradiction: The sentences share no similarity.</br>
Entailment: The sentences have similar meaning.</br>
Neutral: The sentences are neutral.</br>

In [None]:
# Read data from files.
train_dataset = pd.read_csv("SNLI_Corpus/snli_1.0_train.csv", nrows=100000)
valid_dataset = pd.read_csv("SNLI_Corpus/snli_1.0_dev.csv")
test_dataset = pd.read_csv("SNLI_Corpus/snli_1.0_test.csv")

# Shape of the data
print("Total train samples : {}".format(train_dataset.shape[0]))
print("Total validation samples: {}".format(valid_dataset.shape[0]))
print("Total test samples: {}".format(valid_dataset.shape[0]))

Total train samples : 100000
Total validation samples: 10000
Total test samples: 10000


In [None]:
print("Sentence1: {}".format(train_dataset.loc[1, 'sentence1']))
print("Sentence2: {}".format(train_dataset.loc[1, 'sentence2']))
print("Similarity: {}".format(train_dataset.loc[1, 'similarity']))

Sentence1: A person on a horse jumps over a broken down airplane.
Sentence2: A person is at a diner, ordering an omelette.
Similarity: contradiction


## Data Preprocessing

In [None]:
# Drop NaN values present in dataset
print("Number of missing values")
print(train_dataset.isnull().sum())
train_dataset.dropna(axis=0, inplace=True)

Number of missing values
similarity    0
sentence1     0
sentence2     3
dtype: int64


In [None]:
print("Train Target Distribution")
print(train_dataset.similarity.value_counts())

Train Target Distribution
entailment       33384
contradiction    33310
neutral          33193
-                  110
Name: similarity, dtype: int64


In [None]:
print("Validation Target Distribution")
print(valid_dataset.similarity.value_counts())

Validation Target Distribution
entailment       3329
contradiction    3278
neutral          3235
-                 158
Name: similarity, dtype: int64


In [None]:
train_dataset = (
    train_dataset[train_dataset.similarity != "-"]
    .sample(frac=1.0, random_state=42)
    .reset_index(drop=True)
)
valid_dataset = (
    valid_dataset[valid_dataset.similarity != "-"]
    .sample(frac=1.0, random_state=42)
    .reset_index(drop=True)
)

In [None]:
# One hot encoding
train_dataset["label"] = train_dataset["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_train = tf.keras.utils.to_categorical(train_dataset.label, num_classes=3)

valid_dataset["label"] = valid_dataset["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_val = tf.keras.utils.to_categorical(valid_dataset.label, num_classes=3)

test_dataset["label"] = test_dataset["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_test = tf.keras.utils.to_categorical(test_dataset.label, num_classes=3)

## Custom data generator creation

In [None]:
class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    def __init__(
        self,
        sentence_pair,
        label,
        batch_size=batch_size,
        shuffle=True,
        include_target=True,
    ):
        self.sentence_pair = sentence_pair
        self.label = label
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_target = include_target
        # Use base-base-uncased pretrained model in BERT tokenizer to encode the text
        self.tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
        self.indexes = np.arange(len(self.sentence_pair))
        self.on_epoch_end()

    def __len__(self):
        # Number of batches per epoch
        return len(self.sentence_pair) // self.batch_size

    def __getitem__(self, idx):
        # Used for retrieval of the batch of index.
        index = self.index[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pair = self.sentence_pair[index]
        # Encode sentences using BERT tokenizer
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pair.tolist(),
            add_special_token=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_id=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )
        # Conversion of encoded features to numpy array
        input_id = np.array(encoded["input_ids"], dtype="int32")
        attention_mask = np.array(encoded["attention_mask"], dtype="int32")
        token_type_id = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_target:
            label = np.array(self.labels[index], dtype="int32")
            return [input_id, attention_mask, token_type_id], label
        else:
            return [input_id, attention_mask, token_type_id]

    def on_epoch_end(self):
        # If shuffle is set to True shuffle indexes after each epoch
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.index)

## Build the model

In [None]:
# Create the model under a distribution strategy scope
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Input ids are encoded token ids from BERT tokenizer
    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
    # Attention masks indicates which tokens should be attended to
    attention_masks = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="attention_masks")
    # Binary masks identifying different sequences in the model
    token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="token_type_ids")
    # Load pretrained BERT model
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    bert_model.trainable = False

    sequence_op, pooled_op = bert_model(input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids)
    # Add trainable layers to adapt the pretrained features on the new data
    bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(sequence_op)
    # Apply hybrid pooling approach to bi_lstm sequence output
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concatenate = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concatenate)
    op = tf.keras.layers.Dense(3, activation="softmax")(dropout)
    model = tf.keras.models.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs=op)
    # compile model
    model.compile( optimizer=tf.keras.optimizers.Adam(),loss="categorical_crossentropy", metrics=["acc"],)
    
print("Strategy: {}".format(strategy))
model.summary()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fab8cd54668>
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 128, 768), ( 109482240   input_ids[0][0]     

## Train the Model

In [None]:
train_data = BertSemanticDataGenerator(
    train_dataset[["sentence_1", "sentence_2"]].values.astype("str"),
    y_train,
    batch_size=batch_size,
    shuffle=True,
)
valid_data = BertSemanticDataGenerator(
    valid_dataset[["sentence_1", "sentence_2"]].values.astype("str"),
    y_val,
    batch_size=batch_size,
    shuffle=False,
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
# Train model
history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch 1/2
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.




INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Redu

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 2/2


## Fine-tuning

In [None]:
# Unfreeze the bert_model for finetuning
bert_model.trainable = True
# Recompile the model to ensure the change effective
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 128, 768), ( 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0] 

## Train the entire model end-to-end

In [None]:
# Training the entire model
history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epoch,
    use_multiprocessing=True,
    workers=-1,
)

Epoch 1/2




