Reference link- https://keras.io/examples/nlp/semantic_similarity_with_keras_nlp/

### TODO Recording:

- Please upload the 3 files for training, test, and validation to Colab

In [None]:
!pip install tensorflow-text

In [None]:
!pip install keras-nlp

### TODO Recording:

- Please restart the kernel

In [None]:
import numpy as np
import pandas as pd

import keras
import keras_nlp
import tensorflow as tf

Loading training, validation, and testing data

- https://www.kaggle.com/datasets/thedevastator/nli-dataset-for-sentence-understanding?select=mrpc_train.csv
- https://www.kaggle.com/datasets/thedevastator/nli-dataset-for-sentence-understanding?select=mrpc_validation.csv
- https://www.kaggle.com/datasets/thedevastator/nli-dataset-for-sentence-understanding?select=mrpc_test.csv

Dataset info-

https://www.tensorflow.org/datasets/catalog/glue#gluemrpc


In [None]:
mrpc_train_data = pd.read_csv("mrpc_train.csv")

mrpc_valid_data = pd.read_csv("mrpc_validation.csv")

mrpc_test_data = pd.read_csv("mrpc_test.csv")

### TODO Recording:

- Click on the table next to the data frame results to see the entire text

In [None]:
mrpc_train_data.sample(5)

Unnamed: 0,sentence1,sentence2,label,idx
1410,"At the very long end , the 30-year bond US30YT...","At the very long end , the 30-year bond US30YT...",1,1571
1939,"Sean Harrigan , president of the California Pu...","Sean Harrigan , president of the California Pu...",0,2152
1149,Intel Corp. narrowed its fourth-quarter sales ...,Intel Corp. narrowed its fourth-quarter sales ...,0,1281
1579,But Mr Kenny said his advice to Mr Hicks - if ...,"But Mr Kenny said his advice to David Hicks , ...",1,1758
2450,This is the first time in the United States th...,"Today , the experts will perform the United St...",1,2724


Train Dataset info is obtained. No Null values are found in train data

In [None]:
mrpc_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3668 entries, 0 to 3667
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence1  3668 non-null   object
 1   sentence2  3668 non-null   object
 2   label      3668 non-null   int64 
 3   idx        3668 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 114.8+ KB


In [None]:
mrpc_valid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence1  408 non-null    object
 1   sentence2  408 non-null    object
 2   label      408 non-null    int64 
 3   idx        408 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 12.9+ KB


Target balance is checked

In [None]:
mrpc_train_data["label"].value_counts()

1    2474
0    1194
Name: label, dtype: int64

In [None]:
mrpc_valid_data["label"].value_counts()

1    279
0    129
Name: label, dtype: int64

In [None]:
mrpc_test_data["label"].value_counts()

1    1147
0     578
Name: label, dtype: int64

Creating training,validation and testing datasets from dataframes default, `keras_nlp.models.BertClassifier` will tokenize and pack
together raw strings using a `"[SEP]"` token during training.

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((
    (mrpc_train_data["sentence1"].values, mrpc_train_data["sentence2"].values),
     mrpc_train_data["label"].values
)).batch(batch_size = 32)

val_ds = tf.data.Dataset.from_tensor_slices((
    (mrpc_valid_data["sentence1"].values, mrpc_valid_data["sentence2"].values),
     mrpc_valid_data["label"].values
)).batch(batch_size = 32)

test_ds = tf.data.Dataset.from_tensor_slices((
    (mrpc_test_data["sentence1"].values, mrpc_test_data["sentence2"].values),
     mrpc_test_data["label"].values
)).batch(batch_size = 32)

We can view one batch of training data with both sentence pairs and labels.For example-Pair of 'Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion ' and "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 ." are not sematically equivalent.But pairs like
The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .' and 'PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .' are semantically equivalent

In [None]:
next(iter(train_ds))

((<tf.Tensor: shape=(32,), dtype=string, numpy=
  array([b'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
         b"Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
         b'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
         b'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
         b'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .',
         b'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .',
         b'The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .',
         b'The DVD-CCA then appealed to the state Supreme Court .',
         b'That compared with $ 35.18 million , or 24 cents per share , in the y

Labels are converted into one hot form for this binary classification task and can be seen

In [None]:
def preprocess_label(text, label):

    one_hot_label = tf.one_hot(label, depth = 2)

    return text, one_hot_label

In [None]:
train_ds = train_ds.map(preprocess_label)

val_ds = val_ds.map(preprocess_label)

test_ds = test_ds.map(preprocess_label)

In [None]:
_, train_batch_labels = next(iter(train_ds))

train_batch_labels

<tf.Tensor: shape=(32, 2), dtype=float32, numpy=
array([[0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.]], dtype=float32)>

Using the BERT model from KerasNLP to establish a baseline for our semantic similarity task. The keras_nlp.models.BertClassifier class attaches a classification head to the BERT Backbone, mapping the backbone outputs to a logit output suitable for a classification task. This significantly reduces the need for custom code.

KerasNLP models have built-in tokenization capabilities that handle tokenization by default based on the selected model. However, users can also use custom preprocessing techniques as per their specific needs. If we pass a tuple as input, the model will tokenize all the strings and concatenate them with a "[SEP]" separator.

We use this model with pretrained weights, and we can use the from_preset() method to use our own preprocessor. For the MRPC, we set num_classes to 2.

In [None]:
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    " ", num_classes = 2
)

bert_classifier.summary()

  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids
Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/assets/tokenizer/vocabulary.txt...
100%|██████████| 226k/226k [00:00<00:00, 1.06MB/s]


As default compilation parameters does not match our case, we are compiling our model explicitly.

In [None]:
bert_classifier.compile(
    loss = keras.losses.BinaryCrossentropy(from_logits = True),
    optimizer = tf.keras.optimizers.Adam(5e-5),
    metrics = [tf.keras.metrics.BinaryAccuracy(), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
)

Please note that the BERT Tiny model has only 4,386,307 trainable parameters.

KerasNLP task models come with compilation defaults. We can now train the model we just instantiated by calling the fit() method.

Training the model for 3 epochs

In [None]:
bert_classifier.fit(train_ds, validation_data = val_ds, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7e2c8033c310>

Evaluating the performance of the trained model on test data.

In [None]:
bert_classifier.evaluate(test_ds)



[0.5738809704780579, 0.717391312122345, 0.75, 0.6521739363670349]

Warning is coming while saving the model as Model is compiled with different parameters from default ones.So we would compile the restored model with our parameters for making predictions

In [None]:
bert_classifier.save("bert_classifier.keras")

  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids


In [None]:
restored_model = keras.models.load_model("bert_classifier.keras")



In [None]:
restored_model.compile(
    loss = keras.losses.BinaryCrossentropy(from_logits = True),
    optimizer = tf.keras.optimizers.Adam(5e-5),
    metrics = [tf.keras.metrics.BinaryAccuracy(), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
)

In [None]:
restored_model.evaluate(test_ds)



[0.5738809704780579, 0.717391312122345, 0.75, 0.6521739363670349]

## Performing inference with the model.

Let's see how to perform inference with KerasNLP models

In [None]:
(test_sample_sen1, test_sample_sen2), test_sample_labels = next(iter(test_ds))

Convert to Hypothesis-Premise pair, for forward pass through model

In [None]:
sample = (test_sample_sen1, test_sample_sen2)

sample

(<tf.Tensor: shape=(32,), dtype=string, numpy=
 array([b"PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .",
        b"The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected .",
        b'According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 .',
        b'A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night .',
        b"The company didn 't detail the costs of the replacement and repairs .",
        b'The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added .',
        b'Air Commodore Quaife said the Hornets remained on t

The default preprocessor in KerasNLP models handles input tokenization automatically,
so we don't need to perform tokenization explicitly.

In [None]:
predictions = bert_classifier.predict(sample)

predictions



array([[-1.4011934 ,  1.3473518 ],
       [-1.5093755 ,  1.5240192 ],
       [-1.5724782 ,  1.5054085 ],
       [-1.4722987 ,  1.4110419 ],
       [-0.7824017 ,  0.76240706],
       [-1.5758866 ,  1.5936251 ],
       [ 0.33846256, -0.14282188],
       [-0.21681271,  0.34818596],
       [-1.451922  ,  1.4988871 ],
       [ 1.3022469 , -1.369278  ],
       [-1.5541441 ,  1.5100056 ],
       [-0.99803305,  0.8793916 ],
       [-1.4549645 ,  1.3307006 ],
       [ 1.1166571 , -1.0091723 ],
       [-1.5689144 ,  1.592015  ],
       [-0.870552  ,  0.86480445],
       [-1.5461595 ,  1.533225  ],
       [-1.4628304 ,  1.5291264 ],
       [-1.5376155 ,  1.5209433 ],
       [-1.5864745 ,  1.5417418 ],
       [-1.4169122 ,  1.4196694 ],
       [-1.4820219 ,  1.4276307 ],
       [-1.4978031 ,  1.5042706 ],
       [-1.5844882 ,  1.5808759 ],
       [ 0.17856541,  0.02306567],
       [-1.5560848 ,  1.5541008 ],
       [-1.534619  ,  1.5727351 ],
       [-1.4987497 ,  1.5576782 ],
       [-1.3409597 ,

Applying the sigmoid activation on logits

In [None]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Get the class predictions with maximum probabilities
predictions = sigmoid(predictions)

predictions

array([[0.1976268 , 0.7936963 ],
       [0.18103138, 0.82112956],
       [0.1718634 , 0.81837976],
       [0.18659346, 0.8039302 ],
       [0.31380248, 0.6818761 ],
       [0.17137884, 0.8311255 ],
       [0.583817  , 0.46435508],
       [0.44600815, 0.5861776 ],
       [0.18970595, 0.8174084 ],
       [0.7862129 , 0.20273653],
       [0.17448853, 0.819062  ],
       [0.26932833, 0.7066961 ],
       [0.1892387 , 0.7909565 ],
       [0.75336814, 0.26714185],
       [0.17237121, 0.8308995 ],
       [0.29513943, 0.70366347],
       [0.17564164, 0.8224777 ],
       [0.18803479, 0.8218785 ],
       [0.17688218, 0.8206774 ],
       [0.16988048, 0.82371783],
       [0.19514611, 0.8052865 ],
       [0.1851222 , 0.80653185],
       [0.18275341, 0.81821054],
       [0.1701608 , 0.8293285 ],
       [0.5445231 , 0.50576615],
       [0.17420918, 0.8255052 ],
       [0.17731889, 0.82817316],
       [0.18261208, 0.82602   ],
       [0.2073523 , 0.80490166],
       [0.62905943, 0.399458  ],
       [0.

Comparing actual labels with predictions

In [None]:
y_pred = np.argmax(predictions, axis = 1)

y_pred

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 1])

In [None]:
test_sample_labels

<tf.Tensor: shape=(32, 2), dtype=float32, numpy=
array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.]], dtype=float32)>

In [None]:
y_test = np.argmax(test_sample_labels, axis = 1)

y_test

array([1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 0])

In [None]:
import pandas as pd

df = pd.DataFrame()

df['Sentence 1'] = test_sample_sen1

df['Sentence 2'] = test_sample_sen2

df.sample(5)

Unnamed: 0,Sentence 1,Sentence 2
16,b'A charter plane crashed in Turkey on Monday ...,"b'A plane carrying 75 people , including 62 Sp..."
0,"b""PCCW 's chief operating officer , Mike Butch...",b'Current Chief Operating Officer Mike Butcher...
11,b'The increase reflects lower credit losses an...,b'The gain came as a result of fewer credit lo...
23,b'Federal agents said yesterday they are inves...,b'Federal investigators are looking for possib...
26,b'The man accused of using fake grenades to co...,b'A Cuban architect was sentenced to 20 years ...


### TODO Recording:

- Click on the table next to the dataframe results so we can see the entire sentences

In [None]:
df['y_test'] = y_test

df['y_pred'] = y_pred

df.sample(10)

Unnamed: 0,Sentence 1,Sentence 2,y_test,y_pred
27,"b'Moroccan police have arrested 33 suspects , ...","b'In a series of raids , Moroccan police arres...",1,1
6,b'Air Commodore Quaife said the Hornets remain...,b'Air Commodore John Quaife said the security ...,0,0
3,b'A tropical storm rapidly developed in the Gu...,b'A tropical storm rapidly developed in the Gu...,0,1
0,"b""PCCW 's chief operating officer , Mike Butch...",b'Current Chief Operating Officer Mike Butcher...,1,1
5,b'The settling companies would also assign the...,"b'Under the agreement , the settling companies...",1,1
9,"b""The broader Standard & Poor 's 500 Index < ....",b'The technology-laced Nasdaq Composite Index ...,0,0
22,"b'"" Senator Clinton should be ashamed of herse...","b'"" She should be ashamed of herself for playi...",1,1
24,"b'Last year , Comcast signed 1.5 million new d...",b'Comcast has about 21.3 million cable subscri...,0,0
18,"b'"" I \'m delighted that David Chase has decid...","b'"" I \'m delighted that David Chase has decid...",1,1
30,"b""Dynes will get $ 395,000 a year , up from At...","b""In his new position , Dynes will earn $ 395,...",1,1
