<a href="https://colab.research.google.com/github/shebbar27/bert-test/blob/main/BertTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers



In [None]:
input_from_train_ds = [['Raise the ruby cup'],
                       ['Fill a little into the cobalt basin'],
                       ['Elevate the indigo grail'],
                       ['Fill a little into the meager rectangular cardinal basin']];

input_from_unseen_ds_similar = [['Choose the ruby cup'],
                                ['Fill the cobalt basin slightly'],
                                ['The indigo grail should be lifted'],
                                ['Unload a little into the meager rectangular cardinal basin']];

# """
input_from_unseen_ds_different = [['Drop the blue cup'],
                                  ['Throw the red basin far away'],
                                  ['The green box should be thrown away'],
                                  ['Fill a lot into the circular bowl']];
# """

"""
input_from_unseen_ds_different = [['Hope this model works fine'],
                                  ['The distance should be measured correctly'],
                                  ['How to measure 3D vector distance'],
                                  ['Are we using the model properly?']];
"""

"\ninput_from_unseen_ds_different = [['Hope this model works fine'],\n                                  ['The distance should be measured correctly'],\n                                  ['How to measure 3D vector distance'],\n                                  ['Are we using the model properly?']];\n"

In [None]:
import numpy as np
import sklearn.metrics.pairwise as pairwise
from tensorflow.python.ops.numpy_ops import np_config
from transformers import BertTokenizer, TFBertModel

np_config.enable_numpy_behavior()

LANGUAGE_TOKEN_MAX_LENGTH = 15

bertModel = TFBertModel.from_pretrained("bert-base-cased", 
                                      output_hidden_states = True)
bertModel.summary()

def tokenizeForBert(language):
  tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
  bertTokens = tokenizer(
      language, 
      return_tensors='np',
      add_special_tokens=False, 
      return_attention_mask=False, 
      return_token_type_ids=False,
      max_length=LANGUAGE_TOKEN_MAX_LENGTH,
      padding='max_length')
  return bertTokens

def getBertOutput(tokens): 
  output = bertModel(tokens)
  return output['last_hidden_state'][0]

def getBertEmbeddings(languageInputs):
  bertEmbeddings = []
  for language in languageInputs:
    tokens = tokenizeForBert(language)
    bertOutput = getBertOutput(tokens)
    bertEmbeddings.append(bertOutput)
  return bertEmbeddings

def distanceBetweenVectors(vector1, vector2):
  return np.linalg.norm(vector1 - vector2)

def distanceBetweenBertEmbeddings(bertEmbeddings1, bertEmbeddings2):
  sentenceDistances = []
  for (array2d_1, array2d_2) in zip(bertEmbeddings1, bertEmbeddings2):
    distance = distanceBetweenVectors(array2d_1, array2d_2)
    sentenceDistances.append(distance)
  return sentenceDistances

def cosineSimilarityBetweenBertEmbeddings(bertEmbeddings1, bertEmbeddings2):
  sentenceSimilarity = []
  for (array2d_1, array2d_2) in zip(bertEmbeddings1, bertEmbeddings2):
    kernelMatrix = pairwise.cosine_similarity(array2d_1.transpose(), array2d_2.transpose())
    averageSimilarity = np.mean(kernelMatrix)
    sentenceSimilarity.append(averageSimilarity)
  return sentenceSimilarity

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "tf_bert_model_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
Total params: 108,310,272
Trainable params: 108,310,272
Non-trainable params: 0
_________________________________________________________________


In [None]:
output_of_train_ds = getBertEmbeddings(input_from_train_ds)
output_of_unseen_ds_similar = getBertEmbeddings(input_from_unseen_ds_similar)
output_of_unseen_ds_different = getBertEmbeddings(input_from_unseen_ds_different)

embeddingDistancesSimilar = distanceBetweenBertEmbeddings(output_of_train_ds, output_of_unseen_ds_similar)
cosineSimilarity = cosineSimilarityBetweenBertEmbeddings(output_of_train_ds, output_of_unseen_ds_similar)
print("Bert Embedding distances for similar language inputs: ")
for i in range(len(input_from_train_ds)):
  print(f"Trained Input: {input_from_train_ds[i]}")
  print(f"Unseen Input Similar: {input_from_unseen_ds_similar[i]}")
  print(f"Vector distance between Bert Embeddings: {embeddingDistancesSimilar[i]}")
  print(f"Pairwise cosine similarity between Bert Embeddings: {cosineSimilarity[i]}")
  print()

embeddingDistancesDifferent = distanceBetweenBertEmbeddings(output_of_train_ds, output_of_unseen_ds_different)
cosineSimilarity = cosineSimilarityBetweenBertEmbeddings(output_of_train_ds, output_of_unseen_ds_different)
print("Bert Embedding distances for different language inputs: ")
for i in range(len(input_from_train_ds)):
  print(f"Trained Input: {input_from_train_ds[i]}")
  print(f"Unseen Input Different: {input_from_unseen_ds_different[i]}")
  print(f"Vector distance between Bert Embeddings: {embeddingDistancesDifferent[i]}")
  print(f"Pairwise cosine similarity between Bert Embeddings: {cosineSimilarity[i]}")
  print()

Bert Embedding distances for similar language inputs: 
Trained Input: ['Raise the ruby cup']
Unseen Input Similar: ['Choose the ruby cup']
Vector distance between Bert Embeddings: 22.88295555114746
Pairwise cosine similarity between Bert Embeddings: 4.6046403440413997e-05

Trained Input: ['Fill a little into the cobalt basin']
Unseen Input Similar: ['Fill the cobalt basin slightly']
Vector distance between Bert Embeddings: 60.26710891723633
Pairwise cosine similarity between Bert Embeddings: 7.5678457506001e-05

Trained Input: ['Elevate the indigo grail']
Unseen Input Similar: ['The indigo grail should be lifted']
Vector distance between Bert Embeddings: 33.4623908996582
Pairwise cosine similarity between Bert Embeddings: 0.00027569176745601

Trained Input: ['Fill a little into the meager rectangular cardinal basin']
Unseen Input Similar: ['Unload a little into the meager rectangular cardinal basin']
Vector distance between Bert Embeddings: 35.540931701660156
Pairwise cosine similarity