<a href="https://colab.research.google.com/github/setyoai/textsimilarity/blob/main/Text_Similarty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
!pip install tflite-runtime

Collecting tflite-runtime
  Downloading tflite_runtime-2.14.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading tflite_runtime-2.14.0-cp310-cp310-manylinux2014_x86_64.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tflite-runtime
Successfully installed tflite-runtime-2.14.0


In [3]:
import re
import string
import numpy as np
import tensorflow as tf
from transformers import BertTokenizerFast, TFAutoModel
from sklearn.metrics.pairwise import cosine_similarity
import tflite_runtime.interpreter as tflite

In [4]:
class TokenSimilarity:

    def load_pretrained(self, from_pretrained:str="indobenchmark/indobert-lite-base-p2"):
        self.tokenizer = BertTokenizerFast.from_pretrained(from_pretrained)
        self.model = TFAutoModel.from_pretrained(from_pretrained)

    def __cleaning(self, text:str):
        # clear punctuations
        text = text.translate(str.maketrans('', '', string.punctuation))

        # clear multiple spaces
        text = re.sub(r'/s+', ' ', text).strip()

        return text

    def __process(self, first_token:str, second_token:str):
        inputs = self.tokenizer([first_token, second_token],
                                max_length=self.max_length,
                                truncation=self.truncation,
                                padding=self.padding,
                                return_tensors='tf')

        attention = inputs["attention_mask"]

        outputs = self.model(**inputs)

        # get the weights from the last layer as embeddings
        embeddings = outputs[0]

        mask = tf.expand_dims(attention, axis=-1)
        masked_embeddings = embeddings * tf.cast(mask, float)

        # MEAN POOLING FOR 2ND DIMENSION
        summed = tf.reduce_sum(masked_embeddings, axis=1)
        counts = tf.reduce_sum(mask, axis=1)
        counts = tf.cast(counts, float)  # Cast counts to float32
        mean_pooled = summed / tf.maximum(counts, 1e-9)

        return mean_pooled.numpy()

    def predict(self, first_token:str, second_token:str,
                return_as_embeddings:bool=False, max_length:int=16,
                truncation:bool=True, padding:str="max_length"):
        self.max_length = max_length
        self.truncation = truncation
        self.padding = padding

        first_token = self.__cleaning(first_token)
        second_token = self.__cleaning(second_token)

        mean_pooled_arr = self.__process(first_token, second_token)
        if return_as_embeddings:
            return mean_pooled_arr

        # calculate similarity
        similarity = np.squeeze(cosine_similarity([mean_pooled_arr[0]], [mean_pooled_arr[1]]))

        return similarity

In [5]:
model = TokenSimilarity()
model.load_pretrained()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'AlbertTokenizer'. 
The class this function is called from is 'BertTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'AlbertTokenizer'. 
The class this function is called from is 'BertTokenizerFast'.


tf_model.h5:   0%|          | 0.00/63.1M [00:00<?, ?B/s]

Some layers from the model checkpoint at indobenchmark/indobert-lite-base-p2 were not used when initializing TFAlbertModel: ['sop_classifier', 'predictions']
- This IS expected if you are initializing TFAlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFAlbertModel were initialized from the model checkpoint at indobenchmark/indobert-lite-base-p2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFAlbertModel for predictions without further training.


In [9]:
first_token = "Kucing"
second_token = "Anjing"

In [10]:
similarity = model.predict(first_token, second_token)
print(similarity)

0.95563614


In [8]:
converter = tf.lite.TFLiteConverter.from_keras_model(model.model)

In [None]:
tflite_model = converter.convert()

In [None]:
with open('text_similarity.tflite', 'wb') as f:
    f.write(tflite_model)

In [None]:
interpreter = tflite.Interpreter('text_similarity.tflite')
interpreter.allocate_tensors()