### We gonna use 3 differentes models to test our data


##### They are 
##### - multi-qa-distilbert-cos-v1 (SentenceTransformer)
##### - sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (SentenceTransformer)
##### - cross-encoder/ms-marco-MiniLM-L-6-v2 (CrossEncoder)

In [72]:
from sentence_transformers import SentenceTransformer
from sentence_transformers import CrossEncoder
import torch
model = SentenceTransformer("multi-qa-distilbert-cos-v1")




In [None]:
# Carregar o modelo sentence-transformers multilíngue
model_LM = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")


In [78]:

model_cross = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", default_activation_function=torch.nn.Sigmoid(),max_length=64)



In [53]:
##Using the first model (multi_qa) as the model is not trained in multilingual
## and we want to use to portuguese phrases, i will have to translate to use this model
passage_embeddings=model.encode([
    "cancelar consulta",
   " cancelar uma consulta",
    "Quero cancelar a consulta",
   " Como faz para cancelar a consulta",
   " Preciso cancelar a consulta",
    "Gostaria de desmarcar a consulta",
   " Como faço para cancelar a consulta que marquei"
])
query_embedding=model.encode("quero cancelar minha consulta")
similarity=model.similarity(query_embedding,passage_embeddings)
print(similarity)

tensor([[0.8022, 0.7465, 0.8904, 0.8691, 0.8514, 0.5310, 0.7641]])


In [54]:
##As we can see, the scores values to english phrases are much higher
passage_embeddings=model.encode([
    "cancel appointment", 
    "cancel an appointment", 
    "I want to cancel the appointment", 
    "How do I cancel the appointment", 
    "I need to cancel the appointment", 
    "I would like to cancel the appointment", 
    "How do I cancel the appointment I made"
])
query_embedding=model.encode("I want to cancel my appointment")
similarity=model.similarity(query_embedding,passage_embeddings)
print(similarity)

tensor([[0.9259, 0.9335, 0.9778, 0.9416, 0.9662, 0.9486, 0.9275]])


In [55]:
### USING THE SECOND MODEL, MULTILINGUAL
### Testing with portuguese phrases
query_embedding=model_LM.encode("Gostaria de marcar uma consulta")
# Exemplo de uso com frases em português
passage_embeddings=model_LM.encode([
    "marcar consulta com o psicologo",
    "marcar uma consulta",
    "como agendo uma consulta",
    "como agendo um horario com voces",
    "quero marcar uma consulta com voces",
    "como faco para marcar uma consulta com voces",
    "Como posso reservar um horário para consulta",
    "É possível marcar um atendimento"
])
similarity=model_LM.similarity(query_embedding,passage_embeddings)

print(similarity)

tensor([[0.6398, 0.8695, 0.7883, 0.3477, 0.7496, 0.6453, 0.6793, 0.7133]])


In [85]:
#### For english phrases, we do not see much difference is scores
query_embedding=model_LM.encode("I want to book an appointment")
passage_embeddings=model_LM.encode([
"book an appointment with a psychologist",
"book an appointment",
"how do I schedule an appointment",
"how do I schedule an appointment with you",
"I want to schedule an appointment with you",
"how do I schedule an appointment with you",
"How can I book an appointment",
"Is it possible to schedule an appointment"
])
similarity=model_LM.similarity(query_embedding,passage_embeddings)

print(similarity)

tensor([[0.5216, 0.8320, 0.7826, 0.7398, 0.7807, 0.7398, 0.9449, 0.7662]])


In [84]:
### Lets compare the scores with the first model
query_embedding=model.encode("I want to book an appointment")
passage_embeddings=model.encode([
"book an appointment with a psychologist",
"book an appointment",
"how do I schedule an appointment",
"how do I schedule an appointment with you",
"I want to schedule an appointment with you",
"how do I schedule an appointment with you",
"How can I book an appointment",
"Is it possible to schedule an appointment"
])
similarity=model.similarity(query_embedding,passage_embeddings)

print(similarity)

tensor([[0.5282, 0.9095, 0.7804, 0.7978, 0.8564, 0.7978, 0.9507, 0.7263]])


##### the first model obtained higher scores for English sentences

In [83]:
### Lets test the cross encoder model
scores = model_cross.predict([
    ("I want to book an appointment", "book an appointment"),
    ("I want to book an appointment", "how do I schedule an appointment"),
    ("I want to book an appointment", "how do I schedule an appointment with you"),
    ("I want to book an appointment", "I want to schedule an appointment with you"),
    ("I want to book an appointment", "How can I book an appointment"),
    ("I want to book an appointment", "Is it possible to schedule an appointment"),

       
])
print(scores)

[0.9947903  0.01229856 0.01899222 0.991536   0.90968424 0.04460678]


#### This cross encoder model does not handle synonymous words very well, so it will not be very useful for data augmentation, since the intention is to expand the database with synonyms and similar words, maintaining the meaning of the sentence.

### But, how cross encoder detect a high level of score with similar sentences, we are gonna use to detect duplicates or too similar phrases in our dataset, to avoid too much similar data