<a href="https://colab.research.google.com/github/sishef/nlpworkshop/blob/main/5_SemanticSimilarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 5: Measuring semantic similarty 
(how similar is the meaning, not the spelling)

### 1. Simple string similarity doesn't understand meaning at all

In [None]:
pip install levenshtein

In [None]:
import Levenshtein

def remove_punctuation(msg):
  symbols = ['?','-',',',':',';']
  for symbol in symbols:
    msg = msg.replace(symbol, '')
  return msg

def tokenize(msg):
  msg = msg.lower()
  tokens = msg.split(' ')
  return tokens

   
sentence1 = "The earth is round"
sentence2 = "Our planet is a sphere"
sentence3 = "Some cheese is orange"

clean_sentence1 = ''.join(tokenize(remove_punctuation(sentence1)))
clean_sentence2 = ''.join(tokenize(remove_punctuation(sentence2)))
clean_sentence3 = ''.join(tokenize(remove_punctuation(sentence3)))

print(Levenshtein.distance(clean_sentence1, clean_sentence2))
print(Levenshtein.distance(clean_sentence1, clean_sentence3))


Using our distance measure from the last excercise we can see that sentence 2 and sentence 3 have very similar distances to sentence 1. In fact sentence 3 is a bit closer.

Of course when we consider the meaning of these, sentence 2 is almost identical in meaning to sentence 1, whereas sentence 3 is totally different.

### 2. The power of machine learning

We are now going to use some state of the art machine learning models to try and identify sentences that have similar sematics (meaning). First we have to install the libraries and create the model object (which takes a while to download as it is quite big).

In [None]:
!pip install transformers
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('stsb-roberta-large')

Now lets see what we can do.
With this library we have to encode the sentence into 'embeddings', which is similar to our previous tokenization work except the embeddings are a list of special numbers derived from the words. Then we can use a mathematical similarity measure called 'cosine similarity' to see how similar the sentences are:

In [None]:
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
embedding3 = model.encode(sentence3, convert_to_tensor=True)

cosine_scores1_2 = util.pytorch_cos_sim(embedding1, embedding2)
cosine_scores1_3 = util.pytorch_cos_sim(embedding1, embedding3)

print(f"The similarity between '{sentence1}' and '{sentence2}' is {cosine_scores1_2.item()}")
print(f"The similarity between '{sentence1}' and '{sentence3}' is {cosine_scores1_3.item()}")

Amazing! The machine was able to understand that 'our planet' is similar in meaning to 'the earth', and 'is round' is similar in meaning to 'is a sphere'.

**How does this magic work? --> side presentation**

**TASK:** Use this new semantic similarity measure to identify better chatbot responses. Note, the model.enode(...) function call is a bit slow, so you should consider doing this once for each sentence when the training data is loaded into the chatbot rather than every time a user inputs new data.
**TIP:** Take a look at Example_ChatterbotCorpusWithTransformer.ipynb for some help on making the up-front training fast

If you have any extra time you can learn more about 'semantic search' here: https://sbert.net/examples/applications/semantic-search/README.html
