<a href="https://colab.research.google.com/github/vitalivu/short-sentences-similarity/blob/master/semantic_similarity_for_short_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SentenceBERT

Install with `pip`

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/c4/87/49dc49e13ac107ce912c2f3f3fd92252c6d4221e88d1e6c16747044a11d8/sentence-transformers-1.1.0.tar.gz (78kB)
[K     |████████████████████████████████| 81kB 7.6MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 28.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 51.7MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37

## Data

This note nook using data from [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs)

### Read data
Data store in GDrive:

#### Connect to google drive

In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


#### List the files

In [None]:
%ls /gdrive/MyDrive/Colab\ Notebooks/data/quora/input/train.csv.zip

'/gdrive/MyDrive/Colab Notebooks/data/quora/input/train.csv.zip'


In [None]:
import numpy as np
import pandas as pd
import os

for dirname, _, filenames in os.walk('/gdrive/MyDrive/Colab Notebooks/data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/gdrive/MyDrive/Colab Notebooks/data/quora/input/sample_submission.csv.zip
/gdrive/MyDrive/Colab Notebooks/data/quora/input/test.csv
/gdrive/MyDrive/Colab Notebooks/data/quora/input/test.csv.zip
/gdrive/MyDrive/Colab Notebooks/data/quora/input/train.csv.zip


In [None]:
df = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/data/quora/input/train.csv.zip', compression='zip', sep=',')
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
question1 = df['question1'].unique()
question1

array(['What is the step by step guide to invest in share market in india?',
       'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
       'How can I increase the speed of my internet connection while using a VPN?',
       ..., 'What is one coin?',
       'What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?',
       'What is like to have sex with cousin?'], dtype=object)

### Clean data

- Lowercase original sentences
- Remove some nonsense words, non-ASCII character
- Replace with common phrases

In [None]:
stopwords = set(['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'which', 'while', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'])


def cleantext(sent):
    # Removing non ASCII chars
    sent = str(sent).replace(r'[^\x00-\x7f]',r' ')

    # Replace some common paraphrases
    sent_norm = sent.lower()\
        .replace("how do you", "how do i")\
        .replace("how do we", "how do i")\
        .replace("how can we", "how can i")\
        .replace("how can you", "how can i")\
        .replace("how can i", "how do i")\
        .replace("really true", "true")\
        .replace("what are the importance", "what is the importance")\
        .replace("what was", "what is")\
        .replace("so many", "many")\
        .replace("would it take", "will it take")

    # Remove any punctuation characters
    for c in [",", "!", ".", "?", "'", '"', ":", ";", "[", "]", "{", "}", "<", ">"]:
        sent_norm = sent_norm.replace(c, " ")

    # Remove stop words
    tokens = sent_norm.split()
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

cleantext('What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?')

'what approx annual cost living studying uic chicago indian student'

then, replace data with cleaned data: replace `question` with `cleantext(question)`

In [None]:
question1 = df['question1'].unique()
question1 = np.array(list(map(cleantext, question1)))
question1

array(['what step step guide invest share market india',
       'what story kohinoor (koh-i-noor) diamond',
       'how increase speed internet connection using vpn', ...,
       'what one coin',
       'what approx annual cost living studying uic chicago indian student',
       'what like sex cousin'], dtype='<U366')

In [None]:
question2 = df['question2'].unique()
question2 = np.array(list(map(cleantext, question2)))
question2

array(['what step step guide invest share market',
       'what would happen indian government stole kohinoor (koh-i-noor) diamond back',
       'how internet speed increased hacking dns', ..., 'what coin',
       'little hairfall problem want use hair styling product one prefer gel wax clay',
       'what it like sex cousin'], dtype='<U691')

## Models
### Create the embeddings

In [None]:
from sentence_transformers import SentenceTransformer, util
from time import perf_counter


model = SentenceTransformer('paraphrase-distilroberta-base-v1')

startTime = perf_counter()
embeddings1 = model.encode(question1, convert_to_tensor=True)
embeddings2 = model.encode(question2, convert_to_tensor=True)
endTime = perf_counter()
print("Computed sentence embeddings in {:.4f} seconds".format(endTime - startTime))

HBox(children=(FloatProgress(value=0.0, max=305584576.0), HTML(value='')))


Computed sentence embeddings in 297.8298 seconds


#### Experiments
Create a simple query and search for top 10 results

In [None]:
from time import perf_counter
import torch

queries = ['What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?'] # example from question1

top_5 = min(5, len(embeddings2))

time_t1 = perf_counter()
for query in queries:
    query_embedding = model.encode(cleantext(query), convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, embeddings2)[0]
    top_results = torch.topk(cos_scores, k=top_5)
    print("### Query:", query)
    print("Top 5 most similar queries:")
    for score, idx in zip(top_results[0], top_results[1]):
        print("({:.4f})".format(score), question2[idx])

time_t2 = perf_counter()
print("Compute consine-similarity in","{:.4f}".format(time_t2 - time_t1),"seconds")

### Query: What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?
Top 5 most similar queries:
(0.6720) what cost living (monthly yearly) graduate student studying mit
(0.6302) how much indian student earn studying masters degree uk
(0.6255) what minimum living expenses per month dubai student
(0.6064) how much would masters mis cost indian student nyu living education included
(0.6030) how much earning international student make per hour oslo norway working cafes bars restaurants
Compute consine-similarity in 0.0395 seconds


Using the top 100 in Bi-encoder to evaluate with Cross-Encoder

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from time import perf_counter
import torch

query = 'What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?' # example from question1

top_100 = min(100, len(embeddings2))

time_t1 = perf_counter()
query_embedding = model.encode(cleantext(query), convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, embeddings2)[0]
top_results = torch.topk(cos_scores, k=top_100) # select top 100

top_sentences = [ question2[idx] for idx in zip(top_results[1])] # extract top 100 sentences

time_t2 = perf_counter()
sentence_combinations = [[query, sentence] for sentence in top_sentences]

cross_encoder = CrossEncoder('cross-encoder/stsb-distilroberta-base')
similarity_scores = cross_encoder.predict(sentence_combinations)
sim_scores = reversed(np.argsort(similarity_scores))

print("### Query:", query)
print("Top 5 most similar queries:")
for idx in [sim_score for _,sim_score in zip(range(5), sim_scores)]:
    print("({:.4f}) {}".format(similarity_scores[idx], top_sentences[idx]))

time_t3 = perf_counter()
print("Compute bi-encoder in","{:.4f}".format(time_t2 - time_t1),"seconds")
print("Compute cross-encoder from top 100 in","{:.4f}".format(time_t3 - time_t2),"seconds")
print("Total time: ", "{:.4f}".format(time_t3 - time_t1), "seconds")

### Query: What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?
Top 5 most similar queries:
(0.6741) what cost living (monthly yearly) graduate student studying mit
(0.5357) what approximated cost attending norwegian university indian student opting undergraduate programs
(0.5315) what cost living denver (co) student
(0.5217) what opportunity cost studying university
(0.4945) how much would masters mis cost indian student nyu living education included
Compute bi-encoder in 0.0319 seconds
Compute cross-encoder from top 100 in 2.9709 seconds
Total time:  3.0028 seconds


#### Note and TODO
Cannot apply to caculate for all sentences in both sets (memory not enough for 230TB =)) so:
- we can apply one by one
- a signmoi function: threshold for similarity scores to mark a question is similar or not
    - linear regression to select the proper threshold
- calculate the accuracy

### Export and import the model

Export model to file. File can be used to restore model later.

In [None]:
import pickle

#Store sentences & embeddings on disc
with open('question1.pkl', "wb") as fOut:
    pickle.dump({'sentences': question1, 'embeddings': embeddings1}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
with open('question2.pkl', "wb") as fOut:
    pickle.dump({'sentences': question2, 'embeddings': embeddings2}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

Import model from file. In our case, kaggle generates model, then we use the pre-trained model to create the search engine.

In [None]:
#Load sentences & embeddings from disc
with open('question1.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    question1 = stored_data['sentences']
    embeddings1 = stored_data['embeddings']
with open('question2.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    question2 = stored_data['sentences']
    embeddings2 = stored_data['embeddings']