<a href="https://colab.research.google.com/github/vitalivu/short-sentences-similarity/blob/master/semantic_similarity_for_short_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SentenceBERT

For original paper, see [arxiv.org](https://arxiv.org/abs/1908.10084)

To work with this notebook, install with `pip`

In [None]:
!pip install sentence_transformers
!pip install pandas
!pip install lzma

## Data
This note nook using data from [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs)



In [None]:
import numpy as np
import pandas as pd
import os

### Running in Kaggle

List the files

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../input/train.csv.zip', compression='zip', sep=',')
df.head()

### Running in Colab

In Colab, data stores in Google Drive. You have to upload your dataset manually to your google drive, then connect from this notebook

In [None]:
from google.colab import drive
drive.mount('/gdrive')

List the files, eg `data/quora/input`

In [None]:
%ls /gdrive/MyDrive/Colab\ Notebooks/data/quora/input

Get the file path

In [None]:
for dirname, _, filenames in os.walk('/gdrive/MyDrive/Colab Notebooks/data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Open example data

In [None]:
df = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/data/quora/input/train.csv.zip', compression='zip', sep=',')
df.head()

### Locally with Ubuntu

In [None]:
!sudo apt-get install liblzma-dev

In [None]:
for dirname, _, filenames in os.walk('../data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../data/quora/input/train.csv.zip', compression='zip', sep=',')
df.head()

### Example data

In [None]:
id_2_question_map = {}

def add_to_map(key, val):
    if key not in id_2_question_map:
        id_2_question_map[key] = val
    

def add_row(row):
    add_to_map(row['qid1'], row['question1'])
    add_to_map(row['qid2'], row['question2'])

df.apply(lambda row: add_row(row), axis=1)

len(id_2_question_map)

In [None]:
len(set(id_2_question_map.values()))

### Clean data

- Lowercase original sentences
- Remove some nonsense words, non-ASCII character
- Replace with common phrases

In [None]:
from nltk.corpus import stopwords
#stopwords = set(['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'which', 'while', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'])


def clean_text(sent):
    # Removing non ASCII chars
    sent = str(sent).replace(r'[^\x00-\x7f]',r' ')

    # Replace some common paraphrases
    sent_norm = sent.lower()\
        .replace("how do you", "how do i")\
        .replace("how do we", "how do i")\
        .replace("how can we", "how can i")\
        .replace("how can you", "how can i")\
        .replace("how can i", "how do i")\
        .replace("really true", "true")\
        .replace("what are the importance", "what is the importance")\
        .replace("what was", "what is")\
        .replace("so many", "many")\
        .replace("would it take", "will it take")

    # Remove any punctuation characters
    for c in [",", "!", ".", "?", "'", '"', ":", ";", "[", "]", "{", "}", "<", ">"]:
        sent_norm = sent_norm.replace(c, " ")

    # Remove stop words
    tokens = sent_norm.split()
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

clean_text('What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?')

replace data with cleaned data: replace `question` with `clean_text(question)`

In [None]:
questions = np.array(list(filter(None, set(map(clean_text, set(id_2_question_map.values()))))))
questions

In [None]:
questions.shape

## Models

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-distilroberta-base-v1')

### Create the embeddings

In [None]:
from time import perf_counter

time_t1 = perf_counter()
embeddings = model.encode(questions, convert_to_tensor=True)
time_t2 = perf_counter()
print("Computed sentence embeddings in {:.4f} seconds".format(time_t2 - time_t1))

## Experiments
Create a simple query and search for top 5 results


### Bi-Encoder

In [None]:
from time import perf_counter
import torch

queries = ['What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?'] # example from question1

top_5 = min(5, len(embeddings))

time_t1 = perf_counter()
for query in queries:
    query_embedding = model.encode(clean_text(query), convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_5)
    print("### Query:", query)
    print("Top 5 most similar queries:")
    for score, idx in zip(top_results[0], top_results[1]):
        print("({:.4f})".format(score), questions[idx])

time_t2 = perf_counter()
print("Compute consine-similarity in","{:.4f}".format(time_t2 - time_t1),"seconds")

### Cross-Encoder

Cannot run cross-encoder for the large dataset:
- memory limitation,
- computation ability and time-consuming



### Combination
Using the top 100 in Bi-encoder to evaluate with Cross-Encoder

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from time import perf_counter
import torch

query = 'What is the approx annual cost of living while studying in UIC Chicago, for an Indian student?' # example from question1

top_100 = min(100, len(embeddings))

time_t1 = perf_counter()
query_embedding = model.encode(clean_text(query), convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, embeddings)[0]
top_results = torch.topk(cos_scores, k=top_100) # select top 100

top_sentences = [ questions[idx] for idx in zip(top_results[1])] # extract top 100 sentences

time_t2 = perf_counter()
sentence_combinations = [[query, sentence] for sentence in top_sentences]

cross_encoder = CrossEncoder('cross-encoder/stsb-distilroberta-base')
similarity_scores = cross_encoder.predict(sentence_combinations)
sim_scores = reversed(np.argsort(similarity_scores))

print("### Query:", query)
print("Top 5 most similar queries:")
for idx in [sim_score for _,sim_score in zip(range(5), sim_scores)]:
    print("({:.4f}) {}".format(similarity_scores[idx], top_sentences[idx]))

time_t3 = perf_counter()
print("Compute bi-encoder in","{:.4f}".format(time_t2 - time_t1),"seconds")
print("Compute cross-encoder from top 100 in","{:.4f}".format(time_t3 - time_t2),"seconds")
print("Total time: ", "{:.4f}".format(time_t3 - time_t1), "seconds")

## Note and TODO
Cannot apply to caculate for all sentences in both sets (memory not enough for 230TB =)) so:
- we can apply one by one
- a signmoi function: threshold for similarity scores to mark a question is similar or not
    - linear regression to select the proper threshold
- calculate the accuracy

## Export and import the model

Export model to file. File can be used to restore model later.

In [None]:
import pickle

#Store sentences & embeddings on disc
with open('/gdrive/MyDrive/Colab Notebooks/data/quora/output/embeddings_500k.pkl', "wb") as fOut:
    pickle.dump({'questions': questions, 
                 'embeddings': embeddings}, 
                fOut, protocol=pickle.HIGHEST_PROTOCOL)

Import model from file. In our case, kaggle generates model, then we use the pre-trained model to create the search engine.

In [None]:
#Load sentences & embeddings from disc
with open('question1.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    question1 = stored_data['sentences']
    embeddings1 = stored_data['embeddings']
with open('question2.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    question2 = stored_data['sentences']
    embeddings2 = stored_data['embeddings']

### Import from gpu model to cpu

It's important to note that due to some limitation:
- cannot host api server on kaggle/colab
- cannot load the model from kaggle/colab to local machine (lack of GPU enough memory for model)

So it's best to [load model trained with GPU to local machine with only CPU](https://stackoverflow.com/questions/57081727/load-pickle-file-obtained-from-gpu-to-cpu)

In [None]:
# from sentence_transformers import SentenceTransformer, util
from time import perf_counter
import pickle
import torch
import io

# by default, Pickle does not support load model to cpu
class CpuUnpickler(pickle.Unpickler):
    def find_class(self, module, name):
        if module == 'torch.storage' and name == '_load_from_bytes':
            return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
        else: return super().find_class(module, name)

        
# model = SentenceTransformer('paraphrase-distilroberta-base-v1')

t1 = perf_counter()
#Load sentences & embeddings from disc
with open('embeddings.pkl', "rb") as fIn:
    stored_data = CpuUnpickler(fIn).load()
    question1 = stored_data['sentences']
    embeddings1 = stored_data['embeddings']
    question2 = stored_data['sentences2']
    embeddings2 = stored_data['embeddings2']
    
t2 = perf_counter()

print("Took {:.2f} seconds to import model".format(t2-t1))

## Evaluating model

### Only bi-encoder => accuracy
### Only cross-encoder => computation issue

### New model
[Formular1 - kaggle](https://www.kaggle.com/plarmuseau/semantic-similarity-for-short-sentences)

[Word order similarity - paper](https://arxiv.org/pdf/1802.05667.pdf)
```
P = 0.85
simi = P * sematic_similarity(q1, q2, is_duplicate) + (1-P)*word_order_similarity(q1, q2)
```

- S1: `A gem is a jewel or stone that is used in jewellery.`
- S2: `A jewel is a precious stone used to decorate valuable things that you wear, such as rings or necklaces.`

|Words|Similarity|
|--|--|
|jewel - jewel |0.997421032224|
|jewel - stone| 0.217431543606|
|jewel - used| 0.0|
|jewel - decorate| 0.0|
|jewel - valuable| 0.0|
|jewel - things| 0.406309448212|
|jewel - wear| 0.0|
|jewel - rings| 0.456849659596|
|jewel - necklaces| 0.41718607131|
