<h1 style="text-align:center;font-size:30px;" >Quora Question Pairs : Sentence Transformers and BERT for Semantic Similarity</h1>

This work demonstrates how to find textual similarity between a pair of documents using Sentence Transformers and pre-trained BERT model. In this work, I have used the "Quora Question Pairs" dataset, the details about which can be found from [here](http://www.kaggle.com/c/quora-question-pairs) . 

## Import Libraries

In [None]:
# For printing all the outputs of a cell in the same output window

# from IPython.core.interactiveshell import InteractiveShell  
# InteractiveShell.ast_node_interactivity = "all"         #for enabling
# InteractiveShell.ast_node_interactivity = "last_expr"   #for disabling


# Basic Libraries

import numpy as np
import pandas as pd
import pandas_profiling
import re
import string
import random
import math
import time
import os
from os import listdir
import itertools
import collections
from collections import Counter, defaultdict
from tqdm import tqdm
from sklearn import utils

# Visualization

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set_style('darkgrid')

# Vector Similarity
from sklearn.metrics.pairwise import cosine_similarity

# evaluation metrics
from sklearn import metrics


# Deep learning

# import tensorflow
# from tensorflow import keras
# from tensorflow.keras import backend as KB
# # from tensorflow.keras import models, layers, preprocessing as keras_processing


# Bert language model

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
!pip install transformers
from transformers import AutoTokenizer, AutoModel
import torch

# model serialization
import pickle

 ## Load Dataset

In [None]:
train_df = pd.read_csv('../input/quora-question-pairs/train.csv.zip')
print("Train Dataframe:")
train_df.head(3)
print(f'Train dataframe contains {train_df.shape[0]} samples.')
print('Number of features in train data : ', train_df.shape[1])
print('Train Features : ', train_df.columns.values)

**Dataset contains below data fields:**

- id:  a simple rowID
- qid(1, 2):  unique IDs of each question in the pair
- question(1, 2):  actual text contents of the questions.
- is_duplicate:  the label we are trying to Predict, i.e. whether the two questions are duplicates of each other.

## Dataset Analysis

**Dataset Complete Information at a glance Using Pandas Profiling:**

Pandas profiling is a python package which helps us understand our data. It is a simple and fast way to perform exploratory data analysis of a Pandas Dataframe. The Pandas Profiling function extends the pandas DataFrame with df.profile_report() for quick data analysis. It displays a lot of information with a single line of code and that too in an interactive HTML report

In [None]:
train_df.profile_report()

**Check the basic stats of the data:**

The pandas df.describe() and df.info() functions gives us a basic overview of the entire dataset.

In [None]:
# Null values and Data types
print('Train Set:\n')
print(train_df.info())
print('')

In [None]:
# basic stats
print('Train set basic stats:')
train_df.describe(include='all')

- We can observe missing values are present in the data, let's handle these values.

### Handling the Missing Values

In [None]:
print('Train data Null values :')
train_df[train_df.isnull().any(1)]

- There are 3 null values for question1 and question2 texts which are present across only 3 samples, so we will fill these null values with empty strings.

In [None]:
train_df = train_df.fillna(value="")
train_df[train_df.isnull().any(1)]

**Check the distribution of output labels:**

In [None]:
train_df.is_duplicate.value_counts(normalize=True)

In [None]:
plt.figure(figsize=(8,6))
train_df.is_duplicate.value_counts().plot(kind='bar', color=['r','g'])

D = mpatches.Patch(color='r', label='Duplicate')
ND = mpatches.Patch(color='g', label='Non-Duplicate')

plt.legend(handles=[D,ND], loc='best')

plt.xlabel('Type of Labels')
plt.ylabel('Count of Data per Label Category')
plt.title('Distribution of labels')
plt.show()

**Check the distribution sentence lengths for Question 1 and Question 2 :**

In [None]:
q1_lengths = [len(q1)for q1 in train_df.question1]
print("Mean sentence length for Question1:", np.mean(q1_lengths))

plt.figure(figsize=(8,6))
plt.hist(q1_lengths,bins=50,density=True,color='b')
# sns.distplot(q1_lengths,bins=50,kde=True,color='b')
plt.xlabel('Question1 lengths')
plt.ylabel('Count of Question1 lengths')
plt.title('Distribution of Question1 sentence lengths')
plt.show()

In [None]:
q2_lengths = [len(q2)for q2 in train_df.question2]
print("Mean sentence length for Question2:", np.mean(q2_lengths))

plt.figure(figsize=(8,6))
plt.hist(q2_lengths,bins=50,density=True,color='r')
#sns.distplot(q2_lengths,bins=50,kde=True,color='r')
plt.xlabel('Question2 lengths')
plt.ylabel('Count of Question2 lengths')
plt.title('Distribution of Question2 sentence lengths')
plt.show()

## Implementing Sentence Transformers

Generate sentences lists for question1 and question2.

In [None]:
sentences_question1 = list(sent for sent in train_df['question1'].values)
sentences_question2 = list(sent for sent in train_df['question2'].values)

In this approach I have defined a model for **Sentence Transformation** using '[bert-base-nli-mean-tokens](http://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens)' repository. The sentence-transformers repository allows to train and use Transformer models for generating sentence and text embeddings. The model is described in the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](http://arxiv.org/abs/1908.10084)

In [None]:
st_model = SentenceTransformer('bert-base-nli-mean-tokens')

Let's define a method for generating sentence embeddings for each sentence using Sentnce Transformers model.

In [None]:
def generate_sent_embeddings(data):
    return st_model.encode(data)

Generating sentence embeddings using pretrained sentence transformers is a very time exhaustive process and the time complexity increases with the increase in the data size. So here I have generated the sentence embeddings at first and then reusing the generated embeddings using pickle serialisation.
So let's use sentence embeddings for question1 and question2 text generated using pre-trained sentence transformers.

In [None]:
if os.path.isfile('../input/quora-question-pairs-sentence-transformers/question1_sent_embeddings.pkl'):
    #retrieve the question1_sent_embeddings list for usage.
    with open('../input/quora-question-pairs-sentence-transformers/question1_sent_embeddings.pkl', 'rb') as f: 
        question1_sent_embeddings = pickle.load(f)
else:
    question1_sent_embeddings = generate_sent_embeddings(sentences_question1)
    #save the question1_sent_embeddings list for later usage.
    with open('question1_sent_embeddings.pkl', 'wb') as f: 
        pickle.dump(question1_sent_embeddings, f)

print("shape of question1 sentence embeddings:", question1_sent_embeddings.shape)
train_df['question1_sent_embeddings'] = pd.DataFrame({'question1_sent_embeddings' : list(question1_sent_embeddings)})

In [None]:
if os.path.isfile('../input/quora-question-pairs-sentence-transformers/question2_sent_embeddings.pkl'):
    #retrieve the question2_sent_embeddings list for usage.
    with open('../input/quora-question-pairs-sentence-transformers/question2_sent_embeddings.pkl', 'rb') as f: 
        question2_sent_embeddings = pickle.load(f)
else:
    question2_sent_embeddings = generate_sent_embeddings(sentences_question2)
    #save the question2_sent_embeddings list for later usage.
    with open('question2_sent_embeddings.pkl', 'wb') as f: 
        pickle.dump(question2_sent_embeddings, f)

print("shape of question2 sentence embeddings:", question2_sent_embeddings.shape)
train_df['question2_sent_embeddings'] = pd.DataFrame({'question2_sent_embeddings' : list(question2_sent_embeddings)})

Now let's generate textual similarity values for question1 and question2 sentence embeddings using cosine similarity

In [None]:
questions_similarity = []
for index, row in train_df.iterrows():
    questions_similarity.append(cosine_similarity([row['question1_sent_embeddings']],[row['question2_sent_embeddings']]))

#convert the question similarity array into 1d array 
questions_similarity = np.stack(questions_similarity,axis=0)
# questions_similarity = questions_similarity.tolist()
ques_sim = np.array(questions_similarity).ravel()

# store the question similarity scores in our dataframe
train_df['questions_similarity'] = pd.DataFrame({'questions_similarity' : ques_sim})
train_df['questions_similarity']

We know that the cosine similarity values ranges between 0 and 1, so we can convert these sentence similarity values into predicted labels (0 and 1) by setting an appropriate threshold for similarity.

In [None]:
def similarity_to_predictions(cos_sim, threshold):
    """
    This function converts the predicted similarities to predicted labels based on the threshold value
    """
    if (cos_sim >= threshold):
        return 1
    else:
        return 0
    
train_df['predicted_result'] = train_df['questions_similarity'].apply(similarity_to_predictions, threshold=0.87)

Now let's check the our dataset after processing.

In [None]:
train_df.head(3)

At last, we can compare the predictions ('predicted_result') with actual results ('is_duplicate') to check accuracy of the Quora question pairs similarity.

In [None]:
metrics.accuracy_score(train_df['is_duplicate'], train_df['predicted_result'])

**Concluding sentence transformers approach :**

In this work, I have implemented the Sentence Transformers model based on pre-trained "BERT" embeddings. The accuracy of predictions can be adjusted using an appropriate threshold value for sentence similarity. This implementation focuses on simplistic usage of the pre-trained "BERT" embeddings and cosine-similarity for evaluating the textual similarity between pair of sentences.