# BERTSimilarity Library

A minimalistic [BERT embedding](https://github.com/abhilash1910/BERTSimilarity) library built on BERT (base uncased) model for semantic similarity measurement. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data=pd.read_csv('../input/quora-question-pairs/train.csv.zip')

In [None]:
data.head()

In [None]:
len(data)

In [None]:
!pip install BERTSimilarity

## BERTSimilarity Library

BERTSimilarity is a pytorch library built on BERT (bert-base-uncased) and Scipy for semantic similarity measurement between sentences. The actual library can be installed from [Pypi](https://pypi.org/project/BERTSimilarity/).

The Github for the library can be found [here](https://github.com/abhilash1910/BERTSimilarity). The library uses the feed forward BERT layer (without backprop) and provides BERT embeddings for sentences. These sentence vectors are then analysed with respect to cosine distance for similarity measurement.

The BERT feed forward architecture as well as the scaled attention layer (for finetuning and pretraining) is provided.

<img src='https://miro.medium.com/max/1115/1*_7-cBzOXfL175oeAgYPeHA.png'>



The Google Research Github for BERT is [attached](https://github.com/google-research/bert/) and the associated paper is [provided](https://arxiv.org/abs/1810.04805).

## Transformer Architecture 

The Transformer architecture which forms the most significant part of BERT contains feed forward nets , multi head attentions and layer normalizations . These layers are complex encoder decoder architectures with self attention mechanisms to store and retain information for long sequences of data.

The visualization of the Transformer architecture is provided here:

<img src='https://miro.medium.com/max/2880/1*BHzGVskWGS_3jEcYYi6miQ.png'>


The details for the Transformer paper can be found [here](https://arxiv.org/abs/1706.03762). For an elaborate implementation including different SOTA algorithms for classification, this [Notebook](https://www.kaggle.com/abhilash1910/nlp-workshop-2-ml-india)   can be referred. 

In [None]:
##Example Use case
import BERTSimilarity.BERTSimilarity as bertsimilarity

if __name__=='__main__':
    f1='The man is playing soccer.'
    f2='The man is playing football.'
    bertsimilarity=bertsimilarity.BERTSimilarity()
    dist=bertsimilarity.calculate_distance(f1,f2)
    print('The distance between sentence1: '+f1+' and sentence2: '+f2+' is '+str(dist))

## Testing the BERTSimilarity library on a small sample of 100 texts 

This is a test to see how the performance of the BERTSimilarity model is when it comes to semantic similarity matching. For example use case, the first 100 samples are taken.

In [None]:
%%time

#Function to find similarity between the sentences/paragraphs
def calculate_similarity(q1,q2,bertsimilarity):
    dist=bertsimilarity.calculate_distance(q1,q2)
    return dist

if __name__=='__main__':
    distances=[]
    for i in range(len(data[:100])):
        q1=data['question1'][i]
        q2=data['question2'][i]
        z=calculate_similarity(q1,q2,bertsimilarity)
        distances.append(z)
    print(distances)    



In [None]:
result_dataset=pd.DataFrame(columns=['question1','question2','similarity_score'])
result_dataset['question1']=data['question1'][:100]
result_dataset['question2']=data['question2'][:100]
result_dataset['similarity_score']=distances

## Get an idea about the dataset

From the similarity score we see that most semantically similar question pairs are having a score greater than 0.9 and contextually different pairs are having a score way less.We can use these scores along with some thresholds to determine which questions are similar and which are not. 

In [None]:
result_dataset.head()

## Further Development

This minimalistic library can be used for more semantic analysis on the dataset and is left for interested Kagglers to try it out! It would be great if the kernel is upvoted if it was found helpful.