# Sentence Embeddings using Siamese BERT-Networks
---
This Google Colab Notebook illustrates using the Sentence Transformer python library to quickly create BERT embeddings for sentences and perform fast semantic searches.

The Sentence Transformer library is available on [pypi](https://pypi.org/project/sentence-transformers/) and [github](https://github.com/UKPLab/sentence-transformers). The library implements code from the ACL 2019 paper entitled "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410.pdf)" by Nils Reimers and Iryna Gurevych.


## Install Sentence Transformer Library

In [6]:
# Install the library using pip
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 4.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.14.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 30.1 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 26.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 38.6 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 506 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |█████████████████████

## Load the BERT Model

In [7]:
from sentence_transformers import SentenceTransformer

# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Setup a Corpus

In [1]:
#import dataset
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/colab/reviews.csv', usecols= ['text']) #For encoding reviews
df_id = pd.read_csv('/content/drive/MyDrive/colab/reviews.csv', usecols= ['text', 'business_id']) #For business identification

Mounted at /content/drive


In [9]:
 df_list_id = df_id.values.tolist() #Convert to list

In [None]:
df_list = df.values.tolist() #Convert to list

df_list_transform = []
  
for sublist in df_list:
    for val  in sublist:
        df_list_transform.append(val)  #Transform list     

In [None]:
import numpy as np
#df_embeddings = model.encode(df_list_transform) #Encoding reviews - BERT model (takes 40m)
np.savetxt('test.out', df_embeddings, delimiter=',') #save to file

In [4]:
df_embeddings_from_file = np.loadtxt('/content/drive/MyDrive/colab/test.out', delimiter=',')  #Load encoded reviews

## Perform Semantic Search

In [21]:
import scipy
#@title Sematic Search Form

# code adapted from https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py
query = 'unfriendly staff' #@param {type: 'string'}

queries = [query]
query_embeddings = model.encode(queries) #Encode our query

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
number_top_matches = 30 #@param {type: "number"}

print("Semantic Search Results")

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], df_embeddings_from_file, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:number_top_matches]:
      print(df_list_id[idx], "(Cosine Score: %.4f)" % (1-distance))

Semantic Search Results




Query: unfriendly staff

Top 5 most similar sentences in corpus:
['7sxYa0-TwWeWGFr5CvMMkw', 'Crappy signage, rude employes and poor service overall.'] (Cosine Score: 0.8881)
['U9lrX8Nviajz-74dF6zL7g', "Pretty disappointing experience. Employees don't care, and it's extremely inefficiently run."] (Cosine Score: 0.8278)
['sAdkyll7l2eFgb3EGPm58A', 'What a shitty customer experience! Terrible customer service, poor service, and no-trust in the process. Highly recommend looking elsewhere!'] (Cosine Score: 0.8166)
['0BGoel6on7yGvojzOqOEAQ', 'Disappointment all around. Not worth a description and definitely not worth the price.'] (Cosine Score: 0.8161)
['apNQUS92Vhsu8s5QUqdRQA', 'Extremely dirty (see reviews below) and not well kept up at all. Staff is not the friendliest. There are better options out there.'] (Cosine Score: 0.7996)
['_rt-Z934kfFzgG19nTrIcQ', "This place is horrible. Don't go here."] (Cosine Score: 0.7955)
['qbpJFE-XlspCCk3PWhZ0AA', 'The security g