<a href="https://colab.research.google.com/github/versant2612/jnotebooks/blob/main/content/courses/deeplearning/notebooks/SiameseBERT_SemanticSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Embeddings using Siamese BERT-Networks
---
This Google Colab Notebook illustrates using the Sentence Transformer python library to quickly create BERT embeddings for sentences and perform fast semantic searches.

The Sentence Transformer library is available on [pypi](https://pypi.org/project/sentence-transformers/) and [github](https://github.com/UKPLab/sentence-transformers). The library implements code from the ACL 2019 paper entitled "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410.pdf)" by Nils Reimers and Iryna Gurevych.


## Install Sentence Transformer Library

In [None]:
# Install the library using pip
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/0add07b1eebbbe83e77fb5ac4e72e87046c3fc2c9cb16f7d1cd8c6921a1d/sentence-transformers-0.3.7.2.tar.gz (59kB)
[K     |████████████████████████████████| 61kB 5.2MB/s 
[?25hCollecting transformers<3.4.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 10.5MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 53.5MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36

## Load the BERT Model

In [None]:
from sentence_transformers import SentenceTransformer

# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:51<00:00, 7.92MB/s]


## Setup a Corpus

In [None]:
# A corpus is a list with documents split by sentences.

sentences = ['Absence of sanity', 
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']

# Each sentence is encoded as a 1-D vector with 78 columns
sentence_embeddings = model.encode(sentences)

print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))

print('Sample BERT embedding vector - note includes negative values', sentence_embeddings[0])

Sample BERT embedding vector - length 768
Sample BERT embedding vector - note includes negative values [ 2.95402944e-01  2.91811436e-01  2.16480088e+00  2.20419794e-01
 -1.30862771e-02  1.01950371e+00  1.51298153e+00  2.34132320e-01
  2.73057610e-01  1.35122865e-01 -1.11313331e+00 -1.25885352e-01
  1.45378396e-01  9.77708817e-01  1.39352274e+00  4.57705200e-01
 -5.82130790e-01 -7.24941134e-01 -3.61734450e-01 -2.27515027e-01
  1.66629069e-02  2.04862550e-01  6.55132949e-01 -1.29376423e+00
 -7.26099491e-01 -1.91135973e-01 -3.07211190e-01 -1.30278611e+00
 -1.42963862e+00  5.67500899e-03  3.54811519e-01  4.83713001e-01
  6.65388465e-01  5.33848584e-01  6.40497088e-01  5.90408325e-01
  7.83849061e-02 -1.07759213e+00 -1.24676727e-01 -3.98406595e-01
  7.36314774e-01  5.28292835e-01  5.63290417e-01  4.14545923e-01
  4.49179173e-01 -9.58785564e-02  1.45424581e+00 -2.69144714e-01
 -2.44059891e-01 -1.10387063e+00 -2.00923488e-01 -2.17445171e-03
  1.83387911e+00  1.06518483e+00 -5.11946321e-01 -1.

## Perform Semantic Search

In [1]:
import scipy
#@title Sematic Search Form

# code adapted from https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

query = 'comida' #@param {type: 'string'}

queries = [query]
query_embeddings = model.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
number_top_matches = 5 #@param {type: "number"}

print("Semantic Search Results")

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:number_top_matches]:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))

NameError: ignored