# Sentence Embeddings using Siamese BERT-Networks & Aquila
---
This Google Colab Notebook illustrates using the Sentence Transformer python library to quickly create BERT embeddings for sentences, storeing them in Aquila to perform fast semantic searches.

https://github.com/aneesha/SiameseBERT-Notebook/blob/master/SiameseBERT_SemanticSearch.ipynb
The Sentence Transformer library is available on [pypi](https://pypi.org/project/sentence-transformers/) and [github](https://github.com/UKPLab/sentence-transformers). The library implements code from the ACL 2019 paper entitled "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410.pdf)" by Nils Reimers and Iryna Gurevych. 
Embeddings are stored in https://aquiladb.xyz/docs/introduction


## Install  Library

In [1]:
# Install the library using pip
!pip3 install sentence-transformers
!pip3 install aquiladb




## Load the BERT Model

In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

# run local# => docker run -d -i -p 50051:50051 -v "data:/data" -t ammaorg/aquiladb:latest
from aquiladb import AquilaClient as acl
db = acl('localhost', 50051)

I0310 13:41:22.045379 4592192960 file_utils.py:35] PyTorch version 1.3.0.post2 available.
I0310 13:41:24.625150 4592192960 SentenceTransformer.py:29] Load pretrained SentenceTransformer: bert-base-nli-mean-tokens
I0310 13:41:24.625917 4592192960 SentenceTransformer.py:32] Did not find a / or \ in the name. Assume to download model from server
I0310 13:41:24.627104 4592192960 SentenceTransformer.py:68] Load SentenceTransformer from folder: /Users/randyhoulahan/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip
I0310 13:41:24.640283 4592192960 configuration_utils.py:182] loading configuration file /Users/randyhoulahan/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip/0_BERT/config.json
I0310 13:41:24.641008 4592192960 configuration_utils.py:199] Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_tas

## Setup a Corpus

In [3]:
# A corpus is a list with documents split by sentences.

sentences = ['Absence of sanity', 
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']

# Each sentence is encoded as a 1-D vector with 78 columns
sentence_embeddings = model.encode(sentences)

print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))

for sent in sentence_embeddings:
    sample = db.convertDocument(sent, {"hello": "world"}) #convert to aquilaDB object
    print(db.addDocuments([sample]))                      # document added

    

Batches: 100%|██████████| 2/2 [00:00<00:00, 12.21it/s]


Sample BERT embedding vector - length 768
status: true
_id: "8978081260d8d62dcadf0bff61351c56"

status: true
_id: "940606563a91786587b0cb04e9b729f6"

status: true
_id: "90d47bb88195e982ee99f6982380f1b4"

status: true
_id: "51e5fb9f875637da1a933660495ad48c"

status: true
_id: "cee5db6383e0a6deb46102b89582cefa"

status: true
_id: "2d552cb4d4a8df629bb83d0651d2cdb8"

status: true
_id: "efd614d03c4922977f0461ec9e63e051"

status: true
_id: "5df9f748c1e6ea266f850708d3d4c9d0"

status: true
_id: "2fd906c0d42506f247ed2efa2fc0181f"

status: true
_id: "6454f7e82b4522b6f9ffe2b488d48981"

status: true
_id: "a208dae0f07cb623ddf6426f7470ff03"



## Perform Semantic Search

In [4]:
import numpy as np
import json 

query = 'Nobody has sane thoughts' #@param {type: 'string'}

queries = [query]
query_embeddings = model.encode(queries)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
k = 3 #@param {type: "number"}

print("Semantic Search Results")

q_matrix = db.convertMatrix(query_embeddings[0])
result    = db.getNearest(q_matrix, k)

print("result", json.loads(result.documents))




Batches: 100%|██████████| 1/1 [00:00<00:00, 29.59it/s]

Semantic Search Results
result []



