## Siamese BERT-networks for semantic searching / information retrieval

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import numpy as np

from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

In [2]:
PERSON = 'Sinan Ozdemir'

# Note this is NOT an efficient way to search on google. This is done simply for education purposes
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]

nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', 
               tokenizer='deepset/roberta-base-squad2', 
               max_length=10)

nlp(f'Who is {PERSON}?', google_html)

{'score': 0.09814409166574478,
 'start': 545,
 'end': 591,
 'answer': 'data scientist, start-up founder, and educator'}

In [3]:
PERSON = 'Barack Obama'

# Note this is NOT an efficient way to search on google. This is done simply for education purposes
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]

nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', 
               tokenizer='deepset/roberta-base-squad2', 
               max_length=10)

nlp(f'Who is {PERSON}?', google_html)

{'score': 0.18686845898628235,
 'start': 368,
 'end': 403,
 'answer': '44th President of the United States'}

In [4]:
# textbook about insects
text = urlopen('https://www.gutenberg.org/cache/epub/10834/pg10834.txt').read().decode()

# Only keep documents of at least 100 characters
documents = list(filter(lambda x: len(x) > 100, text.split('\r\n\r\n')))

documents = np.array(documents)

print(f'There are {len(documents)} documents/paragraphs')

There are 79 documents/paragraphs


In [5]:
# This model pre-trained on an asymmetric semantic search task
# We use the Bi-Encoder to encode all the documents, so that we can use it with sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4')
bi_encoder.max_seq_length = 256     # Truncate long documents to 256 tokens

bi_encoder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [6]:
# Documents are encoded by calling model.encode(). This takes about 25 seconds on my laptop
document_embeddings = bi_encoder.encode(documents, convert_to_tensor=True, show_progress_bar=True)

document_embeddings.shape

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

torch.Size([79, 768])

In [7]:
QUESTION = 'How many horns does a flea have?'  # a natural language query

In [8]:
# Encode the query using the bi-encoder and find relevant documents
question_embedding = bi_encoder.encode(QUESTION, convert_to_tensor=True)

print(question_embedding.shape)

# Number of documents to retrieve with the bi-encoder
hits = util.semantic_search(question_embedding, document_embeddings, top_k=3)[0]

hits

torch.Size([768])


[{'corpus_id': 14, 'score': 0.4899493455886841},
 {'corpus_id': 19, 'score': 0.24793769419193268},
 {'corpus_id': 21, 'score': 0.18478833138942719}]

In [9]:
print(f'Question: {QUESTION}\n')

for i, hit in enumerate(hits):
    
    print(f'Document {i + 1} Cos_Sim {hit["score"]:.3f}:\n\n{documents[hit["corpus_id"]]}')
    print('\n')

Question: How many horns does a flea have?

Document 1 Cos_Sim 0.490:

When examined by a microscope, the flea is a pleasant object. The body
is curiously adorned with a suit of polished armour, neatly jointed, and
beset with a great number of sharp pins almost like the quills of a
porcupine: it has a small head, large eyes, two horns, or feelers, which
proceed from the head, and four long legs from the breast; they are very
hairy and long, and have several joints, which fold as it were one
within another.


Document 2 Cos_Sim 0.248:

The Chego is a very small animal, about one fourth the size of a common
flea: it is very troublesome, in warm climates, to the poor blacks, such
as go barefoot, and the slovenly: it penetrates the skin, under which it
lays a bunch of eggs, which swell to the bigness of a small pea.


Document 3 Cos_Sim 0.185:


This is one of the largest of the insect tribe. It is met with in
different countries, and of various sizes, from two or three inches t

In [10]:
str(documents[hits[0]['corpus_id']])

'When examined by a microscope, the flea is a pleasant object. The body\r\nis curiously adorned with a suit of polished armour, neatly jointed, and\r\nbeset with a great number of sharp pins almost like the quills of a\r\nporcupine: it has a small head, large eyes, two horns, or feelers, which\r\nproceed from the head, and four long legs from the breast; they are very\r\nhairy and long, and have several joints, which fold as it were one\r\nwithin another.'

In [11]:
# answer the question from the top document
nlp(QUESTION, str(documents[hits[0]['corpus_id']]))

{'score': 0.8524730801582336, 'start': 259, 'end': 262, 'answer': 'two'}

In [12]:
# This is called an "Open Book Q/A" System

# Bonus - Fine-tuning the siamese architecture on custom data

In [13]:
from datasets import load_dataset


from random import sample, seed, shuffle
from sentence_transformers import InputExample, losses, evaluation
from torch.utils.data import DataLoader

In [14]:
# load up the adversarial_qa dataset from the Q/A use-case
training_qa = load_dataset('adversarial_qa', 'adversarialQA', split='train')

good_training_data = []
bad_training_data = []
    
last_example = None
for example in training_qa:
    if last_example and example['context'] != last_example['context']:
        bad_training_data.append((example['question'], last_example['context'], 0.0))  #  add neutral examples
    # question, context, label is 1 if should be matched together
    good_training_data.append((example['question'], example['context'], 1.0))
    last_example = example

Found cached dataset adversarial_qa (/Users/sinanozdemir/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b)


In [15]:
len(good_training_data), len(bad_training_data)

(30000, 2647)

In [16]:
good_training_data[-1]

('What letter designates what Ektachrome is designed for?',
 'Some high-speed black-and-white films, such as Ilford Delta 3200 and Kodak T-MAX P3200, are marketed with film speeds in excess of their true ISO speed as determined using the ISO testing method. For example, the Ilford product is actually an ISO 1000 film, according to its data sheet. The manufacturers do not indicate that the 3200 number is an ISO rating on their packaging. Kodak and Fuji also marketed E6 films designed for pushing (hence the "P" prefix), such as Ektachrome P800/1600 and Fujichrome P1600, both with a base speed of ISO 400.',
 1.0)

In [17]:
bad_training_data[-1]

('What film beside Ektachrome and Fujichorme is designed for pushing?',
 'The Weston Cadet (model 852 introduced in 1949), Direct Reading (model 853 introduced 1954) and Master III (models 737 and S141.3 introduced in 1956) were the first in their line of exposure meters to switch and utilize the meanwhile established ASA scale instead. Other models used the original Weston scale up until ca. 1955. The company continued to publish Weston film ratings after 1955, but while their recommended values often differed slightly from the ASA film speeds found on film boxes, these newer Weston values were based on the ASA system and had to be converted for use with older Weston meters by subtracting 1/3 exposure stop as per Weston\'s recommendation. Vice versa, "old" Weston film speed ratings could be converted into "new" Westons and the ASA scale by adding the same amount, that is, a film rating of 100 Weston (up to 1955) corresponded with 125 ASA (as per ASA PH2.5-1954 and before). This conver

In [18]:
# https://www.sbert.net/docs/training/overview.html for more information on training

seed(42)  # seed our upcoming sample

sampled_training_data = sample(good_training_data, 500) + sample(bad_training_data, 500)

shuffle(sampled_training_data)  # shuffle our data around

training_index = int(.8 * len(sampled_training_data))  # Get an 80/20 train/test split

In [19]:
# Define the training examples
train_examples = [InputExample(texts=t[:2], label=t[2]) for t in sampled_training_data[:training_index]]

train_examples[0].__dict__

{'guid': '',
 'texts': ('What changed after the eigth century?',
  'There is disagreement about the origin of the term, but general consensus that "cardinalis" from the word cardo (meaning \'pivot\' or \'hinge\') was first used in late antiquity to designate a bishop or priest who was incorporated into a church for which he had not originally been ordained. In Rome the first persons to be called cardinals were the deacons of the seven regions of the city at the beginning of the 6th century, when the word began to mean “principal,” “eminent,” or "superior." The name was also given to the senior priest in each of the "title" churches (the parish churches) of Rome and to the bishops of the seven sees surrounding the city. By the 8th century the Roman cardinals constituted a privileged class among the Roman clergy. They took part in the administration of the church of Rome and in the papal liturgy. By decree of a synod of 769, only a cardinal was eligible to become pope. In 1059, during th

In [20]:
# Define the train dataset, a dataloader and the train loss
# A data loader is the object that specifically shuffles/grabs batches of data from a Dataset
# We don't usually have to explicitly create one using the Trainer because it has a default loader built in
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

train_loss = losses.CosineSimilarityLoss(bi_encoder)

In [21]:
# Evaluation data, sentences1 and sentences2 are lists of questions and context respectively and scores are 0 or 1
sentences1, sentences2, scores = zip(*sampled_training_data[training_index:])

# evaluator will evaluate embedding closeness
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

In [22]:
bi_encoder.evaluate(evaluator)  # initial evalaution (higher embedding similarity is better)

0.5044913287672261

In [23]:
# Fine-tune the model using the fit method
bi_encoder.fit(
    train_objectives=[(train_dataloader, train_loss)], 
    output_path='ir/results',
    epochs=3,
    evaluator=evaluator
)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/25 [00:00<?, ?it/s]

Iteration:   0%|          | 0/25 [00:00<?, ?it/s]

Iteration:   0%|          | 0/25 [00:00<?, ?it/s]

In [24]:
bi_encoder.evaluate(evaluator)  # final evalaution (higher embedding similarity is better)
# Not a huge jump in performance with 3 epochs. We could try more data or more epochs

0.5065699196497006

In [25]:
# load fine-tuned IR model
finetuned_bi_encoder = SentenceTransformer('ir/results')

In [26]:
# Slightly more confident results!

document_embeddings = finetuned_bi_encoder.encode(documents, convert_to_tensor=True, show_progress_bar=True)

question_embedding = finetuned_bi_encoder.encode(QUESTION, convert_to_tensor=True)

# Get document hits
hits = util.semantic_search(question_embedding, document_embeddings, top_k=3)[0]

print(f'Question: {QUESTION}\n')

for i, hit in enumerate(hits):
    
    print(f'Document {i + 1} Cos_Sim {hit["score"]:.3f}:\n\n{documents[hit["corpus_id"]]}')
    print('\n')

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Question: How many horns does a flea have?

Document 1 Cos_Sim 0.494:

When examined by a microscope, the flea is a pleasant object. The body
is curiously adorned with a suit of polished armour, neatly jointed, and
beset with a great number of sharp pins almost like the quills of a
porcupine: it has a small head, large eyes, two horns, or feelers, which
proceed from the head, and four long legs from the breast; they are very
hairy and long, and have several joints, which fold as it were one
within another.


Document 2 Cos_Sim 0.253:

The Chego is a very small animal, about one fourth the size of a common
flea: it is very troublesome, in warm climates, to the poor blacks, such
as go barefoot, and the slovenly: it penetrates the skin, under which it
lays a bunch of eggs, which swell to the bigness of a small pea.


Document 3 Cos_Sim 0.189:


This is one of the largest of the insect tribe. It is met with in
different countries, and of various sizes, from two or three inches t

In [27]:
def gutenberg_to_documents(gutenberg_url, bi_encoder):
    text = urlopen(gutenberg_url).read().decode()
    documents = np.array(list(filter(lambda x: len(x) > 100, text.split('\r\n\r\n'))))
    print(f'There are {len(documents)} documents/paragraphs')
    return documents, bi_encoder.encode(documents)

def retrieve_relevant_documents(bi_encoder, query, documents, document_embeddings, hits=3):
    query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    hits = util.semantic_search(query_embedding, document_embeddings, top_k=hits)[0]

    for i, hit in enumerate(hits):
        print(f'Document {i + 1} Cos_Sim {hit["score"]:.3f}:\n\n{documents[hit["corpus_id"]]}')
        print('\n')
    print(f"Answer from Top Document: {nlp(query, str(documents[hits[0]['corpus_id']]))}")

In [28]:
banks_to_bassoon_documents, banks_to_bassoon_embeddings = gutenberg_to_documents(
    'https://www.gutenberg.org/cache/epub/27480/pg27480.txt', finetuned_bi_encoder
)

There are 1402 documents/paragraphs


In [29]:
retrieve_relevant_documents(finetuned_bi_encoder,
    'What is a banshee?', banks_to_bassoon_documents, banks_to_bassoon_embeddings, 2
)

Document 1 Cos_Sim 0.754:

BANSHEE (Irish _bean sidhe_; Gaelic _ban sith_, "woman of the fairies"), a
supernatural being in Irish and general Celtic folklore, whose mournful
screaming, or "keening," at night is held to foretell the death of some
member of the household visited. In Ireland legends of the banshee belong
more particularly to certain families in whose records periodic visits from
the spirit are chronicled. A like ghostly informer figures in Brittany
folklore. The Irish banshee is held to be the distinction only of families
of pure Milesian descent. The Welsh have the banshee under the name _gwrach
y Rhibyn_ (witch of Rhibyn). Sir Walter Scott mentions a belief in the
banshee as existing in the highlands of Scotland (_Demonology and
Witchcraft_, p. 351). A Welsh death-portent often confused with the gwrach
y Rhibyn and banshee is the _cyhyraeth_, the groaning spirit.


Document 2 Cos_Sim 0.325:

BANNU, a town and district of British India, in the Derajat division of the
Nor

In [30]:
retrieve_relevant_documents(finetuned_bi_encoder,
    'When was the Imperial Bank of Germany founded?', banks_to_bassoon_documents, banks_to_bassoon_embeddings, 2
)

Document 1 Cos_Sim 0.799:

[3] The date 1876 is taken as being that when the Imperial Bank of Germany
came into full operation.


Document 2 Cos_Sim 0.577:

Similar banks had been established in Middelburg, (March 28th, 1616), in
Hamburg (1619) and in Rotterdam (February 9th, 1635). Of these the Bank of
Hamburg carried on much the largest business and survived the longest. It
was not till the 15th of February 1873 that its existence was closed by the
act of the German parliament which decreed that Germany should possess a
gold standard, and thus removed those conditions of the local medium of
exchange--silver coins of very different intrinsic values--whose
circulation had provided an ample field for the operations of the bank. The
business of the Bank of Hamburg had been conducted in absolute accordance
with the regulations under which it was founded.


Answer from Top Document: {'score': 0.18934300541877747, 'start': 13, 'end': 17, 'answer': '1876'}
