# Retrieve & Re-Rank Demo over Simple Wikipedia
This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over Simple Wikipedia.

You can input a query or a question. The script then uses semantic search to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') and retrieve 32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')) that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance, especially when you search over a corpus for which the bi-encoder was not trained for.

In [3]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = '../Data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True, batch_size=512)

Passages: 169597


Batches: 100%|██████████| 332/332 [00:25<00:00, 13.06it/s]


In [4]:
data

{'id': '798870',
 'title': 'Seminole bat',
 'paragraphs': ['The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.',
  'The Seminole bat is often confused with the red bat. The Seminole bat has a mahogany color with a frosted look because to white tipped dorsal hairs. They weigh around 12 grams. Females are larger than males.',
  'The Seminole bat is found in the Southeastern United States. This includes Louisiana, Georgia, Alabama, Mississippi, South Carolina and parts of Texas, Tennessee, Arkansas and North Carolina. It has also been seen as far as Mexico. It is a migratory species. In the winter, it lives along the Gulf Coast, North and South Carolina, and southern Arkansas. In the summer, they migrate as far north as Missouri and Kentucky.',
  'It prefers to live in forested areas. In winter, they are found to use leaf litter and Spanish moss as insulation in their roost sites.',
  'Seminole bats are insectivores. They eat large amounts of Hymenopt

In [6]:
data['paragraphs']

['The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.',
 'The Seminole bat is often confused with the red bat. The Seminole bat has a mahogany color with a frosted look because to white tipped dorsal hairs. They weigh around 12 grams. Females are larger than males.',
 'The Seminole bat is found in the Southeastern United States. This includes Louisiana, Georgia, Alabama, Mississippi, South Carolina and parts of Texas, Tennessee, Arkansas and North Carolina. It has also been seen as far as Mexico. It is a migratory species. In the winter, it lives along the Gulf Coast, North and South Carolina, and southern Arkansas. In the summer, they migrate as far north as Missouri and Kentucky.',
 'It prefers to live in forested areas. In winter, they are found to use leaf litter and Spanish moss as insulation in their roost sites.',
 'Seminole bats are insectivores. They eat large amounts of Hymenoptera (ants, bees and wasps), Coleoptera (beetles), Lepidoptera 

In [7]:
# We also compare the results to lexical search (keyword search). Here, we use
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

100%|██████████| 169597/169597 [00:02<00:00, 60196.93it/s]


In [8]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

In [9]:
search(query = "What is the capital of the Norway?")

Input question: What is the capital of the Norway?
Top-3 lexical search (BM25) hits
	13.337	The University of Oslo (, ) is the oldest and largest university in Norway. It is in the Norwegian capital, Oslo.
	11.149	Møre og Romsdal is a county in Norway. Norway has 19 counties.
	11.149	Norway national football team is the national football team of Norway.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.675	Oslo is the capital city of Norway. It is Norway's largest city, with a population of 647,676 people in 2015. The area near the city has a total population of 1,546,706. The city government of Oslo and the county are the same thing.
	0.668	Norway is a country in the north of Europe. It is the western part of the Scandinavian peninsula. The mainland of Norway is surrounded by the North Sea and Atlantic Ocean on the west side, and borders Russia, Finland, and Sweden to the east. The southern coast touches the Oslofjord, Skagerrak, and the North Sea.
	0.665	Norway is a town

In [10]:
search(query = "what is a Seminole bat?")

Input question: what is a Seminole bat?
Top-3 lexical search (BM25) hits
	26.760	The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.
	15.023	The northern bat ("Eptesicus nilssonii") is a species of bat in Eurasia. It is related to the serotine bat ("Eptesicus serotinus").
	14.479	The Mexican free-tailed bat ("Tadarida brasiliensis") is a type of bat. It is also called the Brazilian free-tailed bat. It is native to North and South America.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.820	The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.
	0.539	Seminole is a city in Oklahoma in the United States.
	0.526	The Mexican free-tailed bat ("Tadarida brasiliensis") is a type of bat. It is also called the Brazilian free-tailed bat. It is native to North and South America.

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	10.383	The Seminole bat ("Lasiurus seminolus") is a type of bat in the 

In [12]:
search(query = "what kind of insulation does the seminole bat use in their roost nests?")

Input question: what kind of insulation does the seminole bat use in their roost nests?
Top-3 lexical search (BM25) hits
	26.760	The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.
	15.564	Insulation might mean:
	15.118	A partial discharge (PD) is an electric discharge that breaks through a small portion of the insulation between two conductors. A discharge is when electrical charge that has accumulated is suddenly released (like a spark). A partial discharge does not completely cross the insulation between the two conductors, only a small portion of it in one area. It may happen under the stress of high voltage. It is caused by a small opening (void) inside the insulating material. The insulating material can be either solid or fluid. The charge passing through this insulation is called electrical breakdown.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.557	The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vesp