# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [36]:
# !pip install -U sentence-transformers rank_bm25

In [1]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch
import pandas as pd
import re
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')



In [38]:
sen_t = sentence_transformer_paragraph_embedding("../data/genai_poc/processed/AI_POC_pdf_extracted_sectional_data_oct_dec.pkl")

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/502 [00:00<?, ?it/s]

In [39]:
passages, hits, search_results = sen_t.search(query = "How does the design of thermal vias in QFN packages affect thermal and power dissipation?")

Input question: How does the design of thermal vias in QFN packages affect thermal and power dissipation?


In [40]:
len(search_results)

5

In [3]:
df = pd.read_pickle("../data/genai_poc/processed/AI_POC_pdf_extracted_sectional_data_oct_dec.pkl")

In [4]:
passages = df['context'].values.tolist()

In [5]:
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

In [6]:
passages = [striphtml(i) for i in passages]

In [7]:
# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

#wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

# if not os.path.exists(wikipedia_filepath):
#     util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

# passages = []
# with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
#     for line in fIn:
#         data = json.loads(line.strip())

#         #Add all paragraphs
#         #passages.extend(data['paragraphs'])

#         #Only add the first paragraph
#         passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Passages: 502


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [8]:
# We also compare the results to lexical search (keyword search). Here, we use
# the BM25 algorithm which is implemented in the rank_bm25 package.

# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)
        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

  0%|          | 0/502 [00:00<?, ?it/s]

In [9]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding #.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))
    return hits

In [10]:
x = search(query = "How does the design of thermal vias in QFN packages affect thermal and power dissipation?")

Input question: How does the design of thermal vias in QFN packages affect thermal and power dissipation?
Top-3 lexical search (BM25) hits
	18.076	5. Sample configuration    The component package used in the current study is an 8 x 8 mm, 54-pin, open-cavity plastic QFN package. Two initial plating configurations are considered: 1) electrolytic Au over electrolytic Pd over electrolytic Ni finish stack up with layer thicknesses of 0.08-0.13 µm, 0.13-0.38 µm, and 2.03 µm minimum, respectively; and 2) immersion Au over electrolytic Pd over electroless Ni finish stack up with layer thicknesses of 0.05-0.13 µm, 0.13-0.38 µm, and 2.03 µm minimum, respectively. The PWB configuration used is a 2.6 mm-thick, 15-layer PWB constructed of controlled dielectric laminates. The PWB thermal pad design consists of a solid 0.05 mm thick Cu pad with a 9 x 9 square grid of 0.18 mm diameter vias non-conductively filled and Cu-capped. Dimpling of plating over the filled vias was limited to less than 0.05 mm.


-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.521	11. REFERENCES    [1] R. Ghaffarian, "Body of Knowledge (BOK) for Leadless Quad Flat No-Lead/Bottom Termination Components (QFN/BTC) Package Trends and Reliability," Jet Propulsion Laboratory Publication 14- 17, Pasadena, California, 2014. [2] M. K. Anselm and R. Ghaffarian, "QFN Reliability, Thermal Shock, Lead-free vs. SnPb, Microstructure," in IEEE ITHERM Conference, Orlando, FL, 2017. [3] Mirror Semiconductor, [Online]. Available: https://www.mirrorsemi.com/NiPdAu.html. [Accessed 26 June 2019]. [4] P. Lall, S. Deshpande, N. Kothari, J. Suhling and L. Nguyen, "Effect of Thermal Cycling on Reliability of QFN Packages," in IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), San Diego, CA, 2018. [5] H. Gadepalli, R. Dhanasekaran, S. M. Ramkumar, T. Jensen and E. Briggs, "Influence of Reflow Profile and Pb-Free Solder Paste in Minimizing Voids for Quad Flat Pack No-Lead 