<a href="https://colab.research.google.com/github/vaisour19/WIKI_SEARCH_ENGINE/blob/main/Context_BASED_SEARCH_ENGINE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U sentence-transformers



LOADING WIKIPEDIA DATASET INTO THE SEARCH ENGINE

In [None]:
import json
from sentence_transformers import SentenceTransformer, util
import torch
import gzip
import os

As dataset we use simple English Wikipedia

In [None]:
wiki_filepath = 'simplewiki-2020-11-01.jsonl.gz'

#it checks if the file is present loally if not it will download from the link given below using http_get

if not os.path.exists(wiki_filepath):
  util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz' , wiki_filepath)

In [None]:
passages = []
with gzip.open(wiki_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        # Add all paragraphs
        # passages.extend(data['paragraphs'])

        # Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


sample document

In [None]:
passages[100]

'Hurricane Fabian was a powerful Cape Verde-type hurricane that hit Bermuda in early September during the 2003 Atlantic hurricane season. Fabian was the sixth named storm, fourth hurricane, and first major hurricane of the season.'

load a transformer model

In [None]:

from sentence_transformers import CrossEncoder

if not torch.cuda.is_available():
    print("Warning: No GPU found. Neural search will be slow.")

#we use bi encoder to encode all passages so that we can perform semantic search on tnem
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

#then we use cross encoder to rerank result and improve quality
cross_encoderf = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


We embed the documents

In [None]:
passage_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/5300 [00:00<?, ?it/s]

In [None]:
passage_embeddings[100]

tensor([ 6.7026e-03,  7.7242e-02, -4.0781e-02,  3.4449e-02,  1.0008e-01,
        -7.7398e-03, -7.1532e-02,  3.3724e-02, -2.0213e-02,  2.2111e-02,
         1.6617e-02,  1.4158e-02,  1.0117e-01, -5.6038e-02,  6.7868e-02,
         3.1638e-03, -2.7807e-02, -7.2358e-02,  7.1224e-03,  8.2638e-02,
        -5.6906e-02,  2.6739e-02, -1.3058e-02,  1.7890e-02, -1.5990e-02,
         7.8454e-02, -2.9128e-02,  9.5958e-02, -1.8560e-02, -4.8593e-02,
         3.9069e-02,  6.0041e-02,  2.7971e-02,  2.4341e-03,  2.5529e-02,
         3.6655e-03,  3.1266e-02,  6.8533e-03, -5.5100e-02,  7.8149e-02,
        -2.0497e-02,  1.4029e-02,  5.5703e-02, -6.5600e-02,  2.6484e-02,
        -9.3370e-02,  4.4261e-02,  5.6952e-02,  6.4089e-02, -8.6817e-03,
        -1.3594e-02, -2.9124e-02, -2.5316e-02, -1.2725e-01, -4.1205e-03,
         4.3042e-02,  3.1977e-03,  3.2230e-02,  4.2784e-02,  4.5114e-02,
        -1.8544e-02,  5.9338e-02, -8.7942e-02,  1.7911e-02, -8.5897e-03,
        -1.6148e-02, -1.9721e-02,  2.2814e-02, -3.0

Now try with a sample search query

In [None]:
query = 'What is the capital of russia'

In [None]:
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

In [None]:
cos_score = util.cos_sim(query_embedding, passage_embeddings)[0]
cos_score

tensor([-0.0512,  0.0742,  0.0906,  ...,  0.0141, -0.1154,  0.0708],
       device='cuda:0')

get most similar doc

In [None]:
top_result = torch.topk(cos_score, k=1)
index = top_result.indices.item()
index

6396

In [None]:
passages[index]

'Russia (), officially called the Russian Federation () is a country that is in Eastern Europe and in North Asia. It is the largest country in the world by land area. About 146.7 million people live in Russia according to the 2019 census. The capital city of Russia is Moscow, and the official language is Russian.'

so it works for russia lets try with india

In [None]:
query = 'What is the capital of india'

In [None]:
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

In [None]:
cos_score = util.cos_sim(query_embedding, passage_embeddings)[0]
cos_score

tensor([-0.1538, -0.0140,  0.1119,  ...,  0.0814, -0.0538,  0.0258],
       device='cuda:0')

In [None]:
top_result = torch.topk(cos_score, k=1)
index = top_result.indices.item()
index

3698

In [None]:
passages[index]

"Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."

thisgives skewed data so we have to cross-encode and re rank them to get similar doc

In [None]:
hits = util.semantic_search(query_embedding , passage_embeddings , top_k = 1)
hits[0]

[{'corpus_id': 3698, 'score': 0.6484957337379456}]

So we get the top 30 docs and then we work on it

In [None]:
hits = util.semantic_search(query_embedding , passage_embeddings , top_k = 30)
hits

[[{'corpus_id': 3698, 'score': 0.6484957337379456},
  {'corpus_id': 134500, 'score': 0.6339941620826721},
  {'corpus_id': 22288, 'score': 0.6274536848068237},
  {'corpus_id': 41143, 'score': 0.5963889956474304},
  {'corpus_id': 3701, 'score': 0.5953089594841003},
  {'corpus_id': 16458, 'score': 0.5916082859039307},
  {'corpus_id': 7384, 'score': 0.5885171294212341},
  {'corpus_id': 24727, 'score': 0.5787838697433472},
  {'corpus_id': 104601, 'score': 0.5622392892837524},
  {'corpus_id': 94186, 'score': 0.5576584339141846},
  {'corpus_id': 16289, 'score': 0.5556299090385437},
  {'corpus_id': 106507, 'score': 0.5477286577224731},
  {'corpus_id': 41734, 'score': 0.5462822914123535},
  {'corpus_id': 165144, 'score': 0.54361891746521},
  {'corpus_id': 16301, 'score': 0.5432813167572021},
  {'corpus_id': 59414, 'score': 0.5389682650566101},
  {'corpus_id': 16284, 'score': 0.5352279543876648},
  {'corpus_id': 167027, 'score': 0.5308536887168884},
  {'corpus_id': 106505, 'score': 0.53078228235

In [None]:
reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits[0]]
reranker_inp

[['What is the capital of india',
  "Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."],
 ['What is the capital of india',
  'Gandhinagar is the capital city of Gujarat state in India. It is 23\xa0km from the city of Ahmedabad and 464\xa0km from Mumbai. In the year 1960, the Bombay state of India was divided into two states - Maharashtra and Gujarat. Bombay (now called Mumbai) became the capital city of Maharashtra. For Gujarat, new capital was needed. Gandhinagar was then made the capital of Gujarat.'],
 ['What is the capital of india',
  "Kolkata

In [None]:
reranker_scores = cross_encoderf.predict(reranker_inp)
reranker_scores

array([ 4.912483  ,  3.748391  ,  5.217479  ,  4.219248  ,  8.127115  ,
        6.4802094 , -7.848918  , -0.30529726,  3.5967212 ,  3.048223  ,
        2.3891191 , -0.38947144,  1.7250693 , -5.7348223 ,  2.3636193 ,
        4.246888  , -0.43687913,  0.23689662,  3.7113407 , -0.5231805 ,
        2.7924807 ,  2.1485996 ,  4.021082  ,  3.4632556 ,  2.6803677 ,
       -0.22952613,  3.053036  ,  2.213655  , -7.492552  ,  3.4043975 ],
      dtype=float32)

In [None]:
for id , hit in enumerate(hits[0]):
  hit['rerankers_score'] = reranker_scores[id]

hits[0]

[{'corpus_id': 3698, 'score': 0.6484957337379456, 'rerankers_score': 4.912483},
 {'corpus_id': 134500,
  'score': 0.6339941620826721,
  'rerankers_score': 3.748391},
 {'corpus_id': 22288,
  'score': 0.6274536848068237,
  'rerankers_score': 5.217479},
 {'corpus_id': 41143,
  'score': 0.5963889956474304,
  'rerankers_score': 4.219248},
 {'corpus_id': 3701, 'score': 0.5953089594841003, 'rerankers_score': 8.127115},
 {'corpus_id': 16458,
  'score': 0.5916082859039307,
  'rerankers_score': 6.4802094},
 {'corpus_id': 7384,
  'score': 0.5885171294212341,
  'rerankers_score': -7.848918},
 {'corpus_id': 24727,
  'score': 0.5787838697433472,
  'rerankers_score': -0.30529726},
 {'corpus_id': 104601,
  'score': 0.5622392892837524,
  'rerankers_score': 3.5967212},
 {'corpus_id': 94186,
  'score': 0.5576584339141846,
  'rerankers_score': 3.048223},
 {'corpus_id': 16289,
  'score': 0.5556299090385437,
  'rerankers_score': 2.3891191},
 {'corpus_id': 106507,
  'score': 0.5477286577224731,
  'rerankers_

In [None]:
print("Top Bi-Encoder Retrieval hit:")
hit = sorted(hits[0], key=lambda x: x['score'], reverse=True)[0]
print(passages[hit['corpus_id']])

print("Top Reranker Retrieval hit:")
hit = sorted(hits[0], key=lambda x: x['rerankers_score'], reverse=True)[0]
print(passages[hit['corpus_id']])


Top Bi-Encoder Retrieval hit:
Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.
Top Reranker Retrieval hit:
New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million people.


In [None]:
# prompt: turn the above code into function please not the entire thing just the taking in of query and working on it part

def search_wikipedia(query):
    query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_embedding, passage_embeddings, top_k=30)
    reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits[0]]
    reranker_scores = cross_encoderf.predict(reranker_inp)
    for id, hit in enumerate(hits[0]):
        hit['rerankers_score'] = reranker_scores[id]

    print("Top Bi-Encoder Retrieval hit:")
    hit = sorted(hits[0], key=lambda x: x['score'], reverse=True)[0]
    print(passages[hit['corpus_id']])

    print("Top Reranker Retrieval hit:")
    hit = sorted(hits[0], key=lambda x: x['rerankers_score'], reverse=True)[0]
    print(passages[hit['corpus_id']])

In [None]:
search_wikipedia('What is an animal')

Top Bi-Encoder Retrieval hit:
A terrestrial animal is an animal that lives on land such as dog, cat, an ant or an emu. It can also be used for some species of amphibians that only go back to the water to lay their eggs. It is usually a term to describe the difference between animals that live in water, (such as lobsters and fish), from animals that live on land.
Top Reranker Retrieval hit:
Animals is a concept album by English progressive rock band Pink Floyd, released on 23 January 1977 in the United Kingdom and on 2 February 1977 in the United States. The album proved to be a success in the UK, where it reached #2 in the era of punk music. It was also a success in the U.S., reaching #3 on the Billboard album charts (#1 and #2 were "Hotel California" by the Eagles, and the soundtrack to the Barbra Streisand film "A Star Is Born"). However, it was on the American charts for only six months even though it has continued to sell solidly, to the extent of its having gone quadruple platinum

In [None]:
search_wikipedia('Narendra Modi')

Top Bi-Encoder Retrieval hit:
Narendra Damodardas Modi (born 17 September 1950) is an Indian politician. He is the current Prime Minister of India serving since 2014. He was the 14th Chief Minister of the state of Gujarat. Modi was elected Prime Minister of India in May 2014. He is a member of Bharatiya Janata Party.
Top Reranker Retrieval hit:
Narendra Damodardas Modi (born 17 September 1950) is an Indian politician. He is the current Prime Minister of India serving since 2014. He was the 14th Chief Minister of the state of Gujarat. Modi was elected Prime Minister of India in May 2014. He is a member of Bharatiya Janata Party.


In [None]:
search_wikipedia('What animal can run fast')

Top Bi-Encoder Retrieval hit:
Running is the way in which people or animals travel quickly on their feet. It is a method of travelling on land. It is different to walking in that both feet are regularly off the ground at the same time. Different terms are used to refer to running according to the speed: jogging is slow, and sprinting is running fast.
Top Reranker Retrieval hit:
A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening.


In [None]:
search_wikipedia('su 30mki')

Top Bi-Encoder Retrieval hit:
Sukow is a municipality in the Ludwigslust-Parchim district, in Mecklenburg-Vorpommern, Germany.
Top Reranker Retrieval hit:
Sukhoi Su-30 is a Russian fighter aircraft. It is developed by Sukhoi Aviation Corporation. It is a 4.5-generation jet fighter aircraft. The aircraft is used by these air forces:
