
# Similar Questions Retrieval

This notebook is inspired by the [similar search example of Sentence-Transformers](https://www.sbert.net/examples/applications/semantic-search/README.html#similar-questions-retrieval), and adapted to support [RAFT ANN](https://github.com/rapidsai/raft) algorithm.

The model was pre-trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an asymmetric search task. As corpus, we use the smaller [Simple English Wikipedia](http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz) so that it fits easily into memory.

The steps to install the latest stable `pylibraft` package are available in the [documentation](https://docs.rapids.ai/api/raft/stable/build).

In [1]:
!pip install sentence_transformers torch

# Note: if you have a Hopper based GPU, like an H100, use these to install:
# pip install torch --index-url https://download.pytorch.org/whl/cu118
# pip install sentence_transformers



In [2]:
!nvidia-smi

Mon Jul 31 14:35:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA H100 80G...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   30C    P0    75W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80G...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   31C    P0    72W / 700W |      0MiB / 81559MiB |      0%      Default |
|       

In [3]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch
import pylibraft
from pylibraft.neighbors import ivf_flat, ivf_pq
pylibraft.config.set_output_as(lambda device_ndarray: device_ndarray.copy_to_host())

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'nq-distilbert-base-v1'
if model_name == 'nq-distilbert-base-v1':
    embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')
else:  # Here, we compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Passages: 509663


# Vector Search using RAPIDS RAFT
Now that our embeddings are ready to be indexed and that the model has been loaded, we can use RAPIDS RAFT to do our vector search.

This is done in two step: First we build the index, then we search it.
With `pylibraft` all you need is those four Python lines:

In [5]:
%%time
params = ivf_pq.IndexParams(n_lists=150, pq_dim=96)
pq_index = ivf_pq.build(params, corpus_embeddings)
search_params = ivf_pq.SearchParams()

def search_raft_pq(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    hits = ivf_pq.search(search_params, pq_index, question_embedding[None], top_k)

    # Output of top-k hits
    print("Input question:", query)
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

[W] [14:35:48.810785] [raft::ivf_pq::build] the default cuda resource is used for the raft workspace allocations. This may lead to a significant slowdown for this algorithm. Consider using the default pool resource (`raft::resource::set_workspace_to_pool_resource`) or set your own resource explicitly (`raft::resource::set_workspace_resource`).
[W] [14:35:53.831753] [raft::ivf_pq::extend] the default cuda resource is used for the raft workspace allocations. This may lead to a significant slowdown for this algorithm. Consider using the default pool resource (`raft::resource::set_workspace_to_pool_resource`) or set your own resource explicitly (`raft::resource::set_workspace_resource`).
CPU times: user 2.21 s, sys: 2.49 s, total: 4.7 s
Wall time: 5.13 s


For IVF-PQ we want to reduce the memory footprint while keeping a good accuracy.

In [6]:
pq_index_mem = pq_index.pq_dim * pq_index.size * pq_index.pq_bits
print("IVF-PQ memory footprint: {:.1f} MB".format(pq_index_mem / 2**20))

original_mem = corpus_embeddings.shape[0] * corpus_embeddings.shape[1] * 4
print("Original dataset: {:.1f} MB".format(original_mem / 2**20))

print("Memory saved: {:.1f}%".format(100 * (1 - pq_index_mem / original_mem)))

IVF-PQ memory footprint: 373.3 MB
Original dataset: 1493.2 MB
Memory saved: 75.0%


In [7]:
%%time
search_raft_pq(query="Who was Grace Hopper?")

[W] [14:36:07.640223] [raft::ivf_pq::search] the default cuda resource is used for the raft workspace allocations. This may lead to a significant slowdown for this algorithm. Consider using the default pool resource (`raft::resource::set_workspace_to_pool_resource`) or set your own resource explicitly (`raft::resource::set_workspace_resource`).
Input question: Who was Grace Hopper?
	190.855	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	195.364	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 198

In [8]:
%%time
search_raft_pq(query="Who was Alan Turing?")

Input question: Who was Alan Turing?
	139.827	['Alan Turing', 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.']
	169.849	['William Kahan', 'William Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist. He received the Turing Award in 1989 for ""his fundamental contributions to numerical analysis"." He was named an ACM Fellow in 1994, and added to the National Academy of Engineering in 2005.']
	177.520	['Rolf Noskwith', 'Rolf Noskwith (19 June 1919 – 3 January 2017) was a British businessman. During the Second World War, he worked under Alan Turing as a cryptographer at the British military base Bletchley Park in Milton Keynes, Buckinghamshire.']
	179.202	['Marvin Minsky', "Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive scientist in the field of artificial intelligence (AI). He was the co-founder of

In [9]:
%%time
search_raft_pq(query = "What is creating tides?")

Input question: What is creating tides?
	125.037	['Tide', "A tide is the periodic rising and falling of Earth's ocean surface caused mainly by the gravitational pull of the Moon acting on the oceans. Tides cause changes in the depth of marine and estuarine (river mouth) waters. Tides also make oscillating currents known as tidal streams (~'rip tides'). This means that being able to predict the tide is important for coastal navigation. The strip of seashore that is under water at high tide and exposed at low tide, called the intertidal zone, is an important ecological product of ocean tides."]
	163.835	['Tidal energy', "Many things affect tides. The pull of the Moon is the largest effect, and most of the energy comes from the slowing of the Earth's spin."]
	167.368	['Storm surge', 'A storm surge is a sudden rise of water hitting areas close to the coast. Storm surges are usually created by a hurricane or other tropical cyclone. The surge happens because a storm has fast winds and low at

In [10]:
%%time
params = ivf_flat.IndexParams(n_lists=150)
flat_index = ivf_flat.build(params, corpus_embeddings)
search_params = ivf_flat.SearchParams()

def search_raft_flat(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    
    start_time = time.time()
    hits = ivf_flat.search(search_params, flat_index, question_embedding[None], top_k)
    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

CPU times: user 208 ms, sys: 63.8 ms, total: 271 ms
Wall time: 286 ms


In [11]:
%%time
search_raft_flat(query="Who was Grace Hopper?")

Input question: Who was Grace Hopper?
Results (after 0.002 seconds):
	181.650	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	192.946	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.951	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	202.192	['Nellie Bly', 'Elizabeth Cochrane Seaman (born Elizabeth Jane Cochran; May 5, 1864 – January 27, 1922), better known by he

In [12]:
%%time
search_raft_flat(query="Who was Alan Turing?")

Input question: Who was Alan Turing?
Results (after 0.002 seconds):
	106.131	['Alan Turing', 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.']
	158.646	['William Kahan', 'William Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist. He received the Turing Award in 1989 for ""his fundamental contributions to numerical analysis"." He was named an ACM Fellow in 1994, and added to the National Academy of Engineering in 2005.']
	165.094	['Alan Turing', 'A brilliant mathematician and cryptographer Alan was to become the founder of modern-day computer science and artificial intelligence; designing a machine at Bletchley Park to break secret Enigma encrypted messages used by the Nazi German war machine to protect sensitive commercial, diplomatic and military communications during World War 2. Thus, Turing made the single biggest contribut

In [13]:
%%time
search_raft_flat(query = "What is creating tides?")

Input question: What is creating tides?
Results (after 0.002 seconds):
	94.909	['Tide', "A tide is the periodic rising and falling of Earth's ocean surface caused mainly by the gravitational pull of the Moon acting on the oceans. Tides cause changes in the depth of marine and estuarine (river mouth) waters. Tides also make oscillating currents known as tidal streams (~'rip tides'). This means that being able to predict the tide is important for coastal navigation. The strip of seashore that is under water at high tide and exposed at low tide, called the intertidal zone, is an important ecological product of ocean tides."]
	159.539	['Tidal energy', "Many things affect tides. The pull of the Moon is the largest effect, and most of the energy comes from the slowing of the Earth's spin."]
	159.740	['Storm surge', 'A storm surge is a sudden rise of water hitting areas close to the coast. Storm surges are usually created by a hurricane or other tropical cyclone. The surge happens because a s

## Using CAGRA: GPU graph-based Vector Search

CAGRA is a graph-based nearest neighbors implementation with state-of-the art query performance for both small- and large-batch sized vector searches. 

CAGRA follows the same two-step APIs as IVF-FLAT and IVF-PQ in RAFT. First we build the index:

In [14]:
from pylibraft.neighbors import cagra

In [15]:
%%time
params = cagra.IndexParams(intermediate_graph_degree=32, graph_degree=16, build_algo="nn_descent")
cagra_index = cagra.build(params, corpus_embeddings)
search_params = cagra.SearchParams(algo="multi_cta")

CPU times: user 35.3 s, sys: 4.5 s, total: 39.8 s
Wall time: 2.16 s


In [16]:
def search_raft_cagra(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    start_time = time.time()
    hits = cagra.search(search_params, cagra_index, question_embedding[None], top_k)
    end_time = time.time()

    # Output of top-k hits
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    print("Input question:", query)
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

In [17]:
%%time 
search_raft_cagra(query="Who was Grace Hopper?")

Results (after 0.005 seconds):
Input question: Who was Grace Hopper?
	181.649	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	192.946	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.951	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	202.192	['Nellie Bly', 'Elizabeth Cochrane Seaman (born Elizabeth Jane Cochran; May 5, 1864 – January 27, 1922), better known by he