In [1]:
! pip install -q sentence_transformers splade-index

### **Load Dataset**

In [None]:
from datasets import load_dataset

dataset = load_dataset("sentence-transformers/natural-questions", split="train")
dataset

In [3]:
import hashlib

# Using md5 hash function to deduplicate documents
def md5(text):
  res = hashlib.md5(text.encode())
  return res.hexdigest()

documents = {md5(document): document for document in dataset['answer']}
queries = {md5(query): query for query in dataset['query']}
relevant_docs = {md5(query): [md5(document)] for query, document in zip(dataset['query'], dataset['answer'])}

len(queries), len(documents), len(relevant_docs)

(100231, 75215, 100231)

### **Index the Documents**

In [None]:
from sentence_transformers import SparseEncoder

# Download a SPLADE model from the 🤗 Hub
splade = SparseEncoder("naver/splade-v3-distilbert", device="cuda")
splade

In [5]:
import numpy as np

# The documents
corpus = list(documents.values())
len(corpus)

75215

In [6]:
from splade_index import SPLADE

# Create the SPLADE retriever and index the corpus
retriever = SPLADE()
retriever.index(model=splade, documents=corpus)

Batches: 100%|██████████| 2351/2351 [09:38<00:00,  4.07it/s]


### **Save Index Locally**

In [7]:
# save index locally
retriever.save("natural-questions-75k")

### **Load the Index**

In [8]:
from sentence_transformers import SparseEncoder

# Download a SPLADE model from the 🤗 Hub
splade = SparseEncoder("naver/splade-v3-distilbert", device="cuda")
splade

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
)

In [9]:
# you can set `mmap` to True for low memory usage
reloaded_retriever = SPLADE.load(save_dir="natural-questions-75k", model=splade, mmap=True)

In [11]:
# The documents
query_list = list(queries.values())[:10]
len(query_list), query_list[:3]

(10,
 ['when did richmond last play in a preliminary final',
  "who sang what in the world's come over you",
  'who produces the most wool in the world'])

In [15]:
# Get top-k results as a tuple of (doc ids, documents, scores). All three are arrays of shape (n_queries, k).
results = reloaded_retriever.retrieve(query_list, k=5)
doc_ids, result_docs, scores = results.doc_ids, results.documents, results.scores

for i in range(doc_ids.shape[1]):
    doc_id, doc, score = doc_ids[0, i], result_docs[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}) (doc_id: {doc_id}): {doc}")

Batches: 100%|██████████| 1/1 [00:00<00:00, 57.06it/s]


SPLADE Index Retrieve:   0%|          | 0/10 [00:00<?, ?it/s]

Rank 1 (score: 23.77) (doc_id: 11233): Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season – their qualifying final against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. In their first preliminary final since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd for a Grand Final since 1986. The Crows led at quarter time and le

### **Save Index to HuggingFace Hub**

In [None]:
from google.colab import userdata

username = "yosefw" # your huggingface username
repo_id = f"{username}/splade-index-natural-questions-75k"
retriever.save_to_hub(repo_id, private=True, token=userdata.get("HF_WRITE_TOKEN")) # `token` should be set to your huggingface token

### **Load Index from Hub**

In [17]:
from sentence_transformers import SparseEncoder

# Download a SPLADE model from the 🤗 Hub
splade = SparseEncoder("naver/splade-v3-distilbert", device="cuda")
splade

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
)

In [None]:
from splade_index import SPLADE
from google.colab import userdata

repo_id = "yosefw/splade-index-natural-questions-75k"
loaded_retriever = SPLADE.load_from_hub(repo_id, model=splade, mmap=True, token=userdata.get("HF_WRITE_TOKEN"))

In [19]:
queries = ['is natural gas renewable']

# Get top-k results as a tuple of (doc ids, documents, scores). All three are arrays of shape (n_queries, k).
results = loaded_retriever.retrieve(queries, k=5)
doc_ids, result_docs, scores = results.doc_ids, results.documents, results.scores

for i in range(doc_ids.shape[1]):
    doc_id, doc, score = doc_ids[0, i], result_docs[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}) (doc_id: {doc_id}): {doc}")

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.50it/s]


SPLADE Index Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 21.91) (doc_id: 12107): Natural gas Natural gas is a fossil fuel used as a source of energy for heating, cooking, and electricity generation. It is also used as a fuel for vehicles and as a chemical feedstock in the manufacture of plastics and other commercially important organic chemicals. Fossil fuel based natural gas is a non-renewable resource.[3]
Rank 2 (score: 17.32) (doc_id: 74390): Natural gas Natural gas is found in deep underground rock formations or associated with other hydrocarbon reservoirs in coal beds and as methane clathrates. Petroleum is another resource and fossil fuel found in close proximity to and with natural gas. Most natural gas was created over time by two mechanisms: biogenic and thermogenic. Biogenic gas is created by methanogenic organisms in marshes, bogs, landfills, and shallow sediments. Deeper in the earth, at greater temperature and pressure, thermogenic gas is created from buried organic material.[4][5]
Rank 3 (score: 16.60) (doc_id: 6