In this notebook:
- Generates a sample from hotpotqa dataset, load its documents(sentences) to vector db, and saves the sample as a json for future use.
- The sentences loaded as they are, no further chunking was done due to HotPotQA's nature.


### Sample HotpotQA

In [5]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("hotpotqa/hotpot_qa", "distractor")
train_dataset = dataset['train']

# Select a random subset of 1000 samples
shuffled_dataset = train_dataset.shuffle(seed=42)
random_sample = shuffled_dataset.select(range(1000))

# Save to a local file for reference
random_sample.to_json("hotpotqa_1000_samples.json")


Dataset({
    features: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context'],
    num_rows: 1000
})


In [None]:
# Prepare documents to load
documents_to_load = []
seen_titles = set()

for row in random_sample:
    titles = row['context']['title']
    paragraphs = row['context']['sentences']

    for i in range(len(titles)):
        title = titles[i]
        
        # Only add if we haven't seen this specific Wikipedia page yet
        if title not in seen_titles:
            # Join the list of sentences into one single string (paragraph)
            full_text = " ".join(paragraphs[i])
            documents_to_load.append(full_text)
            
            seen_titles.add(title)

### Load to Vector DB

In [None]:
import os
import sys

# to import from parent directory
parent_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if parent_path not in sys.path:
    sys.path.append(parent_path)

# import vector db client
from vector_db.src.client import VectorDBClient

In [35]:
# instantiate vector db client
vector_db_client = VectorDBClient(persist_directory="../vector_db/chroma_db")

# empty vector db
vector_db_client.delete_collection()

In [37]:
# batch load the documents_to_load as chunks of 1000 to the vector db
for i in range(0, len(documents_to_load), 1000):
    batch = documents_to_load[i:i+1000]
    vector_db_client.add_documents_no_chunking(batch)

Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 1000 whole documents...
Storing in DB...
Embedding 811 whole documents...
Storing in DB...
