# Dense Retrieval Passage for Open Domain Question & Answering
Submitted by: Rishabh Kaushick
<br>Date: March 16, 2025

# 1. Environment Setup

In [1]:
%pip install --quiet torch transformers faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [2]:
# installing datasets (hugging face dataset library)
%pip install --quiet datasets

Note: you may need to restart the kernel to use updated packages.


Restarting the kernel as suggested.

In [3]:
%reset

In [4]:
import torch
from datasets import load_dataset, load_from_disk

  from .autonotebook import tqdm as notebook_tqdm


## CUDA: GPU Availability

In [5]:
 
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")
 
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
    print(f"GPU Count: {torch.cuda.device_count()}")
    
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"Memory Allocated: {torch.cuda.memory_allocated(i) / 1024**2:.2f} MB")
        print(f"Memory Reserved: {torch.cuda.memory_reserved(i) / 1024**2:.2f} MB")
 
else:
    print("CUDA is not available. Running on CPU.")

PyTorch version: 2.6.0
Is CUDA available: False
CUDA is not available. Running on CPU.


# 2. Dataset Selection and Exploration (15%)

## Loading Dataset (from HuggingFace)

### Download Training Set

In [5]:
dataset = load_dataset("ms_marco", "v2.1", split="train[:15000]")

Generating validation split: 100%|██████████| 101093/101093 [00:00<00:00, 236215.19 examples/s]
Generating train split: 100%|██████████| 808731/808731 [00:02<00:00, 299172.53 examples/s]
Generating test split: 100%|██████████| 101092/101092 [00:00<00:00, 309588.97 examples/s]


### Saving Training Set to Disk

In [None]:
# need to save the dataset in /data/train
dataset.save_to_disk("./data/train/ms_marco_15k")

### Loading Train Set from Disk

In [6]:
# https://huggingface.co/docs/datasets/v3.3.2/en/package_reference/main_classes#datasets.Dataset.save_to_disk

# load the already saved dataset from the disk
dataset = load_from_disk("./data/train/ms_marco_15k")

In [7]:
print(len(dataset))

15000


### Loading & Saving Validation Set

In [8]:
validation_dataset = load_dataset("ms_marco", "v2.1", split="validation[:10000]")

In [9]:
# save validation dataset
validation_dataset.save_to_disk("./data/validation/ms_marco")

Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 103656.22 examples/s]


In [10]:
validation_dataset = load_from_disk("./data/validation/ms_marco")

In [11]:
len(validation_dataset)

10000

### Loading & Saving Test Set

In [12]:
test_dataset = load_dataset("ms_marco", "v2.1", split="test[:10000]")

In [13]:
# save test dataset to the disk
test_dataset.save_to_disk("./data/test/ms_marco")

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 97400.38 examples/s]


In [14]:
test_dataset = load_from_disk("./data/test/ms_marco")