<a href="https://colab.research.google.com/github/sb8vk/ML/blob/master/Accelerating_the_RAG_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**The context**
Most RAG discussions focus heavily on the LLM, but the real bottleneck for enterprise applications is often the Retriever (specifically the ingestion, cleaning, and indexing of massive datasets). If a standard CPU-based pipeline takes 12 hours to re-index your proprietary data, your LLM is effectively always 12 hours out of date. With NVIDIA RAPIDS, you can move the entire ETL lifecycle onto the GPU to enable near real-time updates.



In [None]:
# Traditional approach:
embeddings = model.encode(docs)  # GPU â†’ CPU copy
index.add(embeddings)             # CPU â†’ GPU copy
results = index.search(query)     # GPU â†’ CPU copy

In [None]:
# Zero-copy approach:
embeddings_gpu = model.encode(docs, convert_to_numpy=False)  # stays on GPU
cudf_df['embeddings'] = embeddings_gpu                       # GPU-native DataFrame
index = cuVS.build(cudf_df['embeddings'])                   # no data movement

##**The approach**
 We are going to move the "ETL" (Extract, Transform, Load) part of RAG onto the GPU using NVIDIA RAPIDS. By keeping data on the GPU, we avoid the latency penalty of moving data back and forth between system RAM and VRAM.

##**How it works**
We replace standard CPU tools with a "Zero-Copy" GPU pipeline. Data is loaded directly into VRAM and stays there through cleaning, deduplication, and indexing, avoiding the massive latency penalty of moving data back and forth between system RAM and GPU memory.

# **Hardware Requirement**
Setting up a GPU environment on cloud notebooks can be tricky. Three moving parts that must align:
- The NVIDIA Driver (managed by Google)
- The CUDA Runtime
- The Python Version

This script detects your specific environment configuration at runtime and fetches the correct pre-compiled wheels automatically. It handles the version matrix for you, ensuring a stable environment without manual troubleshooting.

In [None]:
!nvidia-smi

Fri Jan 30 22:17:28 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

print("\n" + "="*80)
print("STOP! Please go to 'Runtime > Restart Session' now.")
print("This is required to load the new CUDA libraries correctly.")
print("Then, proceed directly to Cell 2.")
print("="*80)

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 625, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (106/106), done.[K
fetch-pack: unexpected disconnect while reading sideband packet
^C
python3: can't open file '/content/rapidsai-csp-utils/colab/pip-install.py': [Errno 2] No such file or directory

STOP! Please go to 'Runtime > Restart Session' now.
This is required to load the new CUDA libraries correctly.
Then, proceed directly to Cell 2.


Because we installed low-level system libraries in Step 1, we had to restart the Python kernel. This gives us a clean slate but wipes out ephemeral packages. Here, we quickly restore sentence-transformers and import our GPU stack (cudf, cupy, cuvs).

## **2. Restore Dependencies & Import**

In [None]:
import os
import time
import gc

# 1. Restore Sentence-Transformers (often wiped on reset)
try:
    import sentence_transformers
except ImportError:
    print("Installing sentence-transformers...")
    !pip install -q sentence-transformers

# 2. Import the RAPIDS Stack
import cudf
import cupy as cp
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from cuvs.neighbors import cagra

print(f"   Environment Ready!")
print(f"   RAPIDS cuDF Version: {cudf.__version__}")
print(f"   GPU Detected: {cp.cuda.runtime.getDeviceCount()} device(s)")

   Environment Ready!
   RAPIDS cuDF Version: 25.02.01
   GPU Detected: 1 device(s)


# **3. Data Generation: Latency vs. Throughput**
A common misconception is that GPUs are always faster than CPUs for every task. In reality, GPUs have a fixed startup cost (initializing the CUDA context). For small data (e.g., 100k rows), the CPU often wins because it starts instantly. For big data (e.g., 5M rows), the GPU's massive parallelism amortizes that startup cost and dominates on throughput.

We generate both "Small" and "Large" datasets to explicitly demonstrate this crossover point.

In [None]:
# Define sizes
num_rows_small = 100_000
num_rows_big   = 5_000_000

print(f"--- Generating Datasets ---")

# --- 1. Small Dataset (100k rows) ---
print(f"Creating Small Dataset ({num_rows_small:,} rows)...")
# We use a simple dictionary structure for speed
data_small = {
    'id': range(num_rows_small),
    'val': np.random.rand(num_rows_small),
    'text': [f"CONFIDENTIAL: Project Alpha data {i} Contact: user{i}@corp.com" for i in range(num_rows_small)]
}
pd.DataFrame(data_small).to_parquet('small_dataset.parquet')

# --- 2. Large Dataset (5M rows) ---
print(f"Creating Large Dataset ({num_rows_big:,} rows)...")
# This simulates a raw data dump (e.g., from S3)
data_big = {
    'id': range(num_rows_big),
    'val': np.random.rand(num_rows_big),
    'text': [f"CONFIDENTIAL: Project Alpha data {i} Contact: user{i}@corp.com" for i in range(num_rows_big)]
}
pd.DataFrame(data_big).to_parquet('large_dataset.parquet')

# Cleanup memory to be safe on T4
del data_small, data_big
gc.collect()

print("Data generation complete.")

--- Generating Datasets ---
Creating Small Dataset (100,000 rows)...
Creating Large Dataset (5,000,000 rows)...
Data generation complete.


# **4: cudf & Vectorized Regex: Accelerate ETL at Scale**
In a standard RAG pipeline, the copy operation over the PCIe bus is a massive bottleneck. Furthermore, text cleaning tasks like PII redaction often rely on slow, single-threaded CPU loops.

In this pipeline:

- Ingest: cudf loads directly to VRAM.

- Clean: We use vectorized Regex (executing on thousands of CUDA cores) to redact PII.

- Dedup: We hash and filter instantly in VRAM.

In [None]:
import time
import pandas as pd
import cudf
import numpy as np

def run_benchmark(file_path, label):
    print(f"\n{'='*40}")
    print(f"BENCHMARK: {label}")
    print(f"{'='*40}")

    # --- 1. CPU (Pandas) Baseline ---
    # We must measure this to prove the "20s" claim is real.
    print(">> Running CPU (Pandas)...")
    start_cpu = time.time()

    # Load
    t0 = time.time()
    pdf = pd.read_parquet(file_path)
    t_load_cpu = time.time() - t0

    # Regex (The Bottleneck)
    t0 = time.time()
    pdf['clean_text'] = pdf['text'].str.replace(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        '<REDACTED>',
        regex=True
    )
    t_regex_cpu = time.time() - t0

    # Dedup
    t0 = time.time()
    # Pandas doesn't support hash_values() directly; drop_duplicates is the standard equivalent
    pdf = pdf.drop_duplicates(subset=['clean_text'])
    t_dedup_cpu = time.time() - t0

    total_cpu = time.time() - start_cpu
    print(f"   [CPU] Total: {total_cpu:.4f}s | Load: {t_load_cpu:.2f}s | Regex: {t_regex_cpu:.2f}s | Dedup: {t_dedup_cpu:.2f}s")


    # --- 2. GPU (cuDF) Accelerated ---
    print(">> Running GPU (cuDF)...")
    start_gpu = time.time()

    # Load
    t0 = time.time()
    gdf = cudf.read_parquet(file_path)
    t_load_gpu = time.time() - t0

    # Regex (SIMT Execution)
    t0 = time.time()
    gdf['clean_text'] = gdf['text'].str.replace(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        '<REDACTED>',
        regex=True
    )
    t_regex_gpu = time.time() - t0

    # Dedup (Hash-based)
    t0 = time.time()
    gdf['hash'] = gdf['clean_text'].hash_values()
    gdf = gdf.drop_duplicates(subset=['hash'])
    t_dedup_gpu = time.time() - t0

    total_gpu = time.time() - start_gpu
    print(f"   [GPU] Total: {total_gpu:.4f}s | Load: {t_load_gpu:.2f}s | Regex: {t_regex_gpu:.2f}s | Dedup: {t_dedup_gpu:.2f}s")

    # --- Summary ---
    speedup = total_cpu / total_gpu
    if total_cpu < total_gpu:
        print(f"\n RESULT: CPU was {total_gpu / total_cpu:.2f}x FASTER (Startup Overhead Dominates)")
    else:
        print(f"\n RESULT: GPU was {speedup:.2f}x FASTER (Parallelism Dominates)")

    return gdf

# --- EXECUTE THE COMPARISON ---
# 1. The Small Data Test (Proving the crossover point)
_ = run_benchmark('small_dataset.parquet', "Small Data (100k Rows)")

# 2. The Big Data Test (Proving the Scale)
final_gdf = run_benchmark('large_dataset.parquet', "Large Data (5M Rows)")


BENCHMARK: Small Data (100k Rows)
>> Running CPU (Pandas)...
   [CPU] Total: 0.2872s | Load: 0.03s | Regex: 0.23s | Dedup: 0.02s
>> Running GPU (cuDF)...
   [GPU] Total: 0.1433s | Load: 0.12s | Regex: 0.02s | Dedup: 0.00s

 RESULT: GPU was 2.00x FASTER (Parallelism Dominates)

BENCHMARK: Large Data (5M Rows)
>> Running CPU (Pandas)...
   [CPU] Total: 20.1099s | Load: 3.14s | Regex: 13.25s | Dedup: 3.72s
>> Running GPU (cuDF)...
   [GPU] Total: 0.8690s | Load: 0.29s | Regex: 0.53s | Dedup: 0.05s

 RESULT: GPU was 23.14x FASTER (Parallelism Dominates)


# **5. Vector Search**
The final step is Indexing. We use cp.asarray() to wrap the raw device pointers returned by cuVS, telling CuPy (the GPU-accelerated NumPy equivalent) how to interpret the memory. This allows for a seamless handoff: cuDF (Dataframe) â†’ PyTorch (Tensor) â†’ cuVS (Index) â†’ CuPy (Result), ensuring high performance and ease of use.

In [None]:
import torch
import cupy as cp
from cupy import from_dlpack
from sentence_transformers import SentenceTransformer

# SAFETY CHECK: Import CAGRA safely
try:
    from cuvs.neighbors import cagra
except ImportError:
    # Fallback for older environments
    from cuml.neighbors import cagra

print(f"\n{'='*40}")
print("PIPELINE: Zero-Copy Vector Indexing")
print(f"{'='*40}")

# 0. Safety Check for Previous Cell
if 'final_gdf' not in locals():
    raise ValueError("ðŸš¨ variable 'final_gdf' is missing. Please run the Benchmark Cell (Cell 4) completely first!")

# Setup Model on GPU
model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')

# Take a subset
subset_size = 50_000
subset_texts = final_gdf['clean_text'].iloc[:subset_size].to_arrow().to_pylist()
print(f"Embedding {subset_size} documents...")

# 1. Generate Embeddings (PyTorch Tensor on VRAM)
# CRITICAL FIX: normalize_embeddings=True ensures Euclidean distance == Cosine Similarity
embeddings_torch = model.encode(subset_texts, convert_to_tensor=True, normalize_embeddings=True)

# 2. THE ZERO-COPY HANDOFF (DLPack)
print(">> Handoff: PyTorch Tensor -> CuPy Array (via DLPack)")
embeddings_cupy = from_dlpack(torch.utils.dlpack.to_dlpack(embeddings_torch))

# 3. Build Index (CAGRA)
print(">> Building CAGRA Index...")
# We use sqeuclidean because we normalized the vectors above.
build_params = cagra.IndexParams(metric="sqeuclidean")
index = cagra.build(build_params, embeddings_cupy)

# 4. Search
query = "project alpha confidential"
print(f"\nQuerying: '{query}'")
# Normalize the query too!
query_vec = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
query_cupy = from_dlpack(torch.utils.dlpack.to_dlpack(query_vec))

search_params = cagra.SearchParams()
distances, neighbors = cagra.search(search_params, index, query_cupy, k=3)

# 5. Extract Results
# Wrap the device_ndarray in cp.asarray to access .get()
final_indices = cp.asarray(neighbors).get().flatten()

print("\n--- Top Matches ---")
for i, idx in enumerate(final_indices):
    if idx < len(subset_texts):
        print(f"[{i+1}] {subset_texts[idx]}")


PIPELINE: Zero-Copy Vector Indexing
Embedding 50000 documents...
>> Handoff: PyTorch Tensor -> CuPy Array (via DLPack)
>> Building CAGRA Index...

Querying: 'project alpha confidential'

--- Top Matches ---
[1] CONFIDENTIAL: Project Alpha data 978 Contact: <REDACTED>
[2] CONFIDENTIAL: Project Alpha data 2025 Contact: <REDACTED>
[3] CONFIDENTIAL: Project Alpha data 2027 Contact: <REDACTED>
