# Sage: Kaggle GPU Pipeline

Runs the full data pipeline on Kaggle with 1M reviews using GPU acceleration.
Uploads embeddings to Qdrant Cloud.

**Setup:**
1. Enable GPU (Settings -> Accelerator -> GPU T4 x2)
2. Add secrets: `QDRANT_URL`, `QDRANT_API_KEY`
3. Run all cells

## Environment Setup

In [1]:
import os
import sys
import time
from pathlib import Path

IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ

if IS_KAGGLE:
    # Add sage package from Kaggle dataset
    sys.path.insert(0, "/kaggle/input/sage-package")

    # Override data directory (Kaggle input is read-only)
    os.environ["SAGE_DATA_DIR"] = "/kaggle/working/data"

    import subprocess

    packages = ["qdrant-client>=1.7.0", "sentence-transformers>=2.2.0"]
    for pkg in packages:
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", "-q", pkg],
            stdout=subprocess.DEVNULL,
        )
    print("Packages installed")

    from kaggle_secrets import UserSecretsClient

    secrets = UserSecretsClient()
    os.environ["QDRANT_URL"] = secrets.get_secret("QDRANT_URL")
    os.environ["QDRANT_API_KEY"] = secrets.get_secret("QDRANT_API_KEY")
    print("Secrets loaded")
else:
    from dotenv import load_dotenv

    load_dotenv()
    print("Using local .env")

print(f"QDRANT_URL: {os.environ.get('QDRANT_URL', 'NOT SET')[:40]}...")

Packages installed
Secrets loaded
QDRANT_URL: https://2e48f44e-d660-42d6-b0ca-00be9317...


## Check GPU

In [2]:
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name} ({gpu_mem:.1f} GB)")
else:
    print("WARNING: No GPU detected, embeddings will be slow")

GPU: Tesla T4 (15.6 GB)


## Load and Filter Data

In [3]:
from sage.data import prepare_data, get_review_stats

SUBSET_SIZE = 1_000_000 if IS_KAGGLE else 100_000

print(f"Loading {SUBSET_SIZE:,} reviews...")
start = time.time()
df = prepare_data(subset_size=SUBSET_SIZE, force=True)
print(f"Prepared {len(df):,} reviews in {time.time() - start:.1f}s")

stats = get_review_stats(df)
print(f"  Users: {stats['unique_users']:,}")
print(f"  Items: {stats['unique_items']:,}")
print(f"  Sparsity: {stats['sparsity']:.4f}")

11:56:35 INFO     NumExpr defaulting to 4 threads.
Loading 1,000,000 reviews...
11:56:35 INFO     Preparing data from scratch...
11:56:35 INFO     Streaming from https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/resolve/main/raw/review_categories/Electronics.jsonl


Loading reviews: 100%|██████████| 1000000/1000000 [00:22<00:00, 43896.55it/s]

11:56:58 INFO     Loaded 1,000,000 reviews





11:57:04 INFO     Cached to /kaggle/working/data/reviews_1000000.parquet
11:57:04 INFO     Cleaning data quality issues...
11:57:07 INFO     Cleaned: removed 34,099 reviews (3.4%)
11:57:07 INFO     Remaining: 965,901 reviews
11:57:07 INFO     Applying 5-core filtering...
11:57:11 INFO     Final prepared dataset: 334,282 reviews
11:57:12 INFO     Cached prepared data to: /kaggle/working/data/reviews_prepared_1000000.parquet
Prepared 334,282 reviews in 37.0s
  Users: 31,455
  Items: 21,827
  Sparsity: 0.9995


## Chunk Reviews

In [4]:
from sage.adapters.embeddings import get_embedder
from sage.core.chunking import chunk_reviews_batch

# Prepare reviews for chunking
reviews = df.to_dict("records")
for i, review in enumerate(reviews):
    review["review_id"] = f"review_{i}"
    review["product_id"] = review.get("parent_asin", review.get("asin", ""))

print("Loading E5-small embedding model...")
embedder = get_embedder()

print(f"Chunking {len(reviews):,} reviews...")
start = time.time()
chunks = chunk_reviews_batch(reviews, embedder=embedder)
print(f"Created {len(chunks):,} chunks in {time.time() - start:.1f}s")
print(f"Expansion ratio: {len(chunks) / len(reviews):.2f}x")

Loading E5-small embedding model...


2026-02-09 11:57:27.687967: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770638247.841494      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770638247.887574      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770638248.253696      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770638248.253731      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770638248.253738      55 computation_placer.cc:177] computation placer alr

11:57:40 INFO     TensorFlow version 2.19.0 available.
11:57:40 INFO     JAX version 0.7.2 available.
11:57:44 INFO     Loading embedding model: intfloat/e5-small-v2


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Chunking 334,282 reviews...
Created 423,165 chunks in 627.0s
Expansion ratio: 1.27x


## Generate Embeddings (GPU)

In [5]:
import numpy as np

chunk_texts = [c.text for c in chunks]

cache_dir = Path("/kaggle/working") if IS_KAGGLE else Path("data")
cache_dir.mkdir(exist_ok=True)
cache_path = cache_dir / f"embeddings_{len(chunks)}.npy"

print(f"Embedding {len(chunks):,} chunks...")
start = time.time()
embeddings = embedder.embed_passages(
    chunk_texts,
    cache_path=cache_path,
    force=True,
    batch_size=64,
)
embed_time = time.time() - start

print(f"Embeddings: {embeddings.shape} in {embed_time:.1f}s")
print(f"Throughput: {len(chunks) / embed_time:.0f} chunks/sec")

# Validate
assert embeddings.shape[1] == 384, f"Wrong dims: {embeddings.shape[1]}"
assert np.isnan(embeddings).sum() == 0, "NaN values"
norms = np.linalg.norm(embeddings, axis=1)
assert np.allclose(norms, 1.0, atol=0.01), "Not normalized"
print("Validation: PASSED")

Embedding 423,165 chunks...


Batches:   0%|          | 0/6612 [00:00<?, ?it/s]

12:20:06 INFO     Embeddings cached to: /kaggle/working/embeddings_423165.npy
Embeddings: (423165, 384) in 711.2s
Throughput: 595 chunks/sec
Validation: PASSED


## Upload to Qdrant Cloud

In [6]:
from sage.adapters.vector_store import (
    get_client,
    create_collection,
    upload_chunks,
    get_collection_info,
    create_payload_indexes,
)

qdrant_url = os.environ.get("QDRANT_URL")
print(f"Uploading to: {qdrant_url[:40]}...")

client = get_client()
create_collection(client)

start = time.time()
upload_chunks(client, chunks, embeddings)
print(f"Upload complete in {time.time() - start:.1f}s")

create_payload_indexes(client)

info = get_collection_info(client)
print("\nCollection info:")
for key, value in info.items():
    print(f"  {key}: {value}")

Uploading to: https://2e48f44e-d660-42d6-b0ca-00be9317...
12:20:08 INFO     Deleting existing collection: sage_reviews
12:20:08 INFO     Creating collection: sage_reviews


Uploading to Qdrant: 100%|██████████| 4232/4232 [06:59<00:00, 10.08it/s]

12:27:25 INFO     Uploaded 423165 points to sage_reviews
Upload complete in 439.0s
12:27:27 INFO     Creating payload indexes...





12:27:38 INFO     Indexes created for: rating, product_id, timestamp

Collection info:
  name: sage_reviews
  points_count: 423165
  status: yellow


## Test Search

In [8]:
from sage.adapters.vector_store import search

query = "wireless headphones with noise cancellation"
query_emb = embedder.embed_single_query(query)
results = search(client, query_emb.tolist(), limit=5)

print(f"Query: '{query}'\n")
for i, r in enumerate(results):
    print(f"{i + 1}. [{r['rating']:.0f}*] {r['text'][:70]}...")

Query: 'wireless headphones with noise cancellation'

1. [5*] These are the best noise cancellation, wireless headphones on the mark...
2. [5*] These seem to be good wireless noise cancelling headphones.  I have be...
3. [4*] Sony Noise Cancelling Headphones WHCH710N: Wireless Bluetooth Over The...
4. [5*] JBL T600BTNC Noise Cancelling, On-Ear, Wireless Bluetooth Headphones....
5. [5*] Best Bluetooth headphones set with noise cancellation. Very comfortabl...


In [9]:
client.close()
print(f"\nDone! {info.get('points_count', len(chunks)):,} chunks indexed to Qdrant Cloud")


Done! 423,165 chunks indexed to Qdrant Cloud
