# Large-Scale Text Embeddings with Pinecone and Sentence Transformers

This notebook demonstrates how to build a **large-scale text vector index**
using **Pinecone** and **Sentence Transformers**.

The workflow includes:
- Streaming a large text dataset
- Generating embeddings locally
- Creating a Pinecone index
- Upserting vectors in batches

A small subset of data is used to keep the example practical.


## Setup and Authentication

Before interacting with Pinecone, we must:
- Load API credentials from environment variables
- Initialize the Pinecone client

A local embedding model is used to generate vectors
before sending them to the vector database.


In [2]:
import getpass
import os
import pinecone
from pinecone import Pinecone, ServerlessSpec
from datasets import load_dataset
from sentence_transformers import SentenceTransformer


In [3]:
pc = Pinecone(api_key= os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

## Loading a Large Text Dataset

The **FineWeb** dataset is loaded using Hugging Face Datasets.

Streaming mode is enabled to:
- Avoid downloading the full dataset
- Process items incrementally
- Scale to very large corpora


In [4]:
fw =  load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)

Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

In [5]:
fw

IterableDataset({
    features: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count'],
    num_shards: 15
})

## Inspecting Dataset Features

Understanding the dataset schema helps identify
which fields should be embedded and stored as metadata.


In [6]:
fw.features

{'text': Value('string'),
 'id': Value('string'),
 'dump': Value('string'),
 'url': Value('string'),
 'date': Value('string'),
 'file_path': Value('string'),
 'language': Value('string'),
 'language_score': Value('float64'),
 'token_count': Value('int64')}

## Embedding Model Selection

A Sentence Transformer model is used to convert text into vectors.

`all-MiniLM-L6-v2` is:
- General-purpose
- Fast
- Low-dimensional
- Suitable for large-scale indexing


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

## Pinecone Index Setup

The Pinecone index is created using:
- Dimensionality derived from the embedding model
- Cosine similarity
- Serverless deployment


In [8]:
pc.list_indexes()

[
    {
        "name": "my-index",
        "metric": "cosine",
        "host": "my-index-jo7crz3.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "region": "us-east-1",
                "cloud": "aws",
                "read_capacity": {
                    "mode": "OnDemand",
                    "status": {
                        "state": "Ready",
                        "current_shards": null,
                        "current_replicas": null
                    }
                }
            }
        },
        "status": {
            "ready": true,
            "state": "Ready"
        },
        "vector_type": "dense",
        "dimension": 3,
        "deletion_protection": "disabled",
        "tags": null
    },
    {
        "name": "text",
        "metric": "cosine",
        "host": "text-jo7crz3.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "region": "us-east-1",
                "clou

In [9]:
pc.create_index(name= "text", dimension= model.get_sentence_embedding_dimension(), metric= "cosine", spec= ServerlessSpec(cloud="aws", region="us-east-1"))

PineconeApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2025-10', 'x-cloud-trace-context': '9986f0e8fbca4bf78702add5af441900', 'date': 'Fri, 09 Jan 2026 18:29:47 GMT', 'server': 'Google Frontend', 'Content-Length': '85', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"ALREADY_EXISTS","message":"Resource  already exists"},"status":409}


In [10]:
index = pc.Index(name="text")

## Preparing and Upserting Text Data

To efficiently index large volumes of text, the data is processed
incrementally and upserted into Pinecone in batches.

This step includes:
- Selecting a manageable subset of the dataset
- Generating embeddings for each text sample
- Attaching metadata (e.g., language)
- Upserting vectors in batches to improve throughput
  and reduce network overhead


In [11]:


# Define the number of items you want to process (subset size)
subset_size = 10000  # For example, take only 10,000 items

# Iterate over the dataset and prepare data for upserting
vectors_to_upsert = []
for i, item in enumerate(fw):
    if i >= subset_size:
        break

    text = item['text']
    unique_id = str(item['id'])
    language = item['language']

    # Create an embedding for the text
    embedding = model.encode(text, show_progress_bar=False).tolist()

    # Prepare metadata
    metadata = {'language': language}

    # Append the tuple (id, embedding, metadata) to the list
    vectors_to_upsert.append((unique_id, embedding, metadata))

# Upsert data to Pinecone in batches
batch_size = 1000  # Adjust based on your environment and dataset size
for i in range(0, len(vectors_to_upsert), batch_size):
    batch = vectors_to_upsert[i:i + batch_size]
    index.upsert(vectors=batch)

print("Subset of data upserted to Pinecone index.")


Subset of data upserted to Pinecone index.


## Summary

This notebook demonstrated how to build a scalable
text vector index using Pinecone:

- Streaming a large dataset with Hugging Face Datasets
- Generating embeddings using Sentence Transformers
- Creating a Pinecone index with matching dimensionality
- Upserting large volumes of data in batches

This setup forms the foundation for
semantic search and RAG pipelines at scale.
