# Phase 2: Embedding Generation

Reads the cleaned Delta table, calls the **OpenAI `text-embedding-3-small`** API (1536 dims) in batches from the driver, then writes the result back to `../delta_lake/embeddings/restaurants`.

**Tip:** Run on a 50K-row sample first (`SAMPLE = True`) to verify pipeline and check API costs before committing the full dataset.

**Cost estimate:** `text-embedding-3-small` is $0.02 / 1M tokens. The Zomato dataset (~500K rows, ~30 tokens each) ≈ 15M tokens ≈ **~$0.30 total**.

In [21]:
import os
import time
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, FloatType, StringType, StructField, StructType
from openai import OpenAI

In [22]:
# Set True to run on 50K rows only
SAMPLE       = True
SAMPLE_SIZE  = 50_000
BATCH_SIZE   = 500        # Texts per OpenAI API call (max 2048)

DELTA_IN     = "../delta_lake/raw/restaurants"
DELTA_OUT    = "../delta_lake/embeddings/restaurants"
MODEL_NAME   = "text-embedding-3-small"   # 1536 dims
VECTOR_DIM   = 1536

# Set your key here or export OPENAI_API_KEY in your shell before launching Jupyter
# os.environ["OPENAI_API_KEY"] = "sk-..."

## 1. SparkSession

In [23]:
spark = (
    SparkSession.builder
    .appName("ZomatoSemanticSearch-Embeddings")
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.2.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .config("spark.driver.memory", "8g")
    .config("spark.executor.memory", "8g")
    .master("local[*]")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("WARN")
print(f"Spark {spark.version} ready")

Spark 3.5.3 ready


## 2. Load Delta table → Pandas

In [24]:
df_spark = spark.read.format("delta").load(DELTA_IN)

if SAMPLE:
    df_spark = df_spark.limit(SAMPLE_SIZE)
    print(f"Running on sample: {SAMPLE_SIZE:,} rows")
else:
    print(f"Full dataset: {df_spark.count():,} rows")

# Pull only the columns we need to the customer
df_pd = df_spark.select("restaurant_id", "text_for_embedding").toPandas()
print(f"Pulled {len(df_pd):,} rows to customer")

Running on sample: 50,000 rows
Pulled 9,542 rows to customer


## 3. Batch embed via OpenAI API

OpenAI allows up to **2048 texts per request** — we use `BATCH_SIZE=500` for safety.

In [25]:
from dotenv import load_dotenv
load_dotenv()

True

In [26]:

client = OpenAI()   # Reads OPENAI_API_KEY from env automatically

def embed_batch(texts: list[str]) -> list[list[float]]:
    """Call OpenAI embeddings API for a batch of texts."""
    response = client.embeddings.create(
        model=MODEL_NAME,
        input=texts,
        dimensions=1024
    )
    # Sort by index to guarantee order matches input
    return [item.embedding for item in sorted(response.data, key=lambda x: x.index)]


def embed_all(texts: list[str], batch_size: int = BATCH_SIZE) -> list[list[float]]:
    all_vectors = []
    t0 = time.time()

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        vectors = embed_batch(batch)
        all_vectors.extend(vectors)

        done = min(i + batch_size, len(texts))
        elapsed = time.time() - t0
        print(f"  {done:>7,} / {len(texts):,}  |  {elapsed:.1f}s  |  {done/elapsed:.0f} texts/s")

    return all_vectors

In [27]:
texts = df_pd["text_for_embedding"].fillna("").tolist()

t_start = time.time()
embeddings = embed_all(texts)
total_time = time.time() - t_start

print(f"\nEmbedded {len(embeddings):,} texts in {total_time:.1f}s")
print(f"Vector dimension: {len(embeddings[0])}   (expected {VECTOR_DIM})") #Had to convert as Opensearch supports only 1024

      500 / 9,542  |  3.1s  |  164 texts/s
    1,000 / 9,542  |  6.4s  |  156 texts/s
    1,500 / 9,542  |  7.9s  |  190 texts/s
    2,000 / 9,542  |  9.3s  |  216 texts/s
    2,500 / 9,542  |  10.5s  |  238 texts/s
    3,000 / 9,542  |  11.4s  |  264 texts/s
    3,500 / 9,542  |  12.8s  |  274 texts/s
    4,000 / 9,542  |  14.2s  |  282 texts/s
    4,500 / 9,542  |  15.5s  |  289 texts/s
    5,000 / 9,542  |  16.8s  |  297 texts/s
    5,500 / 9,542  |  18.3s  |  301 texts/s
    6,000 / 9,542  |  19.6s  |  306 texts/s
    6,500 / 9,542  |  20.8s  |  313 texts/s
    7,000 / 9,542  |  22.2s  |  315 texts/s
    7,500 / 9,542  |  23.6s  |  318 texts/s
    8,000 / 9,542  |  24.9s  |  321 texts/s
    8,500 / 9,542  |  26.4s  |  322 texts/s
    9,000 / 9,542  |  27.2s  |  330 texts/s
    9,500 / 9,542  |  28.6s  |  332 texts/s
    9,542 / 9,542  |  29.3s  |  326 texts/s

Embedded 9,542 texts in 29.3s
Vector dimension: 1024   (expected 1536)


## 4. Attach vectors back to the full Spark DataFrame

In [28]:
# Build a small mapping DF: restaurant_id → embedding
embed_schema = StructType([
    StructField("restaurant_id", StringType(), False),
    StructField("embedding", ArrayType(FloatType()), False),
])

embed_rows = list(zip(df_pd["restaurant_id"].tolist(), embeddings))
df_embed = spark.createDataFrame(embed_rows, schema=embed_schema)

# Join back onto the full cleaned table
df_full = spark.read.format("delta").load(DELTA_IN)
if SAMPLE:
    df_full = df_full.limit(SAMPLE_SIZE)

df_with_embeddings = df_full.join(df_embed, on="restaurant_id", how="inner")
# print(f"Joined rows: {df_with_embeddings.count():,}")

## 5. Write to Delta Lake

In [30]:
(
    df_with_embeddings
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .save(DELTA_OUT)
)

print(f"Saved to {DELTA_OUT}")

26/02/22 22:02:20 WARN TaskSetManager: Stage 20 contains a task of very large size (8609 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Saved to ../delta_lake/embeddings/restaurants


## 6. Verify

In [31]:
df_check = spark.read.format("delta").load(DELTA_OUT)
sample_row = df_check.select("name", "embedding").first()

print(f"Total rows : {df_check.count():,}")
print(f"Restaurant : {sample_row['name']}")
print(f"Vector dim : {len(sample_row['embedding'])}")
print(f"First 5 dims: {sample_row['embedding'][:5]}")

Total rows : 9,542
Restaurant : Shenanigan's Irish Pub
Vector dim : 1024
First 5 dims: [-0.019657636061310768, -0.005280804820358753, 0.011428858153522015, -0.022064419463276863, -0.009062412194907665]


In [32]:
spark.stop()