
# Price & Promotion Discrepancy Finder (BigQuery + Gemini Embeddings)

**Goal:** Demonstrate how to detect **price/promotion mismatches** by **combining image embeddings** (from shelf photos) with **structured pricing data** (CSV → BigQuery).  
This notebook is designed to run in **BigQuery Studio → Notebooks** (or locally with copy‑paste).

### What we're solving at scale
Retailers run thousands of in‑store **price & promo campaigns** every week. Execution gaps happen:
- Wrong sign vs POS price (e.g., *2 for 30 SEK* but POS lists 29.95 each)
- Wrong item/brand under the sign
- Old promo left after expiry
- Category displays with mixed promotions

**We use embeddings to:**  
1) Convert shelf photos into vectors using **Gemini Multimodal Embeddings**.  
2) Convert the pricing table rows into **text embeddings**.  
3) With a **single natural query** (e.g., *“gurka 2 for 30”*), retrieve relevant **images and products**.  
4) Parse the sign (e.g., “2 for 30”) → compute **per‑unit sign price** (15 SEK) → **flag discrepancies** against structured prices.

You’ll upload a small set of demo images to GCS and a `pricing_promotions.csv` to BigQuery. The same pattern scales to millions of images and product rows with **VECTOR INDEX** and scheduled refresh.



## Prerequisites & data
- **Project:** `sm-gemini-playground`
- **Region:** `US` (for this demo)
- **GCS images:** `gs://sm-gemini-pg-retail-semantic/retail_shelf/*`
- **CSV:** `pricing_promotions.csv` (columns: `product_name, category, listed_price, promo_text, promo_price, discrepancy_flag`)

> Tip: upload the sample images to GCS and the CSV to BigQuery before running queries below.



## 1) Create dataset & Cloud Resource connection
Run these in a **Terminal/Cloud Shell** (or convert to `%bash` cells if supported).


In [None]:

# Shell commands (uncomment if running in a bash-enabled cell)
# bq --location=US mk --dataset sm-gemini-playground:retail_shelf
# 
# bq mk #   --project_id=sm-gemini-playground #   --connection #   --location=US #   --connection_type=CLOUD_RESOURCE #   retail_shelf_conn_us
# 
# bq show --connection --location=US sm-gemini-playground.US.retail_shelf_conn_us
# 
# gcloud storage buckets add-iam-policy-binding gs://sm-gemini-pg-retail-semantic #   --member="serviceAccount:bqcx-979008310984-rfg8@gcp-sa-bigquery-condel.iam.gserviceaccount.com" #   --role="roles/storage.objectViewer"



## 2) Create an **Object Table** over your images
Copy this **SQL** into a BigQuery SQL cell.



```sql
CREATE OR REPLACE EXTERNAL TABLE `sm-gemini-playground.retail_shelf.shelves`
WITH CONNECTION `sm-gemini-playground.US.retail_shelf_conn_us`
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://sm-gemini-pg-retail-semantic/retail_shelf/*']
);
```



## 3) Multimodal embedding model + image embeddings
```sql
CREATE OR REPLACE MODEL `sm-gemini-playground.retail_shelf.multimodal_embedding_model`
REMOTE WITH CONNECTION `sm-gemini-playground.US.retail_shelf_conn_us`
OPTIONS (
  endpoint = 'multimodalembedding@001'
);

CREATE OR REPLACE TABLE `sm-gemini-playground.retail_shelf.shelf_embeddings` AS
SELECT
  *
FROM ML.GENERATE_EMBEDDING(
  MODEL `sm-gemini-playground.retail_shelf.multimodal_embedding_model`,
  TABLE `sm-gemini-playground.retail_shelf.shelves`,
  STRUCT(TRUE AS flatten_json_output)
);
```



## 4) Create a **VECTOR INDEX** for image embeddings
```sql
CREATE OR REPLACE VECTOR INDEX `idx_shelf_image_embeddings`
ON `sm-gemini-playground.retail_shelf.shelf_embeddings` (ml_generate_embedding_result)
OPTIONS (
  index_type = 'IVF',
  distance_type = 'COSINE'
);
```



## 5) Example: image search with natural language
```sql
DECLARE q STRING DEFAULT 'lindt chocolate bars';

WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `sm-gemini-playground.retail_shelf.multimodal_embedding_model`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT base.* EXCEPT (ml_generate_embedding_result)
FROM VECTOR_SEARCH(
  TABLE `sm-gemini-playground.retail_shelf.shelf_embeddings`, 'ml_generate_embedding_result',
  TABLE query_vec,
  query_column_to_search => 'qvec',
  top_k => 10,
  distance_type => 'COSINE'
);
```



```sql
-- Note: fixed missing quote in the original draft
DECLARE q STRING DEFAULT 'shelf with bread and price above 20 SEK';

WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `sm-gemini-playground.retail_shelf.multimodal_embedding_model`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT base.* EXCEPT (ml_generate_embedding_result)
FROM VECTOR_SEARCH(
  TABLE `sm-gemini-playground.retail_shelf.shelf_embeddings`, 'ml_generate_embedding_result',
  TABLE query_vec,
  query_column_to_search => 'qvec',
  top_k => 10,
  distance_type => 'COSINE'
);
```



## 6) Load `pricing_promotions.csv` to BigQuery
If you haven't yet, load your CSV into `sm-gemini-playground.retail_shelf.pricing_promotions`  
(schema: `product_name STRING, category STRING, listed_price FLOAT64, promo_text STRING, promo_price FLOAT64, discrepancy_flag STRING`).

Example CLI:
```bash
bq load --replace   --source_format=CSV   --autodetect   sm-gemini-playground:retail_shelf.pricing_promotions   gs://YOUR_BUCKET/pricing_promotions.csv
```



## 7) Text embeddings for structured pricing data
Make sure this **uses the SAME endpoint** for both build & query (here: `gemini-embedding-001`).

```sql
CREATE OR REPLACE MODEL `sm-gemini-playground.retail_shelf.text_embedding_model`
REMOTE WITH CONNECTION `sm-gemini-playground.US.retail_shelf_conn_us`
OPTIONS(ENDPOINT = 'gemini-embedding-001');

CREATE OR REPLACE TABLE `sm-gemini-playground.retail_shelf.pricing_promotions_emb` AS
SELECT
  * EXCEPT (content)
FROM ML.GENERATE_EMBEDDING(
  MODEL `sm-gemini-playground.retail_shelf.text_embedding_model`,
  (
    SELECT TRIM(CONCAT(
        IFNULL(product_name, ''), ' ',
        IFNULL(category, ''), ' ',
        IFNULL(promo_text, ''), ' ',
        IFNULL(FORMAT('listed %.2f', listed_price), ''), ' ',
        IFNULL(FORMAT('promo %.2f',  promo_price),  '')
      )) AS content,
      p.*
    FROM `sm-gemini-playground.retail_shelf.pricing_promotions` p
  ),
  STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
);

-- Optional: keep only well-formed vectors (prevents dimension mismatch)
CREATE OR REPLACE VIEW `sm-gemini-playground.retail_shelf.pricing_promotions_vec` AS
SELECT *
FROM `sm-gemini-playground.retail_shelf.pricing_promotions_emb`
WHERE ARRAY_LENGTH(ml_generate_embedding_result) = 3072;
```



## 8) **One‑query demo:** find cucumbers with “2 för 30” and flag discrepancies
Single natural query drives both searches (images + products). We parse *2 for 30* → 15 SEK per unit, then compute deltas.

```sql
-- One natural query typed by the auditor
DECLARE q STRING DEFAULT 'fresh cucumbers (gurka) display with promo sign "2 for 30 SEK" in produce';

-- Parse "N for X" and compute per‑unit price
DECLARE n INT64        DEFAULT CAST(IFNULL(REGEXP_EXTRACT(q, r'(\d+)\s*for'), '1') AS INT64);
DECLARE bundle FLOAT64 DEFAULT CAST(REGEXP_EXTRACT(q, r'for\s*(\d{2,3})') AS FLOAT64);
DECLARE sign_each_price FLOAT64 DEFAULT SAFE_DIVIDE(bundle, n);

WITH
img_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `sm-gemini-playground.retail_shelf.multimodal_embedding_model`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
),
txt_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `sm-gemini-playground.retail_shelf.text_embedding_model`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
),
img_hits AS (
  SELECT uri, distance AS img_distance
  FROM VECTOR_SEARCH(
    TABLE `sm-gemini-playground.retail_shelf.shelf_embeddings`,
    'ml_generate_embedding_result',
    TABLE img_q,
    query_column_to_search => 'qvec',
    top_k => 10,
    distance_type => 'COSINE'
  )
),
prod_hits AS (
  SELECT
    product_name, category, listed_price, promo_text, promo_price, discrepancy_flag,
    distance AS txt_distance
  FROM VECTOR_SEARCH(
    TABLE `sm-gemini-playground.retail_shelf.pricing_promotions_vec`,  -- view filters out bad vectors
    'ml_generate_embedding_result',
    TABLE txt_q,
    query_column_to_search => 'qvec',
    top_k => 5,
    distance_type => 'COSINE'
  )
)

SELECT
  (SELECT STRING_AGG(uri, '\n') FROM img_hits ORDER BY img_distance LIMIT 3) AS top_image_uris,
  q AS auditor_query,
  sign_each_price,
  product_name,
  category,
  listed_price,
  promo_text,
  promo_price,
  discrepancy_flag,
  ROUND(ABS(listed_price - sign_each_price), 2) AS delta_vs_sign_listed,
  ROUND(ABS(promo_price  - sign_each_price), 2) AS delta_vs_sign_promo,
  txt_distance
FROM prod_hits
ORDER BY delta_vs_sign_listed DESC, txt_distance
LIMIT 5;
```



## 9) Troubleshooting: vector dimensions
```sql
-- Stored table dimensions
SELECT ARRAY_LENGTH(ml_generate_embedding_result) AS stored_dim, COUNT(*) c
FROM `sm-gemini-playground.retail_shelf.pricing_promotions_emb`
GROUP BY stored_dim;

-- Query vector dimension
DECLARE q STRING DEFAULT 'test';
WITH txt_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `sm-gemini-playground.retail_shelf.text_embedding_model`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT ARRAY_LENGTH(qvec) AS query_dim FROM txt_q;
```
If you see any `stored_dim = 0` (or a value different from `query_dim`), rebuild the embeddings table or keep a filtered **view** (as above).



## 10) Wrap‑up
- No manual product tagging needed (no `image_product_map`).  
- A single natural sentence powers retrieval from **images** and **structured rows**.  
- The **numeric delta** vs parsed sign price makes discrepancies **auditable**.  
- Scale up with: scheduled embedding refresh, **VECTOR INDEX**, filters per category/brand, and Looker Studio to preview top image hits alongside price deltas.
