
# Price & Promotion Discrepancy Finder (BigQuery + Gemini)

This notebook is ready for **BigQuery Studio → Notebooks** (or Colab/Jupyter).  
It uses **environment variables** for all project/dataset/table names so you can reuse it anywhere.

### What we’re solving (clear use case)
Retailers run thousands of **price & promo** campaigns weekly. In-store execution often drifts from plan:
- Wrong sign vs POS price (e.g., *2 för 30* but POS lists **29.95**).
- Wrong item/brand under sign.
- Outdated promo sign after expiry.
- Mixed promos within one display.
  
We **combine**:
1) **Image embeddings** of shelf photos (Gemini multimodal)  
2) **Text embeddings** of structured product pricing/promotions  
3) A **single natural-language query** (e.g., “gurka 2 för 30”) to retrieve relevant **images + products**, parse the sign (“2 for 30” → **15 SEK** each), and **flag discrepancies** by comparing against structured prices.



## Prerequisites
- Google Cloud project with **BigQuery** and **Vertex AI** APIs enabled.
- `gcloud` and `bq` CLIs authenticated (`gcloud auth login`).
- A small set of sample shelf images in a GCS path (you’ll set the bucket/prefix below).
- A CSV `pricing_promotions.csv` with columns:  
  `product_name, category, listed_price, promo_text, promo_price, discrepancy_flag`


## 0) Environment variables (edit if needed)

In [None]:

import os
PROJECT_ID = os.environ.get('PROJECT_ID', 'your-project-id')
BQ_DATASET = os.environ.get('BQ_DATASET', 'retail_shelf')
BQ_LOCATION = os.environ.get('BQ_LOCATION', 'US')  # match your dataset/connection region
CONN_ID = os.environ.get('CONN_ID', f'{BQ_LOCATION}.retail_shelf_conn')
GCS_BUCKET = os.environ.get('GCS_BUCKET', 'sm-gemini-pg-retail-semantic')
GCS_PREFIX = os.environ.get('GCS_PREFIX', 'retail_shelf/*')  # path under the bucket

# Table/model names
OBJ_TABLE = os.environ.get('OBJ_TABLE', 'shelves')
IMG_EMB_TABLE = os.environ.get('IMG_EMB_TABLE', 'shelf_embeddings')
IMG_EMB_INDEX = os.environ.get('IMG_EMB_INDEX', 'idx_shelf_image_embeddings')
TXT_MODEL = os.environ.get('TXT_MODEL', 'text_embedding_model')
MM_MODEL = os.environ.get('MM_MODEL', 'multimodal_embedding_model')
PRICING_TABLE = os.environ.get('PRICING_TABLE', 'pricing_promotions')
PRICING_EMB_TABLE = os.environ.get('PRICING_EMB_TABLE', 'pricing_promotions_emb')
PRICING_VEC_VIEW = os.environ.get('PRICING_VEC_VIEW', 'pricing_promotions_vec')

print(PROJECT_ID, BQ_DATASET, BQ_LOCATION, CONN_ID)


In [None]:

%env PROJECT_ID=$PROJECT_ID
%env BQ_DATASET=$BQ_DATASET
%env BQ_LOCATION=$BQ_LOCATION
%env CONN_ID=$CONN_ID
%env GCS_BUCKET=$GCS_BUCKET
%env GCS_PREFIX=$GCS_PREFIX
%env OBJ_TABLE=$OBJ_TABLE
%env IMG_EMB_TABLE=$IMG_EMB_TABLE
%env IMG_EMB_INDEX=$IMG_EMB_INDEX
%env TXT_MODEL=$TXT_MODEL
%env MM_MODEL=$MM_MODEL
%env PRICING_TABLE=$PRICING_TABLE
%env PRICING_EMB_TABLE=$PRICING_EMB_TABLE
%env PRICING_VEC_VIEW=$PRICING_VEC_VIEW


## 1) Create dataset & Cloud Resource connection

In [None]:

%%bash
set -euo pipefail
bq --location=${BQ_LOCATION} --project_id=${PROJECT_ID} mk -d ${BQ_DATASET} || echo "Dataset may already exist"

# Create (or show) a Cloud Resource connection for object tables / remote models
bq --project_id=${PROJECT_ID} mk   --connection   --location=${BQ_LOCATION}   --connection_type=CLOUD_RESOURCE   ${CONN_ID} || true

bq show --connection --location=${BQ_LOCATION} ${PROJECT_ID}.${CONN_ID} || true


### (Once) Grant Storage access to the connection service account

In [None]:

%%bash
# Replace MEMBER below with the service account from the 'bq show --connection' output if needed.
# This is an example placeholder member; keep your own SA if different.
# gcloud storage buckets add-iam-policy-binding gs://${GCS_BUCKET} #   --member="serviceAccount:bqcx-XXXX@gcp-sa-bigquery-condel.iam.gserviceaccount.com" #   --role="roles/storage.objectViewer" || true
echo "Run the add-iam-policy-binding command above once with your connection service account."


## 2) Create **Object Table** over GCS images

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
CREATE OR REPLACE EXTERNAL TABLE `${PROJECT_ID}.${BQ_DATASET}.${OBJ_TABLE}`
WITH CONNECTION `${PROJECT_ID}.${CONN_ID}`
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://${GCS_BUCKET}/${GCS_PREFIX}']
);
SQL


## 3) Multimodal embedding model + generate image embeddings

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
CREATE OR REPLACE MODEL `${PROJECT_ID}.${BQ_DATASET}.${MM_MODEL}`
REMOTE WITH CONNECTION `${PROJECT_ID}.${CONN_ID}`
OPTIONS ( endpoint = 'multimodalembedding@001' );

CREATE OR REPLACE TABLE `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_TABLE}` AS
SELECT *
FROM ML.GENERATE_EMBEDDING(
  MODEL `${PROJECT_ID}.${BQ_DATASET}.${MM_MODEL}`,
  TABLE `${PROJECT_ID}.${BQ_DATASET}.${OBJ_TABLE}`,
  STRUCT(TRUE AS flatten_json_output)
);
SQL


## 4) Create vector index for image embeddings

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
CREATE OR REPLACE VECTOR INDEX `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_INDEX}`
ON `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_TABLE}` (ml_generate_embedding_result)
OPTIONS ( index_type = 'IVF', distance_type = 'COSINE' );
SQL


## 5) Example image searches (natural language)

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
DECLARE q STRING DEFAULT 'lindt chocolate bars';
WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `${PROJECT_ID}.${BQ_DATASET}.${MM_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT base.* EXCEPT (ml_generate_embedding_result)
FROM VECTOR_SEARCH(
  TABLE `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_TABLE}`, 'ml_generate_embedding_result',
  TABLE query_vec,
  query_column_to_search => 'qvec',
  top_k => 10, distance_type => 'COSINE'
);
SQL


In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
DECLARE q STRING DEFAULT 'shelf with bread and price above 20 SEK';
WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `${PROJECT_ID}.${BQ_DATASET}.${MM_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT base.* EXCEPT (ml_generate_embedding_result)
FROM VECTOR_SEARCH(
  TABLE `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_TABLE}`, 'ml_generate_embedding_result',
  TABLE query_vec,
  query_column_to_search => 'qvec',
  top_k => 10, distance_type => 'COSINE'
);
SQL


## 6) Load `pricing_promotions.csv` into BigQuery

In [None]:

%%bash
cat <<'TXT'
Example CLI (run in a shell where the CSV is present):
bq --location=${BQ_LOCATION} --project_id=${PROJECT_ID} load   --autodetect --source_format=CSV --skip_leading_rows=1   ${BQ_DATASET}.${PRICING_TABLE} ./pricing_promotions.csv
TXT


## 7) Text embeddings for pricing table (+ clean view to avoid 0-length vectors)

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
CREATE OR REPLACE MODEL `${PROJECT_ID}.${BQ_DATASET}.${TXT_MODEL}`
REMOTE WITH CONNECTION `${PROJECT_ID}.${CONN_ID}`
OPTIONS ( ENDPOINT = 'gemini-embedding-001' );

CREATE OR REPLACE TABLE `${PROJECT_ID}.${BQ_DATASET}.${PRICING_EMB_TABLE}` AS
SELECT * EXCEPT(content)
FROM ML.GENERATE_EMBEDDING(
  MODEL `${PROJECT_ID}.${BQ_DATASET}.${TXT_MODEL}`,
  (
    SELECT TRIM(CONCAT(
      IFNULL(product_name, ''), ' ',
      IFNULL(category, ''), ' ',
      IFNULL(promo_text, ''), ' ',
      IFNULL(FORMAT('listed %.2f', listed_price), ''), ' ',
      IFNULL(FORMAT('promo %.2f', promo_price), '')
    )) AS content,
    p.*
    FROM `${PROJECT_ID}.${BQ_DATASET}.${PRICING_TABLE}` p
  ),
  STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
);

-- Filtered view: only rows with valid 3072-d vectors (Gemini embedding)
CREATE OR REPLACE VIEW `${PROJECT_ID}.${BQ_DATASET}.${PRICING_VEC_VIEW}` AS
SELECT *
FROM `${PROJECT_ID}.${BQ_DATASET}.${PRICING_EMB_TABLE}`
WHERE ARRAY_LENGTH(ml_generate_embedding_result) = 3072;
SQL


## 8) One‑query discrepancy demo (no manual product tagging)

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
-- Single natural query used for BOTH image + product retrieval
DECLARE q STRING DEFAULT 'fresh cucumbers (gurka) display with promo sign "2 for 30 SEK" in produce';

-- Parse "N for X" and compute per‑unit sign price
DECLARE n INT64        DEFAULT CAST(IFNULL(REGEXP_EXTRACT(q, r'(\d+)\s*for'), '1') AS INT64);
DECLARE bundle FLOAT64 DEFAULT CAST(REGEXP_EXTRACT(q, r'for\s*(\d{2,3})') AS FLOAT64);
DECLARE sign_each_price FLOAT64 DEFAULT SAFE_DIVIDE(bundle, n);

WITH
img_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `${PROJECT_ID}.${BQ_DATASET}.${MM_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
),
txt_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `${PROJECT_ID}.${BQ_DATASET}.${TXT_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
),
img_hits AS (
  SELECT uri, distance AS img_distance
  FROM VECTOR_SEARCH(
    TABLE `${PROJECT_ID}.${BQ_DATASET}.${IMG_EMB_TABLE}`,
    'ml_generate_embedding_result',
    TABLE img_q,
    query_column_to_search => 'qvec',
    top_k => 10,
    distance_type => 'COSINE'
  )
),
prod_hits AS (
  SELECT
    product_name, category, listed_price, promo_text, promo_price, discrepancy_flag,
    distance AS txt_distance
  FROM VECTOR_SEARCH(
    TABLE `${PROJECT_ID}.${BQ_DATASET}.${PRICING_VEC_VIEW}`,
    'ml_generate_embedding_result',
    TABLE txt_q,
    query_column_to_search => 'qvec',
    top_k => 5,
    distance_type => 'COSINE'
  )
)
SELECT
  (SELECT STRING_AGG(uri, '\n') FROM img_hits ORDER BY img_distance LIMIT 3) AS top_image_uris,
  q AS auditor_query,
  sign_each_price,
  product_name,
  category,
  listed_price,
  promo_text,
  promo_price,
  discrepancy_flag,
  ROUND(ABS(listed_price - sign_each_price), 2) AS delta_vs_sign_listed,
  ROUND(ABS(promo_price  - sign_each_price), 2) AS delta_vs_sign_promo,
  txt_distance
FROM prod_hits
ORDER BY delta_vs_sign_listed DESC, txt_distance
LIMIT 5;
SQL


## 9) Troubleshooting: embedding dimensions

In [None]:

%%bash
set -euo pipefail
bq --project_id=${PROJECT_ID} --location=${BQ_LOCATION} query --use_legacy_sql=false <<'SQL'
SELECT ARRAY_LENGTH(ml_generate_embedding_result) AS stored_dim, COUNT(*) c
FROM `${PROJECT_ID}.${BQ_DATASET}.${PRICING_EMB_TABLE}`
GROUP BY stored_dim
ORDER BY stored_dim;

-- Check query dimension
DECLARE q STRING DEFAULT 'test';
WITH txt_q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `${PROJECT_ID}.${BQ_DATASET}.${TXT_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT ARRAY_LENGTH(qvec) AS query_dim FROM txt_q;
SQL



## 10) Wrap‑up
- No manual tagging (no image→product map).  
- One sentence powers **both** image and structured retrieval.  
- Parse the sign → compute per‑unit price → **numeric deltas** expose discrepancies.  
- Scale: schedule embedding refresh, keep vector indexes, and feed a dashboard that shows top image hits + price deltas for auditor follow‑up.
