# Semantic Hybrid Search on Used Car Listings with BigQuery Vector Search

In this notebook, we will explore on how to build a semantic search system using BigQuery Vector DB with the Vector Index (optional) feature, leveraging the gemini-embedding-001 model for embeddings. We'll use the Used Car Sales Listings dataset from Kaggle, loaded into BigQuery as the table.
We will explore the use of task_type in the Gemini Embedding Model

 1. RETRIEVAL_DOCUMENT: Use this when creating embeddings for the text you want to search over (our car listings). It optimizes the vectors to be effectively "found" in a search.
 2. RETRIEVAL_QUERY: Use this when creating an embedding for the user's search query itself. It optimizes the vector for finding matching documents.

**What you'll do:**
1. Load the Kaggle `used_car_listings.csv` into BigQuery (CLI cell). The CSV is available in the git repo where this notebook is shared
2. Build a consolidated search base table (`content` column).
3. Create a BigQuery **connection** to Vertex AI and a **remote model** for embeddings.
4. Generate embeddings with `gemini-embedding-001`.
5. (Optional) Create a Vector Index (IVF) on the embedding column.
6. Run semantic queries using natural language.

**Notes**
- Use a region where BigQuery and Vertex AI are both available (e.g. `europe-north1`, `us-central1` or multiregion like `US` or `EU`).
- Vector index creation requires sufficient data; very small tables (<~5k rows) may not allow an index.

💡 Tip: The simplest way to run this notebook is by uploading it directly into BigQuery Studio → Notebooks. That’s where this workflow was developed and tested, so it should run seamlessly there.”


## Prerequisites
- Google Cloud project with **BigQuery** and **Vertex AI** APIs enabled.
- You have the CSV locally or in Cloud Storage.
- `gcloud` and `bq` CLIs installed and authenticated (`gcloud auth login`).

In [None]:
import os
PROJECT_ID = os.environ.get('PROJECT_ID', 'your-project-id')
BQ_DATASET = os.environ.get('BQ_DATASET', "used_cars")
BQ_LOCATION = os.environ.get('BQ_LOCATION', 'europe-north1')  # match your dataset region
CONN_ID = os.environ.get('CONN_ID', f'vertex_ai_conn_{BQ_LOCATION}')
TABLE_SRC = 'used_cars_listing'   # raw CSV destination
TABLE_BASE = 'used_cars_search_base'
TABLE_EMB = 'used_cars_search'
REMOTE_MODEL = 'text_embedding_model'
print(PROJECT_ID, BQ_DATASET, BQ_LOCATION, CONN_ID)

%env PROJECT_ID=$PROJECT_ID
%env BQ_DATASET=$BQ_DATASET
%env BQ_LOCATION=$BQ_LOCATION
%env CONN_ID=$CONN_ID
%env TABLE_SRC=$TABLE_SRC
%env TABLE_BASE=$TABLE_BASE
%env TABLE_EMB=$TABLE_EMB
%env REMOTE_MODEL=$REMOTE_MODEL

In [None]:
%%bash
set -euo pipefail
PROJECT_ID=$PROJECT_ID}
BQ_DATASET=${BQ_DATASET}
BQ_LOCATION=${BQ_LOCATION}

bq --location=${BQ_LOCATION} --project_id=${PROJECT_ID} mk -d --default_table_expiration 0 ${BQ_DATASET} || echo "Dataset may already exist"

## 1) Load CSV into BigQuery
If your file is local in the notebook environment, first upload it to Cloud Storage or use the Web UI.
If it's already in your local machine, use the `bq load` command from a terminal.


In [None]:
%%bash
# Example: load from local file path (run where the CSV is present):
# bq --location=${BQ_LOCATION} --project_id=${PROJECT_ID} load \
#   --autodetect --source_format=CSV --skip_leading_rows=1 \
#   ${BQ_DATASET}.${TABLE_SRC} ./used_car_listings.csv
echo "Use the above command in a terminal with the CSV present."


## 2) Create a consolidated search base table
We merge key attributes into a single `content` string which improves the quality of embeddings.

In [None]:
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location=BQ_LOCATION)
sql = f'''
CREATE OR REPLACE TABLE `{PROJECT_ID}.{BQ_DATASET}.{TABLE_BASE}` AS
SELECT
  listing_id,
  vin,
  make,
  model,
  year,
  trim,
  body_type,
  fuel_type,
  transmission,
  mileage,
  price,
  condition,
  location,
  seller_type,
  features,
  ARRAY_TO_STRING([
    make,
    model,
    CAST(year AS STRING),
    trim,
    body_type,
    fuel_type,
    transmission,
    CONCAT('mileage ', CAST(mileage AS STRING)),
    CONCAT('price ', CAST(price AS STRING)),
    condition,
    location,
    seller_type,
    features
  ], ' ', '') AS content
FROM `{PROJECT_ID}.{BQ_DATASET}.{TABLE_SRC}`;
'''
job = client.query(sql)
job.result()
print("created:", f"{PROJECT_ID}.{BQ_DATASET}.{TABLE_BASE}")

## 3) Create BigQuery ↔ Vertex AI connection
This is a one-time setup. Replace IDs as needed. If it already exists, the commands will no-op.

In [None]:
%%bash
set -euo pipefail
PROJECT_ID=${PROJECT_ID:-your-project-id}
BQ_LOCATION=${BQ_LOCATION:-europe-north1}
CONN_ID=${CONN_ID:-vertex_ai_conn_europe-north1}

bq mk --connection --location=${BQ_LOCATION} --project_id=${PROJECT_ID} \
  --connection_type=CLOUD_RESOURCE ${CONN_ID} || echo "Connection may already exist"

echo "Service account for the connection:"
bq show --connection ${PROJECT_ID}.${BQ_LOCATION}.${CONN_ID} | sed -n 's/.*serviceAccountId: \(.*\)/\1/p' || true
echo "Grant this SA roles/aiplatform.user with gcloud if not yet granted."

## 4) Create a remote embedding model in BigQuery

In [None]:
sql = f'''
CREATE OR REPLACE MODEL `{PROJECT_ID}.{BQ_DATASET}.{REMOTE_MODEL}`
REMOTE WITH CONNECTION `{BQ_LOCATION}.{CONN_ID}`
OPTIONS(ENDPOINT = 'gemini-embedding-001');
'''
job = client.query(sql)
job.result()
print("Remote model created:", f"{PROJECT_ID}.{BQ_DATASET}.{REMOTE_MODEL}")

## 5) Generate embeddings for listings

In [None]:
sql = f'''
CREATE OR REPLACE TABLE `{PROJECT_ID}.{BQ_DATASET}.{TABLE_EMB}` AS
SELECT
  listing_id, make, model, year, price, location, condition, features,
  content,
  ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
  MODEL `{PROJECT_ID}.{BQ_DATASET}.{REMOTE_MODEL}`,
  TABLE `{PROJECT_ID}.{BQ_DATASET}.{TABLE_BASE}`,
  STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
);
'''
job = client.query(sql)
job.result()
print("Embeddings table:", f"{PROJECT_ID}.{BQ_DATASET}.{TABLE_EMB}")

## 6) (Optional) Create a Vector Index (IVF)
If your table is small, this step may be skipped automatically with a handled error.

In [None]:
try:
    sql = f'''
    CREATE OR REPLACE VECTOR INDEX idx_used_cars_embedding
    ON `{PROJECT_ID}.{BQ_DATASET}.{TABLE_EMB}`(embedding)
    STORING (listing_id, make, model, year, price, location, condition)
    OPTIONS(
      index_type = 'IVF',
      distance_type = 'COSINE',
      ivf_options = '{{"num_lists":2000}}'
    );
    '''
    job = client.query(sql)
    job.result()
    print("Vector index created: idx_used_cars_embedding")
except Exception as e:
    print("Skipping index (likely small table):", e)

## 7) Run semantic queries

In [19]:
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location=BQ_LOCATION)
user_query = 'a sporty german car with good mileage under $20000'
sql = f'''
DECLARE q STRING DEFAULT @q;
WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `{PROJECT_ID}.{BQ_DATASET}.{REMOTE_MODEL}`,
    (SELECT q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT base.* EXCEPT(embedding), distance
FROM VECTOR_SEARCH(
  TABLE `{PROJECT_ID}.{BQ_DATASET}.{TABLE_EMB}`, 'embedding',
  TABLE query_vec, query_column_to_search => 'qvec',
  top_k => 10, distance_type => 'COSINE'
);
'''
job = client.query(sql, job_config=bigquery.QueryJobConfig(query_parameters=[bigquery.ScalarQueryParameter('q', 'STRING', user_query)]))
#rows = list(job)
#print(f"Results: {len(rows)}")
#for r in rows[:5]:
#    print(dict(r))

search_results_df = job.to_dataframe()  # Get results as DataFrame


# --- Step 3: Display the results ---
# This line runs after the BigQuery job is complete
print(f"Found {len(search_results_df)} matching results.")
print(search_results_df)

Found 10 matching results.
   listing_id           make     model  year    price  \
0         308  Mercedes-Benz   C-Class  2020  20773.0   
1        1161            BMW  3 Series  2013   3131.0   
2        1394            BMW  3 Series  2013   2713.0   
3         394  Mercedes-Benz   E-Class  2017   7905.0   
4         208           Audi        Q5  2020  11476.0   
5        1859            BMW  5 Series  2018  10719.0   
6          26            BMW  3 Series  2016   5229.0   
7        2030            BMW  3 Series  2020  15650.0   
8        1737     Volkswagen    Tiguan  2020  11120.0   
9         717  Mercedes-Benz   C-Class  2008   1289.0   

                       location  condition  \
0  Gräfenhainichen, RP, Germany  excellent   
1           Lübben, TH, Germany       good   
2           Lübeck, TH, Germany       good   
3         Northeim, BE, Germany       good   
4       Rudolstadt, NI, Germany  excellent   
5   Donaueschingen, HH, Germany       good   
6      Oranienburg, BB,

## 7b) Hybrid search: semantic + structured filters
Combine **semantic intent** with classic e‑commerce filters (make, price, year, mileage, location). In this example we are searching for a semantic intent "spacious family SUV with good safety features" but strictly with a keyword filters min_year = 2024 and price in the range of 10K to 40K


In [None]:
# --- Step 1: Define your search parameters in Python ---
# You can easily change these values before running the cell
from google.cloud import bigquery
import os

PROJECT_ID = os.environ.get('PROJECT_ID', 'your-project-id')
BQ_DATASET = os.environ.get('BQ_DATASET', "used_cars")
BQ_LOCATION = os.environ.get('BQ_LOCATION', 'europe-north1')

client = bigquery.Client(project=PROJECT_ID, location=BQ_LOCATION)

query = 'spacious family SUV with good safety features'
make_filter = None        # Set to a string like 'Toyota' to filter, or None to disable
min_price_filter = 10000  # Set to an integer, or None to disable
max_price_filter = 40000  # Set to an integer, or None to disable
min_year_filter = 2024    # Set to an integer, or None to disable
location_filter = None    # Set to a string like 'California', or None to disable

# Create a dictionary of parameters to pass to BigQuery
params = {
    "q": query,
    "make_filter": make_filter,
    "min_price_filter": min_price_filter,
    "max_price_filter": max_price_filter,
    "min_year_filter": min_year_filter,
    "location_filter": location_filter
}


# --- Step 2: Execute the BigQuery SQL using the 'bigquery' client ---
# The results will be saved to the 'search_results_df' DataFrame.
sql_query = f"""
-- Generate query embedding
WITH query_vec AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `{PROJECT_ID}.{BQ_DATASET}.text_embedding_model`,
    (SELECT @q AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
),

-- Perform the vector search to get the top 10 semantic matches
semantic_searched_list AS (
  SELECT
    base.listing_id
  FROM VECTOR_SEARCH(
    TABLE `{PROJECT_ID}.{BQ_DATASET}.used_cars_search`, 'embedding',
    TABLE query_vec, query_column_to_search => 'qvec',
    top_k => 10, distance_type => 'COSINE'
  )
),

-- Apply the standard filters to the full table
filtered_cars AS (
  SELECT * EXCEPT (embedding)
  FROM `{PROJECT_ID}.{BQ_DATASET}.used_cars_search`
  WHERE (@make_filter IS NULL OR make = @make_filter)
    AND (@min_price_filter IS NULL OR price >= @min_price_filter)
    AND (@max_price_filter IS NULL OR price <= @max_price_filter)
    AND (@min_year_filter IS NULL OR year >= @min_year_filter)
    AND (@location_filter IS NULL OR location = @location_filter)
)

-- Join the filtered results with the semantic search results
-- This ensures that the final list contains only cars that match BOTH
-- the semantic query AND the structured filters.
SELECT
  filtered_cars.*
FROM filtered_cars
INNER JOIN semantic_searched_list
  ON filtered_cars.listing_id = semantic_searched_list.listing_id;
"""

# Prepare query parameters
query_params = []
for key, value in params.items():
    if key == 'q':
        query_params.append(bigquery.ScalarQueryParameter(key, 'STRING', value))
    elif key in ['min_price_filter', 'max_price_filter', 'min_year_filter']:
        query_params.append(bigquery.ScalarQueryParameter(key, 'INT64', value))
    else:  # make_filter, location_filter
        query_params.append(bigquery.ScalarQueryParameter(key, 'STRING', value))


job_config = bigquery.QueryJobConfig(query_parameters=query_params)
job = client.query(sql_query, job_config=job_config)
search_results_df = job.to_dataframe()  # Get results as DataFrame


# --- Step 3: Display the results ---
# This line runs after the BigQuery job is complete
print(f"Found {len(search_results_df)} matching results.")
print(search_results_df)

Found 1 matching results.
   listing_id   make model  year    price          location condition  \
0         860  Honda  CR-V  2025  33549.0  Avadi, HR, India      good   

                                            features  \
0  Adaptive Cruise Control, Alloy Wheels, Apple C...   

                                             content  
0  Honda CR-V 2025 Trend SUV Petrol Automatic mil...  


## 8) Cleanup (optional)
To avoid charges, delete the dataset when done:
```
bq rm -r -f -d ${PROJECT_ID}:${BQ_DATASET}
```
