# Text Embeddings and Vector Search

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayoisio/genai-on-google-cloud/blob/main/chapter-2/colabs/03_embeddings_vector_search.ipynb)

**Estimated Time**: 15 minutes

**Prerequisites**: Google Cloud project with billing enabled, Vertex AI and BigQuery APIs enabled

---

## Overview

Embeddings transform text into numerical vectors that capture semantic meaning. This notebook demonstrates:

1. **Generate embeddings** using Vertex AI text-embedding models
2. **Store embeddings** in BigQuery
3. **Perform semantic search** using VECTOR_SEARCH
4. **Create vector indexes** for efficient retrieval

These are the core building blocks for RAG (Retrieval-Augmented Generation) systems.

## 1. Setup & Authentication

In [None]:
# @title Install Dependencies
!pip install --upgrade google-cloud-aiplatform google-cloud-bigquery -q

In [None]:
# @title Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()
print("‚úì Authentication successful")

In [None]:
# @title Configure Your Project
PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
DATASET_ID = "chapter2_demo"  # @param {type:"string"}

# Validate project ID
if PROJECT_ID == "your-project-id":
    raise ValueError("Please set your PROJECT_ID above")

print(f"‚úì Project: {PROJECT_ID}")
print(f"‚úì Location: {LOCATION}")
print(f"‚úì Dataset: {DATASET_ID}")

In [None]:
# @title Initialize Clients
import vertexai
from vertexai.language_models import TextEmbeddingModel
from google.cloud import bigquery

vertexai.init(project=PROJECT_ID, location=LOCATION)
bq_client = bigquery.Client(project=PROJECT_ID)

print(f"‚úì Vertex AI initialized")
print(f"‚úì BigQuery client initialized")

## 2. Understanding Embeddings

Text embeddings convert words and sentences into dense vectors where semantically similar texts are close together in the vector space.

```mermaid
flowchart LR
    A[Text] --> B[Embedding Model]
    B --> C[Vector]
    C --> D[Similarity Search]
    D --> E[Results]
```

In [None]:
# @title Load the text embedding model
# Using text-embedding-005 - the latest model as of December 2025
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-005")

print("‚úì Loaded text-embedding-005 model")

In [None]:
# @title Generate embeddings for sample texts
sample_texts = [
    "How to train a machine learning model",
    "Best practices for training ML models",
    "The weather forecast for tomorrow",
    "Recipe for chocolate chip cookies",
    "Deep learning neural network architecture"
]

# Generate embeddings
embeddings = embedding_model.get_embeddings(sample_texts)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {len(embeddings[0].values)}")
print(f"\nFirst embedding (first 10 values): {embeddings[0].values[:10]}")

In [None]:
# @title Compute similarity between texts
import numpy as np

def cosine_similarity(vec1, vec2):
    """Compute cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Get embedding vectors
vectors = [np.array(e.values) for e in embeddings]

# Compute similarity matrix
print("Similarity Matrix:")
print("(Higher = more similar)\n")

# Print header
print(f"{'':>40}", end="")
for i in range(len(sample_texts)):
    print(f"  [{i}]", end="")
print()

for i, text_i in enumerate(sample_texts):
    print(f"[{i}] {text_i[:35]:>35}...", end="" if len(text_i) > 35 else f"[{i}] {text_i:>38}")
    for j in range(len(sample_texts)):
        sim = cosine_similarity(vectors[i], vectors[j])
        print(f" {sim:.2f}", end="")
    print()

In [None]:
# @title Find most similar text to a query
query = "How do I build an AI model?"

# Get query embedding
query_embedding = embedding_model.get_embeddings([query])[0].values
query_vector = np.array(query_embedding)

# Calculate similarities
similarities = []
for i, vec in enumerate(vectors):
    sim = cosine_similarity(query_vector, vec)
    similarities.append((sample_texts[i], sim))

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

print(f"Query: '{query}'\n")
print("Most similar texts:")
for text, sim in similarities:
    print(f"  {sim:.4f}: {text}")

## 3. Embeddings in BigQuery

BigQuery provides native support for embeddings with `ML.GENERATE_EMBEDDING` and `VECTOR_SEARCH` functions. Let's set up a table and generate embeddings at scale.

In [None]:
# @title Create dataset if it doesn't exist
dataset_ref = bigquery.DatasetReference(PROJECT_ID, DATASET_ID)

try:
    bq_client.get_dataset(dataset_ref)
    print(f"‚úì Dataset {DATASET_ID} already exists")
except:
    dataset = bigquery.Dataset(dataset_ref)
    dataset.location = LOCATION
    bq_client.create_dataset(dataset)
    print(f"‚úì Created dataset {DATASET_ID}")

In [None]:
# @title Create a remote model for embeddings
MODEL_NAME = "text_embedding_model"

create_model_query = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATASET_ID}.{MODEL_NAME}`
REMOTE WITH CONNECTION `{PROJECT_ID}.{LOCATION}.default`
OPTIONS (
    endpoint = 'text-embedding-005'
)
"""

# Note: This requires a BigQuery connection to Vertex AI
# If you don't have one, the cell below provides an alternative
print("Remote Model Creation Query:")
print(create_model_query)
print("\n‚ö†Ô∏è Note: This requires a BigQuery-Vertex AI connection.")
print("See: https://cloud.google.com/bigquery/docs/create-cloud-resource-connection")

In [None]:
# @title Create sample documents table
create_table_query = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET_ID}.documents` AS
SELECT 
    'doc_1' as doc_id,
    'Machine learning is a subset of artificial intelligence that enables systems to learn from data.' as content
UNION ALL SELECT 'doc_2', 'Neural networks are computing systems inspired by biological neural networks.'
UNION ALL SELECT 'doc_3', 'Deep learning uses multiple layers of neural networks to analyze data.'
UNION ALL SELECT 'doc_4', 'Natural language processing helps computers understand human language.'
UNION ALL SELECT 'doc_5', 'Computer vision enables machines to interpret and process visual information.'
UNION ALL SELECT 'doc_6', 'Reinforcement learning trains agents through rewards and penalties.'
UNION ALL SELECT 'doc_7', 'Transfer learning applies knowledge from one task to another related task.'
UNION ALL SELECT 'doc_8', 'Supervised learning uses labeled data to train predictive models.'
UNION ALL SELECT 'doc_9', 'Unsupervised learning finds patterns in data without labeled examples.'
UNION ALL SELECT 'doc_10', 'Generative AI creates new content like text, images, and code.'
"""

bq_client.query(create_table_query).result()
print(f"‚úì Created documents table with sample data")

# Display the data
display_query = f"SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.documents`"
df = bq_client.query(display_query).to_dataframe()
display(df)

In [None]:
# @title Generate embeddings using Python (alternative to BigQuery ML)
# This approach works without a BigQuery-Vertex AI connection

# Get the documents
documents = df.to_dict('records')

# Generate embeddings for each document
doc_embeddings = []
contents = [doc['content'] for doc in documents]
embeddings = embedding_model.get_embeddings(contents)

for doc, emb in zip(documents, embeddings):
    doc_embeddings.append({
        'doc_id': doc['doc_id'],
        'content': doc['content'],
        'embedding': emb.values
    })

print(f"‚úì Generated embeddings for {len(doc_embeddings)} documents")
print(f"  Embedding dimension: {len(doc_embeddings[0]['embedding'])}")

In [None]:
# @title Store embeddings in BigQuery
import json

# Create table with ARRAY<FLOAT64> for embeddings
create_embeddings_table = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET_ID}.document_embeddings` (
    doc_id STRING,
    content STRING,
    embedding ARRAY<FLOAT64>
)
"""

bq_client.query(create_embeddings_table).result()
print(f"‚úì Created embeddings table")

# Insert embeddings
table_id = f"{PROJECT_ID}.{DATASET_ID}.document_embeddings"

rows_to_insert = [
    {
        'doc_id': doc['doc_id'],
        'content': doc['content'],
        'embedding': doc['embedding']
    }
    for doc in doc_embeddings
]

errors = bq_client.insert_rows_json(table_id, rows_to_insert)

if errors:
    print(f"‚ùå Errors inserting rows: {errors}")
else:
    print(f"‚úì Inserted {len(rows_to_insert)} rows with embeddings")

In [None]:
# @title Verify embeddings in BigQuery
verify_query = f"""
SELECT 
    doc_id,
    SUBSTR(content, 1, 50) as content_preview,
    ARRAY_LENGTH(embedding) as embedding_dim
FROM `{PROJECT_ID}.{DATASET_ID}.document_embeddings`
LIMIT 5
"""

result_df = bq_client.query(verify_query).to_dataframe()
print("Stored embeddings:")
display(result_df)

## 4. Vector Search

Now let's perform semantic search using the embeddings we've created.

In [None]:
# @title Semantic search function
def semantic_search(query, top_k=3):
    """
    Perform semantic search over the document embeddings.
    
    Args:
        query: Search query text
        top_k: Number of results to return
    
    Returns:
        List of (doc_id, content, similarity) tuples
    """
    # Generate query embedding
    query_emb = embedding_model.get_embeddings([query])[0].values
    query_vector = np.array(query_emb)
    
    # Calculate similarity with all documents
    results = []
    for doc in doc_embeddings:
        doc_vector = np.array(doc['embedding'])
        sim = cosine_similarity(query_vector, doc_vector)
        results.append((doc['doc_id'], doc['content'], sim))
    
    # Sort by similarity and return top_k
    results.sort(key=lambda x: x[2], reverse=True)
    return results[:top_k]

print("‚úì Semantic search function ready")

In [None]:
# @title Test semantic search
QUERY = "How do machines learn from data?"  # @param {type:"string"}
TOP_K = 5  # @param {type:"integer"}

results = semantic_search(QUERY, top_k=TOP_K)

print(f"üîç Query: '{QUERY}'\n")
print(f"Top {TOP_K} results:")
print("-" * 80)
for doc_id, content, sim in results:
    print(f"[{sim:.4f}] {doc_id}: {content}")

In [None]:
# @title Test with different queries
test_queries = [
    "What is deep learning?",
    "How to understand text with AI?",
    "Creating new content with AI",
    "Learning without labels"
]

for query in test_queries:
    results = semantic_search(query, top_k=2)
    print(f"\nüîç '{query}'")
    for doc_id, content, sim in results:
        print(f"   [{sim:.3f}] {content[:60]}...")

## 5. BigQuery VECTOR_SEARCH (Reference)

BigQuery provides native VECTOR_SEARCH for efficient similarity search at scale. Here's the pattern:

In [None]:
# @title BigQuery VECTOR_SEARCH Pattern
VECTOR_SEARCH_PATTERN = '''
-- Native BigQuery VECTOR_SEARCH pattern
-- This requires ML.GENERATE_EMBEDDING with a remote model connection

-- Step 1: Create a table with embeddings
CREATE OR REPLACE TABLE `{PROJECT}.{DATASET}.embeddings` AS
SELECT 
    doc_id,
    content,
    ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
    MODEL `{PROJECT}.{DATASET}.embedding_model`,
    (SELECT doc_id, content FROM `{PROJECT}.{DATASET}.documents`)
);

-- Step 2: Create a vector index for efficient search
CREATE OR REPLACE VECTOR INDEX my_vector_index
ON `{PROJECT}.{DATASET}.embeddings`(embedding)
OPTIONS (
    index_type = 'IVF',
    distance_type = 'COSINE',
    ivf_options = '{"num_lists": 100}'
);

-- Step 3: Perform vector search
SELECT
    base.doc_id,
    base.content,
    distance
FROM VECTOR_SEARCH(
    TABLE `{PROJECT}.{DATASET}.embeddings`,
    'embedding',
    (
        SELECT ml_generate_embedding_result AS embedding
        FROM ML.GENERATE_EMBEDDING(
            MODEL `{PROJECT}.{DATASET}.embedding_model`,
            (SELECT 'How do machines learn?' AS content)
        )
    ),
    top_k => 5,
    OPTIONS => '{"fraction_lists_to_search": 0.1}'
)
ORDER BY distance;
'''

print("üìã BigQuery VECTOR_SEARCH Pattern:")
print(VECTOR_SEARCH_PATTERN)

In [None]:
# @title RAG with VECTOR_SEARCH + ML.GENERATE_TEXT Pattern
RAG_PATTERN = '''
-- Complete RAG pattern: Vector Search + Text Generation
-- Combines semantic retrieval with LLM generation

SELECT
    ml_generate_text_llm_result AS answer
FROM ML.GENERATE_TEXT(
    MODEL `{PROJECT}.{DATASET}.gemini_model`,
    (
        SELECT CONCAT(
            'Answer the following question using only the context provided.\n\n',
            'Context:\n',
            STRING_AGG(base.content, '\n'),
            '\n\nQuestion: How do machines learn from data?\n\nAnswer:'
        ) AS prompt
        FROM VECTOR_SEARCH(
            TABLE `{PROJECT}.{DATASET}.embeddings`,
            'embedding',
            (SELECT embedding FROM query_embedding),
            top_k => 5
        )
    ),
    STRUCT(0.2 AS temperature, 1024 AS max_output_tokens)
);
'''

print("üìã RAG Pattern (Vector Search + Generation):")
print(RAG_PATTERN)

## 6. Try It Yourself

In [None]:
# TODO: Add your own documents and test semantic search

# Add new documents
new_documents = [
    {"doc_id": "custom_1", "content": "Your custom document content here"},
    {"doc_id": "custom_2", "content": "Another document to search"},
]

# Generate embeddings for new documents
new_contents = [doc['content'] for doc in new_documents]
new_embeddings = embedding_model.get_embeddings(new_contents)

# Add to our document store
for doc, emb in zip(new_documents, new_embeddings):
    doc_embeddings.append({
        'doc_id': doc['doc_id'],
        'content': doc['content'],
        'embedding': emb.values
    })

print(f"‚úì Added {len(new_documents)} new documents")
print(f"Total documents: {len(doc_embeddings)}")

In [None]:
# TODO: Experiment with different embedding models
# Available models: text-embedding-005, text-multilingual-embedding-002

# Try multilingual embeddings
try:
    multilingual_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
    
    multilingual_texts = [
        "How does machine learning work?",  # English
        "¬øC√≥mo funciona el aprendizaje autom√°tico?",  # Spanish
        "Ê©üÊ¢∞Â≠¶Áøí„ÅØ„Å©„ÅÆ„Çà„ÅÜ„Å´Ê©üËÉΩ„Åó„Åæ„Åô„ÅãÔºü",  # Japanese
    ]
    
    ml_embeddings = multilingual_model.get_embeddings(multilingual_texts)
    ml_vectors = [np.array(e.values) for e in ml_embeddings]
    
    print("Multilingual similarity:")
    for i, text_i in enumerate(multilingual_texts):
        for j, text_j in enumerate(multilingual_texts):
            if i < j:
                sim = cosine_similarity(ml_vectors[i], ml_vectors[j])
                print(f"  {text_i[:30]}... ‚Üî {text_j[:30]}... = {sim:.4f}")
except Exception as e:
    print(f"Could not load multilingual model: {e}")

## 7. Cleanup

In [None]:
# @title Cleanup resources (optional)
CLEANUP = False  # @param {type:"boolean"}

if CLEANUP:
    # Delete tables
    bq_client.delete_table(f"{PROJECT_ID}.{DATASET_ID}.documents", not_found_ok=True)
    bq_client.delete_table(f"{PROJECT_ID}.{DATASET_ID}.document_embeddings", not_found_ok=True)
    print("‚úì Deleted tables")
    
    # Optionally delete dataset
    # bq_client.delete_dataset(f"{PROJECT_ID}.{DATASET_ID}", delete_contents=True)
    # print("‚úì Deleted dataset")
else:
    print("Skipping cleanup. Set CLEANUP=True to delete resources.")

## Summary

In this notebook, you learned how to:

1. ‚úÖ **Generate embeddings** using Vertex AI text-embedding models
2. ‚úÖ **Understand similarity** between texts using cosine similarity
3. ‚úÖ **Store embeddings** in BigQuery
4. ‚úÖ **Perform semantic search** to find relevant documents
5. ‚úÖ **Use BigQuery VECTOR_SEARCH** patterns for scale

### Key Takeaways

- **Embeddings** capture semantic meaning in dense vectors
- **Cosine similarity** measures how similar texts are
- **BigQuery VECTOR_SEARCH** enables efficient search at scale
- **Vector indexes** accelerate nearest-neighbor queries

---

## Next Steps

Continue to the next notebook: **[04_rag_context_assembly.ipynb](04_rag_context_assembly.ipynb)** to learn how to build a complete RAG pipeline with context assembly.