## Google Text Embedding models

* Reference : https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api

### Install and configuration

In [32]:
%pip install --upgrade --quiet google-genai \
                                numpy \
                                scipy \
                                pandas

Note: you may need to restart the kernel to use updated packages.


In [5]:
#Set environment variables
PROJECT_ID = "ai-hangsik" 
REGION = "us-central1"
USE_VERTEX_AI = True 


In [3]:
!gcloud auth application-default login
!gcloud auth application-default set-quota-project {PROJECT_ID}

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=twiaMtir7JbgoH7n4tQcXePYYmT8v5&access_type=offline&code_challenge=WmJe1z2bmzhxntgaeOAxTeCCs-Rp1-uLxwHtN-9CxFg&code_challenge_method=S256


Credentials saved to file: [/Users/hangsik/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "ai-hangsik" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.

Credentials saved to file: [/Users/hangsik/.config/gc

### Execution

In [6]:
from google import genai
from google.genai.types import EmbedContentConfig

import time

In [7]:
# Login to Vertex AI
client = genai.Client(
    vertexai=USE_VERTEX_AI,
    project=PROJECT_ID,
    location=REGION,)

In [8]:
# Calculate cosine similarity between two embedding arrays
def cosine_similarity(embed_1, embed_2):
  import numpy as np
  from scipy.spatial.distance import cosine

  embedding_1 = np.array(embed_1)
  embedding_2 = np.array(embed_2)

  cosine_similarity = 1 - cosine(embedding_1, embedding_2)
  print(f"Cosine similarity : {cosine_similarity:.4f}")


### Google Text Embedding models
* Manual web site : https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#google-models
* Related to task types: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb

In [25]:
# Generate embedding using text-multilingual
def gemini_embedding_func(model:str, 
                          contents,
                          task_type:str="SEMANTIC_SIMILARITY",    
                          output_dimensionality:int=768,
                          ):
  
        start_time = time.perf_counter_ns()

        # https://googleapis.github.io/python-genai/genai.html#genai.types.EmbedContentConfig
        embed_config = EmbedContentConfig(
                auto_truncate=True,
                
                # task types ref : https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api#parameter-list
                
                task_type=task_type,  
                
                mime_type="text/plain",
                
                output_dimensionality=output_dimensionality,  
                
                # title="title of the text" # when task type is RETRIEVAL_DOCUMENT
        )

        result = client.models.embed_content(
                model=model,
                contents=contents,
                config=embed_config
        )

        end_time = time.perf_counter_ns()

        latency = (end_time - start_time)
        print(f"Latency (ns): {latency*1e-6:.2f} ms")

        return result.embeddings[0].values

#### text-multilingual-embedding-002

In [26]:
MODEL = "text-multilingual-embedding-002"

CONTENT_1 = "고양이가 자전거를 타고 간다"
CONTENT_2 = "호랑이가 차를 차고 가고 있고 고양이도 자전거를 타고 뒤따르고 있다"

embed_1 = gemini_embedding_func(model = MODEL, 
                                task_type="SEMANTIC_SIMILARITY", 
                                output_dimensionality=768,  
                                contents = CONTENT_1)

embed_2 = gemini_embedding_func(model = MODEL, 
                                task_type="SEMANTIC_SIMILARITY", 
                                output_dimensionality=768,  
                                contents = CONTENT_2)

cosine_similarity(embed_1, embed_2)

Latency (ns): 1308.24 ms
Latency (ns): 312.68 ms
Cosine similarity : 0.8593


#### gemini embedding

* https://arxiv.org/pdf/2503.07891


In [None]:
MODEL = "gemini-embedding-001"

CONTENT_1 = "하이라키 마지막 회 틀어줘"
CONTENT_2 = "하이라이트 마지막에 틀어 줘"

embed_1 = gemini_embedding_func(model = MODEL, 
                                task_type="SEMANTIC_SIMILARITY", 
                                output_dimensionality=3072,  
                                contents = CONTENT_1)

embed_2 = gemini_embedding_func(model = MODEL, 
                                task_type="SEMANTIC_SIMILARITY", 
                                output_dimensionality=3072,  
                                contents = CONTENT_2)

cosine_similarity(embed_1, embed_2)

Latency (ns): 1151.87 ms
Latency (ns): 430.28 ms
Cosine similarity : 0.9714


### Find similar texts

In [59]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine

def find_similar_texts(query_text, df, embedding_column='embedding', text_column='text', top_k=5):
    # Generate embedding for query text
    query_embedding = gemini_embedding_func(
        model=MODEL,
        task_type="SEMANTIC_SIMILARITY",
        output_dimensionality=3072,
        contents=query_text
    )
    
    # Calculate similarities
    similarities = []
    for idx, row in df.iterrows():
        similarity = 1 - cosine(query_embedding, row[embedding_column])
        similarities.append({'text': row[text_column], 'similarity': similarity})
    
    # Sort by similarity and get top k results
    results = sorted(similarities, key=lambda x: x['similarity'], reverse=True)[:top_k]
    
    return results

In [60]:
# Example usage:
# 1. Read CSV file (assuming you have a CSV with a 'text' column)
df = pd.read_csv('data/.audio_truth.csv',skipinitialspace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     52 non-null     int64 
 1   text    52 non-null     object
dtypes: int64(1), object(1)
memory usage: 964.0+ bytes


In [61]:
# 2. Generate embeddings for all texts

MODEL = "gemini-embedding-001"

df['embedding'] = df['text'].apply(lambda x: gemini_embedding_func(
    model=MODEL,
    task_type="SEMANTIC_SIMILARITY",
    output_dimensionality=3072,
    contents=x
))

Latency (ns): 1478.35 ms
Latency (ns): 446.76 ms
Latency (ns): 623.64 ms
Latency (ns): 439.90 ms
Latency (ns): 599.35 ms
Latency (ns): 473.45 ms
Latency (ns): 559.29 ms
Latency (ns): 312.47 ms
Latency (ns): 429.65 ms
Latency (ns): 300.36 ms
Latency (ns): 311.90 ms
Latency (ns): 294.79 ms
Latency (ns): 303.31 ms
Latency (ns): 758.33 ms
Latency (ns): 446.24 ms
Latency (ns): 629.06 ms
Latency (ns): 433.51 ms
Latency (ns): 309.69 ms
Latency (ns): 308.92 ms
Latency (ns): 311.24 ms
Latency (ns): 300.33 ms
Latency (ns): 297.63 ms
Latency (ns): 305.06 ms
Latency (ns): 311.14 ms
Latency (ns): 305.23 ms
Latency (ns): 303.46 ms
Latency (ns): 296.58 ms
Latency (ns): 298.66 ms
Latency (ns): 319.66 ms
Latency (ns): 297.23 ms
Latency (ns): 300.66 ms
Latency (ns): 306.18 ms
Latency (ns): 320.47 ms
Latency (ns): 300.44 ms
Latency (ns): 299.69 ms
Latency (ns): 304.78 ms
Latency (ns): 313.16 ms
Latency (ns): 327.11 ms
Latency (ns): 303.81 ms
Latency (ns): 298.96 ms
Latency (ns): 304.21 ms
Latency (ns): 3

In [72]:
# 3. Find similar texts for a query
query = "오징어 게임 있어?"
similar_texts = find_similar_texts(query, df)

search_results = []
# 4. Print results
for result in similar_texts:
    
    search_results.append({
        "text": result['text'],
        "similarity": f"{result['similarity']:.4f}"
    })

search_results

Latency (ns): 1216.14 ms


[{'text': '오징어 게임 시즌 2 예고편 있어', 'similarity': '0.8878'},
 {'text': '오징어 게임 시즌 2 지금 바로 재생해 줘', 'similarity': '0.8379'},
 {'text': '현빈 이동욱 나오는 하얼빈 지금 볼래', 'similarity': '0.8133'},
 {'text': 'TV야 볼만한 거 추천해 줘', 'similarity': '0.8061'},
 {'text': '오늘 새로 업데이트된 콘텐츠 뭐 있어', 'similarity': '0.8024'}]

In [75]:
MODEL = "gemini-2.5-flash-lite"

PROMPT = f"""
    당신은 사용자의 질문을 이해해서 정확한 질문의 의도를 바탕으로 사용자의 질문을 재작성해주는 AI 어시스턴트입니다.
    사용자의 질문 : {query} 과 검색된 유사한 질문들을 참고하여 최대한 사용자의 질문을 반영한 명확한 질문으로 재작성해 주세요.
    유사한 질문들 : {search_results}    

    답변은 아래와 같이 사용자의 질문을 최소화해서 변경 후 재작성 해주세요.
    답변예제 : "최신 개봉 영화 예고편 모음 틀어줘" 
"""
start_time = time.perf_counter_ns()

response = client.models.generate_content(
    model=MODEL,
    contents=PROMPT,
)

end_time = time.perf_counter_ns()

latency = (end_time - start_time)
print(f"{MODEL} Latency (ns): {latency*1e-6:.2f} ms \n")

print(response.text)

gemini-2.5-flash-lite Latency (ns): 528.10 ms 

오징어 게임 시즌 2 예고편 틀어줘


## End of Document