# Embeddings

## Semantic Search Using Embeddings from Google Gen AI Models

Semantic search is a type of search that uses the meaning of words and phrases to find relevant results.

In this tutorial, we will demonstrate how to do semantic search with embeddings generated from the news text and using [Google ScaNN: Efficient Vector Similarity Search](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) to retrieve the most relevant news semantically.

The embeddings are generated using the embedding model API provided by Google (based on Palm 2) and vector search is done using a local library. This pattern is most appropriate for experimentation.

## Pre-requisites:
- Vertex LLM SDK
- ScaNN [github](https://github.com/google-research/google-research/tree/master/scann)

## Install Vertex LLM SDK

Install required libraries and initialises the Vertex AI SDK

In [3]:
# Install Required Libraries
!pip3 install "google-cloud-aiplatform>=1.25" "shapely<2.0.0"
!pip install scann



In [4]:
# Import Vertex AI SDK
PROJECT_ID = !gcloud config get project
PROJECT_ID = PROJECT_ID.n
LOCATION = "europe-west2" 

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

## Import TextEmbeddingModel

Available models as of Sep 2023
| Models | Description
| :- | :- |
| textembedding-gecko@001 | stable |
| textembedding-gecko@latest | public preview: an embeddings model with enhanced AI quality |
| textembedding-gecko-multilingual@latest | public preview: an embeddings model designed to use a wide range of non-English languages. |


Further documentation on available models can be found here: https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#generative-ai-get-text-embedding-python

In [5]:
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

## Import Required Packages

Outputs regarding Tensorflow can be ignored as this is due to this notebook being CPU only. A GPU is not required for this demonstration.


In [6]:
import json
import time

import numpy as np
import pandas as pd
import scann

## Create Embedding Dataset

The dataset is solely to demonstrate the use of the Text Embedding API with a vector database. It is not intended to be used for any other purpose, such as evaluating models. The dataset is small and does not represent a comprehensive sample of all possible text.

The following command copies the data json file from a Google Storage bucket. The data is stored locally within the notebook for use. 

In [7]:
!mkdir -p datasets
!gsutil cp gs://gen-ai-{PROJECT_ID}-bucket/embeddings/data/google_embeddings_dataset.jsonl ./datasets

Copying gs://gen-ai-playpen-b57463-bucket/embeddings/data/google_embeddings_dataset.jsonl...
/ [1 files][ 14.1 KiB/ 14.1 KiB]                                                
Operation completed over 1 objects/14.1 KiB.                                     


### Load data.json as a Python list

In [8]:
records = []
with open("./datasets/google_embeddings_dataset.jsonl") as f:
    for line in f:
        record = json.loads(line)
        records.append(record)

### Peek at the data

In [8]:
df = pd.read_csv('text_data.csv')
df.head(5)

Unnamed: 0,file_name,text
0,whats-new-in-online-for-business.txt,\nNo\n
1,activate-cbo.txt,Activate your Commercial Banking Online accoun...
2,getting-started-with-the-business-mobile-app.txt,Getting started with the Business Mobile Banki...
3,account-management.txt,Account ManagementAccess & permissionsAdd or r...
4,confirmation-of-payee.txt,Make payments with confidenceConfirmation of P...


### Get embeddings from the Google Embedding Model

The following code sends a request to the embedding model API to get the embedding vector for each entry in the dataset and stores it in a Python DataFrame.

In [None]:
def get_embedding(text):
    get_embedding.counter += 1
    #try:
    if get_embedding.counter % 100 == 0:
        time.sleep(3)
    return model.get_embeddings([text])[0].values #Send request to embedding model
    #except:
     #   return []


get_embedding.counter = 0

# This may take several minutes to complete.
df["embedding"] = df["text"].apply(lambda x: get_embedding(x))

In [None]:
df.head()

In [None]:
# Convert the embeddings into a Python list 
embeddings_list = df['embedding'].values.tolist()
with open("./datasets/vector_search_dataset.json", "w") as f:
    for i in range(len(embeddings_list)):
        f.write('{"id":"' + str(i) + '",')
        f.write('"embedding":[' + ",".join(str(x) for x in embeddings_list[i]) + "]}")
        f.write("\n")

## Create an Index

Further documentation on SCANN: https://github.com/google-research/google-research/tree/master/scann

In [42]:
record_count = len(records)
dataset = np.empty((record_count, 768))
for i in range(record_count):
    dataset[i] = df.embedding[i]

normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]
# configure ScaNN as a tree - asymmetric hash hybrid with reordering
# anisotropic quantization as described in the paper; see README

# use scann.scann_ops.build() to instead create a TensorFlow-compatible searcher
searcher = (
    scann.scann_ops_pybind.builder(normalized_dataset, 10, "dot_product")
    .tree(
        num_leaves=record_count,
        num_leaves_to_search=record_count,
        training_sample_size=record_count,
    )
    .score_ah(2, anisotropic_quantization_threshold=0.2)
    .reorder(100)
    .build()
)

2023-11-30 17:19:19.194929: I scann/partitioning/partitioner_factory_base.cc:59] Size of sampled dataset for training partition: 48
2023-11-30 17:19:19.200076: I ./scann/partitioning/kmeans_tree_partitioner_utils.h:88] PartitionerFactory ran in 5.035017ms.


## Query Function

In [45]:
def search(query):
    start = time.time()
    query = model.get_embeddings([query])[0].values
    neighbors, distances = searcher.search(query, final_num_neighbors=3)
    end = time.time()

    for id, dist in zip(neighbors, distances):
        print(f"[docid:{id}] [{dist}] -- {df.text[int(id)][:125]}...")
    print("Latency (ms):", 1000 * (end - start))

## Query Index

In [48]:
search("tell me about online banking")

[docid:17] [0.7976396679878235] -- Find out more about online bankingOnline for BusinessOnline for Business is our Internet Banking service that lets you take c...
[docid:19] [0.7541975378990173] -- How to log in and out of Commercial Banking Online.You can only log in to Commercial Banking Online if your account has been ...
[docid:20] [0.7353847026824951] -- Register for Commercial Banking OnlineCommercial Banking Online:helps you manage your business online.gives you access to bot...
Latency (ms): 488.70158195495605


In [47]:
search("tell me about an important moment or event in your life")

[docid:25] [0.6011409759521484] -- Log on to Online for Business – Memorable InformationLog on securely using three characters from your memorable information, ...
[docid:18] [0.5863494873046875] -- Claim a Direct Debit refundYou might be able to claim a refund for a Direct Debit you haven’t authorised.You might be able to...
[docid:16] [0.5744001865386963] -- Make a complaintWe’re committed to providing products and services of the very highest quality. If we haven't lived up to you...
Latency (ms): 560.3361129760742
