# Vertex AI GenAI Embeddings - As Features For Hierarchical Classification

**IN DEVELOPMENT**

Embeddings are vector representations of text or images or both.  These are vectors of floating point numbers that come from a model that has been trained to embed content in a way that efficiently represents the content.

Getting embeddings for text, or multimodel text and images using Vetex AI foundational models is demonstrated in notebook []().

This notebook show a use case for embeddings as features. 


Workflow:
- Review product catelog data in BigQuery Public table: `bigquery-public-data.thelook_ecommerce.products`
- Create a table with embeddings for:
    - `name` = A brief description of the product
    - `department` = The first level of the product catelog
    - `category` = The second level of the product catelog
- Setup BigQuery Resource Connection
- Use ML.* to generate embeddings using Vertex AI


In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [10]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'embed-feature-classifier'

In [11]:
# make this the BQ Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2] # subset to first two characters for multi-region

In [47]:
import json

import vertexai.language_models
import bigframes.pandas as bf
import bigframes.ml as bfml
from bigframes.ml import llm
from google.cloud import bigquery_connection_v1 as bq_connection
from google.cloud import bigquery

In [69]:
vertexai.init(project = PROJECT_ID, location = REGION)
bq = bigquery.Client(project = PROJECT_ID)
bf.reset_session()
bf.options.bigquery.project = BQ_PROJECT
bf.options.bigquery.location = BQ_REGION
bf_session = bf.get_global_session()

In [70]:
products = bf.read_gbq('bigquery-public-data.thelook_ecommerce.products')

HTML(value='Query job 92a95488-9350-4812-a228-c9963367ecac is RUNNING. <a target="_blank" href="https://consol…

In [71]:
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
dtype: object

In [72]:
products['department'].unique().tolist()

HTML(value='Query job 18ce4a36-d912-4005-85b4-9f78a2a0b47f is DONE. 410.5 kB processed. <a target="_blank" hre…

['Men', 'Women']

In [73]:
products['category'].unique().tolist()

HTML(value='Query job a031e47d-802a-4d03-8d23-9b1433fe952c is DONE. 594.2 kB processed. <a target="_blank" hre…

['Swim',
 'Jeans',
 'Pants',
 'Socks',
 'Active',
 'Shorts',
 'Sweaters',
 'Underwear',
 'Accessories',
 'Tops & Tees',
 'Sleep & Lounge',
 'Outerwear & Coats',
 'Suits & Sport Coats',
 'Fashion Hoodies & Sweatshirts',
 'Plus',
 'Suits',
 'Skirts',
 'Dresses',
 'Leggings',
 'Intimates',
 'Maternity',
 'Clothing Sets',
 'Pants & Capris',
 'Socks & Hosiery',
 'Blazers & Jackets',
 'Jumpsuits & Rompers']

In [74]:
products['name'].head()

HTML(value='Query job 19ee4334-ec7a-4995-aea1-70625943555f is DONE. 233.0 kB processed. <a target="_blank" hre…

HTML(value='Query job 2446aa30-ae5e-4b38-aed1-ca4444655413 is DONE. 1.7 MB processed. <a target="_blank" href=…

0       2XU Men's Swimmers Compression Long Sleeve Top
1           TYR Sport Men's Square Leg Short Swim Suit
2      TYR Sport Men's Solid Durafast Jammer Swim Suit
3    TYR Sport Men's Swim Short/Resistance Short Sw...
4                      TYR Alliance Team Splice Jammer
Name: name, dtype: string

In [75]:
# create/link to dataset
ds = bigquery.DatasetReference(BQ_PROJECT, BQ_DATASET)
ds.location = BQ_REGION
ds.labels = {'series': f'{SERIES}'}
ds = bq.create_dataset(dataset = ds, exists_ok = True) 

---
## BigQuery ML: Connect To Vertex AI LLMs with ML.GENERATE_TEXT

BigQuery ML can Create Models That are actually connects to Remote Models. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)

Using the `REMOTE_SERVICE_TYPE = "CLOUD_AI_LARGE_LANGUAGE_MODEL_V1"` option will link to LLMs in Vertex AI!

### Connection Requirement

To make a remote connection using BigQuery ML, BigQuery uses a CLOUD_RESOURCE connection. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#connection)

Make sure the [BigQuery Connection API](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection) is enabled:

In [48]:
!gcloud services enable bigqueryconnection.googleapis.com

Create a new connection with type `CLOUD_RESOURCE`:

In [None]:
try:
    response = bq_connection.ConnectionServiceClient().get_connection(
            request = bq_connection.GetConnectionRequest(
                name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}"
            )
    )
    print(f'Found existing connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
except Exception:
    request = bq_connection.CreateConnectionRequest(
        {
            "parent": f"projects/{BQ_PROJECT}/locations/{BQ_REGION}",
            "connection_id": f"{SERIES}_{EXPERIMENT}",
            "connection": bq_connection.types.Connection(
                {
                    "friendly_name": f"{SERIES}_{EXPERIMENT}",
                    "cloud_resource": bq_connection.CloudResourceProperties({})
                }
            )
        }
    )
    response = bq_connection.ConnectionServiceClient().create_connection(request)
    print(f'Created new connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id

Assign the service account the Vertex AI User role:

In [68]:
!gcloud projects add-iam-policy-binding {BQ_PROJECT} --member=serviceAccount:{service_account} --role=roles/aiplatform.user

Updated IAM policy for project [statmike-mlops-349915].
bindings:
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  role: roles/aiplatform.customCodeServiceAgent
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform-vm.iam.gserviceaccount.com
  role: roles/aiplatform.notebookServiceAgent
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform.iam.gserviceaccount.com
  role: roles/aiplatform.serviceAgent
- members:
  - deleted:serviceAccount:bqcx-1026793852137-79ue@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=108216671037418333398
  - deleted:serviceAccount:bqcx-1026793852137-iszu@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=106642351460101305872
  - serviceAccount:bqcx-1026793852137-a2ne@gcp-sa-bigquery-condel.iam.gserviceaccount.com
  - serviceAccount:bqcx-1026793852137-dyw1@gcp-sa-bigquery-condel.iam.gserviceaccount.com
  - serviceAccount:bqcx-1026793852137-zfly@gcp-sa-bigquery-condel.iam.gserv

### Create The Remote Model In BigQuery

Create a temp model that connects to text embedding model on Vertex AI - [Reference](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.PaLM2TextEmbeddingGenerator)

In [98]:
embed_model = bfml.llm.PaLM2TextEmbeddingGenerator(
    session = bf_session,
    connection_name = f'{BQ_PROJECT}.{BQ_REGION}.{SERIES}_{EXPERIMENT}'
)

HTML(value='Query job b980d2d2-eee1-4136-9f92-12d0ab344f14 is RUNNING. <a target="_blank" href="https://consol…

In [99]:
category = products['category'].unique().to_frame()

In [100]:
category.head()

HTML(value='Query job 2674ac24-aeef-4eca-bf52-f536162354a2 is DONE. 594.2 kB processed. <a target="_blank" hre…

HTML(value='Query job 865ce22a-33c0-4ad1-bcf1-30f5f2338b19 is DONE. 594.2 kB processed. <a target="_blank" hre…

HTML(value='Query job d2990684-474c-4706-9404-ab6da8621cf3 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,category
0,Swim
906,Jeans
2023,Pants
3064,Socks
3969,Active


In [106]:
category = category.join(embed_model.predict(category).rename(columns={'text_embedding':'category_embedding'}))

HTML(value='Query job 9988e0b3-9a0e-475c-9f11-018a03a63b53 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c8e46176-d05b-4743-9f19-79316eb5f106 is DONE. 208 Bytes processed. <a target="_blank" hr…

In [107]:
category.head()

HTML(value='Query job b3cbb7af-893d-45be-95ee-358a95e6c751 is DONE. 594.4 kB processed. <a target="_blank" hre…

HTML(value='Query job d7723294-9d85-4200-bd50-e3a3fd69a0be is DONE. 754.1 kB processed. <a target="_blank" hre…

HTML(value='Query job 9c65d907-6d23-4fee-ba5a-da93d1c30c07 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,category,category_embedding
0,Swim,"[0.01972227729856968, -0.01661798171699047, -0..."
906,Jeans,"[-0.016635224223136902, 0.0025853165425360203,..."
2023,Pants,"[-7.701403956161812e-05, 0.015807654708623886,..."
3064,Socks,"[0.05488337576389313, 0.006218045484274626, 0...."
3969,Active,"[0.028275124728679657, -0.00869790930300951, 0..."


In [108]:
department = products['department'].unique().to_frame()
department = department.join(embed_model.predict(department).rename(columns={'text_embedding':'department_embedding'}))

HTML(value='Query job 85e047d1-5132-490a-a00b-0bb5f7b9cd55 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a2be779d-2f38-47ed-84ab-ef571c40bb07 is DONE. 16 Bytes processed. <a target="_blank" hre…

In [109]:
department.head()

HTML(value='Query job 787a9b8e-0f0e-4bbd-8b97-2f9adf4df97c is DONE. 410.6 kB processed. <a target="_blank" hre…

HTML(value='Query job 4441ddee-131e-4ac9-b176-b0d3d9236aee is DONE. 422.8 kB processed. <a target="_blank" hre…

HTML(value='Query job daa0b57d-295d-4fcf-92a1-8a90d51df634 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,department,department_embedding
0,Men,"[-0.04335380345582962, 0.00046764410217292607,..."
13131,Women,"[-0.03191345930099487, -0.006457726005464792, ..."


In [None]:
bq_connection.ConnectionServiceClient().delete_connection(name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}")