# Vertex AI GenAI Embeddings - As Features For Hierarchical Classification

**IN DEVELOPMENT**

Embeddings are vector representations of text or images or both.  These are vectors of floating point numbers that come from a model that has been trained to embed content in a way that efficiently represents the content.

Getting embeddings for text, or multimodel text and images using Vetex AI foundational models is demonstrated in notebook []().

This notebook show a use case for embeddings as features. 


Workflow:
- Review product catelog data in BigQuery Public table: `bigquery-public-data.thelook_ecommerce.products`
- Create a table with embeddings for:
    - `name` = A brief description of the product
    - `department` = The first level of the product catelog
    - `category` = The second level of the product catelog
- Setup BigQuery Resource Connection
- Use ML.* to generate embeddings using Vertex AI


---
## Setup

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'embed-feature-classifier'

In [3]:
# make this the BQ Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2] # subset to first two characters for multi-region

In [4]:
import json
import numpy as np
import vertexai.language_models
import bigframes.pandas as bf
import bigframes.ml as bfml
from bigframes.ml import llm
from bigframes.ml import model_selection
from bigframes.ml import ensemble
from google.cloud import bigquery_connection_v1 as bq_connection
from google.cloud import bigquery

2023-09-17 18:17:41.980064: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [5]:
vertexai.init(project = PROJECT_ID, location = REGION)
bq = bigquery.Client(project = PROJECT_ID)
bf.reset_session()
bf.options.bigquery.project = BQ_PROJECT
bf.options.bigquery.location = BQ_REGION
bf_session = bf.get_global_session()

---
## Review Data Source

BigQuery Public table `bigquery-public-data.thelook_ecommerce.products`.

In [6]:
products = bf.read_gbq('bigquery-public-data.thelook_ecommerce.products')

HTML(value='Query job 4d5e5292-e80b-4b33-b519-b718ee830247 is RUNNING. <a target="_blank" href="https://consol…

In [7]:
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
dtype: object

In [8]:
products['department'].unique().tolist()

HTML(value='Query job e9a871f3-2a07-4a6c-9bd3-9487ac48b36d is DONE. 410.5 kB processed. <a target="_blank" hre…

['Men', 'Women']

In [9]:
products['category'].unique().tolist()

HTML(value='Query job d9264a1a-55a5-4869-9f32-551ac7912815 is DONE. 594.2 kB processed. <a target="_blank" hre…

['Swim',
 'Jeans',
 'Pants',
 'Socks',
 'Active',
 'Shorts',
 'Sweaters',
 'Underwear',
 'Accessories',
 'Tops & Tees',
 'Sleep & Lounge',
 'Outerwear & Coats',
 'Suits & Sport Coats',
 'Fashion Hoodies & Sweatshirts',
 'Plus',
 'Suits',
 'Skirts',
 'Dresses',
 'Leggings',
 'Intimates',
 'Maternity',
 'Clothing Sets',
 'Pants & Capris',
 'Socks & Hosiery',
 'Blazers & Jackets',
 'Jumpsuits & Rompers']

In [10]:
products['name'].head()

HTML(value='Query job 4717bf53-8505-4416-9f08-90fe96e913ca is DONE. 233.0 kB processed. <a target="_blank" hre…

HTML(value='Query job c7243ca9-abb1-46a3-ab53-03097aa16313 is DONE. 1.7 MB processed. <a target="_blank" href=…

0       2XU Men's Swimmers Compression Long Sleeve Top
1           TYR Sport Men's Square Leg Short Swim Suit
2      TYR Sport Men's Solid Durafast Jammer Swim Suit
3    TYR Sport Men's Swim Short/Resistance Short Sw...
4                      TYR Alliance Team Splice Jammer
Name: name, dtype: string

---
## Create BigQuery Dataset

In [11]:
# create/link to dataset
ds = bigquery.DatasetReference(BQ_PROJECT, BQ_DATASET)
ds.location = BQ_REGION
ds.labels = {'series': f'{SERIES}'}
ds = bq.create_dataset(dataset = ds, exists_ok = True) 

---
## BigQuery ML: Connect To Vertex AI LLMs with ML.GENERATE_TEXT

BigQuery ML can `Create Model`s that are actually connections to Remote Models. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)

Using the `REMOTE_SERVICE_TYPE = "CLOUD_AI_LARGE_LANGUAGE_MODEL_V1"` option will link to LLMs in Vertex AI!

### Connection Requirement

To make a remote connection using BigQuery ML, BigQuery uses a CLOUD_RESOURCE connection. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#connection)

Make sure the [BigQuery Connection API](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection) is enabled:

In [12]:
!gcloud services enable bigqueryconnection.googleapis.com

Create a new connection with type `CLOUD_RESOURCE`: First, check for existing connection.

In [13]:
try:
    response = bq_connection.ConnectionServiceClient().get_connection(
            request = bq_connection.GetConnectionRequest(
                name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}"
            )
    )
    print(f'Found existing connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
except Exception:
    request = bq_connection.CreateConnectionRequest(
        {
            "parent": f"projects/{BQ_PROJECT}/locations/{BQ_REGION}",
            "connection_id": f"{SERIES}_{EXPERIMENT}",
            "connection": bq_connection.types.Connection(
                {
                    "friendly_name": f"{SERIES}_{EXPERIMENT}",
                    "cloud_resource": bq_connection.CloudResourceProperties({})
                }
            )
        }
    )
    response = bq_connection.ConnectionServiceClient().create_connection(request)
    print(f'Created new connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
    # assign the service account the Vertex AI User Role:
    !gcloud projects add-iam-policy-binding {BQ_PROJECT} --member=serviceAccount:{service_account} --role=roles/aiplatform.user

Found existing connection with service account: bqcx-1026793852137-pdxa@gcp-sa-bigquery-condel.iam.gserviceaccount.com


**NOTE**: The step above created a service account and assigned it the Vertex AI User Role.  This may take a moment to be recognized in the steps below.  If you get an error in one of the cells below try rerunning it.

### Create The Remote Model In BigQuery

Create a temp model that connects to text embedding model on Vertex AI - [Reference](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.PaLM2TextEmbeddingGenerator)

In [14]:
embed_model = bfml.llm.PaLM2TextEmbeddingGenerator(
    session = bf_session,
    connection_name = f'{BQ_PROJECT}.{BQ_REGION}.{SERIES}_{EXPERIMENT}'
)

HTML(value='Query job 19a4ef52-e941-4193-b40f-c012ec0c5949 is RUNNING. <a target="_blank" href="https://consol…

---
## Create Embeddings

### For Product Descriptions: Name

**NOTE**: The following cell will create embedding requests for all 29k+ values in the `name` column and could take around **10 minutes** to run.

In [15]:
products = products.join(embed_model.predict(products['name']).rename(columns={'text_embedding':'name_embedding'}))

HTML(value='Query job 3b19696f-a75a-4aae-832e-1907f15d77cc is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2a7d29fd-47eb-484d-a993-d66f5b6aa95b is DONE. 233.0 kB processed. <a target="_blank" hre…

In [16]:
products.head()

HTML(value='Query job 0e94bdf3-e109-436f-bc99-2dc63f042162 is DONE. 465.9 kB processed. <a target="_blank" hre…

HTML(value='Query job 1950c084-6997-4bec-9961-dd53371b6763 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 631196f0-76bf-4622-ba3d-055114055b92 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id,name_embedding
0,27569,92.652563,Swim,2XU Men's Swimmers Compression Long Sleeve Top,2XU,150.410004,Men,B23C5765E165D83AA924FA8F13C05F25,1,"[0.021811574697494507, -0.0068705384619534016,..."
1,27445,24.719661,Swim,TYR Sport Men's Square Leg Short Swim Suit,TYR,38.990002,Men,2AB7D3B23574C3DEA2BD278AFD0939AB,1,"[0.04419781640172005, -0.009351101703941822, 0..."
2,27457,15.8976,Swim,TYR Sport Men's Solid Durafast Jammer Swim Suit,TYR,27.6,Men,8F831227B0EB6C6D09A0555531365933,1,"[0.0471641980111599, -0.03273119032382965, 0.0..."
3,27466,17.85,Swim,TYR Sport Men's Swim Short/Resistance Short Sw...,TYR,30.0,Men,67317D6DCC4CB778AEB9219565F5456B,1,"[0.049136240035295486, 0.0037870346568524837, ..."
4,27481,29.408001,Swim,TYR Alliance Team Splice Jammer,TYR,45.950001,Men,213C888198806EF1A0E2BBF2F4855C6C,1,"[0.0008693744894117117, -0.00447087874636054, ..."


### For Level 1 of Product Hierarchy: Department

This step will run quickly as it only creates embedding request for unique values of `department`.

In [17]:
department = products['department'].unique().to_frame()
department = department.join(embed_model.predict(department).rename(columns={'text_embedding':'department_embedding'}))

HTML(value='Query job 5fc9d662-fe40-4d17-974a-eacf03c34ff9 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job adf10934-7ca4-46c1-bc3b-cb3f392f95b9 is DONE. 16 Bytes processed. <a target="_blank" hre…

In [18]:
department.head()

HTML(value='Query job 2c459998-e273-43de-ba5e-2cc7ba2fe9b0 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 41ed5332-31a3-4ee7-9a9d-863fd16704f4 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job ecaae901-ea0d-4a73-83e5-7f4241b5b2ee is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,department,department_embedding
0,Men,"[-0.04335380345582962, 0.00046764410217292607,..."
13131,Women,"[-0.03191345930099487, -0.006457726005464792, ..."


In [19]:
products = products.merge(department, on = 'department')
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
name_embedding                     object
department_embedding               object
dtype: object

### For Level 2 of Product Hierarchy: Category

This step will run quickly as it only creates embedding request for unique values of `category`.

In [20]:
category = products['category'].unique().to_frame()
category = category.join(embed_model.predict(category).rename(columns={'text_embedding':'category_embedding'}))

HTML(value='Query job ab2d0563-b2bf-4a54-94c0-7deacf2d5829 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job eb155c93-e6aa-418e-89ff-12842962b09d is DONE. 208 Bytes processed. <a target="_blank" hr…

In [21]:
category.head()

HTML(value='Query job 78a17198-66b3-43b8-8342-efbb1bab38b1 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 3474e0c7-d718-4c54-b1ed-08000b65f171 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job b3b9f8ae-e5ff-4d2f-8192-62b7373675b3 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,category,category_embedding
0,Swim,"[0.01972227729856968, -0.01661798171699047, -0..."
906,Jeans,"[-0.016635224223136902, 0.0025853165425360203,..."
2023,Pants,"[-7.701403956161812e-05, 0.015807654708623886,..."
3064,Socks,"[0.05488337576389313, 0.006218045484274626, 0...."
3969,Active,"[0.028275124728679657, -0.00869790930300951, 0..."


In [22]:
products = products.merge(category, on = 'category')
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
name_embedding                     object
department_embedding               object
category_embedding                 object
dtype: object

### Make BigQuery Tables of Results

The `products`, `department`, and `category` dataframes are currently temporary tables in BigQuery.  To recall these for future use it is best to store them as actual BigQuery tables using the [.to_gbq](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame#bigframes_dataframe_DataFrame_to_gbq) method as follows.

In [23]:
products.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_products', if_exists = 'replace', index = False)
department.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_deparment', if_exists = 'replace', index = False)
category.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_category', if_exists = 'replace', index = False)

HTML(value='Query job b08e1e74-f77f-46b5-86e9-70b0e84cb0a7 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job fe5b26bd-7882-440b-8510-cfb659e46bfc is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job e8c15ab1-1b10-43cf-8adc-d2a4e478f5ea is RUNNING. <a target="_blank" href="https://consol…

---
## Prepare Data For ML

### Create product hierarchy set: department

In [24]:
department.dtypes

department              string[pyarrow]
department_embedding             object
dtype: object

In [25]:
department_hierarchy = department[['department', 'department_embedding']].rename(columns = {"department":"hierarchy_node", "department_embedding":"hierarchy_node_embedding"})
department_hierarchy['hierarchy_level'] = 'department'
department_hierarchy.dtypes

hierarchy_node              string[pyarrow]
hierarchy_node_embedding             object
hierarchy_level             string[pyarrow]
dtype: object

In [26]:
department_hierarchy.head()

HTML(value='Query job 722da3a4-03c0-4e29-a6fc-9c4a8e775ae8 is DONE. 643.5 kB processed. <a target="_blank" hre…

HTML(value='Query job 41f917b3-32f0-4530-ac49-7073554f9fdd is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job f31db850-d49e-41db-8577-577ba4cd44dd is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,hierarchy_node,hierarchy_node_embedding,hierarchy_level
0,Men,"[-0.04335380345582962, 0.00046764410217292607,...",department
13131,Women,"[-0.03191345930099487, -0.006457726005464792, ...",department


### Create product hierarchy set: category

In [27]:
category.dtypes

category              string[pyarrow]
category_embedding             object
dtype: object

In [28]:
category_hierarchy = category[['category', 'category_embedding']].rename(columns = {"category":"hierarchy_node", "category_embedding":"hierarchy_node_embedding"})
category_hierarchy['hierarchy_level'] = 'category'
category_hierarchy.dtypes

hierarchy_node              string[pyarrow]
hierarchy_node_embedding             object
hierarchy_level             string[pyarrow]
dtype: object

In [29]:
category_hierarchy.head()

HTML(value='Query job 02423891-26b8-46ea-9051-9f7157ab715a is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 40185283-cdf3-462b-8c9a-4e1e233a31e3 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a133f620-bce4-4065-8c41-cd724266e554 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,hierarchy_node,hierarchy_node_embedding,hierarchy_level
0,Swim,"[0.01972227729856968, -0.01661798171699047, -0...",category
906,Jeans,"[-0.016635224223136902, 0.0025853165425360203,...",category
2023,Pants,"[-7.701403956161812e-05, 0.015807654708623886,...",category
3064,Socks,"[0.05488337576389313, 0.006218045484274626, 0....",category
3969,Active,"[0.028275124728679657, -0.00869790930300951, 0...",category


### Create product hiearchy: combine department and category

In [30]:
product_hierarchy = bf.concat([department_hierarchy, category_hierarchy])

In [31]:
product_hierarchy.head()

HTML(value='Query job 3e17f6eb-1fcc-4062-a9b3-5c1fd5b78255 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 9aebc551-24e0-49e6-ab5c-b5808f33097f is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a165951c-796d-4151-92bb-adf57c07bf04 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,hierarchy_node,hierarchy_node_embedding,hierarchy_level
0,Men,"[-0.04335380345582962, 0.00046764410217292607,...",department
13131,Women,"[-0.03191345930099487, -0.006457726005464792, ...",department
0,Swim,"[0.01972227729856968, -0.01661798171699047, -0...",category
906,Jeans,"[-0.016635224223136902, 0.0025853165425360203,...",category
2023,Pants,"[-7.701403956161812e-05, 0.015807654708623886,...",category


### Create a test and train subsets of products

Assign an id to each row of the `products` table:

Create index's for rows allocated to training and test splits:

In [32]:
# retrieve index of all rows
full_index = products.index.to_numpy()
# randomly sort the full index
np.random.shuffle(full_index)
# split the randomly sorted index into 10 sequential parts
split_index = np.split(full_index, 10)
# allocate the first 9 splits (90%) to a training index
train_index = np.concatenate(split_index[0:9])
# allocate the last split (10%) to a test_index
test_index = split_index[9]

# print out the sizes of the indexes:
full_index.shape[0], train_index.shape[0], test_index.shape[0]

HTML(value='Query job e780ec01-be70-42bd-a76b-f304f1b73ffc is RUNNING. <a target="_blank" href="https://consol…

(29120, 26208, 2912)

In [33]:
train_products = products[['id', 'name', 'name_embedding', 'category', 'department']].iloc[train_index.tolist()]
test_products = products[['id', 'name', 'name_embedding', 'category', 'department']].iloc[test_index.tolist()]

HTML(value='Load job 02f015eb-5473-4379-8ba8-547de3c2a460 is RUNNING. <a target="_blank" href="https://console…

HTML(value='Load job 614cb964-7c9b-4aeb-93cd-e8982598173c is RUNNING. <a target="_blank" href="https://console…

### Create Training And Test Data: Explode product data by crossing with product hierarchy

In [34]:
train_products = train_products.merge(product_hierarchy, how = 'cross')

ValueError: Must specify `on` or `left_on` + `right_on`.

**NOTE**: The cross join does not seem to be available yet on the BigFrames API ([Reference](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_merge)).  To work around this adding a dummy column to both tables to do an inner join on.

In [36]:
train_products['dummy'] = 1
test_products['dummy'] = 1
product_hierarchy['dummy'] = 1

In [37]:
train_products = train_products.merge(product_hierarchy, how = 'inner', on = 'dummy').drop('dummy', axis = 1)
test_products = test_products.merge(product_hierarchy, how = 'inner', on = 'dummy').drop('dummy', axis = 1)
product_hiearchy = product_hierarchy.drop('dummy', axis = 1)

In [38]:
train_products.head()

HTML(value='Query job dd1b73b7-583c-4bd1-a95e-fd439be6270d is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2dd943cd-710b-4331-aaa3-9cd77c43f921 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 72f4efe7-340a-4193-bb99-b0706fd08369 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,name,name_embedding,category,department,hierarchy_node,hierarchy_node_embedding,hierarchy_level
0,4348,Levi's Women's Petite 515 Boot Cut Jean,"[-0.014130963012576103, -0.0026561019476503134...",Jeans,Women,Men,"[-0.04335380345582962, 0.00046764410217292607,...",department
1,4348,Levi's Women's Petite 515 Boot Cut Jean,"[-0.014130963012576103, -0.0026561019476503134...",Jeans,Women,Women,"[-0.03191345930099487, -0.006457726005464792, ...",department
2,4348,Levi's Women's Petite 515 Boot Cut Jean,"[-0.014130963012576103, -0.0026561019476503134...",Jeans,Women,Swim,"[0.01972227729856968, -0.01661798171699047, -0...",category
3,4348,Levi's Women's Petite 515 Boot Cut Jean,"[-0.014130963012576103, -0.0026561019476503134...",Jeans,Women,Jeans,"[-0.016635224223136902, 0.0025853165425360203,...",category
4,4348,Levi's Women's Petite 515 Boot Cut Jean,"[-0.014130963012576103, -0.0026561019476503134...",Jeans,Women,Pants,"[-7.701403956161812e-05, 0.015807654708623886,...",category


### Add Label To Training Training And Test Data

In [39]:
train_products_yes = train_products[((train_products['category'] == train_products['hierarchy_node']) & (train_products['hierarchy_level'] == 'category')) | ((train_products['department'] == train_products['hierarchy_node']) & (train_products['hierarchy_level'] == 'department'))]
train_products_yes['label'] = 1
train_products_no = train_products[((train_products['category'] != train_products['hierarchy_node']) & (train_products['hierarchy_level'] == 'category')) | ((train_products['department'] != train_products['hierarchy_node']) & (train_products['hierarchy_level'] == 'department'))]
train_products_no['label'] = 0

train_products = bf.concat([train_products_yes, train_products_no])

In [40]:
test_products_yes = test_products[((test_products['category'] == test_products['hierarchy_node']) & (test_products['hierarchy_level'] == 'category')) | ((test_products['department'] == test_products['hierarchy_node']) & (test_products['hierarchy_level'] == 'department'))]
test_products_yes['label'] = 1
test_products_no = test_products[((test_products['category'] != test_products['hierarchy_node']) & (test_products['hierarchy_level'] == 'category')) | ((test_products['department'] != test_products['hierarchy_node']) & (test_products['hierarchy_level'] == 'department'))]
test_products_no['label'] = 0

test_products = bf.concat([test_products_yes, test_products_no])

Make BigQuery tables from the temporary tables:

In [41]:
train_products.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_train_products', if_exists = 'replace', index = False)
test_products.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_products', if_exists = 'replace', index = False)

HTML(value='Query job f0d6b5a9-c75e-4ede-88fa-288382457761 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 04d3c927-81e3-478c-a0e0-a6c05c08063b is RUNNING. <a target="_blank" href="https://consol…

---
## Create Models

Build a classifier model from:
- the combination of embedding vectors from the product name (`name`) and the node name from the product hierarchy
- the absolute difference between the two embedding vectors repesenting the product and the node in the product hierarchy

### Prepare Train and Test For Model Input

In [43]:
features = ''
for i in range(768):
    features += f""",
    name_embedding[{i}] as name_{i}, hierarchy_node_embedding[{i}] as hier_{i}, ABS(name_embedding[{i}] - hierarchy_node_embedding[{i}]) as adiff_{i} """

test_input_query = f"""
SELECT * {features}
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_products`
WHERE name is not null AND hierarchy_node is not null
"""

train_input_query = f"""
SELECT * {features}
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_train_products`
WHERE name is not null
    AND ARRAY_LENGTH(name_embedding) > 0
    AND ARRAY_LENGTH(hierarchy_node_embedding) > 0
"""

In [44]:
#print(test_input)

In [45]:
test_input = bf.read_gbq(test_input_query)

HTML(value='Query job f3c33d24-e528-42ad-8500-ce0cdcf76c83 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 6192f8c2-1c1d-4d95-9b48-c034df964430 is RUNNING. <a target="_blank" href="https://consol…

In [46]:
test_input.shape

HTML(value='Query job bb6fd6ee-b97c-4eda-a238-157bfd8a401f is DONE. 0 Bytes processed. <a target="_blank" href…

(81536, 2313)

In [47]:
train_input = bf.read_gbq(train_input_query)

HTML(value='Query job 3c5fb0b9-fe9f-4771-b666-98bb601046af is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job eeecded8-b47b-4f2b-88a0-300b9a321ba6 is RUNNING. <a target="_blank" href="https://consol…

In [48]:
train_input.shape

HTML(value='Query job 5dd2b8f3-81af-47a9-a15a-112a065134ff is DONE. 0 Bytes processed. <a target="_blank" href…

(733768, 2313)

In [49]:
train_input.dtypes

id                          Int64
name              string[pyarrow]
name_embedding             object
category          string[pyarrow]
department        string[pyarrow]
                       ...       
hier_766                  Float64
adiff_766                 Float64
name_767                  Float64
hier_767                  Float64
adiff_767                 Float64
Length: 2313, dtype: object

### Train Classifier: Concatenate Embeddings as Features

In [50]:
classifier_model = bfml.ensemble.XGBClassifier()

In [51]:
features = [col for col in train_input.columns if (col.startswith('name_') or col.startswith('hier_')) and col != 'name_embedding']

In [52]:
classifier_model.fit(X = train_input[features], y = train_input['label'])

HTML(value='Query job 003bbc80-774e-443a-b936-cab63aed8f8a is RUNNING. <a target="_blank" href="https://consol…

XGBClassifier()

In [53]:
classifier_model.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_classifier', replace = True)

HTML(value='Copy job a2918421-3372-44f4-8b80-fffd1a692e2d is RUNNING. <a target="_blank" href="https://console…

XGBClassifier(booster='GBTREE', tree_method='AUTO')

In [54]:
classifier_model.get_params()

{'booster': 'gbtree',
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'colsample_bytree': 1.0,
 'dart_normalized_type': 'TREE',
 'early_stop': True,
 'enable_global_explain': False,
 'gamma': 0.0,
 'learning_rate': 0.3,
 'max_depth': 6,
 'max_iterations': 20,
 'min_rel_progress': 0.01,
 'min_tree_child_weight': 1,
 'num_parallel_tree': 1,
 'reg_alpha': 0.0,
 'reg_lambda': 1.0,
 'subsample': 1.0,
 'tree_method': 'auto',
 'xgboost_version': '0.9'}

In [55]:
classifier_model.predict(X = test_input.head())

HTML(value='Query job 20bd7554-259b-4a35-bcc8-f289b33f76fe is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 9cfe5116-c3fb-448f-85d8-a0d1062dacfc is DONE. 40 Bytes processed. <a target="_blank" hre…

HTML(value='Query job b5475cc0-f4b3-47b2-b282-fd636a588265 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 77fb34e6-ade8-432f-8cd2-e52e1b364c70 is DONE. 80 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 2c0525b0-7a26-4ca0-9f75-37d529fa4549 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,predicted_label
0,0
1,0
2,0
3,0
4,0


### Train Classifier: Absolute Difference Between Embeddings as Features

In [56]:
classifier_model_adiff = bfml.ensemble.XGBClassifier()

In [57]:
features = [col for col in train_input.columns if col.startswith('adiff_')]

In [58]:
classifier_model_adiff.fit(X = train_input[features], y = train_input['label'])

HTML(value='Query job b1306788-7b4c-41b4-acfb-c935c6baba9d is RUNNING. <a target="_blank" href="https://consol…

XGBClassifier()

In [59]:
classifier_model_adiff.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_classifier_adiff', replace = True)

HTML(value='Copy job 89c099a1-c9d5-40be-b1e0-b2747728c743 is RUNNING. <a target="_blank" href="https://console…

XGBClassifier(booster='GBTREE', tree_method='AUTO')

In [60]:
classifier_model_adiff.get_params()

{'booster': 'gbtree',
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'colsample_bytree': 1.0,
 'dart_normalized_type': 'TREE',
 'early_stop': True,
 'enable_global_explain': False,
 'gamma': 0.0,
 'learning_rate': 0.3,
 'max_depth': 6,
 'max_iterations': 20,
 'min_rel_progress': 0.01,
 'min_tree_child_weight': 1,
 'num_parallel_tree': 1,
 'reg_alpha': 0.0,
 'reg_lambda': 1.0,
 'subsample': 1.0,
 'tree_method': 'auto',
 'xgboost_version': '0.9'}

In [61]:
classifier_model_adiff.predict(X = test_input.head())

HTML(value='Query job 38eec6dd-9998-4a3d-ba2f-0650aa18df50 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 3304a79b-d30d-49e0-a47f-7694ff0299ca is DONE. 40 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 1c349f75-f1a6-4979-b027-717327e22371 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job ef09bdcd-5729-4678-9f0d-405e1945a2da is DONE. 80 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 9a278239-05ae-4230-a1e5-f11276226864 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,predicted_label
0,0
1,0
2,0
3,0
4,0


---
---
---
# NOTE: Switching To BigQuery API**
    
Working with the model in BigQuery ML requires some features that are not yet in the BigFrames API.  The following section switches to the BigQuery API and uses SQL to retrive results to local Pandas dataframes.

**EXAMPLE**

In the section above the `.predict()` method was used to retrive predictions.  The only column returned is the predicted label.  For this project the probability for each class label is needed to infer the right placement in the product hierarchy.

In [62]:
# save model input to BigQuery Tables
train_input.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_train_input', if_exists = 'replace', index = False)
test_input.to_gbq(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input', if_exists = 'replace', index = False)

HTML(value='Query job 2d42374f-912f-4096-b035-bada9573d4b7 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job bf4ce298-dc68-470f-8ff7-fe98ec66fc87 is RUNNING. <a target="_blank" href="https://consol…

---
## Model Inference: Develop Approach With 1 Instance

Request predictions.  With this data shape the process is iterative.

In [63]:
query = f"""
SELECT *
FROM ML.PREDICT (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_classifier`,
    (
        SELECT *
        FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input`
        LIMIT 1
    )
)
"""

In [64]:
pred = bq.query(query = query).to_dataframe()

In [65]:
pred

Unnamed: 0,predicted_label,predicted_label_probs,id,name,name_embedding,category,department,hierarchy_node,hierarchy_node_embedding,hierarchy_level,...,adiff_764,name_765,hier_765,adiff_765,name_766,hier_766,adiff_766,name_767,hier_767,adiff_767
0,0,"[{'label': 1, 'prob': 0.0018603693461045623}, ...",12163,ICOLLECTION LINGERIE 8066 Underwired sheer mes...,"[0.01873893104493618, 0.00019954226445406675, ...",Intimates,Women,Suits,"[-0.02044973149895668, -0.05227809399366379, 0...",category,...,0.01076,0.008253,0.026246,0.017993,-0.014222,0.007747,0.021968,0.013075,-0.037425,0.0505


**Note**

The full set of columns is returned along with the predicted label and predicted probabilities for each class label. This can be shrunken to just columns related to the label, hierarchy, and the predicted probability of membership.

In [76]:
query = f"""
SELECT *
FROM (
    SELECT id, predicted_label_probs, label, name, category, department, hierarchy_level, hierarchy_node
    FROM ML.PREDICT (
        MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_classifier`,
        (
            WITH limiter AS (SELECT id FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input` LIMIT 1)
            SELECT *
            FROM limiter
            LEFT OUTER JOIN `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input`
            USING(id)
        )
    )
) p
CROSS JOIN UNNEST(p.predicted_label_probs) as probs
WHERE probs.label = 1
"""

In [77]:
pred = bq.query(query = query).to_dataframe()
pred

Unnamed: 0,id,predicted_label_probs,label,name,category,department,hierarchy_level,hierarchy_node,label_1,prob
0,11634,"[{'label': 1, 'prob': 0.09856118261814117}, {'...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Tops & Tees,1,0.098561
1,11634,"[{'label': 1, 'prob': 0.014292235486209393}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Dresses,1,0.014292
2,11634,"[{'label': 1, 'prob': 0.042169298976659775}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Swim,1,0.042169
3,11634,"[{'label': 1, 'prob': 0.02777058258652687}, {'...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Underwear,1,0.027771
4,11634,"[{'label': 1, 'prob': 0.06264296919107437}, {'...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Active,1,0.062643
5,11634,"[{'label': 1, 'prob': 0.10970967262983322}, {'...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Sweaters,1,0.10971
6,11634,"[{'label': 1, 'prob': 0.003362766932696104}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Leggings,1,0.003363
7,11634,"[{'label': 1, 'prob': 0.020120663568377495}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Plus,1,0.020121
8,11634,"[{'label': 1, 'prob': 0.044033437967300415}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Shorts,1,0.044033
9,11634,"[{'label': 1, 'prob': 0.039312850683927536}, {...",0,Cinema Etoile Women's Square Neck Cami,Intimates,Women,category,Sleep & Lounge,1,0.039313


**Note**

For the seleted id value, what is the predicted placement in the hierarchy?

This requires making inference down the structure of the hierarchy:
- what is highest predicted probability for `department`?
- what is highest predicted probability for `category` within the selected `department`?
- Is the answer right or wrong?

In [90]:
query = f"""
WITH
probs AS (
    SELECT *
    FROM (
        SELECT id, predicted_label_probs, label, name, category, department, hierarchy_level, hierarchy_node
        FROM ML.PREDICT (
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_classifier`,
            (
                WITH limiter AS (SELECT id FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input` LIMIT 10)
                SELECT *
                FROM limiter
                LEFT OUTER JOIN `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_test_input`
                USING(id)
            )
        )
    ) p
    CROSS JOIN UNNEST(p.predicted_label_probs) as probs
    WHERE probs.label = 1
),
department AS (
    SELECT id, name, category, department, hierarchy_node as pred_department, prob as pred_department_prob
    FROM probs
    WHERE hierarchy_level = 'department'
    QUALIFY row_number() OVER (PARTITION BY id ORDER by prob DESC) = 1
),
category AS (
    SELECT c.*, probs.prob as pred_category_prob
    FROM (
        SELECT department.*, cl.pred_category
        FROM department
        LEFT JOIN (SELECT DISTINCT department, category as pred_category FROM `bigquery-public-data.thelook_ecommerce.products`) cl
        ON department.pred_department = cl.department
    ) c
    LEFT OUTER JOIN probs
    ON c.pred_department = probs.hierarchy_node
    QUALIFY row_number() OVER (PARTITION BY id ORDER by prob DESC) = 1
)

SELECT *
FROM category
"""
pred = bq.query(query = query).to_dataframe()
pred

Unnamed: 0,id,name,category,department,pred_department,pred_department_prob,pred_category,pred_category_prob
0,2933,Moving Comfort Women's 7.5-Inch Compression Short,Active,Women,Women,0.897995,Plus,0.985298
1,20360,FMF Stacked Full Zip Jacket - Black,Suits & Sport Coats,Men,Men,0.89526,Swim,0.968642
2,12489,Laura High Quality Sexy Red Bra Thong SET #SL1...,Intimates,Women,Women,0.985298,Plus,0.985298
3,3027,Cuddl Duds Women's Long Sleeve V-Neck Flexifit...,Active,Women,Women,0.972547,Plus,0.985298
4,19900,GUESS Resort Sateen Blazer,Suits & Sport Coats,Men,Men,0.679085,Swim,0.968642
5,19847,Men's Single Breasted Two Button Navy Blazer,Suits & Sport Coats,Men,Men,0.929219,Swim,0.968642
6,20632,Wrangler Men's Cowboy Cut Original Fit Jean,Jeans,Men,Men,0.954558,Swim,0.968642
7,8566,Ed Hardy Womens Skull Roses Puffer Jacket -Gold,Outerwear & Coats,Women,Women,0.9459,Plus,0.985298
8,12777,Womens MW Tankini Boyshorts Swimsuit Swimwear ...,Swim,Women,Women,0.960058,Plus,0.985298
9,28659,Men's Leather Driving Gloves with Velcro Strap...,Accessories,Men,Men,0.968642,Swim,0.968642


In [151]:
bq_connection.ConnectionServiceClient().delete_connection(name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}")