![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FApplied+GenAI&dt=Vertex+AI+GenAI+For+BigQuery+Metadata+-+Make+Better+Tables.ipynb)

# Vertex AI GenAI For BigQuery Metadata - Make Better Tables

BigQuery tables are a great source of information for generative AI applications.  Retrieving information is a multi-step process as covered in [Vertex AI GenAI For BigQuery Q&A - Overview](./Vertex%20AI%20GenAI%20For%20BigQuery%20Q&A%20-%20Overview.ipynb).  The ability of a large language model to understand the contents of tables directly relies on the descriptiveness of the metadata: column names, column descriptions, table names, table descriptions.  

This workflow show a potential workflow for creating better, more descriptive metadata for BigQuery tables by using the existing metadata as well as common values from the tables columns.

The notebooks uses the BigFrames API for BigQuery to make local work in the form of a Pandas like API while keeping the execution remote, within BigQuery.  The LLM used here is Vertex AI [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text) called directly from BigQuery using [ML.GENERATE_TEXT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text).



TODO:
- remove unneads import and clients

```
SELECT STRING_AGG(value)
FROM UNNEST(
  (SELECT APPROX_TOP_COUNT(openfda_substance_name, 10) as osn
  FROM `bigquery-public-data.fda_drug.drug_label`)
)
```

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Vertex%20AI%20GenAI%20For%20BigQuery%20Metadata%20-%20Make%20Better%20Tables.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment.  Also, the API for Artifact Registry needs to be enabled (if not already enabled).

### Installs (If Needed)
The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('bigframes', 'bigframes'),
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.bigquery_connection_v1', 'google-cloud-bigquery-connection')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

Make sure the [BigQuery Connection API](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection) is enabled:

In [5]:
!gcloud services enable bigqueryconnection.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [6]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

In [7]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [8]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'bq-metadata'

In [9]:
# make this the BQ Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2] # subset to first two characters for multi-region

In [10]:
#import json
#import numpy as np
#import pandas as pd
#from sklearn import metrics
#import matplotlib.pyplot as plt
#import vertexai.language_models
import bigframes.pandas as bf
import bigframes.ml as bfml
from bigframes.ml import llm
#from bigframes.ml import model_selection
from bigframes.ml import ensemble
from google.cloud import bigquery_connection_v1 as bq_connection
#from google.cloud import bigquery

In [11]:
#vertexai.init(project = PROJECT_ID, location = REGION)
#bq = bigquery.Client(project = PROJECT_ID)
bf.reset_session()
bf.options.bigquery.project = BQ_PROJECT
bf.options.bigquery.location = BQ_REGION
bf_session = bf.get_global_session()

---
## Review Data Source

The data source here is a product catalog with source:
- BigQuery Public table `bigquery-public-data.thelook_ecommerce.products`.



### Get Table: BigQuery Public Table

In [12]:
products = bf.read_gbq('bigquery-public-data.thelook_ecommerce.products')

HTML(value='Query job f20c5c9a-1566-460e-b405-1e929645fe03 is RUNNING. <a target="_blank" href="https://consol…

In [13]:
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
dtype: object

In [14]:
products.describe()

HTML(value='Query job e8066146-9670-4599-a7ea-b342aaccfb57 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 3d009add-58d2-4275-9291-1be8b24c85a1 is DONE. 931.8 kB processed. <a target="_blank" hre…

HTML(value='Query job dc6821f0-24a2-443d-b18b-00fd28559373 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,cost,retail_price,distribution_center_id
count,29120.0,29120.0,29120.0,29120.0
mean,14560.5,28.481774,59.220164,4.982898
std,8406.364256,30.624681,65.888927,2.901153
min,1.0,0.0083,0.02,1.0
25%,7193.0,11.221,24.0,2.0
50%,14540.0,19.55775,39.990002,5.0
75%,21695.0,34.632,69.949997,8.0
max,29120.0,557.151002,999.0,10.0


In [17]:
products.head()

HTML(value='Query job b234b46b-23c8-406e-bf57-3343c0ef7447 is DONE. 233.0 kB processed. <a target="_blank" hre…

HTML(value='Query job 49cd4739-de4a-49e7-9e00-990c96a12456 is DONE. 4.5 MB processed. <a target="_blank" href=…

HTML(value='Query job f9627dbe-a99b-488a-bd39-ba011cd954d8 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
0,27569,92.652563,Swim,2XU Men's Swimmers Compression Long Sleeve Top,2XU,150.410004,Men,B23C5765E165D83AA924FA8F13C05F25,1
1,27445,24.719661,Swim,TYR Sport Men's Square Leg Short Swim Suit,TYR,38.990002,Men,2AB7D3B23574C3DEA2BD278AFD0939AB,1
2,27457,15.8976,Swim,TYR Sport Men's Solid Durafast Jammer Swim Suit,TYR,27.6,Men,8F831227B0EB6C6D09A0555531365933,1
3,27466,17.85,Swim,TYR Sport Men's Swim Short/Resistance Short Sw...,TYR,30.0,Men,67317D6DCC4CB778AEB9219565F5456B,1
4,27481,29.408001,Swim,TYR Alliance Team Splice Jammer,TYR,45.950001,Men,213C888198806EF1A0E2BBF2F4855C6C,1


### Get Table Info From BigQuery Information Schema: Columns

Retrieve the metadata for the table from Information Schema views like [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-column-field-paths)

**NOTE** When `column_name` is not equal to `field_path` it is because the column is nested withing a RECORD (think array, or list) or STRUCT (think dictionary of key:value pairs).  This example does not have examples of these but could be extended to handle these as well.

In [64]:
products_columns = bf.read_gbq(f"""
    SELECT *
    FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE TABLE_NAME = 'products'
        AND column_name = field_path
""")

HTML(value='Query job 9cfb4b17-c633-4f90-86c7-ef9d590d6b92 is DONE. 10.5 MB processed. <a target="_blank" href…

HTML(value='Query job 3cd7af7b-ac1b-45bd-b526-a7bb49dd20ee is RUNNING. <a target="_blank" href="https://consol…

In [65]:
products_columns

HTML(value='Query job c621b5ec-a234-463a-a0aa-eb845e24eedf is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job d1f805fc-05d2-4aa7-9f9a-d14e7e10bf35 is DONE. 833 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 13273a1c-3fe7-4c41-9be8-492e1da92464 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,table_catalog,table_schema,table_name,column_name,field_path,data_type,description,collation_name,rounding_mode
0,bigquery-public-data,thelook_ecommerce,products,id,id,INT64,,,
1,bigquery-public-data,thelook_ecommerce,products,cost,cost,FLOAT64,,,
2,bigquery-public-data,thelook_ecommerce,products,category,category,STRING,,,
3,bigquery-public-data,thelook_ecommerce,products,name,name,STRING,,,
4,bigquery-public-data,thelook_ecommerce,products,brand,brand,STRING,,,
5,bigquery-public-data,thelook_ecommerce,products,retail_price,retail_price,FLOAT64,,,
6,bigquery-public-data,thelook_ecommerce,products,department,department,STRING,,,
7,bigquery-public-data,thelook_ecommerce,products,sku,sku,STRING,,,
8,bigquery-public-data,thelook_ecommerce,products,distribution_center_id,distribution_center_id,INT64,,,


### Get Table Info From BigQuery Information Schema: Table

Retrieve the metadata for the table from Information Schema views like [INFORMATION_SCHEMA.TABLE_OPTIONS](https://cloud.google.com/bigquery/docs/information-schema-table-options)

This view has one row for each option within each table.  Here, only the `OPTION_NAME = 'description'` is needed.

In [28]:
products_table = bf.read_gbq(f"""
    SELECT *
    FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLE_OPTIONS`
    WHERE TABLE_NAME = 'products'
        AND OPTION_NAME = 'description'
""")

HTML(value='Query job 00b6fc29-bba0-4993-abe2-ae9a2e628a78 is DONE. 10.5 MB processed. <a target="_blank" href…

HTML(value='Query job 3fb20ef1-f056-4cc7-a2c7-3a2562e0afea is RUNNING. <a target="_blank" href="https://consol…

In [29]:
products_table

HTML(value='Query job ef6f60b2-be2d-4d2e-a0ac-015ef04189f7 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 529628c6-609c-4d1c-9a15-12e515d1f661 is DONE. 139 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 57acaf37-e3e2-48a3-8e71-ecf2b2120d8a is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,table_catalog,table_schema,table_name,option_name,option_type,option_value
0,bigquery-public-data,thelook_ecommerce,products,description,STRING,"""The Look fictitious e-commerce dataset - prod..."


### Get Values From Columns: Most common values as examples

Retrieve a sample of common values from each column to use as examples for an LLM to create names and descriptions.

Create syntax for query that will create a row per column with a sample of values from the column.

In [41]:
for c, col in enumerate(products_columns.column_name.unique().tolist()):
    if c == 0: 
        cte = f"""SELECT '{col}' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT({col}, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))"""
    else:
        cte += f"""\nUNION ALL\nSELECT '{col}' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT({col}, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))"""
print(cte)

HTML(value='Query job 5ea87c52-ba33-4278-ab70-d687055bc4c5 is DONE. 160 Bytes processed. <a target="_blank" hr…

SELECT 'id' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(id, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'cost' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(cost, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'category' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(category, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'name' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(name, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'brand' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(brand, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.produc

In [42]:
products_sample = bf.read_gbq(cte)

HTML(value='Query job 98ad872f-4b80-4655-9cd0-91e1afdc3dcb is DONE. 4.3 MB processed. <a target="_blank" href=…

HTML(value='Query job f6c31442-80b7-4b61-929c-ec28e7fabe13 is RUNNING. <a target="_blank" href="https://consol…

In [43]:
products_sample

HTML(value='Query job 41dabed9-2001-4e4f-be21-a9ace6ecec18 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 66636947-0218-45a7-b7e1-026fc4c04ada is DONE. 1.5 kB processed. <a target="_blank" href=…

HTML(value='Query job 57aab245-56a1-4e6d-a7ec-5e95562514b2 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,column_sample
0,retail_price,"25,29.989999771118164,19.989999771118164,39.99..."
1,sku,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06..."
2,cost,"13.549999985843897,10.750000039115548,12.05000..."
3,name,Wrangler Men's Premium Performance Cowboy Cut ...
4,department,"Women,Men"
5,category,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ..."
6,distribution_center_id,21384976510
7,id,"29120,29119,29118,29117,29116,29115,29114,2911..."
8,brand,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N..."


---
## BigQuery ML: Connect To Vertex AI LLMs with ML.GENERATE_TEXT

BigQuery ML can `Create Model`s that are actually connections to Remote Models. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)

Using the `REMOTE_SERVICE_TYPE = "CLOUD_AI_LARGE_LANGUAGE_MODEL_V1"` option will link to LLMs in Vertex AI!

### Connection Requirement

To make a remote connection using BigQuery ML, BigQuery uses a CLOUD_RESOURCE connection. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#connection)

Create a new connection with type `CLOUD_RESOURCE`: First, check for existing connection.

In [44]:
try:
    response = bq_connection.ConnectionServiceClient().get_connection(
            request = bq_connection.GetConnectionRequest(
                name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}"
            )
    )
    print(f'Found existing connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
except Exception:
    request = bq_connection.CreateConnectionRequest(
        {
            "parent": f"projects/{BQ_PROJECT}/locations/{BQ_REGION}",
            "connection_id": f"{SERIES}_{EXPERIMENT}",
            "connection": bq_connection.types.Connection(
                {
                    "friendly_name": f"{SERIES}_{EXPERIMENT}",
                    "cloud_resource": bq_connection.CloudResourceProperties({})
                }
            )
        }
    )
    response = bq_connection.ConnectionServiceClient().create_connection(request)
    print(f'Created new connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
    # assign the service account the Vertex AI User Role:
    !gcloud projects add-iam-policy-binding {BQ_PROJECT} --member=serviceAccount:{service_account} --role=roles/aiplatform.user

Created new connection with service account: bqcx-1026793852137-tqpc@gcp-sa-bigquery-condel.iam.gserviceaccount.com
Updated IAM policy for project [statmike-mlops-349915].
bindings:
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  role: roles/aiplatform.customCodeServiceAgent
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform-vm.iam.gserviceaccount.com
  role: roles/aiplatform.notebookServiceAgent
- members:
  - serviceAccount:service-1026793852137@gcp-sa-aiplatform.iam.gserviceaccount.com
  role: roles/aiplatform.serviceAgent
- members:
  - deleted:serviceAccount:bqcx-1026793852137-79ue@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=108216671037418333398
  - deleted:serviceAccount:bqcx-1026793852137-a2ne@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=113722269076525797130
  - deleted:serviceAccount:bqcx-1026793852137-iszu@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=106642351460101305872
  - serviceAcco

**NOTE**: The step above created a service account and assigned it the Vertex AI User Role.  This may take a moment to be recognized in the steps below.  If you get an error in one of the cells below try rerunning it.

### Create The Remote Model In BigQuery

Create a temp model that connects to text generation model on Vertex AI - [Reference](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.PaLM2TextGenerator)

In [53]:
textgen_model = bfml.llm.PaLM2TextGenerator(
    session = bf_session,
    connection_name = f'{BQ_PROJECT}.{BQ_REGION}.{SERIES}_{EXPERIMENT}'
)

HTML(value='Query job 5a4697ad-183b-4448-93a0-4fe6f54de72c is RUNNING. <a target="_blank" href="https://consol…

---
## Generate Table Metadata

In [66]:
products_columns.columns

Index(['table_catalog', 'table_schema', 'table_name', 'column_name',
       'field_path', 'data_type', 'description', 'collation_name',
       'rounding_mode'],
      dtype='object')

In [67]:
products_sample.columns

Index(['column_name', 'column_sample'], dtype='object')

In [68]:
products_columns = products_columns[['column_name', 'description']].merge(products_sample, on = 'column_name')

In [69]:
products_columns

HTML(value='Query job ed4c2720-fc0e-49fd-b84a-c45eb9ca861a is DONE. 176 Bytes processed. <a target="_blank" hr…

HTML(value='Query job cd257ecd-6800-41f2-9ccb-a0f3a4c8c31e is DONE. 1.7 kB processed. <a target="_blank" href=…

HTML(value='Query job 51b57aa8-05ea-4131-911f-5fc66b1fb792 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,description,column_sample
0,id,,"29120,29119,29118,29117,29116,29115,29114,2911..."
1,cost,,"13.549999985843897,10.750000039115548,12.05000..."
2,category,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ..."
3,name,,Wrangler Men's Premium Performance Cowboy Cut ...
4,brand,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N..."
5,retail_price,,"25,29.989999771118164,19.989999771118164,39.99..."
6,department,,"Women,Men"
7,sku,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06..."
8,distribution_center_id,,21384976510


In [70]:
products_columns['name_prompt'] = 'Generate a new column name for a BigQuery column with:\ncurrent column name = ' + products_columns['column_name'] + '\ncommon values in the column = ' + products_columns['column_sample']

HTML(value='Query job 5fec87f4-8a7e-4c75-91b4-bac4337c337c is DONE. <a target="_blank" href="https://console.c…

BadRequest: 400 Syntax error: Unclosed string literal at [35:26]

Location: US
Job ID: 5fec87f4-8a7e-4c75-91b4-bac4337c337c
 Share your usecase with the BigQuery DataFrames team at the https://bit.ly/bigframes-feedback survey.

In [63]:
products_columns

HTML(value='Query job d8fd39fe-43c5-47fb-b582-c5361664bfc7 is DONE. <a target="_blank" href="https://console.c…

BadRequest: 400 Syntax error: Unclosed string literal at [38:26]

Location: US
Job ID: d8fd39fe-43c5-47fb-b582-c5361664bfc7
 Share your usecase with the BigQuery DataFrames team at the https://bit.ly/bigframes-feedback survey.

HTML(value='Query job 2a2d4b28-33c6-4286-a90e-fd152f4e6044 is DONE. <a target="_blank" href="https://console.c…

BadRequest: 400 Syntax error: Unclosed string literal at [38:26]

Location: US
Job ID: 2a2d4b28-33c6-4286-a90e-fd152f4e6044
 Share your usecase with the BigQuery DataFrames team at the https://bit.ly/bigframes-feedback survey.