![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FApplied+GenAI&dt=Vertex+AI+GenAI+For+BigQuery+Metadata+-+Make+Better+Tables.ipynb)

# Vertex AI GenAI For BigQuery Metadata - Make Better Tables

BigQuery tables are a great source of information for generative AI applications.  Retrieving information is a multi-step process as covered in [Vertex AI GenAI For BigQuery Q&A - Overview](./Vertex%20AI%20GenAI%20For%20BigQuery%20Q&A%20-%20Overview.ipynb).  The ability of a large language model to understand the contents of tables directly relies on the descriptiveness of the metadata: column names, column descriptions, table names, table descriptions.  

This workflow show a potential workflow for creating better, more descriptive metadata for BigQuery tables by using the existing metadata as well as common values from the tables columns.

The notebooks uses the BigFrames API for BigQuery to make local work in the form of a Pandas like API while keeping the execution remote, within BigQuery.  The LLM used here is Vertex AI [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text) called directly from BigQuery using [ML.GENERATE_TEXT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text).


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Vertex%20AI%20GenAI%20For%20BigQuery%20Metadata%20-%20Make%20Better%20Tables.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment.  Also, the API for Artifact Registry needs to be enabled (if not already enabled).

### Installs (If Needed)
The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('bigframes', 'bigframes'),
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.bigquery_connection_v1', 'google-cloud-bigquery-connection')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

Make sure the [BigQuery Connection API](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection) is enabled:

In [5]:
!gcloud services enable bigqueryconnection.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [6]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'bq-metadata'

In [3]:
# make this the BQ Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2] # subset to first two characters for multi-region

In [61]:
from IPython.display import Markdown
import bigframes.pandas as bf
import bigframes.ml as bfml
from bigframes.ml import llm
from bigframes.ml import ensemble
from google.cloud import bigquery_connection_v1 as bq_connection

In [62]:
bf.reset_session()
bf.options.bigquery.project = BQ_PROJECT
bf.options.bigquery.location = BQ_REGION
bf_session = bf.get_global_session()

---
## Review Data Source

The data source here is a product catalog with source:
- BigQuery Public table `bigquery-public-data.thelook_ecommerce.products`.



### Get Table: BigQuery Public Table

In [6]:
products = bf.read_gbq('bigquery-public-data.thelook_ecommerce.products')

HTML(value='Query job db50dc90-7613-4648-8281-4b191b33200c is RUNNING. <a target="_blank" href="https://consol…

In [7]:
products.dtypes

id                                  Int64
cost                              Float64
category                  string[pyarrow]
name                      string[pyarrow]
brand                     string[pyarrow]
retail_price                      Float64
department                string[pyarrow]
sku                       string[pyarrow]
distribution_center_id              Int64
dtype: object

In [8]:
products.describe()

HTML(value='Query job 8157464f-7042-43d5-9da4-859be7bef524 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 4d4a7d8f-aa21-4849-9258-53ec00388a5e is DONE. 931.8 kB processed. <a target="_blank" hre…

HTML(value='Query job 6045e1d0-fb09-4375-bb4a-31e5f1e92211 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,cost,retail_price,distribution_center_id
count,29120.0,29120.0,29120.0,29120.0
mean,14560.5,28.481774,59.220164,4.982898
std,8406.364256,30.624681,65.888927,2.901153
min,1.0,0.0083,0.02,1.0
25%,7197.0,11.2455,24.0,2.0
50%,14498.0,19.5965,39.990002,5.0
75%,21789.0,34.648741,69.949997,8.0
max,29120.0,557.151002,999.0,10.0


In [9]:
products.head()

HTML(value='Query job a8ff3b72-30f5-4140-b189-8becc95a7bdd is DONE. 233.0 kB processed. <a target="_blank" hre…

HTML(value='Query job d74eaceb-b46c-42c9-8c98-6971d0e80979 is DONE. 4.5 MB processed. <a target="_blank" href=…

HTML(value='Query job 1d418805-70b2-4337-8323-4dff91bb6cd9 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
0,27569,92.652563,Swim,2XU Men's Swimmers Compression Long Sleeve Top,2XU,150.410004,Men,B23C5765E165D83AA924FA8F13C05F25,1
1,27445,24.719661,Swim,TYR Sport Men's Square Leg Short Swim Suit,TYR,38.990002,Men,2AB7D3B23574C3DEA2BD278AFD0939AB,1
2,27457,15.8976,Swim,TYR Sport Men's Solid Durafast Jammer Swim Suit,TYR,27.6,Men,8F831227B0EB6C6D09A0555531365933,1
3,27466,17.85,Swim,TYR Sport Men's Swim Short/Resistance Short Sw...,TYR,30.0,Men,67317D6DCC4CB778AEB9219565F5456B,1
4,27481,29.408001,Swim,TYR Alliance Team Splice Jammer,TYR,45.950001,Men,213C888198806EF1A0E2BBF2F4855C6C,1


### Get Table Info From BigQuery Information Schema: Columns

Retrieve the metadata for the table from Information Schema views like [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-column-field-paths)

**NOTE** When `column_name` is not equal to `field_path` it is because the column is nested withing a RECORD (think array, or list) or STRUCT (think dictionary of key:value pairs).  This example does not have examples of these but could be extended to handle these as well.

In [10]:
products_columns = bf.read_gbq(f"""
    SELECT *
    FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE TABLE_NAME = 'products'
        AND column_name = field_path
""")

HTML(value='Query job d55e47d9-55a8-4ae7-9e8d-8d31a9fdc26c is DONE. 10.5 MB processed. <a target="_blank" href…

HTML(value='Query job f8e12305-5b08-45fb-8bbe-11e7f1d434a0 is RUNNING. <a target="_blank" href="https://consol…

In [11]:
products_columns

HTML(value='Query job 4e12f67a-e199-4d14-b365-c88ea8cf3fca is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 8d7f8177-69a8-4f0f-b091-f033a8634deb is DONE. 833 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 9f5672e8-b0b5-4920-b7df-e68406692aad is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,table_catalog,table_schema,table_name,column_name,field_path,data_type,description,collation_name,rounding_mode
0,bigquery-public-data,thelook_ecommerce,products,id,id,INT64,,,
1,bigquery-public-data,thelook_ecommerce,products,cost,cost,FLOAT64,,,
2,bigquery-public-data,thelook_ecommerce,products,category,category,STRING,,,
3,bigquery-public-data,thelook_ecommerce,products,name,name,STRING,,,
4,bigquery-public-data,thelook_ecommerce,products,brand,brand,STRING,,,
5,bigquery-public-data,thelook_ecommerce,products,retail_price,retail_price,FLOAT64,,,
6,bigquery-public-data,thelook_ecommerce,products,department,department,STRING,,,
7,bigquery-public-data,thelook_ecommerce,products,sku,sku,STRING,,,
8,bigquery-public-data,thelook_ecommerce,products,distribution_center_id,distribution_center_id,INT64,,,


### Get Table Info From BigQuery Information Schema: Table

Retrieve the metadata for the table from Information Schema views like [INFORMATION_SCHEMA.TABLE_OPTIONS](https://cloud.google.com/bigquery/docs/information-schema-table-options)

This view has one row for each option within each table.  Here, only the `OPTION_NAME = 'description'` is needed.

In [12]:
products_table = bf.read_gbq(f"""
    SELECT *
    FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLE_OPTIONS`
    WHERE TABLE_NAME = 'products'
        AND OPTION_NAME = 'description'
""")

HTML(value='Query job a32bad4d-7a3b-4923-a5d1-416b1213243b is DONE. 10.5 MB processed. <a target="_blank" href…

HTML(value='Query job 0f975a6c-b609-4043-b0d4-3e261773fe81 is RUNNING. <a target="_blank" href="https://consol…

In [13]:
products_table

HTML(value='Query job aa3d07f5-c85e-4596-a6a1-380572846c0f is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 0dc6b590-a6a9-4364-8c2e-47dcdb301904 is DONE. 139 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 2d838429-3741-46ca-9740-328af8e1d177 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,table_catalog,table_schema,table_name,option_name,option_type,option_value
0,bigquery-public-data,thelook_ecommerce,products,description,STRING,"""The Look fictitious e-commerce dataset - prod..."


### Get Values From Columns: Most common values as examples

Retrieve a sample of common values from each column to use as examples for an LLM to create names and descriptions.

Create syntax for query that will create a row per column with a sample of values from the column.

In [14]:
for c, col in enumerate(products_columns.column_name.unique().tolist()):
    if c == 0: 
        cte = f"""SELECT '{col}' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT({col}, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))"""
    else:
        cte += f"""\nUNION ALL\nSELECT '{col}' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT({col}, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))"""
print(cte)

HTML(value='Query job 19b720b8-2c0b-4127-9fe4-85c7a2a14c37 is DONE. 160 Bytes processed. <a target="_blank" hr…

SELECT 'id' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(id, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'cost' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(cost, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'category' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(category, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'name' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(name, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.products`))
UNION ALL
SELECT 'brand' AS column_name, STRING_AGG(CAST(value AS STRING)) as column_sample FROM UNNEST((SELECT APPROX_TOP_COUNT(brand, 10) as osn FROM `bigquery-public-data.thelook_ecommerce.produc

In [15]:
products_sample = bf.read_gbq(cte)

HTML(value='Query job b1f948ae-b79b-4e57-bf1f-75989a501859 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job a7f508d9-df8a-4905-b9cf-307eff521912 is RUNNING. <a target="_blank" href="https://consol…

In [16]:
products_sample

HTML(value='Query job 03e00ef1-7d2f-4a30-8ffa-365426834817 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 59cdb7b4-4dbc-478a-ad96-a187e61491a9 is DONE. 1.5 kB processed. <a target="_blank" href=…

HTML(value='Query job e628a703-7d04-4cbf-9f16-22a329997ce8 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,column_sample
0,department,"Women,Men"
1,cost,"13.549999985843897,10.750000039115548,12.05000..."
2,id,"29120,29119,29118,29117,29116,29115,29114,2911..."
3,category,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ..."
4,retail_price,"25,29.989999771118164,19.989999771118164,39.99..."
5,distribution_center_id,21384976510
6,brand,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N..."
7,name,Wrangler Men's Premium Performance Cowboy Cut ...
8,sku,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06..."


---
## BigQuery ML: Connect To Vertex AI LLMs with ML.GENERATE_TEXT

BigQuery ML can `Create Model`s that are actually connections to Remote Models. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)

Using the `REMOTE_SERVICE_TYPE = "CLOUD_AI_LARGE_LANGUAGE_MODEL_V1"` option will link to LLMs in Vertex AI!

### Connection Requirement

To make a remote connection using BigQuery ML, BigQuery uses a CLOUD_RESOURCE connection. [Reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#connection)

Create a new connection with type `CLOUD_RESOURCE`: First, check for existing connection.

In [17]:
try:
    response = bq_connection.ConnectionServiceClient().get_connection(
            request = bq_connection.GetConnectionRequest(
                name = f"projects/{BQ_PROJECT}/locations/{BQ_REGION}/connections/{SERIES}_{EXPERIMENT}"
            )
    )
    print(f'Found existing connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
except Exception:
    request = bq_connection.CreateConnectionRequest(
        {
            "parent": f"projects/{BQ_PROJECT}/locations/{BQ_REGION}",
            "connection_id": f"{SERIES}_{EXPERIMENT}",
            "connection": bq_connection.types.Connection(
                {
                    "friendly_name": f"{SERIES}_{EXPERIMENT}",
                    "cloud_resource": bq_connection.CloudResourceProperties({})
                }
            )
        }
    )
    response = bq_connection.ConnectionServiceClient().create_connection(request)
    print(f'Created new connection with service account: {response.cloud_resource.service_account_id}')
    service_account = response.cloud_resource.service_account_id
    # assign the service account the Vertex AI User Role:
    !gcloud projects add-iam-policy-binding {BQ_PROJECT} --member=serviceAccount:{service_account} --role=roles/aiplatform.user

Found existing connection with service account: bqcx-1026793852137-tqpc@gcp-sa-bigquery-condel.iam.gserviceaccount.com


**NOTE**: The step above created a service account and assigned it the Vertex AI User Role.  This may take a moment to be recognized in the steps below.  If you get an error in one of the cells below try rerunning it.

### Create The Remote Model In BigQuery

Create a temp model that connects to text generation model on Vertex AI - [Reference](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.PaLM2TextGenerator)

In [18]:
textgen_model = bfml.llm.PaLM2TextGenerator(
    session = bf_session,
    connection_name = f'{BQ_PROJECT}.{BQ_REGION}.{SERIES}_{EXPERIMENT}'
)

HTML(value='Query job fd083a7f-f1d1-48ef-b078-bec6bb587f3d is RUNNING. <a target="_blank" href="https://consol…

---
## Generate Table Metadata

### Bring Together Column Information

In [19]:
products_columns.columns

Index(['table_catalog', 'table_schema', 'table_name', 'column_name',
       'field_path', 'data_type', 'description', 'collation_name',
       'rounding_mode'],
      dtype='object')

In [20]:
products_sample.columns

Index(['column_name', 'column_sample'], dtype='object')

In [21]:
products_columns = products_columns[['column_name', 'data_type', 'description']].merge(products_sample, on = 'column_name')

In [22]:
products_columns

HTML(value='Query job 96182f91-03d5-4cc6-a1b4-904ebfd2252f is DONE. 176 Bytes processed. <a target="_blank" hr…

HTML(value='Query job f75de1e1-8d06-49fe-85a4-f9c4a3b72dbc is DONE. 1.7 kB processed. <a target="_blank" href=…

HTML(value='Query job 57b6fe62-eebd-4cba-97a5-564e94b61d37 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,data_type,description,column_sample
0,id,INT64,,"29120,29119,29118,29117,29116,29115,29114,2911..."
1,cost,FLOAT64,,"13.549999985843897,10.750000039115548,12.05000..."
2,category,STRING,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ..."
3,name,STRING,,Wrangler Men's Premium Performance Cowboy Cut ...
4,brand,STRING,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N..."
5,retail_price,FLOAT64,,"25,29.989999771118164,19.989999771118164,39.99..."
6,department,STRING,,"Women,Men"
7,sku,STRING,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06..."
8,distribution_center_id,INT64,,21384976510


### Add Table Information

In [23]:
products_columns['table_name'] = products_table['table_name'].iloc[0]
products_columns['table_description'] = products_table['option_value'].iloc[0]

HTML(value='Query job 638876e8-cebc-4e27-88a5-6acfbaca9dea is DONE. 18 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 7534e365-8102-43d3-a912-15f351a81782 is DONE. 67 Bytes processed. <a target="_blank" hre…

In [24]:
products_columns

HTML(value='Query job cd2b083a-8a25-43e1-8774-58b72c339cb5 is DONE. 176 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 6977fd8e-1ca6-4c60-8009-c4235d98b3e3 is DONE. 1.7 kB processed. <a target="_blank" href=…

HTML(value='Query job f5769e82-533e-47f6-baaa-282a0f8cc2af is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,data_type,description,column_sample,table_name,table_description
0,id,INT64,,"29120,29119,29118,29117,29116,29115,29114,2911...",products,"""The Look fictitious e-commerce dataset - prod..."
1,cost,FLOAT64,,"13.549999985843897,10.750000039115548,12.05000...",products,"""The Look fictitious e-commerce dataset - prod..."
2,category,STRING,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ...",products,"""The Look fictitious e-commerce dataset - prod..."
3,name,STRING,,Wrangler Men's Premium Performance Cowboy Cut ...,products,"""The Look fictitious e-commerce dataset - prod..."
4,brand,STRING,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N...",products,"""The Look fictitious e-commerce dataset - prod..."
5,retail_price,FLOAT64,,"25,29.989999771118164,19.989999771118164,39.99...",products,"""The Look fictitious e-commerce dataset - prod..."
6,department,STRING,,"Women,Men",products,"""The Look fictitious e-commerce dataset - prod..."
7,sku,STRING,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06...",products,"""The Look fictitious e-commerce dataset - prod..."
8,distribution_center_id,INT64,,21384976510,products,"""The Look fictitious e-commerce dataset - prod..."


### Create Column Naming Prompt

In [25]:
products_columns['name_prompt'] = (
    'Generate a new column name for a BigQuery column with the following information. '
    + 'The current column name is ' + products_columns['column_name'] + '. '
    + 'The table has the name ' + products_columns['table_name'] + '. '
    + 'The column has a datatype of ' + products_columns['data_type'] + ' with common values like: ' + products_columns['column_sample'] + '.'
)

In [26]:
products_columns

HTML(value='Query job 8bb14ffb-d736-42d6-80e0-c2ca8098d9e3 is DONE. 176 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 0b474606-2f73-409f-b52a-272ca0d10ab3 is DONE. 1.7 kB processed. <a target="_blank" href=…

HTML(value='Query job 5ec380f3-5c3f-4763-a797-e98cf822ab31 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,data_type,description,column_sample,table_name,table_description,name_prompt
0,id,INT64,,"29120,29119,29118,29117,29116,29115,29114,2911...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
1,cost,FLOAT64,,"13.549999985843897,10.750000039115548,12.05000...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
2,category,STRING,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
3,name,STRING,,Wrangler Men's Premium Performance Cowboy Cut ...,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
4,brand,STRING,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
5,retail_price,FLOAT64,,"25,29.989999771118164,19.989999771118164,39.99...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
6,department,STRING,,"Women,Men",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
7,sku,STRING,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...
8,distribution_center_id,INT64,,21384976510,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...


In [27]:
products_columns['name_prompt'].iloc[2]

HTML(value='Query job 2d95e062-876f-4249-a15f-3717a57f9bdc is DONE. 1.7 kB processed. <a target="_blank" href=…

'Generate a new column name for a BigQuery column with the following information. The current column name is category. The table has the name products. The column has a datatype of STRING with common values like: Intimates,Jeans,Tops & Tees,Fashion Hoodies & Sweatshirts,Swim,Sleep & Lounge,Shorts,Sweaters,Accessories,Active.'

### Generate New Column Names

In [28]:
products_columns = products_columns.join(textgen_model.predict(products_columns['name_prompt']).rename(columns={'ml_generate_text_llm_result':'new_column_name'}))

HTML(value='Query job c61c3c00-d728-43bd-85d7-85e259c56e74 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 635db7d4-4a45-4cf1-aa6c-ba17ea01ef96 is DONE. 72 Bytes processed. <a target="_blank" hre…

In [29]:
products_columns

HTML(value='Query job b9610c15-e5f4-4caa-aa32-45cc39aa4c3b is DONE. 392 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 3898c870-4615-4eff-8c88-71ebdbd0f223 is DONE. 2.0 kB processed. <a target="_blank" href=…

HTML(value='Query job 886f592a-acc4-40af-8e49-5480fe2f0870 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,data_type,description,column_sample,table_name,table_description,name_prompt,new_column_name
0,id,INT64,,"29120,29119,29118,29117,29116,29115,29114,2911...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_id
1,cost,FLOAT64,,"13.549999985843897,10.750000039115548,12.05000...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_cost
2,category,STRING,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_category_v2
3,name,STRING,,Wrangler Men's Premium Performance Cowboy Cut ...,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_name
4,brand,STRING,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_brand
5,retail_price,FLOAT64,,"25,29.989999771118164,19.989999771118164,39.99...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_retail_price_float64
6,department,STRING,,"Women,Men",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_department_gender
7,sku,STRING,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_sku
8,distribution_center_id,INT64,,21384976510,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_distribution_center_id


### Generate New Column Description

In [41]:
products_columns = products_columns.drop(columns = 'new_column_description')

In [42]:
products_columns = products_columns.join(textgen_model.predict(
    'The context for a BigQuery table column follows. '
    + 'The column name is ' + products_columns['new_column_name'] + '. '
    + 'The table has the name ' + products_columns['table_name'] + '. '
    + 'The column has a datatype of ' + products_columns['data_type'] + ' with common values like: ' + products_columns['column_sample'] + '. '
    + 'Generate a description the column.'
).rename(columns={'ml_generate_text_llm_result':'new_column_description'}))

HTML(value='Query job 893a752a-097b-4753-aa42-c67358b8006e is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job d30d69b8-aae7-42ec-b4f2-c1d0bf566966 is DONE. 72 Bytes processed. <a target="_blank" hre…

In [43]:
products_columns

HTML(value='Query job c33d322c-7fd0-4312-a493-e7fd4844005d is DONE. 680 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 9e781a42-4cbb-4e2e-ad46-837161ff0684 is DONE. 4.7 kB processed. <a target="_blank" href=…

HTML(value='Query job 41784ce6-5683-450f-ac88-2bdb4ae88abd is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,column_name,data_type,description,column_sample,table_name,table_description,name_prompt,new_column_name,new_column_description
0,id,INT64,,"29120,29119,29118,29117,29116,29115,29114,2911...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_id,product_id is a column in the products table....
1,cost,FLOAT64,,"13.549999985843897,10.750000039115548,12.05000...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_cost,product_cost is a column in the products tabl...
2,category,STRING,,"Intimates,Jeans,Tops & Tees,Fashion Hoodies & ...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_category_v2,products_category_v2 column in the products t...
3,name,STRING,,Wrangler Men's Premium Performance Cowboy Cut ...,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_name,product_name: The name of the product. Exampl...
4,brand,STRING,,"Allegra K,Calvin Klein,Carhartt,Hanes,Volcom,N...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_brand,product_brand column in the products table co...
5,retail_price,FLOAT64,,"25,29.989999771118164,19.989999771118164,39.99...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_retail_price_float64,The products_retail_price_float64 column in t...
6,department,STRING,,"Women,Men",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_department_gender,The products_department_gender column in the ...
7,sku,STRING,,"FFFCC1A3964B4AD665FA2F07D7BFD086,FFFB8EF15DE06...",products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,product_sku,product_sku is a column in the products table...
8,distribution_center_id,INT64,,21384976510,products,"""The Look fictitious e-commerce dataset - prod...",Generate a new column name for a BigQuery colu...,products_distribution_center_id,The products_distribution_center_id column in...


### Generate New Table Description

Convert selected column for schema into a markdown table for including in the prompt:

In [58]:
markdown = products_columns[['new_column_name', 'new_column_description', 'data_type']].rename(columns = {'new_column_name':'column_name', 'new_column_description':'description'}).to_pandas().to_markdown(index = False)

HTML(value='Query job c508e6dc-962f-43bb-b131-95bc6420103a is DONE. 0 Bytes processed. <a target="_blank" href…

Review the markdown table:

In [59]:
Markdown(markdown)

| column_name                     | description                                                                                                                                                                                                                                                                                                                                                                                                                                                              | data_type   |
|:--------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------|
| product_id                      | product_id is a column in the products table. The column is of type INT64 and has values like 29120,29119,29118,29117,29116,29115,29114,29113,29112,29111. The column likely represents the unique identifier for each product in the table.                                                                                                                                                                                                                             | INT64       |
| product_cost                    | product_cost is a column in the products table. The column contains FLOAT64 values representing the cost of each product. Example values include 13.549999985843897, 10.750000039115548, and 12.05000001937151.                                                                                                                                                                                                                                                          | FLOAT64     |
| products_category_v2            | products_category_v2 column in the products table contains product category information. The column is of type STRING and contains values such as Intimates, Jeans, Tops & Tees, Fashion Hoodies & Sweatshirts, Swim, Sleep & Lounge, Shorts, Sweaters, Accessories, and Active.                                                                                                                                                                                         | STRING      |
| product_name                    | product_name: The name of the product. Examples: Wrangler Men's Premium Performance Cowboy Cut Jean,Wrangler Men's Rugged Wear Classic Fit Jean,True Religion Men's Ricky Straight Jean,Thorlo Unisex Experia Running Sock,Puma Men's Socks,Pearl iZUMi Attack Sock 3-Pack,Fruit of the Loom Women's 6-Pack Crew Socks,7 For All Mankind Men's Standard Classic Straight Leg Jean,Wrangler Men's Wrancher Dress Jean,Wrangler Men's Original Cowboy Cut Relaxed Fit Jean | STRING      |
| product_brand                   | product_brand column in the products table contains STRING data and captures the brand name of the product. Example values include: Allegra K, Calvin Klein, Carhartt, Hanes, Volcom, Nautica, Levi's, Quiksilver, Tommy Hilfiger, Columbia.                                                                                                                                                                                                                             | STRING      |
| products_retail_price_float64   | The products_retail_price_float64 column in the products table contains the retail price of each product in US dollars. The column has a datatype of FLOAT64 and contains values like 25, 29.99, 19.99, 39.99, 49.99, 24.99, 34.99, 49.5, 50, and 55.                                                                                                                                                                                                                    | FLOAT64     |
| products_department_gender      | The products_department_gender column in the products table stores the department and gender that a product belongs to. The column has a datatype of STRING and common values include "Women" and "Men". This column can be used to filter products by department or gender, or to create reports on product sales by department or gender.                                                                                                                              | STRING      |
| product_sku                     | product_sku is a column in the products table. It is a string column with values that are product SKUs.                                                                                                                                                                                                                                                                                                                                                                  | STRING      |
| products_distribution_center_id | The products_distribution_center_id column in the products table is an INT64 data type. The column contains the ID of the distribution center where the product is stored. The most common values for this column are 2, 1, 3, 8, 4, 9, 7, 6, 5, and 10.                                                                                                                                                                                                                 | INT64       |

Generate the table description:

In [60]:
Markdown(textgen_model.predict(bf.DataFrame({ "prompt": [
f"""Generate a description for the BigQuery table with schema:
{markdown}
"""
],})).ml_generate_text_llm_result.iloc[0])

HTML(value='Load job 8a6a0809-72f3-4be4-b54b-e1ba767ac781 is RUNNING. <a target="_blank" href="https://console…

HTML(value='Query job c97c3b7e-eeff-41a7-91c4-d23453c6e78b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job f3218cb3-17cf-4ed2-bfae-861cd2dc3f00 is DONE. 8 Bytes processed. <a target="_blank" href…

HTML(value='Query job 58c9657a-65cc-41dc-abf1-3d3ee927feff is DONE. 285 Bytes processed. <a target="_blank" hr…

 The products table contains information about products, including product id, cost, category, name, brand, retail price, department and gender, SKU, and distribution center ID. The table can be used to analyze product sales, track inventory, and manage product distribution.

## Create A New Table With The Updated Metadata