![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FApplied+GenAI&dt=Vertex+AI+GenAI+For+BigQuery+Q%26A+-+Overview.ipynb)

# Answering Questions Using BigQuery Tables As Context

**WRITEUP IN PROGRESS**

---
## Overview

<p><center>
    <img alt="Overview Chart" src="../architectures/notebooks/applied/genai/bq_qa.png" width="55%">
</center><p>


---
## Setup

Inputs

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'bq-citibikes'
SERIES = 'applied-genai'

Packages

In [3]:
import vertexai.language_models
from google.cloud import aiplatform
from google.cloud import bigquery

Clients

In [4]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

---
## Goal

New York City has [Citibike](https://citibikenyc.com/) stations where you can rent a bicycle by the ride, the day, or subscribe monthly/annually.  There is a sample of the usage of citibike stations in BigQuery public datasets.  We would like to answer possible questions in natural langauge using this data as the source.

A possible quesiton is: Which Citibike station has the most rental during July 2015?

The appoach used here is ask an LLM to answer the question.  The approach has multiple steps:
1. Ask a code generation LLM to write a SQL query that retrives the relevant information to the question from the tables - the context
2. Run the query generated in 1
3. Ask a text generation LLM to answer the question and give it a context to help accurately answer the question - the result of 2

In [5]:
question = "Which station had most rentals (longest total duration) during July 2015?"

---
## Vertex LLM Setup

- TextEmbeddingModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)
    - TextEmbeddingModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.preview.language_models.TextEmbeddingModel)
- TextGenerationModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/text/test-text-prompts)
    - TextGenerationModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.preview.language_models.TextGenerationModel)

In [6]:
# create links to model: embedding api and text generation
textgen_model = vertexai.language_models.TextGenerationModel.from_pretrained('text-bison')
codegen_model = vertexai.language_models.CodeGenerationModel.from_pretrained('code-bison')

Test test generation (llm) model:

In [7]:
textgen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

```
SELECT station_name, SUM(duration) AS total_duration
FROM trips
WHERE start_date BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY station_name
ORDER BY total_duration DESC
LIMIT 1;
```

In [8]:
codegen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

```
SELECT station_id, SUM(duration) AS total_duration
FROM trip
WHERE start_date BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY station_id
ORDER BY total_duration DESC
LIMIT 1
```

---
## The Problem
Both LLMs write valide SQL queries.  However, notice that asking either LLM to write the query this way faces several issues:
- The queries do not reference the correct tables
- The column names are not the correct one from the correct tables

Basically, the generated SQL is a good starting point for a user to write a query that would retrieve the valid context for the users question.

**How to get a fully executable SQL query form the LLM?**

The following approach was created by iternatively refining text prompts and approaches for specific questions.  

---
## Retrieve Table Schemas

The context that will be provided to the LLM to help write the SQL query will be the related tables schema.  The BigQuery tables used for this experiment are BigQuery public dataset tables:
- `bigquery-public-data.new_york.citibike_trips`
- `bigquery-public-data.new_york.citibike_stations`

To retrive the schemas for the tables the BigQuery [INFORMATIONM_SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-intro) is used - specifically the [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-column-field-paths) view.

In [9]:
BQ_PROJECT = 'bigquery-public-data'
BQ_DATASET = 'new_york'
BQ_TABLES = ['citibike_trips', 'citibike_stations']

In [10]:
query = f"""
    SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ({','.join([f'"{table}"' for table in BQ_TABLES])})
"""
print(query)
schema_columns = bq.query(query = query).to_dataframe()


    SELECT *
    FROM `bigquery-public-data.new_york.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ("citibike_trips","citibike_stations")



In [11]:
schema_columns.head()

Unnamed: 0,table_catalog,table_schema,table_name,column_name,field_path,data_type,description,collation_name,rounding_mode
0,bigquery-public-data,new_york,citibike_stations,station_id,station_id,STRING,Unique identifier of a station.,,
1,bigquery-public-data,new_york,citibike_stations,name,name,STRING,Public name of the station.,,
2,bigquery-public-data,new_york,citibike_stations,short_name,short_name,STRING,"Short name or other type of identifier, as use...",,
3,bigquery-public-data,new_york,citibike_stations,latitude,latitude,FLOAT64,The latitude of station. The field value must ...,,
4,bigquery-public-data,new_york,citibike_stations,longitude,longitude,FLOAT64,The longitude of station. The field value must...,,


In [12]:
#schema_columns.to_markdown(index = False)

---
## Code Generation LLM - With Context

In this attempt, ask the code generation LLM to write valid SQL and also provide it context.  In this case the context is the schema of the tables that are relevant to Citibike rentals.

Turning the table that represent the schema into context is done here by using a conversion to markdown.  This is an area where users can experiment with the format.  JSON, CSV, ....  I like markdown because it includes the column names a single time and is delimited by the header row notation of markdown!

In [13]:
context_prompt = f"""
Write a Google SQL query that answers the following question while using the provided context to correct refer to BigQuery tables and the needed column names.  Output column names should include the units when applicable. 
question:
{question}

context:
{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

print(context_query.text)

```sql
SELECT start_station_name, SUM(tripduration) AS total_duration_seconds
FROM `bigquery-public-data.new_york.citibike_trips`
WHERE starttime BETWEEN TIMESTAMP '2015-07-01' AND TIMESTAMP '2015-07-31'
GROUP BY start_station_name
ORDER BY total_duration_seconds DESC
LIMIT 1
```


In [14]:
context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [15]:
context_response

Unnamed: 0,start_station_name,total_duration_seconds
0,Central Park S & 6 Ave,17259635


---
## Answer The Question

Now that a valid context has been retrieved from BigQuery it can be passed to a text generation LLM to answer the user questions.

In [16]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

Central Park S & 6 Ave


---
## Put It All Together

Ask a new question and try it out:

In [17]:
question = 'What were the top five stations with most unique trips in July 2015?'

In [18]:
context_prompt = f"""
Write a Google SQL query that answers the following question while using the provided context to correct refer to BigQuery tables and the needed column names.  Output column names should include the units when applicable. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [19]:
context_query

```sql
SELECT
  start_station_name,
  COUNT(*) AS num_trips
FROM
  `bigquery-public-data.new_york.citibike_trips`
WHERE
  starttime BETWEEN TIMESTAMP '2015-07-01' AND TIMESTAMP '2015-07-31'
GROUP BY
  start_station_name
ORDER BY
  num_trips DESC
LIMIT
  5
```

In [20]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

8 Ave & W 31 St
Pershing Square North
West St & Chambers St
Lafayette St & E 8 St
E 17 St & Broadway


---
## All Together and More Complex

In [21]:
question = 'What were the top five stations with most unique trips in July 2015 near central park?'

In [22]:
context_prompt = f"""
Write a BigQuery SQL query that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  Output column names should include the units when applicable. Validate the BigQuery function names and inputs. Validate the referenced columns are in the schema of the referenced tables.
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [23]:
context_query

```sql
#standardSQL
SELECT
  start_station_name,
  count(*) AS num_trips
FROM
  `bigquery-public-data.new_york.citibike_trips`
WHERE
  starttime BETWEEN TIMESTAMP '2015-07-01' AND TIMESTAMP '2015-07-31'
  AND start_station_latitude BETWEEN 40.7526 AND 40.7683
  AND start_station_longitude BETWEEN -73.9847 AND -73.9711
GROUP BY
  start_station_name
ORDER BY
  num_trips DESC
LIMIT
  5
```

In [24]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

Include popular points of interest near each listed station.

question:
{question}
context:
{context_response}
"""

question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)

print(question_response.text)

The top five stations with most unique trips in July 2015 near central park are:

1. Central Park S & 6 Ave (7237 trips)

Popular points of interest near this station include:

- Central Park
- The Metropolitan Museum of Art
- The American Museum of Natural History

2. Grand Army Plaza & Central Park S (6998 trips)

Popular points of interest near this station include:

- Central Park
- The Plaza Hotel
- The New York Public Library

3. Broadway & W 58 St (5585 trips)

Popular points of interest near this station include:

- The Museum of Modern Art
- The Whitney Museum of American Art
- The Frick Collection

4. E 43 St & Vanderbilt Ave (5369 trips)

Popular points of interest near this station include:

- Carnegie Hall
- The New York Public Library
- The Museum of Modern Art

5. Broadway & W 49 St (4670 trips)

Popular points of interest near this station include:

- The Empire State Building
- The Chrysler Building
- The Rockefeller Center
