![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FApplied+GenAI&dt=Vertex+AI+GenAI+For+BigQuery+Q%26A+-+Overview.ipynb)

# Answering Questions Using BigQuery Tables As Context

Ask Question of a table in BigQuery.  How?  First translated the quesiton to SQL and extract results from the BigQuery Table.  Then provide the results as context with the orignal question to an LLM for answering.

The workflow:
- Setup Enviornment
- Setup access to LLMs: Test and Code
- Retrieve Table Schemas using BigQuery Information Schema
- Translate Question to Code (SQL) with context of table schemas
- Ask an LLM to answer the question with context of results from running the generated SQL
- Ask more question!

---
## Overview

<p><center>
    <img alt="Overview Chart" src="../architectures/notebooks/applied/genai/bq_qa.png" width="55%">
</center><p>


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Vertex%20AI%20GenAI%20For%20BigQuery%20Q&A%20-%20Overview.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment.  Also, the APIs for Cloud Speech-To-Text and Cloud Text-To-Speech need to be enabled (if not already enabled).

### Installs (If Needed)

In [None]:
install = False
try: import google.cloud.aiplatform
except ImportError:
    print('You need to pip install google-cloud-aiplatform (VERTEX AI), ... commencing')
    !pip install google-cloud-aiplatform -U -q
    install = True

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [None]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'bq-citibikes'
SERIES = 'applied-genai'

Packages

In [3]:
import vertexai.language_models
from google.cloud import aiplatform
from google.cloud import bigquery

import pandas as pd

2023-10-17 12:44:35.578521: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-17 12:44:35.815047: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-17 12:44:36.781023: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-10-17 12:44:36.781173: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer

Clients

In [4]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

---
## Goal

New York City has [Citibike](https://citibikenyc.com/) stations where you can rent a bicycle by the ride, the day, or subscribe monthly/annually.  There is a sample of the usage of citibike stations in BigQuery public datasets.  We would like to answer possible questions in natural langauge using this data as the source.

A possible quesiton is: Which Citibike station has the most rental during July 2015?

The appoach used here is ask an LLM to answer the question.  The approach has multiple steps:
1. Ask a code generation LLM to write a SQL query that retrives the relevant information to the question from the tables - the context
2. Run the query generated in 1
3. Ask a text generation LLM to answer the question and give it a context to help accurately answer the question - the result of 2

In [5]:
question = "Which station had most rentals (longest total duration) during July 2015?"

---
## Vertex LLM Setup

- CodeGenerationgModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-generation-prompts)
    - CodeGenerationModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.CodeGenerationModel)
- TextGenerationModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/text/test-text-prompts)
    - TextGenerationModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.preview.language_models.TextGenerationModel)

In [6]:
# create links to model: embedding api and text generation
textgen_model = vertexai.language_models.TextGenerationModel.from_pretrained('text-bison')
codegen_model = vertexai.language_models.CodeGenerationModel.from_pretrained('code-bison')

Test test generation (llm) model:

In [7]:
textgen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

 ```sql
SELECT start_station_id,
       SUM(duration_minutes) AS total_duration
FROM bike_sharing_trips
WHERE EXTRACT(MONTH FROM start_time) = 7
  AND EXTRACT(YEAR FROM start_time) = 2015
GROUP BY start_station_id
ORDER BY total_duration DESC
LIMIT 1;
```

In [8]:
codegen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

```sql
SELECT start_station_id, SUM(duration_minutes) AS total_duration
FROM bike_sharing_trips
WHERE DATE BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY start_station_id
ORDER BY total_duration DESC
LIMIT 1;
```

These both write code but notice how it has to make assumptions about table names and column names.  The following section address how to guide the LLM to write runnable SQL with correct tables and columns.

---
## The Problem
Both LLMs write valid SQL queries.  However, notice that asking either LLM to write the query this way faces several issues:
- The queries do not reference the correct tables
- The column names are not the correct ones from the correct tables

Basically, the generated SQL is a good starting point for a user to write a query that would retrieve the valid context for the users question.

**How to get a fully executable SQL query from the LLM?**

The following approach was created by iteratively refining text prompts and approaches for specific questions.  

---
## Retrieve Table Schemas

The context that will be provided to the LLM to help write the SQL query will be the related tables schema.  The BigQuery tables used for this experiment are BigQuery public dataset tables:
- `bigquery-public-data.new_york.citibike_trips`
- `bigquery-public-data.new_york.citibike_stations`

To retrive the schemas for the tables the BigQuery [INFORMATION_SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-intro) is used - specifically the [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-column-field-paths) view.

In [9]:
BQ_PROJECT = 'bigquery-public-data'
BQ_DATASET = 'new_york'
BQ_TABLES = ['citibike_trips', 'citibike_stations']

In [10]:
query = f"""
    SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ({','.join([f'"{table}"' for table in BQ_TABLES])})
"""
print(query)
schema_columns = bq.query(query = query).to_dataframe()


    SELECT *
    FROM `bigquery-public-data.new_york.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ("citibike_trips","citibike_stations")



In [11]:
schema_columns.head()

Unnamed: 0,table_catalog,table_schema,table_name,column_name,field_path,data_type,description,collation_name,rounding_mode
0,bigquery-public-data,new_york,citibike_stations,station_id,station_id,STRING,Unique identifier of a station.,,
1,bigquery-public-data,new_york,citibike_stations,name,name,STRING,Public name of the station.,,
2,bigquery-public-data,new_york,citibike_stations,short_name,short_name,STRING,"Short name or other type of identifier, as use...",,
3,bigquery-public-data,new_york,citibike_stations,latitude,latitude,FLOAT64,The latitude of station. The field value must ...,,
4,bigquery-public-data,new_york,citibike_stations,longitude,longitude,FLOAT64,The longitude of station. The field value must...,,


In [12]:
#schema_columns.to_markdown(index = False)

---
## Notes On Efficient Information Schema Retrieval

In this example the entire schema is used as context.  But what happens when there are many table and many columns?  Eventually the size will exceed the input size of the LLM.  It is also possible to misguide the LLM in the creation of code by supply too much information - especially when on different topics.

Using semantic retrieval with embeddings is a great solution in this situation:
- create embeddings for all table descriptions
- create embeddings for all column descriptions
- create a vector database with indexes for tables and embeddings
- When a new question comes in find the most applicable table(s) and columns:
    - embed the questions
    - do a vector search for matching table(s)
    - do a vector search for matching columns on the already matched table(s)
    - prepare a schema for just the matched tables and columns

---
## Code Generation LLM - With Context

In this attempt, ask the code generation LLM to write valid SQL and also provide it context.  In this case the context is the schema of the tables that are relevant to Citibike rentals.

Turning the table that represent the schema into context is done here by using a conversion to markdown.  This is an area where users can experiment with the format.  JSON, CSV, ....  I like markdown because it includes the column names a single time and is delimited by the header row notation of markdown!

In [13]:
print(question)

Which station had most rentals (longest total duration) during July 2015?


In [14]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:
{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

print(context_query.text)

```sql
SELECT 
  start_station_name AS station_name,
  SUM(tripduration) AS total_duration_minutes
FROM bigquery-public-data.new_york.citibike_trips
WHERE 
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
```


In [15]:
#print(context_prompt)

In [16]:
context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [17]:
context_response

Unnamed: 0,station_name,total_duration_minutes
0,Central Park S & 6 Ave,18055103


---
## Answer The Question

Now that a valid context has been retrieved from BigQuery it can be passed to a text generation LLM to answer the user questions.

In [18]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

 Central Park S & 6 Ave


In [19]:
#print(question_prompt)

---
## Put It All Together

Ask a new question and try it out:

In [20]:
question = 'What were the top five stations with most unique trips in July 2015?'

In [21]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [22]:
context_query

```sql
SELECT 
  start_station_name AS station_name,
  COUNT(DISTINCT bikeid) AS unique_trips
FROM bigquery-public-data.new_york.citibike_trips
WHERE 
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
```

In [23]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

 1. 8 Ave & W 31 St (6004)
2. Pershing Square North (5468)
3. West St & Chambers St (5259)
4. Lafayette St & E 8 St (5144)
5. E 17 St & Broadway (5116)


---
## All Together and More Complex

In [24]:
question = 'What were the top five stations with most trips started in July 2015 near central park?'

In [25]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [26]:
context_query

```sql
SELECT 
  start_station_name AS station_name,
  COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips
WHERE 
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND start_station_latitude BETWEEN 40.769 AND 40.78 
  AND start_station_longitude BETWEEN -73.985 AND -73.965
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
```

In [27]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

Include popular points of interest near each listed station.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)

print(question_response.text)

 The top five stations with the most trips started in July 2015 near Central Park are:

1. Broadway & W 60 St (8793 trips) - Central Park Zoo, Lincoln Center, Time Warner Center
2. 5 Ave/59 St       (7610 trips) - Central Park Zoo, The Plaza Hotel, FAO Schwarz
3. Columbus Circle    (6843 trips) - Central Park South, Museum of Modern Art, Time Warner Center
4. 8 Ave/W 59 St     (6322 trips) - Central Park South, Lincoln Center, Hearst Tower
5. 7 Ave/W 59 St     (5991 trips) - Central Park South, The Plaza Hotel, FAO Schwarz


---
## When The Generated Query Fails...

The following question leads to an error in the `context_query` execution.  This section covers a method of using a Code Chat LLM to iteratively fix the query based on the error details returned from the job execution in BigQuery.


In [399]:
question = "How many trips were started on a weekend, in the afternoon, during 2015, by a regular rider, who is over the age of 60?"

In [323]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

In [324]:
context_query

```sql
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
  EXTRACT(WEEKDAY FROM starttime) IN (0, 6)
  AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND usertype = 'Subscriber'
  AND birth_year < 1955;
```

In [381]:
# uncomment an run this query - it will fail - due to WEEKDAY is not a valid date part - 10/17/2023
#context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

```
---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
Cell In[325], line 1
----> 1 context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1799, in QueryJob.to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, max_results, geography_as_object, bool_dtype, int_dtype, float_dtype, string_dtype, date_dtype, datetime_dtype, time_dtype, timestamp_dtype)
   1633 def to_dataframe(
   1634     self,
   1635     bqstorage_client: Optional["bigquery_storage.BigQueryReadClient"] = None,
   (...)
   1648     timestamp_dtype: Union[Any, None] = None,
   1649 ) -> "pandas.DataFrame":
   1650     """Return a pandas DataFrame from a QueryJob
   1651 
   1652     Args:
   (...)
   1797             :mod:`shapely` library cannot be imported.
   1798     """
-> 1799     query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
   1800     return query_result.to_dataframe(
   1801         bqstorage_client=bqstorage_client,
   1802         dtypes=dtypes,
   (...)
   1813         timestamp_dtype=timestamp_dtype,
   1814     )

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/_tqdm_helpers.py:104, in wait_for_query(query_job, progress_bar_type, max_results)
    100 progress_bar = get_progress_bar(
    101     progress_bar_type, "Query is running", default_total, "query"
    102 )
    103 if progress_bar is None:
--> 104     return query_job.result(max_results=max_results)
    106 i = 0
    107 while True:

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1520, in QueryJob.result(self, page_size, max_results, retry, timeout, start_index, job_retry)
   1517     if retry_do_query is not None and job_retry is not None:
   1518         do_get_result = job_retry(do_get_result)
-> 1520     do_get_result()
   1522 except exceptions.GoogleAPICallError as exc:
   1523     exc.message = _EXCEPTION_FOOTER_TEMPLATE.format(
   1524         message=exc.message, location=self.location, job_id=self.job_id
   1525     )

File /opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.__call__.<locals>.retry_wrapped_func(*args, **kwargs)
    345 target = functools.partial(func, *args, **kwargs)
    346 sleep_generator = exponential_sleep_generator(
    347     self._initial, self._maximum, multiplier=self._multiplier
    348 )
--> 349 return retry_target(
    350     target,
    351     self._predicate,
    352     sleep_generator,
    353     self._timeout,
    354     on_error=on_error,
    355 )

File /opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs)
    189 for sleep in sleep_generator:
    190     try:
--> 191         return target()
    193     # pylint: disable=broad-except
    194     # This function explicitly must deal with broad exceptions.
    195     except Exception as exc:

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1510, in QueryJob.result.<locals>.do_get_result()
   1507     self._retry_do_query = retry_do_query
   1508     self._job_retry = job_retry
-> 1510 super(QueryJob, self).result(retry=retry, timeout=timeout)
   1512 # Since the job could already be "done" (e.g. got a finished job
   1513 # via client.get_job), the superclass call to done() might not
   1514 # set the self._query_results cache.
   1515 self._reload_query_results(retry=retry, timeout=timeout)

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/base.py:922, in _AsyncJob.result(self, retry, timeout)
    919     self._begin(retry=retry, timeout=timeout)
    921 kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 922 return super(_AsyncJob, self).result(timeout=timeout, **kwargs)

File /opt/conda/lib/python3.10/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling)
    256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
    258 if self._exception is not None:
    259     # pylint: disable=raising-bad-type
    260     # Pylint doesn't recognize that this is valid in this case.
--> 261     raise self._exception
    263 return self._result

BadRequest: 400 A valid date part name is required but found WEEKDAY at [6:11]

Location: US
Job ID: 8bd89480-8117-4c27-aacc-c94affe146ca
```

### Detect Errors

In [326]:
query = '\n'.join(context_query.text.split('\n')[1:-1])

In [327]:
print(query)

SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
  EXTRACT(WEEKDAY FROM starttime) IN (0, 6)
  AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND usertype = 'Subscriber'
  AND birth_year < 1955;


In [328]:
query_job = bq.query(query = query)

In [329]:
type(query_job)

google.cloud.bigquery.job.query.QueryJob

In [330]:
query_job.errors

[{'reason': 'invalidQuery',
  'location': 'query',
  'message': 'A valid date part name is required but found WEEKDAY at [6:11]'}]

### Use A Chat Model To Iteratively Refine A Query:

Chat models, like `codechat-bison@001` are interactive in that they keep a history of the chat session as context for future message interactions.  This is perfect for both generating the initial query and also asking for help with repairing it due to any errors returned from BigQuery.

#### Start A Chat Session With The Same Context (Schema)

In [337]:
codechat_model = vertexai.language_models.CodeChatModel.from_pretrained('codechat-bison@001')

In [338]:
codechat = codechat_model.start_chat(
    context = schema_columns.to_markdown(index = False)
)

#### Fix Query (Try 1)

In [339]:
response = codechat.send_message(f"""
This query:
{query}

Returns these errors:
{query_job.errors}
""")

In [340]:
response

The query you provided is not valid. The error message indicates that the query is missing a valid date part name.

To fix this error, you need to specify the date part name that you want to extract. For example, if you want to extract the day of the week, you would use the `DAY` date part name.

Here is an example of a query that uses the `DAY` date part name:

```
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
  EXTRACT(DAY FROM starttime) IN (0, 6)
  AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND usertype = 'Subscriber'
  AND birth_year < 1955;
```

This query should fix the error that you are seeing.

In [341]:
response = codechat.send_message(f'Respond with only the corrected query as a markdown code block.')

In [342]:
response

Sure, here is the corrected query as a markdown code block:

```
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
  EXTRACT(DAY FROM starttime) IN (0, 6)
  AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND usertype = 'Subscriber'
  AND birth_year < 1955;
```

In [343]:
if response.text.find("```") >= 0:
    query = response.text.split("```")[1]
    if query.startswith('sql'): query = query[3:]
    print(query)
else:
    print('no query in response')


SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
  EXTRACT(DAY FROM starttime) IN (0, 6)
  AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND usertype = 'Subscriber'
  AND birth_year < 1955;



In [344]:
query_job = bq.query(query = query)

In [345]:
query_job.errors

In [346]:
query_job.to_dataframe()

Unnamed: 0,num_trips
0,4210


#### Generate Response

In [347]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}

context:
{query_job.to_dataframe().to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)

print(question_response.text)

 4210


### Put It All Together

In [404]:
question

'How many trips were started on a weekend, in the afternoon, during 2015, by a regular rider, who is over the age of 60?'

#### Function To Answer The Question Using Iteration To Fix Context Query

In [405]:
def BQ_QA(question, max_fixes = 7, schema_columns = schema_columns):
    
    # text generation model
    textgen_model = vertexai.language_models.TextGenerationModel.from_pretrained('text-bison')
    # code generation model
    codegen_model = vertexai.language_models.CodeGenerationModel.from_pretrained('code-bison')
    # code chat model
    codechat_model = vertexai.language_models.CodeChatModel.from_pretrained('codechat-bison@001')

    
    # initial request for query:
    query_response = codegen_model.predict(f"""
Troubleshoot and Fix Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
Question:
{question}

Context:
{schema_columns.to_markdown(index = False)}
""", max_output_tokens = 256)
    
    # extract query from response
    if query_response.text.find("```") >= 0:
        query = query_response.text.split("```")[1]
        if query.startswith('sql'): query = query[4:]
        print('first try:\n', query)
    else:
        print('no query provided (first try) - unforseen error, printing out response to help with editing this funcntion:\n', query_response.text)
        return    
    
    # start a code chat session and give the schema for columns as the starting context:
    codechat = codechat_model.start_chat(
        context = f"""This session is trying to troubleshoot a Google BigQuery SQL query that is being writen to answer the question:
Question:
{question}

BigQuery SQL query that need to be fixed:
{query}

The BigQuery Environment has tables defined by the follow schema:
{schema_columns.to_markdown(index = False)}

Instructions:
As the user provided version of the query and the errors returned by BigQuery, offer suggestions that fix the errors but it is important that the query still answer the original question.
"""
    )
    
    # iteratively run query, and fix it using codechat until success (or max_fixes reached):
    fix_tries = 0
    answer = False
    while fix_tries < max_fixes:
        if not query: 
            return
        # run query:
        query_job = bq.query(query = query)
        # if errors, then generate new query:
        if query_job.errors:
            fix_tries += 1
            query_response = codechat.send_message(
f"""
This query:{query}

Returns these errors:
{query_job.errors}
""")
            query_response = codechat.send_message('Respond with only the corrected query that still answers the question as a markdown code block.')
            if query_response.text.find("```") >= 0:
                query = query_response.text.split("```")[1]
                if query.startswith('sql'):
                    query = query[4:]
                print(f'Fix #{fix_tries}:\n', query)
            # response did not have a query????:
            else:
                query = ''
                print('No query in response...')

        # no error, retrieve result and break loop
        else:
            result = query_job.to_dataframe()
            # answer question
            question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{result.to_markdown(index = False)}
"""
            question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)
            answer = True
            print('The answer is:\n', question_response.text)
            break

    # if the loop breaks without answer then max_fixes reached - return message
    if not answer:
        print(f'No answer generated after {fix_tries} tries.')
    
    return codechat

In [406]:
session = BQ_QA(question)

first try:
 SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
    EXTRACT(WEEKDAY FROM starttime) IN (0, 6)
    AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
    AND EXTRACT(YEAR FROM starttime) = 2015
    AND usertype = "Subscriber"
    AND birth_year < 1955;

Fix #1:
 
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
    EXTRACT(DAYOFWEEK FROM starttime) IN (0, 6)
    AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
    AND EXTRACT(YEAR FROM starttime) = 2015
    AND usertype = "Subscriber"
    AND birth_year < 1955;

The answer is:
  18968


In [407]:
for message in session.message_history:
    print(message.content)


This query:SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE 
    EXTRACT(WEEKDAY FROM starttime) IN (0, 6)
    AND EXTRACT(HOUR FROM starttime) BETWEEN 12 AND 17
    AND EXTRACT(YEAR FROM starttime) = 2015
    AND usertype = "Subscriber"
    AND birth_year < 1955;


Returns these errors:
[{'reason': 'invalidQuery', 'location': 'query', 'message': 'A valid date part name is required but found WEEKDAY at [6:13]'}]

The query is not valid because the `WEEKDAY` function is not supported in BigQuery. You can use the `EXTRACT` function to get the day of the week as an integer, and then use an `IN` clause to filter on weekends.

Here is a fixed version of the query:

```
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations

### Try Another Question:

In [408]:
question = 'How many trips during the year 2015: started in the evening, were over an hour long, and were by regular riders, who were over age 60?'

In [409]:
session = BQ_QA(question)

first try:
 SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS start_stations
ON CAST(trips.start_station_id AS STRING) = start_stations.station_id
JOIN bigquery-public-data.new_york.citibike_stations AS end_stations
ON CAST(trips.end_station_id AS STRING) = end_stations.station_id
WHERE EXTRACT(HOUR FROM trips.starttime) BETWEEN 18 AND 23
AND trips.tripduration > 3600
AND trips.usertype = "Subscriber"
AND EXTRACT(YEAR FROM trips.starttime) = 2015
AND trips.birth_year < 1955;

The answer is:
  180


### Ideas For Improvement

After a few attempts to fix, ask the CodeChat LLM to start fresh and write a new query to answer the question.  The iterate on the new query.

If the fixes are actually breaking the logic of the query try asking if the successful query answers the questions before providing the result.