![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FApplied+GenAI&dt=Vertex+AI+GenAI+For+BigQuery+Q%26A+-+Overview.ipynb)

# Answering Questions Using BigQuery Tables As Context

Ask Question of a table in BigQuery.  How?  First translated the quesiton to SQL and extract results from the BigQuery Table.  Then provide the results as context with the orignal question to an LLM for answering.

The workflow:
- Setup Enviornment
- Setup access to LLMs: Test and Code
- Retrieve Table Schemas using BigQuery Information Schema
- Translate Question to Code (SQL) with context of table schemas
- Ask an LLM to answer the question with context of results from running the generated SQL
- Ask more question!

---
## Overview

<p><center>
    <img alt="Overview Chart" src="../architectures/notebooks/applied/genai/bq_qa.png" width="55%">
</center><p>


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Vertex%20AI%20GenAI%20For%20BigQuery%20Q&A%20-%20Overview.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment.  Also, the APIs for Cloud Speech-To-Text and Cloud Text-To-Speech need to be enabled (if not already enabled).

### Installs (If Needed)

In [None]:
install = False
try: import google.cloud.aiplatform
except ImportError:
    print('You need to pip install google-cloud-aiplatform (VERTEX AI), ... commencing')
    !pip install google-cloud-aiplatform -U -q
    install = True

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [None]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'bq-citibikes'
SERIES = 'applied-genai'

Packages

In [3]:
import vertexai.language_models
from google.cloud import aiplatform
from google.cloud import bigquery

2023-10-16 14:20:38.432968: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-16 14:20:38.790263: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-16 14:20:39.722986: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-10-16 14:20:39.723119: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer

Clients

In [4]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

---
## Goal

New York City has [Citibike](https://citibikenyc.com/) stations where you can rent a bicycle by the ride, the day, or subscribe monthly/annually.  There is a sample of the usage of citibike stations in BigQuery public datasets.  We would like to answer possible questions in natural langauge using this data as the source.

A possible quesiton is: Which Citibike station has the most rental during July 2015?

The appoach used here is ask an LLM to answer the question.  The approach has multiple steps:
1. Ask a code generation LLM to write a SQL query that retrives the relevant information to the question from the tables - the context
2. Run the query generated in 1
3. Ask a text generation LLM to answer the question and give it a context to help accurately answer the question - the result of 2

In [44]:
question = "Which station had most rentals (longest total duration) during July 2015?"

---
## Vertex LLM Setup

- CodeGenerationgModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-generation-prompts)
    - CodeGenerationModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.CodeGenerationModel)
- TextGenerationModel [Guide](https://cloud.google.com/vertex-ai/docs/generative-ai/text/test-text-prompts)
    - TextGenerationModel [API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.preview.language_models.TextGenerationModel)

In [6]:
# create links to model: embedding api and text generation
textgen_model = vertexai.language_models.TextGenerationModel.from_pretrained('text-bison')
codegen_model = vertexai.language_models.CodeGenerationModel.from_pretrained('code-bison')

Test test generation (llm) model:

In [7]:
textgen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

 ```sql
SELECT start_station_id,
       SUM(duration_minutes) AS total_duration
FROM bike_sharing_trips
WHERE EXTRACT(MONTH FROM start_time) = 7
  AND EXTRACT(YEAR FROM start_time) = 2015
GROUP BY start_station_id
ORDER BY total_duration DESC
LIMIT 1;
```

In [8]:
codegen_model.predict(f"Write a Google SQL query that answers the following question.\nquestion: {question}")

```sql
SELECT start_station_id, SUM(duration_minutes) AS total_duration
FROM bike_sharing_trips
WHERE DATE BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY start_station_id
ORDER BY total_duration DESC
LIMIT 1;
```

These both write code but notice how it has to make assumptions about table names and column names.  The following section address how to guide the LLM to write runnable SQL with correct tables and columns.

---
## The Problem
Both LLMs write valid SQL queries.  However, notice that asking either LLM to write the query this way faces several issues:
- The queries do not reference the correct tables
- The column names are not the correct ones from the correct tables

Basically, the generated SQL is a good starting point for a user to write a query that would retrieve the valid context for the users question.

**How to get a fully executable SQL query from the LLM?**

The following approach was created by iteratively refining text prompts and approaches for specific questions.  

---
## Retrieve Table Schemas

The context that will be provided to the LLM to help write the SQL query will be the related tables schema.  The BigQuery tables used for this experiment are BigQuery public dataset tables:
- `bigquery-public-data.new_york.citibike_trips`
- `bigquery-public-data.new_york.citibike_stations`

To retrive the schemas for the tables the BigQuery [INFORMATION_SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-intro) is used - specifically the [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-column-field-paths) view.

In [12]:
BQ_PROJECT = 'bigquery-public-data'
BQ_DATASET = 'new_york'
BQ_TABLES = ['citibike_trips', 'citibike_stations']

In [13]:
query = f"""
    SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ({','.join([f'"{table}"' for table in BQ_TABLES])})
"""
print(query)
schema_columns = bq.query(query = query).to_dataframe()


    SELECT *
    FROM `bigquery-public-data.new_york.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    WHERE table_name in ("citibike_trips","citibike_stations")



In [14]:
schema_columns.head()

Unnamed: 0,table_catalog,table_schema,table_name,column_name,field_path,data_type,description,collation_name,rounding_mode
0,bigquery-public-data,new_york,citibike_stations,station_id,station_id,STRING,Unique identifier of a station.,,
1,bigquery-public-data,new_york,citibike_stations,name,name,STRING,Public name of the station.,,
2,bigquery-public-data,new_york,citibike_stations,short_name,short_name,STRING,"Short name or other type of identifier, as use...",,
3,bigquery-public-data,new_york,citibike_stations,latitude,latitude,FLOAT64,The latitude of station. The field value must ...,,
4,bigquery-public-data,new_york,citibike_stations,longitude,longitude,FLOAT64,The longitude of station. The field value must...,,


In [15]:
#schema_columns.to_markdown(index = False)

---
## Notes On Efficient Information Schema Retrieval

In this example the entire schema is used as context.  But what happens when there are many table and many columns?  Eventually the size will exceed the input size of the LLM.  It is also possible to misguide the LLM in the creation of code by supply too much information - especially when on different topics.

Using semantic retrieval with embeddings is a great solution in this situation:
- create embeddings for all table descriptions
- create embeddings for all column descriptions
- create a vector database with indexes for tables and embeddings
- When a new question comes in find the most applicable table(s) and columns:
    - embed the questions
    - do a vector search for matching table(s)
    - do a vector search for matching columns on the already matched table(s)
    - prepare a schema for just the matched tables and columns

---
## Code Generation LLM - With Context

In this attempt, ask the code generation LLM to write valid SQL and also provide it context.  In this case the context is the schema of the tables that are relevant to Citibike rentals.

Turning the table that represent the schema into context is done here by using a conversion to markdown.  This is an area where users can experiment with the format.  JSON, CSV, ....  I like markdown because it includes the column names a single time and is delimited by the header row notation of markdown!

In [45]:
print(question)

Which station had most rentals (longest total duration) during July 2015?


In [46]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:
{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

print(context_query.text)

```sql
SELECT 
  start_station_name AS station_name,
  SUM(tripduration) AS total_duration_minutes
FROM bigquery-public-data.new_york.citibike_trips
WHERE 
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
```


In [47]:
#print(context_prompt)

In [48]:
context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [49]:
context_response

Unnamed: 0,station_name,total_duration_minutes
0,Central Park S & 6 Ave,18055103


---
## Answer The Question

Now that a valid context has been retrieved from BigQuery it can be passed to a text generation LLM to answer the user questions.

In [50]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

 Central Park S & 6 Ave


In [51]:
#print(question_prompt)

---
## Put It All Together

Ask a new question and try it out:

In [52]:
question = 'What were the top five stations with most unique trips in July 2015?'

In [53]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [54]:
context_query

```sql
SELECT 
  start_station_name AS station_name,
  COUNT(DISTINCT bikeid) AS unique_trips
FROM bigquery-public-data.new_york.citibike_trips
WHERE 
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
```

In [55]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt)

print(question_response.text)

 1. 8 Ave & W 31 St (6004)
2. Pershing Square North (5468)
3. West St & Chambers St (5259)
4. Lafayette St & E 8 St (5144)
5. E 17 St & Broadway (5116)


---
## All Together and More Complex

In [73]:
question = 'What were the top five stations with most trips started in July 2015 near central park?'

In [74]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

In [75]:
context_query

```sql
SELECT
  start_station_name AS station_name,
  COUNT(*) AS num_trips
FROM
  bigquery-public-data.new_york.citibike_trips
WHERE
  EXTRACT(MONTH FROM starttime) = 7
  AND EXTRACT(YEAR FROM starttime) = 2015
  AND start_station_latitude BETWEEN 40.769 AND 40.784
  AND start_station_longitude BETWEEN -73.984 AND -73.964
GROUP BY
  1
ORDER BY
  2 DESC
LIMIT 5;
```

In [76]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

Include popular points of interest near each listed station.

question:
{question}
context:
{context_response.to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)

print(question_response.text)

 The top five stations with the most trips started in July 2015 near Central Park are:

1. Broadway & W 60 St (8793 trips) - Central Park Zoo, Lincoln Center, Time Warner Center
2. 5 Ave/59 St       (7602 trips) - Central Park Zoo, The Plaza Hotel, FAO Schwarz
3. Columbus Circle   (6841 trips) - Central Park South, Museum of Modern Art, Time Warner Center
4. 8 Ave/W 59 St     (6311 trips) - Central Park South, Lincoln Center, Hearst Tower
5. 7 Ave/W 59 St     (5993 trips) - Central Park South, The Plaza Hotel, FAO Schwarz


---
## When The Generated Query Fails...

The following question leads to an error in the `context_query` execution.  This section covers a method of using a Code Chat LLM to iteratively fix the query based on the error details returned from the job execution in BigQuery.


In [218]:
question = "How many trips have started within a few blocks of central park after noon on Saturdays during 2015?"

In [219]:
context_prompt = f"""
Write a Google SQL query for BigQuery that answers the following question while using the provided context to correctly refer to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 
question:
{question}

context:

{schema_columns.to_markdown(index = False)}
"""

context_query = codegen_model.predict(context_prompt, max_output_tokens = 256)

In [220]:
context_query

```sql
SELECT COUNT(*) AS num_trips
FROM bigquery-public-data.new_york.citibike_trips AS trips
JOIN bigquery-public-data.new_york.citibike_stations AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE ST_WITHIN(
  ST_GEOGPOINT(stations.longitude, stations.latitude),
  ST_GEOGPOINT(-73.9772, 40.7711),
  0.001)
AND EXTRACT(WEEKDAY FROM trips.starttime) = 6
AND EXTRACT(HOUR FROM trips.starttime) >= 12
AND EXTRACT(YEAR FROM trips.starttime) = 2015;
```

In [221]:
context_response = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1])).to_dataframe()

BadRequest: 400 Number of arguments does not match for function ST_WITHIN. Supported signature: ST_WITHIN(GEOGRAPHY, GEOGRAPHY) at [5:7]

Location: US
Job ID: a38015cd-ac17-415d-bb79-cc74fd77237d


### Detect Errors

In [299]:
context_job = bq.query(query = '\n'.join(context_query.text.split('\n')[1:-1]))

In [300]:
type(context_job)

google.cloud.bigquery.job.query.QueryJob

In [301]:
context_job.errors

[{'reason': 'invalidQuery',
  'location': 'query',
  'message': 'Number of arguments does not match for function ST_WITHIN. Supported signature: ST_WITHIN(GEOGRAPHY, GEOGRAPHY) at [5:7]'}]

### Use A Chat Model To Iteratively Refine A Query:

Chat models, like `codechat-bison@001` are interactive in that they keep a history of the chat session as context for future message interactions.  This is perfect for both generating the initial query and also asking for help with repairing it due to any errors returned from BigQuery.

#### Start A Chat Session With The Same Context (Schema)

In [316]:
codechat_model = vertexai.language_models.CodeChatModel.from_pretrained('codechat-bison@001')

In [317]:
codechat = codechat_model.start_chat(
    context = schema_columns.to_markdown(index = False)
)

#### Generate Query (Try 1)

In [318]:
response = codechat.send_message(f"""
Write a Google SQL query for BigQuery that answers the following question while correctly referring to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name. 

{question}
""")

In [319]:
response

The following query will answer the question:

```sql
SELECT count(*)
FROM `bigquery-public-data.new_york.citibike_trips` AS trips
JOIN `bigquery-public-data.new_york.citibike_stations` AS stations
ON trips.start_station_id = stations.station_id
WHERE stations.latitude BETWEEN 40.755749 AND 40.760694
AND stations.longitude BETWEEN -73.982222 AND -73.977778
AND trips.starttime > '2015-01-01 12:00:00'
AND trips.starttime < '2016-01-01 00:00:00'
```

This query first joins the `citibike_trips` table to the `citibike_stations` table on the `start_station_id` column. This ensures that we only count trips that started at a station that is within a few blocks of Central Park.

Next, we use a WHERE clause to filter the results to only include trips that started after noon on Saturdays during 2015.

Finally, we use a COUNT(*) aggregate function to count the number of rows that meet the criteria.

In [326]:
if response.text.find("```") >= 0:
    query = response.text.split("```")[1]
    if query.startswith('sql'): query = query[3:]
    print(query)
else:
    print('no query in response')


SELECT count(*)
FROM `bigquery-public-data.new_york.citibike_trips` AS trips
JOIN `bigquery-public-data.new_york.citibike_stations` AS stations
ON trips.start_station_id = stations.station_id
WHERE stations.latitude BETWEEN 40.755749 AND 40.760694
AND stations.longitude BETWEEN -73.982222 AND -73.977778
AND trips.starttime > '2015-01-01 12:00:00'
AND trips.starttime < '2016-01-01 00:00:00'



In [327]:
query_job = bq.query(query = query)

In [328]:
query_job.errors

[{'reason': 'invalidQuery',
  'location': 'query',
  'message': 'No matching signature for operator = for argument types: INT64, STRING. Supported signature: ANY = ANY at [5:4]'}]

#### Generate Query (Try 2)

In [329]:
response = codechat.send_message(f"""
This query:
{query}

Givese theses errors:
{query_job.errors}
""")

In [330]:
response

The error is because the `start_station_id` column in the `citibike_trips` table is an `INT64` type, but the `station_id` column in the `citibike_stations` table is a `STRING` type. To fix this, you need to cast the `station_id` column to an `INT64` type. You can do this using the `CAST()` function. For example, you could use the following query:

```sql
SELECT count(*)
FROM `bigquery-public-data.new_york.citibike_trips` AS trips
JOIN `bigquery-public-data.new_york.citibike_stations` AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE stations.latitude BETWEEN 40.755749 AND 40.760694
AND stations.longitude BETWEEN -73.982222 AND -73.977778
AND trips.starttime > '2015-01-01 12:00:00'
AND trips.starttime < '2016-01-01 00:00:00'
```

This query will first join the `citibike_trips` table to the `citibike_stations` table on the `start_station_id` column. This ensures that we only count trips that started at a station that is within a few blocks of Central Park.

In [331]:
if response.text.find("```") >= 0:
    query = response.text.split("```")[1]
    if query.startswith('sql'): query = query[3:]
    print(query)
else:
    print('no query in response')


SELECT count(*)
FROM `bigquery-public-data.new_york.citibike_trips` AS trips
JOIN `bigquery-public-data.new_york.citibike_stations` AS stations
ON CAST(trips.start_station_id AS STRING) = stations.station_id
WHERE stations.latitude BETWEEN 40.755749 AND 40.760694
AND stations.longitude BETWEEN -73.982222 AND -73.977778
AND trips.starttime > '2015-01-01 12:00:00'
AND trips.starttime < '2016-01-01 00:00:00'



In [332]:
query_job = bq.query(query = query)

In [333]:
query_job.errors

In [334]:
query_job.to_dataframe()

Unnamed: 0,f0_
0,39860


#### Generate Response

In [335]:
question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

Include popular points of interest near each listed station.

question:
{question}
context:
{query_job.to_dataframe().to_markdown(index = False)}
"""

question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)

print(question_response.text)

 39860 trips have started within a few blocks of central park after noon on Saturdays during 2015.

Popular points of interest near central park include:
- The Metropolitan Museum of Art
- The American Museum of Natural History
- The Central Park Zoo
- The Frick Collection
- The Neue Galerie


### Put It All Together

In [338]:
question = "How many trips have started within a few blocks of central park after noon on Saturdays during 2015?"

#### Function To Answer The Question Using Iteration To Fix Context Query

In [354]:
def BQ_QA(question, max_iterations = 7):
    # code chat session
    codechat_model = vertexai.language_models.CodeChatModel.from_pretrained('codechat-bison@001')
    codechat = codechat_model.start_chat(context = schema_columns.to_markdown(index = False))
    
    # initial request for query:
    query_response = codechat.send_message(
f"""
Write a Google SQL query for BigQuery that answers the following question while correctly referring to BigQuery tables and the needed column names.  When joining tables use coersion to ensure all join columns are the same data type. Output column names should include the units when applicable.  Tables should be refered to using a fully qualified name include project and dataset along with table name.
Question: {question}
"""
)

    # extract query from response
    if query_response.text.find("```") >= 0:
        query = query_response.text.split("```")[1]
        if query.startswith('sql'): query = query[3:]
        print('first try:\n', query)
    else:
        print('no query provided (first try) - unforseen error, printing out response to help with editing this funcntion:\n', query_response.text)
        return
        
    # iteratively run query, and fix it until success:
    fix_tries = 0
    answer = False
    while fix_tries < max_iterations - 1:
        query_job = bq.query(query = query)
        # if errors, then generate new query:
        if query_job.errors:
            query_response = codechat.send_message(
f"""
This query:
{query}

Givese theses errors:
{query_job.errors}
""")
            while True:
                if query_response.text.find("```") >= 0:
                    query = query_response.text.split("```")[1]
                    if query.startswith('sql'):
                        query = query[3:]
                    fix_tries += 1
                    print(f'Fix #{fix_tries}:\n', query)
                    break
                else:
                    fix_tries += 1
                    query_response = codechat.send_message('Please provide a rewritten version of the query with these changes.')
        # no error, retrieve result and break loop
        else:
            result = query_job.to_dataframe()
            # answer question
            question_prompt = f"""
Answer the following question using the provided context.  Note that the context is a tabular result returned from a BigQuery query.  Do not repeat the question or the context when responding.

question:
{question}
context:
{result.to_markdown(index = False)}
"""
            question_response = textgen_model.predict(question_prompt, max_output_tokens = 500)
            answer = True
            print('The answer is:\n', question_response.text)
            break

    if not answer:
        print(f'No answer generated after {fix_tries} tries.')
            

In [355]:
BQ_QA(question)

first try:
 
SELECT
  COUNT(*) AS trip_count
FROM
  `bigquery-public-data.new_york.citibike_trips` AS t
  JOIN
  `bigquery-public-data.new_york.citibike_stations` AS s
    ON t.start_station_id = s.station_id
WHERE
  s.latitude BETWEEN 40.755224 AND 40.761424
  AND s.longitude BETWEEN -73.982031 AND -73.977051
  AND t.starttime > '2015-01-01T12:00:00'
  AND t.starttime < '2016-01-01T00:00:00'

Fix #1:
 
SELECT
  COUNT(*) AS trip_count
FROM
  `bigquery-public-data.new_york.citibike_trips` AS t
  JOIN
  `bigquery-public-data.new_york.citibike_stations` AS s
    ON t.start_station_id = CAST(s.station_id AS INT64)
WHERE
  s.latitude BETWEEN 40.755224 AND 40.761424
  AND s.longitude BETWEEN -73.982031 AND -73.977051
  AND t.starttime > '2015-01-01T12:00:00'
  AND t.starttime < '2016-01-01T00:00:00'

The answer is:
  39860
