# Querying Data with BqT
We've tried to make this as simple as possible while having it flexible enough so that it can be used in a variety of use cases here at Spotify.

### Before we begin: Asynchronous vs Synchronous
In many places when an operation can take a long time (e.g. query), it's a common practice to not wait for the call to finish and return immediately (a.k.a. asynchronous call). The idea being, you'll do other stuff while the query is happening and will get the results when they are ready. This is as opposed to blocking (a.k.a. synchronous) untill the results is ready and getting coffee while you wait.

We've designed our library to support both, to give you the flexibility when you need it. We will also call out whenever a call is asynchronous.

## Let's start: Basic Querying
whatever you want to do with BqT, the first thing you have to do import the library. `from bqt import bqt` does that. This is your window to all functionality that it provides. Under the hood, we create a Python object of the class `BqT` which exposes all the functionality.

Each `BqT` object is tied to a GCP project. This project is used to run your jobs, bill you for them and any queryies you run are going to go against that projects resources. 99% of the time, this is your team's project regardles of what you are actually querying. **You need permission to run jobs on the project you use with `BqT`**.

All the details aside, let's start:

In [1]:
from bqt import bqt

In [3]:
bqt.__version__

'0.3.16'

Now we can use this object to do run a basic query:

In [5]:
launch_countries = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '2016-01-02'
""")
launch_countries

[94mJob started, waiting for it to finish,  elapsed
[92mQuery done.                                                                                                                                               
Processed: 5.8 K Billed: 10.5 M[0m


Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2018-06-24 00:13:28.403000+00:00
1,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.409000+00:00
2,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.414000+00:00
3,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.407000+00:00
4,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2018-06-24 00:13:28.405000+00:00
5,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2018-06-24 00:13:28.408000+00:00
6,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2018-06-24 00:13:28.404000+00:00


That's it, you might have noticed that the cell waits until the results are ready and downloads them into a pandas DataFrame.

If you want to do the same thing, but your query takes a long time to run and you don't want to block your notebook while it runs, you can do this:

In [6]:
launch_countries_job = bqt.query_async("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '2016-01-02'
""")
launch_countries_job

Found cached data that is 1 minute(s) 36 second(s) old, using it.


<bqt.lib.job.BqJobResult at 0x10aee16d0>

Note that it didn't return a DataFrame this time, is returned a results object. This means that your query was sent to BigQuery and it's running. to get the results later on (i.e. to block until ready and download them) just do this:

In [7]:
launch_countries_job.results

Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2018-06-24 00:13:28.403000+00:00
1,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.409000+00:00
2,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.414000+00:00
3,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.407000+00:00
4,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2018-06-24 00:13:28.405000+00:00
5,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2018-06-24 00:13:28.408000+00:00
6,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2018-06-24 00:13:28.404000+00:00


This is in fact what `query` does internally, creates a job and waits for it.

### Formatted queries
The `BqJobResult` returned by `query(...)` contains properties related to the job.
`.query` will return a formatted version of the original query

In [8]:
launch_countries_job.query

## Query Parameters
One of the key features of `BqT` is that it lets you easily manage variables and parameter you use in your query and in your analysis in general. Using this feature is higly recommended because:
1. you can set and forget about parameters you use in your analysis
2. it will automatically replace them with values whenever you use them
3. you can easily rerun your analysis for a different set of parameters

Using them is very simple:

In [2]:
bqt.set_param('launch_date', '2016-01-02')

from now on, you can use `{launch_date}` in your queries without worrying about replacing it:

In [4]:
launch_countries = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '{launch_date}'
""")
launch_countries

Found cached data that is 7 hour(s) 40 minute(s) 58 second(s) old, using it.


Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2018-06-24 00:13:28.403000+00:00
1,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.409000+00:00
2,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.414000+00:00
3,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.407000+00:00
4,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2018-06-24 00:13:28.405000+00:00
5,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2018-06-24 00:13:28.408000+00:00
6,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2018-06-24 00:13:28.404000+00:00


The parameter manager also understands most date formats and you can use them as you wish without doing any conversion.

For example, here we specify the date format as `_YYYYMMDD` and BqT automatically understands that the parameter is a date and converts it to this format:

In [5]:
bqt.query("""
SELECT *
FROM `data-science-golden-path.bqt_playground.launch_party_{launch_date_YYYYMMDD}`
""")

[94mJob started, waiting for it to finish, 1 second(s) elapsed
[92mQuery done.                                                                                                                                                          
Processed: 87.0 Billed: 10.5 M[0m


Unnamed: 0,sample_column,sample_column_2
0,sample_row_b,sample_row_b2
1,sample_row_a,sample_row_a2
2,sample_row_c,sample_row_c2


### Using `{LATEST}` for Latest Partition
One cool feature of `BqT` is that you can easily write queries that access the latest partition of a table without having to worry about updating the query or your parameters.

Just add `{LATEST}` (needs to be all caps) to the end of any table in your query and `BqT` will take care of the rest.

In [1]:
bqt.query("""
SELECT *
FROM `data-science-golden-path.bqt_playground.launch_party_{LATEST}`
""")

Found cached data that is 14 hour(s) 53 minute(s) 35 second(s) old, using it.


Unnamed: 0,sample_column,sample_column_2
0,sample_row_b,sample_row_b2
1,sample_row_a,sample_row_a2
2,sample_row_c,sample_row_c2


## Returning a large dataset as a generator of smaller chuncks
If the dataset does not fit into memory, or the desire is to load a small chunk of data at a time.

defalt chunk size is 10k rows, but you can configure the size by passing value to `generator_row_size`.

In [3]:
# let's load 5 countries at a time
bqt.config['generator_row_size'] = 5 

launch_countries_gen = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '{launch_date}'
""", return_generator=True)

[92mQuery done! Processed: 0.0 Billed: 0.0 Cost: $0.00[0m


In [4]:
type(launch_countries_gen)

generator

In [5]:
next(launch_countries_gen)

Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,IN,2019-02-27,,1324000000,India,APAC,,,0.0,2019-11-07 00:07:36.342000+00:00
1,RU,2022-01-01,,144400000,Russia,,,,0.0,2019-11-07 00:07:36.345000+00:00
2,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2019-11-07 00:07:36.342000+00:00
3,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2019-11-07 00:07:36.344000+00:00
4,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2019-11-07 00:07:36.341000+00:00


In [6]:
# load next
next(launch_countries_gen)

Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2019-11-07 00:07:36.340000+00:00
1,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2019-11-07 00:07:36.344000+00:00
2,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2019-11-07 00:07:36.345000+00:00
3,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2019-11-07 00:07:36.350000+00:00
4,PS,2018-11-14,,4500000,Palestine,MENA,MENA,MENA,0.0,2019-11-07 00:07:36.337000+00:00


In [8]:
# process one chunk at a time
launch_countries_gen = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '{launch_date}'
LIMIT 15
""", return_generator=True)

for country in iter(launch_countries_gen):
    # do something
    print(country[:2])
    # write it back to bq
    bqt.fast_load(country, dataset='mzhu', table='country_output', write_disposition='WRITE_APPEND')

[92mQuery done! Processed: 7.0 K Billed: 10.5 M Cost: $0.00[0m                                                              
  country_code launch_date continent_code  population    name region region_1  \
0           IN  2019-02-27           None  1324000000   India   APAC     None   
1           RU  2022-01-01           None   144400000  Russia   None     None   

  region_2  addressable_population                updated_by_daemon  
0     None                     0.0 2019-11-07 00:07:36.342000+00:00  
1     None                     0.0 2019-11-07 00:07:36.345000+00:00  
[94mStaging bucket: gs://fastbqt-staging-eu-gro-analytics[0m
[94mSplitting the dataframe into 1 chunks ...[0m
[94mCompressed chunk 1/1 (6.9 K) is being uploaded ...[0m
[94mChunk 1/1 has been inserted to bq[0m
[92mTime elapsed: 0:00:09.925475[0m
  country_code launch_date continent_code  population       name region  \
0           ID  2016-03-30             AS   258553000  Indonesia   APAC   
1           VN

## Running multiple queries at the same time
Using the above you can write a simple code to run multiple queries at the same time.

**Just be mindful that resources are limited and running more than a dozen queries at the same time will slow that BQ for your projects and other people.**

The following code has the benefit that it runs all the queries in parallel so it takes way less time.

In [17]:
query = "SELECT {number} AS my_lucky_number" # not the most informative query but you get the point :)
results = []
for i in range(3):
    # first run the query asynchronous
    query_result = bqt.query_async(query.format(number=i))
    
    # keep the results in a list
    results.append(query_result)

# now let's merge them all together using pandas
import pandas
pandas.concat(results)

[92mQuery done.                                                   
Processed: 0.0 Billed: 0.0[0m
[92mQuery done.
Processed: 0.0 Billed: 0.0[0m
[92mQuery done.
Processed: 0.0 Billed: 0.0[0m


Unnamed: 0,my_lucky_number
0,0
0,1
0,2


## Caching Results
one big benefit on using `BqT` is that it can transparently cache results for you so you don't have to run expensive queries multiple times.

Caching in `BqT` is a bit smart, it can detect changes to your query text so it'll cache your query the firt time you run it, and there after will return cached results _until_ you change your query somehow, in which case it'll realize this is a different query and will run it again.

### Caching Locally
The default option that is already enabled is to cache results localy in your temporary folder (this gets removed whenever you restart your computer or run out of space, we also clear cached queries after 30 days to comply with GDPR regulations). In fact, if you run the query above again, it'll use a local version and won't query BigQuery

### Bypass Caching
If for some reason you don't want to cache your results anywhere, that's easy too. just do:

In [9]:
launch_countries = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '2016-01-01'
""", cache=None)  # cache=None means don't cache these results, default is cache='local'

launch_countries

[92mQuery done.
Processed: 0.0 Billed: 0.0[0m


Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2018-06-24 00:13:28.403000+00:00
1,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.409000+00:00
2,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.414000+00:00
3,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2018-06-24 00:13:28.407000+00:00
4,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2018-06-24 00:13:28.405000+00:00
5,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2018-06-24 00:13:28.408000+00:00
6,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2018-06-24 00:13:28.404000+00:00


The above code will run the query against BigQuery every time you run it.

### Caching on BigQuery
This is useful if you want to share analysis between multiple people and are running expensive queries. This way whenever you run a query a copy of it is stored in BigQuery and the next time you run it, it'll use that copy instead of running the query again.

To use it, you have to pass in where these copies are stored when creating the `BqT` object:

In [4]:
# notice how we're configuring BqT here, you can use this to configure different aspects of the tool
bqt.set_config({
    # project to store the copies, if not provided, same project will be used
    'cache.bq.project': 'data-science-golden-path',
    # dataset to use for storing copies
    'cache.bq.dataset': 'bqt_playground',
    # prefix of all cache tables, default is `bqt_cache_`
    'cache.bq.table_prefix': 'analysis_1_'
})

In [3]:
launch_countries = bqt.query("""
SELECT *
FROM `subs-analytics.launch_countries.launch_countries`
WHERE launch_date > '2016-01-01'
""", cache='bq')  # note that we're telling it to cache in BigQuery this time, the default is cache='local'

launch_countries

Unnamed: 0,country_code,launch_date,continent_code,population,name,region,region_1,region_2,addressable_population,updated_by_daemon
0,ZA,2018-03-13,AF,55900000,South Africa,SSA,SSA,AF,0.0,2018-05-16 00:03:51.991000+00:00
1,RO,2018-03-13,EU,19700000,Romania,EU 2,EU 2,EU,0.0,2018-05-16 00:03:51.993000+00:00
2,IL,2018-03-13,EU,8500000,Israel,EU 3,EU 3,EU,0.0,2018-05-16 00:03:51.991000+00:00
3,JP,2016-08-08,AS,126919659,Japan,APAC,SEA,APAC,0.0,2018-05-16 00:03:51.994000+00:00
4,VN,2018-03-13,AS,92700000,Vietnam,APAC,SEA,APAC,0.0,2018-05-16 00:03:51.993000+00:00
5,TH,2017-08-21,AS,67500000,Thailand,APAC,SEA,APAC,0.0,2018-05-16 00:03:51.999000+00:00
6,ID,2016-03-30,AS,258553000,Indonesia,APAC,SEA,APAC,102800000.0,2018-05-16 00:03:51.989000+00:00


As you can see, this is all transparent to you, but the results are in fact stored in BigQuery as well as being downloaded into a DataFrame.

The results are now stored in this table: `data-science-golden-path.bqt_playground.analysis_1_c5906058c8fa28864c9bf6747b5ed7e5`

That weird string at the end is called the Cache Key, it's based on the query and it is how we know what's cached and what's not.

### Cache Expiration and TTL (Time To Live)
There are many cases where old data is not usefull. For example if I run a notebook a month later I don't want it to use the same old (stale) data. You can control this by an option called `ttl`. This indicates, in seconds, how long the results are valid for.
This is very useful for when you want to work on something in a few days but don't want to run your queries every single time.

To set it up, similar to the BigQuery cache, we pass in a configuration dictionary:

In [2]:
# notice now we're passing a configuration dictionary as well
# this is used to change the default beahvior of BqT
bqt.set_config('cache.ttl', bqt.DAY) # this will rerun queries after a day has passed

here `BqT.DAY = 86400` which is the number of seconds in a day, you can also use `MINUTE`, `HOUR`, `WEEK` and `MONTH` or just pass in any number.