# Select, From & Where

Here, I will practice setting up SQL query to fetch data from BigQuery. I will use keywords **SELECT**, **FROM**, and **WHERE** to get data from specific columns based on conditions

In [1]:
# import package
from google.cloud import bigquery

In [2]:
# create a client object
client = bigquery.Client()

In [3]:
# construct a reference to the dataset
dataset_ref = client.dataset("openaq", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [4]:
# list all tables in "pet" dataset
tables = list(client.list_tables(dataset))

# print names of all tables in the dataset
for table in tables:
    print(table.table_id)

global_air_quality


In [5]:
# construct a reference to the "global_air_quality" table
table_ref = dataset_ref.table("global_air_quality")

# API request - fetche the table
table = client.get_table(table_ref)

# view the first five rows of the "global_air_quality" table
client.list_rows(table, max_results=5).to_dataframe()

  if not self._validate_bqstorage(bqstorage_client, create_bqstorage_client):


Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,co,910.0,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
1,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,no2,131.87,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
2,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,o3,15.57,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
3,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,pm25,45.62,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
4,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,so2,4.49,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25


## 1. Submitting the query to the dataset

**Triple Quotation Marks (""")**

These tell Python that everything inside them is a single string even though we have line breaks in it. The line breaks aren't necessary, but they make it easier to read your query. 

**Baktick(option key + ~ on Mac)**
Note that when writing an SQL query, the argument we pass to **FROM** is not in backtick not in single or double quotation makrs.

In [6]:
# query to select all the items from the "city" column where the "country" column is 'US'
query = """
        SELECT city
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

In [7]:
# Create a "Client" object
client = bigquery.Client()
# Set up the query
query_job = client.query(query)

In [8]:
# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

<br>
Now, I've got the information retrieved with the query on a pandas Dataframe.

In [9]:
# What five cities have the most measurements?
us_cities.city.value_counts().head()

Phoenix-Mesa-Scottsdale                     88
Houston                                     82
Los Angeles-Long Beach-Santa Ana            68
New York-Northern New Jersey-Long Island    60
Riverside-San Bernardino-Ontario            60
Name: city, dtype: int64

In [10]:
# Another example of query

query = """
        SELECT country, pollutant, sum(value) as Value
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        GROUP BY country, pollutant
        ORDER BY Value DESC
        """
# Set up the query
query_job = client.query(query)

# convert into pandas dataframe
us_pollutant = query_job.to_dataframe()
us_pollutant

Unnamed: 0,country,pollutant,Value
0,US,pm25,7531.6
1,US,pm10,6203.0
2,US,bc,417.71
3,US,o3,112.618
4,US,co,86.03
5,US,no2,15.5262
6,US,so2,-1110.8207


## 2. Estimating Size of Query

Now I'm going to estimate the size of any query before running it by creating a QueryJobConfig object.

In [12]:
query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job" 
        """
# create a QueryJobConfig object to estimate size of query without running it
dry_run_config = bigquery.QueryJobConfig(dry_run=True)

# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed))

This query will process 458206012 bytes.


You can also specify a parameter when running the query to limit how much data you are willing to scan. Here's an example with a low limit.

In [13]:
# only run the query if it's less than 1MB
ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

# set up the query (will only run if it's less than 1MB)
safe_query_job = client.query(query, job_config=safe_config)

# API request - try to run the query and return a pandas Dataframe
safe_query_job.to_dataframe()

InternalServerError: 500 Query exceeded limit for bytes billed: 1000000. 458227712 or higher required.

(job ID: a142ce94-c755-44d9-8f5f-c9697c71fb95)

             -----Query Job SQL Follows-----             

    |    .    |    .    |    .    |    .    |    .    |
   1:
   2:        SELECT score, title
   3:        FROM `bigquery-public-data.hacker_news.full`
   4:        WHERE type = "job" 
   5:        
    |    .    |    .    |    .    |    .    |    .    |

In this case, the query was cancelled because the limit of 1MB was exceeded.
Now, let's try with increased limt.

In [14]:
# only run the query if it's lenss than 1GB
ONE_GB = 1000*1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_GB)

# set up the query (will only run if it's less than 1GB)
safe_query_job = client.query(query, job_config=safe_config)

# API request - try to run the query, and reture a pandas dataframe
job_post_scores = safe_query_job.to_dataframe()

# print average score for job posts
job_post_scores.score.mean()

1.8365435455850139