# Documentation and resources

**Big Query**
- Colab has an example notebook on BigQuery too.  File > Open notebook > Examples > Getting Started with BigQuery.
- Also a Big Query Snippets Example Notebook
- [BigQuery Documentation]( https://cloud.google.com/bigquery/docs )
- [Open Data Sets]( https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset )
- [Reddit - list of data sets available on BQ]( https://www.reddit.com/r/bigquery/wiki/datasets )


**Big Query Console**  
- [Google Cloud Console]( https://console.cloud.google.com )
- Make sure your project is selected
- Scroll down to BigQuery on the left menu
- [Setup and query instructions]( https://cloud.google.com/bigquery/docs/quickstarts/query-public-dataset-console )

**SQL**
- [Kaggle Intro to SQL]( https://www.kaggle.com/learn/intro-to-sql ) uses BigQuery
- [Kaggle Advanced SQL]( https://www.kaggle.com/learn/advanced-sql )

#  Linking BigQuery to Colab

## Getting started

**You will only need to do this part once.**

1. Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.

    - Create Project
    - Project Name
    - Location

2. [Enable BigQuery APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) for the project

**Note:** You get 1 TB/month of free queries for open datasets
- Kaggle gives you 5 TB/month free



## Imports

In [None]:
from google.cloud import bigquery
from google.colab import auth
from google.colab import syntax
import pandas as pd


### Provide your credentials

In [None]:
auth.authenticate_user()
print('Authenticated')

Authenticated


## Optional: Enable data table display

Colab includes the ``google.colab.data_table`` package that can be used to display large pandas dataframes as an interactive data table.
It can be enabled with:

In [None]:
%load_ext google.colab.data_table
# %unload_ext google.colab.data_table

If you would prefer to return to the classic Pandas dataframe display, you can disable this by running:
```python
%unload_ext google.colab.data_table
```

## List projects



In order to query BigQuery, you will need to specify a project ID.  To get a list of project IDs associated with your account, run the following command.

In [None]:
!gcloud projects list --sort-by=projectId

PROJECT_ID                   NAME                  PROJECT_NUMBER
cool-monolith-286222         Data Science          271608828771
data-science-project-321016  Data Science Project  540766130804
foobar-414218                Foobar                514368644280
sample-401719                sample                20261848095
sampleproject-380615         SampleProject         506364497139
top-gantry-321023            My Project 1160       819168952009


## Declare the Cloud project ID which will be used throughout this notebook

In [None]:
project_id = "cool-monolith-286222"


## Samples data set



The [GSOD table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=gsod&page=table) in the Samples data set contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.


# Use BigQuery via magics



The `google.cloud.bigquery` library also includes a magic command which runs a query and either displays the result or saves it to a variable as a `DataFrame`.

In [None]:
# Display query output immediately

%%bigquery --project {project_id}
SELECT
  COUNT(1) as total_rows
FROM `bigquery-public-data.samples.gsod`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows
0,114420316


In [None]:
# Save output in a variable `df`

%%bigquery df --project {project_id}
SELECT
  COUNT(1) as total_rows
FROM `bigquery-public-data.samples.gsod`

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
f'{df.iloc[0,0]:_}'

'114_420_316'

# Use BigQuery through google-cloud-bigquery



See [BigQuery documentation](https://cloud.google.com/bigquery/docs) and [library reference documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html).


## Sample approximately 2000 random rows

### Count total number of rows

In [None]:
client = bigquery.Client(project=project_id)

row_count = client.query('''
  SELECT
    COUNT(1) as total
  FROM `bigquery-public-data.samples.gsod`
  '''
).to_dataframe()["total"][0]

print(f'Full dataset has {row_count:_} rows')


Full dataset has 114_420_316 rows


### Describe the sampled data

In [None]:
sample_count = 2000
df = client.query(f'''
  SELECT
    *
  FROM
    `bigquery-public-data.samples.gsod`
  WHERE RAND() < {sample_count}/{row_count}
''').to_dataframe()


In [None]:
df.describe().transpose().astype({"count": int})

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
station_number,1996,503783.044589,298740.531143,10150.0,247212.5,537525.0,724476.0,999999.0
wban_number,1996,91138.267535,25277.458144,6.0,99999.0,99999.0,99999.0,99999.0
year,1996,1987.423347,15.868843,1930.0,1977.0,1989.5,2001.0,2010.0
month,1996,6.492485,3.448067,1.0,3.0,6.5,10.0,12.0
day,1996,15.745992,8.689065,1.0,8.0,15.0,23.0,31.0
mean_temp,1996,51.152806,23.940017,-57.200001,37.299999,53.599998,69.300003,98.900002
num_mean_temp_samples,1996,12.711423,7.801413,4.0,7.0,8.0,23.0,24.0
mean_dew_point,1900,40.591895,22.236467,-63.400002,28.675001,42.5,56.0,80.099998
num_mean_dew_point_samples,1900,12.612105,7.755177,4.0,7.0,8.0,23.0,24.0
mean_sealevel_pressure,1488,1014.971505,10.137801,901.599976,1009.400024,1014.700012,1020.324997,1056.599976


### View the first 10 rows

In [None]:
df.head(10)

Unnamed: 0,station_number,wban_number,year,month,day,mean_temp,num_mean_temp_samples,mean_dew_point,num_mean_dew_point_samples,mean_sealevel_pressure,...,min_temperature,min_temperature_explicit,total_precipitation,snow_depth,fog,rain,snow,hail,thunder,tornado
0,376860,99999,1975,7,12,63.799999,6,54.799999,6,,...,,,0.0,,False,False,False,False,False,False
1,592780,99999,1975,11,23,52.099998,7,27.799999,6,1030.5,...,,,0.0,,False,False,False,False,False,False
2,728030,99999,1976,2,27,30.200001,20,27.299999,20,1014.5,...,,,0.04,7.9,True,True,True,True,True,True
3,239550,99999,1980,8,28,54.400002,7,46.900002,7,1015.200012,...,,,0.0,,False,False,False,False,False,False
4,822880,99999,1983,7,18,82.599998,5,72.199997,5,1012.200012,...,,,0.0,,False,False,False,False,False,False
5,474890,99999,1988,3,28,36.200001,9,21.4,9,,...,,,,,False,False,False,False,False,False
6,821930,99999,1994,4,26,80.099998,23,75.0,23,1009.900024,...,,,0.0,,False,False,False,False,False,False
7,37150,99999,1998,2,22,42.299999,24,38.400002,24,,...,,,0.0,,False,False,False,False,False,False
8,33850,99999,1999,5,22,51.799999,23,38.099998,23,1013.5,...,,,0.0,,False,False,False,False,False,False
9,722505,12904,2008,11,23,65.199997,24,60.5,24,1022.0,...,,,0.0,,True,True,True,True,True,True


In [None]:
df.isnull().sum()

station_number                           0
wban_number                              0
year                                     0
month                                    0
day                                      0
mean_temp                                0
num_mean_temp_samples                    0
mean_dew_point                          96
num_mean_dew_point_samples              96
mean_sealevel_pressure                 508
num_mean_sealevel_pressure_samples     508
mean_station_pressure                 1307
num_mean_station_pressure_samples     1307
mean_visibility                        216
num_mean_visibility_samples            216
mean_wind_speed                         25
num_mean_wind_speed_samples             25
max_sustained_wind_speed                52
max_gust_wind_speed                   1696
max_temperature                          4
max_temperature_explicit                 4
min_temperature                       1996
min_temperature_explicit              1996
total_preci

In [None]:
# 10 highest total_precipitation samples
(
df
  .sort_values('total_precipitation', ascending=False)
  .head(10)
  [['station_number', 'year', 'month', 'day', 'total_precipitation']]
)


Unnamed: 0,station_number,year,month,day,total_precipitation
103,142160,2001,9,5,4.08
1073,479420,1973,8,16,2.99
1403,236250,1993,10,27,2.99
914,226570,1964,7,4,2.95
1753,230220,1963,9,7,2.95
1457,627720,2004,2,19,2.36
1614,827650,1979,2,4,2.13
1245,972400,2003,11,19,2.01
1259,590720,2004,5,16,1.9
1047,655920,2004,9,2,1.73


# Use BigQuery through pandas-gbq



The `pandas-gbq` library is a community led project by the pandas community. It covers basic functionality, such as writing a DataFrame to BigQuery and running a query, but as a third-party library it may not handle all BigQuery features or use cases.

[Pandas GBQ Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

In [None]:
df = pd.io.gbq.read_gbq('''
  SELECT
    name, SUM(number) as count
  FROM
    bigquery-public-data.usa_names.usa_1910_2013
  WHERE
    state = 'TX'
  GROUP BY
    name
  ORDER BY
    count DESC
  LIMIT
    100
  ''', project_id=project_id, dialect='standard'
)

df.head()

Unnamed: 0,name,count
0,James,272793
1,John,235139
2,Michael,225320
3,Robert,220399
4,David,219028


# Syntax highlighting
`google.colab.syntax` can be used to add syntax highlighting to any Python string literals which are used in a query later.

In [None]:
query = syntax.sql('''
  SELECT
    COUNT(1) as total_rows
  FROM
    `bigquery-public-data.samples.gsod`
''')

pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')

Unnamed: 0,total_rows
0,114420316


In [None]:
type(query)

str