# Basics of accessing and examing BigQuery datasets

In [1]:
# import package
from google.cloud import bigquery

## 1. Create Client Object

The client object plays a central role in retrieving information from BigQuery datasets.

In [2]:
# create a client object
client = bigquery.Client()

## 2. Access the Dataset in BigQuery

In BigQuery, each dataset is contained in a corresponding project. I'll access the project called "bigquery-public-data" and obtain the dataset, "hacker_new" which is a dataset of posts on [Hacker News](https://news.ycombinator.com/), a website focusing on computer science and cybersecurity news.

- <code>dataset()</code> for constructing a reference to dataset
- <code>get_dataset()</code> for fetching the dataset (by including the reference inside the parentheses) 

In [3]:
# construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project = "bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

<br>
The dataset comprises of multiple tables composed of rows and columns.<br>
<code>list_tables()</code> for listing the tables in the dataset.

In [4]:
# list all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# print names of all tables in the dataset
for table in tables:
    print(table.table_id)

comments
full
full_201510
stories


<br>
As the output above says, the dataset, "hacker_news" is a collection of <font color="green"><b> four tables: comments, full, full_201510, stories.</b> </font>

<br>
Now, I'll fetch one table by constructing a reference to the table name and using <code>get_table()</code> method.

In [5]:
# construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

## Summary

- **Client** holds Projects & Connection.
- **Project** holds a collection of datasets.
- **Dataset** holds a collection of tables.
<br>

!['BigQuery Overview'](https://i.imgur.com/biYqbUB.png)
[image source: Kaggle]

## 3. Checking Table Schema

Table Schema means the structure of a table. To effectively pull out the data, I'll check the table schema of the <code>full</code> table.

In [6]:
# first, print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema

[SchemaField('title', 'STRING', 'NULLABLE', 'Story title', (), None),
 SchemaField('url', 'STRING', 'NULLABLE', 'Story url', (), None),
 SchemaField('text', 'STRING', 'NULLABLE', 'Story or comment text', (), None),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', (), None),
 SchemaField('by', 'STRING', 'NULLABLE', "The username of the item's author.", (), None),
 SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', (), None),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', (), None),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'Timestamp for the unix time', (), None),
 SchemaField('type', 'STRING', 'NULLABLE', 'Type of details (comment, comment_ranking, poll, story, job, pollopt)', (), None),
 SchemaField('id', 'INTEGER', 'NULLABLE', "The item's unique id.", (), None),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', (), None),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story or poll descendants', (), None),
 SchemaField

**SchemaField** tells us about column(also often called "field") information.
In order, the information is:

- the name of the column
- the field type (or datatype) in the column
- the mode of the column ('NULLABLE' means that a column allows NULL values and is the default)
- a description of the data in that column

The first field has the SchemaField:

<code>SchemaField('title', 'STRING', 'NULLABLE', 'Story title', (), None)</code>

This tells us:

- the field(column) name: <code>title</code>
- the datatype: <code>string</code>
- the mode: NULL value are allowed.
- the description: Story title

Now, using <code>list_rows()</code>, let's check the first five rows of the <code>full</code> table in the <code>hacker_news</code> dataset. 

In [14]:
# preview the first five rows of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,"Yeah, but $750-800 is way different than $1500...",,abalashov,,1299653985,2011-03-09 06:59:45+00:00,comment,2304165,2303720,,,
1,,,Agreed. We had a summer intern who is a CS ma...,,mooreds,,1377637633,2013-08-27 21:07:13+00:00,comment,6286286,6285068,,,
2,,,Oh I&#x27;m not referring to which one is easi...,,edwinnathaniel,,1377637599,2013-08-27 21:06:39+00:00,comment,6286281,6286194,,,
3,,,I'm from NY. I plan on networking as soon as ...,,Jcasc,,1299653909,2011-03-09 06:58:29+00:00,comment,2304163,2304143,,,
4,,,"In this context, it&#x27;s interesting to look...",,thebear,,1409084821,2014-08-26 20:27:01+00:00,comment,8229336,8227937,,,


In [18]:
# view the column, 'title' only
client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()

  if not self._validate_bqstorage(bqstorage_client, create_bqstorage_client):


Unnamed: 0,title
0,
1,
2,
3,
4,
