# Big Query Connector - Quick Start
The BigQuery connector enables you to read/write data within BigQuery with ease and integrate it with YData's platform. 
Reading a dataset from BigQuery directly into a YData's `Dataset` allows its usage for Data Quality, Data Synthetisation and Preprocessing blocks.

## Storage and Performance Notes
BigQuery is not intended to hold large volumes of data as a pure data storage service. Its main advantages are based on the ability to execute SQL-like queries on existing tables which can efficiently aggregate data into new views. As such, for storage purposes we advise the use of Google Cloud Storage and provide the method `write_query_to_gcs`, available from the `BigQueryConnector`, that allows the user to export a given query to a Google Cloud Storage object.

The BigQuery connector allows the user to perform the following actions:
- **BiQueryConnector.datasets** - Returns a list with the name of the available datasets
- **BiQueryConnector.list_tables** - Returns a list of the avaiable tables within a chosen dataset.
- **BiQueryConnector.table_schema** - Returns a the schema for a selected table.
- **BiQueryConnector.delete_dataset_if_exists** - Deletes a selected dataset. This action is only possible if the provided credentials have delete access. 
- **BiQueryConnector.delete_table_if_exists** - Deletes a selected table. This action is only possible if the provided credentials have delete access. 
- **BiQueryConnector.query** - Returns the data retrieved from a certain query as a Dataset object. The user is able to set the number of 'n_samples' to be fetched. 

In [20]:
from ydata.connectors import BigQueryConnector
from ydata.utils.formats import read_json

In [21]:
# Load your credentials from a file\n",
#token = read_json('{insert-path-to-credentials}')

token = read_json('gcs_credentials.json')

# Instantiate the Connector
#connector = BigQueryConnector(project_id='{insert-project-id}', keyfile_dict=token)

connector = BigQueryConnector(project_id='ydatasynthetic', keyfile_dict=token)

In [22]:
# Check the available datasets
print(connector.datasets)

# Check the available tables for a given dataset
print(connector.list_tables('{insert-dataset}'))

['cardio_data', 'connectors_dev', 'dataset_test', 'decision_tree']
['cardio_data', 'table_test', 'table_test1', 'table_test2']


In [23]:
#Returns a of dictionaries with the column names and details (type and mode)
connector.table_schema(dataset='{insert-dataset}', table='{insert-table}')

[{'name': 'age', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'gender', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'height', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'weight', 'type': 'FLOAT', 'mode': 'NULLABLE'},
 {'name': 'ap_hi', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'ap_lo', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'cholesterol', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'gluc', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'smoke', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'alco', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'active', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'cardio', 'type': 'INTEGER', 'mode': 'NULLABLE'}]

In [None]:
# Load a dataset
data = connector.query(
    "SELECT * FROM {indert-dataset}.{insert-table}"
)

Query complete after 1.44s: 100%|██████████| 1/1 [00:01<00:00,  1.44s/query]
Downloading: 100%|██████████| 70000/70000 [00:02<00:00, 29079.85rows/s]


In [27]:
data.head(10)

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,65,1,55,81.0,130,90,1,1,0,0,1,1
1,52,1,57,61.0,130,90,1,1,0,0,1,1
2,51,1,59,57.6,125,67,1,1,0,0,0,0
3,53,1,60,69.0,110,70,1,1,0,0,0,0
4,58,1,64,61.0,130,70,1,1,0,0,1,0
5,53,1,65,60.0,120,80,1,1,0,0,1,0
6,55,2,65,72.0,130,80,1,1,0,0,0,0
7,59,1,66,63.0,12,80,1,1,0,0,0,1
8,40,2,67,60.0,110,80,1,1,1,1,1,0
9,61,1,67,57.0,120,90,1,1,0,0,1,1


In [26]:
# Load a sample of a dataset
small_data = connector.query(
    "SELECT * FROM {dataset_name}.{table_name}",
    n_sample=10_000
)

  if not self._validate_bqstorage(bqstorage_client, create_bqstorage_client):


## Advanced
With `BigQueryConnector`, you can access useful properties and methods directly from the main class.

In [15]:
# List the datasets of a given project
connector.datasets

['connectors_dev', 'dataset_test', 'decision_tree']

In [16]:
# Access the BigQuery Client
connector.client

<google.cloud.bigquery.client.Client at 0x7fcebc7fb940>

In [17]:
# Create a new dataset
connector.get_or_create_dataset(dataset='{insert-dataset}')

In [None]:
# Delete a dataset. WARNING: POTENTIAL LOSS OF DATA
connector.delete_table_if_exists(dataset='{insert-dataset}', table='{insert-table}')

In [None]:
# Delete a dataset. WARNING: POTENTIAL LOSS OF DATA 
connector.delete_dataset_if_exists(dataset='{insert-dataset}')

### Example #1 - Execute Pandas transformations and store to BigQuery

In [None]:
# export data to pandas
small_df = small_data.to_pandas()
#
# DO TRANSFORMATIONS
# (...)
# 
# Write results to BigQuery table
connector.write_table_from_data(data=small_df, dataset='{insert-dataset}', table='{insert-table}')

### Example #2 - Write a BigQuery results to Google Cloud Storage

In [None]:
# Run a query in BigQuery and store it in Google Cloud Storage
connector.write_query_to_gcs(query="{insert-query}",
                                path="gs://{insert-bucket}/{insert-filepath}")