# Google Cloud Storage Connector - Quick Start

The CGS connector enables you to read/write data within the Google Cloud Storage with ease and integrate it with YData's platform.
Reading a dataset from GCS directly into a YData's `Dataset` allows its usage for Data Quality, Data Synthetisation and Preprocessing blocks.

The following tutorial covers:
- How to read data from GCS
- How to read data (sample) from GCS
- How to write data to GCS

In [1]:
# Import the necessary packages
from ydata.connectors import GCSConnector
from ydata.connectors.filetype import FileType
from ydata.utils.formats import read_json

In [2]:
# Load your credentials from a file
token = read_json('../../.secrets/gcs_write_token.json')

In [3]:
# Instantiate the Connector
connector = GCSConnector(project_id=token['project_id'], keyfile_dict=token)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 43297 instead
  http_address["port"], self.http_server.port


In [4]:
# Load a dataset
data = connector.read_file('gs://ydata_testdata/tabular/cardio/data.csv', file_type=FileType.CSV)
print(f'My data is of type {type(data).__name__}.')

My data is of type Dataset.


In [5]:
# The file_type argument is optional. If not provided, we will infer it from the path you have provided.
parquet_data = connector.read_file('gs://ydata_testdata/tabular/data.parquet')

In [6]:
# For a quick glimpse, we can load a small subset of the data (e.g. 1%)
small_data = connector.read_sample('gs://ydata_testdata/tabular/data.parquet', sample_size=0.01)

In [7]:
# We could alternatively define a specific number of rows
very_small_data = connector.read_sample('gs://ydata_testdata/tabular/data.parquet', sample_size=67)

In [8]:
print(f"""Number of rows:
Original: {data.shape[0]:,}, 
Sampled (%): {small_data.shape[0]:,}
Sampled (n): {very_small_data.shape[0]:,}.""")

Number of rows:
Original: 70,000, 
Sampled (%): 351
Sampled (n): 67.


In [9]:
# Now imagine we want to store the sampled data.
connector.write_file(small_data, 'gs://ydata_development/connectors/write_sample.csv')

  warn("Appending data to a network storage system may not work.")


In [10]:
# Alternatively, we can write a new Dataframe 
from pandas.util.testing import makeDataFrame
dummy_df = makeDataFrame()
connector.write_file(dummy_df, 'gs://ydata_development/connectors/write_sample.parquet', write_index=True)

  


In [11]:
# Now we load the new dataset to ensure is working well
dummy_data = connector.read_file('gs://ydata_development/connectors/write_sample.parquet')

In [12]:
# This is a sample from the new dataset's original data
dummy_df.head()

Unnamed: 0,A,B,C,D
vWR8EjvCN5,-1.080968,0.131254,-0.400662,-1.56647
rgm3NTBIrg,1.3432,1.324647,0.064379,-1.756315
CD9S1WB8sN,-0.904694,0.222477,-0.174191,-0.815463
zYSkrNk625,-0.874359,-0.287793,-0.876278,-0.081582
JwxGp6cTn5,-0.10096,-0.016859,-0.799288,0.544344


In [13]:
# This is a sample from our "stored-to-parquet-and-loaded" data
# The order of the rows may not match the original, given parallel-based way of reading and writing data.
dummy_data.to_pandas().head()

Unnamed: 0,A,B,C,D
0G4d1A3oXF,0.559392,0.361695,-1.288391,-0.925004
4uy2ackwkc,-0.868865,-0.531903,-0.933147,-1.740783
5ZjXiH4Whr,1.217328,-2.474033,0.653,0.492108
79nPY7QAc4,0.642221,-0.883361,-0.100726,0.429338
9rT1iYFPo2,-0.31892,0.700615,0.130373,0.454638


In [14]:
# But both datasets do match!
print(f'All rows equal all columns in both datasets: {dummy_data.to_pandas().eq(dummy_df, axis=1).all(None)}.')

All rows equal all columns in both datasets: True.


## Advanced
Advanced features enable you to manage Google Cloud Storage directly through the connector.

In [15]:
# Delete a specific blob
# connector.delete_blob_if_exists('gs://ydata_development/connectors/write_sample.csv')

In [16]:
# List the contents under a given bucket
connector.ls('gs://ydata_development/')

{'files': [], 'dirs': ['connectors', 'issue#110']}

In [17]:
# List the contents under a given bucket
connector.ls('gs://ydata_development/connectors')

{'files': [('write_sample.csv', 31734)], 'dirs': ['write_sample.parquet']}