# S3Connector - Quick Start

The S3Connector enables you to read/write data within the AWS Simple Storage Service with ease and integrate it with YData's platform.
Reading a dataset from S3 directly into a YData's `Dataset` allows its usage for Data Quality, Data Synthetisation and Preprocessing blocks.

The following tutorial covers:
- How to read data from S3
- How to read data (sample) from S3
- How to write data to S3
- (Advanced) Developer utilities

In [1]:
# Import the necessary packages
from ydata.connectors import S3Connector
from ydata.connectors.filetype import FileType
from ydata.utils.formats import read_json

In [2]:
# Load your credentials from a file
token = read_json('../../.secrets/s3_credentials.json')

In [3]:
# Instantiate the Connector
connector = S3Connector(**token)

In [4]:
# Load a dataset
data = connector.read_file('S3://ydata-demos/teste.csv', file_type=FileType.CSV)
print(f'My data is of type {type(data).__name__}.')

My data is of type Dataset.


In [5]:
# The file_type argument is optional. If not provided, we will infer it from the path you have provided.
parquet_data = connector.read_file('S3://ydata-demos/teste.parquet')

In [6]:
# For a quick glimpse, we can load a small subset of the data (e.g. 1%)
small_data = connector.read_sample('S3://ydata-demos/teste.csv', sample_size=0.01)

In [7]:
# We could alternatively define a specific number of rows
very_small_data = connector.read_sample('S3://ydata-demos/teste.csv', sample_size=67)

In [8]:
print(f"""Number of rows:
Original: {data.shape[0]:,}, 
Sampled (%): {small_data.shape[0]:,}
Sampled (n): {very_small_data.shape[0]:,}.""")

Number of rows:
Original: 10,000, 
Sampled (%): 100
Sampled (n): 67.


In [9]:
# Now imagine we want to store the sampled data.
connector.write_file(data, 's3://ydata-dev-connectors/write_test.csv')

  warn("Appending data to a network storage system may not work.")


In [10]:
# Alternatively, we can write a new Dataframe 
from pandas.util.testing import makeDataFrame
dummy_df = makeDataFrame()
connector.write_file(dummy_df, 's3://ydata-dev-connectors/write_dummy.parquet', write_index=True)

  


In [11]:
# Now we load the new dataset to ensure is working well
dummy_data = connector.read_file('s3://ydata-dev-connectors/write_dummy.parquet')

In [12]:
# This is a sample from the new dataset's original data
dummy_df.head()

Unnamed: 0,A,B,C,D
1Y2Jcj2XWT,0.839621,-0.808945,0.1783,-1.03718
X3uCdC3SLn,-0.92931,-0.941703,0.295265,-1.352673
GlbjQ8e0MG,0.344733,-0.226315,-1.437262,-1.722771
6tEZjlLKKm,-1.943021,-0.469168,-0.462391,0.240716
L4bl8COz8z,-1.20175,-1.032017,-0.741219,-0.709047


In [13]:
# This is a sample from our "stored-to-parquet-and-loaded" data
# The order of the rows may not match the original, given parallel-based way of reading and writing data.
dummy_data.to_pandas().head()

Unnamed: 0,A,B,C,D
1Y2Jcj2XWT,0.839621,-0.808945,0.1783,-1.03718
6tEZjlLKKm,-1.943021,-0.469168,-0.462391,0.240716
9COfBctvk6,0.744983,1.819461,-0.492183,1.054205
Aocpqx7uLD,1.565838,0.278864,1.202742,-0.394791
GlbjQ8e0MG,0.344733,-0.226315,-1.437262,-1.722771


In [14]:
# But both datasets do match!
print(f'All rows equal all columns in both datasets: {dummy_data.to_pandas().eq(dummy_df, axis=1).all(None)}.')

All rows equal all columns in both datasets: True.


## Advanced Features
Connectors provided developer utilities that enable Data Scientists to navigate S3 Storage via code blocks.

* Check if a bucket exists
* List the contents of a bucket

In [15]:
# We can check if a certain bucket exists
connector.check_bucket('ydata-demos'), connector.check_bucket('fake-ydata-bucket')

(True, False)

In [16]:
# We can check the contents of a certain bucket
# Seems that we have 3 files (i.e. keys) and 1 folder (i.e. prefix)
connector.list(bucket_name='ydata-demos')

{'keys': [('Synthetic Data_2.png', 15630),
  ('teste.csv', 946384),
  ('teste.parquet', 202335)],
 'prefixes': ['syntheticdata']}

In [17]:
# We can check the contents of the prefix
# We only have 1 key now, but there are other prefixes we can explore
connector.list('ydata-demos', prefix='syntheticdata')

{'keys': [('index.json', 2025)],
 'prefixes': ['airbnb_newyork',
  'cardiovascular_disease',
  'census',
  'creditcard_fraud',
  'movie_lens']}

In [18]:
# We have 3 files available under /census
connector.list('ydata-demos', prefix='syntheticdata/census')['keys']

[('data.csv', 3811499),
 ('metadata.json', 4757),
 ('synthetic_data.csv', 232524)]