# Connectors - AWS S3

In this tutorial, you'll learn how to use `ydata-sdk` to connect to AWS S3. 
The `S3Connector` allows you to list buckets, read/write files, and sample datasets stored in S3.

This is useful for integrating synthetic data workflows into cloud-native or hybrid ML pipelines.
Make sure you have your AWS credentials (e.g., access key and secret key) available and with read and write access.

### Benefits of Integration
Integrating ydata-sdk with AWS S3 offers several key benefits:

- **Seamless Cloud Access:** Easily browse, read, and write data from S3 buckets using a unified SDK interface.
- **Cloud-Native Workflows:** Connect directly to your S3-based data lake to enable profiling, synthesis, and anonymization without local downloads.
- **SDK-Wide Compatibility:** All features of ydata-sdk — from Q&A generation to synthetic tabular data—can operate directly on S3-hosted files.
- **Scalable & Automated:** Ideal for automating recurring workflows or powering large-scale pipelines with S3 as a data backend.

Before running this notebook:
1. Ensure you have an `aws_credentials.json` file with the necessary credentials to access the data in your AWS S3 storage (read & write permissions)
2. ydata-sdk installed

### Authenticate with your YData account

In [2]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Create your AWS S3 connector

In [3]:
from ydata.utils.formats import read_json

def get_token(token_path: str):
    """
    Utility to load the token from .secrets directory,
    supporting both local and cloud environments.
    """
    return read_json(token_path)

token = get_token('insert-credentials-path')

In [4]:
# 🔗 Initialize the AWS S3 connector
from ydata.connectors import S3Connector

connector = S3Connector(**token)

## Read from your AWS S3

Using the AWS S3 connector it is possible:
- Read a file and a set of files from a folder
- Get a sample of the full dataset
- Write new data to a define folder

In [7]:
# 🪣 List all S3 buckets

# We can check the contents of a certain bucket
connector.list(bucket_name='insert-bucket-name')

Buckets: {'keys': [], 'prefixes': ['cardio', 'data-profiling', 'db_home_credit_risk', 'duarte', 'portela', 'regular', 'synthea_database', 'synthetic_data', 'syntheticdata', 'test_dbx', 'timeseries']}


In [11]:
# We can check the contents of the prefix
# We only have 1 key now, but there are other prefixes we can explore
connector.list(bucket_name='insert-bucket-name', 
               prefix='insert-prefix')

{'keys': [],
 'prefixes': ['__unitystorage',
  'cardio',
  'cardio_new',
  'cardio_synthetic',
  'credit_lending',
  'credit_scoring.csv']}

In [None]:
# List available files under a specific folder
connector.list(bucket_name='insert-bucket-name', 
               prefix='syntheticdata/census')['keys']

In [12]:
# 📥 Read a file from S3
from ydata.connectors.filetype import FileType

file_path = "s3://your-bucket-name/path/to/data.csv"
df = connector.read_file(
    path=file_path,
    file_type=FileType.CSV
)
df.head()

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [13]:
# 🔍 Read a sample from the S3 file

sample_df = connector.read_sample(
    path=file_path,
    file_type=FileType.CSV
)
sample_df.head()

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,17474,1,156,56.0,100,60,1,1,0,0,0,0


## Write to your AWS S3 Storage

In [None]:
# 📤 Write a file back to S3

output_path = "s3://your-bucket-name/path/to/output.csv"
connector.write_file(df, path=output_path)
print(f"File written to {output_path}")