Skip to content

Latest commit

 

History

History
278 lines (176 loc) · 8.65 KB

saving-data.rst

File metadata and controls

278 lines (176 loc) · 8.65 KB

Saving Data

Ray Data lets you save data in files or other Python objects.

This guide shows you how to:

Writing data to files

Ray Data writes to local disk and cloud storage.

Writing data to local disk

To save your ~ray.data.dataset.Dataset to local disk, call a method like Dataset.write_parquet <ray.data.Dataset.write_parquet> and specify a local directory with the local:// scheme.

Warning

If your cluster contains multiple nodes and you don't use local://, Ray Data writes different partitions of data to different nodes.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

ds.write_parquet("local:///tmp/iris/")

To write data to formats other than Parquet, read the Input/Output reference <input-output>.

Writing data to cloud storage

To save your ~ray.data.dataset.Dataset to cloud storage, authenticate all nodes with your cloud service provider. Then, call a method like Dataset.write_parquet <ray.data.Dataset.write_parquet> and specify a URI with the appropriate scheme. URI can point to buckets or folders.

To write data to formats other than Parquet, read the Input/Output reference <input-output>.

S3

To save data to Amazon S3, specify a URI with the s3:// scheme.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

ds.write_parquet("s3://my-bucket/my-folder")

Ray Data relies on PyArrow for authenticaion with Amazon S3. For more on how to configure your credentials to be compatible with PyArrow, see their S3 Filesytem docs.

GCS

To save data to Google Cloud Storage, install the Filesystem interface to Google Cloud Storage

pip install gcsfs

Then, create a GCSFileSystem and specify a URI with the gcs:// scheme.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

filesystem = gcsfs.GCSFileSystem(project="my-google-project") ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem)

Ray Data relies on PyArrow for authenticaion with Google Cloud Storage. For more on how to configure your credentials to be compatible with PyArrow, see their GCS Filesytem docs.

ABS

To save data to Azure Blob Storage, install the Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage

pip install adlfs

Then, create a AzureBlobFileSystem and specify a URI with the az:// scheme.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

filesystem = adlfs.AzureBlobFileSystem(account_name="azureopendatastorage") ds.write_parquet("az://my-bucket/my-folder", filesystem=filesystem)

Ray Data relies on PyArrow for authenticaion with Azure Blob Storage. For more on how to configure your credentials to be compatible with PyArrow, see their fsspec-compatible filesystems docs.

Writing data to NFS

To save your ~ray.data.dataset.Dataset to NFS file systems, call a method like Dataset.write_parquet <ray.data.Dataset.write_parquet> and specify a mounted directory.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

ds.write_parquet("/mnt/cluster_storage/iris")

To write data to formats other than Parquet, read the Input/Output reference <input-output>.

Changing the number of output files

When you call a write method, Ray Data writes your data to several files. To control the number of output files, configure num_rows_per_file.

Note

num_rows_per_file is a hint, not a strict limit. Ray Data might write more or fewer rows to each file.

import os import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_csv("/tmp/few_files/", num_rows_per_file=75)

print(os.listdir("/tmp/few_files/"))

['0_000001_000000.csv', '0_000000_000000.csv', '0_000002_000000.csv']

Converting Datasets to other Python libraries

Converting Datasets to pandas

To convert a ~ray.data.dataset.Dataset to a pandas DataFrame, call Dataset.to_pandas() <ray.data.Dataset.to_pandas>. Your data must fit in memory on the head node.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

df = ds.to_pandas() print(df)

sepal length (cm) sepal width (cm) ... petal width (cm) target

0 5.1 3.5 ... 0.2 0 1 4.9 3.0 ... 0.2 0 2 4.7 3.2 ... 0.2 0 3 4.6 3.1 ... 0.2 0 4 5.0 3.6 ... 0.2 0 .. ... ... ... ... ... 145 6.7 3.0 ... 2.3 2 146 6.3 2.5 ... 1.9 2 147 6.5 3.0 ... 2.0 2 148 6.2 3.4 ... 2.3 2 149 5.9 3.0 ... 1.8 2 <BLANKLINE> [150 rows x 5 columns]

Converting Datasets to distributed DataFrames

Ray Data interoperates with distributed data processing frameworks like Dask <dask-on-ray>, Spark <spark-on-ray>, Modin <modin-on-ray>, and Mars <mars-on-ray>.

Dask

To convert a ~ray.data.dataset.Dataset to a Dask DataFrame, call Dataset.to_dask() <ray.data.Dataset.to_dask>.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

df = ds.to_dask()

Spark

To convert a ~ray.data.dataset.Dataset to a Spark DataFrame, call Dataset.to_spark() <ray.data.Dataset.to_spark>.

import ray import raydp

spark = raydp.init_spark(

app_name = "example", num_executors = 1, executor_cores = 4, executor_memory = "512M"

)

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_spark(spark)

raydp.stop_spark()

Modin

To convert a ~ray.data.dataset.Dataset to a Modin DataFrame, call Dataset.to_modin() <ray.data.Dataset.to_modin>.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

mdf = ds.to_modin()

Mars

To convert a ~ray.data.dataset.Dataset from a Mars DataFrame, call Dataset.to_mars() <ray.data.Dataset.to_mars>.

import ray

ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")

mdf = ds.to_mars()