## 101 -- Storage Options in the Anyscale Platform

### Introduction

Most AI workloads require access to large amounts of data, whether to obtain training data or to offload model checkpoints. In this notebook, we will explore the different storage options provided by Anyscale and how to use them effectively for AI/ML workloads.

1. Object storage  
2. Network file system (NFS) shared across nodes  
3. Local storage for a node  
4. Local File Store

To see the storage options, open your existing **Anyscale Workspace.** Or create a new one. 

<img src="https://lz-public-demo.s3.us-east-1.amazonaws.com/anyscale101/workspace2.png"  width="500"/>

Once the Anyscale Workspace is running, head over to the **Files Tab**. Click on the Workspace Working Directory to get an UI overview of various storage options, let's review them in greater detail.  

<img src="https://lz-public-demo.s3.us-east-1.amazonaws.com/anyscale101/storage3.png"  width="700"/>


### 1. Cloud Object Store

Every Anyscale Cloud includes a default Cloud Storage Bucket. For accessing large volumes of data (such as inputs for model training), each workspace contains environment variables to reference this shared bucket.

Information about the bucket is stored within the following environment variables:

- **ANYSCALE_CLOUD_STORAGE_BUCKET**: Name of the default bucket  
  - The root bucket (**ANYSCALE_CLOUD_STORAGE_BUCKET**) is managed by Anyscale and contains system-critical files.  
  - Do not modify or delete anything inside the root bucket—only make changes within the **ANYSCALE_ARTIFACT_STORAGE** path—as it may impact platform functionality.
- **ANYSCALE_CLOUD_STORAGE_BUCKET_REGION**: Region of the default bucket.  
- **ANYSCALE_ARTIFACT_STORAGE**: Pre-generated URI path for storing user artifacts separately.

<p>&nbsp;</p>

Create a Python notebook file (.ipynb extension)

<img src="https://lz-public-demo.s3.us-east-1.amazonaws.com/anyscale101/storage2.png"  width="500"/>


Print the details about the Cloud Storage Bucket by running this code snippet in a new code cell

In [None]:
import os

print("ANYSCALE_CLOUD_STORAGE_BUCKET:", os.getenv("ANYSCALE_CLOUD_STORAGE_BUCKET"))
print("ANYSCALE_CLOUD_STORAGE_BUCKET_REGION:", os.getenv("ANYSCALE_CLOUD_STORAGE_BUCKET_REGION"))
print("ANYSCALE_ARTIFACT_STORAGE:", os.getenv("ANYSCALE_ARTIFACT_STORAGE"))

# ANYSCALE_CLOUD_STORAGE_BUCKET: anyscale-production-data-cld-g54aiirwj1s8t9ktgzikqur41k
# ANYSCALE_CLOUD_STORAGE_BUCKET_REGION: us-west-2
# ANYSCALE_ARTIFACT_STORAGE: s3://anyscale-production-data-cld-g54aiirwj1s8t9ktgzikqur41k/org_967t9ah1lbk1yqf1zau6a1v247/cld_g54aiirwj1s8t9ktgzikqur41k/artifact_storage

**(If Anyscale Cloud is connected to Amazon S3)**

Upload a sample file to the S3 Bucket, then find the file by sorting by the most recent upload.

In [None]:
import os
import boto3

s3 = boto3.client("s3")
bucket, prefix = os.environ["ANYSCALE_ARTIFACT_STORAGE"].replace("s3://", "").split("/", 1)

local_file = "hello.txt"
s3_key = f"{prefix.rstrip('/')}/{local_file}"

# Write a simple text file to the Bucket
with open(local_file, "w") as f:
    f.write("Sample File\n")
s3.upload_file(local_file, bucket, s3_key)
print(f"Uploaded {local_file} to s3://{bucket}/{s3_key}")

# List objects in artifact folder and get most recent
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
files = response.get("Contents", [])
most_recent = max(files, key=lambda obj: obj["LastModified"])
recent_key = most_recent["Key"]
recent_name = recent_key.replace(prefix, "").lstrip("/")
print(f"Most recent file: {recent_name} (LastModified: {most_recent['LastModified']})")


# Uploaded hello.txt to s3://anyscale-production-data-cld-g54aiirwj1.../hello.txt
# Most recent file: hello.txt (LastModified: 2025-06-17 20:19:18+00:00)

Since the Workspace clusters already have role permissions to access S3, authentication is not required when using the AWS CLI.

Delete the created file by running this in a new cell.

In [None]:
s3.delete_object(Bucket=bucket, Key=s3_key)
print(f"Deleted s3://{bucket}/{s3_key}")

# Deleted s3://anyscale-production-data-cld.../artifact_storage/hello.txt

### 2. Shared File Storage 

Every Anyscale Cluster is started with the following mount points. This storage option (NFS) is not intended for extremely large datasets, use Cloud Object Storage instead.

- **User Storage:** `/mnt/user_storage`  
  - Private to the Anyscale user but accessible from every node of all their workspace, job, and service clusters in the same cloud.  
  - Persisted independently of Anyscale Workspace/Job/Service lifecycle.

- **Shared Storage:** `/mnt/shared_storage`  
  - Accessible to all Anyscale users of the same Anyscale cloud. Anyscale mounts it on every node of all the clusters in the same cloud.  
  - Persisted independently of Anyscale Workspace/Job/Service lifecycle.

Anyscale Cloud has some tutorials installed by default. Let’s check if it’s stored in the **Shared Storage** mount.

You can also inspect using the UI from the **Files Tab**.


In [None]:
import os

shared_path = "/mnt/shared_storage"
all_files = []

for root, dirs, files in os.walk(shared_path):
    for name in files:
        all_files.append(os.path.join(root, name))

# Print first 4 files
for path in all_files[:4]:
    print(path)
    
# /mnt/shared_storage/dummy_data_xxl.csv
# /mnt/shared_storage/dummy_data_1000_500.parquet
# /mnt/shared_storage/dummy_data_1000_500.csv
# /mnt/shared_storage/dummy_data_1000_720.csv

### 3. Local Cluster Storage 

Anyscale provides each node with its own volume and disk and doesn’t share them with other nodes. This storage option enables higher access speed, lower latency, and scalability.

**Local storage can be accessed:** `/mnt/cluster_storage`

- Scoped to the nodes in the specific cluster.
- Used for sharing files and pip package installations across cluster nodes during workspace development (e.g. `pip install <package> --user`).

To demonstrate where new libraries are located in the **Local Cluster Storage**, download a unique library not pre-installed bythe default Anyscale Container image and print out the path. Run the following in the terminal:


In [None]:
pip install shapely --user
python3 -c "import shapely; print(shapely.__file__)"

# /mnt/cluster_storage/pypi/lib/python3.12/site-packages/shapely/__init__.py

You can also inspect using the UI from the **Files Tab**.

<img src="https://lz-public-demo.s3.us-east-1.amazonaws.com/anyscale101/storage4.png"  width="800"/>

### 4. Local File Store

Anyscale Workspaces persist files and folders within your project directory, `/home/ray/default`, across restarts. This capability maintains project continuity and facilitates seamless transitions between workspace sessions.

For performance reasons, Anyscale limits snapshots to 10 GB per workspace. For larger files, save to File Storage or Cloud Object Store.

