Abstracts DataStorage and Makes Local DataStorage #56

pickles-bread-and-butter · 2024-05-09T23:09:55Z

Overview

This furthers the abstraction work in this PR, imposed S3 usgaes of the repo. This is done by adding a local storage class which at this point only loads data from the storage but cannot write. That's okay that part is tacked in #57. The point here is to fully remove the need for s3 from the datastorage and provide a more abstracted class level interface for coordinating the API between the different class implementations.

Adds

Abstract base dataset for other datasets to build off for one consistent API
API correction to S3Dataset so it's fetch_file_s3 -> fetch_file
Changes across the repo to change the function call to the appropriate
LocalDataSet which reads the files from a local fs or mounted system using shutil
Test for the above dataset

Notes

This is backwards compatible because the user doesn't interact with this function they use the iterator built into the dataset
That iterator is what calls this so it's safe to change the name and not hit the user with the function name change

Testing

CI
Upgrade the version in your ML/ETL workflows and test to make sure changes are backwards compatible

pickles-bread-and-butter · 2024-05-09T23:41:23Z

Only read so I can see CI

convexquad · 2024-05-20T17:21:57Z

wicker/core/storage.py

@@ -27,7 +28,61 @@
 logger = logging.getLogger(__name__)


-class S3DataStorage:
+class AbstractDataStorage:


@isaak-willett do you want to declare AbstractDataStorage as an abstract base class (i.e. https://www.geeksforgeeks.org/abstract-classes-in-python/) and declare fetch_file as @abstractmethod?

Yeah that totally works, I've never really been clear on why the ABC is there tbh. Do you know if it actually implements something or if it's just convention.

I think it actually does throw an error if you attempt to instantiate an instance of the class! I could be wrong though.

convexquad

👍 Thanks, Isaak! Be sure to read the comment that I added.

aalavian

Main question on :

the need to copy on local file system
testing plan for the core module changes

aalavian · 2024-05-22T06:48:11Z

wicker/core/storage.py

+    )
+    def download_with_retries(self, input_path: str, local_path: str):
+        try:
+            shutil.copyfile(input_path, local_path)


hmm, Why do we need to copy in local file system?

On GCloud files are automatically cached when pulling locally so you can do /mnt/ and it won't redownload to the system. The same is true on AWS with Amazon File Cache CSI Driver, https://docs.aws.amazon.com/eks/latest/userguide/file-cache-csi.html. On another system though there is the potential this is not true, potentially internal NAS. In this case these files need to be moved from that mount to the local instance for later access like with Wicker and S3.

Also, the current Wicker pattern is to download the data locally and this replicates that behavior in mounted drives 1-1. What I'm going to do in a follow up PR is have an option to turn this off and read directly from that local drive. I just haven't gotten to that piece yet. I'll add it in a PR after #57.

Just an iterative step until we get to the end state you're describing.

@aalavian it's true, for FSx-style systems (including EFS) the remote filesystem appears as a local volume and for users it is not obvious that they are paying a network latency penalty for every access to these files.

In another one of our internal ML frameworks in Japan (that begins "In"), the data access pattern is actually to leave the files in FSx for every epoch of training, so that the network penalty is always paid. For most of the models, this might actually be fine as the data is often accessible by the time the GPU is ready. However, it would be nice to have the option to copy the data from the locally-mounted (remote) volume to a fast local SSD.

For the FUSE filesystems for bucket storage from GCP and AWS, they actually have internal caching options so that this behavior is effectively done by these FUSE drivers, so in that case we will not call this function.

Thanks for the clarification folks, I think LocalDataStorage and in particular doc-strings in this class referring to local file system would raise the same question for others.

How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage and create an Github issue to add the option to copy or not so we would not forget it?

Good ask, changed the name. I'll make an issue and do it after #57 is in

aalavian · 2024-05-22T06:48:22Z

wicker/core/storage.py

+
+
+class LocalDataStorage(AbstractDataStorage):
+    """Storage routines for reading and writing objects in local fs"""


nit: replace FS with file system

done, replaced everywhere

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

…_reader

convexquad · 2024-05-22T22:51:24Z

@isaak-willett when you can, share how you tested (or plan to test) these changes with @aalavian in a Google doc or in our private Slack (since it probably has details about our internal models).

This reverts commit 7db4981.

aalavian

Thanks @isaak-willett please address the 2 reviews before merging.

aalavian · 2024-05-23T17:21:46Z

wicker/core/storage.py

+    )
+    def download_with_retries(self, input_path: str, local_path: str):
+        try:
+            shutil.copyfile(input_path, local_path)


Thanks for the clarification folks, I think LocalDataStorage and in particular doc-strings in this class referring to local file system would raise the same question for others.

How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage and create an Github issue to add the option to copy or not so we would not forget it?

aalavian · 2024-05-23T17:24:36Z

wicker/core/storage.py

@@ -100,7 +157,7 @@ def download_with_retries(self, bucket: str, key: str, local_path: str):
            logging.error(f"Failed to download s3 object in bucket: {bucket}, key: {key}")
            raise e

-    def fetch_file_s3(self, input_path: str, local_prefix: str, timeout_seconds: int = 120) -> str:


I think fetch_file_s3 is used elsewhere in other repos. to keep backward compatibility, let's keep fetch_file_s3 and point it to call fetch_file.

Done. It would be nice to add a deprecation decorator to this in the future

Isaak Willett added 12 commits May 7, 2024 16:58

abstract the path factory

227ee07

remove print

90c256f

merge

323ee13

update doc strings

102405d

more doc strings

c17b229

lint

fe5034d

lints

c944461

fix typing

9646656

fix

923352e

add more docs

0cb935b

update path for linting

91ae474

fix ci

721e559

pickles-bread-and-butter marked this pull request as ready for review May 9, 2024 23:41

pickles-bread-and-butter requested review from aalavian, anantsimran and chrisochoatri as code owners May 9, 2024 23:41

pickles-bread-and-butter removed request for anantsimran, aalavian and chrisochoatri May 9, 2024 23:41

pickles-bread-and-butter changed the base branch from feature/isaak/local_fs to main May 9, 2024 23:41

pickles-bread-and-butter changed the base branch from main to feature/isaak/local_fs May 10, 2024 03:37

pickles-bread-and-butter mentioned this pull request May 14, 2024

Abstractify the Path Factory #55

Merged

pickles-bread-and-butter changed the title ~~add abstract storage usage~~ Abstracts DataStorage and Makes Local DataStorage May 20, 2024

convexquad reviewed May 20, 2024

View reviewed changes

Isaak Willett added 5 commits May 20, 2024 14:50

changes

d1b0203

fix

eb0f404

lint

a8627f4

add abstract storage usage

dc42be8

use abstract class

ebc071d

convexquad approved these changes May 21, 2024

View reviewed changes

change

7b18832

aalavian requested changes May 22, 2024

View reviewed changes

pickles-bread-and-butter and others added 13 commits May 22, 2024 07:57

Update wicker/core/storage.py

4ad0367

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

1c107db

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

6e881f2

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

93de8ce

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

46775a4

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

e3bffaf

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

07286be

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

244fd03

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

36b59b6

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

Update wicker/core/storage.py

adb019f

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>

doc string

2f06043

Merge branch 'main' into feature/isaak/local_fs

b4b84e6

Merge branch 'feature/isaak/local_fs' into feature/isaak/local_system…

68645e4

…_reader

pickles-bread-and-butter requested a review from marccarre as a code owner May 22, 2024 19:57

requested changes

ebc2f55

pickles-bread-and-butter requested a review from aalavian May 22, 2024 20:03

Isaak Willett added 2 commits May 22, 2024 16:10

rename

7db4981

Revert "rename"

62109c2

This reverts commit 7db4981.

Base automatically changed from feature/isaak/local_fs to main May 22, 2024 23:26

Isaak Willett added 2 commits May 22, 2024 16:29

merge

adf1518

bump version

e76e6f1

aalavian approved these changes May 23, 2024

View reviewed changes

Isaak Willett added 2 commits May 23, 2024 10:35

change name to FileSystemStorage

49f374c

add back api access

fcf8206

pickles-bread-and-butter merged commit 0a71e72 into main May 23, 2024
2 checks passed

pickles-bread-and-butter deleted the feature/isaak/local_system_reader branch May 23, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstracts DataStorage and Makes Local DataStorage #56

Abstracts DataStorage and Makes Local DataStorage #56

pickles-bread-and-butter commented May 9, 2024 •

edited

Loading

pickles-bread-and-butter commented May 9, 2024

convexquad May 20, 2024

pickles-bread-and-butter May 20, 2024

convexquad May 22, 2024

convexquad left a comment

aalavian left a comment

aalavian May 22, 2024

pickles-bread-and-butter May 22, 2024 •

edited

Loading

convexquad May 22, 2024

aalavian May 23, 2024

pickles-bread-and-butter May 23, 2024

aalavian May 22, 2024

pickles-bread-and-butter May 22, 2024

convexquad commented May 22, 2024

aalavian left a comment

aalavian May 23, 2024

aalavian May 23, 2024

pickles-bread-and-butter May 23, 2024



		class LocalDataStorage(AbstractDataStorage):
		"""Storage routines for reading and writing objects in local fs"""

Abstracts DataStorage and Makes Local DataStorage #56

Abstracts DataStorage and Makes Local DataStorage #56

Conversation

pickles-bread-and-butter commented May 9, 2024 • edited Loading

Overview

Adds

Notes

Testing

pickles-bread-and-butter commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

convexquad left a comment

Choose a reason for hiding this comment

aalavian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pickles-bread-and-butter May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

convexquad commented May 22, 2024

aalavian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pickles-bread-and-butter commented May 9, 2024 •

edited

Loading

pickles-bread-and-butter May 22, 2024 •

edited

Loading