Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstracts DataStorage and Makes Local DataStorage #56

Merged
merged 44 commits into from
May 23, 2024

Conversation

pickles-bread-and-butter
Copy link
Collaborator

@pickles-bread-and-butter pickles-bread-and-butter commented May 9, 2024

Overview

This furthers the abstraction work in this PR, imposed S3 usgaes of the repo. This is done by adding a local storage class which at this point only loads data from the storage but cannot write. That's okay that part is tacked in #57. The point here is to fully remove the need for s3 from the datastorage and provide a more abstracted class level interface for coordinating the API between the different class implementations.

Adds

  • Abstract base dataset for other datasets to build off for one consistent API
  • API correction to S3Dataset so it's fetch_file_s3 -> fetch_file
  • Changes across the repo to change the function call to the appropriate
  • LocalDataSet which reads the files from a local fs or mounted system using shutil
  • Test for the above dataset

Notes

  • This is backwards compatible because the user doesn't interact with this function they use the iterator built into the dataset
  • That iterator is what calls this so it's safe to change the name and not hit the user with the function name change

Testing

  • CI
  • Upgrade the version in your ML/ETL workflows and test to make sure changes are backwards compatible

@pickles-bread-and-butter
Copy link
Collaborator Author

Only read so I can see CI

@pickles-bread-and-butter pickles-bread-and-butter changed the base branch from feature/isaak/local_fs to main May 9, 2024 23:41
@pickles-bread-and-butter pickles-bread-and-butter changed the base branch from main to feature/isaak/local_fs May 10, 2024 03:37
@pickles-bread-and-butter pickles-bread-and-butter changed the title add abstract storage usage Abstracts DataStorage and Makes Local DataStorage May 20, 2024
@@ -27,7 +28,61 @@
logger = logging.getLogger(__name__)


class S3DataStorage:
class AbstractDataStorage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isaak-willett do you want to declare AbstractDataStorage as an abstract base class (i.e. https://www.geeksforgeeks.org/abstract-classes-in-python/) and declare fetch_file as @abstractmethod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that totally works, I've never really been clear on why the ABC is there tbh. Do you know if it actually implements something or if it's just convention.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually does throw an error if you attempt to instantiate an instance of the class! I could be wrong though.

Copy link
Collaborator

@convexquad convexquad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks, Isaak! Be sure to read the comment that I added.

Copy link
Collaborator

@aalavian aalavian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main question on :

  1. the need to copy on local file system
  2. testing plan for the core module changes

)
def download_with_retries(self, input_path: str, local_path: str):
try:
shutil.copyfile(input_path, local_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, Why do we need to copy in local file system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GCloud files are automatically cached when pulling locally so you can do /mnt/ and it won't redownload to the system. The same is true on AWS with Amazon File Cache CSI Driver, https://docs.aws.amazon.com/eks/latest/userguide/file-cache-csi.html. On another system though there is the potential this is not true, potentially internal NAS. In this case these files need to be moved from that mount to the local instance for later access like with Wicker and S3.

Also, the current Wicker pattern is to download the data locally and this replicates that behavior in mounted drives 1-1. What I'm going to do in a follow up PR is have an option to turn this off and read directly from that local drive. I just haven't gotten to that piece yet. I'll add it in a PR after #57.

Just an iterative step until we get to the end state you're describing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aalavian it's true, for FSx-style systems (including EFS) the remote filesystem appears as a local volume and for users it is not obvious that they are paying a network latency penalty for every access to these files.

In another one of our internal ML frameworks in Japan (that begins "In"), the data access pattern is actually to leave the files in FSx for every epoch of training, so that the network penalty is always paid. For most of the models, this might actually be fine as the data is often accessible by the time the GPU is ready. However, it would be nice to have the option to copy the data from the locally-mounted (remote) volume to a fast local SSD.

For the FUSE filesystems for bucket storage from GCP and AWS, they actually have internal caching options so that this behavior is effectively done by these FUSE drivers, so in that case we will not call this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification folks, I think LocalDataStorage and in particular doc-strings in this class referring to local file system would raise the same question for others.

How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage and create an Github issue to add the option to copy or not so we would not forget it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ask, changed the name. I'll make an issue and do it after #57 is in



class LocalDataStorage(AbstractDataStorage):
"""Storage routines for reading and writing objects in local fs"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace FS with file system

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, replaced everywhere

Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
@convexquad
Copy link
Collaborator

@isaak-willett when you can, share how you tested (or plan to test) these changes with @aalavian in a Google doc or in our private Slack (since it probably has details about our internal models).

Isaak Willett added 2 commits May 22, 2024 16:10
This reverts commit 7db4981.
Base automatically changed from feature/isaak/local_fs to main May 22, 2024 23:26
Isaak Willett added 2 commits May 22, 2024 16:29
Copy link
Collaborator

@aalavian aalavian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isaak-willett please address the 2 reviews before merging.

)
def download_with_retries(self, input_path: str, local_path: str):
try:
shutil.copyfile(input_path, local_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification folks, I think LocalDataStorage and in particular doc-strings in this class referring to local file system would raise the same question for others.

How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage and create an Github issue to add the option to copy or not so we would not forget it?

@@ -100,7 +157,7 @@ def download_with_retries(self, bucket: str, key: str, local_path: str):
logging.error(f"Failed to download s3 object in bucket: {bucket}, key: {key}")
raise e

def fetch_file_s3(self, input_path: str, local_prefix: str, timeout_seconds: int = 120) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fetch_file_s3 is used elsewhere in other repos. to keep backward compatibility, let's keep fetch_file_s3 and point it to call fetch_file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It would be nice to add a deprecation decorator to this in the future

@pickles-bread-and-butter pickles-bread-and-butter merged commit 0a71e72 into main May 23, 2024
2 checks passed
@pickles-bread-and-butter pickles-bread-and-butter deleted the feature/isaak/local_system_reader branch May 23, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants