-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstracts DataStorage and Makes Local DataStorage #56
Abstracts DataStorage and Makes Local DataStorage #56
Conversation
Only read so I can see CI |
wicker/core/storage.py
Outdated
@@ -27,7 +28,61 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
class S3DataStorage: | |||
class AbstractDataStorage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isaak-willett do you want to declare AbstractDataStorage
as an abstract base class (i.e. https://www.geeksforgeeks.org/abstract-classes-in-python/) and declare fetch_file
as @abstractmethod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that totally works, I've never really been clear on why the ABC is there tbh. Do you know if it actually implements something or if it's just convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually does throw an error if you attempt to instantiate an instance of the class! I could be wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks, Isaak! Be sure to read the comment that I added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main question on :
- the need to copy on local file system
- testing plan for the core module changes
) | ||
def download_with_retries(self, input_path: str, local_path: str): | ||
try: | ||
shutil.copyfile(input_path, local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, Why do we need to copy in local file system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On GCloud files are automatically cached when pulling locally so you can do /mnt/ and it won't redownload to the system. The same is true on AWS with Amazon File Cache CSI Driver, https://docs.aws.amazon.com/eks/latest/userguide/file-cache-csi.html. On another system though there is the potential this is not true, potentially internal NAS. In this case these files need to be moved from that mount to the local instance for later access like with Wicker and S3.
Also, the current Wicker pattern is to download the data locally and this replicates that behavior in mounted drives 1-1. What I'm going to do in a follow up PR is have an option to turn this off and read directly from that local drive. I just haven't gotten to that piece yet. I'll add it in a PR after #57.
Just an iterative step until we get to the end state you're describing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aalavian it's true, for FSx-style systems (including EFS) the remote filesystem appears as a local volume and for users it is not obvious that they are paying a network latency penalty for every access to these files.
In another one of our internal ML frameworks in Japan (that begins "In"), the data access pattern is actually to leave the files in FSx for every epoch of training, so that the network penalty is always paid. For most of the models, this might actually be fine as the data is often accessible by the time the GPU is ready. However, it would be nice to have the option to copy the data from the locally-mounted (remote) volume to a fast local SSD.
For the FUSE filesystems for bucket storage from GCP and AWS, they actually have internal caching options so that this behavior is effectively done by these FUSE drivers, so in that case we will not call this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification folks, I think LocalDataStorage
and in particular doc-strings in this class referring to local file system
would raise the same question for others.
How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage
and create an Github issue to add the option to copy or not so we would not forget it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ask, changed the name. I'll make an issue and do it after #57 is in
wicker/core/storage.py
Outdated
|
||
|
||
class LocalDataStorage(AbstractDataStorage): | ||
"""Storage routines for reading and writing objects in local fs""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace FS with file system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, replaced everywhere
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
Co-authored-by: Marc Carré <marccarre@users.noreply.github.com>
@isaak-willett when you can, share how you tested (or plan to test) these changes with @aalavian in a Google doc or in our private Slack (since it probably has details about our internal models). |
This reverts commit 7db4981.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @isaak-willett please address the 2 reviews before merging.
) | ||
def download_with_retries(self, input_path: str, local_path: str): | ||
try: | ||
shutil.copyfile(input_path, local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification folks, I think LocalDataStorage
and in particular doc-strings in this class referring to local file system
would raise the same question for others.
How about we renaming this class and doc string to be more generic than local, maybe something like FileSystemDataStorage
and create an Github issue to add the option to copy or not so we would not forget it?
@@ -100,7 +157,7 @@ def download_with_retries(self, bucket: str, key: str, local_path: str): | |||
logging.error(f"Failed to download s3 object in bucket: {bucket}, key: {key}") | |||
raise e | |||
|
|||
def fetch_file_s3(self, input_path: str, local_prefix: str, timeout_seconds: int = 120) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fetch_file_s3
is used elsewhere in other repos. to keep backward compatibility, let's keep fetch_file_s3
and point it to call fetch_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It would be nice to add a deprecation decorator to this in the future
Overview
This furthers the abstraction work in this PR, imposed S3 usgaes of the repo. This is done by adding a local storage class which at this point only loads data from the storage but cannot write. That's okay that part is tacked in #57. The point here is to fully remove the need for s3 from the datastorage and provide a more abstracted class level interface for coordinating the API between the different class implementations.
Adds
fetch_file_s3
->fetch_file
Notes
Testing