Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

pickles-bread-and-butter
Copy link
Collaborator

@pickles-bread-and-butter pickles-bread-and-butter commented May 10, 2024

Additions

  1. A new class FileSystemDataset that facilitates reading from a file system through the usage of the WickerPathFactory and the FileSystemDataStorage. Now a user can access the same functional API that S3Dataset has through this new class.
  2. Adds functions to the FileSystemDataStorage to write to a mounted file drive in the correct locations through the WickerPathFactory
  3. Uncouples the ColumnBytesFileCache from needing S3 specifically by changing the function call names and taking the generic outputs of that call from the abstract class

Purpose

A FUSE filesystem or a local mounted fs can now be used as targets for both reading and writing. This enables running ETL with FsX, GCloud, S3 mounted, etc... as well as running training, eval, workflows, etc... the same way (those are all still etls though 😸 ).

Testing

The testing here is a given, I'm going to test this with a large model on gke and eks and sagemaker

@convexquad
Copy link
Collaborator

@pickles-bread-and-butter in storage.py could you also adjust the indenting in the comments that describe the possible bucket layouts?

@pickles-bread-and-butter
Copy link
Collaborator Author

@pickles-bread-and-butter in storage.py could you also adjust the indenting in the comments that describe the possible bucket layouts?

@convexquad done, take a look and let me know if it all looks good!

@pickles-bread-and-butter pickles-bread-and-butter changed the title Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions [DRAFT] Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions Jul 3, 2024
@pickles-bread-and-butter pickles-bread-and-butter changed the title [DRAFT] Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions Jul 3, 2024
logging.error(f"Failed to download/move object for file path: {input_path}")
raise
return local_dst_path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aalavian there will be a little bit of code duplication here with the function fetch_file in the S3DataStorage class. I think it is ok as the actual calls inside the two functions differ a bit (one calls shutil.copyfile and the other downloads from S3) and it is somewhat awkward to resolve in a common function.

Another major consideration is that the fetch_file function in S3DataStorage is critical to Wicker performance in downloading from S3. If it's ok, I'd like to leave that function completely alone (rather than refactoring it to remove the code that is duplicated above) in order to eliminate any chance of accidentally damaging how the function operates.

Copy link
Collaborator

@convexquad convexquad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks, @pickles-bread-and-butter ! @aalavian we set a backwards-compatibility s3_path_factory variable on the ColumnBytesFileWriter class in order to ensure that any code using this class (that assumes this variable is present) will still find it.

@pickles-bread-and-butter pickles-bread-and-butter requested review from aalavian and removed request for aalavian August 16, 2024 21:09
@pickles-bread-and-butter pickles-bread-and-butter removed the request for review from aalavian September 9, 2024 16:56
pa_filesystem: pafs.FileSystem,
storage: AbstractDataStorage,
columns_to_load: Optional[List[str]] = None,
filelock_timeout_seconds: int = FILE_LOCK_TIMEOUT_SECONDS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so abstract, kind of concrete cache implementation detail, could we make a CacheManger Interface and inject into the dataset just as the PathFactory etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle but it's unfortunately out of scope for this PR. It would be nice to abstract it away further and more easily unify the interface. At current though I don't want to take on the work although the cache improvements in the PR get you mostly there with it's abstraction of the reading. If it's okay with you, I'll make an issue and you can make a request to me or @convexquad in the future to do this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenyu I totally agree with you! In the original code at https://github.com/woven-planet/wicker/blob/main/wicker/core/datasets.py#L141 the constructor of the original S3Dataset class exposes too many implementation details that should be wrapped up in another class.

However, in this PR we are not trying to fix the design. If it is ok with you, we will only try to enable the local storage correctly (without breaking S3 storage) and we will not try to refactor the design yet. I think if we can show that with these changes that the data loading from local disk and S3 are both working correctly, then it will be perfect time to refactor the class design here.

return column_bytes_file_class


class FileSystemDataset(AbstractDataset):
Copy link
Collaborator

@zhenyu zhenyu Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible of the same dataset which exist in both FileSystem and S3? My 2 cents here is, there should not S3Dataset and FileSystemDataset. The orthogonal design is S3DataStorage and FileDataStorage. The original design did not make the abstract well, but since you have such good idea to refactor it, let us do it right at this time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible yes, for this PR probably not. Like I mentioned above it's out of our scope to do a larger refactor in this PR or maybe ever. I totally agree with what you're saying where a user should just define Dataset object but someone borked the original design and that didn't happen. The abstraction I did awhile ago was meant to be a halfway measure. I like what you're suggesting I think it's for a future version of dataset library however.

I'm happy to give you context on why this happened with Wicker but really right now it just needs to work given it's other myriad of problems. It's probably the best path to stop supporting Wicker and redesign it with what we've learned with a similar but improved API. If you're okay with that my team or yours can re-design this if you request or take it on and I'm sure we can collaborate and support it. Both @convexquad and I think this is the correct path and have discussed it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also tell from the other PR that was opened what you're getting at but I will mention these are meant to be parallel and not orthogonal. They share the same interfaces directly they are just parallel implementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob a factory method to return a AbstractDataset according to the storage backend is better. Otherwise the The abstractDataset could be a implementation detail instead of the a interface.
The ideal user experience getDataset(name, version, storage: enum (S3, LocalFile))->AbstractDataset(prob no need change the original AbstractDataset)
the implementation of the getDataset is :
make a BaseDataset which inherits the AbstractDataset and encapsulate the shared logic as a implementation base instead of interface.
1 get different DataStorage backend, cache etc
2 inject the DataStorage backend, cache etc to the BaseDataset
3 return the abstract dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenyu again I think you are right!!! Good job in the review.

Although I am co-creator of this project, actually my team is not the owner of the project. So we tried super hard to preserve the original S3 classes like S3Dataset, S3DataStorage, etc. (except that we have to add WickerPathFactory base class) and to keep the S3 developer experience the same even if it is not ideal.

If it is ok, we are going to try using local storage from this PR and see if it is useful or not and probably we will learn some new things. Then I think we can refactor the developer experience better in the future.

@convexquad
Copy link
Collaborator

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@zhenyu
Copy link
Collaborator

zhenyu commented Oct 1, 2024

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

@pickles-bread-and-butter
Copy link
Collaborator Author

pickles-bread-and-butter commented Oct 1, 2024

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

What you're suggesting is a complete refactor of the dataset class, that is not feasible for feature development nor should it be undertaken. Wicker is an old & out of date code base at this point and the bare minimum needs to completed to reach objectives but nothing more. There is nothing wrong with the current code quality compared to the rest of the repo, it fits with the paradigm, is tested, and improves aspects of the design.

@marccarre marccarre dismissed their stale review October 1, 2024 02:36

Dismissing my review as the PR is now quite old, and @zhenyu provided feedback & suggested a rework of this PR.

@zhenyu
Copy link
Collaborator

zhenyu commented Oct 1, 2024

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

What you're suggesting is a complete refactor of the dataset class, that is not feasible for feature development nor should it be undertaken. Wicker is an old & out of date code base at this point and the bare minimum needs to completed to reach objectives but nothing more. There is nothing wrong with the current code quality compared to the rest of the repo, it fits with the paradigm, is tested, and improves aspects of the design.

Thanks again @pickles-bread-and-butter ! by the code quality, we do not only mean the formatting and naming, but also the class design. As you mentioned, the Wicker repo code is out dated and not well designed that makes us hard to add new feature like us, aka the code quality of the rest of the repo is not good. Should we add not orthogonal classes to make the repo even harder to extend and maintain?
I would suggest we have a design for this new feature and make sure the impact before merging to the main.
Again, I am OK with a feature branch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants