Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

pickles-bread-and-butter · 2024-05-10T21:27:34Z

Additions

A new class FileSystemDataset that facilitates reading from a file system through the usage of the WickerPathFactory and the FileSystemDataStorage. Now a user can access the same functional API that S3Dataset has through this new class.
Adds functions to the FileSystemDataStorage to write to a mounted file drive in the correct locations through the WickerPathFactory
Uncouples the ColumnBytesFileCache from needing S3 specifically by changing the function call names and taking the generic outputs of that call from the abstract class

Purpose

A FUSE filesystem or a local mounted fs can now be used as targets for both reading and writing. This enables running ETL with FsX, GCloud, S3 mounted, etc... as well as running training, eval, workflows, etc... the same way (those are all still etls though 😸 ).

Testing

The testing here is a given, I'm going to test this with a large model on gke and eks and sagemaker

wicker/core/storage.py

convexquad · 2024-07-02T23:09:44Z

@pickles-bread-and-butter in storage.py could you also adjust the indenting in the comments that describe the possible bucket layouts?

pickles-bread-and-butter · 2024-07-03T23:26:16Z

@pickles-bread-and-butter in storage.py could you also adjust the indenting in the comments that describe the possible bucket layouts?

@convexquad done, take a look and let me know if it all looks good!

convexquad · 2024-07-04T19:09:26Z

wicker/core/storage.py

+                        logging.error(f"Failed to download/move object for file path: {input_path}")
+                        raise
+        return local_dst_path
+


@aalavian there will be a little bit of code duplication here with the function fetch_file in the S3DataStorage class. I think it is ok as the actual calls inside the two functions differ a bit (one calls shutil.copyfile and the other downloads from S3) and it is somewhat awkward to resolve in a common function.

Another major consideration is that the fetch_file function in S3DataStorage is critical to Wicker performance in downloading from S3. If it's ok, I'd like to leave that function completely alone (rather than refactoring it to remove the code that is duplicated above) in order to eliminate any chance of accidentally damaging how the function operates.

convexquad

👍 Thanks, @pickles-bread-and-butter ! @aalavian we set a backwards-compatibility s3_path_factory variable on the ColumnBytesFileWriter class in order to ensure that any code using this class (that assumes this variable is present) will still find it.

out of date

zhenyu · 2024-09-26T18:17:51Z

wicker/core/datasets.py

+        pa_filesystem: pafs.FileSystem,
+        storage: AbstractDataStorage,
+        columns_to_load: Optional[List[str]] = None,
+        filelock_timeout_seconds: int = FILE_LOCK_TIMEOUT_SECONDS,


This is not so abstract, kind of concrete cache implementation detail, could we make a CacheManger Interface and inject into the dataset just as the PathFactory etc?

I agree in principle but it's unfortunately out of scope for this PR. It would be nice to abstract it away further and more easily unify the interface. At current though I don't want to take on the work although the cache improvements in the PR get you mostly there with it's abstraction of the reading. If it's okay with you, I'll make an issue and you can make a request to me or @convexquad in the future to do this?

@zhenyu I totally agree with you! In the original code at https://github.com/woven-planet/wicker/blob/main/wicker/core/datasets.py#L141 the constructor of the original S3Dataset class exposes too many implementation details that should be wrapped up in another class.

However, in this PR we are not trying to fix the design. If it is ok with you, we will only try to enable the local storage correctly (without breaking S3 storage) and we will not try to refactor the design yet. I think if we can show that with these changes that the data loading from local disk and S3 are both working correctly, then it will be perfect time to refactor the class design here.

zhenyu · 2024-09-26T18:22:51Z

wicker/core/datasets.py

+        return column_bytes_file_class
+
+
+class FileSystemDataset(AbstractDataset):


Is it possible of the same dataset which exist in both FileSystem and S3? My 2 cents here is, there should not S3Dataset and FileSystemDataset. The orthogonal design is S3DataStorage and FileDataStorage. The original design did not make the abstract well, but since you have such good idea to refactor it, let us do it right at this time.

Possible yes, for this PR probably not. Like I mentioned above it's out of our scope to do a larger refactor in this PR or maybe ever. I totally agree with what you're saying where a user should just define Dataset object but someone borked the original design and that didn't happen. The abstraction I did awhile ago was meant to be a halfway measure. I like what you're suggesting I think it's for a future version of dataset library however.

I'm happy to give you context on why this happened with Wicker but really right now it just needs to work given it's other myriad of problems. It's probably the best path to stop supporting Wicker and redesign it with what we've learned with a similar but improved API. If you're okay with that my team or yours can re-design this if you request or take it on and I'm sure we can collaborate and support it. Both @convexquad and I think this is the correct path and have discussed it.

I can also tell from the other PR that was opened what you're getting at but I will mention these are meant to be parallel and not orthogonal. They share the same interfaces directly they are just parallel implementations.

Prob a factory method to return a AbstractDataset according to the storage backend is better. Otherwise the The abstractDataset could be a implementation detail instead of the a interface.
The ideal user experience getDataset(name, version, storage: enum (S3, LocalFile))->AbstractDataset(prob no need change the original AbstractDataset)
the implementation of the getDataset is :
make a BaseDataset which inherits the AbstractDataset and encapsulate the shared logic as a implementation base instead of interface.
1 get different DataStorage backend, cache etc
2 inject the DataStorage backend, cache etc to the BaseDataset
3 return the abstract dataset

@zhenyu again I think you are right!!! Good job in the review.

Although I am co-creator of this project, actually my team is not the owner of the project. So we tried super hard to preserve the original S3 classes like S3Dataset, S3DataStorage, etc. (except that we have to add WickerPathFactory base class) and to keep the S3 developer experience the same even if it is not ideal.

If it is ok, we are going to try using local storage from this PR and see if it is useful or not and probably we will learn some new things. Then I think we can refactor the developer experience better in the future.

convexquad · 2024-09-30T18:42:59Z

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

zhenyu · 2024-10-01T02:07:16Z

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

pickles-bread-and-butter · 2024-10-01T02:28:31Z

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

What you're suggesting is a complete refactor of the dataset class, that is not feasible for feature development nor should it be undertaken. Wicker is an old & out of date code base at this point and the bare minimum needs to completed to reach objectives but nothing more. There is nothing wrong with the current code quality compared to the rest of the repo, it fits with the paradigm, is tested, and improves aspects of the design.

@zhenyu

Dismissing my review as the PR is now quite old, and @zhenyu provided feedback & suggested a rework of this PR.

zhenyu · 2024-10-01T17:13:15Z

@zhenyu thank you for your comments! If it is ok with you - since in this PR we are trying just to add local storage (and not to change the S3 developer interface or the class design), I'm going to merge this PR. However, I think you have already identified two key places where the developer interface can be greatly simplified.

@convexquad , thanks so much for the explanation. To be simple, we should not sacrifice the code quality with the time line reason. For the trying, I am okay with doing a feature branch release to move forward. But this PR significantly changed the code structure and brought extra complex tech debt to the existing solution, I am not comfortable with merging to the main branch as maintainer. cc @marccarre and @aalavian

What you're suggesting is a complete refactor of the dataset class, that is not feasible for feature development nor should it be undertaken. Wicker is an old & out of date code base at this point and the bare minimum needs to completed to reach objectives but nothing more. There is nothing wrong with the current code quality compared to the rest of the repo, it fits with the paradigm, is tested, and improves aspects of the design.

Thanks again @pickles-bread-and-butter ! by the code quality, we do not only mean the formatting and naming, but also the class design. As you mentioned, the Wicker repo code is out dated and not well designed that makes us hard to add new feature like us, aka the code quality of the rest of the repo is not good. Should we add not orthogonal classes to make the repo even harder to extend and maintain?
I would suggest we have a design for this new feature and make sure the impact before merging to the main.
Again, I am OK with a feature branch release.

Isaak Willett added 26 commits May 7, 2024 16:58

abstract the path factory

227ee07

remove print

90c256f

merge

323ee13

update doc strings

102405d

more doc strings

c17b229

lint

fe5034d

lints

c944461

fix typing

9646656

fix

923352e

add more docs

0cb935b

update path for linting

91ae474

fix ci

721e559

add abstract storage usage

0f65ce6

use abstract class

5bea9ed

updates to use the new function name

3ff9a20

fix bad name

3c5ca72

updates

9b32385

add tests for local fs read

0b39a17

update reading

8c9fbce

wip remove column bytes s3 assumption

3fdf121

updates

6fbba20

pull out more of the s3 assumptions

70579ff

lint and format

84fccec

typing

cb27d1f

fix bugs in local storage

c07a87a

add doc strings

b74d91b

This was referenced May 14, 2024

Abstractify the Path Factory #55

Merged

Abstracts DataStorage and Makes Local DataStorage #56

Merged

pickles-bread-and-butter force-pushed the feature/isaak/local_system_reader branch from 8c9fbce to 337cc92 Compare May 20, 2024 22:26

pickles-bread-and-butter mentioned this pull request May 23, 2024

Add FileSystemDataStorage Download Skip #59

Open

convexquad reviewed Jul 2, 2024

View reviewed changes

wicker/core/storage.py Show resolved Hide resolved

pickles-bread-and-butter added 3 commits July 3, 2024 19:10

requested changes

9803ea9

updates to erase download function

ec8abef

docs

7385eee

pickles-bread-and-butter requested a review from convexquad July 3, 2024 23:25

pickles-bread-and-butter changed the title ~~Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions~~ [DRAFT] Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions Jul 3, 2024

pickles-bread-and-butter changed the title ~~[DRAFT] Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions~~ Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions Jul 3, 2024

convexquad reviewed Jul 4, 2024

View reviewed changes

convexquad approved these changes Jul 4, 2024

View reviewed changes

pickles-bread-and-butter added 2 commits July 19, 2024 14:24

borked merge

7b59a31

updates to merge to unbork

8bbdff5

pickles-bread-and-butter requested review from aalavian and removed request for aalavian August 16, 2024 21:09

Merge branch 'main' into feature/isaak/local_fs_dataset

6c033cc

pickles-bread-and-butter added 2 commits September 3, 2024 08:59

bump version

ee22025

merge

97c0473

pickles-bread-and-butter removed the request for review from aalavian September 9, 2024 16:56

zhenyu reviewed Sep 26, 2024

View reviewed changes

convexquad mentioned this pull request Oct 1, 2024

Add a column file reader that can read from local filesystems #72

Merged

This was referenced Oct 9, 2024

Add implementation of FileSystemDataset #74

Closed

Add BaseDataset and FileSystemDataset #75

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

pickles-bread-and-butter commented May 10, 2024 •

edited

Loading

convexquad commented Jul 2, 2024

pickles-bread-and-butter commented Jul 3, 2024

convexquad Jul 4, 2024

convexquad left a comment

zhenyu Sep 26, 2024

pickles-bread-and-butter Sep 26, 2024

convexquad Sep 30, 2024

zhenyu Sep 26, 2024 •

edited

Loading

pickles-bread-and-butter Sep 26, 2024

pickles-bread-and-butter Sep 26, 2024

zhenyu Sep 26, 2024

convexquad Sep 30, 2024

convexquad commented Sep 30, 2024

zhenyu commented Oct 1, 2024

pickles-bread-and-butter commented Oct 1, 2024 •

edited

Loading

zhenyu commented Oct 1, 2024

		return column_bytes_file_class


		class FileSystemDataset(AbstractDataset):

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

Are you sure you want to change the base?

Add FileSystemDataset & Remove S3 ColumnBytes and Cache Assumptions #57

Conversation

pickles-bread-and-butter commented May 10, 2024 • edited Loading

Additions

Purpose

Testing

convexquad commented Jul 2, 2024

pickles-bread-and-butter commented Jul 3, 2024

Choose a reason for hiding this comment

convexquad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenyu Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

convexquad commented Sep 30, 2024

zhenyu commented Oct 1, 2024

pickles-bread-and-butter commented Oct 1, 2024 • edited Loading

zhenyu commented Oct 1, 2024

pickles-bread-and-butter commented May 10, 2024 •

edited

Loading

zhenyu Sep 26, 2024 •

edited

Loading

pickles-bread-and-butter commented Oct 1, 2024 •

edited

Loading