Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary partition layout for HadoopFileDataObject #846

Open
zzeekk opened this issue Jun 12, 2024 · 0 comments
Open

Support arbitrary partition layout for HadoopFileDataObject #846

zzeekk opened this issue Jun 12, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@zzeekk
Copy link
Contributor

zzeekk commented Jun 12, 2024

Is your feature request related to a problem? Please describe.
All HadoopFileDataObjects work with standard hadoop partition layout, e.g. /<col1>=x/<col2>=y. This is also how Spark expects it.
Sometimes it is needed to read files from locations with different partition layout, e.g. extracting //abc/.
Using arbitrary partition layouts is currently possible with SFtpFileRefDataObject, but not for reading files from Hadoop filesystems (local files, S3, ...).
I would like to copy files from Hadoop filesystems with arbitrary partition layouts into another Hadoop filesystem location using FileTransferAction, creating a standard hadoop partition layout. Currently this involves a lot of custom coding...

Describe the solution you'd like
Implement a DataObject reading from files from Hadoop filesystems, which reuses logic from SFtpFileRefDataObject to handle arbitrary partition layout.
Note that this DataObject can not be used with Spark execution engine, e.g. CopyAction, but only with file execution engine, e.g. FileTransferAction.

@zzeekk zzeekk added the enhancement New feature or request label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant