Support arbitrary partition layout for HadoopFileDataObject #846

zzeekk · 2024-06-12T14:19:16Z

Is your feature request related to a problem? Please describe.
All HadoopFileDataObjects work with standard hadoop partition layout, e.g. /<col1>=x/<col2>=y. This is also how Spark expects it.
Sometimes it is needed to read files from locations with different partition layout, e.g. extracting //abc/.
Using arbitrary partition layouts is currently possible with SFtpFileRefDataObject, but not for reading files from Hadoop filesystems (local files, S3, ...).
I would like to copy files from Hadoop filesystems with arbitrary partition layouts into another Hadoop filesystem location using FileTransferAction, creating a standard hadoop partition layout. Currently this involves a lot of custom coding...

Describe the solution you'd like
Implement a DataObject reading from files from Hadoop filesystems, which reuses logic from SFtpFileRefDataObject to handle arbitrary partition layout.
Note that this DataObject can not be used with Spark execution engine, e.g. CopyAction, but only with file execution engine, e.g. FileTransferAction.

The text was updated successfully, but these errors were encountered:

zzeekk added the enhancement New feature or request label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary partition layout for HadoopFileDataObject #846

Support arbitrary partition layout for HadoopFileDataObject #846

zzeekk commented Jun 12, 2024

Support arbitrary partition layout for HadoopFileDataObject #846

Support arbitrary partition layout for HadoopFileDataObject #846

Comments

zzeekk commented Jun 12, 2024