[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

pdames · 2022-04-21T23:41:28Z

Why are these changes needed?

Adds a fast file metadata provider that trades comprehensive file metadata collection for speed of metadata collection, and which also disabled directory path expansion which can be very slow on some cloud storage service providers. This PR also refactors the Parquet datasource to be able to take advantage of both these changes and the content-type agnostic partitioning support from #23624.

This is the second PR of a series originally proposed in #23179.

Related issue number

Partially resolves #22910.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jianoaix · 2022-04-26T05:22:56Z

python/ray/data/datasource/file_based_datasource.py

@@ -237,6 +244,32 @@ def expand_paths(
        return expanded_paths, file_sizes


+class FastFileMetadataProvider(DefaultFileMetadataProvider):


I find it weird to have those in a *_datasource.py file. Would it make more sense for all those subclasses of FileMetadataProvider to be placed in filed_meta_provider.py? If not, why?

I had considered the same, and I think it would be a good idea to move them all over to file_metadata_provider.py as part of this PR.

From the perspective of an end-user, they should be able to use from ray.data.datasource import FastFileMetadataProvider regardless of which file we put the class inside of, so this shouldn't change much from their perspective.

From the perspective of a Ray Data maintainer, the current organization largely stems from organic growth of the code, where one-off utility functions gradually evolved into more generic/extensible classes over time, and thus follow existing conventions of grouping these utility classes with the datasource that uses them. Thus, all file metadata providers given in file_based_datasource.py are meant to be used with file-based datasources, while all file metadata providers in parquet_datasource.py are meant to be used specifically with the ParquetDatasource.

The primary downside of continuing to preserve this type of grouping is that core dependencies like file_based_datasource.py start to grow too large over time (sitting now at >700 LoC), which makes it harder to grok the important parts at a glance or during a quick top-to-bottom read through.

If we were to move all file metadata provider implementations into file_metadata_provider.py, I'd expect it to be just over 300 LoC with all classes organized around the central purpose of providing file metadata. This better normalizes the size of our "large files" and keeps them more focused on a single purpose, so I'm slightly in favor of making this move now if we're in agreement that it's beneficial.

We probably also should do the same with block write path providers in a follow-up PR.

I refactored this in the latest commit - let me know what you think.

Thanks, I think it makes sense for subclasses of FileMetadataProvider live in file_metadata_provider.py.

jianoaix · 2022-04-26T05:28:01Z

python/ray/data/datasource/file_based_datasource.py

+
+    def expand_paths(
+        self,
+        paths: List[str],


Do all paths need to be for the same block?
If not, why does it make sense to return BlockMetadata as value?

I assume this comment is meant to apply to the core __call__ and _get_block_metadata methods, which do require that all paths provided are part of the same block. I think this could be made more clear with some doc string updates.

I've added some clarification to the docstrings here in the latest commit.

Thanks.
@clarkzinzow Is it true that we have multiple files to load into a single block?

jianoaix

Looking good overall. My question is whether we can make the chain of inheritance shorter/simpler?

jianoaix · 2022-04-27T05:00:37Z

python/ray/data/datasource/parquet_base_datasource.py

+logger = logging.getLogger(__name__)
+
+
+class ParquetBaseDatasource(FileBasedDatasource):


This class has only internal methods and also has only one subclass -- can it be folded into its subclass ParquetDatasource? My concern is it's a bit overskill for this case.

I think that depends on what folding it into ParquetDatasource actually means but, if we mean getting rid of this class altogether, I don't think that will work moving forward. Some degree of refactoring may be possible, but we will need to separate this class in either this PR or the next PR since the upcoming Parquet bulk file reader API needs to use ParquetBaseDatasource directly. This API will not be able to use ParquetDatasource directly due to its prepare_read method override being the root cause of subsequent scalability/performance issues when consuming a large number of Parquet files cited in #22910.

So we can delay separation of this class until the next PR, but I think it still needs to happen.

Ok, thanks for explanation. Let's keep it then.

jianoaix · 2022-04-27T05:07:00Z

python/ray/data/datasource/file_meta_provider.py

+
+
+@DeveloperAPI
+class ParquetMetadataProvider(FileMetadataProvider):


I have similar concern of whether this layer is needed, i.e. can DefaultParquetMetadataProvider directly subclass FileMetadataProvider?

I have a slight preference to preserve the current layout since the intent is for anyone providing a custom ParquetDatasource metadata provider to implement the interface signature established in ParquetMetadataProvider and, in particular:

Only implement metadata prefetching if their use-case requires it.

Carefully consider the best way to fetch block metadata for their use-case rather than just keep a default implementation.

For example, my CachedFileMetadataProvider at https://github.com/pdames/deltacat/blob/edec6159c5acda3ede15653b4e92aaa45a43206f/deltacat/io/aws/redshift/redshift_datasource.py#L67-L70 inherits from ParquetMetadataProvider, does not require metadata prefetching, and uses a different implementation for getting block metadata from a prebuilt cache. I also expect this to be the general case for most upcoming data warehouse and data catalog integrations with Ray Datasets beyond just the Amazon Redshift integration that uses it here.

So, in summary, my slight preference is to keep DefaultParquetMetadataProvider set aside as an internal implementation detail exposed to Ray Data maintainers (and to continue to keep the @DeveloperAPI label excluded from this class), and for ParquetMetadataProvider to be exposed to end-users creating their own ParquetDatasource metadata providers.

SG, thanks.

python/ray/data/datasource/file_meta_provider.py

jianoaix

Thank you Patrick, LG!

jianoaix · 2022-04-27T20:23:40Z

python/ray/data/datasource/parquet_base_datasource.py

+logger = logging.getLogger(__name__)
+
+
+class ParquetBaseDatasource(FileBasedDatasource):


Ok, thanks for explanation. Let's keep it then.

jianoaix · 2022-04-27T20:46:02Z

python/ray/data/datasource/file_meta_provider.py

+
+
+@DeveloperAPI
+class ParquetMetadataProvider(FileMetadataProvider):


SG, thanks.

clarkzinzow

LGTM, awesome work!

clarkzinzow · 2022-04-28T23:41:59Z

Hey @pdames, just reverted a commit that was breaking Datasets CI job, could you rebase on master one more time? 🙏

…factoring.

…ders.

… docstrings.

pdames · 2022-04-29T00:03:31Z

Hey @pdames, just reverted a commit that was breaking Datasets CI job, could you rebase on master one more time? 🙏

Done!

pdames · 2022-04-29T16:24:13Z

@clarkzinzow Looks like the remaining CI failures are unrelated. Could you give this a final pass through?

clarkzinzow · 2022-04-29T16:39:00Z

LGTM, merging!

…ata providers. (#24354) API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.

pdames requested review from ericl, scv119, clarkzinzow and jjyao as code owners April 21, 2022 23:41

pdames assigned jianoaix Apr 21, 2022

pdames requested a review from jianoaix April 21, 2022 23:42

pdames force-pushed the fast-file-metadata branch 4 times, most recently from 519dc7d to c7a61ba Compare April 25, 2022 23:03

jianoaix reviewed Apr 26, 2022

View reviewed changes

jianoaix reviewed Apr 27, 2022

View reviewed changes

jianoaix approved these changes Apr 27, 2022

View reviewed changes

clarkzinzow approved these changes Apr 27, 2022

View reviewed changes

pdames force-pushed the fast-file-metadata branch from ec5499a to e205a49 Compare April 28, 2022 22:01

pdames added 5 commits April 28, 2022 16:49

[Datasets] Add fast file metadata collector and Parquet Datasource re…

45ffb3b

…factoring.

[Datasets] Add dedicated unit test suite for file metadata providers.

0de1167

[Datasets] Add expected log message assertions to file metadata provi…

749fb51

…ders.

[Datasets] Add metadata provider read API tests and minor bug fix.

babedeb

[Datasets] Group file metadata providers in a single file and clarify…

c615a6b

… docstrings.

pdames force-pushed the fast-file-metadata branch from e205a49 to c615a6b Compare April 28, 2022 23:50

clarkzinzow merged commit 4691d2d into ray-project:master Apr 29, 2022

pdames mentioned this pull request Apr 29, 2022

[Datasets] Add docs for bulk parquet read API and file meta providers #24354

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

pdames commented Apr 21, 2022

jianoaix Apr 26, 2022

pdames Apr 26, 2022 •

edited

pdames Apr 27, 2022

jianoaix Apr 27, 2022

jianoaix Apr 26, 2022

pdames Apr 26, 2022

pdames Apr 27, 2022

jianoaix Apr 27, 2022

jianoaix left a comment

jianoaix Apr 27, 2022

pdames Apr 27, 2022

jianoaix Apr 27, 2022

jianoaix Apr 27, 2022

pdames Apr 27, 2022

jianoaix Apr 27, 2022

jianoaix left a comment

jianoaix Apr 27, 2022

jianoaix Apr 27, 2022

clarkzinzow left a comment

clarkzinzow commented Apr 28, 2022

pdames commented Apr 29, 2022

pdames commented Apr 29, 2022

clarkzinzow commented Apr 29, 2022

		@@ -237,6 +244,32 @@ def expand_paths(
		return expanded_paths, file_sizes


		class FastFileMetadataProvider(DefaultFileMetadataProvider):

		logger = logging.getLogger(__name__)


		class ParquetBaseDatasource(FileBasedDatasource):



		@DeveloperAPI
		class ParquetMetadataProvider(FileMetadataProvider):

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

Conversation

pdames commented Apr 21, 2022

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

pdames Apr 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow commented Apr 28, 2022

pdames commented Apr 29, 2022

pdames commented Apr 29, 2022

clarkzinzow commented Apr 29, 2022

pdames Apr 26, 2022 •

edited