Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: reduce parquet file size datalake #3035

Merged
merged 12 commits into from
Mar 7, 2023
Merged

Conversation

cisse21
Copy link
Member

@cisse21 cisse21 commented Feb 24, 2023

Description

  • Reduce parquet file size by not having a unified merged schema
  • Get rid of the unused mergedschema column
  • Clean up unused useParquetLoadFilesRS config variable

Notion Ticket

Notion Link

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@cisse21 cisse21 force-pushed the chore.reduceParquetSize branch 5 times, most recently from f4bee74 to 9c1bef1 Compare February 24, 2023 12:24
@codecov
Copy link

codecov bot commented Feb 24, 2023

Codecov Report

Patch coverage: 80.67% and project coverage change: +0.30 🎉

Comparison is base (1b6af85) 53.09% compared to head (59d67f8) 53.39%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3035      +/-   ##
==========================================
+ Coverage   53.09%   53.39%   +0.30%     
==========================================
  Files         332      342      +10     
  Lines       52350    52812     +462     
==========================================
+ Hits        27794    28199     +405     
- Misses      22956    23000      +44     
- Partials     1600     1613      +13     
Impacted Files Coverage Δ
app/apphandlers/gatewayAppHandler.go 13.82% <ø> (ø)
app/apphandlers/setup.go 73.25% <ø> (ø)
gateway/internal/stats/stats.go 100.00% <ø> (ø)
jobsdb/admin.go 68.99% <0.00%> (ø)
warehouse/internal/model/upload.go 100.00% <ø> (ø)
warehouse/schema.go 52.89% <ø> (-1.56%) ⬇️
warehouse/upload.go 23.45% <6.25%> (+0.14%) ⬆️
runner/runner.go 69.66% <50.00%> (-0.61%) ⬇️
services/metric/registry.go 82.25% <50.00%> (-0.15%) ⬇️
testhelper/stats/prometheus.go 57.14% <57.14%> (ø)
... and 37 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cisse21 cisse21 force-pushed the chore.reduceParquetSize branch 2 times, most recently from db27b78 to 95dd9c8 Compare February 27, 2023 11:43
@cisse21 cisse21 changed the title chore: reduce parquet file size datalate chore: reduce parquet file size datalake Feb 27, 2023
@cisse21 cisse21 marked this pull request as ready for review February 27, 2023 12:56
warehouse/upload.go Outdated Show resolved Hide resolved
Copy link
Member

@lvrach lvrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not drop the column just yet.

Copy link
Member

@lvrach lvrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the database column, we also need to consider the revertability of the code.

If we introduce this change and have to revert back, the following code will not work correctly, as the MergedSchema won't be defined:

		schema := &job.Upload.UploadSchema
		if job.Upload.LoadFileType == warehouseutils.LOAD_FILE_TYPE_PARQUET {
			schema = &job.Upload.MergedSchema
		}

We need to consider if we can mitigate this situation, for example manually populate MergeSchema from upload schema in case of revert.

Or we can proactively populate both columns with UploadSchema, so in case we revert to a previous version a valid schema could be used.

@cisse21 cisse21 merged commit 4cb5907 into master Mar 7, 2023
@cisse21 cisse21 deleted the chore.reduceParquetSize branch March 7, 2023 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants