Skip to content

Fix: saving compressed load files with .gz extension #2835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

anuunchin
Copy link
Contributor

@anuunchin anuunchin commented Jul 3, 2025

Description

This PR adds the .gz extension to compressed files based on a new config disable_extension that is set to True by default.

In a nutshell:

  • is_compression_disabled() is removed and destination load jobs have access to BufferedDataWriterConfiguration.
  • Since the task involves getting rid of tricks like FileStorage.is_gzipped(), solely relying on BufferedDataWriterConfiguration is not sufficient when it comes to imported files, since they must not be compressed in any situation. For this reason, a new context class FileImportContext was introduced that tracks whether a file is an imported file and, thus, does not require compression. The is_compressed_file attribute of FileImportContext is then accessed by the duckdb and clickhouse load jobs to correctly set the compression type.

Related Issues

@anuunchin anuunchin self-assigned this Jul 3, 2025
Copy link

netlify bot commented Jul 3, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 1f7da2e
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/6877b40ad10c770008ceb5c6

@anuunchin anuunchin closed this Jul 3, 2025
@anuunchin anuunchin reopened this Jul 3, 2025
@anuunchin anuunchin force-pushed the fix/gzip branch 7 times, most recently from 8118936 to 41285c5 Compare July 4, 2025 13:44
@anuunchin anuunchin marked this pull request as ready for review July 7, 2025 07:48
@anuunchin anuunchin force-pushed the fix/gzip branch 6 times, most recently from 879abc7 to e95ebee Compare July 7, 2025 12:15
Comment on lines 183 to 187
@property
def _is_compression_enabled(self) -> bool:
"""Returns True if compression is enabled for this writer"""
return self.writer_spec.supports_compression and self.open == gzip.open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be redundant. We can just check is self.open is gzip_open

@@ -516,7 +516,7 @@ def execute_query(self, query: AnyStr, *args: Any, **kwargs: Any) -> Iterator[DB
continue
schema = table.db
# add only tables from the dataset schema
if schema or schema.lower() != self.dataset_name.lower():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if this makes sense because it runs in every case except one corner-case with empty schema and empty dataset_name 🤔

@anuunchin anuunchin requested a review from sh-rp July 8, 2025 08:20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not exhaustive, since some writers are missing, but i don't think it's a problem for the first round of reviews 👀

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the direction is good and the only complication is backward compat for filesystem destination. I think @sh-rp wants to chip-in here.

with this PR we can remove some of weird things we do to discover if files are zipped.

is_compression_disabled() - this function should go away and whenever it is used there's something wrong. we are better if we use extension to guess if file is zipped.

another so so pattern used for example in clikckhouse.py

if ext == "jsonl":
                compression = "gz" if FileStorage.is_gzipped(file_path) else "none"

we actually probe the file to see if it is zipped

we remove all those tricks and just use extensiosn. the only place is filesystem destination where we must decide how/if we do backward compatibility

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 9, 2025

I did not read the PR, but just from a user / compatibility perspective, I think dlt should behave the exact same way with regards to adding these extensions after updating if you do not change any settings. This means:

  • adding the gzip extension must be configurable
  • This setting must be set to off by default
  • It would be nicer to have it switched on by default, but then we would need to make this change in dlt 2.0 and mark it as breaking

@anuunchin anuunchin force-pushed the fix/gzip branch 4 times, most recently from 756528a to 04eff17 Compare July 11, 2025 10:53
@anuunchin anuunchin force-pushed the fix/gzip branch 11 times, most recently from c215a3e to 9285e45 Compare July 14, 2025 09:34
@anuunchin anuunchin requested a review from rudolfix July 14, 2025 09:35
@sh-rp
Copy link
Collaborator

sh-rp commented Jul 14, 2025

@anuunchin what I would do is the following: Use the .gz extension by default everywhere because it makes stuff simpler and then remove the .gz extensions for the cases where the filesystem is the main destination and the user has selected to not keep the extension there (which should be the default). So as we discussed:

  • Always add or keep .gz extension in the DataWriter if compression is enabled, so after extracing and normalizing we have .gz on files that are compressed
  • When importing files, probe the file and add .gz if compressed
  • Add a new setting to the filesystem destination that controls wether the files should have a .gz compression if compressed, it be set to remove this extension by default to ensure backward compatibility. This setting should probably be ignored if the filesystem is used as the staging destination.
  • If this setting is set, rename files that are copied to buckets or the local filesystem to not have this extension while actually running the copy job.

Things to check:

  • This approach assumes that the users are ok with having the .gz extension on files in their staging buckets. Please confirm with @rudolfix that we can do this.
  • Check if this has any influence on delta and iceberg. I think we output parquet files in the normalizer for both of these destinations, so we should be good there
  • Athena (without iceberg) uses the filesystem destination as staging, but uses those exact files for backing the tables with data. In this case, the above setting should be respected.

@anuunchin anuunchin closed this Jul 15, 2025
@anuunchin anuunchin reopened this Jul 15, 2025
@anuunchin anuunchin marked this pull request as draft July 15, 2025 09:51
@anuunchin anuunchin force-pushed the fix/gzip branch 8 times, most recently from 56138ee to e464a15 Compare July 16, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save compressed load files with .gz extension
3 participants