Fix: saving compressed load files with .gz extension #2835

anuunchin · 2025-07-03T08:47:28Z

Description

This PR adds the .gz extension to compressed files based on a new config disable_extension that is set to True by default.

In a nutshell:

is_compression_disabled() is removed and destination load jobs have access to BufferedDataWriterConfiguration.
Since the task involves getting rid of tricks like FileStorage.is_gzipped(), solely relying on BufferedDataWriterConfiguration is not sufficient when it comes to imported files, since they must not be compressed in any situation. For this reason, a new context class FileImportContext was introduced that tracks whether a file is an imported file and, thus, does not require compression. The is_compressed_file attribute of FileImportContext is then accessed by the duckdb and clickhouse load jobs to correctly set the compression type.

Related Issues

Resolves Save compressed load files with .gz extension #925

netlify · 2025-07-03T08:47:34Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`1f7da2e`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/6877b40ad10c770008ceb5c6

anuunchin · 2025-07-07T14:31:13Z

dlt/common/data_writers/buffered.py

+    @property
+    def _is_compression_enabled(self) -> bool:
+        """Returns True if compression is enabled for this writer"""
+        return self.writer_spec.supports_compression and self.open == gzip.open
+


This might be redundant. We can just check is self.open is gzip_open

anuunchin · 2025-07-08T07:07:17Z

dlt/destinations/impl/duckdb/sql_client.py

@@ -516,7 +516,7 @@ def execute_query(self, query: AnyStr, *args: Any, **kwargs: Any) -> Iterator[DB
                continue
            schema = table.db
            # add only tables from the dataset schema
-            if schema or schema.lower() != self.dataset_name.lower():


I'm not 100% sure if this makes sense because it runs in every case except one corner-case with empty schema and empty dataset_name 🤔

anuunchin · 2025-07-08T08:30:25Z

tests/common/storages/test_compressed_files.py

This test is not exhaustive, since some writers are missing, but i don't think it's a problem for the first round of reviews 👀

rudolfix

the direction is good and the only complication is backward compat for filesystem destination. I think @sh-rp wants to chip-in here.

with this PR we can remove some of weird things we do to discover if files are zipped.

is_compression_disabled() - this function should go away and whenever it is used there's something wrong. we are better if we use extension to guess if file is zipped.

another so so pattern used for example in clikckhouse.py

if ext == "jsonl":
                compression = "gz" if FileStorage.is_gzipped(file_path) else "none"

we actually probe the file to see if it is zipped

we remove all those tricks and just use extensiosn. the only place is filesystem destination where we must decide how/if we do backward compatibility

sh-rp · 2025-07-09T08:12:24Z

I did not read the PR, but just from a user / compatibility perspective, I think dlt should behave the exact same way with regards to adding these extensions after updating if you do not change any settings. This means:

adding the gzip extension must be configurable
This setting must be set to off by default
It would be nicer to have it switched on by default, but then we would need to make this change in dlt 2.0 and mark it as breaking

sh-rp · 2025-07-14T14:00:10Z

@anuunchin what I would do is the following: Use the .gz extension by default everywhere because it makes stuff simpler and then remove the .gz extensions for the cases where the filesystem is the main destination and the user has selected to not keep the extension there (which should be the default). So as we discussed:

Always add or keep .gz extension in the DataWriter if compression is enabled, so after extracing and normalizing we have .gz on files that are compressed
When importing files, probe the file and add .gz if compressed
Add a new setting to the filesystem destination that controls wether the files should have a .gz compression if compressed, it be set to remove this extension by default to ensure backward compatibility. This setting should probably be ignored if the filesystem is used as the staging destination.
If this setting is set, rename files that are copied to buckets or the local filesystem to not have this extension while actually running the copy job.

Things to check:

This approach assumes that the users are ok with having the .gz extension on files in their staging buckets. Please confirm with @rudolfix that we can do this.
Check if this has any influence on delta and iceberg. I think we output parquet files in the normalizer for both of these destinations, so we should be good there
Athena (without iceberg) uses the filesystem destination as staging, but uses those exact files for backing the tables with data. In this case, the above setting should be respected.

--amend

anuunchin self-assigned this Jul 3, 2025

anuunchin force-pushed the fix/gzip branch from 9111b5f to 7f34192 Compare July 3, 2025 14:20

anuunchin closed this Jul 3, 2025

anuunchin force-pushed the fix/gzip branch from 7f34192 to 3a23b66 Compare July 3, 2025 14:29

anuunchin reopened this Jul 3, 2025

anuunchin force-pushed the fix/gzip branch 7 times, most recently from 8118936 to 41285c5 Compare July 4, 2025 13:44

anuunchin marked this pull request as ready for review July 7, 2025 07:48

anuunchin force-pushed the fix/gzip branch 6 times, most recently from 879abc7 to e95ebee Compare July 7, 2025 12:15

anuunchin commented Jul 7, 2025

View reviewed changes

anuunchin commented Jul 8, 2025

View reviewed changes

anuunchin requested a review from sh-rp July 8, 2025 08:20

anuunchin commented Jul 8, 2025

View reviewed changes

rudolfix reviewed Jul 8, 2025

View reviewed changes

anuunchin force-pushed the fix/gzip branch 4 times, most recently from 756528a to 04eff17 Compare July 11, 2025 10:53

anuunchin force-pushed the fix/gzip branch 11 times, most recently from c215a3e to 9285e45 Compare July 14, 2025 09:34

anuunchin requested a review from rudolfix July 14, 2025 09:35

anuunchin closed this Jul 15, 2025

anuunchin force-pushed the fix/gzip branch from 9285e45 to 21b68e6 Compare July 15, 2025 09:04

anuunchin reopened this Jul 15, 2025

anuunchin marked this pull request as draft July 15, 2025 09:51

anuunchin force-pushed the fix/gzip branch 8 times, most recently from 56138ee to e464a15 Compare July 16, 2025 13:16

enable_gz_extension added to client configs

1f7da2e

--amend

anuunchin force-pushed the fix/gzip branch from e464a15 to 1f7da2e Compare July 16, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: saving compressed load files with .gz extension #2835

Fix: saving compressed load files with .gz extension #2835

anuunchin commented Jul 3, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jul 3, 2025 •

edited

Loading

Uh oh!

anuunchin Jul 7, 2025

Uh oh!

anuunchin Jul 8, 2025

Uh oh!

anuunchin Jul 8, 2025

Uh oh!

rudolfix left a comment

Uh oh!

sh-rp commented Jul 9, 2025

Uh oh!

sh-rp commented Jul 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix: saving compressed load files with .gz extension #2835

Are you sure you want to change the base?

Fix: saving compressed load files with .gz extension #2835

Conversation

anuunchin commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Uh oh!

netlify bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

anuunchin Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

sh-rp commented Jul 9, 2025

Uh oh!

sh-rp commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anuunchin commented Jul 3, 2025 •

edited

Loading

netlify bot commented Jul 3, 2025 •

edited

Loading

sh-rp commented Jul 14, 2025 •

edited

Loading