-
Notifications
You must be signed in to change notification settings - Fork 305
Fix: saving compressed load files with .gz extension #2835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
8118936
to
41285c5
Compare
879abc7
to
e95ebee
Compare
dlt/common/data_writers/buffered.py
Outdated
@property | ||
def _is_compression_enabled(self) -> bool: | ||
"""Returns True if compression is enabled for this writer""" | ||
return self.writer_spec.supports_compression and self.open == gzip.open | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be redundant. We can just check is self.open
is gzip_open
@@ -516,7 +516,7 @@ def execute_query(self, query: AnyStr, *args: Any, **kwargs: Any) -> Iterator[DB | |||
continue | |||
schema = table.db | |||
# add only tables from the dataset schema | |||
if schema or schema.lower() != self.dataset_name.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if this makes sense because it runs in every case except one corner-case with empty schema and empty dataset_name 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not exhaustive, since some writers are missing, but i don't think it's a problem for the first round of reviews 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the direction is good and the only complication is backward compat for filesystem destination. I think @sh-rp wants to chip-in here.
with this PR we can remove some of weird things we do to discover if files are zipped.
is_compression_disabled()
- this function should go away and whenever it is used there's something wrong. we are better if we use extension to guess if file is zipped.
another so so pattern used for example in clikckhouse.py
if ext == "jsonl":
compression = "gz" if FileStorage.is_gzipped(file_path) else "none"
we actually probe the file to see if it is zipped
we remove all those tricks and just use extensiosn. the only place is filesystem
destination where we must decide how/if we do backward compatibility
I did not read the PR, but just from a user / compatibility perspective, I think dlt should behave the exact same way with regards to adding these extensions after updating if you do not change any settings. This means:
|
756528a
to
04eff17
Compare
c215a3e
to
9285e45
Compare
@anuunchin what I would do is the following: Use the .gz extension by default everywhere because it makes stuff simpler and then remove the .gz extensions for the cases where the filesystem is the main destination and the user has selected to not keep the extension there (which should be the default). So as we discussed:
Things to check:
|
56138ee
to
e464a15
Compare
Description
This PR adds the
.gz
extension to compressed files based on a new configdisable_extension
that is set toTrue
by default.In a nutshell:
is_compression_disabled()
is removed and destination load jobs have access toBufferedDataWriterConfiguration
.FileStorage.is_gzipped()
, solely relying onBufferedDataWriterConfiguration
is not sufficient when it comes to imported files, since they must not be compressed in any situation. For this reason, a new context classFileImportContext
was introduced that tracks whether a file is an imported file and, thus, does not require compression. Theis_compressed_file
attribute ofFileImportContext
is then accessed by the duckdb and clickhouse load jobs to correctly set the compression type.Related Issues
.gz
extension #925