-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add strict_timestamps option #117
base: main
Are you sure you want to change the base?
Conversation
Ah thanks for the PR! So probably yes, should do something about this... But...
And also - I think would be good to add or change an existing tests around this, checking cases above and below the clamping range for both with and without the extended timestamps somehow. |
@frederikaalund Let us know if you want to do the above / discuss more? No is all good of course, but in that case I would be tempted for us to make the changes at our end. I'm seeing this PR as essentially a bug report and I'm keen to get it sorted. |
To just think out loud, I wonder also if the extended/unix timestamps need to be clamped for times outside of the range representable by a signed 32 bit integer, e.g. after 2038... |
Thanks for the feedback and I apologize for me late reply. It's been a busy week :) I lean towards (1) too and only decided to not implement it that way since it's technically a breaking change. If you're okay with this subtle change then I'll gladly implement it without the kwarg option. 👍 It is an edge case after all. If we go down this route, I too think that it's a good idea to clamp the timestamps in general (regardless of whether we store them in the MSDOS header or the Extended timestamp header). Something like this:
As for tests and docs, of course, let me add that as well as part of this PR. :) I would ask for the same thing myself so I just shamelessly tried to get away with not doing it 😅 Let me know what you think. Thanks again for your feedback on this PR! |
I think I am - it's the better behaviour I would say for my use cases, and it sounds like yours too, and it's simpler to not have an option. Avoiding a breaking change I don't think is worth it really.
Yes exactly (and I might lean to the making the range are large as possible i.e. clamping to exactly the largest/smallest second that's representable for each) |
f8a9d51
to
15139e3
Compare
Hi again, finally found some time to work on this. :) I just pushed a rework that silently clamps the timestamps to the available precision in the MS-DOS and UNIX headers. The timestamp limits are:
Note that I now use a lower limit of 1970 (and not 1902 as previously discussed). Some tools ( I also added tests and docs. Let me know if you want anything changed 👍 |
Looks great! Having a bit of a nose around and think still - will leave comments in the code |
If `extended_timestamps=True` (the default): | ||
|
||
* Timestamps are clamped between 1970-1-1 and 2038-1-19 (both inclusive) | ||
* Timestamps are rounded down with 1-second precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the situation a bit more complex than this? If extended_timestamps=True, then it depends on the client - ones that read the extended timestamps will use the greater range, but older ones will still use the smaller range.
I wonder if it's better if the documentation should be split into "older clients" and "newer clients" (or something like that). The older clients in all cases use the smaller range, but the newer ones depend on extended_timestamps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data in the file (er, in the stream) doesn't depend on the client, and I think that's all you can reasonably make statements about. What the client will do with it depends on the client, but that doesn't change anything about the stream or the values encoded in it. And documenting what will be done with that data by all possible clients seems both out of scope and impossible.
I guess you could say something like this, to be more precise:
If `extended_timestamps=True` (the default), a second
timestamp field is added which will be read by
supporting clients. Extended timestamps:
* Use UNIX-style "seconds since epoch" values
* Are clamped between 1970-1-1 and 2038-1-19
(both inclusive)
* Are rounded down with 1-second precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps first documenting the standard timestamps without the "if extended_timestamps=False
" qualifier, since those timestamps will always be there.)
], | ||
) | ||
def test_unzip_modification_time(method, timezone, modified_at): | ||
def test_unzip_modification_time(method, timezone, modified_at, expected_modified_at): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just running the tests locally, and I'm getting some errors... I wonder if it's because I'm currently in British Summer Time, 1 our off from UTC...
pytest 'test_stream_zip.py::test_unzip_modification_time[UTC+0-modified_at0-expected_modified_at0-ZIP_32]'
Results in
> assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp())
E AssertionError: assert 0.0 == -3600
E + where 0.0 = <function getmtime at 0x101338040>('my_file')
E + where <function getmtime at 0x101338040> = <module 'posixpath' (frozen)>.getmtime
E + where <module 'posixpath' (frozen)> = os.path
E + and -3600 = int(-3600.0)
E + where -3600.0 = <built-in method timestamp of datetime.datetime object at 0x1040fb990>()
E + where <built-in method timestamp of datetime.datetime object at 0x1040fb990> = datetime.datetime(1970, 1, 1, 0, 0).timestamp
The tests test_unzip_modification_time
do pass for me.
Not quite sure why right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this jszip issue, convention is that MS-DOS timestamps are in local time, whereas extended timestamps use UNIX conventions which means all timestamps are values in UTC. That's backed up by a note in the unzip(1) man page, as well:
-f freshen existing files, i.e., extract only those files that al‐ ready exist on disk and that are newer than the disk copies. By default unzip queries before overwriting, but the -o option may be used to suppress the queries. Note that under many operating systems, the TZ (timezone) environment variable must be set cor‐ rectly in order for -f and -u to work properly (under Unix the variable is usually set automatically). The reasons for this are somewhat subtle but have to do with the differences between DOS-format file times (always local time) and Unix-format times (always in GMT/UTC) and the necessity to compare the two. A typical TZ value is ``PST8PDT'' (US Pacific time with automatic adjustment for Daylight Savings Time or ``summer time'').
Both os.path.getmtime()
and datetime.timestamp()
seem to respect that.
# These produce the same value
datetime.datetime.now().timestamp()
datetime.datetime.now(datetime.UTC).timestamp()
# as do these
datetime.datetime.fromtimestamp(os.path.getmtime('some-file')).timestamp()
datetime.datetime.fromtimestamp(os.path.getmtime('some-file'), datetime.UTC).timestamp()
But a datetime.datetime(1970, 1, 1, 0, 0)
with no timezone information will be interpreted as a local time, so the timestamp will have a timezone offset. IOW, here in NYC at UTC-5:
>>> dt = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=None)
>>> dt.timestamp()
18000.0
>>> dtutc = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.UTC)
>>> dtutc.timestamp()
0.0
# In principle, we the lower limit should be `-2**31` but we set it | ||
# to zero to avoid issues with common zip utilities like `unzip`. | ||
# Said tools do not correctly interpret negative timestamps. | ||
max(min(int(modified_at.timestamp()), 2**31 - 1), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From (admittedly only brief tests) it looks like once times go below 1970 with extended timestamps, then unzip seems to clamp to 1980 (so probably using the MS DOS timestamp?). While odd certainly, I guess I'm leaning to going for the full range in the zip file, because the timestamp before 1970 will be wrong in unzip no matter what, but we can at least make timestamps before 1970 right for other clients.
subprocess.run(['unzip', f'{d}/test.zip', '-d', d], env={'TZ': timezone}) | ||
|
||
assert os.path.getmtime('my_file') == int(modified_at.timestamp()) | ||
assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failures seem to be caused by the unzip
being run with the given timezone variable, but expected_modified_at.timestamp()
being interpreted as local time when it has no tzinfo
set.
I ran into an issue today with a device of ours that had a file from 1970 (or at least that was the timestamp). In turn, stream-zip gave a cryptic error. Turns out, the ZIP file format simply does not support timestamps from before 1980. This is a known issue and python's built-in zipfile module provides the
strict_timestamps
option as a workaround for this.In short, this PR:
strict_timestamps
as an optional argument tostream_zip
. It defaults to true (replicating the current behaviour; no breaking changes).strict_timestamps=False
then the library coerces anymodified_at
timestamps into [1980; 2108). This silently avoids underflow/overflow.The CPython zipfile implementation inspired my implementation in this PR.
Let me know what you think!
Side note: Thank you for this great library (along with stream-unzip). 👍 It's crucial for the backup/restore feature on our devices. :)