feat: add strict_timestamps option #117

frederikaalund · 2024-04-12T12:01:19Z

I ran into an issue today with a device of ours that had a file from 1970 (or at least that was the timestamp). In turn, stream-zip gave a cryptic error. Turns out, the ZIP file format simply does not support timestamps from before 1980. This is a known issue and python's built-in zipfile module provides the strict_timestamps option as a workaround for this.

In short, this PR:

Adds strict_timestamps as an optional argument to stream_zip. It defaults to true (replicating the current behaviour; no breaking changes).
When you set strict_timestamps=False then the library coerces any modified_at timestamps into [1980; 2108). This silently avoids underflow/overflow.

The CPython zipfile implementation inspired my implementation in this PR.

Let me know what you think!

Side note: Thank you for this great library (along with stream-unzip). 👍 It's crucial for the backup/restore feature on our devices. :)

michalc · 2024-04-12T20:47:22Z

Ah thanks for the PR!

So probably yes, should do something about this... But...

I'm leaning to not having an option at all and just "make it work" so earlier/later timestamps should in all cases be clamped so the zip file is created (Unless we can think of a use case where an error is desirable? Maybe some cases do need some validation for not-too-early-or-too-late timestamps, but I suspect that shouldn't be part of stream-zip)
And I'm leaning to not clamping the extended/unix timestamp if there is one, and only ever clamp the MS DOS one. I'm guessing a bit, but I think it would mean that clients that support the extended timestamps would then still in all cases use the right timestamp.
And leaning to having a bit of documentation on this behaviour somewhere, i.e. for older clients or all clients when extended_timestamps=False, the dates are clamped due to a limitation of the ZIP format

And also - I think would be good to add or change an existing tests around this, checking cases above and below the clamping range for both with and without the extended timestamps somehow.

michalc · 2024-04-20T06:36:51Z

@frederikaalund Let us know if you want to do the above / discuss more? No is all good of course, but in that case I would be tempted for us to make the changes at our end. I'm seeing this PR as essentially a bug report and I'm keen to get it sorted.

michalc · 2024-04-20T06:44:40Z

To just think out loud, I wonder also if the extended/unix timestamps need to be clamped for times outside of the range representable by a signed 32 bit integer, e.g. after 2038...

frederikaalund · 2024-04-21T10:24:34Z

Thanks for the feedback and I apologize for me late reply. It's been a busy week :)

I lean towards (1) too and only decided to not implement it that way since it's technically a breaking change. If you're okay with this subtle change then I'll gladly implement it without the kwarg option. 👍 It is an edge case after all.

If we go down this route, I too think that it's a good idea to clamp the timestamps in general (regardless of whether we store them in the MSDOS header or the Extended timestamp header). Something like this:

MSDOS timestamp header: Silently clamp to [1980; 2108)
Extended timestamp header: Silently clamp to [1902; 2038)

As for tests and docs, of course, let me add that as well as part of this PR. :) I would ask for the same thing myself so I just shamelessly tried to get away with not doing it 😅

Let me know what you think. Thanks again for your feedback on this PR!

michalc · 2024-04-21T11:26:09Z

If you're okay with this subtle change then I'll gladly implement it without the kwarg option. 👍 It is an edge case after all.

I think I am - it's the better behaviour I would say for my use cases, and it sounds like yours too, and it's simpler to not have an option. Avoiding a breaking change I don't think is worth it really.

Something like this:
MSDOS timestamp header: Silently clamp to [1980; 2108)
Extended timestamp header: Silently clamp to [1902; 2038)

Yes exactly (and I might lean to the making the range are large as possible i.e. clamping to exactly the largest/smallest second that's representable for each)

frederikaalund · 2024-05-04T10:33:41Z

Hi again, finally found some time to work on this. :)

I just pushed a rework that silently clamps the timestamps to the available precision in the MS-DOS and UNIX headers.

The timestamp limits are:

MSDOS timestamp header: Silently clamp to [1980; 2108)
Extended timestamp header: Silently clamp to [1970; 2038)

Note that I now use a lower limit of 1970 (and not 1902 as previously discussed). Some tools (unzip in particular) has problems with negative timestamps. It's technically a bug in unzip but it's a rather popular tool so why not just abide.

I also added tests and docs. Let me know if you want anything changed 👍

michalc · 2024-05-06T17:54:16Z

Looks great! Having a bit of a nose around and think still - will leave comments in the code

michalc · 2024-05-06T17:59:19Z

docs/get-started.md

+If `extended_timestamps=True` (the default):
+
+ * Timestamps are clamped between 1970-1-1 and 2038-1-19 (both inclusive)
+ * Timestamps are rounded down with 1-second precision


Is the situation a bit more complex than this? If extended_timestamps=True, then it depends on the client - ones that read the extended timestamps will use the greater range, but older ones will still use the smaller range.

I wonder if it's better if the documentation should be split into "older clients" and "newer clients" (or something like that). The older clients in all cases use the smaller range, but the newer ones depend on extended_timestamps?

The data in the file (er, in the stream) doesn't depend on the client, and I think that's all you can reasonably make statements about. What the client will do with it depends on the client, but that doesn't change anything about the stream or the values encoded in it. And documenting what will be done with that data by all possible clients seems both out of scope and impossible.

I guess you could say something like this, to be more precise:

If `extended_timestamps=True` (the default), a second timestamp field is added which will be read by supporting clients. Extended timestamps: * Use UNIX-style "seconds since epoch" values * Are clamped between 1970-1-1 and 2038-1-19 (both inclusive) * Are rounded down with 1-second precision

(Perhaps first documenting the standard timestamps without the "if extended_timestamps=False" qualifier, since those timestamps will always be there.)

michalc · 2024-05-06T18:37:31Z

test_stream_zip.py

    ],
 )
-def test_unzip_modification_time(method, timezone, modified_at):
+def test_unzip_modification_time(method, timezone, modified_at, expected_modified_at):


I'm just running the tests locally, and I'm getting some errors... I wonder if it's because I'm currently in British Summer Time, 1 our off from UTC...

pytest 'test_stream_zip.py::test_unzip_modification_time[UTC+0-modified_at0-expected_modified_at0-ZIP_32]'

Results in

> assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp()) E AssertionError: assert 0.0 == -3600 E + where 0.0 = <function getmtime at 0x101338040>('my_file') E + where <function getmtime at 0x101338040> = <module 'posixpath' (frozen)>.getmtime E + where <module 'posixpath' (frozen)> = os.path E + and -3600 = int(-3600.0) E + where -3600.0 = <built-in method timestamp of datetime.datetime object at 0x1040fb990>() E + where <built-in method timestamp of datetime.datetime object at 0x1040fb990> = datetime.datetime(1970, 1, 1, 0, 0).timestamp

The tests test_unzip_modification_time do pass for me.

Not quite sure why right now...

According to this jszip issue, convention is that MS-DOS timestamps are in local time, whereas extended timestamps use UNIX conventions which means all timestamps are values in UTC. That's backed up by a note in the unzip(1) man page, as well:

-f freshen existing files, i.e., extract only those files that al‐ ready exist on disk and that are newer than the disk copies. By default unzip queries before overwriting, but the -o option may be used to suppress the queries. Note that under many operating systems, the TZ (timezone) environment variable must be set cor‐ rectly in order for -f and -u to work properly (under Unix the variable is usually set automatically). The reasons for this are somewhat subtle but have to do with the differences between DOS-format file times (always local time) and Unix-format times (always in GMT/UTC) and the necessity to compare the two. A typical TZ value is ``PST8PDT'' (US Pacific time with automatic adjustment for Daylight Savings Time or ``summer time'').

Both os.path.getmtime() and datetime.timestamp() seem to respect that.

# These produce the same value datetime.datetime.now().timestamp() datetime.datetime.now(datetime.UTC).timestamp() # as do these datetime.datetime.fromtimestamp(os.path.getmtime('some-file')).timestamp() datetime.datetime.fromtimestamp(os.path.getmtime('some-file'), datetime.UTC).timestamp()

But a datetime.datetime(1970, 1, 1, 0, 0) with no timezone information will be interpreted as a local time, so the timestamp will have a timezone offset. IOW, here in NYC at UTC-5:

>>> dt = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=None) >>> dt.timestamp() 18000.0 >>> dtutc = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.UTC) >>> dtutc.timestamp() 0.0

michalc · 2024-05-06T19:08:32Z

stream_zip.py

+                # In principle, we the lower limit should be `-2**31` but we set it
+                # to zero to avoid issues with common zip utilities like `unzip`.
+                # Said tools do not correctly interpret negative timestamps.
+                max(min(int(modified_at.timestamp()), 2**31 - 1), 0),


From (admittedly only brief tests) it looks like once times go below 1970 with extended timestamps, then unzip seems to clamp to 1980 (so probably using the MS DOS timestamp?). While odd certainly, I guess I'm leaning to going for the full range in the zip file, because the timestamp before 1970 will be wrong in unzip no matter what, but we can at least make timestamps before 1970 right for other clients.

ferdnyc · 2024-05-31T00:28:01Z

test_stream_zip.py

        subprocess.run(['unzip', f'{d}/test.zip', '-d', d], env={'TZ': timezone})

-        assert os.path.getmtime('my_file') == int(modified_at.timestamp())
+        assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp())


The test failures seem to be caused by the unzip being run with the given timezone variable, but expected_modified_at.timestamp() being interpreted as local time when it has no tzinfo set.

frederikaalund changed the title ~~feat: add struct_timestamps option~~ feat: add strict_timestamps option Apr 21, 2024

feat: clamp timestamps to fit header fields

15139e3

frederikaalund force-pushed the feat-strict-timestamps branch from f8a9d51 to 15139e3 Compare May 4, 2024 10:30

michalc reviewed May 6, 2024

View reviewed changes

ferdnyc reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add strict_timestamps option #117

feat: add strict_timestamps option #117

frederikaalund commented Apr 12, 2024

michalc commented Apr 12, 2024

michalc commented Apr 20, 2024 •

edited

Loading

michalc commented Apr 20, 2024

frederikaalund commented Apr 21, 2024

michalc commented Apr 21, 2024 •

edited

Loading

frederikaalund commented May 4, 2024 •

edited

Loading

michalc commented May 6, 2024

michalc May 6, 2024

ferdnyc May 30, 2024

ferdnyc May 30, 2024

michalc May 6, 2024

ferdnyc May 30, 2024 •

edited

Loading

michalc May 6, 2024

ferdnyc May 31, 2024

feat: add strict_timestamps option #117

Are you sure you want to change the base?

feat: add strict_timestamps option #117

Conversation

frederikaalund commented Apr 12, 2024

michalc commented Apr 12, 2024

michalc commented Apr 20, 2024 • edited Loading

michalc commented Apr 20, 2024

frederikaalund commented Apr 21, 2024

michalc commented Apr 21, 2024 • edited Loading

frederikaalund commented May 4, 2024 • edited Loading

michalc commented May 6, 2024

michalc May 6, 2024

Choose a reason for hiding this comment

ferdnyc May 30, 2024

Choose a reason for hiding this comment

ferdnyc May 30, 2024

Choose a reason for hiding this comment

michalc May 6, 2024

Choose a reason for hiding this comment

ferdnyc May 30, 2024 • edited Loading

Choose a reason for hiding this comment

michalc May 6, 2024

Choose a reason for hiding this comment

ferdnyc May 31, 2024

Choose a reason for hiding this comment

michalc commented Apr 20, 2024 •

edited

Loading

michalc commented Apr 21, 2024 •

edited

Loading

frederikaalund commented May 4, 2024 •

edited

Loading

ferdnyc May 30, 2024 •

edited

Loading