Skip to content

Conversation

@pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jan 4, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Fixes #7220

@pmrowla pmrowla added the bugfix fixes bug label Jan 4, 2022
@pmrowla pmrowla requested a review from a team as a code owner January 4, 2022 06:34
@pmrowla pmrowla self-assigned this Jan 4, 2022
@pmrowla pmrowla requested a review from karajan1001 January 4, 2022 06:34
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla, will this break repo if they were using protocol 5 before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users may need to remove .dvc/tmp/index in this case (which is the same potential issue we had when we pinned the version for state #5563 (comment))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python >=3.8 uses Protocol version 5, so we may as well just don't support compatibility between 3.7 and >=3.8.
What do you think of doing something like:

Suggested change
self.index = Index(self.index_dir, disk_pickle_protocol=4)
self.index = Index(
self.index_dir,
disk_pickle_protocol=4 if sys.version_info < (3,8) else 5
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that works here since the reason users are still running into this issue is because they are mixing python versions (between 3.7 and >3.7)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the core python default protocol version in 3.8+ is still 4, the problem for us is that diskcache explicitly uses pickle.HIGHEST_PROTOCOL https://github.com/grantjenks/python-diskcache/blob/d55a50ee083784afa9c85e14e41c4a2d132f3111/diskcache/core.py#L60

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For myself, I'd prefer to raise some messages to suggest .dvc/tmp/index here.

Copy link
Contributor Author

@pmrowla pmrowla Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding pickle protocol version, I think we should ensure compatibility across python versions that are supported by DVC, which means we should be enforcing a pinned version wherever we are using pickle (unless at some point in the future Python officially deprecates whatever ver we are using).

In the state/index use case, I don't think there's any practical benefit to using protocol ver 5 for the way we use diskcache (my understanding of protocol ver 5 is that it improves performance for serializing very large non-native objects that have to implement their own __reduce__ serialization methods, like numpy arrays and pandas dataframes), but with the way we use diskcache we are really just serializing (relatively) small native dicts.

There is an official backport library for pickle protocol v5, so if/when the state/index pickle performance becomes a real problem for us we can go that route, but I don't think it's a pressing need right now.

@skshetry

Copy link
Collaborator

@skshetry skshetry Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla, I agree with all of your points, I have no issues on the technical side of things.

I'm just worried about potentially breaking things here for Python 3.8+ users in the middle of 2.x version. I'm fine with this as long as we are explicit about this being a breaking change and weigh in both options:

  1. By not fixing this, we break compatibility between 3.7 with 3.8+ Python versions. So, if the user used Python 3.8+ version, they cannot use dvc repo again with Python 3.7.
  2. By fixing this, we ensure compatibility between all Python versions by breaking compatibility for once for Python 3.8+ users (note all of our binary packages and homebrew package are 3.8+).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry as a short term workaround we could just catch the protocol errors in state/index, and then add a troubleshooting link w/details suggesting users can try removing the relevant temp dirs (md5s/links/index) and then retry the command

Copy link
Contributor Author

@pmrowla pmrowla Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after some testing, it turns out that changing to protocol version 4 will not break an existing diskcache cache that was generated using protocol version 5 (as long as you are on a python version that supports v5).

It looks like disk_pickle_protocol is only used on writes, and on read it uses the protocol version specified in the object it is reading, rather than the parent cache's default protocol version. So if you change protocol versions, it will just rewrite the cache entry using the new ver on the next time that entry is modified.

This test:

@pytest.mark.parametrize("proto_a, proto_b", [(4, 5), (5, 4)])
def test_protocol_error(tmp_dir, proto_a, proto_b):
    with diskcache.Cache(directory=(tmp_dir / "test"), disk_pickle_protocol=proto_a) as cache:
        cache["key"] = ("value1", "value2")
    with diskcache.Cache(directory=(tmp_dir / "test"), disk_pickle_protocol=proto_b) as cache:
        assert ("value1", "value2")  == cache["key"]
        cache["key"] = ("value3", "value4")
        assert ("value3", "value4")  == cache["key"]

passes on 3.8, 3.9, 3.10 and fails on 3.7 (since 3.7 cannot read/write protocol version 5)

So this should not break compatibility with existing repos for anyone on python 3.8+ (but generating the troubleshooting message for py 3.7 will still be useful)

@pmrowla pmrowla force-pushed the 7220-index-pickle branch from 4ae76b8 to 8048266 Compare January 4, 2022 06:46
@pmrowla pmrowla force-pushed the 7220-index-pickle branch from 8048266 to ebab8a6 Compare January 4, 2022 06:49
@efiop efiop merged commit 9e590ed into treeverse:main Jan 4, 2022
@skshetry
Copy link
Collaborator

skshetry commented Jan 5, 2022

To be honest, I don't find it worth it to break for every python version, considering it may fail in unexpected places/ways.
As 3.7 is the minimum supported version and 3.0 is around, we could choose to leave it as-is, and just not support mixing it with other python versions. We have plans to change it in 3.0 anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

odb: index pickle version should be explicit

4 participants