Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit the pickle jar procedure #10768

Open
nthiery opened this issue Feb 10, 2011 · 14 comments
Open

Revisit the pickle jar procedure #10768

nthiery opened this issue Feb 10, 2011 · 14 comments

Comments

@nthiery
Copy link
Contributor

nthiery commented Feb 10, 2011

The current pickle jar mechanism has some drawbacks:

  • We never add new pickles to the pickle jar

  • We don't know how old pickles in the pickle jar are

  • We may be testing an old pickle, but not a recent one

  • Updating specific pickles is a bit tedious

Here is a new proposal:

  1. Pickles will no longer be stored in a .tar.bz2 file but simply as files within the directory extcode/pickle_jar/$VERSION. This will likely increase the on-disk space needed for a Sage install, but will not have a big influence on Sage distributions, since we have an extcode spkg anyway (which is tarred and compressed).
  2. Pickles will be under git control (this will now become possible).
  3. The $VERSION in the directory name refers to the Sage version used to create the pickle. Once a pickle has been made, it will remain in place in that directory, even in subsequent Sage versions (so sage-4.7.2 will contain pickle_jar/4.7, pickle_jar/4.7.1 and pickle_jar/4.7.2).
  4. When making a new release, the release manager will unpickle all old pickles and repickle them with the new Sage version. Whenever a pickle has changed, the new (changed) pickle will be stored in pickle_jar/$NEWVERSION. The old pickle is kept where it was.
  5. sage.structure.sage_object.unpickle_all will check all pickles (old and new).
  6. If some day some pickle rots away and it is decided by consensus to not support unpickling it anymore, then the patch author would simply git remove the old pickle.

CC: @sagetrac-sage-combinat @ohanar

Component: pickling

Issue created by migration from https://trac.sagemath.org/ticket/10768

@jdemeyer
Copy link

comment:1

While we're at it, why does the pickle jar need to be a tar.bz2 file as opposed to just a directory in data/extcode/pickle_jar? When distributing the pickle jar, it is contained in the extcode spkg anyway, so I don't see the gain of having an additional layer of tarring.

@jdemeyer
Copy link

comment:2

One major advantage of not having the tar file would be that the pickle jar could be updated using standard hg commands. This would instantly solve 2 of the 3 complaints:

  1. Using hg log, we would know exactly how old everything is
  2. Updating specific pickles would become as easy as adding a patch to the Sage library.

@jdemeyer
Copy link

comment:3

Related ticket: #11069

@jdemeyer
Copy link

comment:4

Nicolas, just to make sure I understand you correctly, is your proposal the following:

  1. Pickle jars are named after the Sage version (i.e. we would have a pickle_jar-4.6.2.tar.bz2 file or a pickle_jar-4.6.2 directory in my proposal).
  2. We always keep the old versions unchanged (so sage-4.7 would still contain pickle_jar-4.6.2).
  3. With every new Sage version, the release manager unpickles pickle_jar-$OLDVERSION, repickles them using the new Sage version and saves them as pickle_jar-$NEWVERSION.

I can see some merit to this proposal, however I would save only the pickles which actually changed. Otherwise you will end up with lots of copies of the same pickle.

@nthiery
Copy link
Contributor Author

nthiery commented Mar 29, 2011

comment:5

Replying to @jdemeyer:

One major advantage of not having the tar file would be that the pickle jar could be updated using standard hg commands. This would instantly solve 2 of the 3 complaints:

  1. Using hg log, we would know exactly how old everything is
  2. Updating specific pickles would become as easy as adding a patch to the Sage library.

+1, definitely! Actually I did not suggest it earlier because I was
worrying about the disk space usage, not for the Sage distribution but
for the Sage install. But if there is a consensus that this is well
used disk space, let's go for it.

I was also wondering whether this could possibly slow down
unpickle_all since this would require loading lots of little files
instead of slurping in one large archive. Any clue?

@nthiery
Copy link
Contributor Author

nthiery commented Mar 29, 2011

comment:6

Hi Jeroen!

Replying to @jdemeyer:

Nicolas, just to make sure I understand you correctly, is your proposal the following:

I am going to use the occasion to amend a bit the proposal :-)

  1. Pickle jars are named after the Sage version (i.e. we would have a pickle_jar-4.6.2.tar.bz2 file or a pickle_jar-4.6.2 directory in my proposal).

Yes.

  1. We always keep the old versions unchanged (so sage-4.7 would still contain pickle_jar-4.6.2).

Yes. More precisely sage-4.7 would still contain the subset of the
pickles in pickle_jar-4.6.2 that:

  • still unpickles properly in sage-4.7
  • differ from the corresponding pickle in 4.7 (and any intermediate version)
  1. With every new Sage version, the release manager unpickles pickle_jar-$OLDVERSION, repickles them using the new Sage version and saves them as pickle_jar-$NEWVERSION.

More precisely: the release manager recreates a fresh pickle jar by running all the sage tests with SAGE_PICKLE_JAR set (as described in unpickle_all). And then removes from pickle_jar-$OLDVERSION those that did not change. An easy thing to script.

I can see some merit to this proposal, however I would save only the pickles which actually changed. Otherwise you will end up with lots of copies of the same pickle.

+1; this is a good refinement of the last point in the ticket description. The comments above should take care of this.

Note that if the pickle_jar for 3.1 and 4.6.2 contain the same pickle X (version numbers just for the example), then I prefer to delete that of 3.1 and keep that of 4.6.2. Indeed, if X does not unpickle anymore with 4.7, then the relevant question is: "is it acceptable to not unpickle in 4.7 a pickle generated by 4.6.2?".

Do you mind rephrasing the ticket description accordingly, and then make a quick call for comments on sage-devel?

Thanks!

Cheers,
Nicolas

@jdemeyer
Copy link

comment:7

Replying to @nthiery:

Note that if the pickle_jar for 3.1 and 4.6.2 contain the same pickle X (version numbers just for the example), then I prefer to delete that of 3.1 and keep that of 4.6.2.

If we use hg to track the pickles, I actually think it is better not to constantly move pickles from one version to another. So while I understand your point, from a practical point of view, I prefer to keep the pickle in the old directory of the old version.

@jdemeyer
Copy link

comment:8

Replying to @nthiery:

+1, definitely! Actually I did not suggest it earlier because I was
worrying about the disk space usage, not for the Sage distribution but
for the Sage install.

Currently, the pickle jar contains 1174 files. Assuming each file takes 4kB of actual disk space, this would use a few megabytes. I don't think this is an issue.

I was also wondering whether this could possibly slow down
unpickle_all since this would require loading lots of little files
instead of slurping in one large archive. Any clue?

This would depend very much on the operating system and file system...
But yes, on some systems this will be slower. On the other hand, it could even speed up things by not having to decompress and untar.

@jdemeyer

This comment has been minimized.

@AndrewMathas
Copy link
Member

comment:10

Hi Nicolas,

I want to add to your proposal that the pickle_jar be properly documented. As far as I am aware, there is currently no documentation on what the pickle jar is for, how it should be used, and what to do when a pickle breaks with

sage -t  devel/sage-sf/sage/structure/sage_object.pyx

for example. A non-trivial example for using register_unpickle_override should also be added.

Secondly, I think that the procedure for adding new pickles to the jar needs to streamlined. Again, I don't believe that it is described anywhere when or how this happens, but I do know that there are many "new" classes which are not represented in the pickle_jar with the consequence that the pickle_jar is unable to check backward compatibility for these classes.

Andrew

@vbraun
Copy link
Member

vbraun commented Jan 17, 2014

comment:11

Do we really put all that into the git repo? The current (incredibly old) pickle jar is about 2MB uncompressed. A new one is likely considerably larger. There are of the order of 10 minor Sage releases every year. I don't know often the pickle changes, but it seems likely that this'll generate on the order of 10MB/year that will be with us forever. The whole git repo is currently <100MB.

@vbraun

This comment has been minimized.

@nthiery
Copy link
Contributor Author

nthiery commented Jan 24, 2014

comment:12

Hi Volker!

I don't have a good view on the order of magnitudes. Yet, with the proposed protocol, pickles that don't change don't get duplicated between versions, and I'd expect that only a few pickles get changed from one version to the other (especially if we emphasize pickling by construction rather than by internal data structure). A good experiment would be to regenerate a new pickle jar, and see how much we have added to it since last time!

I don't have a strong opinion about whether the pickle jar should be maintained under git or not. If we can affor it, that makes things easier, as changes to the pickle jar can be done within the usual workflow. But if it's too big, it's too big.

Cheers,
Nicolas

@jdemeyer

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants