-
-
Notifications
You must be signed in to change notification settings - Fork 30k
-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multivolume support in tarfile module #62521
Comments
The patch attached provides implementation for multivolume support for tarfile module. It contains both the changes in the module and a battery of unit tests. It contains support for multivolume for both GNU and PAX formats. The main idea behind this proposal is that for dealing with new volumes, the user will provide a callback function. In this callback function the user typically calls to TarFile.open_volume(filename) with the appropiate filename. This is quite flexible in the sense that the callback function could even ask the user for the filename of the next volume (as tar command does), or do anything they need before returning or before calling to open_volume. Please feel free to comment on how to improve the implementation or the API. Documentation for the new feature will be provided at a later stage of the review process if the patch gets a good review. |
Eduardo, can you please fill out the contributor form? |
Sure, I will fill it out. But is it required? |
could you please check if my contributor form is already processed? |
You have an asterisk next to your user name. So yes, your form was received and processed. |
Do we have any news on this patch? |
Lars, would you like to take a look? |
I cannot yet go into the details, because I have not tested the patch. The comments, docstrings and quoting are not very consistent with the rest of the module. There are a few spelling mistakes. The open_volume() method is more or less a copy of the open() method which is not optimal. The patch adds a lot of complexity to the tarfile module for a use case that only a few connoisseurs benefit from. It seems to alter some inherent TarFile mechanics that people might rely on, e.g. the members attribute contains only the members stored in the current volume not the overall entirety of members. Does this patch reliably allow random-access? Would it be possible/easier to add the same functionality using a separate class MultiVolumeTarFile instead? |
I can try to take care of this, though it'd be best if you can point me out concrete examples.
I know, but trying to do something else might be even worse..
Well, that doesn't make much sense to me. You say that people rely on something like "members attribute contains only the members stored in the current volume not the overall entirety of members", but as you know, python tarfile lib didn't support volumes before this patch, so I guess people couldn't be relying on that anyway. Also please bear in mind that tarfile volumes support is part of the tar standard, and altough not everyone uses this, it has its niche uses (sliced backups etc).
Yes and no. No: when you open a volume for reading, the list of members is cleared as you pointed so you cannot access to them. Yes: you can open any volume at the begining of a file, and it read the tarfile from there. I do that in my use-case.
If you find open_volume similar to open() and don't like, I'm quite sure you would like all the needed copy-paste having this a separate class would entail. Also as I said before, might not make much sense as this is part of the standard. |
At first, I'd like to take back my comment on this patch being too complex for too little benefit. That is no real argument. Okay, I gave it a shot and I have a few more remarks: The patch does not support iterating over a multi-volume tar archive, e.g. for TarFile.list(). You must implement this. In my opinion, a tar archive is one logical unit even if it spans across multiple volumes. Thus, it is vital to have .getmembers() and .getnames() reflect the entirety of the archive, e.g. to support "if filename in .getnames()". I think it could be a good idea to store the volume number along each TarInfo object for random-access. By the way, which standard are you referring to? The only one I know of is POSIX pax which doesn't say anything about multiple volumes. |
I had the following idea: What about a separate class, let's call it TarVolumeSet for now, that maps a set of (virtual) volumes onto one big file-like object. This TarVolumeSet will be passed to a TarFile constructor as the fileobj argument. It is subclassable for implementing custom behavior. class MyTarVolumeSet(tarfile.TarVolumeSet):
def __init__(self, template):
self.template = template
def open_volume(self, volume_number):
return open(self.template % volume_number, "rb")
volumes = MyTarVolumesSet("test.tar.%03d")
with tarfile.open(fileobj=volumes, mode="r:") as tar:
for t in tar:
print(t.name) In my opinion, this approach has a number of advantages: Most importantly, it separates the multi-volume code from the TarFile class, which reduces the invasiveness, complexity and maintenance burden of the original approach. The TarFile class would be totally agnostic about the concept of multiple volumes, TarVolumeSet looks just like another file-object to TarFile. Looks like the cleanest solution to me so far. |
I guess I got it wrong, it's not part of the POSIX standard, just part of the GNU tar documentation. About the getmembers and getnames not reflecting the entirety of the archive, it's an optimization I needed and I think ccan be quite handy. It's also consistent with how the tar command works afaik, just listing the contents of the current volume. But it's true that could be done. Regarding the TarVolumeSet idea, it could work but it's not as easy as that. You don't want to directly do a plain open in there, because you want to be able to deal with read/write modes, with gzip/bzip/Stream class. You also want to give the user maximum flexibility. That's why in my implementation new_volume_handler and open_volume functions are separated. Similarly, we could do something like this: class MyTarVolumeSet(tarfile.TarVolumeSet):
def __init__(self, template, max_vol_size):
self.template = template
self.max_vol_size = max_vol_size
def new_volume_handler(self, tarfile, volume_number):
self.open_volume(self.template % volume_number, tarfile)
volumes = MyTarVolumesSet("test.tar.%03d")
with tarfile.open(fileobj=volumes, mode="w:") as tar:
for t in tar:
print(t.name) Note that the new_volume_handler in this example receives more or less the same params as in my patch (not the base name, because it's already stored in the template), and that the open_volume really abstracts which way it will be opened. It could also have, as in my patch, an optional fileobj parameter, for a new indirection/abstraction. In any case, this kind of abstraction would still really need some kind of hooking with tarfile, because a multivol tarfile is not exactly the same as a normal tarfile chopped up. This might be doable unilateraly by a smart TarVolumeSet getting the information from the tarfile and modifying/patching anything needed. This is one of the reasons why one would probably would still need access to the tarfile inside the open_volume function. |
No, GNU tar operates on the entirety of the archive and asks for the filename of the subsequent volume every time it hits eof in the current volume.
The example I gave is based on the idea that there is a TarVolumeSet class in the tarfile module that implements all the required file-object methods (e.g. read(), write(), seek(), etc.) and acts as if the sequence of volumes is actually one big file. It is passed to tarfile.open() as the fileobj argument. This TarVolumeSet class is supposed to be subclassable to let the user implement her/his own mode of operation. This way the open_volume() method can do whatever the user wants it to do. The TarVolumeSet class might as well have a new_volume() method for writing multivol archives, the example only covered the case of reading a multivol archive. BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.
No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header that is at the start of each subsequent volume is to give GNU tar the ability to detect if it is reading the correct volume. It is not essential and could as well be left out. |
This is not so easy, because for example if you want to move the logic in addfile() that manages volumes to the write() function, you have some issues:
But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..
I'm not going to discuss this, because I think that we can implement it in the way you propose to implement it anyway. But my patch supports it and I think it's an *useful* feature, so I want it in :-P |
Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two volume archive that is split exactly between the two blocks of an XHDTYPE pax header. The result is terrifying. At the beginning of the second volume GNU tar creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, followed by an XHDTYPE header ("GNUFileParts") that somehow decorates the following REGTYPE(!) tar header that contains the continuation of the split XHDTYPE header data from the previous volume. After that comes the REGTYPE file that the split XHDTYPE header was actually meant for as decoration. I attached the archive to this issue. What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...
It is mandatory to do the split on a block boundary (a multiple of 512).
I don't know and this problem seems to be quite hard to address with my approach. That's too bad.
Yeah, it has multivolume support, but a very limited one that is not only weird but isn't even usable together with compression. And sure, I can compress and encrypt the volumes afterwards, but I can also create a compressed archive and pipe it through split(1) to split it into parts. Both ways create tar archives that are not readable by GNU tar because they're non-standard. So what? Please tell me, what is your actual personal use-case for this feature? |
I'm willing modify the patch to remove the "weirdness" you refer to. I differ on that it's not usable: it might not be useful to you, but it's certainly a feature that covers part of the functionality of GNU tar. Actually, some of the unit tests are like this: use GNU Tar to compress, then extract with tarfile - and viceversa. Of course you can use split. And I could use Ruby or Perl, but I'm using python and tarfile, and this is a GNU tar feature that is just not supported in python tarfile upstream, and I'm just trying to contribute this feature, if possible :-). BTW, If I create a multivol tar file and then compress the volumes, that does not make it "non-standard", in the same way that if I create a PNG file and then compress it and then store it in EXTFS, it doesn't make it non-standard. I'm just using multiple layers of standards. I'm a contractor, and I have been asked by a client to develop a python-based backup tool. The client is technical and had already an idea of what he wanted to do: use python-tarfile and add support to multivolume and some other goodies, and the client also wanted to try to push the changes upstream as we believe it is the correct thing to do. BTW, when we designed the backup tool, we ruled out the possibility of using split because split wouldn't allow to correctly list all the files in each file-slice separately. We wanted to be able to recover all the files of each "volume" so that if we lose other volumes, we can still recover all the data from the volumes we have. Anyway, if you are the maintainer of tarfile and you think it's not possible to push tar-multivolume support upstream in python tarfile for whatever reason, please tell me. |
Okay, let me tell you why I reject your contribution at this point. The patch you submitted may be well-suited for your purposes but it does not meet the requirements of a standard library implementation because it is not generic and comprehensive enough. It contains duplicate code, spelling mistakes and needless code changes e.g. in test_tarfile.py. It does not expose one set of volumes as one tar archive to the user. It is not possible to iterate over all members of all volumes in one go. It does not allow random-access. Actually, it does not implement complete multivolume support but only the "easy" parts. For example, it fails to read GNU tar archives that are split in the middle of a pax header block sequence. The other way around, when writing it makes a split only when it is inside the data part of a member. Hence, it is possible that a volume turns out smaller than max_volume_size which is not only inaccurate but also bad on a tape device. If you decide that you still want multivolume support in tarfile, feel free to reopen this issue with a new and significantly better patch. I gave you a number of clues on what I think is required. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: