Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multivolume support in tarfile module #62521

Closed
edulix mannequin opened this issue Jun 28, 2013 · 17 comments
Closed

Multivolume support in tarfile module #62521

edulix mannequin opened this issue Jun 28, 2013 · 17 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@edulix
Copy link
Mannequin

edulix mannequin commented Jun 28, 2013

BPO 18321
Nosy @loewis, @gustaebel, @tiran, @serhiy-storchaka
Files
  • cpython-tarfile-multivolume.patch: Patch providing the code
  • split-xhdtype.tar.gz
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gustaebel'
    closed_at = <Date 2014-04-14.11:20:09.422>
    created_at = <Date 2013-06-28.08:06:17.009>
    labels = ['type-feature', 'library']
    title = 'Multivolume support in tarfile module'
    updated_at = <Date 2014-04-14.11:20:09.420>
    user = 'https://bugs.python.org/edulix'

    bugs.python.org fields:

    activity = <Date 2014-04-14.11:20:09.420>
    actor = 'lars.gustaebel'
    assignee = 'lars.gustaebel'
    closed = True
    closed_date = <Date 2014-04-14.11:20:09.422>
    closer = 'lars.gustaebel'
    components = ['Library (Lib)']
    creation = <Date 2013-06-28.08:06:17.009>
    creator = 'edulix'
    dependencies = []
    files = ['30720', '34798']
    hgrepos = []
    issue_num = 18321
    keywords = ['patch']
    message_count = 17.0
    messages = ['191979', '191993', '193811', '200913', '200914', '209418', '209437', '209633', '209635', '209711', '209997', '213141', '213243', '216007', '216015', '216036', '216073']
    nosy_count = 5.0
    nosy_names = ['loewis', 'lars.gustaebel', 'christian.heimes', 'serhiy.storchaka', 'edulix']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18321'
    versions = ['Python 3.5']

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Jun 28, 2013

    The patch attached provides implementation for multivolume support for tarfile module. It contains both the changes in the module and a battery of unit tests. It contains support for multivolume for both GNU and PAX formats.

    The main idea behind this proposal is that for dealing with new volumes, the user will provide a callback function. In this callback function the user typically calls to TarFile.open_volume(filename) with the appropiate filename. This is quite flexible in the sense that the callback function could even ask the user for the filename of the next volume (as tar command does), or do anything they need before returning or before calling to open_volume.

    Please feel free to comment on how to improve the implementation or the API. Documentation for the new feature will be provided at a later stage of the review process if the patch gets a good review.

    @edulix edulix mannequin added the stdlib Python modules in the Lib dir label Jun 28, 2013
    @serhiy-storchaka serhiy-storchaka added the type-feature A feature request or enhancement label Jun 28, 2013
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 28, 2013

    Eduardo, can you please fill out the contributor form?

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Jul 28, 2013

    Sure, I will fill it out. But is it required?

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Oct 22, 2013

    could you please check if my contributor form is already processed?

    @tiran
    Copy link
    Member

    tiran commented Oct 22, 2013

    You have an asterisk next to your user name. So yes, your form was received and processed.

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Jan 27, 2014

    Do we have any news on this patch?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 27, 2014

    Lars, would you like to take a look?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jan 29, 2014

    I cannot yet go into the details, because I have not tested the patch.

    The comments, docstrings and quoting are not very consistent with the rest of the module. There are a few spelling mistakes. The open_volume() method is more or less a copy of the open() method which is not optimal.

    The patch adds a lot of complexity to the tarfile module for a use case that only a few connoisseurs benefit from. It seems to alter some inherent TarFile mechanics that people might rely on, e.g. the members attribute contains only the members stored in the current volume not the overall entirety of members. Does this patch reliably allow random-access?

    Would it be possible/easier to add the same functionality using a separate class MultiVolumeTarFile instead?

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Jan 29, 2014

    I cannot yet go into the details, because I have not tested the patch.

    The comments, docstrings and quoting are not very consistent with the rest of the module. There are a few spelling mistakes.

    I can try to take care of this, though it'd be best if you can point me out concrete examples.

    The open_volume() method is more or less a copy of the open() method which is not optimal.

    I know, but trying to do something else might be even worse..

    The patch adds a lot of complexity to the tarfile module for a use case that only a few connoisseurs benefit from. It seems to alter some inherent TarFile mechanics that people might rely on, e.g. the members attribute contains only the members stored in the current volume not the overall entirety of members.

    Well, that doesn't make much sense to me. You say that people rely on something like "members attribute contains only the members stored in the current volume not the overall entirety of members", but as you know, python tarfile lib didn't support volumes before this patch, so I guess people couldn't be relying on that anyway.

    Also please bear in mind that tarfile volumes support is part of the tar standard, and altough not everyone uses this, it has its niche uses (sliced backups etc).

    Does this patch reliably allow random-access?

    Yes and no.

    No: when you open a volume for reading, the list of members is cleared as you pointed so you cannot access to them.

    Yes: you can open any volume at the begining of a file, and it read the tarfile from there. I do that in my use-case.

    Would it be possible/easier to add the same functionality using a separate class MultiVolumeTarFile instead?

    If you find open_volume similar to open() and don't like, I'm quite sure you would like all the needed copy-paste having this a separate class would entail. Also as I said before, might not make much sense as this is part of the standard.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jan 30, 2014

    At first, I'd like to take back my comment on this patch being too complex for too little benefit. That is no real argument.

    Okay, I gave it a shot and I have a few more remarks:

    The patch does not support iterating over a multi-volume tar archive, e.g. for TarFile.list(). You must implement this.

    In my opinion, a tar archive is one logical unit even if it spans across multiple volumes. Thus, it is vital to have .getmembers() and .getnames() reflect the entirety of the archive, e.g. to support "if filename in .getnames()". I think it could be a good idea to store the volume number along each TarInfo object for random-access.

    By the way, which standard are you referring to? The only one I know of is POSIX pax which doesn't say anything about multiple volumes.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Feb 2, 2014

    I had the following idea: What about a separate class, let's call it TarVolumeSet for now, that maps a set of (virtual) volumes onto one big file-like object. This TarVolumeSet will be passed to a TarFile constructor as the fileobj argument. It is subclassable for implementing custom behavior.

    class MyTarVolumeSet(tarfile.TarVolumeSet):
    
        def __init__(self, template):
            self.template = template
    
        def open_volume(self, volume_number):
            return open(self.template % volume_number, "rb")
    
    volumes = MyTarVolumesSet("test.tar.%03d")
    with tarfile.open(fileobj=volumes, mode="r:") as tar:
        for t in tar:
            print(t.name)

    In my opinion, this approach has a number of advantages: Most importantly, it separates the multi-volume code from the TarFile class, which reduces the invasiveness, complexity and maintenance burden of the original approach. The TarFile class would be totally agnostic about the concept of multiple volumes, TarVolumeSet looks just like another file-object to TarFile. Looks like the cleanest solution to me so far.

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Mar 11, 2014

    I guess I got it wrong, it's not part of the POSIX standard, just part of the GNU tar documentation. About the getmembers and getnames not reflecting the entirety of the archive, it's an optimization I needed and I think ccan be quite handy. It's also consistent with how the tar command works afaik, just listing the contents of the current volume. But it's true that could be done.

    Regarding the TarVolumeSet idea, it could work but it's not as easy as that. You don't want to directly do a plain open in there, because you want to be able to deal with read/write modes, with gzip/bzip/Stream class. You also want to give the user maximum flexibility. That's why in my implementation new_volume_handler and open_volume functions are separated. Similarly, we could do something like this:

    class MyTarVolumeSet(tarfile.TarVolumeSet):
    
        def __init__(self, template, max_vol_size):
            self.template = template
            self.max_vol_size = max_vol_size
    
        def new_volume_handler(self, tarfile, volume_number):
            self.open_volume(self.template % volume_number, tarfile)
    
    volumes = MyTarVolumesSet("test.tar.%03d")
    with tarfile.open(fileobj=volumes, mode="w:") as tar:
        for t in tar:
            print(t.name)

    Note that the new_volume_handler in this example receives more or less the same params as in my patch (not the base name, because it's already stored in the template), and that the open_volume really abstracts which way it will be opened. It could also have, as in my patch, an optional fileobj parameter, for a new indirection/abstraction.

    In any case, this kind of abstraction would still really need some kind of hooking with tarfile, because a multivol tarfile is not exactly the same as a normal tarfile chopped up. This might be doable unilateraly by a smart TarVolumeSet getting the information from the tarfile and modifying/patching anything needed. This is one of the reasons why one would probably would still need access to the tarfile inside the open_volume function.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Mar 12, 2014

    It's also consistent with how the tar command works afaik, just listing the contents of the current volume.

    No, GNU tar operates on the entirety of the archive and asks for the filename of the subsequent volume every time it hits eof in the current volume.

    You don't want to directly do a plain open in there, because you want to be able to deal with read/write modes, with gzip/bzip/Stream class.

    The example I gave is based on the idea that there is a TarVolumeSet class in the tarfile module that implements all the required file-object methods (e.g. read(), write(), seek(), etc.) and acts as if the sequence of volumes is actually one big file. It is passed to tarfile.open() as the fileobj argument. This TarVolumeSet class is supposed to be subclassable to let the user implement her/his own mode of operation. This way the open_volume() method can do whatever the user wants it to do. The TarVolumeSet class might as well have a new_volume() method for writing multivol archives, the example only covered the case of reading a multivol archive.

    BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.

    [...] because a multivol tarfile is not exactly the same as a normal tarfile chopped up.

    No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header that is at the start of each subsequent volume is to give GNU tar the ability to detect if it is reading the correct volume. It is not essential and could as well be left out.

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Apr 13, 2014

    The example I gave is based on the idea that there is a TarVolumeSet class in the tarfile module that implements all the required file-object methods (e.g. read(), write(), seek(), etc.) and acts as if the sequence of volumes is actually one big file. It is passed to tarfile.open() as the fileobj argument. This TarVolumeSet class is supposed to be subclassable to let the user implement her/his own mode of operation. This way the open_volume() method can do whatever the user wants it to do. The TarVolumeSet class might as well have a new_volume() method for writing multivol archives, the example only covered the case of reading a multivol archive.

    This is not so easy, because for example if you want to move the logic in addfile() that manages volumes to the write() function, you have some issues:

    • write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly. This is something that usually shouldn't belong in a write() function as it has to do to tarfile.. and it can be messy that both layers deal with it (write() splitting volumes, and tarfile adding NUL at the end of a BLOCK..) This can be done I guess, but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. Usually all other records are either in the begining or just occupy one block, but can we rule this problem out for good?

    • multivolume logic in write() needs read/write access to the current tarinfo being written (look for "tarfile" in addfile() funcion in my patch to see why). How do you propose this object should be accessed from write()?

    BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.

    But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..

    > [...] because a multivol tarfile is not exactly the same as a normal tarfile chopped up.
    No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header that is at the start of each subsequent volume is to give GNU tar the ability to detect if it is reading the correct volume. It is not essential and could as well be left out.

    I'm not going to discuss this, because I think that we can implement it in the way you propose to implement it anyway. But my patch supports it and I think it's an *useful* feature, so I want it in :-P

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Apr 13, 2014

    [...] but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. [...]

    Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two volume archive that is split exactly between the two blocks of an XHDTYPE pax header.

    The result is terrifying. At the beginning of the second volume GNU tar creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, followed by an XHDTYPE header ("GNUFileParts") that somehow decorates the following REGTYPE(!) tar header that contains the continuation of the split XHDTYPE header data from the previous volume. After that comes the REGTYPE file that the split XHDTYPE header was actually meant for as decoration.

    I attached the archive to this issue.

    What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...

    write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly.

    It is mandatory to do the split on a block boundary (a multiple of 512).

    • multivolume logic in write() needs read/write access to the current tarinfo being written [...]. How do you propose this object should be accessed from write()?

    I don't know and this problem seems to be quite hard to address with my approach. That's too bad.

    > BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.
    But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..

    Yeah, it has multivolume support, but a very limited one that is not only weird but isn't even usable together with compression. And sure, I can compress and encrypt the volumes afterwards, but I can also create a compressed archive and pipe it through split(1) to split it into parts. Both ways create tar archives that are not readable by GNU tar because they're non-standard. So what?

    Please tell me, what is your actual personal use-case for this feature?

    @edulix
    Copy link
    Mannequin Author

    edulix mannequin commented Apr 13, 2014

    > [...] but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. [...]
    Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two volume archive that is split exactly between the two blocks of an XHDTYPE pax header.

    The result is terrifying. At the beginning of the second volume GNU tar creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, followed by an XHDTYPE header ("GNUFileParts") that somehow decorates the following REGTYPE(!) tar header that contains the continuation of the split XHDTYPE header data from the previous volume. After that comes the REGTYPE file that the split XHDTYPE header was actually meant for as decoration.

    I attached the archive to this issue.

    What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...

    > write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly.

    It is mandatory to do the split on a block boundary (a multiple of 512).
    >> BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.
    > But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..

    Yeah, it has multivolume support, but a very limited one that is not only weird but isn't even usable together with compression. And sure, I can compress and encrypt the volumes afterward, but I can also create a compressed archive and pipe it through split(1) to split it into parts. Both ways create tar archives that are not readable by GNU tar because they're non-standard. So what?

    Please tell me, what is your actual personal use-case for this feature?

    I'm willing modify the patch to remove the "weirdness" you refer to. I differ on that it's not usable: it might not be useful to you, but it's certainly a feature that covers part of the functionality of GNU tar. Actually, some of the unit tests are like this: use GNU Tar to compress, then extract with tarfile - and viceversa.

    Of course you can use split. And I could use Ruby or Perl, but I'm using python and tarfile, and this is a GNU tar feature that is just not supported in python tarfile upstream, and I'm just trying to contribute this feature, if possible :-).

    BTW, If I create a multivol tar file and then compress the volumes, that does not make it "non-standard", in the same way that if I create a PNG file and then compress it and then store it in EXTFS, it doesn't make it non-standard. I'm just using multiple layers of standards.

    I'm a contractor, and I have been asked by a client to develop a python-based backup tool. The client is technical and had already an idea of what he wanted to do: use python-tarfile and add support to multivolume and some other goodies, and the client also wanted to try to push the changes upstream as we believe it is the correct thing to do.

    BTW, when we designed the backup tool, we ruled out the possibility of using split because split wouldn't allow to correctly list all the files in each file-slice separately. We wanted to be able to recover all the files of each "volume" so that if we lose other volumes, we can still recover all the data from the volumes we have.

    Anyway, if you are the maintainer of tarfile and you think it's not possible to push tar-multivolume support upstream in python tarfile for whatever reason, please tell me.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Apr 14, 2014

    Okay, let me tell you why I reject your contribution at this point.

    The patch you submitted may be well-suited for your purposes but it does not meet the requirements of a standard library implementation because it is not generic and comprehensive enough.

    It contains duplicate code, spelling mistakes and needless code changes e.g. in test_tarfile.py.

    It does not expose one set of volumes as one tar archive to the user. It is not possible to iterate over all members of all volumes in one go. It does not allow random-access.

    Actually, it does not implement complete multivolume support but only the "easy" parts. For example, it fails to read GNU tar archives that are split in the middle of a pax header block sequence. The other way around, when writing it makes a split only when it is inside the data part of a member. Hence, it is possible that a volume turns out smaller than max_volume_size which is not only inaccurate but also bad on a tape device.

    If you decide that you still want multivolume support in tarfile, feel free to reopen this issue with a new and significantly better patch. I gave you a number of clues on what I think is required.

    @gustaebel gustaebel mannequin closed this as completed Apr 14, 2014
    @gustaebel gustaebel mannequin self-assigned this Apr 14, 2014
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants