Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating nested path hierarchies and empty directories inside the ZIP archive? #55

Closed
Arcitec opened this issue Jun 22, 2023 · 12 comments
Closed

Comments

@Arcitec
Copy link

Arcitec commented Jun 22, 2023

Hey, small question.

From what I've observed from the ZIP spec, it seems like the "directory hierarchy" of ZIP files are just an illusion, and that it's really just POSIX-style names (forward slash separator), where each file has its own full path embedded straight into the file "name" data (no "actual nested folders"), such as name = foo/bar/baz/readme.txt. I've also noticed that empty directories are stored as "files" which are named things like foo/bar/baz/ with a trailing slash at the end, and with a DATA LENGTH of ZERO. The trailing slash seems to be the only signifier that a "file" is actually a directory?

For instance, when I created a ZIP with the standard utilities on Linux, I observed that it has an entry for the "parent folder" as a standalone ZIP entry, before it has an entry for the actual file inside that folder.

The hierarchy in this example ZIP is:

  • A folder\with backslashes: This is the actual folder name, it's a POSIX name so backslash is legal. And it's stored in the ZIP as A folder\with backslashes/
  • A folder\with backslashes/my file\withbackslash.txt: This is the filename within the folder. It's just a file with a backslash in the name.

Here's my test ZIP:

backslashes.zip

So with that backstory about how directories seem to work in ZIPs, I'm very curious about the correct ways to handle directories in stream-zip:

  1. If I want to create a deeply nested directory, is there anything I need to think about and be careful about? Or can I just immediately do name = foo/bar/baz/readme.txt without having to manually create the parent path components (foo/, foo/bar/ and foo/bar/baz/)?
  2. Is it safe to do interleaved writes, like creating foo/a/1.txt, then bar/b/1.txt, then foo/a/2.txt? Meaning is it safe to add directories in a jumbled order to the ZIP file contents? I assume that stream-zip remembers which directories it has already written to the ZIP and doesn't create those paths again.
  3. And lastly... is there any way to add empty directories? Sometimes there are empty ones in the data I am packing, and I'd prefer to preserve that empty hierarchy if it's possible. Using some kind of "if directory doesn't exist in ZIP yet, create an empty one in the ZIP"? I thought of possibly creating a file named the/directory/ with a trailing slash and providing b"" (empty bytes) as the data?
  4. If the idea in point 3 would work, would there be any risk of clashing with or overwriting any actual files/subfolders from that hierachy if they already existed? Or any other failures, such as it generating new directory entries for the same "directory file/" over and over again? (i.e. if a/b/ was already auto-generated and added earlier, and I then manually try to write a/b/, I guess I'd end up with a duplicate directory entry)?
  5. If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories? It could be as simple as "if the user provides a tuple which ONLY contains a name and timestamp and nothing else, add that as a directory entry to the ZIP, if it hasn't been added before".
@Arcitec
Copy link
Author

Arcitec commented Jun 22, 2023

Oh, I just had a look inside a ZIP createad with stream-zip and it doesn't create empty entries for the other directories in the hierarchy. Very interesting. The entries went straight to the deeply nested file:

somefile.dat
deeply/nested/file.dat

Whereas if I had created the same ZIP on Linux with regular tools, I would have had these entries in the ZIP:

somefile.dat
deeply/
deeply/nested/
deeply/nested/file.dat

But I suppose that the ZIP spec is flexible enough that it doesn't need the intermediary "this is a parent folder on the road to glory" stuff.

So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:

  1. Do a pass through all files only, and add them to the archive.
  2. Do a pass through all leafs (deepest path nodes) regardless of type, and then check if is_dir() and len(children) == 0 to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/ entries at the end, with 0-byte b"" data.

Would something like that solve all of these issues?

@michalc
Copy link
Member

michalc commented Jun 23, 2023

So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:

Do a pass through all files only, and add them to the archive.
Do a pass through all leafs (deepest path nodes) regardless of type, and then check if is_dir() and len(children) == 0 to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/ entries at the end, with 0-byte b"" data.

Would something like that solve all of these issues?

Sounds reasonable to me. I guess you don't absolutely have to have two passes - you can in one pass output files, while maintaining a list of folders that don't have any children. Or just output all folders like it sounds like other tools do. But that's a detail.

I have to admit I haven't tested zipping folders much, especially empty folders - it was never really a priority. So I suspect would be good to check the output using a few different unzip-ers, just in case.

@michalc
Copy link
Member

michalc commented Jun 23, 2023

If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?

On this - I think ending the name in a forward slash should create a directory: 66c500b

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

Hi Michal! I've been digging deeply into this, so before answering, I'll just separately dump the information I've dug up here, for reference since it could be useful (you mentioned not looking much into directory storage in ZIPs)! :)

ZIP Specification

   4.3.8  File data

      Immediately following the local header for a file
      SHOULD be placed the compressed or stored data for the file.
      If the file is encrypted, the encryption header for the file 
      SHOULD be placed after the local header and before the file 
      data. The series of [local file header][encryption header]
      [file data][data descriptor] repeats for each file in the 
      .ZIP archive. 

      Zero-byte files, directories, and other file types that 
      contain no content MUST NOT include file data.
   4.4.17 file name: (Variable)

       4.4.17.1 The name of the file, with optional relative path.
       The path stored MUST NOT contain a drive or
       device letter, or a leading slash.  All slashes
       MUST be forward slashes '/' as opposed to
       backwards slashes '\' for compatibility with Amiga
       and UNIX file systems etc.  If input came from standard
       input, there is no file name field.  
 4.4.3 version needed to extract (2 bytes)

        4.4.3.1 The minimum supported ZIP specification version needed 
        to extract the file, mapped as above.  This value is based on 
        the specific format features a ZIP program MUST support to 
        be able to extract the file.  If multiple features are
        applied to a file, the minimum version MUST be set to the 
        feature having the highest value. New features or feature 
        changes affecting the published format specification will be 
        implemented using higher version numbers than the last 
        published value to avoid conflict.

        4.4.3.2 Current minimum feature versions are as defined below:

         1.0 - Default value
         1.1 - File is a volume label
         2.0 - File is a folder (directory)
         2.0 - File is compressed using Deflate compression
         2.0 - File is encrypted using traditional PKWARE encryption
         2.1 - File is compressed using Deflate64(tm)
         2.5 - File is compressed using PKWARE DCL Implode 
         2.7 - File is a patch data set 
         4.5 - File uses ZIP64 format extensions
         4.6 - File is compressed using BZIP2 compression*
         5.0 - File is encrypted using DES
         5.0 - File is encrypted using 3DES
         5.0 - File is encrypted using original RC2 encryption
         5.0 - File is encrypted using RC4 encryption
         5.1 - File is encrypted using AES encryption
         5.1 - File is encrypted using corrected RC2 encryption**
         5.2 - File is encrypted using corrected RC2-64 encryption**
         6.1 - File is encrypted using non-OAEP key wrapping***
         6.2 - Central directory encryption
         6.3 - File is compressed using LZMA
         6.3 - File is compressed using PPMd+
         6.3 - File is encrypted using Blowfish
         6.3 - File is encrypted using Twofish
   4.4.15 external file attributes: (4 bytes)

       The mapping of the external attributes is
       host-system dependent (see 'version made by').  For
       MS-DOS, the low order byte is the MS-DOS directory
       attribute byte.  If input came from standard input, this
       field is set to zero.

This document taught us a few things:

  • PKZIP added support for directories ("folders") in version 2.0 of their application. However, that 2.0 version identifier does NOT have to be used by the zipped data. In fact, when using Linux utilities to make ZIPs of folders, the "minimum version required" field for the local file headers of the directory entries are set to 0A (10) 00 (0) as seen here (the "version required" is 2 bytes according to the spec), regardless of whether we interpret the two bytes as Little Endian or Big Endian, so that means the value is "10" (which is then cast into a string and split in the middle, so "10" decimal means version 1.0 of the ZIP format, which is before directory support was even added to the spec itself, hehe), and this means that the "minimum version required" attribute of the file is totally useless for the purpose of identifying folders and doesn't have any role in marking folders:

image

  • I also found another person who had noticed the 0A (10) 00 (0) version value for directories in their own ZIPs.
  • The spec confirms that forward slashes are the only correct directory separator.
  • It says nothing about how to mark empty directories.
  • But it makes it very clear that directories, if included, MUST BE ENCODED AS ZERO-BYTE files. So that part is now fully confirmed.

External File Attributes

  • The ZIP spec allows the appending of "external file attributes" to the zipped data. They are host OS-specific, for the OS that created the ZIP file.
   4.4.15 external file attributes: (4 bytes)

       The mapping of the external attributes is
       host-system dependent (see 'version made by').  For
       MS-DOS, the low order byte is the MS-DOS directory
       attribute byte.  If input came from standard input, this
       field is set to zero.

How to interpret the "external file attributes" depends on the OS that made the ZIP, which is controlled by this field:

   4.4.2 version made by (2 bytes)

        4.4.2.1 The upper byte indicates the compatibility of the file
        attribute information.  If the external file attributes 
        are compatible with MS-DOS and can be read by PKZIP for 
        DOS version 2.04g then this value will be zero.  If these 
        attributes are not compatible, then this value will 
        identify the host system on which the attributes are 
        compatible.  Software can use this information to determine
        the line record format for text files etc.  

        4.4.2.2 The current mappings are:

         0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
         1 - Amiga                     2 - OpenVMS
         3 - UNIX                      4 - VM/CMS
         5 - Atari ST                  6 - OS/2 H.P.F.S.
         7 - Macintosh                 8 - Z-System
         9 - CP/M                     10 - Windows NTFS
        11 - MVS (OS/390 - Z/OS)      12 - VSE
        13 - Acorn Risc               14 - VFAT
        15 - alternate MVS            16 - BeOS
        17 - Tandem                   18 - OS/400
        19 - OS X (Darwin)            20 thru 255 - unused

        4.4.2.3 The lower byte indicates the ZIP specification version 
        (the version of this document) supported by the software 
        used to encode the file.  The value/10 indicates the major 
        version number, and the value mod 10 is the minor version 
        number.  

Alright, so how do we identify directories?

We simply have to look at what 7-zip does:

// Just checks if a string ends in a "/".
bool HasTailSlash(const AString &name, UINT
  #if defined(_WIN32) && !defined(UNDER_CE)
    codePage
  #endif
  )
{
  if (name.IsEmpty())
    return false;
  char c;
    #if defined(_WIN32) && !defined(UNDER_CE)
    if (codePage != CP_UTF8)
      c = *CharPrevExA((WORD)codePage, name, name.Ptr(name.Len()), 0);
    else
    #endif
    {
      c = name.Back();
    }
  return (c == '/');
}

bool CItem::IsDir() const
{
  // FIXME: we can check InfoZip UTF-8 name at first.
  if (NItemName::HasTailSlash(Name, GetCodePage()))
    return true;
  
  Byte hostOS = GetHostOS();

  if (Size == 0 && PackSize == 0 && !Name.IsEmpty() && Name.Back() == '\\')
  {
    // do we need to use CharPrevExA?
    // .NET Framework 4.5 : System.IO.Compression::CreateFromDirectory() probably writes backslashes to headers?
    // so we support that case
    switch (hostOS)
    {
      case NHostOS::kFAT:
      case NHostOS::kNTFS:
      case NHostOS::kHPFS:
      case NHostOS::kVFAT:
        return true;
    }
  }

  if (!FromCentral)
    return false;
  
  UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);

  switch (hostOS)
  {
    case NHostOS::kAMIGA:
      switch (highAttrib & NAmigaAttrib::kIFMT)
      {
        case NAmigaAttrib::kIFDIR: return true;
        case NAmigaAttrib::kIFREG: return false;
        default: return false; // change it throw kUnknownAttributes;
      }
    case NHostOS::kFAT:
    case NHostOS::kNTFS:
    case NHostOS::kHPFS:
    case NHostOS::kVFAT:
      return ((ExternalAttrib & FILE_ATTRIBUTE_DIRECTORY) != 0);
    case NHostOS::kAtari:
    case NHostOS::kMac:
    case NHostOS::kVMS:
    case NHostOS::kVM_CMS:
    case NHostOS::kAcorn:
    case NHostOS::kMVS:
      return false; // change it throw kUnknownAttributes;
    case NHostOS::kUnix:
      return MY_LIN_S_ISDIR(highAttrib);
    default:
      return false;
  }
}

So, breaking the algorithm down:

  • If the name ends in a /, it's immediately treated as a directory.
  • Otherwise, use heuristics as follows:
  • Is 0 bytes, ends in a backslash \, and the ZIP was made by a Windows host OS: That's a directory. This is just a "special case" meant for old versions of Windows and PowerShell which used to create illegal ZIPs that used backslashes as directory separators, even though that's forbidden in the ZIP spec. Modern versions of Windows (starting from like 2 years ago) and its tools finally properly use forward slashes instead. "Thanks", Microsoft for making things harder to parse, as usual.
  • If things have gotten this far, it has now immediately detected directories that end with / and (illegal) \.
  • All lines below this will be dealing with entries that don't look like obvious directories, and may therefore be either files or "improperly encoded directories without any kind of trailing slash".
  • It now does a shortcut: If the current metadata isn't being read from the Central Directory (the full index at the end of the ZIP file), then say False here, because it doesn't want to trust the file-local data headers, just in case the directory later overrode those values with something else. This is a reasonable decision, since "if it ended in a slash, we know that the name must be referring to a folder no matter what, but if not, we need to use metadata which may not be up-to-date here in the local metadata".
  • From now on, if we're reading from the Central Metadata Directory at the end of the ZIP, then the heuristics will try to identify improperly encoded directories that DON'T end with any kind of slash. Meaning, let's say a directory is stored as /some/thing even though thing IS a directory.
  • Look at the External Attributes field of the entry, check which Host OS made the ZIP, and check if those OS-native file attributes contain what that exact OS uses to signal "this is a directory": If so, this is a directory too.
  • Otherwise, it is a file.

So that's a full overview of "what is a directory?" in the ZIP spec.

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

@michalc Thanks again for the answer and the link to your commit, it was super helpful! :)

If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?

On this - I think ending the name in a forward slash should create a directory: 66c500b

Now, with the backstory of the ZIP spec, it becomes possible to evaluate what that code is doing! :)

  • If the name ends in /, stream-zip marks the "External Attribute" with the bit for 0x10.
  • To check whether that's correct, I simply looked at 7-Zip's value for "MS-DOS Directory", which is defined as #define FILE_ATTRIBUTE_DIRECTORY 0x0010.
  • So bingo, if there's a trailing slash, stream-zip marks it as a MS-DOS Directory.

The only missing puzzle pieces to ensure that stream-zip is 100% valid are:

  • stream-zip must mark the "Host OS that created this ZIP" as MS-DOS, otherwise the "External Attribute" for the file breaks/becomes meaningless. I haven't verified if it does this. It needs to do this regardless of what Host OS stream-zip is running on. The list of correct operating systems is listed in the 4.4.2 version made by (2 bytes) table that I copied from the ZIP spec in the previous message.
  • Valid "version made by" values that support the MS-DOS 0x10 == directory attribute are seen in 7-zips implementation (quoted in previous message), and those are in turn defined in CPP/7zip/Archive/Zip/ZipHeader.h:
    NHostOS::kFAT      =  0,
    NHostOS::kNTFS     = 11,  // filesystem used by Windows NT
    NHostOS::kHPFS     =  6,  // filesystem used by OS/2 (and NT 3.x)
    NHostOS::kVFAT     = 14,  // filesystem used by Windows 95, NT
  • So stream-zip must mark the "version made by" as one of those values to be legal when using 0x10 to signify directories. I would say that "NTFS" is the most appropriate of those alternatives. I'll try to find out if stream-zip already does this.
  • Interesting side-note: 7-Zip is a better "ground truth" for the operating system values than the official ZIP spec. The official ZIP spec claims that "NTFS = 10" but that's a legacy number, which is defined by 7-Zip as kTOPS20 = 10, // pkzip 2.50 NTFS. So that's some kind of legacy NTFS. To use modern NTFS as the format, the "version made by" value must be 11.
  • stream-zip should also always throw an error if input data length is > 0 if something is a directory (ends in a trailing slash). I don't think it does that sanity check at the moment, and would therefore create illegal directories. The spec says "Zero-byte files, directories, and other file types that contain no content MUST NOT include file data."

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

@michalc Okay, it was good to do this investigation! I see that stream-zip has a bug.

https://github.com/uktrade/stream-zip/blob/31245ce4f89aeede3e260db1430f4fb42911a51d/stream_zip.py#L157C1-L160C49

            return central_directory_header_struct.pack(
                45,           # Version made by
                3,            # System made by (UNIX)
                45,           # Version required

It's marking the directories ("ends with slash" as MS-DOS DIRECTORY which is a flag that only makes sense if the directory is marked as "Made on FAT/NTFS" as mentioned above) but it marks the archive itself as "Made on Unix".

So we'll need to fix this. Thankfully I doubt that many people ever used the "embed directory" feature of stream-zip. And extractors such as 7-Zip use other heuristics (namely, the filenames) to detect directories when the Extended Attributes are wrong. :)

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

Here are the correct attributes when creating "Made on UNIX" ZIP archives, which are necessary to match the UNIX heuristics 7-Zip uses when looking at Extended Attributes:

    UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);

// ...

    case NHostOS::kUnix:  // kUnix     =  3, [The same value that stream-zip uses]
      return MY_LIN_S_ISDIR(highAttrib);
#define MY_LIN_S_IFMT  00170000

#define MY_LIN_S_IFDIR  0040000

#define MY_LIN_S_ISDIR(m)   (((m) & MY_LIN_S_IFMT) == MY_LIN_S_IFDIR)

I can't fully interpret this at the moment. The first part with UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF); means that it's shifting the ExternalAttrib value right by 16 bits. So already, that's something that I think stream-zip isn't doing properly. It looks like stream-zip instead writes the attribute (0x10 at the moment) to the LOW 16 bits of External Attributes. (Edit: As you pointed out, this bug also exists in Python's own zipfile module).

Next, after it has shifted the high bits into becoming low bits instead, it then needs to check the actual attributes.

To check whether something has the Unix "is a directory" attribute, it's first AND-ing the attributes with 00170000 and then checking if the remaining result is exactly 0040000. (Sidenote: I've never seen this way of writing the numbers before. I think it's just zero-padded base10 integers, but that would have to be verified to ensure that it's not Octal data!)

stream-zip needs to use that technique when marking things as directory. Not using DOS 0x10. :)

I can see that stream-zip currently does this, which means that it's storing UNIX PERMISSIONS in the top 16 bits of External Attributes. And then incorrectly placing 0x10 in the lower bits:

            external_attr = \
                (perms << 16) | \
                (0x10 if name_encoded[-1:] == b'/' else 0x0)  # MS-DOS directory

The solution would be something like this:

            external_attr = \
                (perms << 16) | \
                ((0040000 << 16) if name_encoded[-1:] == b'/' else 0x0)  # Unix directory

(I have not tested this! I simply assumed that the 00170000 masking is totally useless for our encoding purposes and is only used in 7-zip to filter out "permissions" etc from the attribute block, so the solution I proposed is most likely correct...)

I hope this helps. I would make a pull request if I fully understood the Unix External Attributes, but I don't. At least "all the info" to solve it exists in this ticket now. :)

Edit in case it got lost in all the noise: stream-zip also needs to throw a fatal error if anyone tries to create a directory (ends with slash) where input bytes length is > 0. Because that's an illegal ZIP.

Edit: The ultimate "success test" afterwards would be to try encoding a directory WITHOUT trailing slash, marked as "Made on UNIX", with the "is a directory" External Attribute, and then opening it in 7-Zip. If it shows as a directory in 7-Zip, then the metadata is correct.

@michalc
Copy link
Member

michalc commented Jun 23, 2023

So lots here - have to admit I haven't read it all quite yet. But: stream-zip is very similar to Python's zipfile module https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L546-L552

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

@michalc Yeah, there's lots of bugs in Python's zipfile module. This doesn't surprise me. :) It was old and crufty even when it was first invented like 20 years ago.

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

@michalc I have edited the bug summary/solution post, the most up to date info is all in this one:

#55 (comment)

Edit: Added potential code solution.

@Arcitec
Copy link
Author

Arcitec commented Jun 23, 2023

I might be able to make a pull request for this if you want to wait. :) Gonna investigate the weird 00 padding in C++ (most likely meaningless visual padding for a regular int), and then do some tests of the proposed solution.

Edit: Nope, the leading 0 was octal, as was my other theory...! I'll be making a pull request and updating the tests today. The findings are below. It's really just a bitmask (IFMT) + a single bit flag (IFDIR).

#include <iostream>

#define MY_LIN_S_IFMT  00170000

#define MY_LIN_S_IFDIR  0040000

int main() {
    // Write C++ code here
    std::cout << "MY_LIN_S_IFMT: " << MY_LIN_S_IFMT << std::endl;
    std::cout << "MY_LIN_S_IFDIR: " << MY_LIN_S_IFDIR << std::endl;

    return 0;
}

Result:

MY_LIN_S_IFMT: 61440
MY_LIN_S_IFDIR: 16384

This number looked funny (power of 2) so I used a binary converter and confirmed that it's really just a power-of-2 bit flag:

// MY_LIN_S_IFMT  61440 (octal 00170000) as binary representation:
1111000000000000

// MY_LIN_S_IFDIR 16384 (octal 0040000) as binary representation:
0100000000000000

So 7-Zip just used a super weird way of writing what I'd personally write as 1 << 14 (16384) for more clarity.

And both are 16-bit values (2 bytes), which matches up with the fact that 7-Zip does UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF); to discard the lower two bytes and cleanly filter out the top two bytes.

I'm normally on Linux where my dev environment is, but I'll set up dev tools here on Windows now and send in a pull request today. :)

Edit: Pull request submitted.

@michalc
Copy link
Member

michalc commented Jul 15, 2023

I've opted to go for a documentation approach to this in #60, documenting the directories must end in a forward slash, and should have the S_IFDIR mode. Potentially a code change could be made down the line, but I'm tempted to leave it for now since from what I can tell, this works in all clients, and is The Right Thing in terms of what ends up in the file.

@michalc michalc closed this as completed Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants