-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating nested path hierarchies and empty directories inside the ZIP archive? #55
Comments
Oh, I just had a look inside a ZIP createad with
Whereas if I had created the same ZIP on Linux with regular tools, I would have had these entries in the ZIP:
But I suppose that the ZIP spec is flexible enough that it doesn't need the intermediary "this is a parent folder on the road to glory" stuff. So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:
Would something like that solve all of these issues? |
Sounds reasonable to me. I guess you don't absolutely have to have two passes - you can in one pass output files, while maintaining a list of folders that don't have any children. Or just output all folders like it sounds like other tools do. But that's a detail. I have to admit I haven't tested zipping folders much, especially empty folders - it was never really a priority. So I suspect would be good to check the output using a few different unzip-ers, just in case. |
On this - I think ending the name in a forward slash should create a directory: 66c500b |
Hi Michal! I've been digging deeply into this, so before answering, I'll just separately dump the information I've dug up here, for reference since it could be useful (you mentioned not looking much into directory storage in ZIPs)! :) ZIP Specification
This document taught us a few things:
External File Attributes
How to interpret the "external file attributes" depends on the OS that made the ZIP, which is controlled by this field:
Alright, so how do we identify directories?We simply have to look at what 7-zip does:
// Just checks if a string ends in a "/".
bool HasTailSlash(const AString &name, UINT
#if defined(_WIN32) && !defined(UNDER_CE)
codePage
#endif
)
{
if (name.IsEmpty())
return false;
char c;
#if defined(_WIN32) && !defined(UNDER_CE)
if (codePage != CP_UTF8)
c = *CharPrevExA((WORD)codePage, name, name.Ptr(name.Len()), 0);
else
#endif
{
c = name.Back();
}
return (c == '/');
}
bool CItem::IsDir() const
{
// FIXME: we can check InfoZip UTF-8 name at first.
if (NItemName::HasTailSlash(Name, GetCodePage()))
return true;
Byte hostOS = GetHostOS();
if (Size == 0 && PackSize == 0 && !Name.IsEmpty() && Name.Back() == '\\')
{
// do we need to use CharPrevExA?
// .NET Framework 4.5 : System.IO.Compression::CreateFromDirectory() probably writes backslashes to headers?
// so we support that case
switch (hostOS)
{
case NHostOS::kFAT:
case NHostOS::kNTFS:
case NHostOS::kHPFS:
case NHostOS::kVFAT:
return true;
}
}
if (!FromCentral)
return false;
UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
switch (hostOS)
{
case NHostOS::kAMIGA:
switch (highAttrib & NAmigaAttrib::kIFMT)
{
case NAmigaAttrib::kIFDIR: return true;
case NAmigaAttrib::kIFREG: return false;
default: return false; // change it throw kUnknownAttributes;
}
case NHostOS::kFAT:
case NHostOS::kNTFS:
case NHostOS::kHPFS:
case NHostOS::kVFAT:
return ((ExternalAttrib & FILE_ATTRIBUTE_DIRECTORY) != 0);
case NHostOS::kAtari:
case NHostOS::kMac:
case NHostOS::kVMS:
case NHostOS::kVM_CMS:
case NHostOS::kAcorn:
case NHostOS::kMVS:
return false; // change it throw kUnknownAttributes;
case NHostOS::kUnix:
return MY_LIN_S_ISDIR(highAttrib);
default:
return false;
}
} So, breaking the algorithm down:
So that's a full overview of "what is a directory?" in the ZIP spec. |
@michalc Thanks again for the answer and the link to your commit, it was super helpful! :)
Now, with the backstory of the ZIP spec, it becomes possible to evaluate what that code is doing! :)
The only missing puzzle pieces to ensure that
|
@michalc Okay, it was good to do this investigation! I see that return central_directory_header_struct.pack(
45, # Version made by
3, # System made by (UNIX)
45, # Version required It's marking the directories ("ends with slash" as MS-DOS DIRECTORY which is a flag that only makes sense if the directory is marked as "Made on FAT/NTFS" as mentioned above) but it marks the archive itself as "Made on Unix". So we'll need to fix this. Thankfully I doubt that many people ever used the "embed directory" feature of |
Here are the correct attributes when creating "Made on UNIX" ZIP archives, which are necessary to match the UNIX heuristics 7-Zip uses when looking at Extended Attributes: UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
// ...
case NHostOS::kUnix: // kUnix = 3, [The same value that stream-zip uses]
return MY_LIN_S_ISDIR(highAttrib); #define MY_LIN_S_IFMT 00170000
#define MY_LIN_S_IFDIR 0040000
#define MY_LIN_S_ISDIR(m) (((m) & MY_LIN_S_IFMT) == MY_LIN_S_IFDIR) I can't fully interpret this at the moment. The first part with Next, after it has shifted the high bits into becoming low bits instead, it then needs to check the actual attributes. To check whether something has the Unix "is a directory" attribute, it's first AND-ing the attributes with
I can see that external_attr = \
(perms << 16) | \
(0x10 if name_encoded[-1:] == b'/' else 0x0) # MS-DOS directory The solution would be something like this: external_attr = \
(perms << 16) | \
((0040000 << 16) if name_encoded[-1:] == b'/' else 0x0) # Unix directory (I have not tested this! I simply assumed that the I hope this helps. I would make a pull request if I fully understood the Unix External Attributes, but I don't. At least "all the info" to solve it exists in this ticket now. :) Edit in case it got lost in all the noise: Edit: The ultimate "success test" afterwards would be to try encoding a directory WITHOUT trailing slash, marked as "Made on UNIX", with the "is a directory" External Attribute, and then opening it in 7-Zip. If it shows as a directory in 7-Zip, then the metadata is correct. |
So lots here - have to admit I haven't read it all quite yet. But: stream-zip is very similar to Python's zipfile module https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L546-L552 |
@michalc Yeah, there's lots of bugs in Python's zipfile module. This doesn't surprise me. :) It was old and crufty even when it was first invented like 20 years ago. |
@michalc I have edited the bug summary/solution post, the most up to date info is all in this one: Edit: Added potential code solution. |
I might be able to make a pull request for this if you want to wait. :) Gonna investigate the weird Edit: Nope, the leading 0 was octal, as was my other theory...! I'll be making a pull request and updating the tests today. The findings are below. It's really just a bitmask (IFMT) + a single bit flag (IFDIR). #include <iostream>
#define MY_LIN_S_IFMT 00170000
#define MY_LIN_S_IFDIR 0040000
int main() {
// Write C++ code here
std::cout << "MY_LIN_S_IFMT: " << MY_LIN_S_IFMT << std::endl;
std::cout << "MY_LIN_S_IFDIR: " << MY_LIN_S_IFDIR << std::endl;
return 0;
} Result:
This number looked funny (power of 2) so I used a binary converter and confirmed that it's really just a power-of-2 bit flag: // MY_LIN_S_IFMT 61440 (octal 00170000) as binary representation:
1111000000000000
// MY_LIN_S_IFDIR 16384 (octal 0040000) as binary representation:
0100000000000000 So 7-Zip just used a super weird way of writing what I'd personally write as And both are 16-bit values (2 bytes), which matches up with the fact that 7-Zip does I'm normally on Linux where my dev environment is, but I'll set up dev tools here on Windows now and send in a pull request today. :) Edit: Pull request submitted. |
I've opted to go for a documentation approach to this in #60, documenting the directories must end in a forward slash, and should have the |
Hey, small question.
From what I've observed from the ZIP spec, it seems like the "directory hierarchy" of ZIP files are just an illusion, and that it's really just POSIX-style names (forward slash separator), where each file has its own full path embedded straight into the file "name" data (no "actual nested folders"), such as
name = foo/bar/baz/readme.txt
. I've also noticed that empty directories are stored as "files" which are named things likefoo/bar/baz/
with a trailing slash at the end, and with a DATA LENGTH of ZERO. The trailing slash seems to be the only signifier that a "file" is actually a directory?For instance, when I created a ZIP with the standard utilities on Linux, I observed that it has an entry for the "parent folder" as a standalone ZIP entry, before it has an entry for the actual file inside that folder.
The hierarchy in this example ZIP is:
A folder\with backslashes
: This is the actual folder name, it's a POSIX name so backslash is legal. And it's stored in the ZIP asA folder\with backslashes/
A folder\with backslashes/my file\withbackslash.txt
: This is the filename within the folder. It's just a file with a backslash in the name.Here's my test ZIP:
backslashes.zip
So with that backstory about how directories seem to work in ZIPs, I'm very curious about the correct ways to handle directories in
stream-zip
:name = foo/bar/baz/readme.txt
without having to manually create the parent path components (foo/
,foo/bar/
andfoo/bar/baz/
)?foo/a/1.txt
, thenbar/b/1.txt
, thenfoo/a/2.txt
? Meaning is it safe to add directories in a jumbled order to the ZIP file contents? I assume thatstream-zip
remembers which directories it has already written to the ZIP and doesn't create those paths again.the/directory/
with a trailing slash and providingb""
(empty bytes) as the data?a/b/
was already auto-generated and added earlier, and I then manually try to writea/b/
, I guess I'd end up with a duplicate directory entry)?The text was updated successfully, but these errors were encountered: