New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipfile can't extract file #51088
Comments
The following exception occured when I tried to extract on Windows. "zipfile.BadZipfile: File name in directory "test\test2.txt" and header It seems like problem about slash. |
P.S |
The zipfile is technically incorrect, the zipfile specification prescribes Even without that caveat the file is corrupt because the zipfile directory That said: IMHO the current code in zipfile.ZipFile.open is too strict, it |
FileRoller doesn't complain about the mismatched slashes either. Where As far as the renaming goes, it seems the standard says the header name |
alan: I don't quite understand which filename you want to use when the Where in the standard is this prescribed? I couldn't find anything in My preference would be to use the central directory as the canonical Furthermore, when the two are different the most reasonably explaination [1] |
Sorry about the confusion--I think I confused myself by looking at the n section J, under "file name (Variable)" there's a bit that says: "If input came from standard input, there is no file name field. If So in these cases the central directory name has to be used. And, as |
In the case at issue, the file name is the same (contrary to the error An actual filename difference might be argued differently. |
I added a patch to replace back slashes by forward slashes in three I kept the exception for mismatching filenames, but if you think it is |
I agree with the change, but the code should be factorized in a function (normalize_filename for example) |
I'd prefer if the code no longer checked if the filename in the directory matches the name in the per-file header. The reason of that is that the two don't have to match: it is relatively cheap to rename a file in the zipfile by rewriting the directory while rewriting the entire zipfile can be pretty expensive when zipfiles get large. It's probably worthwhile to test what other zipfile tools do in the respect (e.g., create a zipfile where the filename in the header doesn't match the name in the directory and extract that zip using a number of popular tools). (I have a slightly odd perspective on this because I regularly deal with zipfiles containing over 100K files and over 10GByte of data). |
I've got bitten by a different variation of this bug. In my case the issue can be summarized by: Attached is a patch for Python27/lib/zipfile.py. I understand that it might not be the best approach, but at least we just compare the filenames without caring much about those pesky paths preceding them. |
Just tested my patch on mac, and it appears that it didn't work on OSX (and likely on other unix platforms too). Conclusion... os.path.basename() will not do anything to windows paths when running on unix. I'm thinking that instead of bailing at 'File name in directory "%s" and header "%s" differ.', the library should just print a warning, and continue. |
I'm in a similar situation, my test file raises this: File name in directory "windows\TEMP\\test123.txt" and header "C:\windows\TEMP\\test123.txt" differ. It turns out that I can't find any cross platform procedures for processing the paths between the different platforms. And there are other things like doing it in portable way; os.path.split() nor os.path.basename() won't touch windows paths on un*x, etc... So, I'd like to propose an easy way, just allow the process to extract the files (and print a warning message) rather that just raising an exception (raise BadZipfile,...) and stopping the extraction altogether. |
This one has the parentheses for print, so that it works in python 3.x. Also, the default fallback behavior in this case is to use the filename from the zips' directory (the first path in the warning). |
As I wrote in msg116385 I'd prefer to drop the consistency check completely because updating data like the filename in the central directory is a cheap way to rename files without completely rewriting the zip file. |
Can we get this simple "fix" implemented in time for the next 2.7.x release?! Thank you! |
print() is not a good way to emit the warning; please use the warnings module. |
It should at least left as debugging print. It can't be a warning, because it depends not on user's actions, but on |
Excellent, please see my third attempt. |
Adam this is not a security issue (2.6, 3.1, 3.2), nor a future issue that must wait for 3.5. |
It might not be a regular "security" issue, but it is not extracting some files that it should. There's a possible scenario, where it can be a security issue. |
Gentlemen, Is there's any way this fix can be included in any version? This ridiculous behavior is really not consistent with other ZIP implementations (7zip just ignores the mismatch). Thank you for your time and effort. |
Adam P, please don't screw around with the version headers. If you want to claim that this is a security issue of the type we care about (threats to the public internet) for patching old releases, and severe enough that we should do anything about it, send a detailed explanation with links to evidence to security response team. Simple writing 'a possible scenario' is insufficient. |
For the version headers, I've added the versions featuring the broken behavior. That's all. I'm not saying that this is I'm extracting malware from the Central Quarantine files, and the vendor's implementation is broken and is causing this issue for me on every single file inside the archive. Let's say, I've got a wrapper script that feeds the contents of a zip file to be scanned with this, because of this behavior, the wrapper will error out... Customers will say your product sucks, etc. Does this really take an act of god to fix this? |
Also, this behavior is present on all platforms and all versions of Python (zipfile Library), so maybe the headers should be adjusted there too. I'm not saying that this is necessarily a big freaking hole, but by using this, one can prevent files from being extracted using this simple trick. |
Patch against zipfile 3.4.0 attached. |
update |
Once again patch against 2.7.6 |
Don't use print (to stdout) or sys.stderr directly. There are already many other uses of warnings.warn within the zipfile module. Be consistent with those. Existing zipfile warnings seem to favor lazily importing warnings when its needed rather than a top level 'import warnings'. While I find that annoying, there are sometimes reasons to do it and the minimally invasive change that is consistent with the rest of the existing code is to do the same thing here. something similar to: + if self.debug and fname != zinfo.orig_filename: |
As Greg suggested, the important thing is to follow the precedent set by |
Attached is a patch with warnings against 2.7.6 |
Attached is a patch with warnings against 3.4.0 |
Attached is a patch with warnings against 2.7.6 (this one should be good to go) |
--- a/zipfile.py Wed Apr 30 11:27:16 2014
+++ b/zipfile.py Wed Apr 30 11:27:01 2014
@@ -1174,8 +1174,9 @@
else:
fname_str = fname.decode("cp437")
- if fname_str != zinfo.orig_filename:
- raise BadZipFile(
+ if self.debug and fname_str != zinfo.orig_filename:
+ import warnings
+ warnings.warn(
'File name in directory %r and header %r differ.'
% (zinfo.orig_filename, fname)) Also, you need to add |
3.4.0 pathc with stacklevel=2 |
I'm leaving it as "needs patch" because it isn't clear exactly what a committer should do. I think the current intent is to make the changes listed in zipfile_???_filename_mismatch_v2.patch (which are not listed as reviewable -- but the changes are indeed sufficiently straightforward that the the files -- if need be -- could be edited by hand as if they were made originally by the committer.) This change is small enough (warning instead of raise) that a test case is probably not strictly required, but it would be helpful. test.zip would presumably be useful data for a test case. There is dispute over whether this would be an enhancement (more generous with what to accept), a bug fix, or a security *regression* because it still allows old vulnerable files to stick around unreplaced (or to hide from a malware scanner), but no longer raises an Exception to get attention. (warnings are often ignored) zlib_forward_slash.patch would also be good (and might even be a security fix, by allowing the new versions to be installed), but is not ready to be committed, as There is also a question of how general the filename correction should be, particularly with respect to windows drives and capitalization. The one in this patch seems to be the minimal change, and is explicitly supported by the zip spec. |
Presumably the stacklevel applies to all versions; verifying that it warns about the right code location is important enough to require a test case. |
I just looked through 2.7.6 version of zipfile, and the the error handling there is either through using raise() or print(). So, inline with the guidance provided for 2.7.6, perhapswe should stick with print() instead of warning.warn(). I'll post that a bit later. test.zip up there is the test case for this change. Is there any other test case needed? |
Adam, please stop deleting the files. It makes for a lot of noise to those on the nosy list, and is unnecessary. Just make sure you increment the version number on the files you upload and it will be fine. Thanks. |
Jim, I've got some test cases where the zlib_forward_slash.patch doesn't cut it. That was the reason for trying a broader approach with filename_mismatch patches. |
A small question related to: "zipfile_276_filename_mismatch_v3.patch" --- a/zipfile.py Wed Apr 30 11:44:38 2014
+++ b/zipfile.py Wed Apr 30 15:10:38 2014
@@ -970,10 +970,10 @@
if fheader[_FH_EXTRA_FIELD_LENGTH]:
zef_file.read(fheader[_FH_EXTRA_FIELD_LENGTH])
- if fname != zinfo.orig_filename:
- raise BadZipfile, \
+ if self.debug and fname != zinfo.orig_filename:
+ print( \
'File name in directory "%s" and header "%s" differ.' % (
- zinfo.orig_filename, fname)
+ zinfo.orig_filename, fname)) Shouldn't a change from raising an exception to a print be somewhere documented? Thanks |
The bug was that BadZipFile was being raised when it shouldn't be so I wouldn't worry about documenting the behavior change other than in the Misc/NEWS entry that the ultimate commiter writes up. |
Is there anything else that you need me to provide? |
On Wed, Apr 30, 2014 at 3:05 PM, Adam Polkosnik wrote:
ah; I see the confusion. test.zip is test *data*. When I asked for a Typically, that would involve adding something to -jJ |
On Wed, Apr 30, 2014 at 3:11 PM, Adam Polkosnik
My recommendation (and I could be convinced otherwise) would be to replace if fname_str != zinfo.orig_filename:
raise ... with something more like self.filename_check(fname_str, zinfo.orig_filename) and a default implementation of filename_check that does nothing if In 2.7.6, you would have to keep the new methods private, but in 3.5, |
Jim, The problems documented here are related to two cases (both apparently arriving from world of windows):
The extraction part seems to be doing a good job at writing the files into sane locations. IMHO, there's no point in trying to replace slashes or otherwise "normalize", as this would fix the cases where the presence of an inverted slashes should be noted in debug output. |
On Fri, May 2, 2014 at 1:14 AM, Adam Polkosnik
Good! I had thought you had even more!
My understanding from earlier -- and I may have been reading too much Notably, if you're extracting on windows with windows conventions, If you're extracting a windows file to a unix environment, then \t
These really are different, as leaving off the "C:" should mean This (and differing capitalization) are among the reasons to do the Note that for python 3.4 and newer, pathlib <URL:
My understanding had been that it was failing to extract entirely. So |
Extraction works fine, the issue was that raise() was creating an exception, and stoping the whole extraction process. When replaced with a warning, everything works fine. |
Adam Polkasnik said:
That doesn't make sense. If an exception was "stopping the whole extraction process" then extraction was not working fine. Questions:
|
Ethan, And Then in python:
Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:25:23) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zf = zipfile.ZipFile('test.zip')
>>> namelist = zf.namelist()
>>> namelist
['test/', 'test/test2.txt', 'test1.txt']
>>> for af in namelist:
... zf.read(af)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "c:\Python34\lib\zipfile.py", line 1117, in read
with self.open(name, "r", pwd) as fp:
File "c:\Python34\lib\zipfile.py", line 1180, in open
% (zinfo.orig_filename, fname))
zipfile.BadZipFile: File name in directory 'test\\' and header b'test/' differ. So, based on that everything is already converted to forward slashes for the extraction. |
Ah, so when you (Adam) said "extraction works fine", what you meant was "extraction works fine *in other programs*". Okay. |
Both. Other programs, and in python scripts when raise() is removed in zipfile.py. Unless your results are different. |
The encoding test in ZipFile.open() is highly opinionated and has no purpose beyond itself. Testing for encoding issues should be done outside this library in the user's own code. Using the 3.7.2 version of ZipFile, this is my proposal: https://gist.github.com/MorganRamsay/696e89450e0f172c16ac8dfc016eb79f/revisions?diff=unified Currently, I'm subclassing ZipFile with this patch and I've had no issues with extracting thousands of different ZIP files on Windows. I can't attest to this solution's applicability on other platforms. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: