Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile module next() method hides exceptions #71777

Open
JieGhost mannequin opened this issue Jul 22, 2016 · 9 comments
Open

tarfile module next() method hides exceptions #71777

JieGhost mannequin opened this issue Jul 22, 2016 · 9 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@JieGhost
Copy link
Mannequin

JieGhost mannequin commented Jul 22, 2016

BPO 27590
Nosy @rhettinger, @gustaebel, @bitdancer, @socketpair

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2016-07-22.15:11:06.123>
labels = ['type-bug', 'library', '3.11']
title = 'tarfile module next() method hides exceptions'
updated_at = <Date 2021-04-26.22:45:55.534>
user = 'https://bugs.python.org/JieGhost'

bugs.python.org fields:

activity = <Date 2021-04-26.22:45:55.534>
actor = 'iritkatriel'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2016-07-22.15:11:06.123>
creator = 'JieGhost'
dependencies = []
files = []
hgrepos = []
issue_num = 27590
keywords = []
message_count = 9.0
messages = ['270990', '270998', '271005', '271011', '271031', '271033', '271235', '271261', '271505']
nosy_count = 5.0
nosy_names = ['rhettinger', 'lars.gustaebel', 'r.david.murray', 'socketpair', 'JieGhost']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue27590'
versions = ['Python 3.11']

@JieGhost
Copy link
Mannequin Author

JieGhost mannequin commented Jul 22, 2016

I have seen a similar ticket, however that was opened 2 years ago and has nothing more than a brief description. So I opened this new one here, hoping to get some answers.

tarfile.TarFile object is iterable and has a next() method. next() will parse the header and save parsed info. During parsing, a lot of checks are done, to make sure the header is valid. And if there is something wrong with the header, exceptions will be thrown. next() catches a lot of them but not reraise what it catches in all cases.

I have a tgz file, one of the headers is corrupted with a wrong checksum section. thus during parsing, InvalidHeaderError was thrown. next() catches that but hide it silently. From source code (https://hg.python.org/cpython/file/2.7/Lib/tarfile.py#l2335), we can see that InvalidHeaderError will ONLY be raised if it happens in the beginning of the tar file. Actually, a lot of exceptions are hidden by tarfile module. tarfile module simply thinks these exceptions mark the end of tarball.

Why does tarfile module hide so many exceptions? or in other words, why does tarfile treat these exceptions as the end marker of tarball but not errors?

Is it because of this from GNU doc:
"At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker. A reasonable system should write such end-of-file marker at the end of an archive, but must not assume that such a block exists when reading an archive."?

Thanks!

@JieGhost JieGhost mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 22, 2016
@bitdancer
Copy link
Member

That would be my guess. If we are reading along and we hit garbage data, we assume we've reached the end of the tar. That doesn't mean there isn't room for improvement, or perhaps issuing a warning message about why we think we hit the end of the tar.

What is the issue number of the other issue? If it is still open we should consolidate the issues if appropriate.

@JieGhost
Copy link
Mannequin Author

JieGhost mannequin commented Jul 22, 2016

The other issue is
http://bugs.python.org/issue16858

@bitdancer
Copy link
Member

OK, I've closed bpo-16858 in favor of this one, since we at least had some discussion here.

I see you selected 2.7. Does python3 have the same issues? (I'm guessing it does, though there has been some work done on the module.)

@JieGhost
Copy link
Mannequin Author

JieGhost mannequin commented Jul 22, 2016

Yeah, I just tried on Python3.5 and it didn't report any errors either.

@rhettinger
Copy link
Contributor

Lars Gustäbel did most of the work on this and it would be nice to get his thoughts. The exception swallowing is explicit here rather than accidental. See http://bugs.python.org/issue6123

@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Jul 25, 2016

The question is what you're trying to accomplish. If you just want to prevent tarfile from stopping at the first invalid header in order to extract everything following it, you may use the ignore_zeros=True keyword argument.

@JieGhost
Copy link
Mannequin Author

JieGhost mannequin commented Jul 25, 2016

I do want tarfile module to stop at the first invalid header. My question is why does tarfile module NOT throw exception about the error in header, instead it just hide it silently.

@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Jul 28, 2016

After all these years, it is not that easy to say why the decision to swallow this exception was made. One part surely was a lack of experience with the tar format itself and all of its implementations. The other part I guess was that it was supposed to avoid problems in case users did not use TarFile as an iterator. tarfile was developed on Python 2.2 which was the first release to feature iterators. The problem if you do random access on a tarfile or call TarFile.getmembers() is that first of all all the headers must be collected. If this fails somewhere in the middle, there is no way to resume the current operation and you get nothing out of the archive.

@iritkatriel iritkatriel added the 3.11 only security fixes label Apr 26, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

3 participants