Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in TOC type (its __sub__ function) #6669

Closed
plu5 opened this issue Mar 12, 2022 · 10 comments · Fixed by #6671
Closed

Bug in TOC type (its __sub__ function) #6669

plu5 opened this issue Mar 12, 2022 · 10 comments · Fixed by #6671
Labels

Comments

@plu5
Copy link

plu5 commented Mar 12, 2022

Description of the issue

The documentation shows you can exclude binaries by adding something like the following to your spec file:

a.binaries -= [('badmodule', None, None)]

I noticed that if you then subtract again, a.binaries becomes an empty list.

For example:

toc = TOC([('module1', None, None), ('module2', None, None), ('badmodule', None, None)])
toc -= [('badmodule', None, None)]  # -> [('module1', None, None), ('module2', None, None)]
toc -= [('module1', None, None)]  # -> []

I believe L117 in PyInstaller/building/datastruct.py is the culprit. If I change it to result.append(entry) it fixes the issue. Otherwise the result __sub__ returns has an empty set for its filenames, which then causes subsequent subtractions to fail, because the subtraction relies on filenames being set properly (see L109).

Context information (for bug reports)

(this is probably irrelevant but just in case)

  • Output of pyinstaller --version: 4.9
  • Version of Python: 3.7.4
  • Platform: Windows 10 1909 en-GB
  • How you installed Python: python.org/downloads
  • Did you also try this on another platform? Does it work there? -- I was also able to reproduce this on a Ubuntu 16.10 VM with Python 3.9.9 and PyInstaller 4.10.

A minimal example program which shows the error

from PyInstaller.building.datastruct import TOC


toc = TOC([('module1', None, None), ('module2', None, None), ('badmodule', None, None)])

toc -= [('badmodule', None, None)]
assert toc == TOC([('module1', None, None), ('module2', None, None)])  # passes

toc -= [('module1', None, None)]
assert toc == TOC([('module2', None, None)])  # fails

Stacktrace / full error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/py249370e", line 10, in <module>
AssertionError

and if you print(toc) at this point you get [].

@plu5 plu5 added the triage Please triage and relabel this issue label Mar 12, 2022
@rokm rokm added bug and removed triage Please triage and relabel this issue labels Mar 12, 2022
@rokm
Copy link
Member

rokm commented Mar 12, 2022

Nice catch!

@plu5
Copy link
Author

plu5 commented Mar 12, 2022

Thanks rokm!

I have a related thing to note:

For TOC entries with typecode BINARY or DATA, unique_name does os.path.normcase on their name. In Windows, the effect of this is to lowercase the names. This means that on Windows if you try to remove a module with subtraction, giving its name with the same casing as it appears on disk, it will fail:

toc = TOC([('m1', None, None), ('m2', None, None), ('BadModule', None, 'BINARY')])
toc -= [('BadModule', None, None)]
print(toc)
# On Windows: -> [('m1', None, None), ('m2', None, None), ('BadModule', None, 'BINARY')]
# On Linux:   -> [('m1', None, None), ('m2', None, None)]

@rokm
Copy link
Member

rokm commented Mar 12, 2022

I'd say this example is invalid, because the subtracted element does not have type code. If it was

toc -= [('BadModule', None, 'BINARY')]

it would work as expected.

@plu5
Copy link
Author

plu5 commented Mar 12, 2022

True, but I guess that only works because specifying BINARY or DATA makes the name you give be run through os.path.normcase (and thus, on Windows, made lowercase) as well.

Since the docs say:

Because set membership is based on the name element of a tuple only, it is not necessary to give accurate path and typecode elements when subtracting.

perhaps this bears mentioning, but I am not sure how to word this in a way that is not going to confuse people.

Users of case-insensitive filesystems, like Windows by default, should either specify the name in lowercase, or give the correct typecode.

This actually sent me for a spin for several hours yesterday, I was really confused about why subtracting 'opengl32sw.dll' would work but something like 'Qt5DBus.dll' stays. That the casing is responsible did not even occur to me until I looked at the code.

@bwoodsend
Copy link
Member

perhaps this bears mentioning, but I am not sure how to word this in a way that is not going to confuse people.

Users of case-insensitive filesystems, like Windows by default, should either specify the name in lowercase, or give the correct typecode.

Ah simple - we don't. If Toc.__sub__() applies os.path.normcase() then the users don't need to know.

@rokm
Copy link
Member

rokm commented Mar 12, 2022

Hmmm, you're right. I wasn't aware that the docs were encouraging that sort of thing.

I'll revise that part of docs to mention case normalization for BINARY and DATA entries, and explain why typecode needs to be specified on case-insensitive OSes (or alternatively, lowercase names should be provided, like you mentioned).

@rokm
Copy link
Member

rokm commented Mar 12, 2022

Ah simple - we don't. If Toc.__sub__() applies os.path.normcase() then the users don't need to know.

Except we can't apply it consistently. If user does not provide the typecode of the operand, we have no way of knowing whether the normalization should be performed (i.e., the entry is data or binary) or not (i.e., the entry is for example a python module). And trying to infer this from the existing entries in the original TOC would require prior matching, leading to chicken-and-egg problem (I suppose pre-matching could be performed in lower-case, but it's a lot of additional complexity for very little gain).

That's why I think adding a note about normalization and specifying lower-case names in lieu of typecodes is the best solution for this...

@rokm
Copy link
Member

rokm commented Mar 13, 2022

Now that I think about it, why not perform case normalization on all TOC.filenames entries, regardless of the type? The case of entries' real filenames is preserved in the entries themselves, while filenames is used internally to prevent duplication and such. If we case-normalized all entries, the TOC should behave like the underlying filesystem for all entry types (i.e., you cannot have both mymodule.pyc and MyModule.pyc on filesystem, even if actual case matters for python's import).

@bwoodsend
Copy link
Member

Sounds like the sensible thing to me.

@rokm
Copy link
Member

rokm commented Mar 19, 2022

I've moved this second part to a separate issue for easier bookkeeping. The case normalization changes involve fixing up tests that explicitly test for the old behavior, so I'd rather have the changes in a separate PR.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants