Bug in TOC type (its sub function) #6669

plu5 · 2022-03-12T00:56:09Z

Description of the issue

The documentation shows you can exclude binaries by adding something like the following to your spec file:

a.binaries -= [('badmodule', None, None)]

I noticed that if you then subtract again, a.binaries becomes an empty list.

For example:

toc = TOC([('module1', None, None), ('module2', None, None), ('badmodule', None, None)])
toc -= [('badmodule', None, None)]  # -> [('module1', None, None), ('module2', None, None)]
toc -= [('module1', None, None)]  # -> []

I believe L117 in PyInstaller/building/datastruct.py is the culprit. If I change it to result.append(entry) it fixes the issue. Otherwise the result __sub__ returns has an empty set for its filenames, which then causes subsequent subtractions to fail, because the subtraction relies on filenames being set properly (see L109).

Context information (for bug reports)

(this is probably irrelevant but just in case)

Output of pyinstaller --version: 4.9
Version of Python: 3.7.4
Platform: Windows 10 1909 en-GB
How you installed Python: python.org/downloads
Did you also try this on another platform? Does it work there? -- I was also able to reproduce this on a Ubuntu 16.10 VM with Python 3.9.9 and PyInstaller 4.10.

A minimal example program which shows the error

from PyInstaller.building.datastruct import TOC


toc = TOC([('module1', None, None), ('module2', None, None), ('badmodule', None, None)])

toc -= [('badmodule', None, None)]
assert toc == TOC([('module1', None, None), ('module2', None, None)])  # passes

toc -= [('module1', None, None)]
assert toc == TOC([('module2', None, None)])  # fails

Stacktrace / full error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/py249370e", line 10, in <module>
AssertionError

and if you print(toc) at this point you get [].

The text was updated successfully, but these errors were encountered:

rokm · 2022-03-12T11:03:19Z

Nice catch!

plu5 · 2022-03-12T21:12:11Z

Thanks rokm!

I have a related thing to note:

For TOC entries with typecode BINARY or DATA, unique_name does os.path.normcase on their name. In Windows, the effect of this is to lowercase the names. This means that on Windows if you try to remove a module with subtraction, giving its name with the same casing as it appears on disk, it will fail:

toc = TOC([('m1', None, None), ('m2', None, None), ('BadModule', None, 'BINARY')])
toc -= [('BadModule', None, None)]
print(toc)
# On Windows: -> [('m1', None, None), ('m2', None, None), ('BadModule', None, 'BINARY')]
# On Linux:   -> [('m1', None, None), ('m2', None, None)]

rokm · 2022-03-12T21:56:43Z

I'd say this example is invalid, because the subtracted element does not have type code. If it was

toc -= [('BadModule', None, 'BINARY')]

it would work as expected.

plu5 · 2022-03-12T22:31:21Z

True, but I guess that only works because specifying BINARY or DATA makes the name you give be run through os.path.normcase (and thus, on Windows, made lowercase) as well.

Since the docs say:

Because set membership is based on the name element of a tuple only, it is not necessary to give accurate path and typecode elements when subtracting.

perhaps this bears mentioning, but I am not sure how to word this in a way that is not going to confuse people.

Users of case-insensitive filesystems, like Windows by default, should either specify the name in lowercase, or give the correct typecode.

This actually sent me for a spin for several hours yesterday, I was really confused about why subtracting 'opengl32sw.dll' would work but something like 'Qt5DBus.dll' stays. That the casing is responsible did not even occur to me until I looked at the code.

bwoodsend · 2022-03-12T23:10:40Z

perhaps this bears mentioning, but I am not sure how to word this in a way that is not going to confuse people.

Users of case-insensitive filesystems, like Windows by default, should either specify the name in lowercase, or give the correct typecode.

Ah simple - we don't. If Toc.__sub__() applies os.path.normcase() then the users don't need to know.

rokm · 2022-03-12T23:29:39Z

Hmmm, you're right. I wasn't aware that the docs were encouraging that sort of thing.

I'll revise that part of docs to mention case normalization for BINARY and DATA entries, and explain why typecode needs to be specified on case-insensitive OSes (or alternatively, lowercase names should be provided, like you mentioned).

rokm · 2022-03-12T23:34:29Z

Ah simple - we don't. If Toc.__sub__() applies os.path.normcase() then the users don't need to know.

Except we can't apply it consistently. If user does not provide the typecode of the operand, we have no way of knowing whether the normalization should be performed (i.e., the entry is data or binary) or not (i.e., the entry is for example a python module). And trying to infer this from the existing entries in the original TOC would require prior matching, leading to chicken-and-egg problem (I suppose pre-matching could be performed in lower-case, but it's a lot of additional complexity for very little gain).

That's why I think adding a note about normalization and specifying lower-case names in lieu of typecodes is the best solution for this...

rokm · 2022-03-13T10:17:24Z

Now that I think about it, why not perform case normalization on all TOC.filenames entries, regardless of the type? The case of entries' real filenames is preserved in the entries themselves, while filenames is used internally to prevent duplication and such. If we case-normalized all entries, the TOC should behave like the underlying filesystem for all entry types (i.e., you cannot have both mymodule.pyc and MyModule.pyc on filesystem, even if actual case matters for python's import).

bwoodsend · 2022-03-13T11:54:41Z

Sounds like the sensible thing to me.

rokm · 2022-03-19T12:32:41Z

I've moved this second part to a separate issue for easier bookkeeping. The case normalization changes involve fixing up tests that explicitly test for the old behavior, so I'd rather have the changes in a separate PR.

plu5 added the triage Please triage and relabel this issue label Mar 12, 2022

rokm added bug and removed triage Please triage and relabel this issue labels Mar 12, 2022

rokm mentioned this issue Mar 12, 2022

building: fix TOC.__sub__ operator #6671

Merged

rokm mentioned this issue Mar 19, 2022

Inconsistent TOC subtraction behavior when entry's type is not specified #6688

Closed

rokm closed this as completed in #6671 Mar 19, 2022

github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in TOC type (its sub function) #6669

Bug in TOC type (its sub function) #6669

plu5 commented Mar 12, 2022

rokm commented Mar 12, 2022

plu5 commented Mar 12, 2022

rokm commented Mar 12, 2022

plu5 commented Mar 12, 2022

bwoodsend commented Mar 12, 2022

rokm commented Mar 12, 2022

rokm commented Mar 12, 2022

rokm commented Mar 13, 2022

bwoodsend commented Mar 13, 2022

rokm commented Mar 19, 2022

Bug in TOC type (its __sub__ function) #6669

Bug in TOC type (its __sub__ function) #6669

Comments

plu5 commented Mar 12, 2022

Description of the issue

Context information (for bug reports)

A minimal example program which shows the error

Stacktrace / full error message

rokm commented Mar 12, 2022

plu5 commented Mar 12, 2022

rokm commented Mar 12, 2022

plu5 commented Mar 12, 2022

bwoodsend commented Mar 12, 2022

rokm commented Mar 12, 2022

rokm commented Mar 12, 2022

rokm commented Mar 13, 2022

bwoodsend commented Mar 13, 2022

rokm commented Mar 19, 2022

Bug in TOC type (its sub function) #6669

Bug in TOC type (its sub function) #6669