compileall: option to hardlink duplicate optimization levels bytecode cache files #84675
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2020-05-14.14:18:29.412> created_at = <Date 2020-05-04.09:08:41.735> labels = ['type-feature', 'library', '3.9'] title = 'compileall: option to hardlink duplicate optimization levels bytecode cache files' updated_at = <Date 2020-05-15.21:11:33.681> user = 'https://github.com/frenzymadness'
activity = <Date 2020-05-15.21:11:33.681> actor = 'hauntsaninja' assignee = 'none' closed = True closed_date = <Date 2020-05-14.14:18:29.412> closer = 'vstinner' components = ['Library (Lib)'] creation = <Date 2020-05-04.09:08:41.735> creator = 'frenzy' dependencies =  files =  hgrepos =  issue_num = 40495 keywords = ['patch'] message_count = 11.0 messages = ['368022', '368023', '368024', '368025', '368627', '368628', '368629', '368630', '368631', '368841', '368842'] nosy_count = 7.0 nosy_names = ['brett.cannon', 'vstinner', 'christian.heimes', 'hroncok', 'frenzy', 'FFY00', 'hauntsaninja'] pr_nums = ['19901', '19806'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40495' versions = ['Python 3.9']
The text was updated successfully, but these errors were encountered:
We would like to include a possibility of hardlink deduplication of identical pyc files to compileall module in Python 3.9. We've discussed the change  and tested it in Fedora RPM build system via implementation in the compileall2 module .
The discussion  contains a lot of details so I mention here only the key features:
We believe that this might be handy for more Pythonistas. In our case, this functionality lowers the installation size of Python 3.9 from 125 MiB to 103 MiB.
Python's import system is fully compatible with this approach.
importlib never directly writes to a .pyc file. Instead it always creates a new temporary file next to the .pyc file and then overrides the .pyc file with an atomic file system operation. See _write_atomic() in Lib/importlib/_bootstrap_external.py.
compileall and py_compile also use _write_atomic().
Is it possible that the PYC file of optimization level 0 content is modified if the PY file content changed, with would make PYC files or optimization level 1 and 2 inconsistent?
It seems like importlib doesn't have the issue because it doesn't open PYC file to write its content, but _write_atomic() creates a *new* file and then call os.replace() to rename the temporary file to the PYC final name.
Alright, I think that I understood :-)
PYC file became more complicated with PEP-552. Here are my own notes to try to understand how it's supposed to be used.
Python 3.9 now has _imp.check_hash_based_pycs string which can be overriden by --check-hash-based-pycs command line option. It can have 3 values:
These values are defined by the PEP-552:
When a PYC file is created, it has a "check_source" bit:
I mostly copied/pasted the PEP-552 :-)
py_compile and compileall have a new invalidation_mode which can have 3 values:
class PycInvalidationMode(Enum): TIMESTAMP CHECKED_HASH UNCHECKED_HASH
The default is compiled in py_compile by:
def _get_default_invalidation_mode(): if os.environ.get('SOURCE_DATE_EPOCH'): return PycInvalidationMode.CHECKED_HASH else: return PycInvalidationMode.TIMESTAMP
importlib: SourceLoader.get_code(filename) uses:
flags = _classify_pyc(data, fullname, exc_details) bytes_data = memoryview(data)[16:] hash_based = flags & 0b1 != 0 if hash_based: check_source = flags & 0b10 != 0 if (_imp.check_hash_based_pycs != 'never' and (check_source or _imp.check_hash_based_pycs == 'always')): source_bytes = self.get_data(source_path) source_hash = _imp.source_hash( _RAW_MAGIC_NUMBER, source_bytes, ) _validate_hash_pyc(data, source_hash, fullname, exc_details) else: _validate_timestamp_pyc( data, source_mtime, st['size'], fullname, exc_details, )
Note that there is a test exactly for this, in case the implementation is changed in the future.
While reviewing PR 19901, I was confused by py_compile and compileall documentation which is outdated: it doesn't mention that optimize argument can be a list of integers.
Currently, it's possible to implement this optimization using the Unix command "hardlink". Example:
On my Fedora 32, this command says:
For example, string.cpython-38.pyc and string.cpython-38.opt-1.pyc become hard links.
PR 19901 avoids the dependency on external "hardlink" command.
In practice, PR 19901 only impacts newly written PYC files, whereas using manually the "hardlink" command cannot track which files are not or not. "hardlink" command is less practice, PR 19901 avoids modifying PYC files that we don't "own".