Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compileall: option to hardlink duplicate optimization levels bytecode cache files #84675

Closed
frenzymadness mannequin opened this issue May 4, 2020 · 11 comments
Closed
Labels
3.9 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@frenzymadness
Copy link
Mannequin

frenzymadness mannequin commented May 4, 2020

BPO 40495
Nosy @brettcannon, @vstinner, @tiran, @hroncok, @frenzymadness, @FFY00, @hauntsaninja
PRs
  • bpo-40495: compileall option to hardlink duplicate pyc files #19901
  • bpo-40445: Update compileall.compile_dir docs #19806
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-05-14.14:18:29.412>
    created_at = <Date 2020-05-04.09:08:41.735>
    labels = ['type-feature', 'library', '3.9']
    title = 'compileall: option to hardlink duplicate optimization levels bytecode cache files'
    updated_at = <Date 2020-05-15.21:11:33.681>
    user = 'https://github.com/frenzymadness'

    bugs.python.org fields:

    activity = <Date 2020-05-15.21:11:33.681>
    actor = 'hauntsaninja'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-05-14.14:18:29.412>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2020-05-04.09:08:41.735>
    creator = 'frenzy'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40495
    keywords = ['patch']
    message_count = 11.0
    messages = ['368022', '368023', '368024', '368025', '368627', '368628', '368629', '368630', '368631', '368841', '368842']
    nosy_count = 7.0
    nosy_names = ['brett.cannon', 'vstinner', 'christian.heimes', 'hroncok', 'frenzy', 'FFY00', 'hauntsaninja']
    pr_nums = ['19901', '19806']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue40495'
    versions = ['Python 3.9']

    @frenzymadness
    Copy link
    Mannequin Author

    frenzymadness mannequin commented May 4, 2020

    We would like to include a possibility of hardlink deduplication of identical pyc files to compileall module in Python 3.9. We've discussed the change [0] and tested it in Fedora RPM build system via implementation in the compileall2 module [1].

    The discussion [0] contains a lot of details so I mention here only the key features:

    • the deduplication can be enabled only if multiple optimization levels are processed at once
    • it generates a pyc file (optimization level 0) as usual but if it finds that optimized files (optimization levels 1 and 2) have the same content, it uses hardlinks (os.link) to prevents duplicates
    • the deduplication is disabled by default

    We believe that this might be handy for more Pythonistas. In our case, this functionality lowers the installation size of Python 3.9 from 125 MiB to 103 MiB.

    [0] https://discuss.python.org/t/compileall-option-to-hardlink-duplicate-optimization-levels-bytecode-cache-files/3014
    [1] https://github.com/fedora-python/compileall2

    @frenzymadness frenzymadness mannequin added 3.9 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels May 4, 2020
    @tiran
    Copy link
    Member

    tiran commented May 4, 2020

    Python's import system is fully compatible with this approach.

    importlib never directly writes to a .pyc file. Instead it always creates a new temporary file next to the .pyc file and then overrides the .pyc file with an atomic file system operation. See _write_atomic() in Lib/importlib/_bootstrap_external.py.

    compileall and py_compile also use _write_atomic().

    @tiran
    Copy link
    Member

    tiran commented May 4, 2020

    Brett, FYI

    @frenzymadness
    Copy link
    Mannequin Author

    frenzymadness mannequin commented May 4, 2020

    I forgot to mention that I am working on PR which should be ready soon because the implementation is already done and tested in compileall2.

    @vstinner
    Copy link
    Member

    Is it possible that the PYC file of optimization level 0 content is modified if the PY file content changed, with would make PYC files or optimization level 1 and 2 inconsistent?

    Christian Heimes:

    Python's import system is fully compatible with this approach. importlib never directly writes to a .pyc file. Instead it always creates a new temporary file next to the .pyc file and then overrides the .pyc file with an atomic file system operation. See _write_atomic() in Lib/importlib/_bootstrap_external.py.

    It seems like importlib doesn't have the issue because it doesn't open PYC file to write its content, but _write_atomic() creates a *new* file and then call os.replace() to rename the temporary file to the PYC final name.

    Alright, I think that I understood :-)

    --

    PYC file became more complicated with PEP-552. Here are my own notes to try to understand how it's supposed to be used.

    Python 3.9 now has _imp.check_hash_based_pycs string which can be overriden by --check-hash-based-pycs command line option. It can have 3 values:

    • "always"
    • "never"
    • "default"

    These values are defined by the PEP-552:

    • "never" causes the interpreter to always assume hash-based pycs are valid
    • "default" means the check_source flag in hash-based pycs determines invalidation
    • "always" causes the interpreter to hash the source file for invalidation regardless of value of check_source bit

    When a PYC file is created, it has a "check_source" bit:

    • Bit set: If the check_source flag is set, Python will determine the validity of the pyc by hashing the source file and comparing the hash with the expected hash in the pyc. If the pyc needs to be regenerated, it will be regenerated as a hash-based pyc again with the check_source flag set.
    • Bit unset, Python will simply load the pyc without checking the hash of the source file. The expectation in this case is that some external system (e.g., the local Linux distribution’s package manager) is responsible for keeping pycs up to date, so Python itself doesn’t have to check.

    I mostly copied/pasted the PEP-552 :-)

    py_compile and compileall have a new invalidation_mode which can have 3 values:

    class PycInvalidationMode(Enum):
        TIMESTAMP
        CHECKED_HASH
        UNCHECKED_HASH

    The default is compiled in py_compile by:

    def _get_default_invalidation_mode():
        if os.environ.get('SOURCE_DATE_EPOCH'):
            return PycInvalidationMode.CHECKED_HASH
        else:
            return PycInvalidationMode.TIMESTAMP

    importlib: SourceLoader.get_code(filename) uses:

        flags = _classify_pyc(data, fullname, exc_details)
        bytes_data = memoryview(data)[16:]
        hash_based = flags & 0b1 != 0
        if hash_based:
            check_source = flags & 0b10 != 0
            if (_imp.check_hash_based_pycs != 'never' and
                (check_source or
                 _imp.check_hash_based_pycs == 'always')):
                source_bytes = self.get_data(source_path)
                source_hash = _imp.source_hash(
                    _RAW_MAGIC_NUMBER,
                    source_bytes,
                )
                _validate_hash_pyc(data, source_hash, fullname,
                                   exc_details)
        else:
            _validate_timestamp_pyc(
                data,
                source_mtime,
                st['size'],
                fullname,
                exc_details,
            )

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented May 11, 2020

    Is it possible that the PYC file of optimization level 0 content is modified if the PY file content changed, with would make PYC files or optimization level 1 and 2 inconsistent? ...

    Note that there is a test exactly for this, in case the implementation is changed in the future.

    @vstinner
    Copy link
    Member

    While reviewing PR 19901, I was confused by py_compile and compileall documentation which is outdated: it doesn't mention that optimize argument can be a list of integers.

    https://docs.python.org/dev/library/py_compile.html#py_compile.compile
    "optimize controls the optimization level and is passed to the built-in compile() function. The default of -1 selects the optimization level of the current interpreter."

    https://docs.python.org/dev/library/compileall.html#compileall.compile_dir
    "optimize specifies the optimization level for the compiler. It is passed to the built-in compile() function."

    @vstinner
    Copy link
    Member

    Currently, it's possible to implement this optimization using the Unix command "hardlink". Example:

    hardlink -c -v /usr/lib64/python3.8/__pycache__/*.pyc
    

    On my Fedora 32, this command says:

    Directories: 1
    Objects: 520
    Regular files: 519
    Comparisons: 133
    Linked: 133
    Saved: 2220032

    For example, string.cpython-38.pyc and string.cpython-38.opt-1.pyc become hard links.

    @vstinner
    Copy link
    Member

    Currently, it's possible to implement this optimization using the Unix command "hardlink".

    PR 19901 avoids the dependency on external "hardlink" command.

    In practice, PR 19901 only impacts newly written PYC files, whereas using manually the "hardlink" command cannot track which files are not or not. "hardlink" command is less practice, PR 19901 avoids modifying PYC files that we don't "own".

    @vstinner
    Copy link
    Member

    New changeset e77d428 by Lumír 'Frenzy' Balhar in branch 'master':
    bpo-40495: compileall option to hardlink duplicate pyc files (GH-19901)
    e77d428

    @vstinner
    Copy link
    Member

    Thanks Lumír and Miro! I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants