Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate slow writes to class variables #80193

Closed
rhettinger opened this issue Feb 16, 2019 · 11 comments
Closed

Investigate slow writes to class variables #80193

rhettinger opened this issue Feb 16, 2019 · 11 comments
Labels
3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@rhettinger
Copy link
Contributor

BPO 36012
Nosy @nascheme, @rhettinger, @scoder, @pablogsal, @0xYasser
PRs
  • bpo-36012: Avoid linear slot search when it's easy to know that we are not looking for a slot name #11907
  • Files
  • perf-screenshot.png
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-02-20.17:29:55.708>
    created_at = <Date 2019-02-16.20:15:18.382>
    labels = ['interpreter-core', '3.8', 'performance']
    title = 'Investigate slow writes to class variables'
    updated_at = <Date 2019-02-20.17:29:55.707>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2019-02-20.17:29:55.707>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-02-20.17:29:55.708>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2019-02-16.20:15:18.382>
    creator = 'rhettinger'
    dependencies = []
    files = ['48157']
    hgrepos = []
    issue_num = 36012
    keywords = ['patch']
    message_count = 11.0
    messages = ['335714', '335733', '335747', '335763', '335776', '335794', '335795', '335992', '335995', '336031', '336129']
    nosy_count = 5.0
    nosy_names = ['nascheme', 'rhettinger', 'scoder', 'pablogsal', 'Yasser Alshalaan']
    pr_nums = ['11907']
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue36012'
    versions = ['Python 3.8']

    @rhettinger
    Copy link
    Contributor Author

    Benchmark show what writes to class variables are anomalously slow.

        class A(object):
            pass
    
        A.x = 1   # This write is 3 to 5 times slower than other writes.

    FWIW, the same operation for old-style classes in Python 2.7 was several times faster.

    We should investigate to understand why the writes are so slow. There might be a good reason or there might be an opportunity for optimization.

    -------------------------------------------------

    $ python3.8 Tools/scripts/var_access_benchmark.py
    Variable and attribute read access:
       4.3 ns	read_local
       4.6 ns	read_nonlocal
      14.5 ns	read_global
      19.0 ns	read_builtin
      18.4 ns	read_classvar_from_class
      16.2 ns	read_classvar_from_instance
      24.7 ns	read_instancevar
      19.7 ns	read_instancevar_slots
      19.5 ns	read_namedtuple
      26.4 ns	read_boundmethod

    Variable and attribute write access:
    4.4 ns write_local
    5.1 ns write_nonlocal
    18.2 ns write_global
    103.9 ns write_classvar <== Outlier
    35.4 ns write_instancevar
    25.6 ns write_instancevar_slots

    @rhettinger rhettinger added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 16, 2019
    @pablogsal
    Copy link
    Member

    It seems 50% of the overhead (50ns) is due to two reasons:

    • 30-40% is due to the call to update_slot(type, name) after the item is set in the class dictionary.

    • 70-60% is due to all the extra work from _PyObject_GenericSetAttrWithDict until it ends calling _PyObjectDict_SetItem with the slot. If in the code of typeobject.c:type_setattro you change:

    res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);

    by

    PyObject* dictptr = _PyObject_GetDictPtr(type);
    res = _PyObjectDict_SetItem(type, dictptr, name, value);

    and delete the update_slot(type, name) call afterwards, the times are reduced to 50ns.

    @0xYasser
    Copy link
    Mannequin

    0xYasser mannequin commented Feb 17, 2019

    can you include your python 2.7 runs? for me it looks similar

    @rhettinger
    Copy link
    Contributor Author

    can you include your python 2.7 runs? for me it looks similar

    It will give similar results unless you switch to old-style classes (edit out the inheritance from object).

        class A:
            pass
    
        A.x = 1

    $ python2.7 var_access_benchmark.py
    Variable and attribute read access:
       6.7 ns	read_local
      10.9 ns	read_global
      18.9 ns	read_builtin
      24.4 ns	read_classvar_from_class
      30.2 ns	read_classvar_from_instance
      22.7 ns	read_instancevar
      25.5 ns	read_instancevar_slots
      99.3 ns	read_namedtuple
      40.9 ns	read_boundmethod

    Variable and attribute write access:
    8.2 ns write_local
    18.7 ns write_global
    32.9 ns write_classvar <== Not formerly an outlier
    32.5 ns write_instancevar
    31.2 ns write_instancevar_slots

    @scoder
    Copy link
    Contributor

    scoder commented Feb 17, 2019

    It turns out that "update_slot()" is always called, even when we are not updating a slot name (which is always a special dunder-name). The linear search for names in "update_slots()" is a huge waste of time here, and short-circuiting out of it when the name does not start with "_" cuts the overall update time by 50%. I pushed a PR.

    Another improvement would be a sub-linear algorithm for searching the slot name, but that's a bigger change.

    @pablogsal
    Copy link
    Member

    This are the timings that I am measuring with PR 11907:

    Variable and attribute read access:
    5.7 ns read_local
    5.9 ns read_nonlocal
    16.2 ns read_global
    24.5 ns read_builtin
    20.9 ns read_classvar_from_class
    20.0 ns read_classvar_from_instance
    29.7 ns read_instancevar
    24.7 ns read_instancevar_slots
    22.9 ns read_namedtuple
    36.8 ns read_boundmethod

    Variable and attribute write access:
    6.9 ns write_local
    6.9 ns write_nonlocal
    26.7 ns write_global
    65.4 ns write_classvar <----- Down from 120.6 ns
    49.5 ns write_instancevar
    34.5 ns write_instancevar_slots

    @rhettinger
    Copy link
    Contributor Author

    Wow, I didn't expect to get an immediate win this of this magnitude :-)

    @nascheme
    Copy link
    Member

    Some profiling using 'perf'. This is for cpython 63fa1cf.

    children self
    [...]
    + 97.27% 0.00% run_mod (inlined)
    + 88.53% 6.33% PyObject_SetAttr
    + 79.34% 6.80% type_setattro
    + 43.92% 43.92% update_slot
    + 26.87% 5.84% _PyObject_GenericSetAttrWithDict
    + 14.62% 6.52% insertdict
    + 8.80% 8.80% lookdict_unicode_nodummy
    + 6.57% 0.00% _Py_DECREF (inlined)
    + 5.19% 5.19% PyUnicode_InternInPlace
    + 3.57% 3.57% _PyObjectDict_SetItem
    + 3.38% 3.38% _PyType_Lookup
    + 1.71% 0.00% _Py_INCREF (inlined)
    + 1.42% 0.00% _Py_XDECREF (inlined)
    [...]

    After applying PR 11907:

    children self
    [...]
    + 94.76% 0.00% run_mod (inlined)
    + 75.22% 6.77% PyObject_SetAttr
    + 64.65% 13.08% type_setattro
    + 47.51% 11.96% _PyObject_GenericSetAttrWithDict
    + 22.99% 13.95% insertdict
    + 10.10% 10.10% lookdict_unicode_nodummy
    + 9.47% 9.47% PyUnicode_InternInPlace
    + 7.10% 7.10% _PyObjectDict_SetItem
    + 7.02% 7.02% _PyType_Lookup
    + 6.52% 0.00% _Py_DECREF (inlined)
    + 3.17% 0.00% _Py_INCREF (inlined)
    + 3.16% 0.00% _Py_XDECREF (inlined)
    + 2.59% 0.00% PyDict_SetItem (inlined)
    + 1.50% 0.00% is_dunder_name (inlined)
    [...]

    The PyUnicode_InternInPlace() can mostly be eliminated by testing PyUnicode_CHECK_INTERNED() first (we already have called PyUnicode_Check() on it). That only gives a 7% speedup on my machine though. The is_dunder_name() is a much bigger optimization.

    @nascheme
    Copy link
    Member

    BTW, 'perf report [...]' has a really neat annotated assembly view. Scroll to the function you are interested in and press 'a'. Press 't' to toggle the time units (left side numbers). I'm attaching a screenshot of the disassembly of the lookdict function. The time units are sample accounts. Each count is a point where the profiler woke up on that specific instruction.

    @rhettinger
    Copy link
    Contributor Author

    Thanks Neil. The tooling does indeed look nice.

    I added your PyUnicode_InternInPlace() suggestion to the PR.

    At this point, the PR looks ready-to-go unless any of you think we've missed some low-hanging fruit.

    @rhettinger
    Copy link
    Contributor Author

    New changeset d8b9e1f by Raymond Hettinger (Stefan Behnel) in branch 'master':
    bpo-36012: Avoid linear slot search for non-dunder methods (GH-11907)
    d8b9e1f

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants