Prepare for splitting LOAD_CONST into several opcodes #89315

iritkatriel · 2021-09-09T14:09:24Z

BPO	45152
Nosy	@gvanrossum, @rhettinger, @markshannon, @serhiy-storchaka, @brandtbucher, @iritkatriel
PRs	bpo-45152: refactor the dis module to make handling of hasconst opcod… #28258 bpo-45152: Add HAS_CONST macro and get_const_value() function and use… #28262

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/iritkatriel'
closed_at = <Date 2021-09-15.09:35:25.243>
created_at = <Date 2021-09-09.14:09:24.396>
labels = ['interpreter-core', 'type-feature', 'library', '3.11']
title = 'Prepare for splitting LOAD_CONST into several opcodes'
updated_at = <Date 2021-09-15.09:35:25.243>
user = 'https://github.com/iritkatriel'

bugs.python.org fields:

activity = <Date 2021-09-15.09:35:25.243>
actor = 'iritkatriel'
assignee = 'iritkatriel'
closed = True
closed_date = <Date 2021-09-15.09:35:25.243>
closer = 'iritkatriel'
components = ['Interpreter Core', 'Library (Lib)']
creation = <Date 2021-09-09.14:09:24.396>
creator = 'iritkatriel'
dependencies = []
files = []
hgrepos = []
issue_num = 45152
keywords = ['patch']
message_count = 12.0
messages = ['401485', '401501', '401502', '401512', '401513', '401514', '401539', '401559', '401562', '401609', '401763', '401817']
nosy_count = 6.0
nosy_names = ['gvanrossum', 'rhettinger', 'Mark.Shannon', 'serhiy.storchaka', 'brandtbucher', 'iritkatriel']
pr_nums = ['28258', '28262']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue45152'
versions = ['Python 3.11']

iritkatriel · 2021-09-09T14:09:24Z

This issue is to prepare the code for splitting LOAD_CONST to several opcodes.

There are a number of places in the code where an opcode is compared to LOAD_CONST (such as dis, the compiler, and the peephole optimizer). These need to be refactored to make the query "is this a hasconst opcode", and the value calculation needs to be refactored into a single place, which can later be updated to get the value from places other than co_consts.

rhettinger · 2021-09-09T17:34:08Z

I didn't know this was in the works. Can you point me to the relevant discussion about why LOAD_CONST would be split?

iritkatriel · 2021-09-09T17:36:43Z

It's related to this work: faster-cpython/ideas#65

rhettinger · 2021-09-09T19:15:10Z

Thanks for the link. This is a worthwhile experiment. However, the potential gains will be hard to come by.

The workload of LOAD_CONST is very small. After paying for the usual dispatch logic overhead, all it does is an index into a struct member and an incref. Both the co_consts table and the popular constant objects are already likely to be in the L1 data cache.

##DEBUG_LABEL: TARGET_LOAD_CONST
movslq	%r15d, %rax             ## OpArg fetch, typically a zero code register rename   
movq	-368(%rbp), %rcx        ## 8-byte Reload to access co_consts
movq	24(%rcx,%rax,8), %rax   ## The actual indexing operation  (3 cycles)
incq	(%rax)                  ## The incref

A specialized opcode for a specific constant like None can 1) eliminate the oparg fetch (likely saving nothing), and 2) eliminate the two sequentially dependent memory access (this is a win):

##DEBUG_LABEL: TARGET_LOAD_NONE
    movq	\_\_Py_NoneStruct@GOTPCREL(%rip), rax
incq	(%rax)                  ## The incref

Any more general opcode for loading small ints would still need the oparg fetch and the incref. To win, it would need to convert the oparg into an int more efficiently than the two movq steps. If the small int table is in a fixed location (not per-subinterpreter), then you can save 2 cycles with the simpler address computation:

##DEBUG_LABEL: TARGET_SMALLINT
movslq	%r15d, %rax             ## OpArg fetch, typically a zero code register rename 
movq	\_\_Py_SmallInt@GOTPCREL(%rip), %rcx        ## Find an array of ints
movq	(%rcx,%rax,8), %rax     ## Cheaper address computation takes 1 cycle
incq	(%rax)                  ## The incref

The 2 cycle win (intel-only) will be partially offset by the additional pressure on the L1 data cache. Right now, the co_consts is almost certainly in cache, holding only the constants that actually get used (at a rate of 8 per cache line). Accesses into a small_int array will push other data out of L1.

IIRC, Serhiy already experimented with a LOAD_NONE opcode and couldn't get a measurable win.

iritkatriel · 2021-09-09T19:20:06Z

I’m not sure we’re hoping this to be faster, just reduce the code object size and not be slower. (Smaller code object could be faster to unmarshal).

rhettinger · 2021-09-09T19:31:52Z

Idea for improving unmarshalling speed: Since reading PYCs is I/O bound, it may be worthwhile to compress them. When the idea came up previously, Antoine suggested choosing an algorithm with the fastest possible decompression speed (iirc, it was lzma4 at the time). This should be an easy optimization that doesn't require changing the rest of the runtime.

gvanrossum · 2021-09-10T00:34:35Z

Raymond, I am super grateful that you are showing us the assembly language. That's the kind of thing we need to do more of if we want to be successful. (In this case I'm not surprised by the outcome, but I have never studied x86 assembly and this was a nice explanation.)

Regarding speeding up marshal, I am not at all convinced that this is still I/O bound. Raymond, do you have data about this? In my own informal experiments I believe the main cost I found was (a) stat() operations, and (b) the allocation and initialization of a large number of objects. In my (limited) understanding, "freezing" modules (like in bpo-45020) is mostly a win because it avoids stat() operations (and listdir(), too). But it would be nice to do a detailed experiment to get real facts here.

serhiy-storchaka · 2021-09-10T07:45:36Z

From my experience, the largest cost in importing module (after I/O) is for creating classes, especially classes with complex creation code: enums and dataclasses (and namedtuples in past, but they were significanly optimized since).

For I/O, would using zipimport or any other container help? It should reduce the number of stats and opens and allow to use compression.

As for increasing performance of demarshallization, it is already very fast (all is read from memory buffers, all repeated objects are cached). Switching to "wordcode" (32- or 64- bit instructions) and aligning all objects at the boundary of 4-8 bytes can marginally increase decoding speed, but it needs testing. Also more optimal using of references (like pickletools.optimize()) can reduce unmarshalling time at the cost of increasing marshalling time and memory consumption. It can also help with deterministic marshalling.

iritkatriel · 2021-09-10T08:13:48Z

(b) the allocation and initialization of a large number of objects

Yes, when I profiled import GC showed up as 20-30%.

gvanrossum · 2021-09-10T19:48:32Z

We're not (yet) trying to optimize the code being executed (we should perhaps add an issue about class creation to https://github.com/faster-cpython/ideas).

I agree that marshal is very fast already; I doubt the extra cost of wordcode would pay off since it would increase the size. This could be tested but it would be a big experiment.

I'm not sure that compression will really help, since a typical decompressor delivers its output in some memory buffer. But this needs to be tested, and it might be a simpler experiment.

I'm pretty sure zipimport will improve things -- the main problem with it (as always) is that not all tooling understands it, since resource loading is different.

markshannon · 2021-09-14T08:53:36Z

New changeset c2f1e95 by Irit Katriel in branch 'main':
bpo-45152: Add HAS_CONST macro and get_const_value() function and use… (bpo-28262)
c2f1e95

iritkatriel · 2021-09-15T09:14:31Z

New changeset 40d2ac9 by Irit Katriel in branch 'main':
bpo-45152: refactor the dis module to make handling of hasconst opcodes more generic (GH-28258)
40d2ac9

iritkatriel added the 3.11 only security fixes label Sep 9, 2021

iritkatriel self-assigned this Sep 9, 2021

iritkatriel added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-feature A feature request or enhancement 3.11 only security fixes labels Sep 9, 2021

iritkatriel self-assigned this Sep 9, 2021

iritkatriel added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Sep 9, 2021

iritkatriel closed this as completed Sep 15, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for splitting LOAD_CONST into several opcodes #89315

Prepare for splitting LOAD_CONST into several opcodes #89315

iritkatriel commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

gvanrossum commented Sep 10, 2021

serhiy-storchaka commented Sep 10, 2021

iritkatriel commented Sep 10, 2021

gvanrossum commented Sep 10, 2021

markshannon commented Sep 14, 2021

iritkatriel commented Sep 15, 2021

Prepare for splitting LOAD_CONST into several opcodes #89315

Prepare for splitting LOAD_CONST into several opcodes #89315

Comments

iritkatriel commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

iritkatriel commented Sep 9, 2021

rhettinger commented Sep 9, 2021

gvanrossum commented Sep 10, 2021

serhiy-storchaka commented Sep 10, 2021

iritkatriel commented Sep 10, 2021

gvanrossum commented Sep 10, 2021

markshannon commented Sep 14, 2021

iritkatriel commented Sep 15, 2021