Migrate decimal to use PEP 567 context variables #76811
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/skrah' closed_at = <Date 2018-01-27.18:47:18.042> created_at = <Date 2018-01-23.06:56:55.612> labels = ['3.7', 'type-feature', 'library'] title = 'Migrate decimal to use PEP 567 context variables' updated_at = <Date 2020-03-01.20:18:29.215> user = 'https://github.com/1st1'
activity = <Date 2020-03-01.20:18:29.215> actor = 'gregory.p.smith' assignee = 'skrah' closed = True closed_date = <Date 2018-01-27.18:47:18.042> closer = 'yselivanov' components = ['Library (Lib)'] creation = <Date 2018-01-23.06:56:55.612> creator = 'yselivanov' dependencies =  files = ['47410', '47411'] hgrepos =  issue_num = 32630 keywords = ['patch'] message_count = 32.0 messages = ['310472', '310482', '310630', '310698', '310699', '310700', '310701', '310704', '310714', '310715', '310716', '310738', '310790', '310791', '310794', '310795', '310796', '310799', '310800', '310803', '310805', '310806', '310809', '310811', '310815', '310817', '310820', '310821', '310864', '310875', '310876', '363091'] nosy_count = 8.0 nosy_names = ['gvanrossum', 'gregory.p.smith', 'mark.dickinson', 'vstinner', 'ned.deily', 'methane', 'skrah', 'yselivanov'] pr_nums = ['5278'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue32630' versions = ['Python 3.7']
The text was updated successfully, but these errors were encountered:
I realize that you had to fight massive mailing list distractions
Let's start here:
>>> from decimal import * ==18887== Invalid read of size 8 ==18887== at 0x5324E0: contextvar_new (context.c:744) ==18887== by 0x53141A: PyContextVar_New (context.c:137) ==18887== by 0xFED052B: PyInit__decimal (_decimal.c:5542) ==18887== by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159) ==18887== by 0x51F29F: _imp_create_dynamic_impl (import.c:2145) ==18887== by 0x51A4BA: _imp_create_dynamic (import.c.h:289) ==18887== by 0x43257A: _PyMethodDef_RawFastCallDict (call.c:530) ==18887== by 0x432710: _PyCFunction_FastCallDict (call.c:582) ==18887== by 0x432DD6: PyCFunction_Call (call.c:787) ==18887== by 0x4FAA44: do_call_core (ceval.c:4659) ==18887== by 0x4F58CC: _PyEval_EvalFrameDefault (ceval.c:3232) ==18887== by 0x4E7F99: PyEval_EvalFrameEx (ceval.c:545) ==18887== Address 0xcf589a8 is 8 bytes before a block of size 64 alloc'd ==18887== at 0x4C2A9A1: malloc (vg_replace_malloc.c:299) ==18887== by 0x470498: _PyMem_RawMalloc (obmalloc.c:75) ==18887== by 0x470FFC: PyMem_RawMalloc (obmalloc.c:503) ==18887== by 0x471DEF: _PyObject_Malloc (obmalloc.c:1560) ==18887== by 0x471312: PyObject_Malloc (obmalloc.c:616) ==18887== by 0x4A35D6: PyUnicode_New (unicodeobject.c:1293) ==18887== by 0x4CA16B: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13423) ==18887== by 0x4B1843: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:4806) ==18887== by 0x4A5E67: PyUnicode_FromString (unicodeobject.c:2105) ==18887== by 0x5313F5: PyContextVar_New (context.c:133) ==18887== by 0xFED052B: PyInit__decimal (_decimal.c:5542) ==18887== by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159) ==18887==
Oh thanks, but I see no reason for you to be condescending here.
I cannot reproduce this on Mac OS / Linux. Are you sure you've built your Python correctly? Can you run 'make distclean; ./configure --with-pydebug; make -j4'?
Please. I thought it was pretty much decided that we will update decimal if there is no significant performance degradation, so there's no need for a conspiracy. I put Guido to the nosy-list not because I want to force something, but just because we've discussed decimal and PEP 567/550 with him numerous times.
I ran some of my own tests (not even close to all), they seem fine.
However, I could not find any tests for the added feature (safe
I'm getting a large slowdown:
patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199]
slowdown: > 10%
patched: [0.535, 0.541, 0.523]
slowdown: > 30%
Given the performance issues I'm -1 for adding the feature at
This is no problem, I can add a few async/await tests.
I'd like you to elaborate a bit more here. First, bench.py produces a completely different output from what you've quoted. How exactly did you compile these results? Are those numbers results of Pi calculation or factorial? Can you upload the actual script you used here (if there's one)?
Second, here's my run of bench.py with contextvars and without: https://gist.github.com/1st1/1187fc58dfdef86e3cad8874e0894938
I don't see any difference, left alone 10% slowdown.
This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with them.
I've optimized PEP-567 for contextvar.get() operation, not contextvar.set (it's hard to make hamt.set() as fast as dict.set()). That way, if you have an some decimal code that performs actual calculations with decimal objects, the operation of looking up the current context is cheap.
It's hard to imagine a situation, where a real decimal-related code just creates decimal contexts and does nothing else with them.
On Fri, Jan 26, 2018 at 09:06:38PM +0000, Yury Selivanov wrote:
It is not constructed at all. It was the first thing I wrote down trying
Even the telco benchmark (where there's a lot of other stuff going
I did not hunt for these benchmarks. They are the first things I tried out. I
Guido, I have the feeling that the feature -- about which I was actually
BTW, prec is changed quite frequently in decimal code, so if people
Stefan, I don't think a module author should retain veto over everything
Sorry Stefan, I never wanted this to look like "I'm pushing this without listening to Stefan". I apologize if it looked that way.
I ran bm_telco on my machine before submitting the PR, and I indeed did not see any performance impact. I'll try again. I also have a idea of a micro-optimization that might make it a tiny bit faster.
FWIW, I ran bm_telco with pyperformance on a benchmark-tuned system and did not observe the slowdown. Benchmarks were done on a release build (--enable-optimizations)
$ sudo (which python3) -m perf system tune
$ pyperformance run --python=envs/3.7-master-pgo-lto/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-master.json Python benchmark suite 0.6.1
MASTER + contextvars patch:
$ pyperformance run --python=envs/3.7-master-pgo+lto+decimal-contextvars/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-contextvars.json Python benchmark suite 0.6.1
### telco ###
Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00
Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization!
Python master vs 3.6 (normalized on 3.6):
Graph of telco performance on master since April 2014 to January 2017:
20.2 ms => 14.1 ms, well done!
If you are curious of reasons why Python became faster, see my documentation:
Or even my talk at Pycon 2017:
Sorry, I moved off topic. Let's move back to this measuring the performance of this issue...
I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py
Results on my laptop with 2 physical cores isolated for best benchmark stability (*):
This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.
The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).
Except for a few small projects, I rarely use decimal, so I'm unable to judge that.
But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all.
The Python benchmark suite does have a benchmark dedicated to the decimal module:
I ran this benchmark on PR 5278:
vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json
... not significant. Honestly, I'm not surprised at all:
FYI timings can be seen in verbose mode:
vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
Note: it may be interesting to rewrite this benchmark my perf module to be able to easily check if a benchmark result is significant.
"perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95."
Usually, I consider that between 5% slower and 5% faster is not significant. But it depends how the benchmark was run, it depends on the type of benchmark, etc. Here I don't know bench.py so I cannot judge.
For example, for an optimization, I'm more interested by an optimization making a benchmark 10% faster ;-)
On Fri, Jan 26, 2018 at 11:11:00PM +0000, STINNER Victor wrote:
Thank you and Elvis for running the benchmarks. Yes, the exact version does seem
Yes, that's the big question. In the generator discussions people were advised
I would use the context functions, which would not require PEP-567 at all.
Okay. In my above reference to telco, I ran the "telco.py full" command
The numbers I posted weren't cooked, but I have a hard time reproducing
This means that I no longer have any objections, so Yury, please go ahead
Thank you, Stefan. I've updated the PR with an asyncio+decimal test and run tests in refleak mode to make sure there's no regression there.
If during the beta/rc period we see that contextvars isn't stable enough or something I'll revert this change before 3.7.0 myself, so that decimal users will not be disturbed.
I'll merge the PR once the CI is green.
Yes, I used decimal examples all the time to showcase how context is supposed to work with generators. Most of those examples were specifically constructed to illustrate some point, but I don't think that real-world code uses a 'with localcontext()' statement in every function.
Unfortunately there's no way (at least known to me) to make 'ContextVar.set()' faster than it is now. I use HAMT which guarantees that all set operations will have O(log n) performance; the other known approach is to use copy-on-write (as in .NET), but that has an O(n) ContextVar.set() performance. So I guess a slightly slower 'with localcontext()' is the price to pay to make decimal easier to use for async/await code.