Potential performance improvements #26

timcassell · 2021-12-29T15:26:31Z

Currently in master, benchmark results are showing speedups when dealing with already resolved promises and slowdowns when dealing with pending promises.

[Edit] Updated table from #27 and adjusting the AsyncPending benchmark to use a simpler reusable awaiter.

Type	Method	N	BaseIsPending	Mean	Code Size
AsyncPending	ProtoPromise_V1	100	?	73.930 μs	4,702 B
AsyncPending	ProtoPromise_V2	100	?	107.352 μs	7,553 B

AsyncResolved	ProtoPromise_V1	100	?	39.307 μs	2,982 B
AsyncResolved	ProtoPromise_V2	100	?	18.702 μs	2,215 B

AwaitPending	ProtoPromise_V1	100	?	278.061 μs	9,178 B
AwaitPending	ProtoPromise_V2	100	?	424.685 μs	16,647 B

AwaitResolved	ProtoPromise_V1	100	?	3.931 μs	2,025 B
AwaitResolved	ProtoPromise_V2	100	?	13.674 μs	2,845 B

ContinueWithPending	ProtoPromise_V1	100	False	328.419 μs	9,248 B
ContinueWithPending	ProtoPromise_V2	100	False	487.998 μs	18,336 B

ContinueWithPending	ProtoPromise_V1	100	True	336.582 μs	9,248 B
ContinueWithPending	ProtoPromise_V2	100	True	484.569 μs	14,874 B

ContinueWithFromValue	ProtoPromise_V1	100	False	77.698 μs	4,059 B
ContinueWithFromValue	ProtoPromise_V2	100	False	24.998 μs	7,238 B

ContinueWithFromValue	ProtoPromise_V1	100	True	73.843 μs	4,391 B
ContinueWithFromValue	ProtoPromise_V2	100	True	128.037 μs	9,317 B

ContinueWithResolved	ProtoPromise_V1	100	False	106.440 μs	4,017 B
ContinueWithResolved	ProtoPromise_V2	100	False	32.014 μs	9,193 B

ContinueWithResolved	ProtoPromise_V1	100	True	100.274 μs	4,522 B
ContinueWithResolved	ProtoPromise_V2	100	True	137.503 μs	8,947 B

This is expected due to changing promises to structs and adding thread safety. But there are areas where we can improve the performance for pending promises.

Interface method calls are twice as expensive as class virtual method calls. Some internal interfaces can be changed to abstract classes, such as ITreeHandleable, IValueContainer, and IMultiTreeHandleable.
Multiple virtual calls are made that can be reduced to a single virtual call [Edit] I tried these in Optimize virtual calls #38 and it wound up being a wash with larger code, so I opted not to merge it.
- When hooking up a callback (Promise.Then), MarkAwaited is a virtual call, then the new PromiseRef is created, then HookupNewPromise is a second virtual call. This was done for validation purposes, but it can be reduced to a single virtual call, with careful consideration for the PromiseRef created when the call is invalid. Also note that we cannot use a generic creator passed into the virtual function, because AOT compilers have trouble with generic virtual functions and structs (and using a non-struct would still be a second virtual call and consume heap space).
- When ITreeHandleable.Handle is invoked, PromiseRef calls a separate virtual Execute function. This is done so that all calls to Execute are wrapped in the try/catch, but that logic can be shifted to pass the IDelegate structs to a generic function with the try/catch in it instead, similar to the CallbackHelper for synchronous invokes.
When promises are completed, ValueContainers are created with 0 retains and retain is called when it is passed to ResolveInternal or RejectOrCancelInternal. This can be changed to always create with 1 retain and don't call retain in those methods.
PromiseRef.MaybeDispose() will always call Dispose() virtually. Most of the time MaybeDispose() is called from the most derived class, so that can be changed to a direct call instead of virtual. [Edit] No longer true since Handle combine #55
async/await can be improved by hooking up the AsyncPromiseRef directly to the awaited PromiseRef instead of creating an AwaiterRef, by doing a special type check on the awaiter for PromiseAwaiter in the PromiseMethodBuilder. This optimization is trickier than the others and will need to make sure it doesn't box (very similar to Reporting progress from an async Promise function #10).
It is possible to change object pooling to use an Interlocked method instead of Spinlock, but to be efficient it would require using C#7's ref local and ref return features. This can be done while still supporting older language versions, but it will get ugly with the #ifdefs.
Call IPromiseWaiter.AwaitOnCompletedInternal directly in .Net Core instead of going through AwaitOverrider, since the JIT optimizes away the box instructions (but not in Unity).
Unsafe.As<T>(object) in .Net 5+

Unfortunately, I don't think there is much that can be done for awaiting a resolved promise, as v1 was branchless (the promise was its own awaiter and didn't have to check itself for null). V2 is more complex with the struct wrapping a reference and using a separate PromiseAwaiter struct for safety. [Edit] Await promise was improved in #39.

The text was updated successfully, but these errors were encountered:

timcassell · 2021-12-30T04:33:53Z

When the T result of Promise<T> is a reference type, we could store that directly in the object _valueOrPrevious field instead of wrapping it in a ResolveContainer<T>. Then instead of using the IValueContainer.GetState() to retrieve the previous promise's state, we can store the previous state in a field on the PromiseRef.

This would result in decreased memory consumption for reference types, but I'm not sure if it would help with execution time. On one hand, it would eliminate a virtual call to IValueContainer.GetState(), but on the other hand, storing the previous state as a field would either increase the size of the PromiseRef, or it would need to be placed inside the SmallFields struct with an explicit offset, sharing space with the _state field. There is enough bit space to do it (State only consumes 2 bits), but it would require bit-masking to read the proper values.

If this change is made, benchmarks are a must before merging it.

[Edit] I investigated making this change, and it ended up costing extra checks to see if the field was a ValueContainer or not (which right now it happily just assumes it is). But I did end up optimizing it so that reference types only create one type ResolveContainer<object> instead of a type for every T in #60.

timcassell · 2021-12-30T04:46:57Z

[Edit] This was done in #55

Another option is combining ITreeHandleable.MakeReady and ITreeHandleable.Handle into a single call, so each promise's callbacks will be invoked recursively. This would eliminate a virtual call as well as adding and removing it from the internal execution stack.

But this would also cause a StackOverflowException if the promise chain is very long. They are currently separate calls to allow the stack to unwind before invoking the next. It also allows objects to re-pool themselves before the next invoke in case they need to be re-used during the callback.

I'm currently thinking the downsides outweigh the upside at this time.

As an aside, async/await currently can cause a StackOverflowException if the recursive depth is very long, which the async/await improvement in the OP would also solve. [Edit] This was fixed in #54

timcassell · 2021-12-30T05:15:34Z

For progress improvements:

Instead of a separate field (_depthAndProgress), we can store the depth of the promise as a short in SmallFields, sharing the bit space with _deferredId, because _deferredId is only needed for the base of the promise chain (from the Deferred, obviously), and the depth is always 0 at that point. Fixed32 will need to be changed to use a short for the whole value, and ushort for the decimal value, obsoleting Config.ProgressDecimalBits which will always be 16 (for the size of ushort).
The algorithm for updating progress can be adjusted so that PromiseWaitPromise doesn't need to implement IProgressInvokable to save on memory. Since progressListener.SetProgress already has an out PromiseSingleAwaitWithProgress nextRef to continue updating each progress ref iteratively, we can change Fixed32 progress to ref Fixed32 progress and PromiseWaitPromise can update the calculated progress and out the nextRef instead of scheduling itself on the ExecutionScheduler.
PromiseMultiAwait can add a int _progressRetains field to retain itself while it's invoking progress instead of locking to improve concurrency.

[Edit] These were done except for the progress retains for PromiseMultiAwait which proved difficult to handle some race conditions. I may revisit it at a later time.

[Edit2] PromiseMultiAwait progress counter is implemented in #63.

timcassell · 2021-12-30T09:16:54Z

.Net 5 added the Unsafe class. We can use T Unsafe.As<T>(object o) to cast more efficiently when we know the cast will succeed (everywhere we use traditional (type) casts).

timcassell · 2022-01-18T03:35:48Z

https://issuetracker.unity3d.com/issues/il2cpp-incorrect-results-when-calling-a-method-from-outside-class-in-a-struct

This issue which forced us to use a less efficient async method builder in IL2CPP builds was fixed in Unity 2020.3.20f1 and 2021.1.24f1. We can #if UNITY_2021_2_OR_NEWER to enable the optimized builder for IL2CPP builds that other runtimes enjoy (it would take too many ifdefs to also support 2020, because unity doesn't include OR_NEWER symbols for patch releases).

[Edit] Done in #45.

timcassell · 2022-01-18T03:48:32Z

[Edit] Bad measurement, I was benchmarking the same configuration both times. Here are the real results:

Without progress:

|                  Type |          Method |   N | BaseIsPending |      Mean | Code Size | Allocated | Survived |
|---------------------- |---------------- |---- |-------------- |----------:|----------:|----------:|---------:|
|          AwaitPending | ProtoPromise_V2 | 100 |             ? | 382.57 μs |  13,141 B |         - |    416 B |
|          AsyncPending | ProtoPromise_V2 | 100 |             ? |  86.27 μs |   7,240 B |         - |    728 B |
|         AsyncResolved | ProtoPromise_V2 | 100 |             ? |  16.82 μs |   2,172 B |         - |        - |
|         AwaitResolved | ProtoPromise_V2 | 100 |             ? |  11.03 μs |   1,906 B |      72 B |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |         False | 412.14 μs |  17,455 B |         - | 17,264 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |         False |  25.16 μs |   8,005 B |         - |        - |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |         False |  24.67 μs |   8,195 B |         - |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |          True | 403.14 μs |  14,912 B |         - | 17,216 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |          True | 112.18 μs |   9,504 B |         - | 17,144 B |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |          True | 126.77 μs |   9,521 B |         - | 17,144 B |

With progress:

|                  Type |          Method |   N | BaseIsPending |      Mean | Code Size | Allocated | Survived |
|---------------------- |---------------- |---- |-------------- |----------:|----------:|----------:|---------:|
|          AwaitPending | ProtoPromise_V2 | 100 |             ? | 355.67 μs |  13,403 B |         - |    416 B |
|          AsyncPending | ProtoPromise_V2 | 100 |             ? | 103.09 μs |   7,542 B |         - |    760 B |
|         AsyncResolved | ProtoPromise_V2 | 100 |             ? |  17.69 μs |   2,210 B |         - |        - |
|         AwaitResolved | ProtoPromise_V2 | 100 |             ? |  11.33 μs |   1,906 B |      72 B |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |         False | 460.28 μs |  18,828 B |         - | 26,984 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |         False |  24.85 μs |   8,179 B |         - |        - |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |         False |  31.72 μs |  10,186 B |         - |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |          True | 449.77 μs |  15,353 B |         - | 26,832 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |          True | 116.52 μs |   9,882 B |         - | 17,160 B |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |          True | 135.89 μs |  10,000 B |         - | 26,760 B |

But the largest difference is still only 11% (48 μs divided by 300 is only 160 ns per ContinueWith), so it may still be worth it to implement that progress improvement and remove disabled progress option for v2.

timcassell · 2022-03-07T09:06:06Z

#54 changed async/await to hook up PromiseRefs directly to prevent StackOverflowExceptions. The implementation uses a pass-through object or function pointer to avoid boxing allocations.

public void AwaitUnsafeOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
    where TAwaiter : ICriticalNotifyCompletion
    where TStateMachine : IAsyncStateMachine
{
    SetStateMachine(ref stateMachine);
    if (null != default(TAwaiter) && AwaitOverrider<TAwaiter>.IsOverridden())
    {
        AwaitOverrider<TAwaiter>.AwaitOnCompletedInternal(ref awaiter, _ref);
    }
    else
    {
        awaiter.UnsafeOnCompleted(_ref.MoveNext);
    }
}

That code can be optimized to call the method directly on the awaiter in .Net 5 thanks to the JIT optimizing away the boxes. Unity has not yet implemented the optimization, so the AwaitOverrider implementation will need to remain until they do, but we can still implement the optimization for non-Unity consumption.

public void AwaitUnsafeOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
    where TAwaiter : ICriticalNotifyCompletion
    where TStateMachine : IAsyncStateMachine
{
    SetStateMachine(ref stateMachine);
    if (null != default(TAwaiter) && awaiter is IPromiseAwaiter)
    {
        ((IPromiseAwaiter) awaiter).AwaitOnCompletedInternal(_ref);
    }
    else
    {
        awaiter.UnsafeOnCompleted(_ref.MoveNext);
    }
}

A quick benchmark showed that change increases the speed by ~10% - 15% in .Net 5.

timcassell added the performance label Dec 29, 2021

timcassell self-assigned this Dec 29, 2021

This was referenced Jan 12, 2022

Interfaces to abstract classes #34

Merged

Optimize value containers #40

Merged

Optimize await promise #39

Merged

This was referenced Jan 18, 2022

Bug: allocation when awaiting resolved promise #42

Closed

Optimize async method builder in Unity 2021.2 #45

Closed

Optimize progress #52

Merged

timcassell mentioned this issue Feb 1, 2022

Reporting progress to async Promise function #54

Merged

timcassell mentioned this issue Feb 22, 2022

Handle combine #55

Merged

timcassell added this to the v2.1.0 milestone Apr 7, 2022

timcassell mentioned this issue May 22, 2022

More performance improvements 2 #71

Merged

timcassell closed this as completed in #71 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance improvements #26

Potential performance improvements #26

timcassell commented Dec 29, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Dec 30, 2021

timcassell commented Jan 18, 2022 •

edited

Loading

timcassell commented Jan 18, 2022 •

edited

Loading

timcassell commented Mar 7, 2022 •

edited

Loading

Potential performance improvements #26

Potential performance improvements #26

Comments

timcassell commented Dec 29, 2021 • edited Loading

timcassell commented Dec 30, 2021 • edited Loading

timcassell commented Dec 30, 2021 • edited Loading

timcassell commented Dec 30, 2021 • edited Loading

timcassell commented Dec 30, 2021

timcassell commented Jan 18, 2022 • edited Loading

timcassell commented Jan 18, 2022 • edited Loading

timcassell commented Mar 7, 2022 • edited Loading

timcassell commented Dec 29, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Dec 30, 2021 •

edited

Loading

timcassell commented Jan 18, 2022 •

edited

Loading

timcassell commented Jan 18, 2022 •

edited

Loading

timcassell commented Mar 7, 2022 •

edited

Loading