Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvements #26

Closed
8 tasks done
timcassell opened this issue Dec 29, 2021 · 7 comments · Fixed by #71
Closed
8 tasks done

Potential performance improvements #26

timcassell opened this issue Dec 29, 2021 · 7 comments · Fixed by #71
Assignees
Milestone

Comments

@timcassell
Copy link
Owner

timcassell commented Dec 29, 2021

Currently in master, benchmark results are showing speedups when dealing with already resolved promises and slowdowns when dealing with pending promises.

[Edit] Updated table from #27 and adjusting the AsyncPending benchmark to use a simpler reusable awaiter.

Type Method N BaseIsPending Mean Code Size
AsyncPending ProtoPromise_V1 100 ? 73.930 μs 4,702 B
AsyncPending ProtoPromise_V2 100 ? 107.352 μs 7,553 B
AsyncResolved ProtoPromise_V1 100 ? 39.307 μs 2,982 B
AsyncResolved ProtoPromise_V2 100 ? 18.702 μs 2,215 B
AwaitPending ProtoPromise_V1 100 ? 278.061 μs 9,178 B
AwaitPending ProtoPromise_V2 100 ? 424.685 μs 16,647 B
AwaitResolved ProtoPromise_V1 100 ? 3.931 μs 2,025 B
AwaitResolved ProtoPromise_V2 100 ? 13.674 μs 2,845 B
ContinueWithPending ProtoPromise_V1 100 False 328.419 μs 9,248 B
ContinueWithPending ProtoPromise_V2 100 False 487.998 μs 18,336 B
ContinueWithPending ProtoPromise_V1 100 True 336.582 μs 9,248 B
ContinueWithPending ProtoPromise_V2 100 True 484.569 μs 14,874 B
ContinueWithFromValue ProtoPromise_V1 100 False 77.698 μs 4,059 B
ContinueWithFromValue ProtoPromise_V2 100 False 24.998 μs 7,238 B
ContinueWithFromValue ProtoPromise_V1 100 True 73.843 μs 4,391 B
ContinueWithFromValue ProtoPromise_V2 100 True 128.037 μs 9,317 B
ContinueWithResolved ProtoPromise_V1 100 False 106.440 μs 4,017 B
ContinueWithResolved ProtoPromise_V2 100 False 32.014 μs 9,193 B
ContinueWithResolved ProtoPromise_V1 100 True 100.274 μs 4,522 B
ContinueWithResolved ProtoPromise_V2 100 True 137.503 μs 8,947 B

This is expected due to changing promises to structs and adding thread safety. But there are areas where we can improve the performance for pending promises.

  • Interface method calls are twice as expensive as class virtual method calls. Some internal interfaces can be changed to abstract classes, such as ITreeHandleable, IValueContainer, and IMultiTreeHandleable.
  • Multiple virtual calls are made that can be reduced to a single virtual call [Edit] I tried these in Optimize virtual calls #38 and it wound up being a wash with larger code, so I opted not to merge it.
    • When hooking up a callback (Promise.Then), MarkAwaited is a virtual call, then the new PromiseRef is created, then HookupNewPromise is a second virtual call. This was done for validation purposes, but it can be reduced to a single virtual call, with careful consideration for the PromiseRef created when the call is invalid. Also note that we cannot use a generic creator passed into the virtual function, because AOT compilers have trouble with generic virtual functions and structs (and using a non-struct would still be a second virtual call and consume heap space).
    • When ITreeHandleable.Handle is invoked, PromiseRef calls a separate virtual Execute function. This is done so that all calls to Execute are wrapped in the try/catch, but that logic can be shifted to pass the IDelegate structs to a generic function with the try/catch in it instead, similar to the CallbackHelper for synchronous invokes.
  • When promises are completed, ValueContainers are created with 0 retains and retain is called when it is passed to ResolveInternal or RejectOrCancelInternal. This can be changed to always create with 1 retain and don't call retain in those methods.
  • PromiseRef.MaybeDispose() will always call Dispose() virtually. Most of the time MaybeDispose() is called from the most derived class, so that can be changed to a direct call instead of virtual. [Edit] No longer true since Handle combine #55
  • async/await can be improved by hooking up the AsyncPromiseRef directly to the awaited PromiseRef instead of creating an AwaiterRef, by doing a special type check on the awaiter for PromiseAwaiter in the PromiseMethodBuilder. This optimization is trickier than the others and will need to make sure it doesn't box (very similar to Reporting progress from an async Promise function #10).
  • It is possible to change object pooling to use an Interlocked method instead of Spinlock, but to be efficient it would require using C#7's ref local and ref return features. This can be done while still supporting older language versions, but it will get ugly with the #ifdefs.
  • Call IPromiseWaiter.AwaitOnCompletedInternal directly in .Net Core instead of going through AwaitOverrider, since the JIT optimizes away the box instructions (but not in Unity).
  • Unsafe.As<T>(object) in .Net 5+

Unfortunately, I don't think there is much that can be done for awaiting a resolved promise, as v1 was branchless (the promise was its own awaiter and didn't have to check itself for null). V2 is more complex with the struct wrapping a reference and using a separate PromiseAwaiter struct for safety. [Edit] Await promise was improved in #39.

@timcassell timcassell self-assigned this Dec 29, 2021
@timcassell
Copy link
Owner Author

timcassell commented Dec 30, 2021

When the T result of Promise<T> is a reference type, we could store that directly in the object _valueOrPrevious field instead of wrapping it in a ResolveContainer<T>. Then instead of using the IValueContainer.GetState() to retrieve the previous promise's state, we can store the previous state in a field on the PromiseRef.

This would result in decreased memory consumption for reference types, but I'm not sure if it would help with execution time. On one hand, it would eliminate a virtual call to IValueContainer.GetState(), but on the other hand, storing the previous state as a field would either increase the size of the PromiseRef, or it would need to be placed inside the SmallFields struct with an explicit offset, sharing space with the _state field. There is enough bit space to do it (State only consumes 2 bits), but it would require bit-masking to read the proper values.

If this change is made, benchmarks are a must before merging it.

[Edit] I investigated making this change, and it ended up costing extra checks to see if the field was a ValueContainer or not (which right now it happily just assumes it is). But I did end up optimizing it so that reference types only create one type ResolveContainer<object> instead of a type for every T in #60.

@timcassell
Copy link
Owner Author

timcassell commented Dec 30, 2021

[Edit] This was done in #55

Another option is combining ITreeHandleable.MakeReady and ITreeHandleable.Handle into a single call, so each promise's callbacks will be invoked recursively. This would eliminate a virtual call as well as adding and removing it from the internal execution stack.

But this would also cause a StackOverflowException if the promise chain is very long. They are currently separate calls to allow the stack to unwind before invoking the next. It also allows objects to re-pool themselves before the next invoke in case they need to be re-used during the callback.

I'm currently thinking the downsides outweigh the upside at this time.

As an aside, async/await currently can cause a StackOverflowException if the recursive depth is very long, which the async/await improvement in the OP would also solve. [Edit] This was fixed in #54

@timcassell
Copy link
Owner Author

timcassell commented Dec 30, 2021

For progress improvements:

  • Instead of a separate field (_depthAndProgress), we can store the depth of the promise as a short in SmallFields, sharing the bit space with _deferredId, because _deferredId is only needed for the base of the promise chain (from the Deferred, obviously), and the depth is always 0 at that point. Fixed32 will need to be changed to use a short for the whole value, and ushort for the decimal value, obsoleting Config.ProgressDecimalBits which will always be 16 (for the size of ushort).
  • The algorithm for updating progress can be adjusted so that PromiseWaitPromise doesn't need to implement IProgressInvokable to save on memory. Since progressListener.SetProgress already has an out PromiseSingleAwaitWithProgress nextRef to continue updating each progress ref iteratively, we can change Fixed32 progress to ref Fixed32 progress and PromiseWaitPromise can update the calculated progress and out the nextRef instead of scheduling itself on the ExecutionScheduler.
  • PromiseMultiAwait can add a int _progressRetains field to retain itself while it's invoking progress instead of locking to improve concurrency.

[Edit] These were done except for the progress retains for PromiseMultiAwait which proved difficult to handle some race conditions. I may revisit it at a later time.

[Edit2] PromiseMultiAwait progress counter is implemented in #63.

@timcassell
Copy link
Owner Author

.Net 5 added the Unsafe class. We can use T Unsafe.As<T>(object o) to cast more efficiently when we know the cast will succeed (everywhere we use traditional (type) casts).

@timcassell
Copy link
Owner Author

timcassell commented Jan 18, 2022

https://issuetracker.unity3d.com/issues/il2cpp-incorrect-results-when-calling-a-method-from-outside-class-in-a-struct

This issue which forced us to use a less efficient async method builder in IL2CPP builds was fixed in Unity 2020.3.20f1 and 2021.1.24f1. We can #if UNITY_2021_2_OR_NEWER to enable the optimized builder for IL2CPP builds that other runtimes enjoy (it would take too many ifdefs to also support 2020, because unity doesn't include OR_NEWER symbols for patch releases).

[Edit] Done in #45.

@timcassell
Copy link
Owner Author

timcassell commented Jan 18, 2022

[Edit] Bad measurement, I was benchmarking the same configuration both times. Here are the real results:

Without progress:

|                  Type |          Method |   N | BaseIsPending |      Mean | Code Size | Allocated | Survived |
|---------------------- |---------------- |---- |-------------- |----------:|----------:|----------:|---------:|
|          AwaitPending | ProtoPromise_V2 | 100 |             ? | 382.57 μs |  13,141 B |         - |    416 B |
|          AsyncPending | ProtoPromise_V2 | 100 |             ? |  86.27 μs |   7,240 B |         - |    728 B |
|         AsyncResolved | ProtoPromise_V2 | 100 |             ? |  16.82 μs |   2,172 B |         - |        - |
|         AwaitResolved | ProtoPromise_V2 | 100 |             ? |  11.03 μs |   1,906 B |      72 B |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |         False | 412.14 μs |  17,455 B |         - | 17,264 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |         False |  25.16 μs |   8,005 B |         - |        - |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |         False |  24.67 μs |   8,195 B |         - |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |          True | 403.14 μs |  14,912 B |         - | 17,216 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |          True | 112.18 μs |   9,504 B |         - | 17,144 B |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |          True | 126.77 μs |   9,521 B |         - | 17,144 B |

With progress:

|                  Type |          Method |   N | BaseIsPending |      Mean | Code Size | Allocated | Survived |
|---------------------- |---------------- |---- |-------------- |----------:|----------:|----------:|---------:|
|          AwaitPending | ProtoPromise_V2 | 100 |             ? | 355.67 μs |  13,403 B |         - |    416 B |
|          AsyncPending | ProtoPromise_V2 | 100 |             ? | 103.09 μs |   7,542 B |         - |    760 B |
|         AsyncResolved | ProtoPromise_V2 | 100 |             ? |  17.69 μs |   2,210 B |         - |        - |
|         AwaitResolved | ProtoPromise_V2 | 100 |             ? |  11.33 μs |   1,906 B |      72 B |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |         False | 460.28 μs |  18,828 B |         - | 26,984 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |         False |  24.85 μs |   8,179 B |         - |        - |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |         False |  31.72 μs |  10,186 B |         - |        - |
|   ContinueWithPending | ProtoPromise_V2 | 100 |          True | 449.77 μs |  15,353 B |         - | 26,832 B |
| ContinueWithFromValue | ProtoPromise_V2 | 100 |          True | 116.52 μs |   9,882 B |         - | 17,160 B |
|  ContinueWithResolved | ProtoPromise_V2 | 100 |          True | 135.89 μs |  10,000 B |         - | 26,760 B |

But the largest difference is still only 11% (48 μs divided by 300 is only 160 ns per ContinueWith), so it may still be worth it to implement that progress improvement and remove disabled progress option for v2.

@timcassell
Copy link
Owner Author

timcassell commented Mar 7, 2022

#54 changed async/await to hook up PromiseRefs directly to prevent StackOverflowExceptions. The implementation uses a pass-through object or function pointer to avoid boxing allocations.

public void AwaitUnsafeOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
    where TAwaiter : ICriticalNotifyCompletion
    where TStateMachine : IAsyncStateMachine
{
    SetStateMachine(ref stateMachine);
    if (null != default(TAwaiter) && AwaitOverrider<TAwaiter>.IsOverridden())
    {
        AwaitOverrider<TAwaiter>.AwaitOnCompletedInternal(ref awaiter, _ref);
    }
    else
    {
        awaiter.UnsafeOnCompleted(_ref.MoveNext);
    }
}

That code can be optimized to call the method directly on the awaiter in .Net 5 thanks to the JIT optimizing away the boxes. Unity has not yet implemented the optimization, so the AwaitOverrider implementation will need to remain until they do, but we can still implement the optimization for non-Unity consumption.

public void AwaitUnsafeOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
    where TAwaiter : ICriticalNotifyCompletion
    where TStateMachine : IAsyncStateMachine
{
    SetStateMachine(ref stateMachine);
    if (null != default(TAwaiter) && awaiter is IPromiseAwaiter)
    {
        ((IPromiseAwaiter) awaiter).AwaitOnCompletedInternal(_ref);
    }
    else
    {
        awaiter.UnsafeOnCompleted(_ref.MoveNext);
    }
}

A quick benchmark showed that change increases the speed by ~10% - 15% in .Net 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant