About avoiding std::shared_ptr on the packet path #3379

awelzel · 2023-10-16T13:05:14Z

awelzel
Oct 16, 2023
Maintainer

There's a very low-level and small performance improvement/bug around the packet analysis dispatcher's std::shared_ptr usage. I've been wary around std::shared_ptr's thread-safety guarantees in the packet path where we do not need them and if it's actually visible. This discussion is a write-up around the finding and further wants to start a discussion around and possibly prevent any further std::shared_ptr usages on the packet path if the thread-safety guarantees are not required.

It began with Dispatcher::Lookup showing in flamegraphs, particularly with geneve+vxlan encapsulated traffic. Roughly 3% of samples ended in Dispatcher::Lookup. This was a bit much for something that should just be an array based lookup.

Testing PCAP

Based on 2009-M57-day11-18.trace from externa/zeek-testing, encapsulated with genve+vxlan tunneling as used by AWS GLB.

./capture-fwd -r 2009-M57-day11-18.trace -encap vxlan -dltRaw -srcIp 10.0.0.1 -destIp 10.0.0.2 -w vxlan-2009-M57-day11-18.trace
./capture-fwd -r vxlan-2009-M57-day11-18.trace -encap geneve -srcIp 192.168.0.1 -destIp 192.168.0.2 -w geneve-vxlan-2009-M57-day11-18.trace

This will result in an eth/ip/udp/geneve/ip/udp/vxlan/eth/ip/... encapsulation in the final pcap.

Link to capture-fwd.

Raw packet throughput

Comparing the zeek -r runtime of the original and encapsulated PCAP in bare-mode and checksums disabled shows pretty striking overhead:

$ /usr/bin/time -p zeek -b -C -r /mnt/btrfs/pcaps/2009-M57-day11-18.trace
real 1.90
user 1.68
sys 0.22

$ /usr/bin/time -p zeek -b -C -r /mnt/btrfs/pcaps/geneve-vxlan-2009-M57-day11-18.trace 
real 4.64
user 4.45
sys 0.19

Almost 2.5 times slower. Spoiler: Will still be ~2.0 times slower at the end.

perf annotate

Zooming into Dispatcher::Lookup using perf annotate shows that the most expensive instruction of the function is lock addl to increase the reference count in an atomic manner:

perf annotate zeek::packet_analysis::Dispatcher::Lookup

 0.17 :   913d8f: mov    %rcx,%rax
 2.50 :   913d92: ret
 0.00 :   913d93: nopl   0x0(%rax,%rax,1)
      : 72     __gnu_cxx::__atomic_add(int volatile*, int):
      : 71     { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
71.56 :   913d98: lock addl $0x1,0x8(%rdx)
 0.00 :   913d9d: jmp    913d8f <zeek::packet_analysis::Dispatcher::Lookup(unsigned int) const+0x5f>

This is mostly unneeded for how analyzers are used. We can return a const reference (or a raw pointer) to the user. Introducing a new LookupAnalyzer() method that instead returns the const reference reduced runtime from ~4.6 seconds to ~4.3 seconds. 6.5% of the time was spent on atomic reference counting in a single threaded context.

This can obviously be framed as a performance bug, and the percent difference dwarfs if one doesn't run in bare mode. It is, however, an indication that these thread-safe reference count modifications on the packet path can be significant.

Looking a bit more

Running perf annotate zeek::packet_analysis::IPTunnel::build_inner_packet on the profile directly jumps to another lock add instruction:

 0.00 :   94e35a: nopw   0x0(%rax,%rax,1)
18.76 :   94e360: lock addl $0x1,0x8(%r13)
      : 76     std::__shared_count...
      : 1089   if (_M_pi != nullptr)

And in zeek::packet_analyzis::IP:IPAanalyzer::AnalyzePacket, it's 3 distinct lock prefixed instructions that are the hottest, summing up to ~50% of self samples for this method.

 0.00 :   9319bd: nopl   (%rax)
      : 308    __gnu_cxx::__atomic_add(int volatile*, int):
      : 71     { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
17.71 :   9319c0: lock addl $0x1,0x8(%r14)
      : 73     std::__shared_count ...
      : 1089   if (_M_pi != nullptr)

      : 66     { return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
 0.16 :   9318c8: mov    $0xffffffff,%eax
14.39 :   9318cd: lock xadd %eax,(%rcx)
      : 69     __gnu_cxx::__exchange_and_add_dispatch(int*, int):
      : 101    return __exchange_and_add(__mem, __val);

      : 308    __gnu_cxx::__atomic_add(int volatile*, int):
      : 71     { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
17.54 :   9319c0: lock addl $0x1,0x8(%r14)
      : 73     std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_count<(__gnu_cxx::_Lock_policy)2> const&):
      : 1089   if (_M_pi != nullptr)

The percent numbers are over-promising as it's based on self samples and build_inner_packet invokes other methods that do more works, all of which isn't included.

Removing std::shared_ptr usage on the packet path

With our API guarantee, I think this will be a nightmare or even impossible to transition, but was still curious about the numbers.

boost::local_shared_ptr too slow to construct

With the above finding in mind, attempt to remove the std::shared_ptr's in the packet path using boost::local_shared_ptr. That is primarily Packet::encap and Packet::ip_hdr fields, but it ends up being a shotgun change all over the place.

Maybe we should consider introducing Ptr aliases rather than using the
std::shared_ptr directly in the API directly. This has come up before.

Unfortunately, constructing a boost::local_shared_ptr is more expensive than a std::shared_ptr. Unclear, if boos::make_local_shared() optimizes for one allocation, but it has come up elsewhere. In a tunneled scenario, a lot of IP_Hdr's are allocated and there's not a lot of reference counting happening. The overall result ends up negative/neutral after switching the std::shared_ptrs over to boost::local_shared_ptr.

Use IntrusivePtr

Another approach is to make EncapsulationStack and IP_Hdr work with IntrusivePtr by embedding a ref_count member and implementing Ref() / Unref(). This will result in just one allocation (same as for std::make_shared()), but managing the reference count is non atomic and should be cheaper to modify.

The numbers on my system with above PCAP:

Base: 4.61 sec
Dispatcher::Lookup fix: 4.34 sec
IntrusivePtr packet path: 4.19 sec

Removing the std::shared_ptr squeezes away ~0.15 sec from the original 4.34 sec. So that is ~3.5% spent on thread-safe reference counting. Not all that much, but significant and unneeded.

Further, there's a bit of a concern that atomic instructions like lock addl and others may take varying lengths of times when various Zeek or other processes on the system do them at the same time as there's probably some coherency protocol required to make this work well.

A zeek -C -b -r test does not sufficiently represent this, a perf annotate on on a live system may or may not show something more interesting here.

Summary

Main point of this write-up is to raise awareness that using and modifying the reference count of a std::shared_ptr within the packet packet path should not be taken lightly. Particularly given that we we don't require the thread safety guarantees. The other point was to mention perf annotate :-)

The std::shared_ptr overhead is unfortunately not very visible in flamegraphs due to being heavily inlined. A flat bar with no children above might be an indication in a flame graph.

With Zeek script events being raised and interpreted, a few cycles spent doing atomic changes to std::shared_ptr may be a drop on a hot stone, but I think raw packet processing throughput is something we should care about. Further, adding more std::shared_ptrs to the packet path probably avoided, mainly because they cause unnecessary overhead.

Appendix: More perf annotate micro optimization findings

Some more perf annotate listings provide a few more opportunities for micro optimizations in the tunneled traffic scenario mostly:

Expensive floating point operations in Packet::Init to initialize the timestamp first to 0.0 via the default constructor and then to the actual timestamp using Packet::Init. Can be fixed in build_inner_packet by not using the default constructor.
Computing a packet's inner timestamp based on run_state rather than just re-using the outer packet's timestamp to avoid some more floating point operations.
GetAnalyzerTag() returns a Tag object which is not very cheap to construct and destruct and quite visible in profiles. Switching to return a const ref here helps.
EncapsulationStack holds a pointer to a vector instead of embedding it. This requires one extra allocation/deallocation for the vector. Also, the vector's data is regularly reallocated initially. Reserve 4 entries when the conn is added to the stack to avoid reallocations for smallish encapsulation stacks at the cost of some more memory usage.

Applying the above fixes, squeezes another 0.27 seconds. 4.19sec to 3.92 sec.

This is still pretty far way from the non-encap case, but altogether ~15% better than previously.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About avoiding std::shared_ptr on the packet path #3379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

About avoiding std::shared_ptr on the packet path #3379

awelzel Oct 16, 2023 Maintainer

Testing PCAP

Raw packet throughput

perf annotate

Looking a bit more

Removing std::shared_ptr usage on the packet path

boost::local_shared_ptr too slow to construct

Use IntrusivePtr

Summary

Appendix: More perf annotate micro optimization findings

Replies: 0 comments

awelzel
Oct 16, 2023
Maintainer