You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a very low-level and small performance improvement/bug around the packet analysis dispatcher's std::shared_ptr usage. I've been wary around std::shared_ptr's thread-safety guarantees in the packet path where we do not need them and if it's actually visible. This discussion is a write-up around the finding and further wants to start a discussion around and possibly prevent any further std::shared_ptr usages on the packet path if the thread-safety guarantees are not required.
It began with Dispatcher::Lookup showing in flamegraphs, particularly with geneve+vxlan encapsulated traffic. Roughly 3% of samples ended in Dispatcher::Lookup. This was a bit much for something that should just be an array based lookup.
Testing PCAP
Based on 2009-M57-day11-18.trace from externa/zeek-testing, encapsulated with genve+vxlan tunneling as used by AWS GLB.
Comparing the zeek -r runtime of the original and encapsulated PCAP in bare-mode and checksums disabled shows pretty striking overhead:
$ /usr/bin/time -p zeek -b -C -r /mnt/btrfs/pcaps/2009-M57-day11-18.trace
real 1.90
user 1.68
sys 0.22
$ /usr/bin/time -p zeek -b -C -r /mnt/btrfs/pcaps/geneve-vxlan-2009-M57-day11-18.trace
real 4.64
user 4.45
sys 0.19
Almost 2.5 times slower. Spoiler: Will still be ~2.0 times slower at the end.
perf annotate
Zooming into Dispatcher::Lookup using perf annotate shows that the most expensive instruction of the function is lock addl to increase the reference count in an atomic manner:
This is mostly unneeded for how analyzers are used. We can return a const reference (or a raw pointer) to the user. Introducing a new LookupAnalyzer() method that instead returns the const reference reduced runtime from ~4.6 seconds to ~4.3 seconds. 6.5% of the time was spent on atomic reference counting in a single threaded context.
This can obviously be framed as a performance bug, and the percent difference dwarfs if one doesn't run in bare mode. It is, however, an indication that these thread-safe reference count modifications on the packet path can be significant.
Looking a bit more
Running perf annotate zeek::packet_analysis::IPTunnel::build_inner_packet on the profile directly jumps to another lock add instruction:
And in zeek::packet_analyzis::IP:IPAanalyzer::AnalyzePacket, it's 3 distinct lock prefixed instructions that are the hottest, summing up to ~50% of self samples for this method.
The percent numbers are over-promising as it's based on self samples and build_inner_packet invokes other methods that do more works, all of which isn't included.
Removing std::shared_ptr usage on the packet path
With our API guarantee, I think this will be a nightmare or even impossible to transition, but was still curious about the numbers.
boost::local_shared_ptr too slow to construct
With the above finding in mind, attempt to remove the std::shared_ptr's in the packet path using boost::local_shared_ptr. That is primarily Packet::encap and Packet::ip_hdr fields, but it ends up being a shotgun change all over the place.
Maybe we should consider introducing Ptr aliases rather than using the std::shared_ptr directly in the API directly. This has come up before.
Unfortunately, constructing a boost::local_shared_ptr is more expensive than a std::shared_ptr. Unclear, if boos::make_local_shared() optimizes for one allocation, but it has come up elsewhere. In a tunneled scenario, a lot of IP_Hdr's are allocated and there's not a lot of reference counting happening. The overall result ends up negative/neutral after switching the std::shared_ptrs over to boost::local_shared_ptr.
Use IntrusivePtr
Another approach is to make EncapsulationStack and IP_Hdr work with IntrusivePtr by embedding a ref_count member and implementing Ref() / Unref(). This will result in just one allocation (same as for std::make_shared()), but managing the reference count is non atomic and should be cheaper to modify.
The numbers on my system with above PCAP:
Base: 4.61 sec
Dispatcher::Lookup fix: 4.34 sec
IntrusivePtr packet path: 4.19 sec
Removing the std::shared_ptr squeezes away ~0.15 sec from the original 4.34 sec. So that is ~3.5% spent on thread-safe reference counting. Not all that much, but significant and unneeded.
Further, there's a bit of a concern that atomic instructions like lock addl and others may take varying lengths of times when various Zeek or other processes on the system do them at the same time as there's probably some coherency protocol required to make this work well.
A zeek -C -b -r test does not sufficiently represent this, a perf annotate on on a live system may or may not show something more interesting here.
Summary
Main point of this write-up is to raise awareness that using and modifying the reference count of a std::shared_ptr within the packet packet path should not be taken lightly. Particularly given that we we don't require the thread safety guarantees. The other point was to mention perf annotate :-)
The std::shared_ptr overhead is unfortunately not very visible in flamegraphs due to being heavily inlined. A flat bar with no children above might be an indication in a flame graph.
With Zeek script events being raised and interpreted, a few cycles spent doing atomic changes to std::shared_ptr may be a drop on a hot stone, but I think raw packet processing throughput is something we should care about. Further, adding more std::shared_ptrs to the packet path probably avoided, mainly because they cause unnecessary overhead.
Appendix: More perf annotate micro optimization findings
Some more perf annotate listings provide a few more opportunities for micro optimizations in the tunneled traffic scenario mostly:
Expensive floating point operations in Packet::Init to initialize the timestamp first to 0.0 via the default constructor and then to the actual timestamp using Packet::Init. Can be fixed in build_inner_packet by not using the default constructor.
Computing a packet's inner timestamp based on run_state rather than just re-using the outer packet's timestamp to avoid some more floating point operations.
GetAnalyzerTag() returns a Tag object which is not very cheap to construct and destruct and quite visible in profiles. Switching to return a const ref here helps.
EncapsulationStack holds a pointer to a vector instead of embedding it. This requires one extra allocation/deallocation for the vector. Also, the vector's data is regularly reallocated initially. Reserve 4 entries when the conn is added to the stack to avoid reallocations for smallish encapsulation stacks at the cost of some more memory usage.
Applying the above fixes, squeezes another 0.27 seconds. 4.19sec to 3.92 sec.
This is still pretty far way from the non-encap case, but altogether ~15% better than previously.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
There's a very low-level and small performance improvement/bug around the packet analysis dispatcher's
std::shared_ptr
usage. I've been wary aroundstd::shared_ptr
's thread-safety guarantees in the packet path where we do not need them and if it's actually visible. This discussion is a write-up around the finding and further wants to start a discussion around and possibly prevent any furtherstd::shared_ptr
usages on the packet path if the thread-safety guarantees are not required.It began with
Dispatcher::Lookup
showing in flamegraphs, particularly with geneve+vxlan encapsulated traffic. Roughly 3% of samples ended inDispatcher::Lookup
. This was a bit much for something that should just be an array based lookup.Testing PCAP
Based on
2009-M57-day11-18.trace
from externa/zeek-testing, encapsulated with genve+vxlan tunneling as used by AWS GLB.This will result in an
eth/ip/udp/geneve/ip/udp/vxlan/eth/ip/...
encapsulation in the final pcap.Link to capture-fwd.
Raw packet throughput
Comparing the
zeek -r
runtime of the original and encapsulated PCAP in bare-mode and checksums disabled shows pretty striking overhead:Almost 2.5 times slower. Spoiler: Will still be ~2.0 times slower at the end.
perf annotate
Zooming into
Dispatcher::Lookup
usingperf annotate
shows that the most expensive instruction of the function islock addl
to increase the reference count in an atomic manner:This is mostly unneeded for how analyzers are used. We can return a const reference (or a raw pointer) to the user. Introducing a new
LookupAnalyzer()
method that instead returns the const reference reduced runtime from ~4.6 seconds to ~4.3 seconds. 6.5% of the time was spent on atomic reference counting in a single threaded context.This can obviously be framed as a performance bug, and the percent difference dwarfs if one doesn't run in bare mode. It is, however, an indication that these thread-safe reference count modifications on the packet path can be significant.
Looking a bit more
Running
perf annotate zeek::packet_analysis::IPTunnel::build_inner_packet
on the profile directly jumps to anotherlock add
instruction:And in
zeek::packet_analyzis::IP:IPAanalyzer::AnalyzePacket
, it's 3 distinctlock
prefixed instructions that are the hottest, summing up to ~50% of self samples for this method.The percent numbers are over-promising as it's based on self samples and
build_inner_packet
invokes other methods that do more works, all of which isn't included.Removing std::shared_ptr usage on the packet path
With our API guarantee, I think this will be a nightmare or even impossible to transition, but was still curious about the numbers.
boost::local_shared_ptr too slow to construct
With the above finding in mind, attempt to remove the std::shared_ptr's in the packet path using boost::local_shared_ptr. That is primarily
Packet::encap
andPacket::ip_hdr
fields, but it ends up being a shotgun change all over the place.Unfortunately, constructing a
boost::local_shared_ptr
is more expensive than astd::shared_ptr
. Unclear, ifboos::make_local_shared()
optimizes for one allocation, but it has come up elsewhere. In a tunneled scenario, a lot of IP_Hdr's are allocated and there's not a lot of reference counting happening. The overall result ends up negative/neutral after switching thestd::shared_ptr
s over toboost::local_shared_ptr
.Use IntrusivePtr
Another approach is to make
EncapsulationStack
andIP_Hdr
work withIntrusivePtr
by embedding aref_count
member and implementingRef()
/Unref()
. This will result in just one allocation (same as forstd::make_shared()
), but managing the reference count is non atomic and should be cheaper to modify.The numbers on my system with above PCAP:
Removing the
std::shared_ptr
squeezes away ~0.15 sec from the original 4.34 sec. So that is ~3.5% spent on thread-safe reference counting. Not all that much, but significant and unneeded.Further, there's a bit of a concern that atomic instructions like
lock addl
and others may take varying lengths of times when various Zeek or other processes on the system do them at the same time as there's probably some coherency protocol required to make this work well.A
zeek -C -b -r
test does not sufficiently represent this, aperf annotate
on on a live system may or may not show something more interesting here.Summary
Main point of this write-up is to raise awareness that using and modifying the reference count of a
std::shared_ptr
within the packet packet path should not be taken lightly. Particularly given that we we don't require the thread safety guarantees. The other point was to mentionperf annotate
:-)The
std::shared_ptr
overhead is unfortunately not very visible in flamegraphs due to being heavily inlined. A flat bar with no children above might be an indication in a flame graph.With Zeek script events being raised and interpreted, a few cycles spent doing atomic changes to
std::shared_ptr
may be a drop on a hot stone, but I think raw packet processing throughput is something we should care about. Further, adding morestd::shared_ptr
s to the packet path probably avoided, mainly because they cause unnecessary overhead.Appendix: More perf annotate micro optimization findings
Some more
perf annotate
listings provide a few more opportunities for micro optimizations in the tunneled traffic scenario mostly:Expensive floating point operations in Packet::Init to initialize the timestamp first to 0.0 via the default constructor and then to the actual timestamp using
Packet::Init
. Can be fixed inbuild_inner_packet
by not using the default constructor.Computing a packet's inner timestamp based on
run_state
rather than just re-using the outer packet's timestamp to avoid some more floating point operations.GetAnalyzerTag()
returns a Tag object which is not very cheap to construct and destruct and quite visible in profiles. Switching to return a const ref here helps.EncapsulationStack
holds a pointer to a vector instead of embedding it. This requires one extra allocation/deallocation for the vector. Also, the vector's data is regularly reallocated initially. Reserve 4 entries when the conn is added to the stack to avoid reallocations for smallish encapsulation stacks at the cost of some more memory usage.Applying the above fixes, squeezes another 0.27 seconds. 4.19sec to 3.92 sec.
This is still pretty far way from the non-encap case, but altogether ~15% better than previously.
Beta Was this translation helpful? Give feedback.
All reactions