High CPU load due to a single misbehaving client #143

RalfJung · 2020-07-24T16:24:59Z

We occasionally see a client misbehave and establish multiple connections at the same time to all our servers. For some reason, even when there are just around 20 connections per 10 minutes, this causes 100% CPU load by tunneldigger. Python is not the most efficient language, but this seems eccessive -- I'd like to better understand where in the broker all that CPU time is spent. Unfortunately, so far I found no good way to do such an analysis for python (what I am looking for is something like callgrind).

RalfJung · 2020-08-09T14:23:22Z

I did a cProfile run of this (on the live system under the problematic load situation), so now I can start analyzing that profiler data. So far I am not sure what to conclude.

Here's the functions with the highest "cumtime":

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000  409.537  204.769 main.py:1(<module>)
     51/1    0.000    0.000  204.769  204.769 {built-in method builtins.exec}
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:978(_find_and_load)
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
    104/2    0.000    0.000  204.769  102.384 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
     68/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap>:663(_load_unlocked)
     47/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap_external>:722(exec_module)
        1   79.314   79.314  203.672  203.672 eventloop.py:44(start)
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
       22    0.002    0.000    1.448    0.066 tunnel.py:239(close)
       91    0.001    0.000    1.392    0.015 netlink.py:127(send)
       91    0.000    0.000    1.392    0.015 netlink.py:152(send)
       91    1.391    0.015    1.391    0.015 {method 'send' of '_socket.socket' objects}
       22    0.001    0.000    1.388    0.063 l2tp.py:181(session_delete)
        1    0.000    0.000    1.048    1.048 broker.py:194(close)
      481    0.002    0.000    0.421    0.001 network.py:88(read)
      461    0.004    0.000    0.416    0.001 tunnel.py:224(keepalive)
      559    0.003    0.000    0.129    0.000 network.py:154(read)
       66    0.002    0.000    0.085    0.001 hooks.py:136(run_hook)
       71    0.001    0.000    0.080    0.001 broker.py:249(message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)
       23    0.000    0.000    0.073    0.003 broker.py:236(create_tunnel)
       23    0.001    0.000    0.073    0.003 broker.py:67(create_tunnel)
       44    0.002    0.000    0.072    0.002 hooks.py:18(__init__)
       45    0.003    0.000    0.071    0.002 subprocess.py:656(__init__)
       22    0.001    0.000    0.066    0.003 tunnel.py:104(setup_tunnel)
       45    0.007    0.000    0.064    0.001 subprocess.py:1383(_execute_child)
      488    0.003    0.000    0.037    0.000 tunnel.py:310(message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
        2    0.000    0.000    0.034    0.017 limits.py:24(configure)
        8    0.000    0.000    0.033    0.004 traffic_control.py:18(tc)

Does that look as expected? Not sure. The number of calls to method 'poll' of 'select.epoll' objects and method 'get' of 'dict' objects seems rather large, in particular the former. Maybe we are just in a too tight epoll loop? I also have an (older) pcap file with traffic from that client, showing that it averages at 440 packets per second, but almost all of these are packets from inside the tunnel; the number of control packets (that tunneldigger would actually interpret) is very low. So maybe tunneldigger has to wake up for data packets as well? I am not sure if the kernel sends those packets to userspace or not. Cc @kaechele

Here are the most-called functions:

 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
6169/6069    0.001    0.000    0.001    0.000 {built-in method builtins.len}
     3474    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     2282    0.001    0.000    0.001    0.000 {built-in method _struct.pack}
     2003    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
     1922    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
     1609    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}
     1605    0.001    0.000    0.001    0.000 sre_parse.py:233(__next)
     1564    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
     1531    0.000    0.000    0.000    0.000 {method 'isupper' of 'str' objects}
     1410    0.000    0.000    0.001    0.000 sre_parse.py:254(get)
     1251    0.001    0.000    0.001    0.000 {built-in method time.time}
     1125    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
      963    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
      900    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
      900    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
      862    0.001    0.000    0.001    0.000 {built-in method _struct.unpack}
      735    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
      650    0.006    0.000    0.006    0.000 {method 'recvfrom' of '_socket.socket' objects}
      607    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
      576    0.010    0.000    0.010    0.000 {method 'sendto' of '_socket.socket' objects}
      576    0.001    0.000    0.011    0.000 network.py:115(write)
      576    0.003    0.000    0.015    0.000 network.py:128(write_message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
      559    0.003    0.000    0.129    0.000 network.py:154(read)
      559    0.002    0.000    0.003    0.000 protocol.py:50(parse_message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)

The trace file says it was running for 200s, so 3000 calls does not seem excessive at all to me.

RalfJung · 2020-08-09T14:42:02Z

When the high load starts, I am seeing a ton of wakeups in our main epoll loop, that all have event flag 0x8 set -- that's select.EPOLLERR. The events are all for tunnels associated with the misbehaving client.

RalfJung mentioned this issue Jul 24, 2020

Change rate limiting to be per-UUID #144

Closed

RalfJung mentioned this issue Aug 9, 2020

switch per per-IP to per-UUID rate limiting #146

Closed

RalfJung mentioned this issue Aug 9, 2020

close pollable when its FD errors #148

Merged

RalfJung closed this as completed in #148 Aug 10, 2020

RalfJung mentioned this issue Aug 10, 2020

The client can get stuck in a high-frequency retry loop despite working brokers #149

Open

This was referenced Aug 15, 2023

Frequent reconnection of clients #171

Closed

do not close connection immediately on EPOLLERR #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU load due to a single misbehaving client #143

High CPU load due to a single misbehaving client #143

RalfJung commented Jul 24, 2020

RalfJung commented Aug 9, 2020

RalfJung commented Aug 9, 2020

High CPU load due to a single misbehaving client #143

High CPU load due to a single misbehaving client #143

Comments

RalfJung commented Jul 24, 2020

RalfJung commented Aug 9, 2020

RalfJung commented Aug 9, 2020