Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU load due to a single misbehaving client #143

Closed
RalfJung opened this issue Jul 24, 2020 · 2 comments · Fixed by #148
Closed

High CPU load due to a single misbehaving client #143

RalfJung opened this issue Jul 24, 2020 · 2 comments · Fixed by #148

Comments

@RalfJung
Copy link
Member

We occasionally see a client misbehave and establish multiple connections at the same time to all our servers. For some reason, even when there are just around 20 connections per 10 minutes, this causes 100% CPU load by tunneldigger. Python is not the most efficient language, but this seems eccessive -- I'd like to better understand where in the broker all that CPU time is spent. Unfortunately, so far I found no good way to do such an analysis for python (what I am looking for is something like callgrind).

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2020

I did a cProfile run of this (on the live system under the problematic load situation), so now I can start analyzing that profiler data. So far I am not sure what to conclude.

Here's the functions with the highest "cumtime":

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000  409.537  204.769 main.py:1(<module>)
     51/1    0.000    0.000  204.769  204.769 {built-in method builtins.exec}
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:978(_find_and_load)
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
    104/2    0.000    0.000  204.769  102.384 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
     68/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap>:663(_load_unlocked)
     47/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap_external>:722(exec_module)
        1   79.314   79.314  203.672  203.672 eventloop.py:44(start)
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
       22    0.002    0.000    1.448    0.066 tunnel.py:239(close)
       91    0.001    0.000    1.392    0.015 netlink.py:127(send)
       91    0.000    0.000    1.392    0.015 netlink.py:152(send)
       91    1.391    0.015    1.391    0.015 {method 'send' of '_socket.socket' objects}
       22    0.001    0.000    1.388    0.063 l2tp.py:181(session_delete)
        1    0.000    0.000    1.048    1.048 broker.py:194(close)
      481    0.002    0.000    0.421    0.001 network.py:88(read)
      461    0.004    0.000    0.416    0.001 tunnel.py:224(keepalive)
      559    0.003    0.000    0.129    0.000 network.py:154(read)
       66    0.002    0.000    0.085    0.001 hooks.py:136(run_hook)
       71    0.001    0.000    0.080    0.001 broker.py:249(message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)
       23    0.000    0.000    0.073    0.003 broker.py:236(create_tunnel)
       23    0.001    0.000    0.073    0.003 broker.py:67(create_tunnel)
       44    0.002    0.000    0.072    0.002 hooks.py:18(__init__)
       45    0.003    0.000    0.071    0.002 subprocess.py:656(__init__)
       22    0.001    0.000    0.066    0.003 tunnel.py:104(setup_tunnel)
       45    0.007    0.000    0.064    0.001 subprocess.py:1383(_execute_child)
      488    0.003    0.000    0.037    0.000 tunnel.py:310(message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
        2    0.000    0.000    0.034    0.017 limits.py:24(configure)
        8    0.000    0.000    0.033    0.004 traffic_control.py:18(tc)

Does that look as expected? Not sure. The number of calls to method 'poll' of 'select.epoll' objects and method 'get' of 'dict' objects seems rather large, in particular the former. Maybe we are just in a too tight epoll loop? I also have an (older) pcap file with traffic from that client, showing that it averages at 440 packets per second, but almost all of these are packets from inside the tunnel; the number of control packets (that tunneldigger would actually interpret) is very low. So maybe tunneldigger has to wake up for data packets as well? I am not sure if the kernel sends those packets to userspace or not. Cc @kaechele

Here are the most-called functions:

 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
6169/6069    0.001    0.000    0.001    0.000 {built-in method builtins.len}
     3474    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     2282    0.001    0.000    0.001    0.000 {built-in method _struct.pack}
     2003    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
     1922    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
     1609    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}
     1605    0.001    0.000    0.001    0.000 sre_parse.py:233(__next)
     1564    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
     1531    0.000    0.000    0.000    0.000 {method 'isupper' of 'str' objects}
     1410    0.000    0.000    0.001    0.000 sre_parse.py:254(get)
     1251    0.001    0.000    0.001    0.000 {built-in method time.time}
     1125    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
      963    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
      900    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
      900    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
      862    0.001    0.000    0.001    0.000 {built-in method _struct.unpack}
      735    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
      650    0.006    0.000    0.006    0.000 {method 'recvfrom' of '_socket.socket' objects}
      607    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
      576    0.010    0.000    0.010    0.000 {method 'sendto' of '_socket.socket' objects}
      576    0.001    0.000    0.011    0.000 network.py:115(write)
      576    0.003    0.000    0.015    0.000 network.py:128(write_message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
      559    0.003    0.000    0.129    0.000 network.py:154(read)
      559    0.002    0.000    0.003    0.000 protocol.py:50(parse_message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)

The trace file says it was running for 200s, so 3000 calls does not seem excessive at all to me.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2020

When the high load starts, I am seeing a ton of wakeups in our main epoll loop, that all have event flag 0x8 set -- that's select.EPOLLERR. The events are all for tunnels associated with the misbehaving client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant