Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

Closed
northys opened this issue Mar 13, 2022 · 10 comments
Closed
Assignees
Labels
bug TYPE: a report on something that isn't working

Comments

@northys
Copy link
Contributor

northys commented Mar 13, 2022

❗ ❗ ❗ DISCLAIMER: THIS MAY NOT BE CAUSED BY PORTMASTER, STILL INVESTIGATING ❗ ❗ ❗

What happened:

In past week I had to reboot my laptop to make networking fast again. We are talking about LAN throughput so I'm 99% sure it wasn't related to ISP throttling me or anything.

I watched video content streamed from my local server over DLNA hosted by Plex Server. I'm using VLC most of the time but sometimes I use Videos gnome app (totem player) because it works better on slower LAN.

There is main difference between VLC and totem player:

  • VLC streams the content "second by second" as you watch it so you can even track where you left off in Plex Server. You can see the exact second where you left off. That means when networking doesn't work smoothly the playback gets stuck frequently (e.g. every 1-2 seconds for many second depending on the content bitrate).
  • Totem player downloads the file to local cache as fast as it can while it plays it from that local cache. When everything works fine it downloads the content from DLNA server with speeds reaching 300Mbps. That means that you cache it in few minutes at most and watch it with no issues. That's the reason why I tend to use totem on slower networks. E.g. hotel wi-fi or in this case when I'm too lazy to reboot my laptop because something weird is happening with my networking.

However in past week for some reason it felt like the networking on my laptop is "clogged". I couldn't load self-hosted apps over LAN from my local server nor stream DLNA content smoothly (VLC) / cache it at full speed (totem). At most I could measure 500kB/s download from my home server but content with bitrate of 8Mbps was lagging. I blamed the home server at the begging because there are some I/O heavy tasks running time to time (backups and so on). Also, there are network heavy tasks being run.

Sadly I wasn't able to find any issue on my home server. I checked graphs scrapped by prometheus' node-exporter and there wasn't anything relevant. Disks were under normal load, no packet drops, bandwidth wasn't used completely, cpu was running at 40% most... Nothing weird.

This is the reason why I'm opening this issue and I will continue investigating this. Rebooting of my laptop always fixed that issue. I've installed node-exporter also on my laptop so I have some graphs for you when it happens again.

What I know so far:

  • restarting portmaster did not fix this issue
  • laptop reboot fixes the issue
  • I had similar issues when I had SPN connected all the time so I disabled it again like 2 days ago. It did not make sence to me though because SPN doesn't handle traffic over LAN (192.168.1.0/24 subnet).
  • I restarted portmaster from the UI (not systemctl restart) because earlier today (few hours before this issue occured again) it was consuming 100% CPU (the issue is already opened by someone else and I contributed with my logs few days ago)
  • When the neworking clogged on my laptop I was able to stream 4K content on my phone (30Mbps bitrate) from the same home server so I assume it is not issue with my local network's components. I have high-end network components from Ubiquiti, no cheap chinese shits for 20$.

What did you expect to happen?:

Networking is working at full speed all the time.

How did you reproduce it?:

I have no idea.

Debug Information:

Of course I forgot to copy it. I'll try not to forget the next time. I'm running version 0.8.5.

@northys northys added the bug TYPE: a report on something that isn't working label Mar 13, 2022
@northys northys changed the title [still investigating] networking becomes "clogged" and max bandwidth is reduced to 5% [investigating] networking becomes "clogged" and max bandwidth is reduced to 5% Mar 13, 2022
@northys
Copy link
Contributor Author

northys commented Mar 18, 2022

closing as I couldn't replicate it in past 5 days

@northys
Copy link
Contributor Author

northys commented Apr 15, 2022

@dhaavi there must be definitely something wrong with portmaster. I can't help myself but I have to blame something 😆

This is graph of my laptop's networking throughput. At 22:10 (first blue dashed line) portmaster's SPN module crashed on nil pointer reference (reported safing/spn#74). I've mentioned similar behaviour in another issue (safing/spn#67 (comment)) and as you can see laptop reboot again fixed it (second dashed blue line at the very right of the graph).

I'm sitting here, watching how slow Nextcloud sync my files (over tailscale network (yellow graph), but connected over LAN, no relay). I wish I restarted my laptop earlier 😞

My laptop wasn't connected to SPN at least in last hour. I think I disabled it few minutes after opening the nil pointer crash but I'm not 100% sure. It's good to mention that it happened even before I've deployed tailscale. I've switched to tailscale yesterday.

Positive is download, negative is upload.

image

@dhaavi
Copy link
Member

dhaavi commented Apr 15, 2022

That's some very interesting data. Do you have CPU/MEM data for the same time period? Would be interesting if there is a correlation.

Also, are you collecting metrics from the portmaster (http://127.0.0.1:817/metrics)?
You can some docs on these metrics from http://127.0.0.1:817/api/v1/metrics/list.
Metrics have different levels of senstivity, so you might need to enable dev mode or supply an API key.
Alternatively, you could use the push metrics setting.

These two metrics could tell us how many packets the Portmaster was handling during that period.

portmaster_firewall_handling_duration_seconds_sum
portmaster_firewall_handling_duration_seconds_count

I remember that if every packet goes through the Portmaster, the bandwidth limit was around 20Mbit/s - this suspiciously matches the reported speed here.
Don't why this would happen, as the Portmaster currently marks all verdicts as permanent. One possibility is that another program broke the packet marking.

@northys
Copy link
Contributor Author

northys commented Apr 15, 2022

Also, are you collecting metrics from the portmaster (http://127.0.0.1:817/metrics)?

I do, for 30 minutes already 😅

Do you have a grafana dashboard for those stats? I discovered the /metrics endpoint a few weeks ago but I don't collect metrics for which I do not have dashboard.


Graphs

I've added the 2 annotations also to CPU/RAM dashboard and posting all 3 because so you have it with same time range.

image
image
image


Some more graphs about network. There are some drops on UDP (wireguard managed by tailscale). I haven't noticed those when I originally opened this issue. Maybe because I did not use wireguard (tailscale) locally.

Anyway I don't think it means anything because the UDP errors spike is after the laptop restart. I think it is drawback of being connected over Wi-Fi.

image
image
image
image

@northys
Copy link
Contributor Author

northys commented Apr 15, 2022

btw the swap is not swap but zram. people often see it as problem since it's not common to use swap nowadays anywhere.

https://fedoraproject.org/wiki/Changes/SwapOnZRAM#Summary

@northys
Copy link
Contributor Author

northys commented Apr 17, 2022

Ha! I've got some interesting stuff!

  • when I rsync 40G file over LAN without tailscale (that means full TCP) it goes 16-22MB/s (180Mbps 5min avg by iftop)
  • when I rsync 40G file over LAN with tailscale (TCP over encrypted UDP network) it goes 0-4MB/s (~18Mbps 5min avg by iftop)
  • when I download (firefox) 40G file over LAN without tailscale (http unencrypted connection) it goes 0-2MB/s
    • at the same time I can run the first rsync mentioned at 16MB/s
  • when I download (firefox) 40G file over LAN with tailscale (TCP over encrypted UDP connection + HTTPS) it goes 0-2MB/s
    • at the same time I can run the first rsync mentioned at 16MB/s

image

@northys
Copy link
Contributor Author

northys commented Apr 17, 2022

I don't knwo what should I see here or how to use those metrics (rate() could be used probably?) but I can see that for today the metrics value increases noticeably faster than before. 2 squares were enough for last 2 days and for today 4 squares aren't enough.

I wouldn't say that I did more heavy network stuff than I did yesterday.

Screenshot from 2022-04-17 17-30-23

@dhaavi
Copy link
Member

dhaavi commented Apr 21, 2022

Wow, nice stats! It seems there is a correlation between CPU usage and the network.Interesting. This would point to a problem with the network integration.

Are you using tailscale exit node stuff?
Tailscale might be interfering with the packet marks of Portmaster.

Ha! I've got some interesting stuff!

Does rsync also go over Tailscale, or not?
If not, it seems that tailscale makes the connection go to 0-4MB/s.
Can check the other way around and just do Tailscale without Portmaster?

how to use those metrics

This is a histogram in the form that victoriametrics does it: https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350

So, something like this should do it: (not tested)

histogram_quantile(0.95,
    sum(rate(
        portmaster_firewall_handling_duration_seconds[5m]
    )) by (vmrange)
)

What I hope to see is how many packets the Portmaster handled in a certain period. For this, you can just use rate(portmaster_firewall_handling_duration_seconds_count[5m])

@dhaavi dhaavi self-assigned this Apr 21, 2022
@northys
Copy link
Contributor Author

northys commented Apr 21, 2022

Are you using tailscale exit node stuff?

No, I tried it, it broke everything and I don't need it so I disabled it and since that I use SPN on my laptop.

Does rsync also go over Tailscale, or not?

I've posted 4 examples and the first one is rsync without tailcale. When I tried to download (firefox) the same file from server through LAN without tailscale it did not reach "full speed" though. I'm just wondering why this "problem" doesn't apply for rsync.

I just can't understand why rsync without tailscale copies the file at "fulll speed" while downloading the same file using firefox makes it slow. There must be something that rsync bypassed and it's not tailscale...

If not, it seems that tailscale makes the connection go to 0-4MB/s.

Nope, normally it goes about 95% of my LAN throughput without tailscale. The overhead is really low.

Can check the other way around and just do Tailscale without Portmaster?

I don't think the tailscale is the problem here. When it works, it works and the "overhead" of tailscale is only a few percent in speed drop... This happened even before I discovered tailscale.

@Raphty
Copy link
Member

Raphty commented Aug 23, 2023

I am cleaning out old issues. If you feel this issue should not have been closed let me know.

Please keep in mind, the free version of Portmaster only has limited support.
For free users our active Discord community as well as the chat bot are the fastest and best way to get their help. https://discord.gg/safing
If you find our work brings value to you, please consider supporting it by purchasing Plus or Pro Packages https://safing.io/pricing/.
If you are already a subscriber, first Thank You! and also if you want priority support pleas send in an email and let me know your username so I can prioritize your request accordingly.

@Raphty Raphty closed this as completed Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug TYPE: a report on something that isn't working
Projects
None yet
Development

No branches or pull requests

3 participants