[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

northys · 2022-03-13T17:26:19Z

❗ ❗ ❗ DISCLAIMER: THIS MAY NOT BE CAUSED BY PORTMASTER, STILL INVESTIGATING ❗ ❗ ❗

What happened:

In past week I had to reboot my laptop to make networking fast again. We are talking about LAN throughput so I'm 99% sure it wasn't related to ISP throttling me or anything.

I watched video content streamed from my local server over DLNA hosted by Plex Server. I'm using VLC most of the time but sometimes I use Videos gnome app (totem player) because it works better on slower LAN.

There is main difference between VLC and totem player:

VLC streams the content "second by second" as you watch it so you can even track where you left off in Plex Server. You can see the exact second where you left off. That means when networking doesn't work smoothly the playback gets stuck frequently (e.g. every 1-2 seconds for many second depending on the content bitrate).
Totem player downloads the file to local cache as fast as it can while it plays it from that local cache. When everything works fine it downloads the content from DLNA server with speeds reaching 300Mbps. That means that you cache it in few minutes at most and watch it with no issues. That's the reason why I tend to use totem on slower networks. E.g. hotel wi-fi or in this case when I'm too lazy to reboot my laptop because something weird is happening with my networking.

However in past week for some reason it felt like the networking on my laptop is "clogged". I couldn't load self-hosted apps over LAN from my local server nor stream DLNA content smoothly (VLC) / cache it at full speed (totem). At most I could measure 500kB/s download from my home server but content with bitrate of 8Mbps was lagging. I blamed the home server at the begging because there are some I/O heavy tasks running time to time (backups and so on). Also, there are network heavy tasks being run.

Sadly I wasn't able to find any issue on my home server. I checked graphs scrapped by prometheus' node-exporter and there wasn't anything relevant. Disks were under normal load, no packet drops, bandwidth wasn't used completely, cpu was running at 40% most... Nothing weird.

This is the reason why I'm opening this issue and I will continue investigating this. Rebooting of my laptop always fixed that issue. I've installed node-exporter also on my laptop so I have some graphs for you when it happens again.

What I know so far:

restarting portmaster did not fix this issue
laptop reboot fixes the issue
I had similar issues when I had SPN connected all the time so I disabled it again like 2 days ago. It did not make sence to me though because SPN doesn't handle traffic over LAN (192.168.1.0/24 subnet).
I restarted portmaster from the UI (not systemctl restart) because earlier today (few hours before this issue occured again) it was consuming 100% CPU (the issue is already opened by someone else and I contributed with my logs few days ago)
When the neworking clogged on my laptop I was able to stream 4K content on my phone (30Mbps bitrate) from the same home server so I assume it is not issue with my local network's components. I have high-end network components from Ubiquiti, no cheap chinese shits for 20$.

What did you expect to happen?:

Networking is working at full speed all the time.

How did you reproduce it?:

I have no idea.

Debug Information:

Of course I forgot to copy it. I'll try not to forget the next time. I'm running version 0.8.5.

The text was updated successfully, but these errors were encountered:

northys · 2022-03-18T09:08:20Z

closing as I couldn't replicate it in past 5 days

northys · 2022-04-15T01:08:22Z

@dhaavi there must be definitely something wrong with portmaster. I can't help myself but I have to blame something 😆

This is graph of my laptop's networking throughput. At 22:10 (first blue dashed line) portmaster's SPN module crashed on nil pointer reference (reported safing/spn#74). I've mentioned similar behaviour in another issue (safing/spn#67 (comment)) and as you can see laptop reboot again fixed it (second dashed blue line at the very right of the graph).

I'm sitting here, watching how slow Nextcloud sync my files (over tailscale network (yellow graph), but connected over LAN, no relay). I wish I restarted my laptop earlier 😞

My laptop wasn't connected to SPN at least in last hour. I think I disabled it few minutes after opening the nil pointer crash but I'm not 100% sure. It's good to mention that it happened even before I've deployed tailscale. I've switched to tailscale yesterday.

Positive is download, negative is upload.

dhaavi · 2022-04-15T07:28:22Z

That's some very interesting data. Do you have CPU/MEM data for the same time period? Would be interesting if there is a correlation.

Also, are you collecting metrics from the portmaster (http://127.0.0.1:817/metrics)?
You can some docs on these metrics from http://127.0.0.1:817/api/v1/metrics/list.
Metrics have different levels of senstivity, so you might need to enable dev mode or supply an API key.
Alternatively, you could use the push metrics setting.

These two metrics could tell us how many packets the Portmaster was handling during that period.

portmaster_firewall_handling_duration_seconds_sum
portmaster_firewall_handling_duration_seconds_count

I remember that if every packet goes through the Portmaster, the bandwidth limit was around 20Mbit/s - this suspiciously matches the reported speed here.
Don't why this would happen, as the Portmaster currently marks all verdicts as permanent. One possibility is that another program broke the packet marking.

northys · 2022-04-15T09:55:50Z

Also, are you collecting metrics from the portmaster (http://127.0.0.1:817/metrics)?

I do, for 30 minutes already 😅

Do you have a grafana dashboard for those stats? I discovered the /metrics endpoint a few weeks ago but I don't collect metrics for which I do not have dashboard.

Graphs

I've added the 2 annotations also to CPU/RAM dashboard and posting all 3 because so you have it with same time range.

Some more graphs about network. There are some drops on UDP (wireguard managed by tailscale). I haven't noticed those when I originally opened this issue. Maybe because I did not use wireguard (tailscale) locally.

Anyway I don't think it means anything because the UDP errors spike is after the laptop restart. I think it is drawback of being connected over Wi-Fi.

northys · 2022-04-15T13:12:14Z

btw the swap is not swap but zram. people often see it as problem since it's not common to use swap nowadays anywhere.

https://fedoraproject.org/wiki/Changes/SwapOnZRAM#Summary

northys · 2022-04-17T15:09:07Z

Ha! I've got some interesting stuff!

when I rsync 40G file over LAN without tailscale (that means full TCP) it goes 16-22MB/s (180Mbps 5min avg by iftop)
when I rsync 40G file over LAN with tailscale (TCP over encrypted UDP network) it goes 0-4MB/s (~18Mbps 5min avg by iftop)
when I download (firefox) 40G file over LAN without tailscale (http unencrypted connection) it goes 0-2MB/s
- at the same time I can run the first rsync mentioned at 16MB/s
when I download (firefox) 40G file over LAN with tailscale (TCP over encrypted UDP connection + HTTPS) it goes 0-2MB/s
- at the same time I can run the first rsync mentioned at 16MB/s

northys · 2022-04-17T15:33:06Z

I don't knwo what should I see here or how to use those metrics (rate() could be used probably?) but I can see that for today the metrics value increases noticeably faster than before. 2 squares were enough for last 2 days and for today 4 squares aren't enough.

I wouldn't say that I did more heavy network stuff than I did yesterday.

dhaavi · 2022-04-21T12:23:21Z

Wow, nice stats! It seems there is a correlation between CPU usage and the network.Interesting. This would point to a problem with the network integration.

Are you using tailscale exit node stuff?
Tailscale might be interfering with the packet marks of Portmaster.

Ha! I've got some interesting stuff!

Does rsync also go over Tailscale, or not?
If not, it seems that tailscale makes the connection go to 0-4MB/s.
Can check the other way around and just do Tailscale without Portmaster?

how to use those metrics

This is a histogram in the form that victoriametrics does it: https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350

So, something like this should do it: (not tested)

histogram_quantile(0.95,
    sum(rate(
        portmaster_firewall_handling_duration_seconds[5m]
    )) by (vmrange)
)

What I hope to see is how many packets the Portmaster handled in a certain period. For this, you can just use rate(portmaster_firewall_handling_duration_seconds_count[5m])

northys · 2022-04-21T15:14:11Z

Are you using tailscale exit node stuff?

No, I tried it, it broke everything and I don't need it so I disabled it and since that I use SPN on my laptop.

Does rsync also go over Tailscale, or not?

I've posted 4 examples and the first one is rsync without tailcale. When I tried to download (firefox) the same file from server through LAN without tailscale it did not reach "full speed" though. I'm just wondering why this "problem" doesn't apply for rsync.

I just can't understand why rsync without tailscale copies the file at "fulll speed" while downloading the same file using firefox makes it slow. There must be something that rsync bypassed and it's not tailscale...

If not, it seems that tailscale makes the connection go to 0-4MB/s.

Nope, normally it goes about 95% of my LAN throughput without tailscale. The overhead is really low.

Can check the other way around and just do Tailscale without Portmaster?

I don't think the tailscale is the problem here. When it works, it works and the "overhead" of tailscale is only a few percent in speed drop... This happened even before I discovered tailscale.

Raphty · 2023-08-23T07:57:41Z

I am cleaning out old issues. If you feel this issue should not have been closed let me know.

Please keep in mind, the free version of Portmaster only has limited support.
For free users our active Discord community as well as the chat bot are the fastest and best way to get their help. https://discord.gg/safing
If you find our work brings value to you, please consider supporting it by purchasing Plus or Pro Packages https://safing.io/pricing/.
If you are already a subscriber, first Thank You! and also if you want priority support pleas send in an email and let me know your username so I can prioritize your request accordingly.

northys added the bug TYPE: a report on something that isn't working label Mar 13, 2022

northys changed the title ~~[still investigating] networking becomes "clogged" and max bandwidth is reduced to 5%~~ [investigating] networking becomes "clogged" and max bandwidth is reduced to 5% Mar 13, 2022

northys closed this as completed Mar 18, 2022

northys mentioned this issue Apr 2, 2022

SPN outage? failed to ping home hub: timed out safing/spn#67

Closed

northys reopened this Apr 15, 2022

dhaavi self-assigned this Apr 21, 2022

dhaavi added the waiting for input label Apr 21, 2022

github-actions bot removed the waiting for input label Apr 22, 2022

Raphty closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

northys commented Mar 13, 2022 •

edited

northys commented Mar 18, 2022

northys commented Apr 15, 2022 •

edited

dhaavi commented Apr 15, 2022

northys commented Apr 15, 2022 •

edited

northys commented Apr 15, 2022 •

edited

northys commented Apr 17, 2022 •

edited

northys commented Apr 17, 2022

dhaavi commented Apr 21, 2022

northys commented Apr 21, 2022 •

edited

Raphty commented Aug 23, 2023

[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

[investigating] networking becomes "clogged" and max bandwidth is reduced to 5% #566

Comments

northys commented Mar 13, 2022 • edited

northys commented Mar 18, 2022

northys commented Apr 15, 2022 • edited

dhaavi commented Apr 15, 2022

northys commented Apr 15, 2022 • edited

northys commented Apr 15, 2022 • edited

northys commented Apr 17, 2022 • edited

northys commented Apr 17, 2022

dhaavi commented Apr 21, 2022

northys commented Apr 21, 2022 • edited

Raphty commented Aug 23, 2023

northys commented Mar 13, 2022 •

edited

northys commented Apr 15, 2022 •

edited

northys commented Apr 15, 2022 •

edited

northys commented Apr 15, 2022 •

edited

northys commented Apr 17, 2022 •

edited

northys commented Apr 21, 2022 •

edited