New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp_ecn causes my network downloads to fail or become _very_slow #9748

Closed
tYYGH opened this Issue Jul 29, 2018 · 22 comments

Comments

7 participants
@tYYGH

tYYGH commented Jul 29, 2018

systemd version the issue has been seen with

239.0

Used distribution

Archlinux

Expected behaviour you didn't see

working downloads

Unexpected behaviour you saw

very slow, or non-working downloads (curl, wget, firefox, git clone/fetch…)

Steps to reproduce the problem
Install systemd 239.0
See here: https://bugs.archlinux.org/task/59473

Steps to fix the problem
sysctl net.ipv4.tcp_ecn=0

@phomes

This comment has been minimized.

Show comment
Hide comment
@phomes

phomes Jul 29, 2018

Contributor

The man page for tcp says

When enabled, connectivity to some destinations could be affected due to older, misbehaving middle boxes along the path, causing connections to be dropped. However, to facilitate and encourage deployment with option 1, and to work around such buggy equipment, the tcp_ecn_fallback option has been introduced.

And it seems that you might be affected by that. Can you share some details about your network equipment and perhaps even check if there is an updated firmware for it?

To work around the problem I would suggest to use sysctl net.ipv4.tcp_ecn=2 (edit: updated pr next comment) instead of 0, as that is was the default in use before the change in #9143

@SimonIremonger

Contributor

phomes commented Jul 29, 2018

The man page for tcp says

When enabled, connectivity to some destinations could be affected due to older, misbehaving middle boxes along the path, causing connections to be dropped. However, to facilitate and encourage deployment with option 1, and to work around such buggy equipment, the tcp_ecn_fallback option has been introduced.

And it seems that you might be affected by that. Can you share some details about your network equipment and perhaps even check if there is an updated firmware for it?

To work around the problem I would suggest to use sysctl net.ipv4.tcp_ecn=2 (edit: updated pr next comment) instead of 0, as that is was the default in use before the change in #9143

@SimonIremonger

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@SimonIremonger

SimonIremonger Jul 29, 2018

Correction to above: sysctl net.ipv4.tcp_ecn=2 [negotiate ECN only on incoming connections] used to be the default, I think.
I any case, I agree tYYGH should sort-out what router / further network equipment in use and its' firmware... Could easily be an issue there!.

SimonIremonger commented Jul 29, 2018

Correction to above: sysctl net.ipv4.tcp_ecn=2 [negotiate ECN only on incoming connections] used to be the default, I think.
I any case, I agree tYYGH should sort-out what router / further network equipment in use and its' firmware... Could easily be an issue there!.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Jul 30, 2018

Member

I wonder if we should revert #9143. We shouldn't really turn something on if it breaks people's connectivity. If routers are fucked, then this might very well be out of control of the person who uses the system, and issues like this are really not easy to debug and fix.

Member

poettering commented Jul 30, 2018

I wonder if we should revert #9143. We shouldn't really turn something on if it breaks people's connectivity. If routers are fucked, then this might very well be out of control of the person who uses the system, and issues like this are really not easy to debug and fix.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Jul 30, 2018

Member

@SimonIremonger @enihcam @michich opinions on reverting?

Member

poettering commented Jul 30, 2018

@SimonIremonger @enihcam @michich opinions on reverting?

@poettering poettering added this to the v240 milestone Jul 30, 2018

@poettering poettering added the sysctl label Jul 30, 2018

@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Jul 30, 2018

I can confirm the “not easy to debug” part! I literally spent days on Freenode##Networking, where I got great help, and no one came anywhere close to the real cause for this issue. We ran many tests (ping, traceroute, mtr, tcpdump…), and always we were disturbed by the fact that some things worked, and some not…

I had this issue on 2 machines (out of 2!) where I upgraded to latest systemd. I will report on these machines’ specifics (a regular self-assembled PC, and an Udoo X86) as soon as I get home.

Thanks a lot for systemd, by the way! It’s so much better than what we had before :-)

tYYGH commented Jul 30, 2018

I can confirm the “not easy to debug” part! I literally spent days on Freenode##Networking, where I got great help, and no one came anywhere close to the real cause for this issue. We ran many tests (ping, traceroute, mtr, tcpdump…), and always we were disturbed by the fact that some things worked, and some not…

I had this issue on 2 machines (out of 2!) where I upgraded to latest systemd. I will report on these machines’ specifics (a regular self-assembled PC, and an Udoo X86) as soon as I get home.

Thanks a lot for systemd, by the way! It’s so much better than what we had before :-)

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@SimonIremonger

SimonIremonger Jul 30, 2018

I would strongly suggest a 3-prong'ed approach:-

Definitely encourage those who HAVE found STILL faulty network equipment to raise the issue (this NEEDS to happen somewhere!!). Is there a new wiki/location for ECN-hall-of-shame?

Advising linux-net and bufferbloat communities of the issue still persisting in some areas, and asking if a more aggressive ECN-fallback option can be implemented (like apple have been doing sucessfully).

Gauge the actual scale of the issue, It may be reverting for the next systemd revision is worthwhile, but this still "needs raising" somewhere. I would have expected this issue to appear in a FEW places but I really do get the impression its' increasingly-rare. Many of us have had tcp_ecn=1 for a decade now. These few, initial, complaints might suggest the problem is now more likely in "consumer routers" rather than in "public services" (latter appears to be much of a non-issue now)...
If the latter-point is true, a better adaptive-fallback in kernel would likely alleviate/workaround the slow connection establishment, at least.

SimonIremonger commented Jul 30, 2018

I would strongly suggest a 3-prong'ed approach:-

Definitely encourage those who HAVE found STILL faulty network equipment to raise the issue (this NEEDS to happen somewhere!!). Is there a new wiki/location for ECN-hall-of-shame?

Advising linux-net and bufferbloat communities of the issue still persisting in some areas, and asking if a more aggressive ECN-fallback option can be implemented (like apple have been doing sucessfully).

Gauge the actual scale of the issue, It may be reverting for the next systemd revision is worthwhile, but this still "needs raising" somewhere. I would have expected this issue to appear in a FEW places but I really do get the impression its' increasingly-rare. Many of us have had tcp_ecn=1 for a decade now. These few, initial, complaints might suggest the problem is now more likely in "consumer routers" rather than in "public services" (latter appears to be much of a non-issue now)...
If the latter-point is true, a better adaptive-fallback in kernel would likely alleviate/workaround the slow connection establishment, at least.

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@SimonIremonger

SimonIremonger Jul 30, 2018

@tYYGH -- we don't need the machines' specifics as much as Router/firmware-on-router that you are connecting through, which is much more likely where the incompatibility lies.

SimonIremonger commented Jul 30, 2018

@tYYGH -- we don't need the machines' specifics as much as Router/firmware-on-router that you are connecting through, which is much more likely where the incompatibility lies.

@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Jul 30, 2018

Here is absolutely all I can tell about this ISP-provided router:

[root@sedentaire ~]# nmap -A 192.168.1.1
Starting Nmap 7.70 ( https://nmap.org ) at 2018-07-30 17:49 CEST
Nmap scan report for bbox.lan (192.168.1.1)
Host is up (0.00045s latency).
Not shown: 997 closed ports
PORT    STATE SERVICE  VERSION
53/tcp  open  domain   dnsmasq 2.75
| dns-nsid: 
|_  bind.version: dnsmasq-2.75
80/tcp  open  http     lighttpd
|_http-server-header: Lighttpd
443/tcp open  ssl/http lighttpd
| ssl-cert: Subject: commonName=Bbox/organizationName=Bouygues Telecom/stateOrProvinceName=France/countryName=FR
| Not valid before: 2013-05-27T08:50:31
|_Not valid after:  2023-05-25T08:50:31
|_ssl-date: TLS randomness does not represent time
MAC Address: D0:84:B0:18:55:FC (Sagemcom Broadband SAS)
Device type: general purpose
Running: Linux 2.6.X
OS CPE: cpe:/o:linux:linux_kernel:2.6
OS details: Linux 2.6.9 - 2.6.27
Network Distance: 1 hop

Apparently, this would be the one:
https://www.bbox-mag.fr/box/fixe/37-presentation-video-de-la-bbox-sensation-ng-sagem-5330b/

Yep, it is, and it has firmware 15.1.2 from 2018-02-13 (probably the latest available).

tYYGH commented Jul 30, 2018

Here is absolutely all I can tell about this ISP-provided router:

[root@sedentaire ~]# nmap -A 192.168.1.1
Starting Nmap 7.70 ( https://nmap.org ) at 2018-07-30 17:49 CEST
Nmap scan report for bbox.lan (192.168.1.1)
Host is up (0.00045s latency).
Not shown: 997 closed ports
PORT    STATE SERVICE  VERSION
53/tcp  open  domain   dnsmasq 2.75
| dns-nsid: 
|_  bind.version: dnsmasq-2.75
80/tcp  open  http     lighttpd
|_http-server-header: Lighttpd
443/tcp open  ssl/http lighttpd
| ssl-cert: Subject: commonName=Bbox/organizationName=Bouygues Telecom/stateOrProvinceName=France/countryName=FR
| Not valid before: 2013-05-27T08:50:31
|_Not valid after:  2023-05-25T08:50:31
|_ssl-date: TLS randomness does not represent time
MAC Address: D0:84:B0:18:55:FC (Sagemcom Broadband SAS)
Device type: general purpose
Running: Linux 2.6.X
OS CPE: cpe:/o:linux:linux_kernel:2.6
OS details: Linux 2.6.9 - 2.6.27
Network Distance: 1 hop

Apparently, this would be the one:
https://www.bbox-mag.fr/box/fixe/37-presentation-video-de-la-bbox-sensation-ng-sagem-5330b/

Yep, it is, and it has firmware 15.1.2 from 2018-02-13 (probably the latest available).

@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Jul 30, 2018

In your opinion, is this an important bug in this loaned router, that would justify, that Bouygues (my ISP) replaces it with a bug-free model? Or is this just a sadly-acceptable nuisance?

tYYGH commented Jul 30, 2018

In your opinion, is this an important bug in this loaned router, that would justify, that Bouygues (my ISP) replaces it with a bug-free model? Or is this just a sadly-acceptable nuisance?

@phomes

This comment has been minimized.

Show comment
Hide comment
@phomes

phomes Jul 30, 2018

Contributor

I am a bit surprised that an ISP would not have this fixed. All iOS 11 devices have ECN enabled and 50% randomly selected OS X computers do. So I would expect an ISP to drown in complaints over the router. But maybe our fallback solution is just not good as Apples as @SimonIremonger wrote.

Just to be sure. net.ipv4.tcp_ecn_fallback is set to 1, right?

Contributor

phomes commented Jul 30, 2018

I am a bit surprised that an ISP would not have this fixed. All iOS 11 devices have ECN enabled and 50% randomly selected OS X computers do. So I would expect an ISP to drown in complaints over the router. But maybe our fallback solution is just not good as Apples as @SimonIremonger wrote.

Just to be sure. net.ipv4.tcp_ecn_fallback is set to 1, right?

@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Jul 30, 2018

@phomes You mean on the router? I have no way of knowing. Seen from my seat, this is a black-box appliance… :-(

On my PC net.ipv4.tcp_ecn_fallback = 1, as you wrote.

tYYGH commented Jul 30, 2018

@phomes You mean on the router? I have no way of knowing. Seen from my seat, this is a black-box appliance… :-(

On my PC net.ipv4.tcp_ecn_fallback = 1, as you wrote.

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@SimonIremonger

SimonIremonger Jul 30, 2018

@tYYGH The issue COULD be further up the chain in the ISP's network, not just the router.
In any case, I DO think it really is an issue the ISP SHOULD be fixing -- these days I do think it is NOT a sadly-acceptable-nuisance. Please do try to raise it with them, I appreciate it can be hard to get through to 2nd-line support, but worth a try. They might in any case get you a 'different type of router' if asked.

SimonIremonger commented Jul 30, 2018

@tYYGH The issue COULD be further up the chain in the ISP's network, not just the router.
In any case, I DO think it really is an issue the ISP SHOULD be fixing -- these days I do think it is NOT a sadly-acceptable-nuisance. Please do try to raise it with them, I appreciate it can be hard to get through to 2nd-line support, but worth a try. They might in any case get you a 'different type of router' if asked.

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Jul 31, 2018

Interesting read!
I believe I had at least the packet-reordering problem:
https://ptpb.pw/fXg1
This is curl trying to download the Tor-Browser using the URL from the project’s web site.

tYYGH commented Jul 31, 2018

Interesting read!
I believe I had at least the packet-reordering problem:
https://ptpb.pw/fXg1
This is curl trying to download the Tor-Browser using the URL from the project’s web site.

@SimonIremonger

This comment has been minimized.

Show comment
Hide comment
@SimonIremonger

SimonIremonger Jul 31, 2018

@tYYGH
Do find what you can about alternate-router / ISP tech support... Post onto phomes's query:-
https://www.reddit.com/r/linux/comments/933vys/is_tcp_ecn_still_a_problem_today/
If problem-routers quoted there I'll put notes in bufferbloat wiki.

SimonIremonger commented Jul 31, 2018

@tYYGH
Do find what you can about alternate-router / ISP tech support... Post onto phomes's query:-
https://www.reddit.com/r/linux/comments/933vys/is_tcp_ecn_still_a_problem_today/
If problem-routers quoted there I'll put notes in bufferbloat wiki.

@dtaht

This comment has been minimized.

Show comment
Hide comment
@dtaht

dtaht Aug 3, 2018

a packet cap of the failing connection with ecn on would be revealing.

for the record, I have deeply conflicting feelings about the wide use of ecn. In my mind it is a good idea at very short rtts(sub 2ms), and very long ones(>50ms), and for doing things like protecting video iframes from loss. I use it to protect routing babel protocol packets from being dropped. Etc.

Others (in the bbr, l4s, dctcp communities) want to change the definition of ecn to mean a multi-bit rate reduction and obsolete rfc3168, where a loss is equivalent to a mark and the recommended rate reduction is 1/2. fq_codel, pie, red, and all other deployed ecn capable aqm systems essentially implement rfc3168 behavior and it's what apple's tcp - and linux cubic - and bsd's - and windows - general deployment expects. I had hoped with wider deployment of aqms that dealt with ecn properly that we'd see more tcp's also enabling it... and we'd see tcps evolve to treat aqms doing multiple ecn marks per rtt per rfc3168 more sanely than they do today as it is a stronger signal of congestion than loss, as well as

Instead... well, see bbr, which currently more or less ignores packet loss on it's quest to own the link, and currently has no ecn response. There's a thread on the bbr list that talks about how they are leaning towards not respecting that rfc. The l4s folk vehemently defend the idea that some form of dctcp can run outside the datacenter, based on bigbuckbunny based demos, combined with a custom and patented aqm, never tested against wifi or 3g, who create a lot of noise in the ietf, and little else.

Nearly every time I've quit smoking, an ecn debate started me up again, and instead of continuing to deal with it, I left the ietf, leaving the folk there to plot amongst themselves with no actual deployment to deal with. I'm so frustrated with the "make tcp go fast at any cost" people that periodically I fiddle with something called tcp-fu (for users), that has an adjustable response curve to fq_codel's ecn marks from background (torrent-like) to "gentle", rather than "rabid". Despite the enormous success of fq_codel and BQL in eliminating bufferbloat and network latency - I feel bad about essentially obsoleting the entire field of LPCC
(see: https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf ) and would like to do something about it.

I certainly support systemd folk trying to move the state of the art forward and am eternally grateful for their adoption of fq_codel, which now covers the world, and (as a tiny component of the whole system) also has ecn enabled by default. If, in all the world, enabling ecn for cubic tcp only breaks this one box, well, fix that box.

Turning on ecn universally in systemd seems to be an idea that could be a major forcing, political and technical function, towards aiding deployment (fixing busted routers) and keeping the simplicity of an rfc3168-style ecn enabled aqm alive instead of crazed ietf alternatives like l4s + "dualQ". Of course those proponents would point to their lowered tcp delay (in the datacenter) as a forever lost opportunity and split the universe apart.

I used to take great glee in how fq techniques generally beat dctcp's, even in the datacenter... at how even the guy that invented dctcp moved onto fq...

But...

As the accidental co-author of what is now the largest edge-user ecn-enabled fq + aqm deployment in "fq_codel everywhere", + the implementation in the ath9k and athk wifi code... (does anyone have any idea how many users of systemd + fq_codel are?) I lose sleep over the ecn component only. I have a ton of data on it. It's mixed.

What I see with lots of tcp-ecn traffic on a link is that other valuable packets are delayed (slightly) or lost and am always saying "ECN has mass" to anyone that will listen. This is made up for by nearly eliminating retransmits, mostly. I think. It's a huge win for interactive tcp traffic, which is why apple adopted it (and - helpfully - their reno tcp is far less aggressive than linux cubic). I worry about lockout at low speeds and that one day we'd have to mark all packets of all types of flows (including voip and dns) as ecn capable, unless tcps evolve appropriately towards an agreed upon response curve.

One really interesting side-effect of ecn on is that fq_codel, running locally on the server, can self congest and start marking packets locally, thus regulating the behavior of the server better after a rtt.(try 128 flows coming out of a short path at a gbit). But local fq_codel induced loss (currently) on the other hand is sometimes not lost there but signals the local stack to immediately to reduce cwnd without actually losing the packet. Others might view either behavior as a problem and prefer that that server serve 100s more flows at ever increasing self inflicted local delay (as sch_fq does) until you run out of cpu. (I know this is not a good description of this issue, this is a bug note not a paper).

I do wish we'd come up with a more robust response to overload in fq_codel for ecn, as currently a malicious ecn sender can push fq_codel to its memory limit before being dropped robustly. (pie drops ecn at overload). fq_codel (and now sch_cake) can continue to evolve as can everything else of course!

ECN is the wet paint of the congestion control universe. I hope the systemd deployment goes well and if it does, I'll sleep better, and if it doesn't, and gets reverted with all the reasons why well understood, I'll also sleep better. thx for pushing the limits.

dtaht commented Aug 3, 2018

a packet cap of the failing connection with ecn on would be revealing.

for the record, I have deeply conflicting feelings about the wide use of ecn. In my mind it is a good idea at very short rtts(sub 2ms), and very long ones(>50ms), and for doing things like protecting video iframes from loss. I use it to protect routing babel protocol packets from being dropped. Etc.

Others (in the bbr, l4s, dctcp communities) want to change the definition of ecn to mean a multi-bit rate reduction and obsolete rfc3168, where a loss is equivalent to a mark and the recommended rate reduction is 1/2. fq_codel, pie, red, and all other deployed ecn capable aqm systems essentially implement rfc3168 behavior and it's what apple's tcp - and linux cubic - and bsd's - and windows - general deployment expects. I had hoped with wider deployment of aqms that dealt with ecn properly that we'd see more tcp's also enabling it... and we'd see tcps evolve to treat aqms doing multiple ecn marks per rtt per rfc3168 more sanely than they do today as it is a stronger signal of congestion than loss, as well as

Instead... well, see bbr, which currently more or less ignores packet loss on it's quest to own the link, and currently has no ecn response. There's a thread on the bbr list that talks about how they are leaning towards not respecting that rfc. The l4s folk vehemently defend the idea that some form of dctcp can run outside the datacenter, based on bigbuckbunny based demos, combined with a custom and patented aqm, never tested against wifi or 3g, who create a lot of noise in the ietf, and little else.

Nearly every time I've quit smoking, an ecn debate started me up again, and instead of continuing to deal with it, I left the ietf, leaving the folk there to plot amongst themselves with no actual deployment to deal with. I'm so frustrated with the "make tcp go fast at any cost" people that periodically I fiddle with something called tcp-fu (for users), that has an adjustable response curve to fq_codel's ecn marks from background (torrent-like) to "gentle", rather than "rabid". Despite the enormous success of fq_codel and BQL in eliminating bufferbloat and network latency - I feel bad about essentially obsoleting the entire field of LPCC
(see: https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf ) and would like to do something about it.

I certainly support systemd folk trying to move the state of the art forward and am eternally grateful for their adoption of fq_codel, which now covers the world, and (as a tiny component of the whole system) also has ecn enabled by default. If, in all the world, enabling ecn for cubic tcp only breaks this one box, well, fix that box.

Turning on ecn universally in systemd seems to be an idea that could be a major forcing, political and technical function, towards aiding deployment (fixing busted routers) and keeping the simplicity of an rfc3168-style ecn enabled aqm alive instead of crazed ietf alternatives like l4s + "dualQ". Of course those proponents would point to their lowered tcp delay (in the datacenter) as a forever lost opportunity and split the universe apart.

I used to take great glee in how fq techniques generally beat dctcp's, even in the datacenter... at how even the guy that invented dctcp moved onto fq...

But...

As the accidental co-author of what is now the largest edge-user ecn-enabled fq + aqm deployment in "fq_codel everywhere", + the implementation in the ath9k and athk wifi code... (does anyone have any idea how many users of systemd + fq_codel are?) I lose sleep over the ecn component only. I have a ton of data on it. It's mixed.

What I see with lots of tcp-ecn traffic on a link is that other valuable packets are delayed (slightly) or lost and am always saying "ECN has mass" to anyone that will listen. This is made up for by nearly eliminating retransmits, mostly. I think. It's a huge win for interactive tcp traffic, which is why apple adopted it (and - helpfully - their reno tcp is far less aggressive than linux cubic). I worry about lockout at low speeds and that one day we'd have to mark all packets of all types of flows (including voip and dns) as ecn capable, unless tcps evolve appropriately towards an agreed upon response curve.

One really interesting side-effect of ecn on is that fq_codel, running locally on the server, can self congest and start marking packets locally, thus regulating the behavior of the server better after a rtt.(try 128 flows coming out of a short path at a gbit). But local fq_codel induced loss (currently) on the other hand is sometimes not lost there but signals the local stack to immediately to reduce cwnd without actually losing the packet. Others might view either behavior as a problem and prefer that that server serve 100s more flows at ever increasing self inflicted local delay (as sch_fq does) until you run out of cpu. (I know this is not a good description of this issue, this is a bug note not a paper).

I do wish we'd come up with a more robust response to overload in fq_codel for ecn, as currently a malicious ecn sender can push fq_codel to its memory limit before being dropped robustly. (pie drops ecn at overload). fq_codel (and now sch_cake) can continue to evolve as can everything else of course!

ECN is the wet paint of the congestion control universe. I hope the systemd deployment goes well and if it does, I'll sleep better, and if it doesn't, and gets reverted with all the reasons why well understood, I'll also sleep better. thx for pushing the limits.

@dtaht

This comment has been minimized.

Show comment
Hide comment
@dtaht

dtaht Aug 3, 2018

@tYYGH - looking at that packet cap is not helpful without the actual .cap file as it lacks IP headers. Secondly iy appears to be using tor. It would not surprise me if some tor's clients/servers encapsulation of ip packets was wrong (in perpetual draft is this: https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-10)

I'm currently not inclined to "blame the router" but the server you are connecting to. This server has ecn enabled. Try downloading this file both "straight" and through tor, with ecn on and off, with wget or curl: http://flent-fremont.bufferbloat.net/~d/losangeles_and_wish_you_were_here.mp3

(for the record, it's me and a friend playing those two songs and is 12MB in size, enjoy)

(If you can setup a time I can tcpdump from my side).

@SimonIremonger - if you can't tell, my position on ecn is nuanced. Also, apple's enormous test tested their clients against their servers primarily, and not things like torrent or tor. I would not surprise me at all if ecn failure modes were higher against many other peoples severs, nor if linux's ecn response was insufficiently robust.

dtaht commented Aug 3, 2018

@tYYGH - looking at that packet cap is not helpful without the actual .cap file as it lacks IP headers. Secondly iy appears to be using tor. It would not surprise me if some tor's clients/servers encapsulation of ip packets was wrong (in perpetual draft is this: https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-10)

I'm currently not inclined to "blame the router" but the server you are connecting to. This server has ecn enabled. Try downloading this file both "straight" and through tor, with ecn on and off, with wget or curl: http://flent-fremont.bufferbloat.net/~d/losangeles_and_wish_you_were_here.mp3

(for the record, it's me and a friend playing those two songs and is 12MB in size, enjoy)

(If you can setup a time I can tcpdump from my side).

@SimonIremonger - if you can't tell, my position on ecn is nuanced. Also, apple's enormous test tested their clients against their servers primarily, and not things like torrent or tor. I would not surprise me at all if ecn failure modes were higher against many other peoples severs, nor if linux's ecn response was insufficiently robust.

@tYYGH

This comment has been minimized.

Show comment
Hide comment
@tYYGH

tYYGH Aug 5, 2018

@dtaht - There is no TOR involved in the tcpdump output I posted. This is a plain curl download of the Tor Browser, not through the Tor Browser.

I wouldn’t blame the server, too (or at least not only this one), since this issue also prevented me from running any git fetch for example; and for instance, downloading https://f-droid.org/FDroid.apk failed the same way as downloading the Tor Browser did.

As for the usefulness of the capture I posted, I am sorry if it is useless; it is only what I had available at the time. I ran this capture at the direction of someone who understands these things better than I do. I do not know how to use tcpdump, and I do not understand its output, and your long post above is gibberish to me, sadly.

However, I am willing to learn, and to help, and if I can produce any output that is more helpful, please ask! I will do my best to do post the information you need.

tYYGH commented Aug 5, 2018

@dtaht - There is no TOR involved in the tcpdump output I posted. This is a plain curl download of the Tor Browser, not through the Tor Browser.

I wouldn’t blame the server, too (or at least not only this one), since this issue also prevented me from running any git fetch for example; and for instance, downloading https://f-droid.org/FDroid.apk failed the same way as downloading the Tor Browser did.

As for the usefulness of the capture I posted, I am sorry if it is useless; it is only what I had available at the time. I ran this capture at the direction of someone who understands these things better than I do. I do not know how to use tcpdump, and I do not understand its output, and your long post above is gibberish to me, sadly.

However, I am willing to learn, and to help, and if I can produce any output that is more helpful, please ask! I will do my best to do post the information you need.

@dtaht

This comment has been minimized.

Show comment
Hide comment
@dtaht

dtaht Aug 6, 2018

ok, that much info points more at your router.

tcpdump -i yourinterface -w acapture.cap

captures the binary info needed to look that deeply into what was going on at the tcp level.

dtaht commented Aug 6, 2018

ok, that much info points more at your router.

tcpdump -i yourinterface -w acapture.cap

captures the binary info needed to look that deeply into what was going on at the tcp level.

@jonathonf

This comment has been minimized.

Show comment
Hide comment
@jonathonf

jonathonf Aug 12, 2018

Just to weigh in here. Silently "breaking" end-users' network connections isn't a great approach to fixing middleware. You also need to remember that large parts of the world don't have the same level of networking infrastructure so some users won't be able to do anything about this even if they knew the reason behind their issue.

As things stand, there are numerous reports of network connections suddenly becoming "slow" after an update to 239 on rolling release distros. This leads those users to test a different (frozen-pool) distro, find it works as expected (as it has an older systemd version), and then simply move to that distro. Of course, the issue will reappear a few months/years down the line when they upgrade that distro, but as far as they're concerned the original distro was broken.

Now - if I I have read #9748 (comment) correctly, it appears that ECN isn't a "magic bullet" that should be blanket-enabled, but is instead a "tweak" or "optimisation" which should be applied by network admins who know what it does and can check its effects. A "successful" test on iOS isn't representative for anything other than for iOS devices being used in areas where people can afford iOS devices - Linux isn't only run on high-end/pro-sumer hardware.

I don't know how new features can be rolled out without this sort of friction ("you can't please all the people all the time") but if ECN isn't "perfect" perhaps this change should be reverted? At least until it can be better/more widely tested? However, I'm certainly no expert so am happy to be overruled.

jonathonf commented Aug 12, 2018

Just to weigh in here. Silently "breaking" end-users' network connections isn't a great approach to fixing middleware. You also need to remember that large parts of the world don't have the same level of networking infrastructure so some users won't be able to do anything about this even if they knew the reason behind their issue.

As things stand, there are numerous reports of network connections suddenly becoming "slow" after an update to 239 on rolling release distros. This leads those users to test a different (frozen-pool) distro, find it works as expected (as it has an older systemd version), and then simply move to that distro. Of course, the issue will reappear a few months/years down the line when they upgrade that distro, but as far as they're concerned the original distro was broken.

Now - if I I have read #9748 (comment) correctly, it appears that ECN isn't a "magic bullet" that should be blanket-enabled, but is instead a "tweak" or "optimisation" which should be applied by network admins who know what it does and can check its effects. A "successful" test on iOS isn't representative for anything other than for iOS devices being used in areas where people can afford iOS devices - Linux isn't only run on high-end/pro-sumer hardware.

I don't know how new features can be rolled out without this sort of friction ("you can't please all the people all the time") but if ECN isn't "perfect" perhaps this change should be reverted? At least until it can be better/more widely tested? However, I'm certainly no expert so am happy to be overruled.

@dtaht

This comment has been minimized.

Show comment
Hide comment
@dtaht

dtaht Aug 13, 2018

If the ecn enablement is causing trouble outside this bug report, I totally agree with reverting it in systemd.

In fact, at this point in time, as desirous as I was in an earlier post towards getting more data about what else can go wrong with it in a fuller ecn deployment, I believe the linux tcps are not ready to be ecn'd in the general case for the general public. So - while I don't mind if you get more data!! - please revert.

dtaht commented Aug 13, 2018

If the ecn enablement is causing trouble outside this bug report, I totally agree with reverting it in systemd.

In fact, at this point in time, as desirous as I was in an earlier post towards getting more data about what else can go wrong with it in a fuller ecn deployment, I believe the linux tcps are not ready to be ecn'd in the general case for the general public. So - while I don't mind if you get more data!! - please revert.

@keszybz

This comment has been minimized.

Show comment
Hide comment
@keszybz

keszybz Aug 20, 2018

Member

Since #9880 is merged, let's close this. We can probably try again when the kernel gets better fallback mode.

Member

keszybz commented Aug 20, 2018

Since #9880 is merged, let's close this. We can probably try again when the kernel gets better fallback mode.

@keszybz keszybz closed this Aug 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment