Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my router showing on all hops #234

Open
brianjmurrell opened this issue Nov 24, 2017 · 35 comments
Open

my router showing on all hops #234

brianjmurrell opened this issue Nov 24, 2017 · 35 comments

Comments

@brianjmurrell
Copy link

Given some MTR output such as:

                                           Packets               Pings
 Host                                    Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 2001:dcba:4321:5678::1                0.0%  1418    0.6   1.6   0.3 1061.  28.3
 2. 2001:dcba:4000:82::1                  4.0%  1418    7.5 455.6   4.3 8104. 1401.
    2001:dcba:4321:5678::1
 3. 2001:dcba:0:1a5::1                   25.2%  1417   14.7 742.4  11.1 8767. 1804.
    2001:dcba:4321:5678::1
 4. 2001:dcba:0:54::1                     9.9%  1417   15.6 611.6  12.3 7672. 1529.
    2001:dcba:0:53::1
    2001:dcba:0:50::1
    2001:dcba:0:51::1
    2001:dcba:0:52::1
    2001:dcba:4321:5678::1
 5. 2001:dcba:0:48::2                    30.5%  1417  3106. 1010.  14.2 29997 2463.
 6. 2620:0:1cff:dead:beee::13a            4.3%  1417   20.3 449.8  14.1 7991. 1396.
    2001:dcba:4321:5678::1
 7. 2620:0:1cff:dead:bef0::2d             4.5%  1417   14.6 439.9  12.7 7871. 1396.
    2620:0:1cff:dead:bef0::21
    2620:0:1cff:dead:bef0::13
    2620:0:1cff:dead:bef0::1b
    2001:dcba:4321:5678::1
 8. 2a03:2880:f00e:ffff::3                4.8%  1417   14.6 445.5  11.2 8383. 1430.
    2a03:2880:f00e:ffff::9f
    2a03:2880:f00e:ffff::11
    2a03:2880:f00e:ffff::9
    2a03:2880:f00e:ffff::a7
    2a03:2880:f00e:ffff::37
    2a03:2880:f00e:ffff::ad
    2a03:2880:f00e:ffff::73
 9. 2a03:2880:f00e:a:face:b00c:0:2        4.4%  1417  112.7 533.3  15.2 7209. 1396.
    2001:dcba:4321:5678::1

Where my router is of course 2001:dcba:4321:5678::1 but is showing up in all hops, what might be causing MTR to display this?

Is my router really exhausting the TTLs of packets with TTLs > 1 to 0, such as a routing loop might do?

Or could it be that MTR is not being discriminate enough and only processing ICMP TTL Exceeded and is treating something like an ICMP Unreachable in the same way that it would treat ICMP TTL Exceeded?

@osbjmg
Copy link

osbjmg commented Nov 25, 2017

I do not know much about mtr or ipv6 specifically, but a tcpdump would probably be good to get some clues.

I assume this behavior doesn't happen with ipv4?

Edit/answer by REW: confirmed on IPV4.

@brianjmurrell
Copy link
Author

I don't think there is anything particular about IPv6 and this phenomenon. I would reckon it could happen with IPv4 just as easily.

The problem with a tcpdump is that because it can take many many hours in order to witness this phenomenon and not know when it had happened in the many hours later of MTR running, it would be a bit of a needle-in-a-haystack to find it I think. Happy to be told I am wrong though.

@brianjmurrell
Copy link
Author

But yes, I agree a tcpdump would be quite useful, if we could just solve the needle in the haystack problem.

@rewolff
Copy link
Collaborator

rewolff commented Nov 25, 2017

The problem, I would think is that this tcpdump would also catch regular traffic or it would be tricky to get the stuff we need if you start being selective.

I have been able to recreate the situation on IPV4: mtr running to a host, and suddenly on the router I change the route to that single host (8.8.8.8 in my case) to the wrong interface. After a while the router sometimes sends a "host unreachable".

After a while mtr changes to a "no route to host" and ends up showing my router at every hop, but only AFTER the normal route comes back. I can confirm that older versions of MTR don't do this. This was introduced with the "no route to host" processing that we seem to have now.

I used:
tcpdump -ni eth0 icmp -l | & grep -v 'echo request'
to see the responses from the router(s).

@brianjmurrell
Copy link
Author

I have also been able to get a packet capture when this happens and also see that my router is sending Destination unreachable (Address unreachable) when these happen.

It would seem to me to be most accurate in the MTR display to attribute these to the first hop, even though technically the packet they are ICMPing was to some hop >1. It is afterall the first hop router that is causing the packet not to make it to it's desired hop and the drop count really should go there, not the router where the packet was intended to send back the Time Exceeded.

@brianjmurrell
Copy link
Author

Any comments on my proposal above?

@rewolff
Copy link
Collaborator

rewolff commented Dec 6, 2017

It goes without saying.... mtr should attribute the returned packets to the right router/hop.... So the returned packets should be inspected: "no route to host?" then... What?
If we've seen the host on an earlier hop, attribute to that hop?

We don't want to discard them completely: Sometimes you get the "no route to host" message from say the 10th hop out, that actually belongs there.

(what remains is actually writing the code).

@brianjmurrell
Copy link
Author

Yes, of course. But that's not what appears to be happening currently, otherwise I wouldn't see situations where my own router showed up in hops > 1 then, would I?

@jonesrick
Copy link

Not sure it would work all the time, but if mtr included the initial hop count as part of the "data" portion of the packets it sent, close enough to the beginning to be reasonably likely to be included in that portion of the packet returned in the ICMP message, it could then compare the initial hop count with the TTL in the IP header of the datagram being returned in the ICMP message and calculate how many hops away the "no route to host" was generated without having to assume something based on the source IP of the ICMP message.

@dbareiro
Copy link

dbareiro commented Sep 25, 2022

Hi!

I am experiencing this issue still in 2022 with version 0.94 packaged by Debian.

We are using a CPE210 device to distribute internet in a cabin complex. The CPE210 is wired to a Mikrotik device that receives the Internet through an over-the-air connection.

According to what I was observing, mtr shows a "no route to host" situation but it does not do it in all the hops at the same time, but, for example, in hop 8, several minutes later, in hop 4, and a several minutes later in the final hop of the test. This is an example but it can happen randomly on any hop.

image

The topology used in the test is as follows:

My notebook (wifi)-> cabin router [192.168.2.1] (wifi)-> CPE210 [192.168.0.254] (wired)-> Mikrotik [192.168.88.1]

Any suggestions to avoid this behavior?

Thanks in advance.

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 26, 2022

You have to imagine the internet as being a country with lots of railway stations. First of all in this country there are no fast trains. All trains end at the next station. The railways have a service: You can send parcels across the railways and the people at the stations will put the parcels on the next train provided you've paid for the service with a special railway coin attached. They will take one coin for themselves and pass the parcel on to the next station by the next train in that direction.

Now because those coins are really cheap everybody just slaps on 64 coins and the parcels (so far) always get to their destination. (it used to be 32, and about a decade and a half ago, some parcels were starting to be sent back! So everybody quickly started putting on 64 coins instead of just 32).

When a parcel gets at a station with no coins remaining, the address label (and just a tiny bit more) is sent back with the note "not enough coins at station X".

Normally when you want to communicate you just slap on enough coins and either it works or it doesn't but if you want to figure out how far your parcels are getting.... you can do a trick. Send out a parcel with just one coin and see which station reports: "No more coins at station X" . Send out a parcel with just two coins and watch for the "No more coins at station Y".

Now it seems your mikrotik is being smart. It seems to intercept the parcels with, at that point in time plenty of coins left but sending them back anyway with a "no more coins" message. I don't think mtr can do anything about that.

Does it happen often? I see you ran mtr for about 3 hours in the above example. Does it happen if you run it for a minute? No? Can you automate: Take a network dump, start mtr in report mode for a minute (-r -c 60), stop the network dump and check if there are multiple 192.168.0.254 in the report. If not repeat... Now leave that running for 3 hours.... and then I need the network dump and the mtr report.

@dbareiro
Copy link

dbareiro commented Sep 26, 2022

Hi, @rewolff.

Thanks for your time in replying. I loved the example with the train :)

When you say do a network dump do you mean to run tcpdump with some specific syntax on the same notebook that I am running mtr on?

After reading your answer, I've done some isolated tests by running mtr in report mode for a minute, but it doesn't seem to be a problem. At least during those tests.

root@orion:~# mtr -r -c 60 8.8.8.8
Start: 2022-09-26T10:23:29-0300
HOST: orion                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.2.1                1.7%    60    1.3  11.3   1.0 128.1  26.5
  2.|-- 192.168.0.254              0.0%    60    2.9  11.8   2.4  88.3  20.3
  3.|-- 192.168.88.1               0.0%    60    3.0  12.1   2.6 185.4  31.8
  4.|-- 20.0.0.1                   1.7%    60    5.5  44.9   4.5 229.2  55.4
  5.|-- 200.45.79.1                0.0%    60    5.9  36.4   5.9 185.5  43.3
  6.|-- host229.190-224-7.telecom  0.0%    60   12.6  27.6   6.2 197.8  38.9
  7.|-- host142.181-96-60.telecom  0.0%    60   11.7  31.8   6.8 139.9  34.9
  8.|-- host210.181-89-2.telecom.  0.0%    60   19.8  54.9  17.8 315.5  61.9
  9.|-- ???                       100.0    60    0.0   0.0   0.0   0.0   0.0
 10.|-- host234.181-96-113.teleco 70.0%    60  106.7  63.8  17.8 294.8  72.9
 11.|-- 72.14.194.198              0.0%    60   20.4  43.2  18.7 174.3  37.3
 12.|-- 142.250.62.83              0.0%    60   19.8  40.6  18.2 149.8  32.8
 13.|-- 142.251.239.191            0.0%    60   18.0  48.6  16.9 272.1  45.6
 14.|-- dns.google                 0.0%    60   18.6  38.3  16.4 200.5  37.1

Now it seems your mikrotik is being smart. It seems to intercept the parcels with, at that point in time plenty of coins left but sending them back anyway with a "no more coins" message. I don't think mtr can do anything about that.

Just to clarify, the IP that is repeated in the hops is not that of the border router (Mikrotik) but that of the device wired to it that distributes the Internet wirelessly (CPE210).

I'm not sure if this adds any more information, but in case it does, I'm attaching a screenshot of the frequency spectrum reported by OpenWRT on the router located in my cabin. CabaniasDelRincon is the wireless network generated by the CPE210 to which my router in the cabin connects.

image

Thanks again for your time.

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 26, 2022

Re: Tcpdump: Yes!

originally, the counter/coins was to be decremented for each SECOND that passes while in the networki. When networks because so fast that everybody forwarded a packet within a second, nobody was decrementing the counter anymore. So then they decided to add "or when a hop is taken" (in the analogy: train journey). (they may have had the foresight to do that in one step).

So in theory it is possible that your "misbehaving" router is also counting the seconds and that in some cases it had that packet under its control for several seconds. However I've never heard of a router that does this. And you'd see "several seconds" in the results beyond that router when that's happening more than once. With the worst packet getting back in 300ms that seems unliekly.

@jonesrick
Copy link

jonesrick commented Sep 26, 2022 via email

@rewolff
Copy link
Collaborator

rewolff commented Sep 26, 2022

Yeah. Not sure if and how MTR processes "unexpected responses from intermediate hosts". So the question is:

But generally, when mtr reports "funky stuff", that's actually what's happening. So it seems most likely to me that this router is actually "misbehaving".

@dbareiro
Copy link

Thanks for your reply, @rewolff .

After running mtr with the report option without observing "anomalies", I ran it again for three more hours with the same target host as in the previous test (to maintain the same conditions) although without doing a network dump. The results were interesting.

root@orion:~# mtr -r -c 10800 8.8.8.8
Start: 2022-09-26T10:25:24-0300
HOST: orion                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.2.1                0.2% 10800    1.0  15.0   0.7 2997. 105.0
        192.168.2.100                    
  2.|-- 192.168.0.254              0.2% 10800    7.7  16.3   2.2 2925. 101.6
        192.168.2.100                    
  3.|-- 192.168.88.1               0.3% 10800    4.3  19.3   2.3 3053. 107.8
        192.168.2.100                    
  4.|-- 20.0.0.1                   0.3% 10800    5.2  27.6   3.4 2954. 106.2
        192.168.2.100                    
  5.|-- 200.45.79.1                0.2% 10800   11.1  28.8   5.2 2882. 103.1
        192.168.2.100                    
  6.|-- host229.190-224-7.telecom  0.3% 10800   11.8  28.1   4.5 3064. 107.5
        192.168.2.100                    
  7.|-- host142.181-96-60.telecom  0.3% 10800   11.8  29.0   6.3 2992. 106.4
        192.168.2.100                    
        192.168.0.254                    
  8.|-- host210.181-89-2.telecom.  0.2% 10800   20.0  42.4   2.8 2921. 104.1
        192.168.2.100                    
        192.168.0.254                    
  9.|-- 192.168.2.100             99.7% 10800  348.4 1541. 116.7 3030. 885.6

Observations:

  • This time the route did not go until hop 14, but until hop 9, being the last four hops on the Telecom network. Hop 9, which should be a Telecom router that in the previous test was shown with "???", here it appears with the IP 192.168.2.100, which is the IP that the router in my cabin (192.168.2.1) assigned to the notebook used for these tests.
  • The IP of my notebook appears in several hops. I'm not quite sure what situation could cause this behavior. During the time that I was in front of the notebook, I have not noticed that it loses connection with the OpenWRT router in the cabin.
  • The IP of the CPE210 appears in several hops.

I would like to make a network dump to share with you the log obtained with tcpdump. What syntax do you suggest to run tcpdump?

Thanks for your time.

Kind regards,
Daniel

@jonesrick
Copy link

jonesrick commented Sep 26, 2022 via email

@dbareiro
Copy link

Thanks, Rick. I'll do that network dump while running mtr against 8.8.8.8 in the next 3 hours and then share the results.

Thanks again for your time.

@dbareiro
Copy link

Hi, Roger & Rick.

I'm sharing the result of mtr after running it for three hours. The result was similar to the previous run, with my laptop's and CPE's IP appearing over multiple hops.

root@orion:~# mtr -r -c 10800 8.8.8.8
Start: 2022-09-26T18:33:52-0300
HOST: orion                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.2.1                1.5% 10800   67.9  77.8   0.7 4542. 269.1
        192.168.2.100                    
  2.|-- 192.168.0.254              2.1% 10800  323.0 129.2   2.2 4966. 345.0
        192.168.2.100                    
  3.|-- 192.168.88.1               1.8% 10800  285.6 134.6   2.3 4896. 351.6
        192.168.2.100                    
  4.|-- 20.0.0.1                   3.1% 10800  222.1 156.0   3.4 5136. 350.5
        192.168.2.100                    
  5.|-- 200.45.79.1                3.5% 10800  164.4 160.5   2.4 5294. 356.4
        192.168.2.100                    
        192.168.0.254                    
  6.|-- host229.190-224-7.telecom  3.2% 10800   90.6 158.2   3.0 5168. 351.5
        192.168.2.100                    
        192.168.0.254                    
  7.|-- host142.181-96-60.telecom  3.4% 10800   47.0 160.4   6.3 5151. 350.8
        192.168.2.100                    
  8.|-- host210.181-89-2.telecom.  2.4% 10800  263.0 189.1  17.2 5051. 382.1
        192.168.2.100                    
  9.|-- 192.168.2.100             99.5% 10800  1128. 1685.  56.8 3063. 836.6

Again it shows up to hop 9 instead of up to hop 14. It seems that at some point this happens with the number of hops because again running mtr for a short time, let's say a minute, I don't see this behaviour:

root@orion:~# mtr -r -c 60 8.8.8.8
Start: 2022-09-26T22:45:57-0300
HOST: orion                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.2.1                0.0%    60    1.3   9.5   0.9  82.9  20.5
  2.|-- 192.168.0.254              0.0%    60    2.6  14.6   2.6 157.4  27.7
  3.|-- 192.168.88.1               0.0%    60    2.9  11.9   2.5  89.7  20.9
  4.|-- 20.0.0.1                   0.0%    60   13.4  56.4   5.6 401.4  71.6
  5.|-- 200.45.79.1                5.0%    60   40.6  52.2   5.9 309.0  62.9
  6.|-- host229.190-224-7.telecom  5.0%    60   18.8  51.4   6.2 354.9  60.6
  7.|-- host142.181-96-60.telecom  3.3%    60   14.9  50.1   8.9 284.6  58.0
  8.|-- host210.181-89-2.telecom.  1.7%    60   21.7  70.3  17.4 312.8  71.9
  9.|-- ???                       100.0    60    0.0   0.0   0.0   0.0   0.0
 10.|-- host234.181-96-113.teleco 55.0%    60  162.6 211.3  18.9 945.7 218.1
 11.|-- 72.14.194.198              0.0%    60   21.3  74.4  20.7 247.2  63.4
 12.|-- 142.250.62.83              1.7%    60   21.3  66.6  17.8 299.1  55.8
 13.|-- 142.251.239.191            0.0%    60   19.1  66.2  16.5 278.9  63.3
 14.|-- dns.google

Here's the link to download the file with the network dump with tcpdump during those three hours running mtr:
https://we.tl/t-IdHvF9SvOD

Please let me know if there is any other information or test that might be helpful.

Thanks for your time. Much appreciated.

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 27, 2022

I'll take a look shortly.
On first glance I'm afraid that the rick-types-faster-tcpdump commandline catches all the outgoing packets and NOT the responses from the routers that we want to see. The easiest addition would be to say "or icmp " to catch the return packets. Not sure if simply adding or icmp to tcpdump commandline is acceptable to tpcdump. I always need to grab the manual alongside. But manybe they are already in the dump because the icmp returns include enough of the "desitnation host" for tpcdump to include them.

@rewolff
Copy link
Collaborator

rewolff commented Sep 27, 2022

I analysed the packets and we see MTR sending out the correct queries, but only the expected replies from 8.8.8.8.
Because you use 8.8.8.8 as your DNS server and you captured everything to and from 8.8.8.8 I now know that you have a twitter app running and you browsed youtube.... You may want to take the capture down....

I browsed a bit more and I see your misbehaving router respond as if it is the destination host at 272.271 seconds into the capture as well as 4112.811, 4552.481, 8813.022, 8813.049 and 11357.187. The last one is surprising as it is 10 minutes after you stopped the mtr.....

My verdict is: You have a weird(misbehaving) router on your hand and mtr is reporting that correctly.

@rewolff
Copy link
Collaborator

rewolff commented Sep 27, 2022

[grandpa tells a story] (About misbehaving hardware).

Back in the early nineties the whole EE department at TUD was on a single 10mbps ethernet. Occasionally important connections would drop. Took a while to figure out: There was a PC card was configured to pass on only its own mac address packets plus of course broadcast packets. Turns out it would occasionally pass on a packet that was not broadcast and not its own mac address. i.e. for an entirely different compuiter. The IP stack back then didn't double check the destination IP address so "you say an ongoing packet for a connection to port 1234??? I know nothing about that. This port has been closed since the beginning of time!" And one random connection in the building would drop.

Once we found it, we opened up the PC, took out the network card and confiscated it. The sysop later begged to have it back. We made him promise never to use it on a public network, and it seems he kept his promise as we didn't have these problems after that.

@dbareiro
Copy link

Hi, Roger.

Thanks for taking the time to check the network dump. I appreciate it.

I browsed a bit more and I see your misbehaving router respond as if it is the destination host at 272.271 seconds into the capture as well as 4112.811, 4552.481, 8813.022, 8813.049 and 11357.187.

I'll take a look at the timestamps you mention. Thanks.

In the mtr report we saw two phenomena: the repetition of the IP of my notebook and the repetition of the IP of the CPE. I tend to think that the cause of each of these phenomena is different. What do you think based on what you could see in the capture?

The last one is surprising as it is 10 minutes after you stopped the mtr.....

Hmmm... I may have stopped running tcpdump a few minutes after mtr delivered the summary report because I was cooking my dinner at that time so the traffic captured by tcpdump after that may have been something else but there shouldn't be any traffic generated by mtr itself.

My verdict is: You have a weird(misbehaving) router on your hand and mtr is reporting that correctly.

It's strange. This TPLink CPE210 is a brand new device installed by the owner of the cabin complex some days ago. Before that we had a TPLink WA5210G that had several years of use but in which I was also observing a similar behavior in mtr, although I never got to analyze it in detail...

This CPE210 seems to have SSH access and has linux running from what I could see via SNMP. Maybe SSH access can allow us to see something else?

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 27, 2022

I may have stopped running tcpdump a few minutes after mtr delivered the summary report because I was cooking my dinner at that time so the traffic captured by tcpdump after that may have been something else but there shouldn't be any traffic generated by mtr itself.

Yes, for sure that's what happened, but this means that Either it is reacting to packets sent 10 minutes ago, or it is interfering with your normal network traffic.

as with my "story", it seems likely that the router that is accidentally messing with your network connections is still signing with its own IP address. If someone is maliciously messing with you then all bets are off as to where they are coming from.

If you can read the tcpdump manual and figure out how to specify "or ICMP" or just dump "all ICMP" that should work. The MTR outgoing packets are also ICMP, so just capturing ICMP will not capture your DNS traffic, but both the outging mtr traffic and the responses.

But I'm betting we'll see "it is very likely one of your routers is misbehaving.

I don't expect such "bad" behaviour from a Linux box. But people can do bad things to misconfigure them. e.g. my ISP that hosted my domain used to reconfigure their mailserver at 10 past the whole hour every hour. They would clear the whole configuration, and quickly reload it with the current state of what domains they hosted.

If an Email came in during that period they would refuse it with "no such account: This Email address doesn't exist". At first I considered such reports "user error, they probably mistyped my Email", but when it happened more often I started investigating. Sending an Email every minute at 0 seconds past the whole minute: ALL emails would go through. But sending an Email with a random delay between 0 and 59 seconds from the start of the minute would occasionally fail. They didn't think that this was important, so today I haven't been a client with them for more than 22 years.

It would've been easy to just configure the mailserver to respond "temporary error, try again later", before the flush and set everything back to normal afterwards, but they didn't do that.

@rewolff
Copy link
Collaborator

rewolff commented Sep 27, 2022

Why are you running mtr, by the way?.... I'm guesing you're seeing "something funny" (actually NOT fun), and are trying to diagnose the problem with mtr. Well... All hints are that there is something fishy going on.

One thing that MTR could do better is that when "host at hop 8" says that it is the destination, then mtr immediately cancels sending packets further than that, even when it's been getting responses from different hosts at hop 8-14.... (for over an hour!).

@jonesrick
Copy link

jonesrick commented Sep 27, 2022 via email

@dbareiro
Copy link

Hi, Roger & Rick.

I've been checking the capture paying attention to the timestamps you mentioned:

I browsed a bit more and I see your misbehaving router respond as if it is the destination host at 272.271 seconds into the capture as well as 4112.811, 4552.481, 8813.022, 8813.049 and 11357.187.

Taking note of what I see in the "info" field of these packets, we have:

  272.271  Destination unreachable (Port unreachable) [Packet size limited during capture]
 4112.811  Destination unreachable (Port unreachable) [Packet size limited during capture]
 4552.481  Destination unreachable (Port unreachable) [Packet size limited during capture]
 8813.022  Destination unreachable (Port unreachable) 
 8813.049  Destination unreachable (Port unreachable)
11357.187  Destination unreachable (Port unreachable)

I'm not sure if this should be considered as two different types of errors: the one with "packet size limited" and the one without.

In any case I was hoping to find a correlation with what we saw in the mtr report. In that report we saw:

  • 9 hops with the IP 192.168.2.100 (the IP of my notebook).
  • 2 hops with the IP 192.168.0.254 (the IP of the CPE).

That is, 11 packets presenting transmission errors, and these errors could be classified into two types. Is this reasoning logical? But here we are seeing 6 packets showing transmission errors with what appear to be 2 different types of errors.

But going back to the quoted text, in one part you say "I see your misbehaving router respond as if it is the destination". Where in the package information do you see this?

If you can read the tcpdump manual and figure out how to specify "or ICMP" or just dump "all ICMP" that should work. The MTR outgoing packets are also ICMP, so just capturing ICMP will not capture your DNS traffic, but both the outging mtr traffic and the responses.

Yes, or alternatively I could use another destination host such as fast.com. Something like this:

tcpdump -w mtr-dump.pcap -s 192 -i wlan0 "host fast.com"

But I'm betting we'll see "it is very likely one of your routers is misbehaving.

I don't expect such "bad" behaviour from a Linux box. But people can do bad things to misconfigure them.

Do you think the misbehaving router is the CPE or the Mikrotik?

As I mentioned in a previous message, it is strange to see similar behavior with both the brand new CPE210 and the old CPE (WA5210G). If the culprit is the CPE, I refuse to believe it's a coincidence and think of it more as an unhappy "feature" of these TPLink CPEs. But by the way, do you know someone who has one of these TPLink CPEs to test it?

Yesterday I found a way to access the CPE via SSH and although I can see some configuration files (like iptables) I don't have root access to modify files.

my ISP that hosted my domain used to reconfigure their mailserver at 10 past the whole hour every hour. They would clear the whole configuration, and quickly reload it with the current state of what domains they hosted.

If an Email came in during that period they would refuse it with "no such account: This Email address doesn't exist". At first I considered such reports "user error, they probably mistyped my Email", but when it happened more often I started investigating. Sending an Email every minute at 0 seconds past the whole minute: ALL emails would go through. But sending an Email with a random delay between 0 and 59 seconds from the start of the minute would occasionally fail. They didn't think that this was important, so today I haven't been a client with them for more than 22 years.

Sometimes I don't understand why human beings complicate themselves in a seemingly unnecessary way. But someone always does things for some reason even if it seems meaningless to us. Who knows what would have happened in that scenario. Perhaps some bug in a software that they used that to save themselves from updating it they decided to do that trick...

It would've been easy to just configure the mailserver to respond "temporary error, try again later", before the flush and set everything back to normal afterwards, but they didn't do that.

Unfortunately not all sysadmins manage the levels of awareness that one would expect. It's a shame because that's how they lose customers and not only that, but they get a bad reputation from word of mouth.

Why are you running mtr, by the way?.... I'm guesing you're seeing "something funny" (actually NOT fun), and are trying to diagnose the problem with mtr. Well... All hints are that there is something fishy going on.

Well, I started using it here some time ago when I ran into latency and packet loss issues with the wireless provider we have here. At that time there were problems in the Telecom network.

Some time later I began to observe these details that caught my attention, such as the repetition of the IP of my notebook or that of the CPE.

Speaking of latency problems, there is something that is not clear to me: sometimes I see packet loss on the router in my cabin and it is something that catches my attention because I have the router in my other room. There should be no packet loss, unless it is some packet loss from routers in the next two hops (CPE->Mikrotik) which are also reflected as losses in the first hop router, although I don't know if such a thing is feasible.

One thing that MTR could do better is that when "host at hop 8" says that it is the destination, then mtr immediately cancels sending packets further than that, even when it's been getting responses from different hosts at hop 8-14.... (for over an hour!).

Another thing that I don't quite understand is, if under normal conditions the destination host is reached in 14 hops for that test, why if at some point in the test mtr shows 9 hops for a period of time in which the destination host becomes unreachable, continuing the test becoming that host reachable again, mtr does not reflect it and continues to show 9 hops instead of 14.

Thanks again for your time.

Kind regards,
Daniel

@dbareiro
Copy link

Hi, Rick.

Re the tcpdump command line - ncurses! Foiled again. Apologies for the
omission of the "or icmp" in the filter expression. I must have
subconsciously thought the matcher would have looked into the
encapsulated packet snippet in the ICMPs coming back.

Doing a new test, I would have thought that this syntax would be enough for the purposes:

tcpdump -w mtr-fast.com.pcap -i wlan0 "host fast.com"

Having found the repetitions of these IPs in the approximate time frames on the specified hops, I stopped capturing

Sent 627 -> 192.168.0.254 (hops: 10)
Sent 1265 -> 192.168.2.100 (hops: 1,2,3,4,5,6,7,8)

But I didn't find anything "weird". Am I missing something in the syntax?

Kind regards,
Daniel

@jonesrick
Copy link

jonesrick commented Sep 28, 2022 via email

@dbareiro
Copy link

Hi again!

I've retested using the following syntax (thanks Rick for reminding me of Roger's suggestion):

tcpdump -w mtr-fast.com.pcap -i wlan0 "host fast.com or icmp"

With this syntax, from what I've seen, the capture is more interesting because it also includes the packets from the hosts in the intermediate hops, I think.

Under normal conditions, the test should give something like this:

root@orion:~# mtr -r -c 60 fast.com
Start: 2022-09-28T22:34:12-0300
HOST: orion                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.2.1                6.7%    60  177.1  72.2   0.9 702.8 158.0
  2.|-- 192.168.0.254              0.0%    60  120.6 148.4   2.2 1105. 232.5
  3.|-- 192.168.88.1               0.0%    60   39.6 161.2   2.7 1715. 284.7
  4.|-- 20.0.0.1                   1.7%    60   56.0 196.2   4.7 1680. 273.8
  5.|-- 200.45.79.1                0.0%    60    9.8 190.8   6.1 1805. 281.5
  6.|-- host229.190-224-7.telecom  0.0%    60   69.6 190.9   6.2 1731. 300.1
  7.|-- host142.181-96-60.telecom  1.7%    60   16.2 173.5   7.6 1718. 290.4
  8.|-- host208.181-89-2.telecom.  5.0%    60   59.9 150.7  20.6 1105. 208.7
  9.|-- ???                       100.0    60    0.0   0.0   0.0   0.0   0.0
 10.|-- host168.181-96-103.teleco 70.0%    60  609.8 399.5  21.7 2158. 489.2
 11.|-- 2-154-30-181.fibertel.com  3.3%    60   61.6 250.6  33.4 1386. 257.2
 12.|-- a23-196-29-135.deploy.sta  1.7%    60   57.2 194.0  20.8 1301. 259.2

But it gave something like this:

image

Some notes:

70 => 192.168.2.100 (1,2,3,4,5,6,7,8 and "no route to host" in 9)

That is, in "Snt" 70 approximately, the IP 192.168.2.100 (my notebook) appears in hops 1,2,3,4,5,6,7 and 8, and "no route to host" in hop 9.

Even though mtr still shows "no route to host" at hop 9, if I run the "ping" command against fast.com at the same time that mtr
continues its execution showing that message at hop 9, it responds to the ping with no problem:

$ ping fast.com
PING fast.com (23.196.29.135) 56(84) bytes of data.
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=1 ttl=53 time=25.4 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=2 ttl=53 time=19.6 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=3 ttl=53 time=22.9 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=4 ttl=53 time=22.5 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=5 ttl=53 time=19.7 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=6 ttl=53 time=27.1 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=7 ttl=53 time=23.7 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=8 ttl=53 time=26.8 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=9 ttl=53 time=47.6 ms 
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=10 ttl=53 time=43.8 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=11 ttl=53 time=22.4 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=12 ttl=53 time=90.6 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=13 ttl=53 time=77.0 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=14 ttl=53 time=21.0 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=15 ttl=53 time=26.8 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=16 ttl=53 time=28.4 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=17 ttl=53 time=24.1 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=18 ttl=53 time=27.5 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=19 ttl=53 time=90.6 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=20 ttl=53 time=35.3 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=21 ttl=53 time=24.3 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=22 ttl=53 time=29.3 ms
64 bytes from a23-196-29-135.deploy.static.akamaitechnologies.com (23.196.29.135): icmp_seq=23 ttl=53 time=21.4 ms

I don't understand why this happens.

And another thing I can't understand is how can I possibly be getting 1.9% packet loss and about 104.7 ms average latency on a Wi-Fi router that's inside my cabin not much more than 7 meters (about 23 feet, approx). Even in the "good test", it got a packet loss of 6.7% and an average latency of 72.2 ms. I would have expected in both cases 0% packet loss and at most 20 ms. In a previous message I was wondering if any packet loss on the next two hops (CPE->Mikrotik) could also be reflected as loss on the first hop router, although I don't know if such a thing is feasible...

Going back to what happens at "Snt" 70, both before and after that timestamp throughout the entire packet capture, I see a lot of packets with "time to live exceeded in transit", but nothing in particular exactly at "Snt" 70 or in an interval close to this timestamp. In any case, I share the network dump in case you see something that I am missing:
https://we.tl/t-NMSrZNoodr

Although in the scenario that occurred in this test the IP of the CPE has not been repeated, the IP of my notebook has been repeated and that can serve to isolate why we are observing this peculiar behavior of mtr.

Thanks again for your time.

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 29, 2022

mtr sends out a parcel with enough coins for 8 hops. Then someone in the middle opens up the parcel, looks at the room number and says: "no such room here!".

That's only for the destination to do. So when mtr gets back a delivery note saying: "i'm stationmaster 192.168.2.100 and there is no such room here" mtr will assume that he knows what he's talking about.

When someone sends back a message indicating that they are the destination host, mtr will immediately assume that this is true (it always is, except for your misbehaving hardware) and consider that the desitnation host is at the position that elicited this response. This could be improved: with a lousy connection, mtr might have sent out packets for hosts 10-11-12-13 and eventually get responses for more than one, before finding that host 10 is in fact the destination. So responses to probes at distance 11, 12, 13 also arrived at the destination. mtr should fold those responses into the "host at position 10" instead of completely forgetting about them.
Another possible improvement might be that we MIGHT detect this current situation and display it in a different way. The problem is that this will require us to detect lots of different special cases and find a way to display them all.

The current situation is that when something misbehaves, mtr might display it in a peculiar way, but it is easy to analyse. When mtr would try to interpret this special case, I'm sure we'll run into another special case where mtr's interpretation is wrong and this trying to do better makes it worse.

Can you make a tcpdump of the icmp packets ?

@rewolff
Copy link
Collaborator

rewolff commented Sep 29, 2022

Another thing that I don't quite understand is, if under normal conditions the destination host is reached in 14 hops for that test, why if at some point in the test mtr shows 9 hops for a period of time in which the destination host becomes unreachable, continuing the test becoming that host reachable again, mtr does not reflect it and continues to show 9 hops instead of 14.

See above (the part about 10-11-12-13). Traceroute does the same as mtr, but it only sends out the next packet once the first one has been received or timed out. That's painfully slow. So mtr sends out a whole stream of packets probing for the host at 1-2-3-.... With behaving network equipment then say host 10 will respond with "I'm the destination and... "

Because you refuse to capture ICMP packets, you can't see the normal responses from the intermediate hosts.
Your suggestion to use another host instead of 8.8.8.8 will solve: "the capture includes the hosts you browsed" but not that we can't fully see what information mtr is working on.

All the packet timestamps that I quoted are "WRONG" packets where your router responded with "I'm the destination and there was a problem with your packet". Instead when mtr sends out a probe to something further than hop 2, it should NOT respond with an error packet and just forward the packet.

@dbareiro
Copy link

Hi, Roger.

Let me see if I understood. In the packets you mentioned the source is 192.168.2.100 with destination 8.8.8.8 saying "Destination unreachable" but this doesn't make any logical sense. The correct packet should be from 192.168.0.254 to 192.168.2.100 saying it itself couldn't reach 8.8.8.8 when actually 8.8.8.8 becomes unreachable. (?)

There is an interesting difference that I noticed between the packet capture done here and the last one. In the first case, the IP of my notebook appears in the last hop. The most recent network dump shows "no route to host" on the last hop. I imagine they should be taken as two different scenarios.

On the other hand, It seems that the CPE's lan IP (192.168.0.254) is on a "BRIDGE" interface. Do you think that could have something to do with the problem?

image

image

Can you make a tcpdump of the icmp packets ?

Yes, of course. I did it in the last message I sent last night (early today) about 13 hours ago. You can see it here

Thanks for your time.

Kind regards,
Daniel

@rewolff
Copy link
Collaborator

rewolff commented Sep 30, 2022

OK. Sorry. Missed that. Your mtr output now does not show any other hosts "misbehaving" except host "192.168.2.100", which is the source.

The TCPDUMP did not catch any packets that would suggest that this was based on a packet received over the line.

The only source for IP addresses is the packets that mtr receives. If it finds 83.72.73.84 there that's what it will print. So now it seems that the OS on your laptop is passing mtr packets that were mangled or passing mtr packets that it shouldn't get maybe?

Anyway.. The failure now seems different from what you had with 8.8.8.8 .

I used:
tcpdump -w test.pcap icmp
to dump the packets in a file. Then you can scroll through them with wireshark (file->open). Then run mtr again against 8.8.8.8 and I'd like to see if your CPE thingy starts misbehaving again. Maybe the CPE thing confuses the hell out of your system with something that wasn't captured (or maybe I couldn't filter well enough for it to stand out).

Edit: OK. I was wrong. Filter your last capture with wireshark:
icmp.type != 11 and icmp.type != 8 and icmp.type != 0
Now you see a whole bunch of packets FROM your laptop TO your CPE that say "port unreachable". This is odd. Because I trust your laptop more than the CPE I think that might be due to something the CPE is sending to your laptop causing it to be confused.

@dbareiro
Copy link

dbareiro commented Oct 1, 2022

Hi, Roger.

Edit: OK. I was wrong. Filter your last capture with wireshark:
icmp.type != 11 and icmp.type != 8 and icmp.type != 0
Now you see a whole bunch of packets FROM your laptop TO your CPE that say "port unreachable". This is odd. Because I trust your laptop more than the CPE I think that might be due to something the CPE is sending to your laptop causing it to be confused.

Were you referring to the latest capture file (mtr-fast.com.pcap)? Because when I filter that capture using the syntax you mentioned I only get packets with "Time-to-live exceeded (time to live exceeded in transit)". I don't see any packet saying "port unreachable". In fact, if I'm not mistaken, apparently ALL the packets obtained by filtering (which are the ones with this "time to live exceeded" message) are FROM intermediate routers TO my notebook. Apparently my notebook (192.168.2.100) does not appear in those filtered packages as "source".

image

Where I do see when filtering the packets with the syntax you mention is in the "mtr-dump.pcap" capture, that is, the previous file. There I do see 5 packets from my notebook (192.168.2.100) to 8.8.8.8 saying "Port unreachable". But I didn't see such packets in the last capture.

image

It was precisely this difference that caught my attention:

  • mtr.dump.pcap: packets with "port unreachable" / no packets with "time to leave exceeded"
  • mtr-fast.com.pcap (the last one): packets with "time to leave exceeded" / no packets with "port unreachable"

Thanks for your time. Much appreciated.

Kind regards,
Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants