Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

android connection broken for too long time #278

Open
iwanbk opened this issue May 31, 2024 · 12 comments
Open

android connection broken for too long time #278

iwanbk opened this issue May 31, 2024 · 12 comments

Comments

@iwanbk
Copy link
Member

iwanbk commented May 31, 2024

I found that Android connection can sometime broken for longer time than expected.
From the code in peers.rs, looks like any broken connection will recover no more than 5 seconds.
But sometimes i found that ping was broken for more than 5 minutes.
What i did:

  • spawn my own public peer (to reduce possibility of overloaded public peer)
  • start myceliumflut on Android emulator on laptop
  • sleep the android by pressing the power button
  • use the laptop for work

ping side, with timestamp

May 30 16:50:00 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=450 ttl=61 time=29.2 ms
May 30 16:50:01 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=451 ttl=61 time=32.5 ms
May 30 16:50:02 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=452 ttl=61 time=28.7 ms
May 30 16:50:03 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=453 ttl=61 time=27.4 ms
May 30 16:51:41 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=548 Destination unreachable: No route
May 30 16:51:42 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=549 Destination unreachable: No route
May 30 16:51:43 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=550 Destination unreachable: No route
May 30 16:51:44 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=551 Destination unreachable: No route
May 30 16:51:45 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=552 Destination unreachable: No route
May 30 16:51:46 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=553 Destination unreachable: No route
May 30 16:51:47 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=554 Destination unreachable: No route
.....
.....
.....
May 30 16:57:05 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=865 Destination unreachable: No route                                                                                         09:59:30 [168/1004]
May 30 16:57:06 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=866 Destination unreachable: No route
May 30 16:57:07 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=867 Destination unreachable: No route
May 30 16:57:08 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=868 Destination unreachable: No route
May 30 16:57:09 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=869 ttl=61 time=35.9 ms
May 30 16:57:10 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=870 ttl=61 time=25.9 ms
May 30 16:57:11 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=871 ttl=61 time=62.0 ms
May 30 16:57:12 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=872 ttl=61 time=24.1 ms
May 30 16:57:13 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=873 ttl=61 time=23.0 ms
May 30 16:57:14 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=874 ttl=61 time=27.0 ms
May 30 16:57:15 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=875 ttl=61 time=22.9 ms
May 30 16:57:16 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=876 ttl=61 time=26.8 ms
May 30 16:57:17 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=877 ttl=61 time=22.9 ms
May 30 16:57:18 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=878 ttl=61 time=38.3 ms
May 30 16:57:19 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=879 ttl=61 time=27.7 ms
May 30 16:57:20 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=880 ttl=61 time=33.7 ms
May 30 16:57:21 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=881 ttl=61 time=34.6 ms
May 30 16:57:22 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=882 ttl=61 time=26.9 ms
May 30 16:57:23 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=883 ttl=61 time=25.3 ms
May 30 16:57:24 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=884 ttl=61 time=25.0 ms
May 30 16:57:25 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=885 ttl=61 time=23.8 ms
May 30 16:57:26 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=886 ttl=61 time=25.0 ms
May 30 16:57:27 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=887 ttl=61 time=29.7 ms
May 30 16:57:28 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=888 ttl=61 time=24.7 ms
May 30 16:57:29 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=889 ttl=61 time=23.0 ms
May 30 16:57:30 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=890 ttl=61 time=37.9 ms
May 30 16:57:31 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=891 ttl=61 time=23.3 ms
May 30 16:59:21 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=998 Destination unreachable: No route
May 30 16:59:22 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=999 Destination unreachable: No route
May 30 16:59:23 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1000 Destination unreachable: No route
May 30 16:59:24 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1001 Destination unreachable: No route
May 30 16:59:25 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1002 Destination unreachable: No route
May 30 16:59:26 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1003 Destination unreachable: No route
May 30 16:59:27 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1004 Destination unreachable: No route
May 30 16:59:28 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1005 Destination unreachable: No route
May 30 16:59:29 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1006 Destination unreachable: No route
May 30 16:59:30 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1007 Destination unreachable: No route
.......
......
......
.......
May 30 17:03:15 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1227 Destination unreachable: No route
May 30 17:03:16 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1228 Destination unreachable: No route
May 30 17:03:17 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1229 Destination unreachable: No route
May 30 17:03:18 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1230 Destination unreachable: No route
May 30 17:03:19 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1231 Destination unreachable: No route
May 30 17:03:20 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1232 Destination unreachable: No route
May 30 17:03:21 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1233 Destination unreachable: No route
May 30 17:03:22 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1234 Destination unreachable: No route
May 30 17:03:24 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1235 Destination unreachable: No route
May 30 17:03:25 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1236 Destination unreachable: No route
May 30 17:03:26 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1237 Destination unreachable: No route
May 30 17:03:27 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1238 Destination unreachable: No route
May 30 17:03:28 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1239 Destination unreachable: No route
May 30 17:03:29 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1240 Destination unreachable: No route
May 30 17:03:30 From 478:18d6:a0d4:59e9:b38d:e6b4:3b7b:656f icmp_seq=1241 Destination unreachable: No route
May 30 17:03:31 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1242 ttl=61 time=26.7 ms
May 30 17:03:32 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1243 ttl=61 time=24.6 ms
May 30 17:03:33 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1244 ttl=61 time=22.6 ms
May 30 17:03:34 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1245 ttl=61 time=103 ms
May 30 17:03:35 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1246 ttl=61 time=23.4 ms
May 30 17:03:36 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1247 ttl=61 time=23.7 ms
May 30 17:03:37 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1248 ttl=61 time=31.3 ms
May 30 17:03:38 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1249 ttl=61 time=24.6 ms
May 30 17:03:39 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1250 ttl=61 time=24.9 ms
May 30 17:03:40 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1251 ttl=61 time=56.8 ms
May 30 17:03:41 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1252 ttl=61 time=101 ms
May 30 17:03:42 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1253 ttl=61 time=26.4 ms
May 30 17:03:43 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1254 ttl=61 time=24.6 ms
May 30 17:03:44 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1255 ttl=61 time=55.4 ms
May 30 17:03:45 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1256 ttl=61 time=94.4 ms
May 30 17:03:46 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1257 ttl=61 time=23.8 ms
May 30 17:03:47 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1258 ttl=61 time=30.0 ms
May 30 17:03:48 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1259 ttl=61 time=22.8 ms
May 30 17:03:49 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1260 ttl=61 time=23.2 ms
May 30 17:03:50 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1261 ttl=61 time=26.2 ms
May 30 17:03:51 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1262 ttl=61 time=22.8 ms
May 30 17:03:52 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1263 ttl=61 time=25.4 ms
May 30 17:03:53 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1264 ttl=61 time=22.9 ms
May 30 17:03:54 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1265 ttl=61 time=26.8 ms
May 30 17:03:55 64 bytes from 40c:6d50:972e:be3a:7b79:547f:48bb:e97: icmp_seq=1266 ttl=61 time=31.4 ms

and from the android side
image

(it is picture because android log doesn't have timestamp (#275) and i still cant find a way to export android plain log with timestamp from Android Studio )

From the ping, there are two occurences of issue:

  • 16:50:04 -16:57:09
    from android side:

    • there is no log around the time when the issue started
    • just before recovered:
      - android detected that Stream to TCP .......is closed
      - close peer connection
      - and then reconnect to the peer
  • 16:57:32 - 17:03:30
    from android side:

    • there is no log around the time when the issue started
    • just before recovered:
      - dead_watcher.notified()
      - [router] peer is dead
      - close peer connection
      - reconnect to the peer

So, there is possibility that:

@iwanbk
Copy link
Member Author

iwanbk commented May 31, 2024

just before recovered:

  • android detected that Stream to TCP .......is closed

just before recovered:

  • dead_watcher.notified()

Tried with public peers, and the error before recoverd was:

mycelium::peer: Frame error from TCP 10.0.2.16:50224 <-> 65.21.231.58:9651: Connection reset by peer (os error 104)

I also tried it several times before with public peers, the error was mostly this os error 104

@iwanbk
Copy link
Member Author

iwanbk commented Jun 3, 2024

the underlying network is indeed lost, it will be handled in threefoldtech/myceliumflut#40

from my test so far:

  • network lost detector can detect when i disabled the wifi / mobile
  • there is no network lost when the issue happened

@iwanbk
Copy link
Member Author

iwanbk commented Jun 3, 2024

i was thinking of adding some timeout mechanism, only for android.
But from what i've read, timeout doesn't really fit in async world.
Timeout usually used in synchronous programming.

notes: Go uses timeout extensively while the engine is async, but the API is sync

@iwanbk
Copy link
Member Author

iwanbk commented Jun 3, 2024

wdyt @LeeSmet ?

@LeeSmet
Copy link
Contributor

LeeSmet commented Jun 3, 2024

Timeouts/timers are generally fine to add. Though I wonder where you'd add them. In general, the router should also detect broken connections in less than a minute, since there are Hello calls made to every peer which should get answered by an IHU (this is part of the babel protocol). So if that is also failing to reliably detect the broken connection, there seems to be something wrong there as well which needs some figuring out.

@iwanbk
Copy link
Member Author

iwanbk commented Jun 3, 2024

Though I wonder where you'd add them

I was thinking on the framed.next in peers.rs, but now i'm not sure.

In general, the router should also detect broken connections in less than a minute

which log that i should give attention for this?

@LeeSmet
Copy link
Contributor

LeeSmet commented Jun 3, 2024

You're mainly looking for https://github.com/threefoldtech/mycelium/blob/master/mycelium/src/router.rs#L304

@iwanbk
Copy link
Member Author

iwanbk commented Jun 4, 2024

You're mainly looking for https://github.com/threefoldtech/mycelium/blob/master/mycelium/src/router.rs#L304

i've looked at this log.
While peers log always printed when the issue happened, above log was not.

I'll look at the timeout way tomorrow or the day after

@iwanbk
Copy link
Member Author

iwanbk commented Jun 6, 2024

Tested it again,
Now on real android phone instead of emulator

ping was failed for ~1 minute, no error there

Croot@cilik:~# ping6 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507 | ts

Jun 06 11:29:00 PING 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507 (521:81b4:7f65:20c3:9a0f:2c45:8b92:5507) 56 data bytes
Jun 06 11:29:00 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=41 ttl=61 time=1023 ms
Jun 06 11:29:00 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=42 ttl=61 time=713 ms
Jun 06 11:29:02 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=43 ttl=61 time=943 ms
Jun 06 11:29:03 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=44 ttl=61 time=968 ms
Jun 06 11:29:04 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=45 ttl=61 time=890 ms
Jun 06 11:29:05 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=46 ttl=61 time=914 ms
Jun 06 11:29:06 64 bytes from 521:81b4:7f65:20c3:9a0f:2c45:8b92:5507: icmp_seq=47 ttl=61 time=938 ms

i changed the log on below line to error to make easier to see.

trace!("Checking for dead peers");

It ran every 10 seconds, but didn't detect the error (because error!("Peer {} is dead", peer.connection_identifier()); didn't printed.
It is the same with Emulator, check_for_dead_peers in router almost never detect any issue.

In case of using Emulator, usually peer loop broken -> peer disconnected -> reconnect to peer.

@iwanbk
Copy link
Member Author

iwanbk commented Jun 6, 2024

There are many ideas about this

  1. refresh peer when receive network available event.
    It is like what yggdrasil-android do.
    I don't think it would work:
  • the documentation about network callback is not very clear. But from my experiments, it is not about network become functional or not, but about the interface become up/down.
  • from my testing, there is no network available/loss even during the issue
  1. reset mycelium when the phone wake-up
    i think it will work, but i don't like it for these reasons:
  • we might hide some actual problem
  • we can do many unnecessary reset
  1. check peers status and reset when needed
  • check peer status using
    pub fn peer_info(&self) -> Vec<PeerStats> {
    self.peer_manager.peers()
    }
  • we can do it periodically and when the phone wake-up
  • but considering that check_for_dead_peers and peer loop can't detect the issue, then peer status might not work properly as well
  1. improve dead peer detection in router.rs or peer.rs
  • I think this is the best way, fix hidden issue first
  • if it is fixed, mycelium will recover by itself.
  • we can also add number (3), in case we need to make it recover faster.

@iwanbk
Copy link
Member Author

iwanbk commented Jun 6, 2024

ping was failed for ~1 minute, no error there
It ran every 10 seconds, but didn't detect the error (because error!("Peer {} is dead", peer.connection_identifier()); didn't printed.

for this, there is possibility that it was because the issue duration is still below DEAD_PEER_THRESHOLD, which is 43 seconds.

I'm afraid i can't test it quickly because mycelium is quite reliable on my test phone.
only one issue in few hours

It is the same with Emulator, check_for_dead_peers in router almost never detect any issue.

In case of Emulator, i found that when it happened, the timer also stopped working.
So, the issue might be because of the emulator side.

@iwanbk
Copy link
Member Author

iwanbk commented Jun 6, 2024

i'll check if it is half-open issue.
Maybe the peer closed the conn, but Android didn't know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants