Skip to content

[🐛] Failed Reconnect After Extended Network Down (≈10s) #2952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
8 tasks
arekkubaczkowski opened this issue Feb 14, 2025 · 25 comments
Open
8 tasks

[🐛] Failed Reconnect After Extended Network Down (≈10s) #2952

arekkubaczkowski opened this issue Feb 14, 2025 · 25 comments

Comments

@arekkubaczkowski
Copy link

arekkubaczkowski commented Feb 14, 2025

Issue

Description

We’ve observed that the GetStream library fails to reconnect automatically after a network outage lasting around 10 seconds or more. In addition, re-initializing the chat instance—by navigating back to the previous screen and then returning to the chat page—does not resolve the issue. A complete app reboot is required to restore connectivity.

Steps to Reproduce

  1. Simulate Network Down:

    • Disable network connectivity (e.g., via Wi‑Fi toggle).
  2. Extended Network Outage:

    • Keep the network offline for at least 10 seconds.
  3. Restore Network Connection:

    • Re-enable the network connection.
  4. Observe:

    • The connection fails to re-establish automatically.
    • The MessageList component continues to display as if it’s connected, without indicating any disconnection or ongoing reconnection process.
  5. Additional Re-Initialize Chat Instance:

    • Navigate back to the previous screen and then return to the chat page to trigger chat instance re-initialization.
      • A full app reboot is necessary to regain connectivity.
      • The MessageList component continues to display as if it’s connected, without indicating any disconnection or ongoing reconnection.

Steps to reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error
    etc...

Expected behavior

Project Related Information

Customization

Click To Expand

# N/A

Offline support

  • I have enabled offline support.
  • The feature I'm having does not occur when offline support is disabled. (stripe out if not applicable)

Environment

Click To Expand

package.json:

# N/A

react-native info output:

 OUTPUT GOES HERE
  • Platform that you're experiencing the issue on:
    • iOS
    • Android
    • iOS but have not tested behavior on Android
    • Android but have not tested behavior on iOS
    • Both
  • stream-chat-react-native version you're using that has this issue:
    • e.g. 5.4.3
  • Device/Emulator info:
    • I am using a physical device
    • OS version: e.g. Android 10
    • Device/Emulator: e.g. iPhone 11

Additional context

Screenshots

Click To Expand


@isekovanic
Copy link
Contributor

isekovanic commented Feb 17, 2025

Hi @arekkubaczkowski ,

Thanks a lot for the report !

I fiddled around with this for quite a bit today and could not get the particular behaviour to reproduce. In that regard, I have a couple of follow-up questions:

  • Which version of the SDK are you using ? I'm only asking because we fixed a similar issue not too long ago (about 2 months back) exhibiting the behaviour
  • Which HoC do you have mounted when this is happening ? Is it ChannelList or MessageList ?
  • What makes you think that the connection is not able to establish itself again ? This can be checked if you attach a WS event listener (a general one, not for a specific event) and bombard the app with a couple of WS events
  • Would it be able to provide a video of this happening ? Perhaps something subtle will nudge me in the right direction

@arekkubaczkowski
Copy link
Author

hi @isekovanic thanks for the reply.

  • I am using the latest version of SDK
  • MessageList, but any channel events also don't come at all, e.g. channel.on('message.new', handler)
  • It is very difficult to debug it since when I disable wifi I lose the connection for debugger. Is there any better way to debug it?
  • I will try to attach something soon

@arekkubaczkowski
Copy link
Author

@isekovanic I think I found something. We’re using GetStream along with LiveKit for streaming. Our logic calls router.back when the LiveKit disconnect event triggers. However, this event is sometimes delayed - up to 30 seconds after disabling WiFi. Before router.back is executed, if I turn WiFi back on, GetStream successfully re-establishes the connection. But once router.back is called, if I reconnect and navigate back to the chat screen, GetStream fails to reconnect. It looks like if the getstream is in the offline mode and is being unmounted, it cannot reconnect after next mount.

@isekovanic
Copy link
Contributor

That's very interesting indeed ! I shall check it out later today and try to emulate this behaviour - but this was most helpful, so thanks a bunch !

@isekovanic
Copy link
Contributor

isekovanic commented Feb 19, 2025

Hi @arekkubaczkowski ,

So first off thanks again for your astute observations on the topic !

After spending quite a bit of time tonight playing around with what you described, I was unfortunately not able to reproduce it at all. However, I do have some ideas of what might be going on.

Traditionally, our SDK has 2 ways in which it can kill a WS connection:

  • You put the app into background mode, which is done to preserve bandwidth and to not clog the open WS connection for long (and also because of the fact that Android will kill the connection on an OS level after about 2-3 minutes anyway)
  • You actually lose connection to the internet, after which we manually dispatch a WS event towards the StreamChat client to let it know that something's up (and after which a reconnection flow will begin after we've regained access to the internet)

Both of these are different usecases, but similar in terms of what they do in these scenarios. The main hook where all the magic happens is this one. Let's analyze it further.

  1. For the case of putting the app in background mode, we have this and this handler. Their job is to do exactly what they say in the given scenarios. Keep in mind that you need to actually have a connected user in the chatClient for this to work.
  2. If you actually lose connection, the chatClient is going to wait 5 seconds and then fire a connection.changed event, thus causing the state to change to offline. When internet is regained the WS connection will begin to recover and once that happens a connection.recovered event will be fired (both of these are not WS events directly, but local events that the client fires).

My current theory is the following:

  • If you unmount the entire Chat component while in background and client.closeConnection() has been called when going to background but not client.openConnection() (because the Chat component can never figure out it got to foreground), it should be theoretically possible that the client gets stuck in a weird state where it keeps being loaded (because it's a singleton) but the connection is closed. I was not able to reproduce this (and also the UI would've updated anyhow), but it is something I considered.
  • If the same thing happens, but with actually losing connection it is going to continue to try to reconnect with exponential backoff and only give up if all limits are reached (you would probably see something like initial WS connection could not be established when this happens). I cannot see a scenario where this particularly could get caught in a broken state, but I suppose the same applies as my previous point.

I have tried:

  • Unmounting the Chat component immediately when going offline
  • Unmounting the Chat component immediately when going to the background
  • Manually trying to break the connection and then unmounting

and after each one of these the remount afterwards uses the recovered connection.

With that said, I'm at a bit of a loss and kind of out of ideas. However, I do have some things you could check:

  • I know offline events and such are a pain to debug, but one way to do it is through Alerts or through passing a custom logger when instantiating the chatClient on your side (maybe you can write to a local file or something) so that we can see what's roughly going on
  • You can check the state of your chatClient when the issue is happening, specifically chatClient.wsConnection?.isHealthy for example (and possibly continue with some other values from our connection module in the low level client)
  • If you come to a conclusion that the WS connection indeed gets irrecoverably broken, trying to chatClient.disconnectUser() on running router.goBack() would probably get you to a clean slate, if you aren't doing this already
  • I'm not particularly familiar with the LiveKit codebase, so I can't say about what underlying libraries they're using - however, it might be worth also looking into some of their own dependencies and if maybe there are shared ones between the 2 SDKs (possibly a version mismatch ?)
  • Finally, since I cannot reproduce this - are you able to create a minimum reproducible example of this happening ? It does not have to be with LiveKit naturally, what I did was basically unmount on certain events and then have a button to remount the entire Chat component at will; if this is possible it would be most helpful
  • And last but not least, if you have a lightweight showcase of your implementation and how it works it could also reveal some interesting things

Sorry for the wall of text and for not being able to help more.

@isekovanic
Copy link
Contributor

Oh and one more thing I forgot to mention - if you're trying to lose connection by turning wi-fi off, you likely have to put the app in background mode; which in turn goes down the other route rather than actually simulating losing internet.

To properly simulate this, you can either:

  • Actually turn your internet off (like kill the router and make sure you don't connect to 4G/5G)
  • Pass closeConnectionOnBackground={false} to the Chat component so that the connection does not immediately close when going to the background

This should at least isolate the unintended consequences of a connection close whenever doing this.

@arekkubaczkowski
Copy link
Author

arekkubaczkowski commented Feb 20, 2025

@isekovanic I appreciate your detailed input. I just tried to reproduce it by unmounting/mounting the chat instance and you're right, this scenario works fine. However if you do the same but on the navigation level, then it breaks indeed.
Steps to repro:

  • Add some button on the chat screen that triggers router.back on press (expo-router)
  • Open the chat screen
  • disable wifi for ~50s
  • after 50s, press router.back button
  • enable wifi back
  • open chat screen again
  • see that connection is not recovered

I am able actually to reproduce it all the time.

ps: I have disabled closeConnectionOnBackground

@arekkubaczkowski
Copy link
Author

arekkubaczkowski commented Feb 20, 2025

I just found out that when I enable closeConnectionOnBackground then after above scenario, when I put the app into the background and foreground again, it re-establishes the connection again properly.

If you come to a conclusion that the WS connection indeed gets irrecoverably broken, trying to chatClient.disconnectUser() on running router.goBack() would probably get you to a clean slate, if you aren't doing this already

I actually cannot do it because I get Call connectUser or connectAnonymousUser before creating a channel error. That is probably because the client was disconnected already due to lack of connectivity.

@isekovanic
Copy link
Contributor

Okay, so that's a good start ! I was actually testing on our RN CLI sample apps, so perhaps there's a hint on how this works within how Expo handles routing or something.

3 follow up questions:

  • When you say chat screen, which screen do you mean exactly ? Is it a ChannelList rendered there, MessageList or something else ?
  • When you run router.goBack(), do you unmount the entire Chat component or something else ? In other words, does the Chat component wrap your entire application or just the chat screen ?
  • When going back in these conditions, could you please try running chatClient.closeConnection() and then chatClient.openConnection() immediately after it ? This is what essentially happens whenever going to background/foreground and I'm interested to see if it resolves the broken state so that we know where to poke the bear

I actually cannot do it because I get Call connectUser or connectAnonymousUser before creating a channel error. That is probably because the client was disconnected already due to lack of connectivity.

That's interesting, that error should not be the one appearing in these scenarios; but maybe it reveals something else.

@isekovanic
Copy link
Contributor

PS: I will actually try with your steps to reproduce a bit later today and see

@arekkubaczkowski
Copy link
Author

arekkubaczkowski commented Feb 20, 2025

@isekovanic

  • chat screen is the one that implements MessageList (I am not using ChannelList at all)
  • yes I am unmounting the entire chat component. btw. what is the most recommended approach? Should I wrap the entire app with the chat component?

the only thing that I run globally and store across the entire app is the chat client connection (useCreateChatClient).
I initialize it at the highest app level and share it via custom context.

the one reason why I would need to have multiple chat instances is that I need to support multiple themes (style prop)

@isekovanic
Copy link
Contributor

No, this should be alright.

When going back in these conditions, could you please try running chatClient.closeConnection() and then chatClient.openConnection() immediately after it ? This is what essentially happens whenever going to background/foreground and I'm interested to see if it resolves the broken state so that we know where to poke the bear

did you manage to try this ?

the one reason why I would need to have multiple chat instances is that I need to support multiple themes (style prop)

you shouldn't need multiple Chat instances, passing a new theme to a single one should work

@arekkubaczkowski
Copy link
Author

calling closeConnection before router.back doesn't seem to help

@isekovanic
Copy link
Contributor

Are you running openConnection() immediately afterwards though ? Only asking because you mentioned that going to background -> foreground helps with the issue and these are pretty much the only 2 things happening when that occurs.

@arekkubaczkowski
Copy link
Author

that's strange but also doesn't help

@isekovanic
Copy link
Contributor

Ah, but wait - I think I'm being silly. You're still offline when doing that right ? It won't be able to establish the connection anyway. Can you try doing it once you regain connection ?

@arekkubaczkowski
Copy link
Author

nope, I am calling it after turning the connection on. Moreover, openConnection should return wsPromise, the only scenario where it returns void is when it's already connected, and I am actually getting void response.

@isekovanic
Copy link
Contributor

Yeah, that would probably explain why it can't reconnect - I guess it relates to what I suspected; the entire chatClient gets caught in a weird state and it can neither reconnect (because it thinks it's already connected) nor does it have connection.

I'll have a go at your updated reproduction steps probably tonight and see what's going on, using navigation specifically.

@arekkubaczkowski
Copy link
Author

@isekovanic have you managed to reproduce it by chance?

@isekovanic
Copy link
Contributor

Hi @arekkubaczkowski ,

No, sorry - very busy past couple of days and a ton of stuff to do. I'll let you know as soon as I try it out.

@arekkubaczkowski
Copy link
Author

@isekovanic hi, I would like to double check if it's still on your plate. I would be grateful for any update 🙏

@isekovanic
Copy link
Contributor

Hi @arekkubaczkowski , yes - it still is; it's just one of those scenarios where my work queue has grown quite large (and keeps changing). I promise the team and I will have a look at this soon. In the meantime, if you happen to have some bandwidth - a minimum reproducible sample would be extremely appreciated, especially as something to go off on (to avoid any potential back and forth later on if it's not reproducible properly).

I'm very sorry for the wait !

@arekkubaczkowski
Copy link
Author

arekkubaczkowski commented Apr 11, 2025

@isekovanic
https://github.com/arekkubaczkowski/getstream-reconnect
here is the repository with reproduced issue:

please follow these steps:

  • replace you token and data in ChatClientProvider.tsx
  • replace channel data in ChatChannel.tsx
  • Open the chat screen
  • make sure you get any messages
  • disable internet connection
  • press go back button (which is already rendered above the chat)
  • turn on internet
  • go to the chat again
  • see that messages are not coming anymore

@isekovanic
Copy link
Contributor

Hi @arekkubaczkowski ,

First off - thanks so much for taking the time to create a minimal reproducible sample. The team and I greatly appreciate it !

Apologies for the delay (again), things are getting pretty busy as we're pushing for the V7 release as well as doing MessageList performance improvements.

I spent the better part of last night debugging this in depth and have indeed found the issue. Let me break in down in a TL;DR format:

  • The WS connection is indeed very much stable and healthy; there's no issue there (as expected)
  • What does happen however, is when the connection actually gets killed (due to internet loss of >= 5 seconds) the channel we were watching is no longer being watched
    • This can be seen if you listen to all channel websocket events, where you'll see that you're receiving notification.message_new events rather than message.new ones (notification. events only arrive for entities we are not currently watching while the other ones arrive only if we're watching)
    • The issue that actually occurs is the channel being unwatched and yet its state not matching this (since channel.initialized is still very much true)
    • This then causes navigating to the Channel component again to not watch() the channel once again, causing the issue (we typically do this on the ChannelList by running queryChannels; however that logic depends on the ChannelList being mounted in the first place)
  • Due to some technical limitations for which I won't go into too many details, we are not able to update the channel state accordingly and mark it as "needed to be watched" once again
  • What did work for me (on your minimal reproducible sample) was simply adding channel?.watch() when navigating to the Channel screen and rendering the HOC

That would look something like this:

const [ready, setReady] = useState(false);
useEffect(() => {
     const watchChannel = async () => {
         await channel?.watch();
         setReady(true);
     }
     watchChannel();
 }, [channel]);

if (!client || !ready) {
  return null;
}

right about here.

Please take this as an example of what the workaround should look like; ideally you would want to only do this when needed (i.e whenever unmounting a Channel during having network down, or maybe when recovering the connection state while outside of the Chat component etc). Depending on your needs, if you need to watch() multiple channels it's also good to rely on queryChannels with their respective cids like so:

await client.queryChannels(
        { cid: { $in: cids } } as ChannelFilters<StreamChatGenerics>,
        { last_message_at: -1 },
        { limit: 30 },
      );

As mentioned before, we can unfortunately not immediately fix this in the SDK as it's a much bigger rabbit hole with a ton of implications at this time. I will however add this to our backlog so that we may pick it up in the future, however for the time being you can try the workaround out and let me know if it indeed helps.

Thanks again !

@arekkubaczkowski
Copy link
Author

arekkubaczkowski commented May 13, 2025

@isekovanic The workaround you suggested does work for this specific scenario. However, we’re still facing issues with reconnection failures after the connection is lost. Many users report that in poor network conditions, the chat stops working entirely. Unfortunately, the only solution that seems to help is restarting the app. Simply re-entering doesn’t fix the issue due to the problem we discussed earlier and the workaround doesn't seem to work in this specific case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants