Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Signal Work at the South Pole #6899

Closed
wants to merge 3 commits into from

Conversation

g1a55er
Copy link

@g1a55er g1a55er commented Jun 2, 2024

First time contributor checklist:

Contributor checklist:

  • My contribution is not related to translations.
  • My commits are in nice logical chunks with good commit messages
  • My changes are rebased on the latest main branch
  • A yarn ready run passes successfully (more about tests here)
  • My changes are ready to be shipped to users

Description

I was reading a post on the popular brr.fyi blog, when I came across a familiar looking "unnamed chat app 1" that reportedly did not work at the South Pole. I was quite disappointed to see that WhatsApp.... errrr.... "unnamed chat app 2" works fine in these conditions, so I decided to treat this post as a bug report.

The anonymous author was kind enough to provide the following debug log from the field:

There are a few things that come to mind when analyzing the problem of improving performance on low-throughput-high-latency connectivity connections:

  1. The WebSocket establishment handshake takes many roundtrips (SYN, SYN-ACK, ACK, ClientHello, ServerHello, TLS Negotiation, Web Socket Upgrade). I would imagine "competitor app 2" is probably using QUIC or something similar by now to consolidate these down into the bare minimum. This is, in my opinion, the main technical root cause of the performance differential. However, I completely understand that organizational and resource constraints probably make this a high-investment, longer term item to fix.
  2. The WebSocket abstraction provided by the TypeScript library doesn't provide us with any window into if the handshake is making progress or not. Ideally, I'd like to set it to bail out after we stop making progress through the various stages of the bootstrap handshake, but to keep going by resetting the timer if we are making good progress. Unfortunately, we don't have visibility on if we are still making progress, or if it has all just failed. We only just get a final connection event when connection is fully established.
  3. So, we're left with a trade-off between improving success rates on low end connections and increasing wasted wait time in the disconnected failure states while we wait for doomed connection attempts to eventually fail. Ideally, we'd have some data we could use to help drive this decision: How many failures are we getting? How long is the P99 success case? How many false positives will we get for a given connection timeout value? I don't have access to any data like that for the Signal user base, so I just choose a simple <10s, 20s, 30s, 30s,...> progression. Hopefully this improves success rates without hurting reconnect speed for most users.

I'd like to take a deeper dive into these issues if I get some more time. 🤞

Test Strategy

[x] Ran yarn ready and got a pass
[x] Ran 'yarn start' and nothing seemed obviously more broken than staging normally is.

@g1a55er g1a55er closed this Jun 3, 2024
@g1a55er g1a55er reopened this Jun 3, 2024
@g1a55er
Copy link
Author

g1a55er commented Jun 3, 2024

I saw this needed some work when I did some testing earlier this morning using the macOS link conditioner under a custom "South Pole" profile.

Screenshot 2024-06-03 at 2 50 26 PM

I based these settings on the conditions reported in the blog post.

With these changes, I can now connect to Signal with the link conditioner set to those settings if I am very, very patient. Diving into how the app behaves, I think there's a few other areas for improvement I spotted that I'm noting down for my own reference:

  • The DNS cache does not appear to be very aggressive. These leads to a pretty pathological error state where the client will keep hammering the DNS resolver multiple times simultaneously for the same name resolution.
  • I'm not seeing any sort of connection pooling or other connection strategy going on here? I get that we want to keep the unauthenticated and the authenticated socket separate for privacy reasons, but I think some sort of connection engine that is aware of that constraint could do a pretty good job standing up working sockets over bad connections faster.

@@ -54,10 +54,16 @@ export function connect<Resource extends IResource>({
...extraHeaders,
'User-Agent': getUserAgent(version),
};

const http_agent_options = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a TypeScript ninja, so let me know if there is a better way to do this. My desire here is to add the connect_timeout_ms option to the AgentOptions argument without overwriting any of the defaults.

@jamiebuilds-signal
Copy link
Member

Thank you for the pull request. We're actually working on a longer term project to move a lot of this to a Rust library and discussing how we could improve this. However, just increasing the timeouts pushes the problem downstream to a lot of places in the app and causes other issues. So I think for now we'll defer doing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants