Make Signal Work at the South Pole #6899

g1a55er · 2024-06-02T00:10:39Z

First time contributor checklist:

I have read the README and Contributor Guidelines
I have signed the Contributor Licence Agreement

Contributor checklist:

My contribution is not related to translations.
My commits are in nice logical chunks with good commit messages
My changes are rebased on the latest main branch
A yarn ready run passes successfully (more about tests here)
My changes are ready to be shipped to users

Description

I was reading a post on the popular brr.fyi blog, when I came across a familiar looking "unnamed chat app 1" that reportedly did not work at the South Pole. I was quite disappointed to see that WhatsApp.... errrr.... "unnamed chat app 2" works fine in these conditions, so I decided to treat this post as a bug report.

The anonymous author was kind enough to provide the following debug log from the field:

There are a few things that come to mind when analyzing the problem of improving performance on low-throughput-high-latency connectivity connections:

The WebSocket establishment handshake takes many roundtrips (SYN, SYN-ACK, ACK, ClientHello, ServerHello, TLS Negotiation, Web Socket Upgrade). I would imagine "competitor app 2" is probably using QUIC or something similar by now to consolidate these down into the bare minimum. This is, in my opinion, the main technical root cause of the performance differential. However, I completely understand that organizational and resource constraints probably make this a high-investment, longer term item to fix.
The WebSocket abstraction provided by the TypeScript library doesn't provide us with any window into if the handshake is making progress or not. Ideally, I'd like to set it to bail out after we stop making progress through the various stages of the bootstrap handshake, but to keep going by resetting the timer if we are making good progress. Unfortunately, we don't have visibility on if we are still making progress, or if it has all just failed. We only just get a final connection event when connection is fully established.
So, we're left with a trade-off between improving success rates on low end connections and increasing wasted wait time in the disconnected failure states while we wait for doomed connection attempts to eventually fail. Ideally, we'd have some data we could use to help drive this decision: How many failures are we getting? How long is the P99 success case? How many false positives will we get for a given connection timeout value? I don't have access to any data like that for the Signal user base, so I just choose a simple <10s, 20s, 30s, 30s,...> progression. Hopefully this improves success rates without hurting reconnect speed for most users.

I'd like to take a deeper dive into these issues if I get some more time. 🤞

Test Strategy

[x] Ran yarn ready and got a pass
[x] Ran 'yarn start' and nothing seemed obviously more broken than staging normally is.

g1a55er · 2024-06-03T22:20:13Z

I saw this needed some work when I did some testing earlier this morning using the macOS link conditioner under a custom "South Pole" profile.

I based these settings on the conditions reported in the blog post.

With these changes, I can now connect to Signal with the link conditioner set to those settings if I am very, very patient. Diving into how the app behaves, I think there's a few other areas for improvement I spotted that I'm noting down for my own reference:

The DNS cache does not appear to be very aggressive. These leads to a pretty pathological error state where the client will keep hammering the DNS resolver multiple times simultaneously for the same name resolution.
I'm not seeing any sort of connection pooling or other connection strategy going on here? I get that we want to keep the unauthenticated and the authenticated socket separate for privacy reasons, but I think some sort of connection engine that is aware of that constraint could do a pretty good job standing up working sockets over bad connections faster.

g1a55er · 2024-06-03T22:21:13Z

ts/textsecure/WebSocket.ts

@@ -54,10 +54,16 @@ export function connect<Resource extends IResource>({
    ...extraHeaders,
    'User-Agent': getUserAgent(version),
  };
+
+  const http_agent_options = {


I am not a TypeScript ninja, so let me know if there is a better way to do this. My desire here is to add the connect_timeout_ms option to the AgentOptions argument without overwriting any of the defaults.

jamiebuilds-signal · 2024-06-03T23:38:51Z

Thank you for the pull request. We're actually working on a longer term project to move a lot of this to a Rust library and discussing how we could improve this. However, just increasing the timeouts pushes the problem downstream to a lot of places in the app and causes other issues. So I think for now we'll defer doing this

Incrementally increase connection timeout

1574c47

g1a55er closed this Jun 3, 2024

Update with results of link conditioner testing

8f70f6b

g1a55er reopened this Jun 3, 2024

Run linter

0d13105

g1a55er force-pushed the to-the-south-pole branch from c08ed38 to 0d13105 Compare June 3, 2024 22:20

g1a55er commented Jun 3, 2024

View reviewed changes

jamiebuilds-signal closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Signal Work at the South Pole #6899

Make Signal Work at the South Pole #6899

g1a55er commented Jun 2, 2024

g1a55er commented Jun 3, 2024

g1a55er Jun 3, 2024

jamiebuilds-signal commented Jun 3, 2024

Make Signal Work at the South Pole #6899

Make Signal Work at the South Pole #6899

Conversation

g1a55er commented Jun 2, 2024

First time contributor checklist:

Contributor checklist:

Description

Test Strategy

g1a55er commented Jun 3, 2024

g1a55er Jun 3, 2024

Choose a reason for hiding this comment

jamiebuilds-signal commented Jun 3, 2024