Twine used Artillery for load testing because of its built-in Socket.IO engine, which enabled straightforward connection to the Twine server. However, in Artillery, "each virtual user will pick and run one of the scenarios in the test definition and run it to completion."1 That made Twine difficult to simulate in Artillery because the user flow requires a successful /set-cookie request followed by a second request that establishes the WebSocket connection, and we wanted to add custom error reporting.
These issues were addressed by adding custom processing for Artillery load tests: extracting the Artillery "scenario" logic from the limited YAML options to a more complex JavaScript file. With that in place, each of Artillery's virtual users fetched a cookie, established a WebSocket connection with the Twine server, and maintained that connection, all in sequence. Custom error reporting tracked the success or failure of each virtual user’s /set-cookie request, WebSocket connection interactions, and when applicable, the receipt of a payload published over the WebSocket connection.
Load testing Twine with 96,000 concurrent virtual users placed too much strain on the AWS EC2 instance running Artillery: maintaining tens of thousands of WebSocket connections created with the Socket.IO client library quickly reached the server’s memory limit. We also found that the Artillery server had a limited number of ephemeral ports and open file descriptors, and each WebSocket connection required one of each. To resolve these issues, we increased the CPU, memory, and network performance of the EC2 instance and added another, to load test Twine with both concurrently.
Phase one load tested a Twine deployment by ramping up to 96,000 concurrent virtual users over 20 minutes: the Twine servers auto-scaled from 1 to 4 based on a CPU threshold and breach duration trigger (how long the threshold must be crossed), and handled the load successfully.
Phase two load tested ramping up to 40,800 virtual users over 20 minutes, and added the strain of subscribing each virtual user to one room and emitting 1 message per second to all users in that room. The Twine architecture handled the load without issue. However, the test report showed that 20-40% of virtual users, spread across the load test duration, connected yet failed to receive a single message. A load test of 6,000 virtual users over 10 minutes reported the same result. At the same time, Twine server metrics showed low CPU and memory usage.
The failure rate immediately dropped to 2-4% when we upgraded the Artillery servers and ran multiple 6,000-virtual-user load tests, but it increased when testing with more concurrent connections. Following these results, we ran a simplified Artillery test that ramped up to 48,000 concurrent connections over 20 minutes: message receipt errors occurred for 0.7% of virtual users. While the results were encouraging, the test configuration did not combine the /set-cookie and WebSocket requests into a single flow for each virtual user.
After examining server metrics and Artillery reports, we believe the errors were caused by an Artillery server bottleneck, the load test configuration, or a combination thereof. We are investigating further and also working on configuring Artillery to load test connection state recovery specific to each virtual user.