Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: try to avoid buffer allocations #1998

Closed
wants to merge 3 commits into from

Conversation

ronag
Copy link
Contributor

@ronag ronag commented Jan 4, 2022

No description provided.

lib/sender.js Outdated Show resolved Hide resolved
lib/sender.js Outdated Show resolved Hide resolved
@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

const net = require('net')

const server = net.createServer((socket) => {
  socket.resume()
  socket.on('error', () => {})
}).listen(0)

const socket = net.connect(server.address().port, '0.0.0.0', () => {
  let str = '1'.repeat(4096)
  for (let n = 0; n < 1024*128; ++n) {
    // socket.write(str)
    socket.write(toBuffer(str))
  }
  socket.end()
  socket.on('finish', () => {
    socket.on('error', () => {})
    server.close()
  })
})

function toBuffer(data) {
  toBuffer.readOnly = true;

  if (Buffer.isBuffer(data)) return data;

  let buf;

  if (data instanceof ArrayBuffer) {
    buf = Buffer.from(data);
  } else if (ArrayBuffer.isView(data)) {
    buf = Buffer.from(data.buffer, data.byteOffset, data.byteLength);
  } else {
    buf = Buffer.from(data);
    toBuffer.readOnly = false;
  }

  return buf;
}
ws$ time node bench.js 

real    0m0.324s
user    0m0.144s
sys     0m0.204s
ws$ time node bench.js 

real    0m0.857s
user    0m0.687s
sys     0m0.235s

It's more than 2x faster without toBuffer.

lib/sender.js Show resolved Hide resolved
lib/sender.js Show resolved Hide resolved
@ronag ronag force-pushed the avoid-buf-allocations branch 7 times, most recently from 8875b42 to b163d6c Compare January 5, 2022 13:00
lib/sender.js Outdated
Comment on lines 319 to 320
const buf =
perMessageDeflate || typeof data !== 'string' ? toBuffer(data) : data;
Copy link
Member

@lpinca lpinca Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buf.length >= perMessageDeflate._threshold check below might not work as expected when data is a string because it does not use the string bytes length.


Also, when data is a string and permessage-deflate is enabled, toBuffer() should ideally be called only when the message will actually be compressed (this._compress === true).

It would be better to make this permessage-deflate agnostic. Removing the perMessageDeflate check here and passing the string to perMessageDeflate.compress() should work. The problem is that enqueue() and dequeue() currently use the buffer length to update the _bufferedBytes value.

Copy link
Contributor Author

@ronag ronag Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That branch is only used if perMessageDeflate is true; in which case data is converted to a buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that we could improve it, I don't think there is an issue here currently. Only that it's a little hard to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Anyway atm this improvement only works when permessage-deflate is disabled. What about websocket.ping() and websocket.pong()?

Copy link
Contributor Author

@ronag ronag Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about websocket.ping() and websocket.pong()?

Don't see them as a hot path.

@lpinca
Copy link
Member

lpinca commented Jan 5, 2022

I think the benchmark is a bit fabricated. It writes 512 MiB synchronously by calling socket.write() 1024 * 128 times. There are 1024 * 128 allocations when using the toBuffer() variant but a lot less when using the string variant because there is only one allocation for the data buffered in the socket writable queue. Also in the string variant the string is always the same. There is basically no memory usage.

I don't think this is realistic.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

I think the benchmark is a bit fabricated. It writes 512 MiB synchronously by calling socket.write() 1024 * 128 times. There are 1024 * 128 allocations when using the toBuffer() variant but a lot less when using the string variant because there is only one allocation for the data buffered in the socket writable queue. Also in the string variant the string is always the same. There is basically no memory usage.

I mean the point here is all of those buffer allocations are not necessary. The strings are already allocated. This just demonstrates the overhead of creating all of those buffers that you would otherwise not create. Which is basically what happens in production when you are sending a large amount of strings to ws.

i.e. what this PR is trying to do is to avoid some of the overhead of going string -> Buffer. Hence the benchmark should only show the difference between doing the conversion in js before writing to the socket instance or inside the native writeGeneric.

I don't think this is realistic.

While it could be better maybe I don't see why it wouldn't be realistic. It's basically what's happening in my case, although not with quite such large values as 512 MiB.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

Even with smaller values e.g. 4 MiB there is improvement:

ws$ time node bench

real    0m0.058s
user    0m0.043s
sys     0m0.013s
ws$ time node bench

real    0m0.041s
user    0m0.031s
sys     0m0.009s
ws$ 

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

Added a cronometro benchmark

╔══════════════╤═════════╤═══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │        Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ toBuffer     │     100 │ 141.63 op/sec │  ± 1.44 % │                         ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │        Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 495.38 op/sec │  ± 2.48 % │ + 249.78 %              ║
╚══════════════╧═════════╧═══════════════╧═══════════╧═════════════════════════╝

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

Without pre-allocating the string:

╔══════════════╤═════════╤══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ toBuffer     │      99 │ 49.33 op/sec │  ± 0.39 % │                         ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 71.23 op/sec │  ± 1.36 % │ + 44.40 %               ║
╚══════════════╧═════════╧══════════════╧═══════════╧═════════════════════════╝

@lpinca
Copy link
Member

lpinca commented Jan 5, 2022

╔══════════════╤═════════╤══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ toBuffer     │     100 │ 41.60 op/sec │  ± 1.00 % │                         ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 42.91 op/sec │  ± 1.66 % │ + 3.15 %                ║
╚══════════════╧═════════╧══════════════╧═══════════╧═════════════════════════╝

with Buffer.from() instead of toBuffer(), that is removing the Buffer, ArrayBuffer, DataView checks from it.

Hence the benchmark should only show the difference between doing the conversion in "user space" or inside the native writeGeneric

The problem with this is that the difference is only the result of the huge amount of buffered data in the socket writable queue. The difference should not be noticeable if data is regularly flowing, because the number of allocations would basically be 1:1.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

with Buffer.from() instead of toBuffer(), that is removing the Buffer, ArrayBuffer, DataView checks from it.

Sorry. I don't understand? We're using toBuffer just as it would with ws.

The problem with this is that the difference is only the result of the huge amount of buffered data in the socket writable queue. The difference should not be noticeable if data is regularly flowing, because the number of allocations would basically be 1:1.

This is with 4MiB of back pressure, which is not unrealistic.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

I don't quite see why you think the benchmark is not relevant. According to you, how would a "correct" benchmark look?

@lpinca
Copy link
Member

lpinca commented Jan 5, 2022

Sorry. I don't understand? We're using toBuffer just as it would with ws. We are not using Buffer.from

Yes but we could optimize it for the string case instead of dealing with the string case only at the end.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

Ok updated benchmark:

  • Backpressure down to 16 * 4096
  • Added toBuffer2 which check for string earlier
╔══════════════╤═════════╤════════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ toBuffer2    │    1000 │ 1522.18 op/sec │  ± 1.60 % │                         ║
║ toBuffer     │     999 │ 1735.66 op/sec │  ± 0.89 % │ + 14.02 %               ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ string       │     999 │ 2606.76 op/sec │  ± 0.72 % │ + 71.25 %               ║
╚══════════════╧═════════╧════════════════╧═══════════╧═════════════════════════╝

@lpinca
Copy link
Member

lpinca commented Jan 5, 2022

It's 15% on my machine but only 3% with Buffer.from(), that 12% only for the toBuffer() wrapper is odd.

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

On an EPYC server + Node v17.3.0:

╔══════════════╤═════════╤════════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ toBuffer     │    1000 │ 1920.40 op/sec │  ± 1.36 % │                         ║
║ bufferFrom   │     999 │ 2024.28 op/sec │  ± 0.93 % │ + 5.41 %                ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ string       │    1000 │ 3084.64 op/sec │  ± 1.56 % │ + 60.63 %               ║
╚══════════════╧═════════╧════════════════╧═══════════╧═════════════════════════╝

@lpinca
Copy link
Member

lpinca commented Jan 5, 2022

How to make cronometro wait for the socket to be connected?

@ronag
Copy link
Contributor Author

ronag commented Jan 5, 2022

How to make cronometro wait for the socket to be connected?

cronometro does a warmup run so waiting for the socket doesn't really make any difference.

@ronag
Copy link
Contributor Author

ronag commented Jan 6, 2022

Ok, always sending the same string does not make sense and invalidates the test.

Added another one. See above.

@lpinca
Copy link
Member

lpinca commented Jan 6, 2022

That's a ~5% improvement.

@ronag
Copy link
Contributor Author

ronag commented Jan 6, 2022

and reduced gc pressure

@ronag
Copy link
Contributor Author

ronag commented Jan 6, 2022

btw, I kind of did manage to resolve that with the current version by setting Buffer.poolSize = 128 * 1024.

@ronag ronag closed this Jan 6, 2022
@lpinca
Copy link
Member

lpinca commented Jan 6, 2022

Yes, it's a small improvement that has some readability and maintenance drawbacks. We can move this forward but I want to make it general and handle websocket.ping(), websocket.pong(), and permessage-deflate. It shouldn't be hard, we only have to precompute the message length in bytes and pass it along to sender.enqueue(), sender.dequeue(), and Sender.frame() with a dedicated option.

Feel free to reopen and do that. I might open a new PR for this later.

@lpinca
Copy link
Member

lpinca commented Jan 6, 2022

Anyway, thank you for bearing with me.

@ronag ronag reopened this Jan 6, 2022
@ronag
Copy link
Contributor Author

ronag commented Jan 6, 2022

Something like this?

@ronag ronag requested a review from lpinca January 6, 2022 12:31
@ronag ronag force-pushed the avoid-buf-allocations branch 6 times, most recently from a5cf313 to c285e36 Compare January 6, 2022 13:15
@ronag
Copy link
Contributor Author

ronag commented Jan 6, 2022

Tried cleaning it up a little more with proper commit msgs. A little exhausted on this one ATM. Hope this can be included in some form. Feel free to take over.

lpinca added a commit that referenced this pull request Jan 6, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
@lpinca lpinca mentioned this pull request Jan 6, 2022
lpinca added a commit that referenced this pull request Jan 6, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
lpinca added a commit that referenced this pull request Jan 7, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
lpinca added a commit that referenced this pull request Jan 7, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
lpinca added a commit that referenced this pull request Jan 7, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
@ronag ronag closed this Jan 9, 2022
lpinca added a commit that referenced this pull request Jan 13, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: #1998
th37rose added a commit to th37rose/websocket_server that referenced this pull request May 24, 2022
Do not convert strings to `Buffer`s if data does not need to be masked.

Refs: websockets/ws#1998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants