Skip to content

fix(pg): fix binary format buffer handling #3496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

faulpeltz
Copy link

@faulpeltz faulpeltz commented Jun 20, 2025

This tries to fix #3495

Functionally, this seems to fix the issue and the tests pass (edit: the stream tests are currently not working), but performance is not ideal because now in binary mode there is a buffer slice created, and then a string from the slice when a string is needed,, where previously the string was directly created from the buffer. Maybe it would probably be better instead to hint the field type to the parser before parsing?

  • change Parser.parseDataRowMessage() to produce a buffer slice instead of a string
  • change Result.parseRow() and _parseRowAsArray() to create the string from the buffer slice if not in binary mode

- change Parser.parseDataRowMessage() to produce a buffer slice instead of a string
- change Result.parseRow() and _parseRowAsArray() to create the string from the buffer slice if not in binary mode
@hjr3
Copy link
Collaborator

hjr3 commented Jun 20, 2025

did binary parsing for int32 work prior to #3494 ? if so, i wonder if we special case the array parsing to fix the regression. that would give us time to explore a larger, more correct change.

change Parser.parseDataRowMessage() to produce a buffer slice instead of a string

i like the idea of passing around a buffer of bytes and only converting to string at the last possible moment.

@faulpeltz
Copy link
Author

No, I don't think it worked, but I never got to that point because the bind argument protocol message was wrong for binary mode.
I was just about to try to fix that after digging around with wireshark when the new version released 😅

Please feel free to make any necessary changes, I've never worked on the internals of node-postgres before but I have written similar binary message parsers in JS before..

@hjr3
Copy link
Collaborator

hjr3 commented Jun 20, 2025

I will dive into this more later tonight and this weekend. I also plan on adding more data type parsing integration tests as well.

@hjr3
Copy link
Collaborator

hjr3 commented Jun 21, 2025

I am not finding the performance regression.

Before:

  1. parseDataRowMessage converts the buffer into a string using this.reader.string(len) which uses buffer.toString. This allocates.
  2. parseRow() and _parseRowAsArray() pass the string to the parser. This does not allocate

After:

  1. parseDataRowMessage creates a buffer slice using this.reader.bytes(len) which uses buffer.slice``. This does NOT allocate. source
  2. parseRow() and _parseRowAsArray() converts the buffer into a string. This allocates

In both cases, we now have a single allocation. (Actually, my recent fix introduced another allocation when adding the additional Buffer.from(rawValue), but you already fixed that).

I did add tests for every type I could think of.

@charmander
Copy link
Collaborator

I think the allocation they’re referring to is the additional Buffer view object itself on the string path.

@faulpeltz
Copy link
Author

I think the allocation they’re referring to is the additional Buffer view object itself on the string path.

Yes, I was referring to the creation of the additional Buffer view when parsing, but because there is no allocation/copy involved its probably not relevant.

I was also just wondering about performance in general - when comparing text vs. binary mode in my app I cannot see any improvements when fetching large amounts of rows

@hjr3
Copy link
Collaborator

hjr3 commented Jun 21, 2025

Ah, the new object from buffer.slice that is now created in the string path.

but because there is no allocation/copy involved its probably not relevant.

I would think this as well! I wrote a quick benchmark to be sure

import { run, bench } from 'mitata'
import { Client } from 'pg'

const client = new Client()
await client.connect()

bench('text format', async () => {
  const rowCount = 50000
  return client.query({
    text: `SELECT
      generate_series(1, $1) as id,
      (random() * 10000)::int as int_val,
      'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' as text_val,
      ARRAY[random(), random(), random()]::real[] as array_val`,
    values: [rowCount],
    binary: false,
  })
})

await run({
  format: 'mitata',
  colors: true,
  throw: true,
})

await client.end()

This was run on my local laptop, so not scientific. However, I am seeing this difference consistently:

Before

before

After

after

If this is true, then maybe we should look at hinting to parseDataRowMessage the foramt and making DataRowMessage generic over String and Buffer. I think that may be a fair amount of work to accomplish though.

I will keep looking as I get time.

@hjr3
Copy link
Collaborator

hjr3 commented Jun 21, 2025

I was also just wondering about performance in general - when comparing text vs. binary mode in my app I cannot see any improvements when fetching large amounts of rows

I am seeing some noticeable difference over large results. Same benchmark as above, but text vs binary

Text

text

Binary

binary

Are you not seeing any difference? Can you share your benchmark?

@faulpeltz
Copy link
Author

Regarding your benchmark: in binary mode, the query result for the float array is only string garbage, it seems like type 1021/element type 700 is not supported by pg-types in binary mode, so that might skew the benchmark result
Also the noParser always creates a string from the Buffer in binary mode which I think is not a good default

Are you not seeing any difference? Can you share your benchmark?

In the case of my app (which i cannot share), I wasn't able to measure any meaningful difference.
The part I was measuring basically just dumps large-ish tables to an nd-json stream, and from profiling the pg row parsing is still the largest chunk, but <30% so hard to say whats going on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pg 8.16.2 in binary mode produces incorrect row values
3 participants