Skip to content

Conversation

@mikeauclair
Copy link
Contributor

Given the underlying gocache struct takes an io.Reader as the body, pass the io.ReadCloser returned from the S3 SDK directly to it to avoid reading all the entries directly into memory. Also adjust the callers of GetData to close the returned io.ReadCloser.

Copy link
Member

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanically this change is fine, but I will note that it won't actually have the desired effect: The cache package accepts a reader there because that's what the prototype implementation does, but there is no practical way for it to actually stream data, so it's still going to read the whole thing into memory anyway. And the go tool will do the same, for roughly the same reason.

That said—I have no real objection to making the behaviour follow the interface. If we're going to do that, though, let's fix up the usage rather than changing the support library signature.

}
defer rc.Close()
return io.ReadAll(rc)
func (c *Client) GetData(ctx context.Context, key string) (io.ReadCloser, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the signature of this method—if callers want the semantics of a reader, I think they should switch to calling Get directly.

@mikeauclair
Copy link
Contributor Author

mikeauclair commented Mar 18, 2025

Can you say a little bit more about where in the chain the body would be read into memory regardless?

The protocol encodes all requests, data, and responses as JSON, using the standard encoder (on both ends). So you have to have the entire value in-hand before it can be written out anyway.

In theory one could write a custom encoder/decoder that turns the reader into an incremental writer for a base64 stream (with the JSON string framing, etc.), and reverses that on the other end, but that wouldn't currently be worthwhile anyway as the Go tool will pull the whole thing into memory anyway.

The protocol's use of a Reader here is aspirational, and the protocol wasn't properly-designed to support streaming. A better choice would have been to push the value as binary, since the caller provides the length anyway—but that wasn't how it shook out.

@creachadair
Copy link
Member

Apparently I failed at quoting your comment to reply to it, sorry.

@creachadair
Copy link
Member

The CI failure is not related to your change, if you rebase on main it should pass.

@mikeauclair
Copy link
Contributor Author

Can you say a little bit more about where in the chain the body would be read into memory regardless?

The protocol encodes all requests, data, and responses as JSON, using the standard encoder (on both ends). So you have to have the entire value in-hand before it can be written out anyway.

In theory one could write a custom encoder/decoder that turns the reader into an incremental writer for a base64 stream (with the JSON string framing, etc.), and reverses that on the other end, but that wouldn't currently be worthwhile anyway as the Go tool will pull the whole thing into memory anyway.

The protocol's use of a Reader here is aspirational, and the protocol wasn't properly-designed to support streaming. A better choice would have been to push the value as binary, since the caller provides the length anyway—but that wasn't how it shook out.

Totally understood that the go bin will have to read the whole payload into memory, and that Puts necessarily have to have the data, but it looks like Get (which returns the path to the FS), doesn't actually need to have the payload in memory at all, as long as (in the "local miss but S3 hit", it can write it to the FS and get the path - am I missing something there?

For context, we're using this plugin in a context where we're seeing a bunch of CPU throttling in kubernetes, and I'm trying to drive down GC-driven CPU utilization on Gets (which are the majority case in our situation) to avoid having to decrease GOMAXPROCS for the consuming go process (using my fork reduces CPU usage in my case by ~50%)

@creachadair
Copy link
Member

Totally understood that the go bin will have to read the whole payload into memory, and that Puts necessarily have to have the data, but it looks like Get (which returns the path to the FS), doesn't actually need to have the payload in memory at all, as long as (in the "local miss but S3 hit", it can write it to the FS and get the path - am I missing something there?

No, I think that's right. And the disk cache implementation (already) handles that case, so it certainly won't break anything.

Copy link
Member

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience.

@creachadair creachadair merged commit 7346bbd into tailscale:main Mar 18, 2025
1 check passed
@creachadair
Copy link
Member

This is now tagged at and after v0.0.22.

@mikeauclair mikeauclair deleted the stream-get-data-to-disk branch March 20, 2025 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants