Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next steps with groupcache (galaxycache) #5037

Open
3 of 7 tasks
GiedriusS opened this issue Jan 6, 2022 · 10 comments
Open
3 of 7 tasks

Next steps with groupcache (galaxycache) #5037

GiedriusS opened this issue Jan 6, 2022 · 10 comments
Labels
component: store dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement help wanted

Comments

@GiedriusS
Copy link
Member

GiedriusS commented Jan 6, 2022

After #4818 has been merged, I am writing down a list of TODO:

  • Make pull request that pushes changes (Thanos update) back into Cortex
  • Write documentation
  • Switch back to the original galaxycache version by implementing the epoch strategy on keys suggested here (*: add TTL support vimeo/galaxycache#25, https://medium.com/vimeo-engineering-blog/video-metadata-caching-at-vimeo-a54b25f0b304)
  • Add microbenchmarks for fetching keys via HTTP.
  • Enable H2C on our HTTP server to enable multiplexing even without TLS
  • Add e2e tests with TLS on Thanos Store w/ galaxycache. Currently galaxycache's HTTP client doesn't use TLS even though the server can do it.
  • Implement fetching multiple keys over one HTTP request. This isn't so trivial since we directly copy data from memory to the response's body :/ needs some kind of encoding scheme here that has minimal overhead?

For implementing the last two, perhaps we could even switch to gRPC?

Maybe I have missed something?

@yeya24
Copy link
Contributor

yeya24 commented Jan 8, 2022

Looking forward to the documentation. Can't wait to try it 👍

@GiedriusS
Copy link
Member Author

#5068 fixes [4] and [5].

@stale
Copy link

stale bot commented Apr 16, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 16, 2022
@stale
Copy link

stale bot commented Apr 30, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Apr 30, 2022
@GiedriusS GiedriusS reopened this Apr 30, 2022
@stale stale bot removed the stale label Apr 30, 2022
@GiedriusS
Copy link
Member Author

From our experience, groupcache is still slow due to the last point - we need to fetch multiple keys in one request.

@stale
Copy link

stale bot commented Aug 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 13, 2022
@GiedriusS GiedriusS added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels Aug 16, 2022
@pulkitent
Copy link

Hey @GiedriusS, I would like to pick a sub-issue of this open issue which is Make pull request that pushes changes (Thanos update) back into Cortex

As I am a newbie to open source contribution & Thanos codebase, can someone share some docs to help me understand the requirement better?

Any acceptance criteria for this feature? Or which flow/component in the Thanos codebase should I focus on? :)

@bwplotka
Copy link
Member

Amazing, thanks! So Make pull request that pushes changes (Thanos update) back into Cortex : I wonder if this is still relevant, given we vendor Cortex code. Could you double-check @GiedriusS ? Also I think it less important, given we are not sure if groupcache can be a better solution than Redis. We need to first figure out if groupcache is worth pursuing, so perhaps your (any new ideas) around making it faster (and benchmark) should be the next step here.

cc @yeya24 @GiedriusS does it sounds right?

@GiedriusS
Copy link
Member Author

Thanks for picking this up! I am personally doubtful whether it is needed to pursue this further, I might even be leaning towards deleting groupcache before 1.0 😢 as far as I understand, the problem is that we are serializing access to a single map on each key retrieval: https://github.com/vimeo/galaxycache/blob/master/singleflight/singleflight.go#L41-L64. In Thanos, on a bigger query we can easily get thousands of requests at the same time. I saw this once I've tried deploying groupcache to prod. The latency even jumped after groupcache was enabled 😱 I think the performance issue must be the contention. This blog post is pretty illuminating: https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/. I wonder if there's a better strategy we could use for the map 🤔

@bwplotka
Copy link
Member

Got it. I think it's nice piece to work, but best effort. Still something to consider if blockers are improved.

What would be a benchmark that could reproduce the problem? What requirements we are looking for in terms of efficiency here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: store dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement help wanted
Projects
None yet
Development

No branches or pull requests

4 participants