Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pool telemetry #248

Merged
merged 15 commits into from
Jan 7, 2024
Merged

Implement pool telemetry #248

merged 15 commits into from
Jan 7, 2024

Conversation

oliveigah
Copy link
Contributor

@oliveigah oliveigah commented Nov 4, 2023

Second proposal to solve #183

After gather some feedback, I settled down on these 2 structs for the metrics:

   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }
   
   %Finch.HTTP2.PoolMetrics{
      pool_index: 1,
      in_flight_requests: 4
   }

The Finch.get_pool_status/2 interface also changed a bit. Now it returns a list of metrics struct, one for each pool started defined by the count option. Like this:

Finch.get_pool_status(MyFinch, "https://somehost.com:4321")

{:ok,
 [
   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   },
   %Finch.HTTP1.PoolMetrics{
     pool_index: 2,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }
 ]}

I've also added some fields to the pool state (such as finch_name and pool_index) in order to enable future changes more easily.

Copy link
Owner

@sneako sneako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution!

Running the test suite locally on main does require your patch for me as well so that is a welcome addition.

I have added some comments, please let me know what you think

lib/finch.ex Outdated Show resolved Hide resolved
lib/finch/http1/pool_metrics.ex Show resolved Hide resolved
lib/finch.ex Outdated Show resolved Hide resolved
lib/finch/http1/pool.ex Outdated Show resolved Hide resolved
@oliveigah
Copy link
Contributor Author

I agree with all suggestions! I'm currently on a business trip so probably won't be able to work on this until friday.

Regarding the averages and maxes, I think the potential margin of error is very low when we consider a large number of samples (which should be most cases).

The problem is if you take the measures without enough samples and therefore this margin of error may be more relevant.

One possible workaround would be only serve this measurements if there is enough samples, like 100. In this scenario even if the base metrics are slightly out of sync the real impact on the average/max measurements should not be relevant.

Anyway, I think is better to move on like you suggested with only available and in use connections and we can think more about this problem later.

Thanks for the feedback! (:

pool_timeout = Keyword.get(opts, :pool_timeout, 5_000)
receive_timeout = Keyword.get(opts, :receive_timeout, 15_000)

metadata = %{request: req, pool: pool}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already passing the Finch name in, could we add it to the metadata so it's available in telemetry? It would be really useful in monitoring wait time per Finch instance?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly @aselder I was thinking these changes enable #199 . I think that should be implemented in a new PR once this is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea indeed. But just to be clear, I've only added the finch_name on the Finch.Pool request callback because of averages and maxes metrics.

If we gonna move on only with available and in use connections this is not necessary.

Should I keep it here unused anyway in order to enable this change? What you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the modifications on #252 I think would be better to remove these changes from this PR in order to avoid conflicts.

@oliveigah
Copy link
Contributor Author

I'm working on the pool metrics implementation for HTTP2 and would like to know which metrics you think would be good to keep track of.

I'm thinking about in flight requests instead of available connections because it makes more sense on a multiplex scenario like HTTP2. There is another metric you think would be good @sneako?

@oliveigah
Copy link
Contributor Author

Regarding finch configurations with a pool count greater than one. Since the multiple pools are not exposed to the user in any way, I think makes sense to aggregate metrics from multiple pools under a single counter and generate metrics as there is only one pool.

An alternative would be keep separate metrics for each started pool using some index strategy and just show it as a unified metric when the user calls Finch.get_pool_status (maybe even have some option to show it separately). This would enable different forms of load balancing such as least_conn when there is more than one pool but the implementation may be harder.

What you think? I'm good with both. Just let me known the approach that better suits your vision for the library @sneako! (:

@sneako
Copy link
Owner

sneako commented Nov 20, 2023

Thanks @oliveigah !

I agree inflight requests make more sense for h2 👍

I think individual pool metrics will be more useful than only exposing the aggregates. I definitely like the idea of enabling new load balancing strategies. Individual metrics will also be more useful when users are configuring their pool sizes & counts. Any aggregation I assume will take place somewhere else in the observability stack (ie Datadog or similar)

@oliveigah oliveigah reopened this Nov 25, 2023
@oliveigah
Copy link
Contributor Author

I think the implementation is done. Gonna start to work on the docs now.

In summary:

  • No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state
  • Finch.get_pool_status/2 now returns a list of metrics struct. Each struct contains a pool_index attribute that represents the index of the pool.
  • I've kept some other attributes on the pool state in order to facilitate future changes. This may causes some conflicts with Add finch instance name to telemetry metadata #252. Gonna wait for the merge to solve it.
  • The only persistent term that still remains is the pool manager that keep all atomic refs related to a given finch instance and shp, this is used on the Finch.get_pool_status/2 call.
  • Both h1 and h2 are ready

Let me know what you think @sneako. (:

@oliveigah
Copy link
Contributor Author

oliveigah commented Nov 26, 2023

  • No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Gonna work on a fix for this, probably gonna need to store the reference on a persistent term and call it N times when the user call Finch.get_pool_status/2.

To add and subtract from the atomics I still think no persistent term call will be needed but the calls for Finch.get_pool_status/2 will need to execute N calls to persistent term instead of just one.

Maybe an ETS would be a better fit for what Finch.get_pool_status/2 needs. Since it's use case is far less performance sensitive than the add and subtract it may work. Gonna explore it a bit.

@oliveigah oliveigah marked this pull request as ready for review November 27, 2023 12:47
@oliveigah
Copy link
Contributor Author

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Solved!

I think the PR is ready for review now! I've made some changes to some test functions to address some problems with flaky tests.

Mainly what I did is check for available ports dynamically and reuse the same server for all tests.

Let me know your thoughts! (:

lib/finch.ex Outdated Show resolved Hide resolved
lib/finch/http1/pool.ex Outdated Show resolved Hide resolved
@sneako sneako linked an issue Dec 8, 2023 that may be closed by this pull request
@oliveigah
Copy link
Contributor Author

Done @sneako! (:

Let me know if you see any other improvement to the proposed implementation.

Copy link
Owner

@sneako sneako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @oliveigah!

@sneako sneako merged commit 3946acb into sneako:main Jan 7, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failing under OTP26/Elixir 1.15
3 participants