Conversation
There was a problem hiding this comment.
@connorgorman Do you have any concerns about limiting the resources to lower than recommended?
So far we did not hit any issues, i.e. with hack-fest deploying >70 instances.
The secured clusters were very small in these scenarios.
| type AnalyzerDefaults struct { | ||
| MemoryRequest resource.Quantity `env:"MEMORY_REQUEST" envDefault:"100M"` | ||
| CPURequest resource.Quantity `env:"CPU_REQUEST" envDefault:"5m"` | ||
| CPURequest resource.Quantity `env:"CPU_REQUEST" envDefault:"100m"` |
There was a problem hiding this comment.
The requests for CPU are so low because it reserves CPU resources. We hit a problem when too many instances are deployed to the cluster that the cluster run out of resources, even though the CPU usage was < 40%.
For now the idea is that instances with higher resource requirements are scaled manually vertically.
There was a problem hiding this comment.
FTR, we had an alert today where the 0.005 core request might have been at least a contributing factor.
e0421a2 to
f3d6cb5
Compare
|
@SimonBaeumer I am seeing quite a few timeout: context deadline exceeded errors when contacting the database. Also, there seems to be a decent amount of throttling. I've removed the requests, but increased the limits so that customers can at least get a core of resources |
|
You could take care of https://issues.redhat.com/browse/ROX-16657 and bump memory request to |
|
@kylape I had initially adjusted the requests in this PR, but there were concerns that we weren't utilizing the cluster enough |
|
I think when customers start using the cloud service for production
workloads, the average central will likely use more resources.
…On Wed, May 3, 2023 at 17:12 Connor Gorman ***@***.***> wrote:
@kylape <https://github.com/kylape> I had initially adjusted the requests
in this PR, but there were concerns that we weren't utilizing the cluster
enough
—
Reply to this email directly, view it on GitHub
<#723 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHEIJ6JTO64BTMVA4P7SXDXELJ3NANCNFSM6AAAAAAT2ZYJKY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
We can increase the resources but need an additional alert for memory and CPU requests exceeding cluster capacity. |
|
let me try to sum up all the things
With that in mind, I will approve this PR and bump up cluster node count to reduce risk of evictions in the short term. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: connorgorman, kylape The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Description
Up for debating the increases in requests, but the limits should be more near to what we expect customers to run in production. The result of this is that it could lead to more evictions, but also should give customers much faster API times. I'm seeing timeouts in the metrics
Checklist (Definition of Done)
Test manualROX-12345: ...Test manual
TODO: Add manual testing efforts