ROX-16883: Increase resource limits by connorgorman · Pull Request #723 · stackrox/acs-fleet-manager

connorgorman · 2023-01-13T20:49:43Z

Description

Up for debating the increases in requests, but the limits should be more near to what we expect customers to run in production. The result of this is that it could lead to more evictions, but also should give customers much faster API times. I'm seeing timeouts in the metrics

Checklist (Definition of Done)

Unit and integration tests added
Added test description under Test manual
Evaluated and added CHANGELOG.md entry if required
Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
CI and all relevant tests are passing
Add the ticket number to the PR title if available, i.e. ROX-12345: ...
Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.

Test manual

TODO: Add manual testing efforts

# To run tests locally run:
make db/teardown db/setup db/migrate
make ocm/setup OCM_OFFLINE_TOKEN=<ocm-offline-token> OCM_ENV=development
make verify lint binary test test/integration

SimonBaeumer

@connorgorman Do you have any concerns about limiting the resources to lower than recommended?
So far we did not hit any issues, i.e. with hack-fest deploying >70 instances.
The secured clusters were very small in these scenarios.

SimonBaeumer · 2023-01-19T13:08:33Z

internal/dinosaur/pkg/defaults/default_scanner_spec.go

 type AnalyzerDefaults struct {
 	MemoryRequest resource.Quantity `env:"MEMORY_REQUEST" envDefault:"100M"`
-	CPURequest    resource.Quantity `env:"CPU_REQUEST"    envDefault:"5m"`
+	CPURequest    resource.Quantity `env:"CPU_REQUEST"    envDefault:"100m"`


The requests for CPU are so low because it reserves CPU resources. We hit a problem when too many instances are deployed to the cluster that the cluster run out of resources, even though the CPU usage was < 40%.
For now the idea is that instances with higher resource requirements are scaled manually vertically.

FTR, we had an alert today where the 0.005 core request might have been at least a contributing factor.

connorgorman · 2023-05-03T21:26:11Z

@SimonBaeumer I am seeing quite a few timeout: context deadline exceeded errors when contacting the database. Also, there seems to be a decent amount of throttling. I've removed the requests, but increased the limits so that customers can at least get a core of resources

kylape · 2023-05-03T22:09:22Z

You could take care of https://issues.redhat.com/browse/ROX-16657 and bump memory request to 1Gi. We'll need to make sure we are ready to increase node count on our clusters in the likely event that ACS instances will not be able to be scheduled due to lack of allocatable memory.

connorgorman · 2023-05-03T22:11:54Z

@kylape I had initially adjusted the requests in this PR, but there were concerns that we weren't utilizing the cluster enough

kylape · 2023-05-04T03:48:45Z

I think when customers start using the cloud service for production workloads, the average central will likely use more resources.

…

On Wed, May 3, 2023 at 17:12 Connor Gorman ***@***.***> wrote: @kylape <https://github.com/kylape> I had initially adjusted the requests in this PR, but there were concerns that we weren't utilizing the cluster enough — Reply to this email directly, view it on GitHub <#723 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHEIJ6JTO64BTMVA4P7SXDXELJ3NANCNFSM6AAAAAAT2ZYJKY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SimonBaeumer · 2023-05-04T08:13:54Z

We can increase the resources but need an additional alert for memory and CPU requests exceeding cluster capacity.
Added a ticket for this: https://issues.redhat.com/browse/ROX-16884

Substantial changes

kylape · 2023-05-04T18:41:50Z

let me try to sum up all the things

Need to bump memory and cpu limits because things are slow
This increases risk of evictions, so bump up cluster node count
Evictions trigger incidents - this should be fixed by modifying the alert rule
Evictions can also be reduced by increasing memory request, but this risks scheduling issues, especially at the point when requests are raised across the fleet

With that in mind, I will approve this PR and bump up cluster node count to reduce risk of evictions in the short term.

openshift-ci · 2023-05-04T18:42:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: connorgorman, kylape
Once this PR has been reviewed and has the lgtm label, please ask for approval from simonbaeumer. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

connorgorman requested review from SimonBaeumer and johannes94 January 13, 2023 20:49

connorgorman temporarily deployed to development January 13, 2023 20:49 — with GitHub Actions Inactive

connorgorman requested a review from ebensh January 13, 2023 20:49

SimonBaeumer previously requested changes Jan 19, 2023

View reviewed changes

openshift-ci bot assigned SimonBaeumer Jan 19, 2023

Modify resource limits to prevent throttling

f3d6cb5

connorgorman force-pushed the cgorman-resource-tweaks branch from e0421a2 to f3d6cb5 Compare May 3, 2023 21:25

connorgorman temporarily deployed to development May 3, 2023 21:25 — with GitHub Actions Inactive

connorgorman requested review from SimonBaeumer and kylape May 3, 2023 21:26

connorgorman changed the title ~~Increase resource limits and requests~~ Increase resource limits May 3, 2023

connorgorman changed the title ~~Increase resource limits~~ ROX-16883: Increase resource limits May 4, 2023

kylape approved these changes May 4, 2023

View reviewed changes

openshift-ci bot assigned kylape May 4, 2023

openshift-ci bot added the lgtm label May 4, 2023

connorgorman merged commit 61b65b7 into main May 4, 2023

connorgorman deleted the cgorman-resource-tweaks branch May 4, 2023 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-16883: Increase resource limits#723

ROX-16883: Increase resource limits#723
connorgorman merged 1 commit intomainfrom
cgorman-resource-tweaks

connorgorman commented Jan 13, 2023 •

edited

Loading

Uh oh!

SimonBaeumer left a comment •

edited

Loading

Uh oh!

SimonBaeumer Jan 19, 2023

Uh oh!

porridge May 5, 2023

Uh oh!

connorgorman commented May 3, 2023

Uh oh!

kylape commented May 3, 2023

Uh oh!

connorgorman commented May 3, 2023

Uh oh!

kylape commented May 4, 2023 via email

Uh oh!

SimonBaeumer commented May 4, 2023

Uh oh!

kylape commented May 4, 2023 •

edited

Loading

Uh oh!

openshift-ci bot commented May 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

connorgorman commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist (Definition of Done)

Test manual

Uh oh!

SimonBaeumer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

porridge May 5, 2023

Choose a reason for hiding this comment

Uh oh!

connorgorman commented May 3, 2023

Uh oh!

kylape commented May 3, 2023

Uh oh!

connorgorman commented May 3, 2023

Uh oh!

kylape commented May 4, 2023 via email

Uh oh!

SimonBaeumer commented May 4, 2023

Uh oh!

kylape commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

connorgorman commented Jan 13, 2023 •

edited

Loading

SimonBaeumer left a comment •

edited

Loading

kylape commented May 4, 2023 •

edited

Loading