Skip to content

ROX-16883: Increase resource limits#723

Merged
connorgorman merged 1 commit intomainfrom
cgorman-resource-tweaks
May 4, 2023
Merged

ROX-16883: Increase resource limits#723
connorgorman merged 1 commit intomainfrom
cgorman-resource-tweaks

Conversation

@connorgorman
Copy link
Copy Markdown
Contributor

@connorgorman connorgorman commented Jan 13, 2023

Description

Up for debating the increases in requests, but the limits should be more near to what we expect customers to run in production. The result of this is that it could lead to more evictions, but also should give customers much faster API times. I'm seeing timeouts in the metrics

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Evaluated and added CHANGELOG.md entry if required
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.

Test manual

TODO: Add manual testing efforts

# To run tests locally run:
make db/teardown db/setup db/migrate
make ocm/setup OCM_OFFLINE_TOKEN=<ocm-offline-token> OCM_ENV=development
make verify lint binary test test/integration

@connorgorman connorgorman temporarily deployed to development January 13, 2023 20:49 — with GitHub Actions Inactive
@connorgorman connorgorman requested a review from ebensh January 13, 2023 20:49
Copy link
Copy Markdown
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@connorgorman Do you have any concerns about limiting the resources to lower than recommended?
So far we did not hit any issues, i.e. with hack-fest deploying >70 instances.
The secured clusters were very small in these scenarios.

type AnalyzerDefaults struct {
MemoryRequest resource.Quantity `env:"MEMORY_REQUEST" envDefault:"100M"`
CPURequest resource.Quantity `env:"CPU_REQUEST" envDefault:"5m"`
CPURequest resource.Quantity `env:"CPU_REQUEST" envDefault:"100m"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requests for CPU are so low because it reserves CPU resources. We hit a problem when too many instances are deployed to the cluster that the cluster run out of resources, even though the CPU usage was < 40%.
For now the idea is that instances with higher resource requirements are scaled manually vertically.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, we had an alert today where the 0.005 core request might have been at least a contributing factor.

@connorgorman connorgorman force-pushed the cgorman-resource-tweaks branch from e0421a2 to f3d6cb5 Compare May 3, 2023 21:25
@connorgorman connorgorman temporarily deployed to development May 3, 2023 21:25 — with GitHub Actions Inactive
@connorgorman connorgorman temporarily deployed to development May 3, 2023 21:25 — with GitHub Actions Inactive
@connorgorman connorgorman temporarily deployed to development May 3, 2023 21:25 — with GitHub Actions Inactive
@connorgorman
Copy link
Copy Markdown
Contributor Author

@SimonBaeumer I am seeing quite a few timeout: context deadline exceeded errors when contacting the database. Also, there seems to be a decent amount of throttling. I've removed the requests, but increased the limits so that customers can at least get a core of resources

@connorgorman connorgorman changed the title Increase resource limits and requests Increase resource limits May 3, 2023
@kylape
Copy link
Copy Markdown
Contributor

kylape commented May 3, 2023

You could take care of https://issues.redhat.com/browse/ROX-16657 and bump memory request to 1Gi. We'll need to make sure we are ready to increase node count on our clusters in the likely event that ACS instances will not be able to be scheduled due to lack of allocatable memory.

@connorgorman
Copy link
Copy Markdown
Contributor Author

@kylape I had initially adjusted the requests in this PR, but there were concerns that we weren't utilizing the cluster enough

@kylape
Copy link
Copy Markdown
Contributor

kylape commented May 4, 2023 via email

@connorgorman connorgorman changed the title Increase resource limits ROX-16883: Increase resource limits May 4, 2023
@SimonBaeumer
Copy link
Copy Markdown
Contributor

We can increase the resources but need an additional alert for memory and CPU requests exceeding cluster capacity.
Added a ticket for this: https://issues.redhat.com/browse/ROX-16884

@connorgorman connorgorman dismissed SimonBaeumer’s stale review May 4, 2023 15:37

Substantial changes

@kylape
Copy link
Copy Markdown
Contributor

kylape commented May 4, 2023

let me try to sum up all the things

  • Need to bump memory and cpu limits because things are slow
  • This increases risk of evictions, so bump up cluster node count
  • Evictions trigger incidents - this should be fixed by modifying the alert rule
  • Evictions can also be reduced by increasing memory request, but this risks scheduling issues, especially at the point when requests are raised across the fleet

With that in mind, I will approve this PR and bump up cluster node count to reduce risk of evictions in the short term.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented May 4, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: connorgorman, kylape
Once this PR has been reviewed and has the lgtm label, please ask for approval from simonbaeumer. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@connorgorman connorgorman merged commit 61b65b7 into main May 4, 2023
@connorgorman connorgorman deleted the cgorman-resource-tweaks branch May 4, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants