Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new menu option for default settings with 4 GPUs #333

Merged
merged 6 commits into from
Apr 24, 2020
Merged

Conversation

willgraf
Copy link
Contributor

This PR adds a new Default configuration option for 4 GPUs. This sets the default node type to n1-standard-4 to help remove issues where Prometheus will get OOMKilled because the node it lives on does not have enough memory. n1-standard-4 has shown to work for high-volume jobs with up to 8 GPUs. (Fixes #296).

The troubleshooting section has also been updated to assist users in understanding this issue and how to resolve it.

Here is what the new menu options look like:

Screen Shot 2020-04-23 at 1 21 15 PM

@willgraf
Copy link
Contributor Author

I have successfully created and destroyed a cluster in this branch.

Copy link
Collaborator

@MekWarrior MekWarrior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@willgraf willgraf merged commit ecbccb9 into stable Apr 24, 2020
@willgraf willgraf deleted the default-large branch April 24, 2020 22:06
willgraf added a commit that referenced this pull request May 22, 2020
* Feature/cicd (#268)

* Set all custom charts image.pullPolicy to IfNotPresent (#258)

* setting TRANSLATE_COLON_NOTATION=false by default (#289)

* Update Getting Started  (#287)

* Update PULL_REQUEST.md for grammar (#292)

* Use gomplate to template patches/hpa.yaml. (#293)

* default account has 100 firewalls, not 200. (#297)

* Update all documentation and links to reference kiosk-console instead of kiosk (#295)

* Use yq and helmfile build to dynamically deploy helm charts based on release name. (#300)

* Upgrade the openvpn chart to latest 4.2.1. (#301)

* Change CLUSTER in Makefile to kiosk-console to fix binary name issue. (#302)

* update raw.gif and tracked.gif with new nearly perfect gif (#303)

* Update default values for tf-serving (#306)

* Update Redis to the latest helm chart before they migrate to bitnami (#307)

* Update autoscaler to 0.4.1 (#308)

* Update redis-janitor to 0.3.1 (#309)

* Update frontend to 0.4.1. (#310)

* Update OpenVPN command for version 4.2.1 (#313)

* Upgrade consumers to 0.5.1 and update models to DeepWatershed. (#311)

* Set no-appendfsync-on-rewrite=yes to prevent Redis latency issues during AOF fsync (#316)

* Install yq in install_script.sh (#319)

* Use 4 random digits for cluster names. (#318)

* update to latest version of the frontend (#322)

* Change default consumer machine type to n1-standard-2 (#323)

* Upgrade benchmarking to 0.2.4 and fix for Deep Watershed models (#324)

* Use GRAFANA_PASSWORD env var to override the default grafana password. (#325)

* Update Getting Started docs with new user feedback (#321)

* Add basic unit tests (#326)

* Use the docker container to run integration tests. (#327)

* Warn users if bucket's region and cluster's region do not match (#329)

* Bump benchmarking to latest 0.2.5 release (#331)

* Add Logo Banner and Update README (#332)

* Add new menu option for default settings with 4 GPUs (#333)

* Update HPA target to 2 keys per zip consumer pod. (#334)

* Bump consumers to version 0.5.2 (#336)

* Update consumer and benchmarking versions (#337)

* Bump redis-janitor to 0.3.2 to fix empty key bug. (#339)

* bump benchmarking to 0.3.1 to fix No route to host bug. (#341)

* Allow users to select which zone(s) to deploy the cluster (#340)

* Pin KUBERNETES_VERSION to 1.14. (#346)

* Fix bug indexing into last array element of valid_zones. (#348)

* Fix logs to indicate finality and be less redundant. (#351)

* If KUBERNETES_VERSION is 1.14, warn user of potential future version removal (#352)

Co-authored-by: dylanbannon <dylanbannon5@gmail.com>
Co-authored-by: MekWarrior <moen.erick@gmail.com>
willgraf added a commit that referenced this pull request May 23, 2020
* add new menu option for 4 GPU default (with n1-standard-4 default node type)

* add troubleshooting section for Prometheus OOMKilled issue.

* update the confirm display to include node information.

* update docs with new menu options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants