Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.21.0 hybrid #29

Merged
merged 4 commits into from
Feb 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions contrib/terraform/openstack/modules/compute/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -596,6 +596,7 @@ resource "openstack_compute_instance_v2" "k8s_nodes" {
user_data = each.value.cloudinit != null ? templatefile("${path.module}/templates/cloudinit.yaml.tmpl", {
extra_partitions = each.value.cloudinit.extra_partitions
}) : data.cloudinit_config.cloudinit.rendered
security_groups = var.port_security_enabled ? local.worker_sec_groups : null
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this possibly due to a bug in kubespray?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same thing, and after some reading I concluded that this likely is a problem that merits another PR or at least further discussion. In summary, I think it was an oversight when removing port definitions and then fixing the broken security groups in a future commit. Since we hadn't used the k8s_nodes resource, it was never updated.

I think that series of commits was done to force Terraform to add instances to the auto_allocated_network. I would suggest we find a way to accomplish this without removing the ports resources so that we diverge as little as possible from "vanilla" Kubespray. This would make updating easier.

I can open a new issue that describes what I think the issue is more in depth and work on this when I have time?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure, thanks, keep it low priority, maybe we can reconsider this when we update kubespray the next time


dynamic "block_device" {
for_each = !local.k8s_nodes_settings[each.key].use_local_disk ? [local.k8s_nodes_settings[each.key].image_id] : []
Expand Down
44 changes: 44 additions & 0 deletions inventory/kubejetstream/cluster.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,50 @@ number_of_k8s_nodes_no_floating_ip = 0

flavor_k8s_node = "4"

# # Uncomment when all nodes will be GPU nodes
# # If you wish to use this var for another reason, add the ansible groups as a comma seperated list
# # E.g "additional-group-1,additional-group2,etc"
# supplementary_node_groups = "gpu-node"

# BEGIN HYBRID CLUSTER CONFIG

# # Set to true by default, but we make it explicit here
# port_security_enabled = true

# # Must be uncommented and set to 0 to use the k8s_nodes variable
# number_of_k8s_nodes = 0
# number_of_k8s_nodes_no_floating_ip = 0

# # "<cluster-name>-k8s-node-" will be prepended to each key name and used to create the instance name.
# # E.g the first item below would result in an instanced named "<cluster-name>-k8s-node-nf-cpu-1"
zonca marked this conversation as resolved.
Show resolved Hide resolved
# # For a full list of options see ./contrib/terraform/openstack/README.md#k8s_nodes
# k8s_nodes = {
# "nf-cpu-1" = {
# "az" = "nova"
# "flavor": "4"
# "floating_ip": false
# },
# "nf-cpu-2" = {
# "az" = "nova"
# "flavor": "4"
# "floating_ip": false
# },
# "nf-gpu-1" = {
# "az" = "nova"
# "flavor": "10"
# "floating_ip": false
# "extra_groups": "gpu-node"
# },
# "nf-gpu-2" = {
# "az" = "nova"
# "flavor": "10"
# "floating_ip": false
# "extra_groups": "gpu-node"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ana-v-espinoza How do I create 2 profiles, 1 for CPU and 1 for GPU?
I see there is an extra group here, but it seems it is only in terraform and not in Kubernetes.
This works for GPU pods, but not for CPU:
https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/gpu/jupyterhub_gpu.yaml#L2-L9

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Andrea,

I'm using the same "profiles" config option. Here's my snippet. You'll notice that I don't override the image in the GPU profile, as I'm using an image similar to that discussed in zonca/jupyterhub-deploy-kubernetes-jetstream#72

singleuser:
  image:
    name: "unidata/hybrid-gpu"
    tag: "minimal-tf"
  profileList:
  - display_name: "CPU Server"
    default: true
  - display_name: "GPU Server"
    kubespawner_override:
      extra_resource_limits:
        nvidia.com/gpu: "1"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I request a CPU server, but I spawn on a GPU node, I am wondering if it is better to restrict CPU-only users to run on CPU nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I see what you mean! Yeah, you can taint the GPU node(s), then add a toleration in kubespawner_override for the GPU profile.

# },
# }

# END HYBRID CLUSTER CONFIG

# GlusterFS
# either 0 or more than one
#number_of_gfs_nodes_no_floating_ip = 0
Expand Down
54 changes: 54 additions & 0 deletions inventory/kubejetstream/group_vars/gpu-node/containderd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
# Please see roles/container-engine/containerd/defaults/main.yml for more configuration options

# containerd_storage_dir: "/var/lib/containerd"
# containerd_state_dir: "/run/containerd"
# containerd_oom_score: 0

containerd_default_runtime: "nvidia"
# containerd_snapshotter: "native"

containerd_runc_runtime:
name: nvidia
type: "io.containerd.runc.v2"
engine: ""
root: ""
options:
BinaryName : '"/usr/bin/nvidia-container-runtime"'


# containerd_additional_runtimes:
# Example for Kata Containers as additional runtime:
# - name: kata
# type: "io.containerd.kata.v2"
# engine: ""
# root: ""

# containerd_grpc_max_recv_message_size: 16777216
# containerd_grpc_max_send_message_size: 16777216

# containerd_debug_level: "info"

# containerd_metrics_address: ""

# containerd_metrics_grpc_histogram: false

## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define mirror.registry.io or 172.19.16.11:5000
## set "name": "url". insecure url must be started http://
## Port number is also needed if the default HTTPS port is not used.
# containerd_insecure_registries:
# "localhost": "http://127.0.0.1"
# "172.19.16.11:5000": "http://172.19.16.11:5000"

# containerd_registries:
# "docker.io": "https://registry-1.docker.io"

# containerd_max_container_log_line_size: -1

# containerd_registry_auth:
# - registry: 10.0.0.2:5000
# username: user
# password: pass