New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot use cloud_config while creating VM #227
Comments
Hey @sMteX, sorry for the late reply. Can you please share the content of the Xen Orchestra handles cloud init configuration slightly differently from the terraform provider because it performs some templating on the client side (javascript in the web UI). I noticed the cloud config you referenced is using those client side features ( Have you tried a cloud config that doesn't use those features? That might result in cloud-init thinking the config file is malformed.
As for the cloud-init logs missing, I'm surprised that these files don't exist. The times I've debugged issues with cloud-init not running the way I expect, the logs would be there but wouldn't contain the output that I expected. Can you confirm that it's running on boot in this failed scenario? Even if the nocloud data drive is missing, it should still be able to run. |
Hey @ddelnano, thanks for getting back!
In fact, what I posted there is the content of the
(and fetching that with Terraform and using in the VM) with the same result, so while that may be broken, I don't think it's the culprit.
Like I've mentioned, I originally tried to create a VM manually in XO with the same template and config, just to see how it looks while booting. Looks like it gets past the That was on the manually created VM which seemed to work just fine. None of that showed on the VMs created with Terraform. No errors, nothing, just straight into login prompt. I don't think the cloud-init even runs on the Terraform VMs and I can't figure out why. My only other thought is that even though I've installed |
I've tried creating a new template, this time also with # The top level settings are used as module
# and system configuration.
# A set of users which may be applied and/or used by various modules
# when a 'default' entry is found it will reference the 'default_user'
# from the distro configuration specified below
users:
- default
# If this is set, 'root' will not be able to ssh in and they
# will get a message to login instead as the above $user (debian)
disable_root: true
# This will cause the set+update hostname module to not operate (if true)
preserve_hostname: false
# This prevents cloud-init from rewriting apt's sources.list file,
# which has been a source of surprise.
apt_preserve_sources_list: true
# Example datasource config
# datasource:
# Ec2:
# metadata_urls: [ 'blah.com' ]
# timeout: 5 # (defaults to 50 seconds)
# max_wait: 10 # (defaults to 120 seconds)
# The modules that run in the 'init' stage
cloud_init_modules:
- migrator
- seed_random
- bootcmd
- write-files
- growpart
- resizefs
- disk_setup
- mounts
- set_hostname
- update_hostname
- update_etc_hosts
- ca-certs
- rsyslog
- users-groups
- ssh
# The modules that run in the 'config' stage
cloud_config_modules:
# Emit the cloud config ready event
# this can be used by upstart jobs for 'start on cloud-config'.
- emit_upstart
- ssh-import-id
- locale
- set-passwords
- grub-dpkg
- apt-pipelining
- apt-configure
- ntp
- timezone
- disable-ec2-metadata
- runcmd
- byobu
# The modules that run in the 'final' stage
cloud_final_modules:
- package-update-upgrade-install
- fan
- puppet
- chef
- salt-minion
- mcollective
- rightscale_userdata
- scripts-vendor
- scripts-per-once
- scripts-per-boot
- scripts-per-instance
- scripts-user
- ssh-authkey-fingerprints
- keys-to-console
- phone-home
- final-message
- power-state-change
# System and/or distro specific settings
# (not accessible to handlers/transforms)
system_info:
# This will affect which distro class gets used
distro: debian
# Default user name + that default users groups (if added/used)
default_user:
name: debian
lock_passwd: True
gecos: Debian
groups: [adm, audio, cdrom, dialout, dip, floppy, netdev, plugdev, sudo, video]
sudo: ["ALL=(ALL) NOPASSWD:ALL"]
shell: /bin/bash
# Other config here will be given to the distro class and/or path classes
paths:
cloud_dir: /var/lib/cloud/
templates_dir: /etc/cloud/templates/
upstart_dir: /etc/init/
package_mirrors:
- arches: [default]
failsafe:
primary: http://deb.debian.org/debian
security: http://security.debian.org/
ssh_svcname: ssh One thing I've yet to try is if Debian isn't the culprit, so I'm trying to set up an Ubuntu image to see if the problem isn't there maybe. |
Thanks for confirming. I thought that was the case, but didn't want to assume.
I thought in both cases you were using a cloudinit ready VM image? Is that not the case? Have you checked the Since we are still skeptical that cloud-init is ever running, can you try to identify what systemd unit (or similar construct) is launching cloud-init and find those logs? Looking at the cloud-init package for Ubuntu focal, it appears these systemd config files would be of interest
|
|
I mean I was using a template that I've created myself. It's a Debian 11.6 that has just update packages and installed aforementioned packages. This install creates the
Inside the template, after installing All the EDIT: Just for reference, when I tried using the Ubuntu template I made (just installed Ubuntu Server, it appears to already have the |
This doc explains more on the different stages of the boot process for cloud-init and systemd is a large part of that. So those missing unit files seem problematic. Are the systemd generator and units mentioned in the docs above running when the instance is created through the XO UI? It would be interesting to see the kernel command line in the working and non working case as well ( Cloud-init also tries to determine if the current boot is the first or a later reboot (docs). I would also check to see if your template has anything cached in it that is causing it to think it doesn't need to run. I'm also not sure that the terraform provider is the issue here. From my anecdotal experience, creating these templates has many moving parts and it's difficult to pinpoint the underlying cause at times. |
My apologies for the radio silence last week. We initially thought we found the reason (the host we were trying to spin the VMs on had for some reason multiples of the same network interfaces, and on a different host it worked), but after reinstalling the faulty machine, the issue still persists. Currently the same procedure with cloud init only seems to work on one host out of 3 (coincidentally, the pool's master) and we've got no clue what could be different between those hosts that makes it not work. Ultimately we've come to (at least temporary) decision that it's not worth the effort to try and get it working (as there isn't really any obvious thing wrong) as we can accomplish more or less the same with Ansible, it's just not that automatic. Thank you for the help and if we ever find the cause, I'll try to remember this thread and reply for anyone else facing similar problem. |
No worries and makes sense that it wasn't worth the investment in continuing to debug. I wish we were able to get to the bottom of this, but if you do find the solution and remember to follow up that would be great. For now I'm going to close this since there isn't any active lead to follow. If it becomes important for you again or someone else is interested in debugging this further, we can reopen this in the future. |
Hi,
Here is how I'm generating the cloud config using talos: resource "talos_machine_secrets" "machine_secrets" {
talos_version = var.talos_version
}
data "talos_machine_configuration" "controlplane" {
count = var.master_count
cluster_name = var.talos_cluster_name
machine_type = "controlplane"
cluster_endpoint = "https://${var.talos_vip}:6443"
machine_secrets = talos_machine_secrets.machine_secrets.machine_secrets
talos_version = var.talos_version
kubernetes_version = var.kubernetes_version
config_patches = [
templatefile("${path.module}/templates/controlplanepatch.yaml.tmpl", {
vip = var.talos_vip
hostname = "${var.talos_cluster_name}-master-${count.index + 1}"
ip = var.master_ips[count.index]
gateway = var.gateway_ip
nameserver = var.nameserver
talos_version = var.talos_version
})
]
} How I'm passing it to the VM: data "xenorchestra_pool" "pool" {
name_label = var.xo_pool
}
data "xenorchestra_hosts" "hosts" {
pool_id = data.xenorchestra_pool.pool.id
sort_by = "name_label"
sort_order = "asc"
}
data "xenorchestra_sr" "local_storage" {
count = length(data.xenorchestra_hosts.hosts.hosts)
name_label = format("%s %s", split(".", data.xenorchestra_hosts.hosts.hosts[count.index].name_label)[0], var.xo_storage_tier)
pool_id = data.xenorchestra_pool.pool.id
}
data "xenorchestra_template" "template" {
name_label = var.xo_vm_template
pool_id = data.xenorchestra_pool.pool.id
}
data "xenorchestra_network" "net" {
name_label = var.xo_vm_network
pool_id = data.xenorchestra_pool.pool.id
}
resource "xenorchestra_vm" "controlplane" {
count = var.master_count
memory_max = var.vm_memory * 1024 * 1024 * 1024
cpus = var.vm_cpu
name_label = "${var.talos_cluster_name}-master-${count.index + 1}"
template = data.xenorchestra_template.template.id
cloud_config = data.talos_machine_configuration.controlplane[count.index].machine_configuration
affinity_host = data.xenorchestra_hosts.hosts.hosts[count.index % length(data.xenorchestra_hosts.hosts.hosts)].id
network {
network_id = data.xenorchestra_network.net.id
#mac_address = var.master_macs[count.index]
}
disk {
sr_id = data.xenorchestra_sr.local_storage[count.index % length(data.xenorchestra_sr.local_storage)].id
name_label = "${var.talos_cluster_name}-master-${count.index + 1}-disk1"
size = var.vm_disk * 1024 * 1024 * 1024
}
tags = [
var.talos_cluster_name,
"controlplane"
]
} @ddelnano any help on this would be appreciated |
I have tried to use the same VM Template for Talos and created a VM manually with the desired cloud-config via XO on Host 2 (non-master) and talos bootstrapped using the config just fine. Which makes me think that this is not a host issue but rather some sort of misconfiguration during the vm.create RPC call where some wrong parameters are passed in somehow when the affinity_host is different from the master in the pool but I cannot confirm or validate that it is the case |
Here is a diff of master-2 and master-3 vm.create DEBUG log: https://www.diffchecker.com/Cp6BSIeO/ |
After some debugging I also found what is happening but I have no idea why it happens. This warning gets generated on XO when the cloudconfigdrive is being created:
It seems to be coming from here: https://github.com/vatesfr/xen-orchestra/blob/master/packages/xo-server/src/xapi/index.mjs#L1332-L1334 |
@TheiLLeniumStudios thanks for the extremely detailed report and glad to hear that the XO team is working on the fix! |
@ddelnano the problem has been fixed in this commit: vatesfr/xen-orchestra@01ba10f I just tested it out and the cloud configs are created properly for all the VMs that are scheduled on the Slaves 🥳 |
I'm currently facing an issue, where we can create a cloud config in XO (or have it created with
resource "xenorchestra_cloud_config"
) but when used to initialize a new VM, it doesn't get applied.I've verified that:
What I've tried:
data "xenorchestra_cloud_config" "cc"
and using thatresource "xenorchestra_cloud_config" "cc"
) and using thathashicorp/cloudinit
provider (data "cloudinit_config" "cloudinit_config"
), passing the config intopart.content
and using thatxenorchestra_vm.cloud_config
with newlines or<<EOF ... EOF
Nothing seems to work. I've tried running
terraform apply
withTF_LOG_PROVIDER=DEBUG
and I've noticed that the inputted cloud config made it all the way to the RPC call:(I tried to format this with line breaks)
However a bit later when we're waiting for the VM to be created, I'm receiving logs like this (notice
CloudConfig: ResourceSet:<nil>
):The result is:
XO CloudConfigDrive
is created/var/log/cloud-init[-output].log
obviously don't exist eitherSystem info:
v1.3.8
terra-farm/xenorchestra
version0.24.0
xo-server 5.109.3
xo-web 5.111.1
Attempted Cloud config:
Terraform file:
The text was updated successfully, but these errors were encountered: