Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining cluster fails #17

Closed
kaosmonk opened this issue Nov 11, 2020 · 4 comments
Closed

Joining cluster fails #17

kaosmonk opened this issue Nov 11, 2020 · 4 comments

Comments

@kaosmonk
Copy link

kaosmonk commented Nov 11, 2020

Hi guys,
I'm having an issue running more than a single node control plane as nodes doesn't seem to be joining the cluster, here's the error I am seeing

2020-11-11 14:42:56,743 - util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/00_download.sh'] with allowed return codes [0] (shell=False, capture=False)
2020-11-11 14:44:34,107 - util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/01_rke2.sh'] with allowed return codes [0] (shell=False, capture=False)
2020-11-11 14:46:36,461 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/01_rke2.sh [1]
2020-11-11 14:46:36,461 - util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/01_rke2.sh [1]
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 896, in runparts
    subp(prefix + [exe_path], capture=False)
  File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 2083, in subp
    cmd=args)
cloudinit.util.ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/01_rke2.sh']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
2020-11-11 14:46:36,463 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2020-11-11 14:46:36,464 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2020-11-11 14:46:36,464 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
2020-11-11 14:46:36,464 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 852, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 903, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 2 attempted commands

and if I try to run the script manually afterwords, I do get the following

# /var/lib/cloud/instance/scripts/01_rke2.sh
[INFO]  Beginning user defined pre userdata
[INFO]  Beginning user defined pre userdata
[INFO]  Fetching rke2 join token...
REDACTED
[INFO]  Found token from s3 object
[INFO]  API server available, identifying as server joining existing cluster
[INFO]  Cluster is ready
[ERROR]  Failed to create kubeconfig

I am also looking into that right now so I will share more insights unless you guys have any ideas? Not sure if important, but I am not using spot instances.

Thanks!

@kaosmonk
Copy link
Author

It's really hard to troubleshoot this. Sometimes there's 2 out fo 3 master nodes that are working, while the 3rd one does not have a kube-apiserver started (rke2-server service running) hence shown as OutOfService for the LB. And sometimes just one node is good while 2 are reporting the errors from the initial comment.

@joshrwolf
Copy link
Contributor

Hi @kaosmonk , this is unfortunately a known issue with rke2 and not specific to this implementation with terraform. You can see the full discussion, workarounds, and track the status here.

@chadningle
Copy link

chadningle commented Dec 17, 2020

For me, I found that the /var/lib/cloud/instance/scripts/00_download.sh file that gets populated needs to be tweaked. The scenario is that awscli already exists and it is failing on the install.

[root@ip-10-2-5-28 ~]# /var/lib/cloud/instance/scripts/00_download.sh
Loaded plugins: amazon-id, search-disabled-repos
Package unzip-6.0-21.el7.x86_64 already installed and latest version
Nothing to do
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33.5M  100 33.5M    0     0  47.0M      0 --:--:-- --:--:-- --:--:-- 47.0M
Found preexisting AWS CLI installation: /usr/local/aws-cli/v2/current. Please rerun install script with --update flag.
[root@ip-10-2-5-28 ~]# echo $?
1

I'll try and figure out how to modify the population of the /var/lib/cloud/instance/scripts/00_download.sh script but if anyone knows already please let me know. :)

@aleiner
Copy link
Contributor

aleiner commented Mar 31, 2023

Issues with preexisting awscli installations were resolved via this commit:

7b729db

@aleiner aleiner closed this as completed Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants