-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to connect worker node to Kubernetes cluster #924
Comments
@ymc101, please provide following information for us to understand the problem:
|
Hi @leokondrashov , the VMs I was previously running has been terminated, I will replicate the setup later and get back to you with the information. |
Hi @leokondrashov , below is the requested information: ifconfig -a (Worker VM):
ifconfig -a (Master VM):
sudo lsof info: ~/.kube/config (Master VM)
/etc/kubernetes/admin.conf (Master VM):
Worker VM: .kube folder not present, etc/kubernetes/admin.conf file content is empty |
Hi, @ymc101. The data provided looks fine. The only reason I can think of is that the firewall is in place and/or ports are blocked. Can you check if port 6443 is whitelisted? |
Hi @leokondrashov , the ports were apparently blocked due to the VM configuration. After solving it, the worker node was able to join with the master node, but there were some Error with configuring MetalLB. Below are some Terminal logs: Worker node when joining:
Here I forgot to create the temp log files before running, but it seems that it did not affect the joining of the node. Master node:
I had automatic ssh set up for both master and worker node. Do you have an idea what might be causing this error? |
It's good that the networking issue was that simple to resolve. We should add it to the troubleshooting guide. Regarding the metallb, I saw that stuff once. I think it's just a sporadic error. First, check the available pods, metallb might be there, just too late to report to the script. Either way, try to rerun |
The networking issue was caused by a VM setting in VirtualBox, so nothing to do with vHive itself. Regarding the metallb error, i tried running the setup tool command again and it passed the check, however this time there is an error with deploying the istio operator:
@leokondrashov do you have any idea for this one? Thanks. |
Can you please check the pods in the |
I ran into a different metallb error this time:
When i run the setup |
I'm not very confident in using the clean cri script for a multi-node setup. It's better to start from clean nodes. Let's figure out the problems that we face. Please run from the start and document the errors that you encounter. For istio and metallb failures, please provide the output of Most of our problems are timeouts of resources not being ready. Can you check the networking speed of the VMs? |
Using speedtest-cli, this is the networking speed from one of my VMs:
Is this within sufficient ranges with respect to the timeout timers in the master node setup script? Additionally, can I ask if there is a script or a method to reset a node after setup failure, or after usage of a node? It is quite time consuming to tear down the VM and set up a new one each time due to the initial OS setup. So far I have tried using the single-node clean cri script as I could not find any other cleanup script from the quickstart guide. |
That seems to be a fair speed for the setup. You are using VirtualBox, right? Can you create a snapshot of the VM right after the boot? That should speed up the process. |
Yes, I am using VirtualBox, I was previously not aware of this feature, thanks for the suggestion. |
I tried running it from scratch and got the same metallb timeout error, and when i tried to rerun the command i get this index out of range panic:
If the metallb was ready but not in time for the script, resetting the VM snapshot might produce the same error. Do you have any ideas or suggestion? Right now I can only think of modifying the code on the master node to increase the timeout threshold for the metallb and istio setup, but im not sure what might be causing this, especially when the download and upload speed doesn't seem to be the bottleneck here. |
I saw the timeout issue previously with network congestion, which is not the case here. However, there might also be the problem of not enough CPU to install it in time. What is the VM size? The solution with more time would work, although the current limit of 3 minutes should be more than enough. Not sure that it can be done for istio (that also experienced timeouts), so a more permanent solution might include increasing the VM size. |
I allocated 3 CPU cores and 8GB of RAM to this VM. Ill try again with bigger VM size and then see if there is the same issue. This system has 12 cores and 32GB RAM, I can give each VM about 4-5 cores at most and about 12GB RAM. |
The script encountered the same error with metallb, even with 5 cores and 12GB RAM, which is the maximum I can allocate to each VM before exceeding the system's hardware resources. May I know what was the specs of the nodes that you have tested on before? |
We commonly use nodes with around 10 cores and 64GB, but your configuration should be enough. Can you supply the content of the |
Below are the content of the 2 log files i just ran with the verbosity flag: create_multinode_cluster_common.log:
|
create_multinode_cluster_error.log:
|
Can you provide the output of the command |
Do I run that command on the worker node right after it joins the cluster, and before I provide the user prompt on the master node ( |
After it fails to deploy the metallb services. |
This is the output i got:
Does it mention why there is error with metallb setup? Im not very sure how to interpret this log |
I see several minutes of a wait due to the worker node not being ready (between the first two events). Other stuff is not that big (only the image pull that took 40s, but I have no idea how to improve that). Can you also add the output of I suppose you can try to continue the setup with |
May I check what do you mean by continuing the setup with |
As I remember, in this comment is the result of For now, I don't know what's wrong with the node, it just takes more time. So, possibly, the only solution is to increase the timeout on this line to 600s just to fix this problem. |
Ok, i will attempt increasing the timeout and try again. For the comment you referenced where it passed the metallb setup, I believe I tried running the cleanup script for single node cluster and tried running the command again. As you previously mentioned that the cleanup script was not really meant for multi-node clusters, I have stopped trying that method. |
It seems that both the metallb error and istio error did not appear this time. Is this supposed to be the full output when successful?
|
Yes, that is the correct setup result. It seems that these errors are just flaky; the solution is to increase MetalLB timeout and hope that Istio is installed in time. I suppose we can close the issue then. |
Thanks for all your help. Before we close the issue, I have one more question on function deployment. According to the recorded tutorial session on youtube, there is supposed to be a deployer directory in |
Yes, we moved them to the vSwarm repository now. You can check with our quickstart guide. It has the most up-to-date instructions, including examples of how to use these tools. |
Hi, could I ask a few questions regarding function deployment for my setup? When running the deployer client, I am getting some error messages:
Could those be ignored or is there an issue getting in the way of the deployment? When running the invoker client I got this error regarding go versioning:
is there a way to fix this? Thanks. |
Errors are definitely bad. It shouldn't timeout. Please send over the description of the pods: The invoker problem with Go is known; we will update the Go version in the next update so it will be fixed. For now, you can reinstall go: |
I got a pod not found error; i tried the describe pod command for other functions as well but it is returning the same error:
|
Then describe the deployment ( |
|
Weird. It says that deployment was scaled up and down. What about revisions? The original error was about revision not being ready. |
Sorry, what do you mean by revisions? |
|
|
I've never seen such errors: "Failed to get/pull image: failed to prepare extraction snapshot". Please open a separate issue and attach firecracker logs from worker nodes. It seems that it is the issue with Firecracker now. |
Is there a command or file location i can access the firecrackers logs from in the worker node? |
Describe the bug
Error connecting worker node to Kubernetes cluster when executing following command:
when following standard deployment steps in quickstart guide.
To Reproduce:
Setting up 1 master and 1 worker node on 2 VMs running on the same computer (using VirtualBox), both running on Ubuntu 20.04, and following the steps in quickstart guide to "Setup a Serverless (Knative) Cluster" (standard setup, non-stargz).
Expected behaviour:
Success message as shown in the quickstart guide:
Logs:
Error message after running above-mentioned command:
stack trace:
The text was updated successfully, but these errors were encountered: