New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSG not defined in template #34

Closed
rian-hout opened this Issue Jul 17, 2017 · 20 comments

Comments

Projects
None yet
7 participants
@rian-hout

rian-hout commented Jul 17, 2017

The autoscaler is throwing an error when trying to scale up. I've created multiple clusters, new resource groups, new acs-engine templates and I always end up with the same error. I've also tried deploying the helm chart and receive the same error. The debug log is below. Am I missing something or doing something wrong?

2017-07-17 15:28:39,015 - autoscaler.cluster - DEBUG - Using kube service account
2017-07-17 15:28:39,016 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2017-07-17 15:28:39,085 - autoscaler.cluster - INFO - Pods to schedule: 10
2017-07-17 15:28:39,085 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2017-07-17 15:28:39,085 - autoscaler.cluster - INFO - Nodes: 2
2017-07-17 15:28:39,085 - autoscaler.cluster - INFO - To schedule: 10
2017-07-17 15:28:39,086 - autoscaler.cluster - INFO - Pending pods: 10
2017-07-17 15:28:39,086 - autoscaler.cluster - DEBUG - busybox
2017-07-17 15:28:39,086 - autoscaler.cluster - DEBUG - busybox19
2017-07-17 15:28:39,086 - autoscaler.cluster - DEBUG - busybox2
2017-07-17 15:28:39,086 - autoscaler.cluster - DEBUG - busybox3
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox4
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox5
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox6
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox7
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox8
2017-07-17 15:28:39,087 - autoscaler.cluster - DEBUG - busybox9
2017-07-17 15:28:39,087 - autoscaler.scaler - INFO - ====Scaling for 10 pods ====
2017-07-17 15:28:39,088 - autoscaler.scaler - DEBUG - units_needed: 1
2017-07-17 15:28:39,088 - autoscaler.scaler - DEBUG - units_requested: 1
2017-07-17 15:28:39,088 - autoscaler.scaler - DEBUG - riantest2a actual capacity: 2 , units requested: 1
2017-07-17 15:28:39,088 - autoscaler.scaler - INFO - New capacity requested for pool riantest2a: 3 agents (current capacity: 2 agents)
2017-07-17 15:28:39,088 - autoscaler.scaler - DEBUG - remaining pending: 0
2017-07-17 15:28:39,105 - autoscaler.engine_scaler - INFO - Deployment autoscaler-deployment-43ece6c4 started...
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 348, in init
self._operation.set_initial_status(self._response)
File "/usr/local/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 229, in set_initial_status
self._raise_if_bad_http_status_and_method(response)
File "/usr/local/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 147, in _raise_if_bad_http_status_and_method
"Invalid return status for {!r} operation".format(self.method))
msrestazure.azure_operation.BadStatus: Invalid return status for 'PUT' operation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 116, in
main()
File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "main.py", line 105, in main
scaled = cluster.loop(debug)
File "/app/autoscaler/cluster.py", line 110, in loop
return self.loop_logic()
File "/app/autoscaler/cluster.py", line 163, in loop_logic
self.scale(pods_to_schedule, all_nodes, scaler)
File "/app/autoscaler/cluster.py", line 204, in scale
scaler.fulfill_pending(pending_pods)
File "/app/autoscaler/scaler.py", line 178, in fulfill_pending
self.scale_pools(new_pool_sizes)
File "/app/autoscaler/engine_scaler.py", line 149, in scale_pools
new_pool_sizes), new_pool_sizes)
File "/app/autoscaler/deployments.py", line 18, in deploy
self._current_deployment = func()
File "/app/autoscaler/engine_scaler.py", line 149, in
new_pool_sizes), new_pool_sizes)
File "/app/autoscaler/engine_scaler.py", line 180, in deploy_pools
properties, raw=False)
File "/usr/local/lib/python3.6/site-packages/azure/mgmt/resource/resources/v2017_05_10/operations/deployments_operations.py", line 285, in create_or_update
get_long_running_status, long_running_operation_timeout)
File "/usr/local/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 351, in init
raise CloudError(self._response)
msrestazure.azure_exceptions.CloudError: Azure Error: InvalidTemplate
Message: Deployment template validation failed: 'The resource 'Microsoft.Network/networkSecurityGroups/k8s-master-38581513-nsg' is not defined in the template. Please see https://aka.ms/arm-template for usage details.'.

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Jul 17, 2017

Owner

@rian-hout can you paste the json file you are using to generate the ARM template with acs-engine here?
Don't forget to remove the service principal credentials before :)
Also which version of acs-engine are you using?

Owner

wbuchwalter commented Jul 17, 2017

@rian-hout can you paste the json file you are using to generate the ARM template with acs-engine here?
Don't forget to remove the service principal credentials before :)
Also which version of acs-engine are you using?

@rian-hout

This comment has been minimized.

Show comment
Hide comment
@rian-hout

rian-hout Jul 17, 2017

My json files are attached.

I'm using Windows Go to run the acs-engine. I'm not sure what version it is. I just rebuilt it on 6/30 and grabbed whatever was latest. When I run acs-engine version, the response says the version is unset.

azuredeploy.parameters.json.txt
azuredeploy.json.txt
apimodel.json.txt

rian-hout commented Jul 17, 2017

My json files are attached.

I'm using Windows Go to run the acs-engine. I'm not sure what version it is. I just rebuilt it on 6/30 and grabbed whatever was latest. When I run acs-engine version, the response says the version is unset.

azuredeploy.parameters.json.txt
azuredeploy.json.txt
apimodel.json.txt

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Jul 18, 2017

Owner

Thanks for the precisions.
The file I'm looking for is the one you used as input for acs-engine.
To create the cluster you had to do acs-engine generate <template-file>, can you share it here?

Some more questions:

  • Is the cluster deployed in it's own resource group? Or did you create additional resources in these resource group?
  • Can you share how you start the autoscaler? Parameters used etc.
  • What region of azure are you using?

Thanks!

Owner

wbuchwalter commented Jul 18, 2017

Thanks for the precisions.
The file I'm looking for is the one you used as input for acs-engine.
To create the cluster you had to do acs-engine generate <template-file>, can you share it here?

Some more questions:

  • Is the cluster deployed in it's own resource group? Or did you create additional resources in these resource group?
  • Can you share how you start the autoscaler? Parameters used etc.
  • What region of azure are you using?

Thanks!

@rian-hout

This comment has been minimized.

Show comment
Hide comment
@rian-hout

rian-hout Jul 18, 2017

I've attached the requested file.

The cluster, including the NSG, is in it's own resource group. I have not created anything additional in that resource group. The VNet is in a different resource group.

I'm using US East 2 region.

I originally tried to use the helm chart from https://github.com/kubernetes/charts/tree/master/stable/acs-engine-autoscaler

When the helm chart didn't work for me, I followed the documentation here. I created the secrets and deployed the scaling-deployment.yaml. Below are the commands I used in the yaml. Everything else in the yaml file is the same as what is in github.

   command:
        - python
        - main.py
        - --resource-group 
        - RianTestKubernetes2
        - --acs-deployment
        - RianDevKube2a
        - -vvv
        - --debug

Custom_kubernetes.json.txt

rian-hout commented Jul 18, 2017

I've attached the requested file.

The cluster, including the NSG, is in it's own resource group. I have not created anything additional in that resource group. The VNet is in a different resource group.

I'm using US East 2 region.

I originally tried to use the helm chart from https://github.com/kubernetes/charts/tree/master/stable/acs-engine-autoscaler

When the helm chart didn't work for me, I followed the documentation here. I created the secrets and deployed the scaling-deployment.yaml. Below are the commands I used in the yaml. Everything else in the yaml file is the same as what is in github.

   command:
        - python
        - main.py
        - --resource-group 
        - RianTestKubernetes2
        - --acs-deployment
        - RianDevKube2a
        - -vvv
        - --debug

Custom_kubernetes.json.txt

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Jul 18, 2017

Owner

Currently the autoscaler cannot work on clusters with a custom VNet.

This is due to a limitation in ARM.
In order to scale up, the autoscaler has to remove the NSG from the template before redeploying to avoid services interruption in the kubernetes cluster (see https://github.com/Azure/acs-engine/issues/979for more infos).

But when you deploy a k8s cluster with a custom VNET, the NIC of each VM directly references the NSG, and so this NSG needs to be present in the template, which is not possible as explained above.

I will check if I can find a way to circumvent this, suggestions are welcome :)

Regarding the helm chart, the current version is deprecated, I have a PR opened since a while here: helm/charts#1408 hopefully it will get merged soon. I will probably duplicate the chart in this repo as well, so we can have an up-to-date version at any time.

Owner

wbuchwalter commented Jul 18, 2017

Currently the autoscaler cannot work on clusters with a custom VNet.

This is due to a limitation in ARM.
In order to scale up, the autoscaler has to remove the NSG from the template before redeploying to avoid services interruption in the kubernetes cluster (see https://github.com/Azure/acs-engine/issues/979for more infos).

But when you deploy a k8s cluster with a custom VNET, the NIC of each VM directly references the NSG, and so this NSG needs to be present in the template, which is not possible as explained above.

I will check if I can find a way to circumvent this, suggestions are welcome :)

Regarding the helm chart, the current version is deprecated, I have a PR opened since a while here: helm/charts#1408 hopefully it will get merged soon. I will probably duplicate the chart in this repo as well, so we can have an up-to-date version at any time.

@rian-hout

This comment has been minimized.

Show comment
Hide comment
@rian-hout

rian-hout Jul 18, 2017

Thanks for the info. I thought I had read that the autoscaler wouldn't work due to the custom vnet NSG, but I thought it was pertaining to a different autoscaler and not this one. Unfortunately I don't have any suggestions to help work around this.

Do you have any insights on how to manually scale the cluster. I've done some reading on the acs-engine and it sounds like I need to change the count in the azuredeploy.parameters.json and redeploy the template. Two questions on this.

What is the proper way to redeploy the template? Do I use the new-azurermresourcegroupdeployment in powershell or some other method? I tried this once and it seemed to overwrite everything that was already up and running.

Is it possible to change the server size by redeploying the template or am I limited to only changing the agent count.

rian-hout commented Jul 18, 2017

Thanks for the info. I thought I had read that the autoscaler wouldn't work due to the custom vnet NSG, but I thought it was pertaining to a different autoscaler and not this one. Unfortunately I don't have any suggestions to help work around this.

Do you have any insights on how to manually scale the cluster. I've done some reading on the acs-engine and it sounds like I need to change the count in the azuredeploy.parameters.json and redeploy the template. Two questions on this.

What is the proper way to redeploy the template? Do I use the new-azurermresourcegroupdeployment in powershell or some other method? I tried this once and it seemed to overwrite everything that was already up and running.

Is it possible to change the server size by redeploying the template or am I limited to only changing the agent count.

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Jul 18, 2017

Owner

If you check the issue I linked earlier, I explained there how to scale manually.
But sadly, you will still face the ARM issue I described even when manually scaling, so it won't help you.
The only work around I can think about, is to use the Azure REST API to scale up, and not an ARM template. But this is quite a lot of work in the autoscaler.

Owner

wbuchwalter commented Jul 18, 2017

If you check the issue I linked earlier, I explained there how to scale manually.
But sadly, you will still face the ARM issue I described even when manually scaling, so it won't help you.
The only work around I can think about, is to use the Azure REST API to scale up, and not an ARM template. But this is quite a lot of work in the autoscaler.

@rian-hout

This comment has been minimized.

Show comment
Hide comment
@rian-hout

rian-hout Jul 18, 2017

Thank-you for the information. Hopefully you can find a solution to this. I think we can close out this issue unless you wish to keep it open.

rian-hout commented Jul 18, 2017

Thank-you for the information. Hopefully you can find a solution to this. I think we can close out this issue unless you wish to keep it open.

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Jul 18, 2017

Owner

Let's keep it open so that others can easily find and it and chime in.

Owner

wbuchwalter commented Jul 18, 2017

Let's keep it open so that others can easily find and it and chime in.

@oryagel

This comment has been minimized.

Show comment
Hide comment
@oryagel

oryagel Aug 9, 2017

Hi,
Having the same issue when scaling in custom vnet.

oryagel commented Aug 9, 2017

Hi,
Having the same issue when scaling in custom vnet.

@X-E-n-G

This comment has been minimized.

Show comment
Hide comment
@X-E-n-G

X-E-n-G Nov 10, 2017

same here! Thanks for the explanation though...

X-E-n-G commented Nov 10, 2017

same here! Thanks for the explanation though...

@sprab

This comment has been minimized.

Show comment
Hide comment
@sprab

sprab Nov 23, 2017

Hi All,

I got this working with the following steps:

  1. Created a resource group
  2. Created a VNET in the resource group
  3. Created a NSG in the resource group with the inbound allow rules for 443 and 22
  4. Generated the template using acs-engine
  5. Deleted the NSG resource from the template.
  6. Modified the NSG variables to reflect the one created from the Portal.
  7. Modified the NSG ID to reflect the one created from the portal.
  8. Deployed the template successfully.
  9. Deployed the autoscaler
  10. Deployed a RC and modified to increase the pods.
  11. Correspondingly the autoscaler scales out the agent nodes.
  12. Now the autoscaler is successfully scaling in/out the agent nodes without any error.

The idea is to remove the NSG from the template and create it separately and then associate with the agent and master nodes.

sprab commented Nov 23, 2017

Hi All,

I got this working with the following steps:

  1. Created a resource group
  2. Created a VNET in the resource group
  3. Created a NSG in the resource group with the inbound allow rules for 443 and 22
  4. Generated the template using acs-engine
  5. Deleted the NSG resource from the template.
  6. Modified the NSG variables to reflect the one created from the Portal.
  7. Modified the NSG ID to reflect the one created from the portal.
  8. Deployed the template successfully.
  9. Deployed the autoscaler
  10. Deployed a RC and modified to increase the pods.
  11. Correspondingly the autoscaler scales out the agent nodes.
  12. Now the autoscaler is successfully scaling in/out the agent nodes without any error.

The idea is to remove the NSG from the template and create it separately and then associate with the agent and master nodes.

@X-E-n-G

This comment has been minimized.

Show comment
Hide comment
@X-E-n-G

X-E-n-G Nov 23, 2017

This solution is verifed, I received the same setup from MS this afternoon, and it works fine! Great Job! @sprab

X-E-n-G commented Nov 23, 2017

This solution is verifed, I received the same setup from MS this afternoon, and it works fine! Great Job! @sprab

@garytofu

This comment has been minimized.

Show comment
Hide comment
@garytofu

garytofu Nov 24, 2017

Contributor

I got another work around that is similar to @sprab

  1. Created a resource group
  2. Created a VNET in the resource group
  3. Generate ARM template and parameters using ACS Engine
  4. Separate the ARM template into 2 parts (resource NSG, resource other than NSG)
  5. Deploy the resource NSG
  6. Deploy the other resource (remove "dependsOn: [“[variables(‘nsgID’)]“] ")
  7. Update route table for subnet

The reason to do this is to get around the ARM issue (or bug?) for requiring NSG dependsOn exists in the ARM template (without check the existence in the resources group)

However, i think removing "dependsOn: [“[variables(‘nsgID’)]“] " in the template generated in the Autoscaler's scale out process would be a better solution. I am wondering if it could be done.

Contributor

garytofu commented Nov 24, 2017

I got another work around that is similar to @sprab

  1. Created a resource group
  2. Created a VNET in the resource group
  3. Generate ARM template and parameters using ACS Engine
  4. Separate the ARM template into 2 parts (resource NSG, resource other than NSG)
  5. Deploy the resource NSG
  6. Deploy the other resource (remove "dependsOn: [“[variables(‘nsgID’)]“] ")
  7. Update route table for subnet

The reason to do this is to get around the ARM issue (or bug?) for requiring NSG dependsOn exists in the ARM template (without check the existence in the resources group)

However, i think removing "dependsOn: [“[variables(‘nsgID’)]“] " in the template generated in the Autoscaler's scale out process would be a better solution. I am wondering if it could be done.

@X-E-n-G

This comment has been minimized.

Show comment
Hide comment
@X-E-n-G

X-E-n-G Nov 24, 2017

The MS team and I deployed the arn, by removing all the nsg entries and the k8s cluster was working limitedly, you couldn't expose anything and the control manager pod crashed when you did. But, the autoscaler worked and you could deploy pods and such... just couldn't expose them.

@sprab solution is simple, you just have to change 5 lines of code for it to work... We deploy and autoscaler cluster in under 10 mins. the autoscaling is nice too as it seems to break the node deployments into groups, which take about 4 mins to deploy...

X-E-n-G commented Nov 24, 2017

The MS team and I deployed the arn, by removing all the nsg entries and the k8s cluster was working limitedly, you couldn't expose anything and the control manager pod crashed when you did. But, the autoscaler worked and you could deploy pods and such... just couldn't expose them.

@sprab solution is simple, you just have to change 5 lines of code for it to work... We deploy and autoscaler cluster in under 10 mins. the autoscaling is nice too as it seems to break the node deployments into groups, which take about 4 mins to deploy...

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Nov 24, 2017

Owner

However, i think removing "dependsOn: [“[variables(‘nsgID’)]“] " in the template generated in the Autoscaler's scale out process would be a better solution. I am wondering if it could be done.

Yes it should be. I had issues with that solution when I tried a while back, but can't remember exactly what (might be unrelated). In any case this what I plan on implementing as a fix for this when I have time.
However as you may have noticed I am not very active these days as I have a lot of other commitments/travel.
If someone wants to submit a PR for this I would gladly accept it, otherwise I plan on doing a bunch of fixes around mid-december when I am back home.

Owner

wbuchwalter commented Nov 24, 2017

However, i think removing "dependsOn: [“[variables(‘nsgID’)]“] " in the template generated in the Autoscaler's scale out process would be a better solution. I am wondering if it could be done.

Yes it should be. I had issues with that solution when I tried a while back, but can't remember exactly what (might be unrelated). In any case this what I plan on implementing as a fix for this when I have time.
However as you may have noticed I am not very active these days as I have a lot of other commitments/travel.
If someone wants to submit a PR for this I would gladly accept it, otherwise I plan on doing a bunch of fixes around mid-december when I am back home.

@garytofu

This comment has been minimized.

Show comment
Hide comment
@garytofu

garytofu Nov 24, 2017

Contributor

I have created a pull request for your review
#67

thanks @wbuchwalter

Contributor

garytofu commented Nov 24, 2017

I have created a pull request for your review
#67

thanks @wbuchwalter

@wbuchwalter

This comment has been minimized.

Show comment
Hide comment
@wbuchwalter

wbuchwalter Nov 24, 2017

Owner

I have pushed a new image wbuchwalter/kubernetes-acs-engine-autoscaler:nsg-beta including the fix for this issue by @garytofu.
I couldn't test it myself yet, so if you do test it in your cluster, please let me know if everything works fine with your configuration.

Owner

wbuchwalter commented Nov 24, 2017

I have pushed a new image wbuchwalter/kubernetes-acs-engine-autoscaler:nsg-beta including the fix for this issue by @garytofu.
I couldn't test it myself yet, so if you do test it in your cluster, please let me know if everything works fine with your configuration.

@emondek

This comment has been minimized.

Show comment
Hide comment
@emondek

emondek Dec 2, 2017

Contributor

Looks like we need to remove the NSG dependency from the Microsoft.Network/loadBalancers resource used for the Master ILB. I created PR #68 . I also pushed a test image to emondek/kubernetes-acs-engine-autoscaler:remove-nsg-dependson.

Contributor

emondek commented Dec 2, 2017

Looks like we need to remove the NSG dependency from the Microsoft.Network/loadBalancers resource used for the Master ILB. I created PR #68 . I also pushed a test image to emondek/kubernetes-acs-engine-autoscaler:remove-nsg-dependson.

@X-E-n-G

This comment has been minimized.

Show comment
Hide comment
@X-E-n-G

X-E-n-G Dec 3, 2017

X-E-n-G commented Dec 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment