Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openstack_networking_port_v2 all_fixed_ips is empty after creation (first run) but filled after rm from state (second run) #1606

Closed
frittentheke opened this issue Aug 22, 2023 · 16 comments

Comments

@frittentheke
Copy link

Terraform Version

Terraform v1.5.5

  • provider registry.terraform.io/terraform-provider-openstack/openstack v1.52.1

Affected Resource(s)

Please list the resources as a list, for example:

  • openstack_networking_port_v2

Terraform Configuration Files

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
}

resource "openstack_networking_router_route_v2" "vpn" {
  router_id        = var.router_id
  destination_cidr = var.cidr
  next_hop         = openstack_networking_port_v2.vpn.all_fixed_ips[0]
}

Debug Output

Panic Output

Expected Behavior

The openstack_networking_port_v2 is used as interface for an instance providing a VPN service. I port is used in order to have a fixed / known IP address. I would expect the IP of the port_v2 to be returned as first element in the all_fixed_ips array to then be used the next_hop for a static route.

In short: I want to route the network behind the VPN to the corresponding instance via a static route.

Actual Behavior

There are two errors thrown in relation to the port just created:

[...]
│ Error: Error creating OpenStack server: Bad request with: [POST https://compute.region.cloud.example.com/v2.1/servers], error message: {"badRequest": {"code": 400, "message": "Port ca6b83fd-e624-4e91-8dac-db291be55a42 requires a FixedIP in order to be used."}}
│ 
│   with module.vpn-server.openstack_compute_instance_v2.vpn,
│   on .terraform/modules/vpn-server/server/main.tf line 154, in resource "openstack_compute_instance_v2" "vpn":
│  154: resource "openstack_compute_instance_v2" "vpn" {
│ 
╵
╷
│ Error: Invalid index
│ 
│   on .terraform/modules/vpn-server/server/main.tf line 181, in resource "openstack_networking_router_route_v2" "vpn":
│  181:   next_hop         = openstack_networking_port_v2.vpn.all_fixed_ips[0]
│     ├────────────────
│     │ openstack_networking_port_v2.vpn.all_fixed_ips is empty list of string
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
[...]

causing the terraform run to abort with an error.

Steps to Reproduce

  • After a terraform apply the error is reached
  • Then terraform state rm openstack_networking_port_v2.vpn
  • terraform apply again and things work just fine.

It seems the port resource takes longer to be fully created and initialized and the provider moves on too early.
Just a refresh on the just created resource? Or some other indication of the port being actually done provisioning has to be tracked via the API?

Important Factoids

References

@nikParasyr
Copy link
Member

@frittentheke thanks for reporting this.

I speculate that this is occuring because the Openstack api responds to the port creation before an ip gets allocated by the dhcp server, and also the following Get for the port happens also before an ip is allocated.

Can you run the following scenario for me:

  1. Run initial terraform apply
  2. Optional: check how the port resource is written in the state
  3. Re-run terraform apply (without any changes to the code)
  4. Check again the state

@frittentheke
Copy link
Author

@nikParasyr thanks for the fast response!

I did run the scenario you asked for. Full disclosure, there are quite a few resources spawned and the port for the instance running the VPN services is created by a terraform module .... but here we go:

  1. terraform apply ended with the error I reported about all_fixed_ips being empty.
  2. This is the state for the ressource:
terraform state show module.vpn-server.openstack_networking_port_v2.vpn
# module.vpn-server.openstack_networking_port_v2.vpn:
resource "openstack_networking_port_v2" "vpn" {
    admin_state_up         = true
    all_fixed_ips          = []
    all_security_group_ids = [
        "87acb073-5123-4473-b33b-fc78f522c6b8",
    ]
    all_tags               = []
    dns_assignment         = []
    id                     = "9b37978b-ed53-41c2-983f-31570eb88259"
    mac_address            = "fa:16:3e:3a:58:ec"
    name                   = "vpn"
    network_id             = "f946cedc-94d1-4bde-a680-f59d615ad2e3"
    port_security_enabled  = true
    region                 = "fra"
    security_group_ids     = [
        "87acb073-5123-4473-b33b-fc78f522c6b8",
    ]
    tenant_id              = "REDACTED"

    allowed_address_pairs {
        ip_address = "10.3.4.0/24"
    }

    binding {
        vif_details = {}
        vnic_type   = "normal"
    }
}

so all_fixed_ips is empty.

  1. Running terraform apply again does not work, it's ending up with the same error about all_fixed_ips
  2. State remains unchanged.

But we dug a little deeper:

  1. terraform refresh does NOT update the all_fixed_ips (if called implicitly by the apply or explicitly)
  2. terraform apply -target module.vpn-server.openstack_networking_port_v2.vpn does "work", but finds nothing that needs changing. So also then the field is not populated.
  3. Certainly the terraform state rm module.vpn-server.openstack_networking_port_v2.vpn which I did initially of course caused a new port to be created leaving the first one dangling. But then all_fixed_ips was set, so the other resource referring that worked fine.
  4. As I said there a quite a few resources in this terraform code, so I believe this is the reason the port resource does not "work" the same way for the first apply doing everything, but for the second attempt with only this port and the static route being changed / created via the API. Read: convergence time.

This is the terraform debug output / openstack API response to the port creation (initial terraform apply):

[...]
2023-08-23T12:06:29.893+0200 [INFO]  provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request URL: GET https://network.regsion.cloud.example.com/v2.0/ports?id=9b37978b-ed53-41c2-983f-31570eb88259: timestamp=2023-08-23T12:06:
29.893+0200
2023-08-23T12:06:29.893+0200 [INFO]  provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request Headers:
Accept: application/json
Cache-Control: no-cache
User-Agent: HashiCorp Terraform/1.5.5 (+https://www.terraform.io) Terraform Plugin SDK/2.10.1 gophercloud/v1.4.0
X-Auth-Token: ***: timestamp=2023-08-23T12:06:29.893+0200
2023-08-23T12:06:29.983+0200 [INFO]  provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Code: 200: timestamp=2023-08-23T12:06:29.983+0200
2023-08-23T12:06:29.983+0200 [INFO]  provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Headers:
Content-Type: application/json
Date: Wed, 23 Aug 2023 10:06:29 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000
Vary: Accept-Encoding
Via: 1.1 network.region.cloud.example.com
X-Openstack-Request-Id: req-1b02e0f3-442d-4b85-a9ce-40a765d05fb5: timestamp=2023-08-23T12:06:29.983+0200
2023-08-23T12:06:29.983+0200 [INFO]  provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Body: {
  "ports": [
    {
      "admin_state_up": true,
      "allowed_address_pairs": [
        {
          "ip_address": "10.3.4.0/24",
          "mac_address": "fa:16:3e:3a:58:ec"
        }
      ],
      "binding:vnic_type": "normal",
      "created_at": "2023-08-23T10:06:12Z",
      "description": "",
      "device_id": "",
      "device_owner": "",
      "dns_assignment": [],
      "dns_name": "",
      "extra_dhcp_opts": [],
      "fixed_ips": [],
      "id": "9b37978b-ed53-41c2-983f-31570eb88259",
      "mac_address": "fa:16:3e:3a:58:ec",
      "name": "vpn",
      "network_id": "f946cedc-94d1-4bde-a680-f59d615ad2e3",
      "port_security_enabled": true,
      "project_id": "REDACTED",
      "revision_number": 1,
      "security_groups": [
        "87acb073-5123-4473-b33b-fc78f522c6b8"
      ],
      "status": "DOWN",
      "tags": [],
      "tenant_id": "REDACTED",
      "updated_at": "2023-08-23T10:06:12Z"
    }
  ]
}: timestamp=2023-08-23T12:06:29.983+0200
[...]

@nikParasyr
Copy link
Member

From neutron docs i see that port creation returns 201. Also, I am unable to reproduce this on my environment:

2023-08-24T15:50:43.933+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:43 [DEBUG] OpenStack Request Body: {
  "port": {
    "admin_state_up": true,
    "name": "vpn",
    "network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a"
  }
}: timestamp=2023-08-24T15:50:43.933+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Code: 201: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Headers:
Content-Length: 699
Content-Type: application/json
Date: Thu, 24 Aug 2023 13:50:44 GMT
X-Openstack-Request-Id: req-7eb97add-533b-46f0-a0d3-31bae4ac65e9: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Body: {
  "port": {
    "admin_state_up": true,
    "allowed_address_pairs": [],
    "binding:vnic_type": "normal",
    "created_at": "2023-08-24T13:50:44Z",
    "description": "",
    "device_id": "",
    "device_owner": "",
    "extra_dhcp_opts": [],
    "fixed_ips": [
      {
        "ip_address": "192.168.1.102",
        "subnet_id": "60882bd0-9597-4433-b841-aad6868d82b7"
      }
    ],
    "id": "0c092325-05ee-4dfd-85ec-a9571a37310a",
    "mac_address": "fa:16:1e:c7:be:fe",
    "name": "vpn",
    "network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a",
    "port_security_enabled": true,
    "project_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
    "revision_number": 1,
    "security_groups": [
      "d6e94844-3231-42ca-bd35-7cc1a68bd095"
    ],
    "status": "DOWN",
    "tags": [],
    "tenant_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
    "updated_at": "2023-08-24T13:50:44Z"
  }
}: timestamp=2023-08-24T15:50:44.505+0200

So fixed-ip is already populated in the response, and is written to the state correctly.

@frittentheke what behavior do you get via the cli?

❯ openstack port create --network 157c19ff-a568-45bc-88a7-0ec62d5a7a7a vpn                                                                                                                                                                                                                                                                                                                                                              ─╯
+-------------------------+-----------------------------------------------------------------------------+
| Field                   | Value                                                                       |
+-------------------------+-----------------------------------------------------------------------------+
| admin_state_up          | UP                                                                          |
| allowed_address_pairs   |                                                                             |
| binding_host_id         | None                                                                        |
| binding_profile         | None                                                                        |
| binding_vif_details     | None                                                                        |
| binding_vif_type        | None                                                                        |
| binding_vnic_type       | normal                                                                      |
| created_at              | 2023-08-24T14:13:46Z                                                        |
| data_plane_status       | None                                                                        |
| description             |                                                                             |
| device_id               |                                                                             |
| device_owner            |                                                                             |
| device_profile          | None                                                                        |
| dns_assignment          | None                                                                        |
| dns_domain              | None                                                                        |
| dns_name                | None                                                                        |
| extra_dhcp_opts         |                                                                             |
| fixed_ips               | ip_address='192.168.1.88', subnet_id='60882bd0-9597-4433-b841-aad6868d82b7' |
| id                      | 65822420-e3f9-47d2-9b20-63cb6dd9dd4c                                        |
| ip_allocation           | None                                                                        |
| mac_address             | fa:16:1e:ad:45:b2                                                           |
| name                    | vpn                                                                         |
| network_id              | 157c19ff-a568-45bc-88a7-0ec62d5a7a7a                                        |
| numa_affinity_policy    | None                                                                        |
| port_security_enabled   | True                                                                        |
| project_id              | ed498e81f0cc448bae0ad4f8f21bf67f                                            |
| propagate_uplink_status | None                                                                        |
| qos_network_policy_id   | None                                                                        |
| qos_policy_id           | None                                                                        |
| resource_request        | None                                                                        |
| revision_number         | 1                                                                           |
| security_group_ids      | d6e94844-3231-42ca-bd35-7cc1a68bd095                                        |
| status                  | DOWN                                                                        |
| tags                    |                                                                             |
| trunk_details           | None                                                                        |
| updated_at              | 2023-08-24T14:13:46Z                                                        |
+-------------------------+-----------------------------------------------------------------------------+

Also, are you aware if your openstack environment has any specific neutron/dhcp configuration? Anything that could make a port to get an ip after a delay?

@frittentheke
Copy link
Author

Maybe a few basics:

  • I am running the OpenStack Yoga release
  • Neutron uses the linuxbridge driver, HA is used (l3_ha=True), with max_l3_agents_per_router=3 and dhcp_agents_per_network=3
    ** so maybe the RPC takes a little longer if there are many things to do / apply to the same router?

Looking at https://github.com/openstack/neutron/blob/5d97b13c7978c70673d1c886f0c49319076fdec5/neutron/db/models_v2.py#L110 makes me wonder if the port object might be returned without the IPAllocations if they are not already there when the subquery happens?

Diving into how an API call to create a port is distributed is kind of a rabbit hole ...

There is just so much code dealing with ports and their IPs ...

any I believe some of this is done asynchronously racing the API response for the newly created port and its all_fixed_ips field.

@frittentheke
Copy link
Author

@nikParasyr Is there any way I could assist more with this issue? Is this potentially even a bug with Neutron returning the port creation response prematurely?

@nikParasyr
Copy link
Member

@frittentheke I'm not sure how to tackle this tbh. I'm also busy with some personal stuff for the next 2 weeks.

Is this potentially even a bug with Neutron returning the port creation response prematurely?

It could be, but im not 100%. ( We also run l3_ha x3 in our site and i get an ip instantly)

We could potentially add a wait till the fixed_ip is populated.

@kayrus any ideas?

@frittentheke
Copy link
Author

@nikParasyr thanks again for digging into this issue!
Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially
return the port without fixed_ips populated?

I raised a bug with Neutron https://bugs.launchpad.net/neutron/+bug/2035230 to ask if this behavior is expected.
Would not want to add polling code and timeouts to the provider if this was an API issue in the first place.

@nikParasyr
Copy link
Member

Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially
return the port without fixed_ips populated?

The bug you opened is a way. there is already a response. Otherwise IRC channels are also an option: https://docs.openstack.org/contributors/common/irc.html

@frittentheke
Copy link
Author

@nikParasyr ... did you see https://bugs.launchpad.net/neutron/+bug/2035230/comments/3 ?
So in short: it's expected that the fixed_ips are not initially returned and need to be waited for.

Do you see any chance this could be fixed?

@nikParasyr
Copy link
Member

nikParasyr commented Oct 13, 2023

@frittentheke are you creating the subnet in the same run? and if so can you add a depends_on on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside the fixed_ip block)?

From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.

In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.

@nikParasyr
Copy link
Member

@frittentheke were you successful when adding depends_on / defining subnet_id ?

@mnaser
Copy link
Collaborator

mnaser commented Oct 31, 2023

I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like ip_allocation set to deferred in this case.

https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html

Could this be the case?

@frittentheke
Copy link
Author

@frittentheke are you creating the subnet in the same run? and if so can you add a depends_on on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside the fixed_ip block)?

From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.

@nikParasyr yes, we are creating a number of networking resources in a single run. So a router, multiple networks and subnets, ...

The issue is not occurring 100% of the time. So it's a race condition. If it occurs it's enough to replace the port resource which then receives the "missing" all_fixed_ips" being the only resource to be created.

So testing with depends_on is not that easy. Maybe(tm) it does help as it causes some more serialization, but likely it's not tackling the root cause of a deferred / delayed IP address allocation.

In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.

That be awesome.
I was thinking that implementing support for refresh on this field might also be sensible, as this data might change?

I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like ip_allocation set to deferred in this case.

https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html

Could this be the case?

@mnaser Thanks for diving into this issue!
This might just be another case in which the ips are not returned with the initial resource create response, but we are not using that.

@nikParasyr
Copy link
Member

nikParasyr commented Nov 2, 2023

but likely it's not tackling the root cause of a deferred / delayed IP address allocation.

I think it will tackle the root cause which based on the neutron people from launchpad is:

This result is something expected if the network where the port is created has no subnets

I've actually had deployments where this behavior occurred.


Explaining:

Currently you have this code:

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
}

This only says to TF that the port resource is dependent to the network resource, nothing about the subnet. Based on the dependency graph TF will parallelize the creation of the subnet AND port => their creation will be triggered "at the same time" meaning you are in a race condition (which you have noticed). Sometimes it will work because internally on neutron level the subnet creation will be before the port creation and thus you will get an IP, other times it will be the opposite and you wont get an IP. If you look at your terraform apply output you probably will see something like:

creating network resource
...
network resource **created** (ID=blah_blah)
creating port resource
creating subnet resource <= (triggered at the same time, you are in a race condition)
...

If you add switch your port resource to:

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]

 fixed_ip {
   subnet_id = openstack_networking_subnet_v2.name-here.id
 }
}

This will make known to TF that the port resources is dependant of the subnet => the TF depedency graph will force the subnet creation to be done before it triggers the port creation. So this should remove the race condition. Your terraform apply logs will look like:

creating network resource
...
network resource **created** (ID=blah_blah)
creating subnet resource
...
subnet resource **created** (ID= bluh bluh)
creating port resource <= triggered after the subnet is created and therefore based on neutron people input your port will get an ip now. there is no race condition
...

depends_on will have the same result but it is a bit more pesky to use when you have for_each etc to create multiple resources.


Given the neutron people input, similar behaviors i've noticed and your input (race condition + not using deferred) I am rather certain the above will fix it. I would prefer if you can test the above solution before we consider adding a wait.

@frittentheke
Copy link
Author

@nikParasyr sorry for the delay here. I added the subnet_id reference now and the issue seems to not occur anymore.
So you were indeed correct.

Thanks for all your time and deep-diving into this mess ;-)

@nikParasyr
Copy link
Member

Thank you as well for the patience. I’ve updated the docs so hopefully it will be clear for other users.
I’ll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants