Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcement: Upcoming Changes in Version 2.0 of the Azure Provider #2807

Closed
tombuildsstuff opened this issue Jan 30, 2019 · 36 comments
Closed
Assignees
Milestone

Comments

@tombuildsstuff
Copy link
Member

tombuildsstuff commented Jan 30, 2019

Terraform initially shipped support for the AzureRM Provider back in December 2015. Since then we've added support for 191 Resources, 58 Data Sources and have launched a couple of related Providers in the form of the Azure Active Directory Provider and the Azure Stack Provider.

Version 2.0 of the AzureRM Provider will be a Major Release - in that it will include some larger-scale changes not seen in a regular release. A summary of these changes is outlined below - however a full breakdown will be available on the Terraform Website after the release of v1.22.

Summary

  • Existing Resources will be required to be imported
  • Custom Timeouts will be available on Resources - this will allow you to specify a custom timeout for provisioning the resource in your Terraform Configuration using the timeouts block.
  • New resources for Virtual Machines and Virtual Machine Scale Sets
  • Removing Fields, Data Sources and Resources which have been deprecated

A brief summary of each item can be found below - more details will be available in the Azure Provider 2.0 upgrade guide on the Terraform Website once v1.22 has been released.


Existing Resources will be required to be Imported

Terraform allows for existing resources which have been created outside of Terraform to be Imported into Terraform's State. Once a resource is imported into the state, it's possible for Terraform to track changes and manage this resource. The Azure Provider allows Importing existing resources into the state (using terraform import) for (almost) every resource.

Version 2.0 of the Azure Provider aims to solve an issue where it's possible to unintentionally import resources into the state by running terraform apply. To explain this further, the majority of Azure's API's are Upserts - which means that a resource will be updated if it exists, otherwise it'll be created.

Where the unique identifier for (most) Azure resources is the name (rather than for example an aws_instance where AWS will generate a different unique identifier) - it's possible that users may have unintentionally imported resources into Terraform when running terraform apply on an existing resource.

Whilst this may allow resources to work in some cases, it leads to hard-to-diagnose bugs in others (which could have been caught during terraform plan).

In order to match the behaviour of other Terraform Providers version 2.0 of the AzureRM Provider will require that existing resources are imported into the state prior to use. This means that Terraform will be checking for the presence of an existing resource prior to creating it - and will return an error similar to below:

A resource with the ID /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/group1 already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for `azurerm_resource_group` for more information.

You can opt into this behaviour in version 1.22 of the AzureRM Provider by setting the Environment Variable ARM_PROVIDER_STRICT to true.

Custom Timeouts for Resources

Resources can optionally support a timeouts block - which allows users to specify a Custom Timeout for resource creation/deletion as part of the Terraform Configuration.

Prior to version 2.0 the Azure Provider has a default value set for resource timeouts for an hour - which cannot be overridden. This works for the most-part but there are certain scenarios where it'd be helpful to override this.

This is useful for resources which can take a long time to delete - for example deleting the azurerm_resource_group resource will delete any resources within it, which can take time. Within Terraform your Terraform Configuration this could be represented like so:

resource "azurerm_resource_group" "test" {
  name     = "example-resource-group"
  location = "West Europe"

  timeouts {
    create = "10m"
    delete = "30m"
  }
}

We intend to support the timeout block in version 2.0 of the Azure Provider - which will allow timeouts to be specified on resources (as shown above). This feature request is being tracked here and will form part of the 2.0 release of the AzureRM Provider.

New Resources for Virtual Machines and Virtual Machine Scale Sets

We originally shipped support for the azurerm_virtual_machine and azurerm_virtual_machine_scale_set resources back in March 2016.

Over time new features have been added to these resources by Azure, such as Managed Disks and Managed Service Identity which these resources support. Since these resources first launched Azure's also changed the behaviour of some fields, so that it's now possible to update them where this wasn't previously possible - for example the Custom Data for a Virtual Machine.

We've spent some time thinking about how we can accommodate these changes and about how we can improve the user experience of both resources.
In particular we've wanted to be able to give better validation during terraform plan, rather than bailing out with an Azure API error during terraform apply, however this isn't possible with the current resource structure since they're very generic. The validation requirements also vary substantially based on the fields provided, for example the name field for a Virtual Machine can be up to 63 characters for a Linux Virtual Machine but only allows 15 characters for a Windows Virtual Machine.

As such after spending some time reading through bug reports and thinking/prototyping some potential solutions to this - we believe the best path forward here is to split these resources out, so that we would have:

  • a Linux Virtual Machine Resource (working name: azurerm_linux_virtual_machine)
  • a Windows Virtual Machine Resource (working name: azurerm_windows_virtual_machine)
  • updating the Data Disk Attachment Resource to support Unmanaged Disks
  • a Linux Virtual Machine Scale Set Resource (working name: azurerm_linux_virtual_machine_scale_set)
  • a Windows Virtual Machine Scale Set Resource (working name: azurerm_windows_virtual_machine_scale_set)
  • a separate resource for Virtual Machine Scale Set Extensions (working name azurerm_virtual_machine_scale_set_extension)

Please Note: all of the resources mentioned above currently do not exist but will form part of the 2.0 release.

Whilst we're aware that this isn't ideal since users will eventually have to update their code/import an existing resource - we believe this approach gives us a good footing for the future. In particular this allows us to re-consider the schema design so that we can both support these new use-cases, fix some bugs and improve the user experience with these resources.

The existing azurerm_virtual_machine and azurerm_virtual_machine_scale_set resources will continue to be available throughout the 2.x releases - but over time we'd end up deprecating these in favour of the new resources.

Removing Deprecated Fields, Data Sources and Resources

As v2.0 of the AzureRM Provider is a major release - we'll be taking the opportunity to remove Fields, Data Sources and Resources which have been previously deprecated.

A detailed breakdown will be available on the Terraform Website once v1.22 has been released - and we'll update this issue with a link to that once it's live.


We've spent the past few months laying the groundwork for these changes - and whilst we appreciate that your Terraform Configurations may require code changes to upgrade to 2.0 - we take Semantic Versioning seriously and so try our best to limit these changes to major versions.

Pinning your Provider Version

We recommend pinning the version of each Provider you use in Terraform - you can do this using the version attribute in the provider block, either to a specific version of the AzureRM Provider, like so:

provider "azurerm" {
  version = "=1.22.0"
}

.. or to any 1.x release:

provider "azurerm" {
  version = "~> 1.x"
}

More information on how to pin the version of a Terraform Provider being used can be found on the Terraform Website.

Once version 2.0 of the AzureRM Provider is released - you can then upgrade to it by updating the version specified in the Provider block, like so:

provider "azurerm" {
  version = "=2.0.0"
}

You can follow along with the work in the 2.0 release in this GitHub Milestone - we'll also post updates in this issue and publish the v2.0 upgrade guide on the Terraform Website once the v1.22 release is out.

There's a summary/update in the thread below: #2807 (comment)

@InterStateNomad
Copy link

InterStateNomad commented Feb 1, 2019

Regarding the "Existing Resources will be required to be Imported" change. Will we still be able to make Azure resources in an existing environment without having to import all of that existing environment into state?

For example, if I just want to make a Azure VM using Terraform in a existing environment. Today I can just provide the name of the resource group I want the VM to be created in (resource_group_name) and use the azurerm_subnet data source to specify the the networking details for the VM NIC.

Will this functionality still be possible or would I have to import the VNET and RSG into a new state file before I can make my VM? I know it sounds odd but lots of times I just like building VMs with Terraform because I prefer it and not really concerned with state management.

Thanks

@tombuildsstuff
Copy link
Member Author

@InterStateNomad

Regarding the "Existing Resources will be required to be Imported" change. Will we still be able to make Azure resources in an existing environment without having to import all of that existing environment into state?
For example, if I just want to make a Azure VM using Terraform in a existing environment. Today I can just provide the name of the resource group I want the VM to be created in (resource_group_name) and use the azurerm_subnet data source to specify the the networking details for the VM NIC.
Will this functionality still be possible or would I have to import the VNET and RSG into a new state file before I can make my VM? I know it sounds odd but lots of times I just like building VMs with Terraform because I prefer it and not really concerned with state management.

Yes this will still work - only (existing) Resources need to be Imported into the State via terraform import - Data Sources can be used without importing and so will function as they do today.

@TraGicCode
Copy link

Hey @tombuildsstuff ,

Is there an ETA for v2? Is there a workaround we can use in the meantime to prevent our CI/CD pipelines in azure release pipelines?

I'm really surprised the timeout issue with resources taking about an hour has been hanging around so long as alot of important resources in azure have about the same spin-up and spin-down times.

@tombuildsstuff
Copy link
Member Author

@TraGicCode

Is there an ETA for v2? Is there a workaround we can use in the meantime to prevent our CI/CD pipelines in azure release pipelines?

Not at this time, unfortunately - we'll post more when we have that information - originally we intended to focus on v2.0 after the v1.22 release - however we plan to do a few more v1.x releases before starting work on v2.0.

I'm really surprised the timeout issue with resources taking about an hour has been hanging around so long as alot of important resources in azure have about the same spin-up and spin-down times.

Unfortunately the upstream bugs which were needed to support this have only recently been fixed, so whilst we could add this to a handful of resources today - we feel there's more value in doing this for all resources at once, since we'll also be taking the opportunity to update the default timeout for other resources to be more realistic (for example, it doesn't take an hour to create a Virtual Network, so setting the timeout to an hour [as we do today, since it's shared by all resources] is probably overkill).

Thanks!

@YashwanthMannem

This comment has been minimized.

@tombuildsstuff

This comment has been minimized.

@kmoe kmoe unpinned this issue Apr 28, 2019
@katbyte katbyte pinned this issue Apr 29, 2019
@Lachlan-White
Copy link
Contributor

@tombuildsstuff Is the Custom timeout available in another version of the provider? Currently this is going to block me for deployment of Azure Service environments?

@tombuildsstuff
Copy link
Member Author

@a138076 unfortunately this is a change which needs to be made across every resource in the same release - so this won't land until 2.0.

@AdamCoulterOz
Copy link
Contributor

@tombuildsstuff - I'm from Vibrato in Australia, HashiCorp APAC services provider, I work with many clients using Terraform for Azure every week. Moving the resources model away from mapping to the Azure RM model for VM and VMSS to Windows and Linux based specific versions is going to make life very difficult for us. I agree with splitting out VMSS extensions though.

I understand what you're trying to achieve by moving to less generic versions, but can't the design goals still be met while retaining the generic nature of the resource types? There are many different resource types which have different validation rules based on provided attributes (in this case OS), but if you were to create separate resource types for every time validation rules had to vary based on attribute values there would need to be 1000x+ as a many resources maintained and would become completely unusable. Seems inconsistent with the general approach for writing providers.

I also think you are conflating 2 different problems, one; the maintainability of lots of state migration code (hence the need to break to completely new resource implementations), and two, complex attribute dependency validation logic.

Problem 1 - State Migration Maintenance
Since provider version 2.0.0 is marked as breaking you don't need to maintain any compatibility with the existing implementations of VM or VMSS anyway, this is expected if you are following SemVer.
You could remove resource state migrations prior to the new "version 2" SchemaVersion, and include only 1 SchemaVersion prior (provider version 1.xx or higher), and instruct them to upgrade by running against that provider version, then switch to version 2? Alternatively just break if they don't have a SchemaVersion of provider version 2.0.0 or higher, and tell them they need to reimport, or, the resource just reimports at that version with an overwrite of the state file as a 1 time event with an explicit warning message?

Problem 2 - Attribute Validation Logic
I understand that maintaining the validation logic may be slightly more complex (kept within 1 resource) but I fundamentally believe you should be keeping them aligned to the AzureRM model. You still need to maintain the validation logic for both models regardless anyway if you want to provide relevant feedback during the plan phase. Also, couldn't Microsoft provide some kind of validation via their Azure Go API client library?

I'm happy to have feedback if any of my assumptions or reasoning is incorrect. Would like to make sure we get this one right. Thanks!

@tombuildsstuff
Copy link
Member Author

tombuildsstuff commented Feb 4, 2020

hey @AdamCoulterOz

Firstly - apologies it's taken so long to reply to this, I saw this shortly after you posted it but didn't have a chance to reply; and it's been hidden in the "more comments" section until recently.

To go through each of your points in turn:

I understand what you're trying to achieve by moving to less generic versions, but can't the design goals still be met while retaining the generic nature of the resource types?

There are many different resource types which have different validation rules based on provided attributes (in this case OS), but if you were to create separate resource types for every time validation rules had to vary based on attribute values there would need to be 1000x+ as a many resources maintained and would become completely unusable

Early versions of this Provider focused on matching the Azure API's exactly - the result of this being that many users ended up with unclear error messages from the API that we're unable to catch during terraform plan/validate.

Whilst it's unfortunate this is the case - the root cause of this is that the same Azure API can provision multiple things - which leads to them being generic. To go through a couple of examples in turn: HDInsight and App Services.

The HDInsight API looks like a good candidate for having a Generic API, since the only thing that's really different between them is the number of blocks (e.g. worker_node, zookeeper_node and edge_node). We initially looked to ship the HDInsight resources as one resource - but quickly realised this wouldn't work due to the API behaving differently depending on the kind of cluster being provisioned (e.g. a bunch of the SKU's get mutated from their original value to Medium or Large, and is the case for multiple SKU's which varies by cluster type, so we couldn't map this back).

In addition the API flat-out rejected changed to some fields for some configurations but allows them for others. The result of this being that whilst we /could/ make this one generic resource - it would have been a pretty poor user experience. Ultimately we ended up splitting this up into 8 different resources for HDInsight which behave as they should - which whilst is a little more work for us to maintain (and there's things we've done to alleviate this) this allows these resources behave the way that users expect.

The App Service API follows a similar pattern - where an App Service (Web App, Function App, API App, Mobile App etc) can be provisioned within an App Service Plan of different kinds (Linux/Windows/Function/Consumption). Whilst on first glance this sounds fine there's some pretty severe limitations to this resource which has ended up causing issues - namely that the API itself returns an empty HTTP 400 "Bad Request" when the configuration is wrong, with no details about what's gone wrong to be able to debug it.

The end-result of this is that the more generic the resource the more we end up pushing this complexity onto users, which in turn leads to errors we're unable to catch during a plan/validate and ultimately leads to more (Github/support) issues.

As such whilst it does deviate from the Azure API - I fully expect that we'll introduce more "specialized" resources in the future to provide a better user experience than directly-mapping the API - due to the issues caused by the design of the Azure API's.

It's not something we plan to do anytime soon, but a good candidate for this is the azurerm_app_service resource which wants splitting into ~4 sub resources (Web App, Function App [which has already been done], Mobile App & API Apps).

Since provider version 2.0.0 is marked as breaking you don't need to maintain any compatibility with the existing implementations of VM or VMSS anyway, this is expected if you are following SemVer.

Unfortunately this'd mean we'd end up leaving some users on older versions of the Provider unable to access new functionality - which isn't ideal (and is why the azurerm_virtual_machine and azurerm_virtual_machine_scale_set resources are feature-frozen rather than deprecated in 2.0).

In addition - based on the experiences we've had with both the HDInsight and App Service API's - I think this'd be the wrong call since we'd be pushing the complexity onto end-users rather than handling it within the Azure Provider.

Alternatively just break if they don't have a SchemaVersion of provider version 2.0.0 or higher, and tell them they need to reimport, or, the resource just reimports at that version with an overwrite of the state file as a 1 time event with an explicit warning message?

Terraform expects that all of the Resources in the Statefile can be represented by the Providers being used - as such users would be unable to change the state when using this newer version of the Provider, which would mean ultimately starting afresh with state, which again means we'll end up having a set of users who can't upgrade.

I understand that maintaining the validation logic may be slightly more complex (kept within 1 resource) but I fundamentally believe you should be keeping them aligned to the AzureRM model. You still need to maintain the validation logic for both models regardless anyway if you want to provide relevant feedback during the plan phase. Also, couldn't Microsoft provide some kind of validation via their Azure Go API client library?

Due to the way the Azure SDK's are built (they're generated from Swagger, which doesn't contain a means of expressing these conditional validation functions) unfortunately this logic can't easily be added automatically to the Azure SDK and thus it'd need to be some kind of manual mix-in. My concern with this approach is that the Swagger is already an afterthought to some API teams - and if I'm being honest if these got written based on prior experience I don't think these would be maintained, particularly in languages which aren't .net (which is what most of the Azure API's are written in).


Whilst I appreciate splitting these resources out does make creating generic modules harder - it's still possible to achieve that if you need that; to use a hypothetical example:

variable "linux" {
  default = "yes"
}

resource "azurerm_linux_virtual_machine" "example" {
  count = var.linux == "yes" ? 1 : 0
}

resource "azurerm_windows_virtual_machine" "example" {
  count = var.linux == "yes" ? 0 : 1
}

output "virtual_machine_id" {
  value = element(concat(azurerm_linux_virtual_machine.example.*.id, azurerm_windows_virtual_machine.example.*.id), 0)
}

(It's worth noting that whilst it's possible to create a generic module which configures every option on a resource - we'd recommend having multiple more specialized modules instead.)

Overall whilst we're trying to match the schema design used by Azure where that makes sense - when the Azure API is overly generic to be unhelpful to end-users we'd rather diverge from this by creating more specialized resources to be able to provide a better user-experience.

From our side we'll shortly be releasing an opt-in Beta for the new resources in the upcoming version 1.43 of the Azure Provider, which I'd encourage you to try if you have time (also worth noting there's a minor couple of known issues in the documentation). Whilst they are separate resources - overall the resources are (intentionally) pretty similar (albeit with more specific validation) but match the behaviour of the updated API's - and as such we believe should fit most use-cases.

Thanks!

@tombuildsstuff
Copy link
Member Author

👋

Over the past few months we've been working on the functionality coming in version 2.0 of the Azure Provider (outlined above).

We've just released version 1.43 of the Azure Provider which allows you to opt-in to the Beta of these upcoming features, rather than detailing this in multiple issues - more information can be found in this Github issue (which is pinned for visibility).

Thanks!

@berney
Copy link

berney commented Feb 16, 2020

It sounds like more pressure needs to be applied on Azure to make their API and SDK better.

@tombuildsstuff
Copy link
Member Author

👋

Thanks for all of the input here - we've finished up the major changes needed for version 2.0 of the Azure Provider - and as such I'm going to close this meta issue for the moment. As this meta issue is assigned to the 2.0 Github milestone - @hashibot will comment when the 2.0 release of the Azure Provider is available.

Thanks!

@ghost
Copy link

ghost commented Feb 24, 2020

This has been released in version 2.0.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 2.0.0"
}
# ... other configuration ...

@ghost
Copy link

ghost commented Mar 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests