Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rack aware deployment across Azure and GCP #74

Merged
merged 7 commits into from
Nov 17, 2022

Conversation

tmgstevens
Copy link
Contributor

When ha variable is set (defaults to false) will create a partition placement group and ensure that nodes are spread between the groups

@CLAassistant
Copy link

CLAassistant commented Sep 13, 2022

CLA assistant check
All committers have signed the CLA.

@tmgstevens tmgstevens force-pushed the aws-placement-groups branch 5 times, most recently from 5c62906 to 4322629 Compare September 16, 2022 13:25
@tmgstevens tmgstevens changed the title Supports AWS partition placement groups when the ha variable is set Support rack aware deployment across AWS, Azure, GCP and via Ansible Sep 16, 2022
Copy link
Member

@vuldin vuldin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only reason for not yet approving is that I haven't tested these changes out across multiple cloud providers (I need to get access to Azure, for instance).

azure/vars.tf Outdated Show resolved Hide resolved
ansible/playbooks/start-redpanda.yml Outdated Show resolved Hide resolved
Copy link
Member

@vuldin vuldin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rogger and I met to discuss this PR yesterday and had some questions/concerns:

  • the use of the proximity placement group seems odd, from reading the docs it seems this is meant to do the opposite of splitting nodes up across racks and increasing availability
  • there isn't very much consistency in the variable names across the aws/azure/gcp code sets
  • we ran into issues with it working as expected when deployed to Azure (I can't recall the exact issue but will post more details)

At the same time the AWS side worked well with a minor change. Since a customer is wanting to have this soon, we thought breaking out the AWS part into it's own PR may be the best approach. More details in this PR: #85

@tmgstevens
Copy link
Contributor Author

@vuldin @r-vasquez regarding your comments above:

the use of the proximity placement group seems odd, from reading the docs it seems this is meant to do the opposite of splitting nodes up across racks and increasing availability

The proximity placement group stuff was already in there - I think ideally we want nodes to be as close together as possible (for latency) but respecting the scale-set failure domain constraints. The only risk here is that we get availability problems by pairing them both together. If we see that we'll have to consider making it an either/or.

there isn't very much consistency in the variable names across the aws/azure/gcp code sets

I agree, at this point the three terraform implementations are pretty disparate. I'm not really fancying a rework, but we should make sure that any new parameter names are consistent. I'll check that.

we ran into issues with it working as expected when deployed to Azure (I can't recall the exact issue but will post more details)

Grateful for info. It's worked fine for me so far.

@vuldin vuldin self-requested a review November 2, 2022 12:23
Copy link
Member

@vuldin vuldin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't have a clear understanding of how availability sets, placement groups, and scale sets work with each other to enable high availability. From reading the docs I would expect that we want to create an availability set whenever ha is true.

@@ -1,6 +1,6 @@
variable "region" {
description = "Azure Region where the Resource Group will exist"
default = "North Europe"
default = "centralus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README needs to be updated to reflect this change (it still mentions Northern Europe as default).

@@ -8,8 +8,11 @@ resource "azurerm_linux_virtual_machine" "redpanda" {
count = var.vm_instances
resource_group_name = azurerm_resource_group.redpanda.name
location = azurerm_resource_group.redpanda.location
availability_set_id = azurerm_availability_set.redpanda.id
availability_set_id = var.ha ? null : azurerm_availability_set.redpanda.0.id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if ha is true, then we don't create an availability set? That seems opposite of what we want right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the old behaviour for this provider is that we would automatically deploy machines in an availability set, presumably for resilience. However availability sets don't give the ability to introspect which machines are placed into which fault domain, so there is limited use in terms of rack awareness in Redpanda. We could remove availability sets altogether, other than it would be a change in behaviour from what was there before.

When ha=true we now deploy into a flexible scale set (with three fault domains) which then gives us information about which fault domain each VM is, that we can use for rack awareness.

I figured there was no harm in leaving the existing behaviour in there, but happy to take advice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this helps explain the use of availability sets in this project. Scale set (rather than availability set) is the vehicle for enabling rack awareness, and I was mistakenly trying to understand how the availability set could provide that especially given changes in this PR.

resource_group_name = azurerm_resource_group.redpanda.name
location = azurerm_resource_group.redpanda.location
proximity_placement_group_id = azurerm_proximity_placement_group.redpanda.id
count = var.ha ? 0 : 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions as above: this value seems opposite to what I would expect we would want (to create an availability set when ha is true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above

tmgstevens added a commit that referenced this pull request Nov 9, 2022
Allow for configuration of arbitrary node and cluster configuration items.
N.B. Further work to be done on idempotence and integration of TLS. #74 and #86 will need some rework
@vuldin vuldin self-requested a review November 15, 2022 14:19
@tmgstevens
Copy link
Contributor Author

Updated addressing comments. @vuldin be good to get a re-review please.

@tmgstevens tmgstevens changed the title Support rack aware deployment across AWS, Azure, GCP and via Ansible Support rack aware deployment across Azure and GCP Nov 17, 2022
Copy link
Member

@vuldin vuldin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tristans, looks good. I see start-redpanda.yml is now included as an empty file in this PR, I think that file could be deleted entirely.

@@ -8,8 +8,11 @@ resource "azurerm_linux_virtual_machine" "redpanda" {
count = var.vm_instances
resource_group_name = azurerm_resource_group.redpanda.name
location = azurerm_resource_group.redpanda.location
availability_set_id = azurerm_availability_set.redpanda.id
availability_set_id = var.ha ? null : azurerm_availability_set.redpanda.0.id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this helps explain the use of availability sets in this project. Scale set (rather than availability set) is the vehicle for enabling rack awareness, and I was mistakenly trying to understand how the availability set could provide that especially given changes in this PR.

@tmgstevens
Copy link
Contributor Author

Thanks @vuldin

@tmgstevens tmgstevens merged commit 337e1c7 into redpanda-data:main Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants