Support rack aware deployment across Azure and GCP #74

tmgstevens · 2022-09-13T18:54:39Z

When ha variable is set (defaults to false) will create a partition placement group and ensure that nodes are spread between the groups

CLAassistant · 2022-09-13T18:54:44Z

All committers have signed the CLA.

vuldin

My only reason for not yet approving is that I haven't tested these changes out across multiple cloud providers (I need to get access to Azure, for instance).

azure/vars.tf

ansible/playbooks/start-redpanda.yml

vuldin

Rogger and I met to discuss this PR yesterday and had some questions/concerns:

the use of the proximity placement group seems odd, from reading the docs it seems this is meant to do the opposite of splitting nodes up across racks and increasing availability
there isn't very much consistency in the variable names across the aws/azure/gcp code sets
we ran into issues with it working as expected when deployed to Azure (I can't recall the exact issue but will post more details)

At the same time the AWS side worked well with a minor change. Since a customer is wanting to have this soon, we thought breaking out the AWS part into it's own PR may be the best approach. More details in this PR: #85

tmgstevens · 2022-11-01T14:47:14Z

@vuldin @r-vasquez regarding your comments above:

the use of the proximity placement group seems odd, from reading the docs it seems this is meant to do the opposite of splitting nodes up across racks and increasing availability

The proximity placement group stuff was already in there - I think ideally we want nodes to be as close together as possible (for latency) but respecting the scale-set failure domain constraints. The only risk here is that we get availability problems by pairing them both together. If we see that we'll have to consider making it an either/or.

there isn't very much consistency in the variable names across the aws/azure/gcp code sets

I agree, at this point the three terraform implementations are pretty disparate. I'm not really fancying a rework, but we should make sure that any new parameter names are consistent. I'll check that.

we ran into issues with it working as expected when deployed to Azure (I can't recall the exact issue but will post more details)

Grateful for info. It's worked fine for me so far.

vuldin

I think I don't have a clear understanding of how availability sets, placement groups, and scale sets work with each other to enable high availability. From reading the docs I would expect that we want to create an availability set whenever ha is true.

vuldin · 2022-11-02T12:44:53Z

azure/vars.tf

@@ -1,6 +1,6 @@
 variable "region" {
  description = "Azure Region where the Resource Group will exist"
-  default     = "North Europe"
+  default     = "centralus"


The README needs to be updated to reflect this change (it still mentions Northern Europe as default).

vuldin · 2022-11-02T13:06:48Z

azure/cluster.tf

@@ -8,8 +8,11 @@ resource "azurerm_linux_virtual_machine" "redpanda" {
  count                        = var.vm_instances
  resource_group_name          = azurerm_resource_group.redpanda.name
  location                     = azurerm_resource_group.redpanda.location
-  availability_set_id          = azurerm_availability_set.redpanda.id
+  availability_set_id          = var.ha ? null : azurerm_availability_set.redpanda.0.id


So if ha is true, then we don't create an availability set? That seems opposite of what we want right?

So the old behaviour for this provider is that we would automatically deploy machines in an availability set, presumably for resilience. However availability sets don't give the ability to introspect which machines are placed into which fault domain, so there is limited use in terms of rack awareness in Redpanda. We could remove availability sets altogether, other than it would be a change in behaviour from what was there before.

When ha=true we now deploy into a flexible scale set (with three fault domains) which then gives us information about which fault domain each VM is, that we can use for rack awareness.

I figured there was no harm in leaving the existing behaviour in there, but happy to take advice.

Thanks, this helps explain the use of availability sets in this project. Scale set (rather than availability set) is the vehicle for enabling rack awareness, and I was mistakenly trying to understand how the availability set could provide that especially given changes in this PR.

vuldin · 2022-11-02T13:08:33Z

azure/network.tf

  resource_group_name          = azurerm_resource_group.redpanda.name
  location                     = azurerm_resource_group.redpanda.location
  proximity_placement_group_id = azurerm_proximity_placement_group.redpanda.id
+  count                        = var.ha ? 0 : 1


Same questions as above: this value seems opposite to what I would expect we would want (to create an availability set when ha is true).

Same answer as above

Allow for configuration of arbitrary node and cluster configuration items. N.B. Further work to be done on idempotence and integration of TLS. #74 and #86 will need some rework

tmgstevens · 2022-11-17T16:19:36Z

Updated addressing comments. @vuldin be good to get a re-review please.

vuldin

Thanks Tristans, looks good. I see start-redpanda.yml is now included as an empty file in this PR, I think that file could be deleted entirely.

vuldin · 2022-11-17T16:55:50Z

azure/cluster.tf

@@ -8,8 +8,11 @@ resource "azurerm_linux_virtual_machine" "redpanda" {
  count                        = var.vm_instances
  resource_group_name          = azurerm_resource_group.redpanda.name
  location                     = azurerm_resource_group.redpanda.location
-  availability_set_id          = azurerm_availability_set.redpanda.id
+  availability_set_id          = var.ha ? null : azurerm_availability_set.redpanda.0.id


Thanks, this helps explain the use of availability sets in this project. Scale set (rather than availability set) is the vehicle for enabling rack awareness, and I was mistakenly trying to understand how the availability set could provide that especially given changes in this PR.

tmgstevens · 2022-11-17T19:58:03Z

Thanks @vuldin

tmgstevens force-pushed the aws-placement-groups branch 5 times, most recently from 5c62906 to 4322629 Compare September 16, 2022 13:25

tmgstevens changed the title ~~Supports AWS partition placement groups when the ha variable is set~~ Support rack aware deployment across AWS, Azure, GCP and via Ansible Sep 16, 2022

tmgstevens requested review from vuldin, rkruze and jrkinley September 16, 2022 13:26

vuldin reviewed Sep 20, 2022

View reviewed changes

r-vasquez reviewed Oct 20, 2022

View reviewed changes

azure/vars.tf Outdated Show resolved Hide resolved

ansible/playbooks/start-redpanda.yml Outdated Show resolved Hide resolved

r-vasquez reviewed Oct 20, 2022

View reviewed changes

ansible/playbooks/start-redpanda.yml Outdated Show resolved Hide resolved

vuldin requested changes Oct 21, 2022

View reviewed changes

tmgstevens added 2 commits November 1, 2022 14:34

Add in GCP spread placement groups

94c655a

Add support for Azure scale sets to allow for rack awareness

274e131

tmgstevens force-pushed the aws-placement-groups branch from 4322629 to 274e131 Compare November 1, 2022 14:41

tmgstevens added 2 commits November 1, 2022 15:51

Update docs and rename variable use_scale_set to ha to be consistent

a651b18

Moved rack id set to before redpanda starts to avoid the restart

b8c6907

vuldin self-requested a review November 2, 2022 12:23

vuldin reviewed Nov 2, 2022

View reviewed changes

tmgstevens added a commit that referenced this pull request Nov 9, 2022

Extensible configuration (#90)

be48d76

Allow for configuration of arbitrary node and cluster configuration items. N.B. Further work to be done on idempotence and integration of TLS. #74 and #86 will need some rework

vuldin self-requested a review November 15, 2022 14:19

tmgstevens added 2 commits November 17, 2022 15:14

Merge branch 'main' into aws-placement-groups

0a7a8fd

Update to readmes

84f04af

tmgstevens changed the title ~~Support rack aware deployment across AWS, Azure, GCP and via Ansible~~ Support rack aware deployment across Azure and GCP Nov 17, 2022

vuldin approved these changes Nov 17, 2022

View reviewed changes

Delete start-redpanda.yml

227d0ad

tmgstevens merged commit 337e1c7 into redpanda-data:main Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support rack aware deployment across Azure and GCP #74

Support rack aware deployment across Azure and GCP #74

tmgstevens commented Sep 13, 2022

CLAassistant commented Sep 13, 2022 •

edited

Loading

vuldin left a comment

vuldin left a comment

tmgstevens commented Nov 1, 2022

vuldin left a comment

vuldin Nov 2, 2022

vuldin Nov 2, 2022

tmgstevens Nov 17, 2022

vuldin Nov 17, 2022

vuldin Nov 2, 2022

tmgstevens Nov 17, 2022

tmgstevens commented Nov 17, 2022

vuldin left a comment

vuldin Nov 17, 2022

tmgstevens commented Nov 17, 2022

Support rack aware deployment across Azure and GCP #74

Support rack aware deployment across Azure and GCP #74

Conversation

tmgstevens commented Sep 13, 2022

CLAassistant commented Sep 13, 2022 • edited Loading

vuldin left a comment

Choose a reason for hiding this comment

vuldin left a comment

Choose a reason for hiding this comment

tmgstevens commented Nov 1, 2022

vuldin left a comment

Choose a reason for hiding this comment

vuldin Nov 2, 2022

Choose a reason for hiding this comment

vuldin Nov 2, 2022

Choose a reason for hiding this comment

tmgstevens Nov 17, 2022

Choose a reason for hiding this comment

vuldin Nov 17, 2022

Choose a reason for hiding this comment

vuldin Nov 2, 2022

Choose a reason for hiding this comment

tmgstevens Nov 17, 2022

Choose a reason for hiding this comment

tmgstevens commented Nov 17, 2022

vuldin left a comment

Choose a reason for hiding this comment

vuldin Nov 17, 2022

Choose a reason for hiding this comment

tmgstevens commented Nov 17, 2022

CLAassistant commented Sep 13, 2022 •

edited

Loading