Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tiered storage in AWS via IAM policy #86

Merged
merged 2 commits into from
Jan 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 29 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Terraform and Ansible Deployment for Redpanda

Terraform and Ansible configuration to easily provision a [Redpanda](https://vectorized.io) cluster on AWS, GCP, Azure, or IBM .
Terraform and Ansible configuration to easily provision a [Redpanda](https://www.redpanda.com/) cluster on AWS, GCP, Azure, or IBM.

## Installation Prerequisites

Expand All @@ -11,12 +11,12 @@ Terraform and Ansible configuration to easily provision a [Redpanda](https://vec

### On Mac OS X:
You can use brew to install the prerequisites. You will also need to install gnu-tar:
```
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew install ansible
brew install gnu-tar
ansible-galaxy install -r ansible/requirements.yml
```commandline
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew install ansible
brew install gnu-tar
ansible-galaxy install -r ansible/requirements.yml
```

## Usage
Expand Down Expand Up @@ -65,22 +65,24 @@ You can pass the following variables as `-e var=value`:
| `skip_node` | false | Per-node config to prevent the Redpanda_broker role being applied to this specific node. Use carefully when adding new nodes to avoid existing nodes from being reconfigured. |
| `restart_node` | false | Per-node config to prevent Redpanda brokers from being restarted after updating. Use with care because this can cause `rpk` to be reconfigured but the node not be restarted and therefore be in an inconsistent state. |
| `rack` | `undefined` | Per-node config to enable rack awareness. N.B. Rack awareness will be enabled cluster-wide if at least one node has the `rack` variable set. |

| `tiered_storage_bucket_name`| | Set bucket name to enable tiered storage
| `aws_region` | | The region to be used if tiered storage is enabled

You can also specify any available Redpanda configuration value (or set of values) by passing a JSON dictionary as an Ansible extra-var. These values will be spliced with the calculated configuration and only override those values that you specify.
There are two sub-dictionaries that you can specify, `redpanda.cluster` and `redpanda.node`. Check the Redpanda docs for the available [Cluster configuration properties](https://docs.redpanda.com/docs/platform/reference/cluster-properties/) and [Node configuration properties](https://docs.redpanda.com/docs/platform/reference/node-properties/).

An example overriding specific properties would be as follows:

```commandline
ansible-playbook ansible/playbooks/provision-node.yml -i hosts.ini --extra-vars '{ "redpanda":
{"cluster":
{ "auto_create_topics_enabled": "true"
},
"node":
{ "developer_mode": "false"
}
}
ansible-playbook ansible/playbooks/provision-node.yml -i hosts.ini --extra-vars '{
"redpanda": {
"cluster": {
"auto_create_topics_enabled": "true"
},
"node": {
"developer_mode": "false"
}
}
}'
```

Expand All @@ -89,13 +91,13 @@ ansible-playbook ansible/playbooks/provision-node.yml -i hosts.ini --extra-vars

## Configure TLS

There are two options for configuring TLS. The first option would be to use externally provided and signed certificates (possibly via a corporately provided Certmonger) and re-run the `provision_node` playbook but specifying the relevant locations and `tls=true`.
For example:
There are two options for configuring TLS. The first option would be to use externally provided and signed certificates (possibly via a corporately provided Certmonger) and re-run the `provision_node` playbook but specifying the relevant locations and `tls=true`. For example:

```commandline
ansible-playbook ansible/playbooks/provision-node.yml -i hosts.ini --extra-vars redpanda_key_file='<path to key file>' --extra-vars redpanda_cert_file='<path to cert file>' --extra-vars redpanda_truststore_file='<path to truststore file>' --extra-vars tls=true
ansible-playbook ansible/playbooks/provision-node.yml -i hosts.ini --extra-vars redpanda_key_file='<path to key file>' --extra-vars redpanda_cert_file='<path to cert file>' --extra-vars redpanda_truststore_file='<path to truststore file>' --extra-vars tls=true
```

The second option is to deploy a private certificate authority using the playbooks provided below and generating private keys and signed certificates. For this approach, follow the steps below.
The second option is to deploy a private certificate authority using the playbooks provided below and generating private keys and signed certificates. For this approach, follow the steps below.

### Optional: Create a Local Certificate Authority

Expand Down Expand Up @@ -130,10 +132,10 @@ The playbooks can be used to add nodes to an existing cluster however care is re
1. Add the new host(s) to the `hosts.ini` file. You may add `skip_node=true` to the existing hosts to avoid the playbooks being re-run on those nodes.
2. `install-node-deps.yml` - this will set up the Prometheus node_exporter and install package dependencies.
3. `prepare-data-dir.yml` - this will create any RAID devices required and format devices as XFS. Note: This playbook looks for devices presented to the operating system as NVMe devices (which can include EBS volumes built on the Nitro System). You may replace this playbook with your own method of formatting devices and presenting disks.
4. If managing TLS with the Redpanda playbooks:
1. `generate-csrs.yml` - will create private key and CSR and bring the CSR back to the Ansible host.
2. If using the Redpanda provided CA: `issue-certs.yml` - signs the CSR and issues a certificate.
3. `install-certs.yml` - Installs the certificate and also applies the `redpanda_broker` role to the cluster nodes. Note: This will install and start Redpanda (and restart any brokers that do not have `skip_node=true` set)
4. If managing TLS with the Redpanda playbooks:
1. `generate-csrs.yml` - will create private key and CSR and bring the CSR back to the Ansible host.
2. If using the Redpanda provided CA: `issue-certs.yml` - signs the CSR and issues a certificate.
3. `install-certs.yml` - Installs the certificate and also applies the `redpanda_broker` role to the cluster nodes. Note: This will install and start Redpanda (and restart any brokers that do not have `skip_node=true` set)
5. If `install-certs.yml` was not run in step iii above, you will need to run `provision-node.yml` which will install the `redpanda_broker` role onto any nodes without `skip_node=true` set. **Note: If TLS is enabled on the cluster, make sure that `-e tls=true` is set, otherwise this playbook will disable TLS across any nodes that don't have `skip_nodes=true` set.**

## Building a cluster with TLS enabled in one execution
Expand All @@ -144,9 +146,9 @@ A similar process can be used to build a cluster with TLS in one execution as to
2. `install-node-deps.yml` - this will set up the Prometheus node_exporter and install package dependencies.
3. `prepare-data-dir.yml` - this will create any RAID devices required and format devices as XFS. Note: This playbook looks for devices presented to the operating system as NVMe devices (which can include EBS volumes built on the Nitro System). You may replace this playbook with your own method of formatting devices and presenting disks.
4. If managing TLS with the Redpanda playbooks run the following steps. If you're using externally provided certificates, skip to step 5 remembering to set `tls=true`:
1. `generate-csrs.yml` - will create private key and CSR and bring the CSR back to the Ansible host.
2. If using the Redpanda provided CA: `issue-certs.yml` - signs the CSR and issues a certificate.
3. `install-certs.yml` - Installs the certificate and also applies the `redpanda_broker` role to the cluster nodes. Note: This will install and start Redpanda (and restart any brokers that do not have `skip_node=true` set)
1. `generate-csrs.yml` - will create private key and CSR and bring the CSR back to the Ansible host.
2. If using the Redpanda provided CA: `issue-certs.yml` - signs the CSR and issues a certificate.
3. `install-certs.yml` - Installs the certificate and also applies the `redpanda_broker` role to the cluster nodes. Note: This will install and start Redpanda (and restart any brokers that do not have `skip_node=true` set)
5. If `install-certs.yml` was not run in step iii above, you will need to run `provision-node.yml` which will install the `redpanda_broker` role. **Note: If TLS is enabled on the cluster, make sure that `-e tls=true` is set, otherwise this playbook will disable TLS across any nodes that don't have `skip_nodes=true` set.**


Expand All @@ -166,4 +168,3 @@ You might try resolving by setting an environment variable:
`export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`

See: https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
cluster:
cloud_storage_access_key: THISVALUENOTUSED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand why this is here if it's not used? Are we expecting users to override this with their own storage key in the TF config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is just a placeholder. Unfortunately Redpanda requires this to be something even when we are using aws_instance_metadata (IAM permissions applied to the EC2 instance).

cloud_storage_bucket: {{ tiered_storage_bucket_name if tiered_storage_bucket_name is defined }}
cloud_storage_enable_remote_read: true
cloud_storage_enable_remote_write: true
Comment on lines +4 to +5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes all topics tiered-storage-enabled by default if the tiered_storage_enabled terraform variable is enabled.

cloud_storage_region: {{ aws_region if aws_region is defined }}
cloud_storage_secret_key: THISVALUENOTUSED
cloud_storage_credentials_source: aws_instance_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to pass this in as a param from terraform otherwise we're assuming AWS only

# cloud_storage_enabled must be after other cloud_storage parameters
cloud_storage_enabled: {{ true if tiered_storage_bucket_name is defined and tiered_storage_bucket_name|d('')|length > 0 else false }}
4 changes: 3 additions & 1 deletion ansible/playbooks/roles/redpanda_broker/vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@
custom_config_templates:
- template: configs/defaults.j2
- template: configs/tls.j2
condition: "{{ tls | default(False) | bool }}"
condition: "{{ tls | default(False) | bool }}"
- template: configs/tiered_storage.j2
condition: "{{ tiered_storage_bucket_name is defined | default(False) | bool }}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiered_storage_bucket_name ansible variable is pulled from hosts.ini, and is only defined if the tiered_storage_enabled terraform variable is true.

104 changes: 89 additions & 15 deletions aws/cluster.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@ resource "random_uuid" "cluster" {}
resource "time_static" "timestamp" {}

locals {
uuid = random_uuid.cluster.result
timestamp = time_static.timestamp.rfc3339
deployment_id = "redpanda-${local.uuid}-${local.timestamp}"
uuid = random_uuid.cluster.result
timestamp = time_static.timestamp.unix
deployment_id = length(var.deployment_prefix) > 0 ? var.deployment_prefix : "redpanda-${substr(local.uuid, 0, 8)}-${local.timestamp}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terraform complained about names being too long, so I shortened both the UUID and the timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how safe is it to just grab the first 8 characters off the uuid? I haven't look at how go-uuid works to see if it's one of the time-based uuids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why do we use a timestamp here instead of just the larger uuid?

tiered_storage_bucket_name = "${local.deployment_id}-bucket"

# tags shared by all instances
instance_tags = {
Expand All @@ -14,15 +15,73 @@ locals {
}
}

resource "aws_iam_policy" "redpanda" {
count = var.tiered_storage_enabled ? 1 : 0
name = local.deployment_id
path = "/"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
"Effect": "Allow",
"Action": [
"s3:*",
"s3-object-lambda:*",
Comment on lines +28 to +29
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be more limited to list only those needed permissions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with the broader team. From that conversation:

Having an policy that tightly limits redpanda to its own bucket is clearly the right thing to do, it's less obvious to me that restricting the verbs is helpful: it creates maintenance burden to keep those up to date as/when we change redpanda, and doesn't meaningfully change security if redpanda is the owner of the bucket.

So limiting this policy to the specific bucket (as shown in this file below) is the best approach, and avoids a maintenance burden as tiered storage capabilities are expanded.

],
"Resource": [
"arn:aws:s3:::${local.tiered_storage_bucket_name}/*"
]
},
]
})
}

resource "aws_iam_role" "redpanda" {
count = var.tiered_storage_enabled ? 1 : 0
name = local.deployment_id
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Sid = ""
Principal = {
Service = "ec2.amazonaws.com"
}
},
]
})
}

resource "aws_iam_policy_attachment" "redpanda" {
count = var.tiered_storage_enabled ? 1 : 0
name = local.deployment_id
roles = [aws_iam_role.redpanda[count.index].name]
policy_arn = aws_iam_policy.redpanda[count.index].arn
}

resource "aws_iam_instance_profile" "redpanda" {
count = var.tiered_storage_enabled ? 1 : 0
name = local.deployment_id
role = aws_iam_role.redpanda[count.index].name
}

vuldin marked this conversation as resolved.
Show resolved Hide resolved
resource "aws_instance" "redpanda" {
count = var.nodes
ami = var.distro_ami[var.distro]
instance_type = var.instance_type
key_name = aws_key_pair.ssh.key_name
iam_instance_profile = var.tiered_storage_enabled ? aws_iam_instance_profile.redpanda[0].name : null
vpc_security_group_ids = [aws_security_group.node_sec_group.id]
placement_group = var.ha ? aws_placement_group.redpanda-pg[0].id : null
placement_partition_number = var.ha ? (count.index % aws_placement_group.redpanda-pg[0].partition_count) + 1 : null
tags = local.instance_tags
tags = merge(
local.instance_tags,
{
Name = "${local.deployment_id}-node-${count.index}",
}
)

connection {
user = var.distro_ssh_user[var.distro]
Expand Down Expand Up @@ -53,7 +112,12 @@ resource "aws_instance" "prometheus" {
instance_type = var.prometheus_instance_type
key_name = aws_key_pair.ssh.key_name
vpc_security_group_ids = [aws_security_group.node_sec_group.id]
tags = local.instance_tags
tags = merge(
local.instance_tags,
{
Name = "${local.deployment_id}-prometheus",
}
)

connection {
user = var.distro_ssh_user[var.distro]
Expand All @@ -68,7 +132,12 @@ resource "aws_instance" "client" {
instance_type = var.client_instance_type
key_name = aws_key_pair.ssh.key_name
vpc_security_group_ids = [aws_security_group.node_sec_group.id]
tags = local.instance_tags
tags = merge(
local.instance_tags,
{
Name = "${local.deployment_id}-client",
}
)

connection {
user = var.distro_ssh_user[var.client_distro]
Expand Down Expand Up @@ -176,20 +245,25 @@ resource "aws_placement_group" "redpanda-pg" {
resource "aws_key_pair" "ssh" {
key_name = "${local.deployment_id}-key"
public_key = file(var.public_key_path)
tags = local.instance_tags
}

resource "local_file" "hosts_ini" {
content = templatefile("${path.module}/../templates/hosts_ini.tpl",
{
redpanda_public_ips = aws_instance.redpanda.*.public_ip
redpanda_private_ips = aws_instance.redpanda.*.private_ip
monitor_public_ip = var.enable_monitoring ? aws_instance.prometheus[0].public_ip : ""
monitor_private_ip = var.enable_monitoring ? aws_instance.prometheus[0].private_ip : ""
ssh_user = var.distro_ssh_user[var.distro]
enable_monitoring = var.enable_monitoring
client_public_ips = aws_instance.client.*.public_ip
client_private_ips = aws_instance.client.*.private_ip
rack = aws_instance.redpanda.*.placement_partition_number
aws_region = var.aws_region
client_count = var.clients
client_public_ips = aws_instance.client.*.public_ip
client_private_ips = aws_instance.client.*.private_ip
enable_monitoring = var.enable_monitoring
monitor_public_ip = var.enable_monitoring ? aws_instance.prometheus[0].public_ip : ""
monitor_private_ip = var.enable_monitoring ? aws_instance.prometheus[0].private_ip : ""
rack = aws_instance.redpanda.*.placement_partition_number
redpanda_public_ips = aws_instance.redpanda.*.public_ip
redpanda_private_ips = aws_instance.redpanda.*.private_ip
ssh_user = var.distro_ssh_user[var.distro]
tiered_storage_bucket_name = local.tiered_storage_bucket_name
tiered_storage_enabled = var.tiered_storage_enabled
}
)
filename = "${path.module}/../hosts.ini"
Expand Down
2 changes: 1 addition & 1 deletion aws/provider.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "3.73.0"
version = "4.35.0"
vuldin marked this conversation as resolved.
Show resolved Hide resolved
}
local = {
source = "hashicorp/local"
Expand Down
13 changes: 7 additions & 6 deletions aws/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Example: `terraform apply -var="instance_type=i3.large" -var="nodes=3"`

| Name | Version |
|------|---------|
| aws | 3.73.0 |
| aws | 4.35.0 |
| local | 2.1.0 |
| random | 3.1.0 |

## Providers

| Name | Version |
|------|---------|
| aws | 3.73.0 |
| aws | 4.35.0 |
| local | 2.1.0 |
| random | 3.1.0 |

Expand All @@ -35,10 +35,10 @@ No Modules.

| Name |
|--------------------------------------------------------------------------------------------------------------------|
| [aws_instance](https://registry.terraform.io/providers/hashicorp/aws/3.73.0/docs/resources/instance) |
| [aws_key_pair](https://registry.terraform.io/providers/hashicorp/aws/3.73.0/docs/resources/key_pair) |
| [aws_security_group](https://registry.terraform.io/providers/hashicorp/aws/3.73.0/docs/resources/security_group) |
| [aws_placement_group](https://registry.terraform.io/providers/hashicorp/aws/3.73.0/docs/resources/placement_group) |
| [aws_instance](https://registry.terraform.io/providers/hashicorp/aws/4.35.0/docs/resources/instance) |
| [aws_key_pair](https://registry.terraform.io/providers/hashicorp/aws/4.35.0/docs/resources/key_pair) |
| [aws_security_group](https://registry.terraform.io/providers/hashicorp/aws/4.35.0/docs/resources/security_group) |
| [aws_placement_group](https://registry.terraform.io/providers/hashicorp/aws/4.35.0/docs/resources/placement_group) |
| [local_file](https://registry.terraform.io/providers/hashicorp/local/2.1.0/docs/resources/file) |
| [random_uuid](https://registry.terraform.io/providers/hashicorp/random/3.1.0/docs/resources/uuid) |
| [timestamp_static](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/static) |
Expand All @@ -57,6 +57,7 @@ No Modules.
| nodes | The number of nodes to deploy | `number` | `"3"` | no |
| prometheus\_instance\_type | Instant type of the prometheus/grafana node | `string` | `"c5.2xlarge"` | no |
| public\_key\_path | The public key used to ssh to the hosts | `string` | `"~/.ssh/id_rsa.pub"` | no |
| tiered\_storage\_enabled | Enables or disables tiered storage | `bool` | `false` | no |

### Client Inputs
By default, no client VMs are provisioned. If you want to also provision client
Expand Down
19 changes: 19 additions & 0 deletions aws/s3.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
resource "aws_s3_bucket" "tiered_storage" {
count = var.tiered_storage_enabled ? 1 : 0
bucket = local.tiered_storage_bucket_name
tags = local.instance_tags
}

resource "aws_s3_bucket_acl" "tiered_storage" {
count = var.tiered_storage_enabled ? 1 : 0
bucket = aws_s3_bucket.tiered_storage[count.index].id
acl = "private"
}

resource "aws_s3_bucket_versioning" "tiered_storage" {
count = var.tiered_storage_enabled ? 1 : 0
bucket = aws_s3_bucket.tiered_storage[count.index].id
versioning_configuration {
status = "Disabled"
}
}
Comment on lines +13 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a comment added on here on why we disable versioning (and maybe what happens, good or bad, if the user overrides this)?

Loading