Vault Production Readiness Checklist

This checklist will prepare you launch production-ready vault clusters into any major Cloud provider or on Premise

Infrastructure Architecture
Load Balancing
Monitoring and Alerting
Configuration Management
Vault Configuration
Identity & Access Management
Security Hardening
Operational Readiness
Observability
Governance and compliance
Disaster Recovery
Automation Playbooks
SLI, SLO and SLA
Keys Rotation

References

Infrastructure Architecture


☐	Infrastructure Workloads spread accross Availability Zones Nodes (Container) in Vault clusters should be spread accross two or more failure domains known as Availability zones. The loss of a single Availability zone should not result result in a loss of service.
☐	Helm Charts/Operator Created If you are deploying your vault on Kubernetes, it is reccomended to build helm charts or operator to create vault cluster nodes in an immutable way.
☐	Replication Enabled (Enterprise only) If you are using the Enterprise version of Vault, you can enable replication between two or more Vault clusters in different geographical regions for added protection is Disaster Recovery scenarios. Replication can be configured in Disaster Recovery Mode or Performance Replication mode. If you are planing on using Replication, you need to provision infrastructure in an alternative region, with nodes spread accross multiple Availability Zones. For more information about the Enterprise Replication feature, see the official documentation.
☐	Firewall rules configured to control access to Vault Vault will likely contain business critical secrets which makes it a prime target for malicious actors. Access to vault to should be restricted to your private networks and not be accessible on the internet. The Use of Virtual Private Networks is a commonly used approach to allow access to Vault from unknown networks
☐	Compute Resouces satisfy the minimum requirements Ensure Hardware servers, Virtual Machines and Containers have been appropriately resources in accordance with the Deployment System Requirements
☐	Secondary Disk Ensure that vault servers have a secondary disk attached to them. This will help with Audit Device Fault tolerance

Load Balancing


☐	TLS Encryption is configured on Load Balancer Vault’s communications should be encrypted end-to-end with TLS and this should not be terminated at the Load balancer layer. The load balancer should also use the same encryption to communicate with Vault
☐	HTTP Redirects configured With TLS configured, all traffic going via HTTPS will be encrypted; however, we need to ensure that there are no connections to vault via HTTP. The Load balancer should be configured to redirect all HTTP traffic to HTTPS.
☐	Health checks enabled and configured Load balancer health probes can be used to ensure that traffic is only routed to a healthy leader node. Configure routing rules according to these response codes

Monitoring and Alerting


☐	Vault Telemetry Enabled. Vault telemetry should be configured in the telemetry stanza within the Vault config file. This will enable monitoring and alerting with a wide range of open source tools (Telegraf and prometheus)
☐	Vault platform monitoring configured Monitoring system of your choice is configured to monitor and alert on vault application metric thresholds as per the best practice guidance of Hashicorp.
☐	Infrastructure Monitoring Configured Monitoring system of your choice is configured to monitor and alert on consul application metric thresholds as per the best practice guidance of Hashicorp.
☐	Monitoring Dashbaord Created Using a Dashboard tool a of your choice, create a monitoring dashboard for operations staff to easily identify any issues that may be occurring.
☐	Prometheus Alerts Configured Configure and package prometheus alerts for operations team

Configuration Management


☐	Infrastructure as code written (Containers only) Code written to deploy the infrastructure Vault.
☐	Vault Platform Configuration Code Written Vault Platform configuration should be described in code using a tool like Terrafrom. Configuration such as Auth Methods, Secrets Engines, Audit Devices and Policies should all be configured using code
☐	Code under Version Control in Source Code Repositories All Infrastructure code and application code should be stored separate source control repositories and be placed under version control. An appropriate branching strategy should be implemented and documented in the README file.
☐	Code Owners in repositories Repository files should have code owners assigned to them to control who can approve Pull Requests that will be merged into the Master branch.
☐	Repository rules implemented Configure the minimum number of Pull Request approvers, restrictions on Pull Request Authors approving their own requests and any other rules that your organisation’s security standards require for Integrity.
☐	Deployment pipelines implemented Code deployments should be automated using deployment pipelines. Where possible, the pipeline should be written as code and stored under version control with the code
☐	Dev/Staging environments created Create development and staging environments for Vault. Staging Environment should be identical to production, with the only divergence being, when pre-production changes are implemented for final testing prior to being deployed to production.
☐	Naming and coding standards established Implement and document naming and coding standards. Naming standards for Namespaces, Policies, Vault Roles, secrets keys and AppRoles. Coding standards where applicable for variable names and function names.
☐	Integration tests written A suite of automated integration tests written to be run either during the deployment pipeline or as a pre-check on your chosen VCS required to pass before a Pull Request can be merged.

Vault Configuration


☐	Enable audit device on 2 files on different disks Vault logs all requests and responses to requests. If Vault is unable to log requests and responses to these requests, it will immediately seize operations. To provide redundancy, each vault node should have 2 file audit devices enabled on separate volumes on separate disks.
☐	Log rotation configured on log files Enable and configure log rotation on the audit files to ensure the disks do not fill up and cause a vault outage.
☐	Configure at least one IDP as an auth method Where appropriate, configure an existing identity provider (or multiple if required) as an authentication method in Vault
☐	Configure the required secrets engines Identify and enable the required secrets engines for your business and technical use cases
☐	Ensure KV v2 is enabled Ensure that Version 2 of the KV secrets engine is used to enable secrets versioning

Identity & Access Management


☐	Create default policies Create default policies that all user entities will inherit according to your business security model. This could be list permissions on a particular KV path for example.
☐	Create policy mappings for default policies Create a mapping for default policies to ensure all user entities inherit these policies.
☐	Configure aliases for entities when more than one auth method is in use Using the Identity Secrets engine, create aliases to attach vault logins via different auth methods to a single entity to ensure the correct policies are inherited and to make the logging data easier to mine
☐	Design a path structure for KV v2 that matches the way your org works (team based or product/service based) Map you KV path design to the way your organisation works or product groupings.
☐	Meta AppRole process defined Meta Approles are a mechanism that allow an application or service to read the secret id of an app role without exposing this to application developers.
☐	SSO for end customer Configure SSO for user to authenticate via the company IDP.
☐	SSO for management team Configure SSO for management team via company IDP.

Security Hardening


☐	Ensure TLS is configured on Vault Cluster Enable end-to-end encryption using TLS certificates. Vault agents should also use TLS certificates
☐	Enable and configure SELinux / App Armour Enable and config SELinux / app amour depending on your operating system to create sandboxed contexts to reduce blast radius if even the system is compromised.
☐	Randomise the ports used to differ from standard ports for Vault By default, Vault uses port 8200 and 8201. Change the port to a non-standard port to provide extra hardening
☐	Revoke root token Once initial set-up of Vault cluster has been completed, the root token should be revoked.
☐	Configure server firewalls to only allow access to required ports. Using firewalld or IP Tables, configure these firewalls to limit port access to the vault servers.
☐	Disable SSH Interaction with Vault is done via the API, even when using the CLI. As such, there is no reason to have to SSH on to a vault server (if it’s a virtual machine) so SSH should be disabled to mitigate the risk of unauthorised access to the server.

Operational Readiness


☐	Configure auto-unseal Add a seal stanza to the Vault config file to reduce operational burden on operators. For more information check the auto-unseal documentation here
☐	PGP encryption of unseal/recovery keys Use PGP or Keybase to add an extra layer of security to the distribution of unseal/recovery keys. For more details, see the official documentation here
☐	Node rebuild practice run Practice building and replacing a node in the vault clusters with zero downtime.
☐	Vault upgrade practice run Practice upgrading Vault binaries to newer versions with zero downtime.
☐	Load testing Consuct load testing to ensure your infrastructure compute resources are sufficient for the load you are expecting. There are projects like wrk That can assist with generating traffic.
☐	Document key holders and contact details Ensure unseal/recovery key holders are documented on a Wiki and this document is kept up-to-date
☐	Trusted Broker/Platform pattern Choose a platform or broker that your business trusts and use this for secure injection of initial secrets. Examples are using Azure as a trusted platform or using Jenkins as a trusted broker. Each organisation will differ with regards to what they trust so this should be a business driven decision.

Observability


☐	Logs shipping to central logs data warehouse Logs should be streamed to a central data warehouse as log rotation on the servers should be enabled and logs will be lost locally. A platform like splunk is ideal for this use case. There are other viable options available.
☐	Logs data mining scripts written. Decide the value that the log data should provide and write some scripts to extract this value from the data. Scripts can be written in python. Models can also be produced to predict future loads based on existing data sets. This kind of insight can be useful for planning.
☐	Logs alerting configured Some events should generate some kind of alert, for example, a root token being generated should be flagged and alerted on. Ensure these events have alerts configured for them.
☐	UpDown Alerting Configure updown monitoring service which checks periodically the URL you want and reports back any anomaly, be it downtime, bad response, degraded performance or even broken SSL certificate. And publishes alerts in realtime to your alerting system.
☐	Status Page Configure a status page with uptime and performance metrics.

Governance and compliance


☐	Threat Model Exercise Conduct a threat modelling exercise using a framework of your organisations choosing and ensure you have documented and mitigated against all identified threats.

Disaster Recovery


☐	DR Excercise Conduct a DR excercise to ensure vault can be recovered in case of disaster.

Automation Playbooks


☐	Automation Playbooks Helm Charts and/or operator to deploy the above setup multiple times on different clusters.

SLI, SLO and SLA


☐	SLIs (service level indicators), SLO (service level objectives) and SLA (service level agreements) Have properly defined SLIs, SLOs and SLAs.

Keys Rotation


☐	Storage Key Rotation The Storage Key encrypts every secret that is stored in Vault, and only lives unencrypted in memory. This key can be rotated online by simply sending a call to the right API endpoint, or from the CLI. This requires the right privileges as set on the policy. From the point in time of rotating the key every new secret gets encrypted with the new key. This is a fairly straightforward process that most organizations carry out every six months, unless there is a compromise.
☐	Master Key Rotation The Master Key wraps the Storage Key, and only lives unencrypted in memory. When using automatic unsealing, the Master Key will be encrypted by the Seal Key, and recovery keys will be provided for certain operations. This process is also online, and causes no disruption, but requires the key holders to input their current shard or recovery key to validate the process, and it's time bound. This procedure is generally carried out by organizations yearly, unless there is a compromise..
☐	Seal Key Rotation ...

Define Organizational Roles

In most organizations where Vault has been deployed at scale, there is no requirement for extra staffing or hiring. In terms of deploying and running the solution. Vault has no predefined roles and profiles, as the policy system allows very granular definitions of the duties for different teams, but generally speaking these have been defined in most organizations as follows:

Consumers: Individuals or teams that require the ability to consume secrets or have a need for a namespaced Vault capability.
Operators: Individuals or teams who onboard consumers, as well as secret engine capabilities, policies, namespaces and authentication methods.
Crypto (Key Officers): Individual or teams who manage key rotation and audit policies
Audit: Individual or teams who review and audit usage.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.DS_Store		.DS_Store
README-consul.md		README-consul.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.DS_Store

.DS_Store

README-consul.md

README-consul.md

README.md

README.md

Repository files navigation

Vault Production Readiness Checklist

References

Infrastructure Architecture

Load Balancing

Monitoring and Alerting

Configuration Management

Vault Configuration

Identity & Access Management

Security Hardening

Operational Readiness

Observability

Governance and compliance

Disaster Recovery

Automation Playbooks

SLI, SLO and SLA

Keys Rotation

Define Organizational Roles

About

Releases

Packages

stakater/Vault-Production-Readiness-Checklist

Folders and files

Latest commit

History

Repository files navigation

Vault Production Readiness Checklist

References

Infrastructure Architecture

Load Balancing

Monitoring and Alerting

Configuration Management

Vault Configuration

Identity & Access Management

Security Hardening

Operational Readiness

Observability

Governance and compliance

Disaster Recovery

Automation Playbooks

SLI, SLO and SLA

Keys Rotation

Define Organizational Roles

About

Resources

Stars

Watchers

Forks