Skip to content

Security, Quality Assurance, Stability, Relience, and Troubleshooting Features

Eric Jahn edited this page May 8, 2017 · 1 revision

Quality Assurance Practices

Using Release Versions - Development and Production

Predictable and stable release phases keep new features from constantly introducing bugs into customer facing production deployed software. What may be considered disruptive coding events like hackathons, 3rd party contributors, and short-term projects need not be a problem, as long as our development process is methodical. All open (and closed) source software development should have predictable phases leading to a production release and maintenance with hotfixes. Each of these live in separate branches. Our team (as well as the HOME App team) has adopted this code maintenance process. It requires more structure and code segregation, but avoids unanticipated bugs arising from fixes and new features.

API Monitoring and Alerting

We already have notifications to Slack for Jenkins deployment issues, but this does not monitor for problems during deployment. We need to know immediately when any of our various microservices are down. We also need this status change information logged. The only good way to do this is to deploy or pay for a service (like one at http://raml.org/developers/test-your-api , https://www.apiscience.com/, https://status.io/ or one of the many API Gateway offerings from Amazon or 3scale) that constantly tests our API health under a separate test account we establish. This monitoring will give us the data we need to advertise our service uptime and performance. Here are the two issues related to this: #207 and CES #45.

Security Practices

Infrastructure level security

VPC - Critical for customers to be assured that our system is secured. Draft of proposed cloud architecture

  • Bastian / NAT instance configuration - All supporting systems to be accessible only through NAT or Bastian host
  • Cloud monitoring - Alerts / Notifications on resource thresholds and access. Should use certain tools like fail2ban with customized rules that can prevent unauthorized access to our applications
  • Access Logs - Should have a mechanism to be able to access logs in a secured manner -as this will contain sensitive information.
  • Access to support systems (Jenkins / Nexus) only through Private subnets.
  • Notifications on unusual events
  • Two factor authentication set up for AWS login

Application Level security

  • Data source and connection pooling for database access from all the application instances.
  • Property management through database- All sensitive information, in addition to database connection properties should be maintained in application database and not in the code base. Micro services should be altered to load these properties during application start up.

User level security

  • Most of the security at user level is in place - in terms of notifications on forgot passwords , password reset links.
  • Password lock out should be enforced upon X number of unsuccessful attempts.

Troubleshooting Facilities

When things do go wrong, either in development (preferably), or in production some established troubleshooting facilities will speed problem identification and correction. Some of the following features are also NIST and HIPAA suggested or required practices as well.

Enabling a test API for developer testing

Before they even start coding, our customer developers should be reading and really understanding our API documentation. Many API docs now feature a "Try It Now" feature, that accepts manual input through the web. It also lets developers know the service is working if they don't have the means to set up a rest test client (mentioned later in this document). How it works typically is that we hook up a mock web service (actually a test account in our actual service) to our API documentation page we already have. Mulesoft's AnyPoint which we use actually has a configuration setting to link a test endpoint to each API page. Issue #237

Pervasive logging of API calls (request/response) and email transmissions (request/response)

When a developer wants to know if a web request they made at, say, 4am was received, we can send them a report after the fact. This will give them another tool to figure out where issues lie. We could even make an API call for them to retrieve the logs. Issue #52. Email logging is issue 104.

Guidance for quickly checking live customer (not live mock) APIs

Even more relevant and useful to a developer than "Try It Now" API documentation, is helping developers test using an API tool like SOAP-UI or the Chrome Advanced REST Client Chrome browser extension. It's better, because they can actually try the same failing request their app is attempting to make, against the actual customer account. We should provide thorough documentation on how to use this tool in our environment, which might be a little tricky because of OAuth tokens needed to make live requests. Issue #240

#Stability and Recovery These features will help make our system crash less often, and when it does break down, to recover data quickly if necessary.

Production Databases: Hadoop and PostgreSQL

we are currently running a development Hadoop, which is far less stable and scalable than a multizone production Hadoop instance. Issue #23

Our Postgres should be multi-node to make it more resilient and scalable. Issue #247

Remote backups of Postgres transactional data

Our relational data service is backed up, but this further safeguards customer data. Issue #241

Load balanced web servers

Keeps web requests evenly distributed amongst available web servers. Reduces latency of web requests/responses. Issue #187

dynamic load-based deployment

"Spins up" or "spins down" Amazon server instances as needed. That way, we always have the appropriate number of servers running. Issue #31

soft deletes for data recovery

We allow customers to recover from accidental deletes.

Clone this wiki locally