Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API uptime monitoring #91

Closed
benhowes opened this issue Apr 17, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@benhowes
Copy link
Member

commented Apr 17, 2018

Hi @Alan-R, @Birne94, @BusinessFawn, @cetanu, @megetron, @funkyfuture, @WhileLoop, @zhaogp, @sdementen, @slmingol, @aweidner, @justinfay, @vit-goncharov, @jones77, @Fosity, @ppr-A320, @raghavakora, @DeadDuck, @hugogu, @lpodl,

We're investigating if there is a demand for additional tools in the tavern ecosystem at the moment. Uptime monitoring using tavern is looking the most exciting at the moment.

You'd use you existing Tavern tests to define uptime checks with a CLI based CI/CD deployment, keeping your uptime tests in sync with your code changes. The service would generate a performance overview dashboard, uptime page and also alerts when there was a problem.

Some other services that offer this kind of thing are Postman's monitoring, API fortesss or Runscope, but we think the clarity of Tavern's syntax, the multi-stage tests and the chance to re-use your integration tests could make it a substantial improvement on those tools.

Does this sound like a useful tool/service? If not, how do you monitor API's at the moment?

If you'd rather not reply here, please feel free to drop me an email on ben@zoetrope.io.

Finally - Thanks so much for using Tavern and helping to build a community around the tool! We're looking at value add services around tavern so that we can spend more time working on the open source product and we hope that new products and tools built around tavern will help the main project thrive!

@Birne94

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2018

Hello Ben,

Our main use case for tavern so far is integration testing our backend services and how they interact with each other. I have played with the thought of having the tests run against a staging or even production server, but refrained from doing so for several reasons:

  • The main reason is pollution of the production database. I want to keep it as clean as possible. Running tests against it would clutter data, falsify analytics and overall increase the load (depending on the check intervals) for resource/calculation intensive endpoints.

  • Given that we run the tests when developing and before each deploy, we should be able to catch any integration issues at this stage. Currently, we only have one public facing server (scaled across multiple instances but sharing the same code), so any form of data corruption will most likely affect everything and we are doomed anyway.

  • We are a small startup (4 employees), so we simply do not have had the time yet to setup complex CI (apart from running unit tests on commit) or monitoring systems. However in the future, this is most likely to change.

These reasons of course only result from my rather limited experience in the operational aspect of software engineering, so if there are any misconceptions here or any concepts I should know about, I would really like to know!

Since we deploy all of our services on AWS, we use their integrated monitoring tools for now. Each api server provides a simple ping endpoint, which connects to the database, performs a SELECT 1 query and returns. This is used by the load balancer to determine unhealthy hosts and also detects issues with our database. This endpoint is then accessed every 30 seconds.

For catching issues in production, we have set a collection of alarms which include:

  • HTTP 5xx errors breach a threshold
  • HTTP 4xx errors breach a threshold
  • Response time from machine to load balance breaches a threshold
  • several database related thresholds

These alarms, together with stack traces from sentry are collected and pushed to a slack channel. If anything breaks or behaves abnormally, we usually receive a notification within seconds.

Additionally, AWS allows creating dashboards of different metrics which I have opened on a spare monitor most of the time. If I catch anything suspicious, I can usually investigate in no time.

I can see uptime monitoring for tavern become a valuable asset when building a system of many (internal or external) services which need to communicate with each other. Such a system provides many point of failures, one services might affect many others and so on. Getting early notifications if any of this happens (ideally before a user encounters this problem) is crucial, so short-interval integration tests look promising to me!

Overall, I am very happy with the development and community of this project and - even though I haven't had much time recently to play around with it - would really like to see it grow.

@cetanu

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2018

Hi Ben,

We use tavern for local acceptance tests for microservices and some of our code-driven reverse proxies. In addition we currently use Tavern to perform post-deployment verification (PDV) in various environments including production, where we utilize canaries.

We have a lot of monitoring systems in place currently, and most of the things that tickle us at night-time and lead to real problems are about raw metrics such as memory, cache, CPU, error rates, latency.

We would consider it, but we're unsure if it's going to change our lives at this point. @Birne94 makes a good point regarding analytics/log pollution as well, but this wasn't an immediate concern for myself.

We generally just run our PDV once after a deployment, and locally as part of our builds, and we're happy with that and haven't thought about performing them continuously on live systems.

Probably my biggest concern is the implications for the tavern library if such an application were made, I imagine there would be additions to suit the new application. We are wary of this. We like the lightweight nature of the library in it's current state. I'd probably like those things to go through a more democratic process (if that is even possible).

@benhowes benhowes closed this Jun 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.