Skip to content

Playground to learn how to set up Service Level Objectives

Notifications You must be signed in to change notification settings

tblxio/slo-laboratory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Fleet Management System

The goal of this project is to illustrate the concepts defined in the Navigating Service Level Objectives Series. For that, we will build a simple fleet management system to track vehicles and trips. The goal is not to have a fully fledged production-grade system, but rather a simple prototype enabling anyone to create SLOs on a system that uses well-known technologies, along with a proven monitoring stack.

Table of Contents

How to run

You need to have Docker and Docker Compose installed. Then, just run the following commands:

# Start the monitoring stack
docker compose up -d prometheus
docker compose up -d grafana
docker compose up -d pyrra-api
docker compose up -d pyrra-filesystem

# Start the REST API
docker compose up -d rest-api --build

# Run load tests against the REST API
docker compose run --rm k6

TL;DR: Exposed Services

REST API

The REST API is built using Spring Boot, WebFlux and Spring Data R2DBC. It is a simple application on top of a PostgreSQL database.

Endpoints

The API exposes the following endpoints:

  • GET /operators: Fetch operators in a paginated fashion
  • GET /operators/{operatorId}: Get operator by id
  • GET /operators/{operatorId}/fleets: Get fleets associated to an operator
  • GET /operators/{operatorId}/fleets/{fleetId}: Get a fleet by ID associated to an operator
  • GET /operators/{operatorId}/fleets/{fleetId}/vehicles: Get vehicles of a fleet and operator
  • GET /operators/{operatorId}/fleets/{fleetId}/vehicles/{vehicleId}: Get a vehicle by ID of a fleet and operator
  • GET /operators/{operatorId}/drivers: Get drivers associated to an operator
  • GET /operators/{operatorId}/drivers/{driverId}: Get a driver by ID associated to an operator

See the Swagger UI for more information.

Database

Here is a diagram representing the database schema used by the REST API (see V1__Initial.sql for the actual SQL) :

Migrations are handled by a standalone Flyway instance. They can be run using:

docker compose run --rm flyway

Note that the migrations are always ran before launching the rest-api service in Docker Compose.

The database is populated with fake data generated using PostgreSQL Faker, details can be found in V2__Populate.sql. This step is ran as part of the migrations.

Load Testing

Traffic to the REST API can be simulated using the load tests defined in k6. As the name implies, these load tests are based on k6m an open-source load testing tool.

The script.js script runs a simple load test hitting each endpoint sequentially with a constant rate of requests.

To ensure that the load tests actually query real data, relationships are extracted from the database and then used by k6 to query existing combinations of operators / fleets / vehicles and operators / drivers. This extraction is done in docker compose and runs automatically before k6, but you can run it yourself as well:

docker compose run --rm database-extractor

By default, the load tests run for 24 hours, with 50 concurrents "clients", generating about 400 requests / second on the REST API.

Monitoring (Prometheus)

The REST API Monitoring dashboard in Grafana helps visualizing the performance of the REST API using metrics scraped from Prometheus.

The load tests can help generating traffic to have the relevant metrics needed for this dashboard to be useful.

In particular, it also displays well-known SLOs built on top of REST APIs:

  • availability SLO, based on the HTTP request success rate
  • latency SLO, based on the HTTP request duration

These SLOs are represented over 3 time windows in order to illustrate the important of carefully choosing the right window for your SLOs:

  • Instant
  • 1 hour
  • 1 day

You can change the SLOs using the variables of the dashboard:

The dashboard is composed of 4 rows:

  • Request Rate: displays information about the instant request rate (per method and uri) and the total requests processed in the time window displayed by Grafana
  • Availability SLO: SLO counterpart of the request rate, displaying success rate VS error rate, availability and error budget on different time windows
  • PXX Latency: displays the latency of requests based on the latency SLO chosen in the variables, per method and URI, as a time series and a summary over the time window displayed by Grafana
  • Latency SLO: SLO counterpart of the latency row, displaying fast rate VS slow rate, availability and error budget on different time windows

Monitoring (Pyrra)

Pyrra allows to easily configure Service Level Objectives using Custom Resources Definitions (CRDs). It then translates this simple CRDs into Prometheus recording rules in order to pre-compute the metrics needed to monitor SLOs (burn rates, error budgets, availabilies, ...).

These CRDs are defined in the slos directory. For the REST API, two SLOs are defined:

  • REST API Availability defines that 99% of requests to the REST API should succeed (status != 5xx) over 4 weeks
  • REST API Latency defines that 99% of successful requests (status = 2xx) should have a latency below 50ms over 4 weeks

The resulting recording rules are stored in the prometheus_pyrra (not committed in this repository).

Three dashboards allow to monitor metrics generated by Pyrra:

  • Pyrra - List allows to see the list of SLOs and their current availability and error budget
  • Pyrra - Detail allows to see the details of a specific SLO, including the burn rate, error budget, availability, etc.
  • Pyrra - Overview allows to see the error budgets of all the SLOs over time

The SLOs can also be monitored using the Pyrra UI.

Simulating outages

Since the REST API is dependent on the database to work properly, the easiest way to simulate outages is to act on the database container.

To trigger server errors (impacting the availability SLO), the easiest way is to stop the database:

docker compose stop database

To trigger bigger latencies (impacting the latency SLO), the easiest way is to insert a lot of data into the database, e.g. by using the following commands:

docker compose exec database psql -U admin -d fms

Then insert data, for example:

INSERT INTO fms.drivers (operator_id, first_name, last_name, license_number, date_of_birth, hire_date)
     SELECT (random() * ((select max(operator_id) from fms.operators) - 1) + 1)::INT,
            faker.first_name(),
            faker.last_name(),
            'License--' || id,
            faker.date_this_century()::DATE,
            faker.date_this_decade()::DATE
       FROM generate_series(1, 1000000) AS s(id);

This can be reverted by running the follow SQL command:

DELETE FROM fms.drivers
      WHERE driver_id > 300;

About

Playground to learn how to set up Service Level Objectives

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published