- 1. Task description
- 2. Functional requirements
- 3. Non-Functional Requirements
- 4. Requirements Refinement Decisions
- 4.1. FastAPI or Django framework
- 4.2. Constraints to text fields according to the established practices in existing systems
- 4.3. The format and the storage of the image is to be chosen following the established practices in existing systems
- 4.4. The format for the unique product identifier (SKU) is to be chosen following the established practices in existing systems
- 4.5. The name of the category constraints
- 4.6. The parent field of the category
- 4.7. Pagination of the returned results
- 4.8. Database to store products, categories and images
- 4.9. Unit test for real database or for mock database
- 5. High-Level Design
- 6. Database Design
- 7. API Design
- 8. Run and Test
- 9. Observability
- 10. Implemented Improvements
- 11. Investigation Conclusions on 200ms Target
Create a service which handles operations on products in an E-commerce system.
- There should be two models - Product and Category.
- Text fields should be able to support short text input, which may not be in English/Bulgarian only.
- Text fields the established practices in existing systems.
- Additional fields may be added to models in the future.
- title - text field
- description - text field
- image
- unique product identifier (SKU)
- price. Should not lose precision when rounded.
- category - link to a category model. Can be empty.
- name - text field
- parent - link to category model. Maximum depth of nesting of children under parent is 100.
- CRUD operations for both models.
- API endpoint to search and filter all products matching:
- certain name/SKU
- within a price range
- under a certain category.
- additional filters may be added
- Range borders are inclusive.
- Returned results do not need to be sorted.
- Search for a certain category should return child categories results too.
- On deletion of the category all linked to it products are to be unlinked.
- On deletion of the parent category the children categories are to be deleted and all linked products are to be unlinked.
- Use FastAPI or Django frameworks
- Unit tests for the search functionality
- The expectation is that endpoints return results within 200ms. If there are technical difficulties in achieving such latency, the reasons should be justified.
- Expected number of products: tens of thousands
- Expected number of categories: thousands
- Users per day: thousands
- No need for user authorization
- Multiple parallel connections to service
The FastAPI was chosen:
- Supports
async/awaitthroughout, enabling parallel DB queries and concurrent connections without threading overhead - according to the requirements forlow latencyandmultiple parallel connections - Built-in validation via Pydantic to be used for input data contraints
- Auto-generated docs via OpenAPI/Swagger UI out of the box (FastAPI docs)
- Easier to pick up compared to Django with great example-driven documentation.
- Django would be preferable if ORM ecosystem, admin panel, or auth are required.
- References:
- Use UTF-8 encoding capable of representing all Unicode characters
Product.title: 255 chars (VARCHAR(255))Product.description: 10000 chars (TEXT)Category.name: 255 chars (VARCHAR(255))- References:
4.3. The format and the storage of the image is to be chosen following the established practices in existing systems
- Store a URL string (
VARCHAR(2083)) - Storing binary image data in the database worsens query performance, increases backup sizes and traffic.
- Accepted formats: JPEG (universal), PNG (universal), WebP (modern optimized delivery).
- References:
4.4. The format for the unique product identifier (SKU) is to be chosen following the established practices in existing systems
- Alphanumeric string (
VARCHAR(100)) - Must be unique per system (
UNIQUEconstraint in DB) - Must be present (
NOT NULLconstraint in DB) - Normalized to upper-case on write to prevent duplicates
- Validation regex:
^[A-Z0-9_-]{1,100}$(enforce at the API layer via Pydantic) - References:
- Use UTF-8 encoding capable of representing all Unicode characters
- String 255 chars (
VARCHAR(255)) - Must be present (
NOT NULLconstraint in DB) and not empty (enforcemin_length=1in Pydantic) - Must be unique per same parent (composite unique constraint on
(parent_id, name)) - References:
- Adjacency List with recursive CTE (Common Table Expression) queries. Self-referencing foreign key, nullable.
- DB column -
parent_id INTEGER REFERENCES category(id) ON DELETE CASCADE - Root categories -
parent_id IS NULL - Max depth: 100 (validated at the application layer before insert/update)
- Subtree queries: PostgreSQL recursive CTEs (
WITH RECURSIVE) - Cascade delete:
ON DELETE CASCADEat DB level — deleting a parent removes all descendants. Product unlinking handled viaON DELETE SET NULLonproduct.category_id. - For the "search returns child results too" requirement, the recursive CTE fetches all descendant category IDs first, then filters products with
WHERE category_id IN (...). - Why Adjacency List over Materialized Path or Nested Sets?
- Simplest model; category count is in the
thousands— recursive CTEs on PostgreSQL handle this efficiently. - Writes (add/move/delete categories) are O(1) — no tree rebalancing needed.
- Max depth of 100 is enforceable at the application layer during writes
- Schema:
parent_id = ForeignKey('self', null=True, on_delete=SET_NULL)— but given the requirement that deleting a parent deletes all children, useon_delete=CASCADEfor the FK constraint
- Simplest model; category count is in the
- References:
- With tens of thousands of products, pagination is mandatory for the search endpoint
- Decision: Offset-based pagination with
limit/offsetquery parameters - Dataset is tens of thousands of products — offset/limit with proper indexes performs well within 200 ms.
- Simpler for clients to implement (random page access).
totalcount is cheap with proper indexes on the filtered columns.- Trade-off: Results do not need to be sorted (per requirements), so cursor pagination has no anchor advantage.
- Trade-off: Offset pagination (
LIMIT x OFFSET y) degrades at large offsets — the DB must scan and discard rows. Avoid for deep pages. - Trade-off: Cursor-based (keyset) pagination uses a stable pointer (e.g.,
?cursor=<last_id>) and isO(1)regardless of page depth. - Trade-off: Cursor-based is production standard when dataset is millions
- References:
- PostgreSQL with
asyncpg(via SQLAlchemy 2.0 async) gives the best combination of correctness (exact decimals, recursive CTEs, cascading deletes) and performance (async I/O, rich indexing). - Supports: Recursive CTEs ,
DECIMALprecision, full UTF-8, concurrent writes, async driver for FastAPI and index types - SQLite has limited decimal precision, no support of concurrent writes and no index types
- MySQL requires explicit config for UTF-8 and has limited index types
- Key indexes:
product.sku— unique B-treeproduct.category_id— B-tree (for category filter + cascade unlink)product.price— B-tree (for range queries)product.title— GIN trigram index (pg_trgm) for partial-match searchcategory.parent_id— B-tree (for recursive CTE traversal)
- References:
- Decision: Real database (PostgreSQL) via test containers or an in-process test database.
- Mock DB is fast, but doesn't test real SQL, recursive CTEs, cascades, indexes
- Real DB (test container) - tests actual queries, constraints, cascade behavior. However it is slower (~seconds startup)
The service is best modeled as a layered API application. The API layer handles transport concerns, Pydantic enforces request and response contracts, application services own business rules, and the persistence layer isolates database access. PostgreSQL remains the system of record for products and categories, while product images are stored externally and referenced by URL.
flowchart TB
client[Clients<br/>Web UI / Admin UI / External Systems]
docs[OpenAPI / Swagger UI]
client --> api
docs --> api
subgraph service[Commerce Service]
direction TB
api[FastAPI Routers<br/>Products / Categories / Search]
schemas[Pydantic Schemas<br/>Validation / Normalization / Serialization]
app[Application Services<br/>CRUD / Search / Category Tree Rules]
repo[Persistence Layer<br/>SQLAlchemy 2.0 Async Repositories]
search[Search Composition<br/>Name or SKU / Price Range / Category Subtree / Pagination]
api --> schemas
schemas --> app
app --> repo
app --> search
search --> repo
end
subgraph data[Data Platform]
direction LR
db[(PostgreSQL)]
images[(Object Storage or CDN<br/>Image files)]
idx[Index Strategy<br/>SKU unique / price / category_id / parent_id / trigram title]
tree[Category Hierarchy<br/>Adjacency list + recursive CTE]
integrity[Referential Integrity<br/>category.parent_id ON DELETE CASCADE<br/>product.category_id ON DELETE SET NULL]
db --- idx
db --- tree
db --- integrity
end
repo --> db
repo --> images
tests[Integration Tests<br/>pytest + real PostgreSQL test DB] --> api
- Layered design keeps request validation, business logic, and persistence concerns separate.
- PostgreSQL is the source of truth for products, categories, prices, and relational constraints.
- Product image binaries should live outside the relational database; only image URLs and metadata belong in the service data model.
- Category subtree search is implemented through recursive CTE queries over an adjacency-list category structure.
- Search performance relies on targeted indexes and offset-based pagination for the current scale of tens of thousands of products.
The database design uses PostgreSQL as the primary transactional store. The schema keeps the core model deliberately small: categories are stored as a self-referencing hierarchy, products reference categories optionally, and image assets are represented as URLs rather than binary blobs. This keeps writes simple, supports recursive category traversal efficiently, and preserves room for future fields without redesigning the core relationships.
erDiagram
CATEGORY ||--o{ CATEGORY : parent_of
CATEGORY o|--o{ PRODUCT : classifies
CATEGORY {
bigint id PK
varchar_255 name
bigint parent_id FK
timestamptz created_at
timestamptz updated_at
}
PRODUCT {
bigint id PK
varchar_255 title
text description
varchar_2083 image_url
varchar_100 sku UK
numeric_12_2 price
bigint category_id FK
timestamptz created_at
timestamptz updated_at
}
category.idandproduct.idshould be surrogate primary keys generated by PostgreSQL identity columns.category.parent_idis nullable so root categories can exist without a parent.product.category_idis nullable because a product may remain in the system after its category is deleted.product.skuis the external business identifier; it should be normalized to uppercase before persistence.product.priceshould useNUMERIC(12,2)so prices are stored exactly and do not lose precision.image_urlstores the external object location only; binaries stay in object storage or a CDN-backed asset store.created_atandupdated_attimestamps are recommended on both tables for auditability and operational debugging.
Example logical DDL:
CREATE TABLE category (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR(255) NOT NULL,
parent_id BIGINT REFERENCES category(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT uq_category_parent_name UNIQUE (parent_id, name)
);
CREATE TABLE product (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
title VARCHAR(255) NOT NULL,
description TEXT NOT NULL,
image_url VARCHAR(2083),
sku VARCHAR(100) NOT NULL,
price NUMERIC(12,2) NOT NULL,
category_id BIGINT REFERENCES category(id) ON DELETE SET NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT uq_product_sku UNIQUE (sku),
CONSTRAINT chk_product_price_non_negative CHECK (price >= 0)
);UNIQUE (sku)prevents duplicate products under the same business identifier.UNIQUE (parent_id, name)prevents duplicate sibling category names while still allowing the same name in different branches.ON DELETE CASCADEoncategory.parent_iddeletes child categories automatically when a parent category is removed.ON DELETE SET NULLonproduct.category_idpreserves products while unlinking them from deleted categories, matching the functional requirements.CHECK (price >= 0)blocks invalid negative product prices at the database layer.- A B-tree index on
product.category_idsupports category filtering and unlink operations. - A B-tree index on
product.pricesupports inclusive range queries. - A unique B-tree index on
product.skusupports exact SKU lookups and uniqueness enforcement. - A B-tree index on
category.parent_idsupports recursive hierarchy traversal. - A trigram GIN index on
product.titleis recommended for partial title search in PostgreSQL.
Recommended indexes:
CREATE INDEX idx_product_category_id ON product(category_id);
CREATE INDEX idx_product_price ON product(price);
CREATE INDEX idx_category_parent_id ON category(parent_id);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_product_title_trgm ON product USING gin (title gin_trgm_ops);The API follows REST conventions over JSON and is designed for low-latency filtering at scale. Validation is handled by Pydantic, and the generated OpenAPI document is the contract source for clients.
- Base path:
/api/v1 - Content type:
application/json - No authentication layer is required in this task.
- Timestamps use ISO-8601 UTC format (example:
2026-03-12T10:20:30Z). - All list/search responses are wrapped in a paginated envelope.
- Validation errors return HTTP
422; missing resources return HTTP404; uniqueness conflicts return HTTP409.
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/categories |
Create a category |
GET |
/api/v1/categories/{category_id} |
Get category by id |
PATCH |
/api/v1/categories/{category_id} |
Update category name and or parent |
DELETE |
/api/v1/categories/{category_id} |
Delete category subtree |
GET |
/api/v1/categories |
List categories with pagination |
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/products |
Create a product |
GET |
/api/v1/products/{product_id} |
Get product by id |
PATCH |
/api/v1/products/{product_id} |
Update product fields |
DELETE |
/api/v1/products/{product_id} |
Delete product |
GET |
/api/v1/products |
List products with pagination |
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/products/search |
Filter products by name or SKU, price range, and category subtree |
Query parameters:
q: optional string for partial title match or exact SKU match.min_price: optional decimal, inclusive lower bound.max_price: optional decimal, inclusive upper bound.category_id: optional integer category filter; includes all descendants.limit: integer, default20, max100.offset: integer, default0.
Category create request:
{
"name": "Laptops",
"parent_id": 12
}Category response:
{
"id": 34,
"name": "Laptops",
"parent_id": 12,
"created_at": "2026-03-12T10:20:30Z",
"updated_at": "2026-03-12T10:20:30Z"
}Product create request:
{
"title": "Ultrabook X13",
"description": "13-inch ultrabook with 16GB RAM",
"image_url": "https://cdn.example.com/products/x13.webp",
"sku": "UBX13-16-512",
"price": "1299.99",
"category_id": 34
}Product response:
{
"id": 501,
"title": "Ultrabook X13",
"description": "13-inch ultrabook with 16GB RAM",
"image_url": "https://cdn.example.com/products/x13.webp",
"sku": "UBX13-16-512",
"price": "1299.99",
"category_id": 34,
"created_at": "2026-03-12T10:21:00Z",
"updated_at": "2026-03-12T10:21:00Z"
}Search response:
{
"items": [
{
"id": 501,
"title": "Ultrabook X13",
"description": "13-inch ultrabook with 16GB RAM",
"image_url": "https://cdn.example.com/products/x13.webp",
"sku": "UBX13-16-512",
"price": "1299.99",
"category_id": 34,
"created_at": "2026-03-12T10:21:00Z",
"updated_at": "2026-03-12T10:21:00Z"
}
],
"total": 1,
"limit": 20,
"offset": 0
}Validation error response:
{
"detail": [
{
"loc": ["body", "sku"],
"msg": "String should match pattern '^[A-Z0-9_-]{1,100}$'",
"type": "string_pattern_mismatch"
}
]
}- Create and activate a virtual environment:
python3 -m venv .venv
. .venv/bin/activate- Install dependencies:
pip install -e '.[dev]'- Copy environment config and adjust as needed:
cp .env.example .envThe .env file is not committed to the repository (it is listed in .gitignore). The provided .env.example contains sensible defaults for local development. Key variables you may want to review:
| Variable | Default | Purpose |
|---|---|---|
DATABASE_URL |
postgresql+asyncpg://postgres:postgres@localhost:5432/commerce_demo |
PostgreSQL connection string. Change host/port/credentials to match your local setup, or leave as-is when using Docker Compose (it provisions the DB automatically). |
TELEMETRY_ENABLED |
true |
Set to false if you are not running the observability stack. |
API_PREFIX |
/api/v1 |
URL prefix for all API routes. |
AUTO_CREATE_SCHEMA |
true |
Creates tables on startup when no Alembic migration has been run yet. |
Tip: When using Docker Compose (Option A below), the compose file passes its own
DATABASE_URLpointing at the containerized PostgreSQL, so the value in.envis only used if you run the dev server directly on the host (Option C).
Build and start the full stack (API, PostgreSQL, OpenTelemetry Collector, Tempo, Prometheus, Grafana):
docker compose up --buildRun in detached mode:
docker compose up --build -dNote:
- Compose now includes a one-shot
migrate-indexesservice that runsscripts/migrate_indexes.pyafter PostgreSQL becomes healthy. - The
appservice waits for this migration to complete successfully before starting. - On a fresh database,
migrate-indexesnow creates the base schema first, then applies indexes, sodocker compose up -dsucceeds on first boot. - Optional: set
POSTGRES_USER,POSTGRES_PASSWORD, andPOSTGRES_DBin your shell or.envbeforedocker compose upto override the default local database credentials.
Use the remote compose file when the API service and PostgreSQL already run on another host and you only want the local monitoring stack.
- Create a dedicated env file for the remote target:
cp .env.remote.example .env.remote- Set the remote host and, if needed, the exposed app port:
REMOTE_SERVER_ADDRESS=api.example.com
REMOTE_APP_SCHEME=https
REMOTE_APP_PORT=8000
REMOTE_METRICS_PATH=/metricsExample for this deployed site (https://commercesystemdemo.onrender.com/):
REMOTE_SERVER_ADDRESS=commercesystemdemo.onrender.com
REMOTE_APP_SCHEME=https
REMOTE_APP_PORT=443
REMOTE_METRICS_PATH=/metrics- Start the remote-targeted observability stack:
docker compose -f docker-compose.remote.yml --env-file .env.remote up -dThis stack does not start the app, migrate-indexes, or db services locally. Prometheus scrapes REMOTE_APP_SCHEME://REMOTE_SERVER_ADDRESS:REMOTE_APP_PORT${REMOTE_METRICS_PATH}, while Grafana, Tempo, Loki, and the OpenTelemetry Collector continue to run locally.
Quick verification for the Render example:
curl -sS https://commercesystemdemo.onrender.com/metrics | grep '^commerce_' | head || trueThen open local dashboards:
- Grafana:
http://127.0.0.1:3000 - Prometheus targets:
http://127.0.0.1:9090/targets
Remote limitations:
- Metrics work immediately as long as the remote app exposes
/metricspublicly or over a reachable private network. - Traces appear only if the remote app is configured to send OTLP traffic to a collector endpoint reachable from that remote host.
- Logs require a separate
promtailagent on the remote Docker host because the local stack can only scrape local container log files.
Optional server-side log shipping:
- On the remote Docker host, create a log shipper env file:
cp .env.server-logs.example .env.server-logs- Point it to a Loki endpoint reachable from that server:
REMOTE_LOKI_PUSH_URL=https://logs.example.com/loki/api/v1/push
REMOTE_LOG_HOST=commerce-prod-01
REMOTE_LOG_JOB=commerce-remote- Start the remote log shipper on that server:
docker compose -f docker-compose.server-logs.yml --env-file .env.server-logs up -dThis ships Docker JSON logs from the remote host into Loki. If Loki runs on your workstation, expose it through a tunnel or reverse proxy first because the remote server must be able to reach REMOTE_LOKI_PUSH_URL directly.
Stop services:
docker compose downStop services and remove the PostgreSQL volume:
docker compose down -vBefore starting the server, ensure PostgreSQL is running. You can use Docker for the database only:
docker run -d \
--name commerce-postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=commerce_demo \
-p 5432:5432 \
postgres:16Or configure connection to an existing PostgreSQL instance by updating .env:
DATABASE_URL=postgresql+asyncpg://user:password@host:port/database
LOG_LEVEL=INFOIf the service is deployed on a remote host (for example Render), set these environment variables in the host's service settings (do not rely only on local .env):
DATABASE_URL=postgresql+asyncpg://user:password@host:port/database
DATABASE_POOL_PRE_PING=false
DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=5
LOG_LEVEL=INFONotes:
DATABASE_POOL_PRE_PING=falseavoids an extra connection-check round trip on each checkout.DATABASE_POOL_SIZE=20andDATABASE_MAX_OVERFLOW=5are the tuned defaults used for concurrent load.- After changing remote environment variables, trigger a redeploy/restart so the running container picks up the new values.
Use this checklist after every Render deploy to confirm that the running instance matches the expected config.
-
In Render Dashboard ->
Environment, confirm:DATABASE_POOL_PRE_PING=falseDATABASE_POOL_SIZE=20DATABASE_MAX_OVERFLOW=5
-
In Render Dashboard ->
Deploys, confirm the active deploy is the expected commit.
If you still observe a persistent ~100ms DB latency tier or occasional near-200ms mutation paths on Render, apply these checks in order:
-
Region and network:
- Keep the web service and Render Postgres in the same region.
- Use the private/internal database hostname when available.
-
Database plan sizing:
- Upgrade Render Postgres plan (CPU and I/O headroom) if DB metrics show pressure.
- Re-run the same load scenario after each plan change for apples-to-apples comparison.
-
Connection settings (service env vars):
DATABASE_POOL_PRE_PING=falseDATABASE_POOL_SIZE=20DATABASE_MAX_OVERFLOW=5
-
Maintenance:
- Run
VACUUM (ANALYZE)on frequently updated tables. - Keep indexes from
scripts/migrate_indexes.pyapplied in production.
- Run
Important limitation:
- Render managed Postgres does not expose all low-level server tuning knobs typical in self-managed Postgres.
- The
db_timeundercount mismatch seen in some search requests is typically instrumentation-path timing (outside query execute hooks), not a direct Postgres parameter issue.
Start the server locally:
uvicorn app.main:app --reloadBrowse the interactive API:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc - Health endpoint:
http://127.0.0.1:8000/health - Metrics endpoint:
http://127.0.0.1:8000/metrics - Grafana:
http://127.0.0.1:3000(admin/admin) - Prometheus:
http://127.0.0.1:9090
Troubleshooting: ConnectionRefusedError
If you see Connection refused [Errno 111], PostgreSQL is not accessible. Verify:
- Compose services are running:
docker compose ps - Docker container is running:
docker ps | grep commerce-postgres - Connection string in
.envis correct - Firewall/network allows connection to database port (5432)
The test suite includes 35 integration tests covering:
Service Layer Tests (4 tests) — tests/test_search.py
- Category subtree filtering with descendants
- Inclusive price range filtering
- Title and exact SKU matching
- Pagination with proper totals
Endpoint Integration Tests (31 tests) — tests/test_api.py
- CRUD operations for categories and products
- SKU normalization and validation
- Category hierarchy and cascade delete
- Advanced search with filters and pagination
- Error handling (404, 409, 422 responses)
Note: Tests automatically provision a PostgreSQL container via testcontainers — no manual database setup required.
Run all tests:
pytest -qExpected output: 35 passed
Run specific test file:
pytest tests/test_search.py -q # Service layer tests
pytest tests/test_api.py -q # Endpoint integration testsRun opt-in performance benchmarks:
pytest -m performance -qThe default test run excludes performance benchmarks to keep CI fast.
To run benchmarks in GitHub Actions, start the Python application workflow
manually and set run_performance to true.
Run with coverage:
pytest --cov=app --cov-report=term-missingTroubleshooting: Test Container Issues
If tests fail to start PostgreSQL container:
- Ensure Docker is running:
docker ps - Grant permission:
sudo usermod -aG docker $USER(then log out/in) - Check internet: testcontainers will download PostgreSQL image on first run (~100MB)
The service is instrumented with OpenTelemetry traces and metrics, plus structured JSON logs.
Traces:
- Incoming FastAPI requests (automatic instrumentation)
- SQLAlchemy database operations (automatic instrumentation)
Logs:
- Request completion events with method, route, status code, duration, payload size, and client IP
- Product and category mutation events for create, update, and delete outcomes
- Search summary events including filter presence and result totals
- Error and validation failure events at warning/error levels
- Log-to-trace correlation via
trace_idandspan_idfields in every log record - Configurable minimum log level via
LOG_LEVEL(DEBUG,INFO,WARNING,ERROR,CRITICAL)
Metrics:
commerce_http_request_duration_seconds- request latencycommerce_http_response_payload_size_bytes- response payload sizecommerce_http_processing_duration_seconds- route handler processing timecommerce_http_queue_wait_duration_seconds- queue wait before handler processingcommerce_db_query_duration_seconds- database query durationcommerce_http_requests_total- request throughputcommerce_http_requests_in_flight- in-flight requestscommerce_http_errors_total- HTTP error responses with status class and typecommerce_http_exceptions_total- unhandled exception count by classcommerce_db_pool_in_use_connections- DB connections checked out from poolcommerce_search_requests_total- search request throughputcommerce_search_result_count- search result count distributioncommerce_search_zero_results_total- zero-result search countcommerce_product_mutations_total- product create update delete outcomescommerce_category_mutations_total- category create update delete outcomescommerce_category_validation_failures_total- category validation failures by reason
Observability is wired end to end through the local Docker stack:
- The FastAPI application emits structured JSON logs to stdout.
- Promtail scrapes Docker container log files and pushes logs to Loki.
- The FastAPI application exposes Prometheus metrics on
/metricsthrough the OpenTelemetry Prometheus exporter. - The application also emits OTLP traces to the OpenTelemetry Collector at
otel-collector:4317. - The collector forwards traces to Tempo and exposes its own runtime metrics on port
8888. - Prometheus scrapes the application metrics endpoint and the collector metrics endpoint every 5 seconds.
- Grafana reads metrics from Prometheus, traces from Tempo, and logs from Loki through provisioned datasources.
Telemetry path summary:
FastAPI app
|- structured JSON logs -> stdout -> Promtail -> Loki -> Grafana
|- /metrics -> Prometheus
'- OTLP traces -> OpenTelemetry Collector -> Tempo -> Grafana
docker-compose.yml runs an observability stack:
otel-collectorreceives OTLP traces from the API and forwards to Tempotempostores traceslokistores logspromtailships container logs to Lokiprometheusscrapesapp:8000/metricsgrafanais pre-provisioned with Prometheus, Tempo, and Loki datasources
Useful endpoints:
- API:
http://127.0.0.1:8000 - Metrics:
http://127.0.0.1:8000/metrics - Grafana:
http://127.0.0.1:3000 - Prometheus:
http://127.0.0.1:9090 - Tempo API:
http://127.0.0.1:3200 - Loki API:
http://127.0.0.1:3100
For a remote-hosted API and database, use docker-compose.remote.yml instead. It keeps Grafana, Prometheus, Tempo, Loki, and the collector local, but scrapes the application metrics endpoint from REMOTE_SERVER_ADDRESS:REMOTE_APP_PORT configured through .env.remote.
To ingest logs from that remote host as well, run docker-compose.server-logs.yml on the remote Docker server with REMOTE_LOKI_PUSH_URL set to a Loki endpoint the server can reach.
Relevant configuration files:
app/observability/logging.pyconfigures structured JSON logging and trace/span correlation fieldsapp/observability/setup.pyinitializes tracing, metrics, middleware, and the/metricsendpointapp/observability/middleware.pyrecords request lifecycle metrics and error classificationsapp/observability/db.pyinstruments SQLAlchemy and DB pool usageobservability/otel-collector-config.yamlconfigures OTLP ingestion and exporter pipelineobservability/loki-config.yamlconfigures Loki storage and ingestionobservability/promtail-config.yamlconfigures container log scraping and shipping to Lokiobservability/promtail.remote.yml.tmplconfigures the remote-server Promtail agent for Loki shippingobservability/prometheus.ymlconfigures scrape jobs and alert rule loadingobservability/grafana/provisioningprovisions Grafana datasources and dashboards
A ready-to-use dashboard set is provisioned from:
observability/grafana/dashboards/commerce-observability.json(P1 reliability)observability/grafana/dashboards/commerce-observability-p2.json(P2 domain)observability/grafana/dashboards/commerce-observability-p3.json(P3 diagnostics)
Priority split:
- P1 Reliability: request rate, in-flight requests, error rate, exception rate, p95 request latency, p95 DB query duration, DB pool pressure, recent traces
- P2 Domain: search traffic and quality, mutation outcomes for products and categories, category validation failures
- P3 Diagnostics: latency decomposition, payload sizes, DB query duration by operation, HTTP error type breakdown, prebuilt Loki log panels for
commerce-appwarnings/errors, and request latency buckets derived from structured logs
Prometheus alert rules are defined in:
observability/prometheus-alerts/commerce-alerts.yml
Prometheus is configured to load alert rule files from:
observability/prometheus.ymlviarule_files: /etc/prometheus/alerts/*.yml
Compose mounts the alert rules directory into the Prometheus container:
docker-compose.ymlmaps./observability/prometheus-alertsto/etc/prometheus/alerts
Configured P1 alerts:
CommerceHigh5xxRate- 5xx ratio above 5% for 5 minutesCommerceHighP95RequestLatency- global p95 latency above 350 ms for 10 minutesCommerceEndpointP95Over200ms- per-endpoint p95 latency above 200 ms for 10 minutes (non-functional requirement)CommerceSustainedHighInFlightRequests- in-flight requests above 50 for 10 minutesCommerceDbPoolPressure- DB pool in-use connections above 12 for 10 minutes
The dashboards stay sparse until the service receives traffic. For a quick local smoke test:
- Start the full stack:
docker compose up --build -d- Generate load against the API:
.venv/bin/python scripts/load_test.py --duration 60 --workers 8If the catalog is already seeded, skip the initial setup phase:
.venv/bin/python scripts/load_test.py --skip-seed --duration 60 --workers 8- Open the tools:
- Grafana:
http://127.0.0.1:3000 - Prometheus:
http://127.0.0.1:9090 - Metrics endpoint:
http://127.0.0.1:8000/metrics - Loki ready check:
http://127.0.0.1:3100/ready
- Verify that custom application metrics are present:
curl -s http://127.0.0.1:8000/metrics | grep '^commerce_' || true- Verify that logs are queryable in Loki:
curl -G -s "http://127.0.0.1:3100/loki/api/v1/query_range" \
--data-urlencode 'query={container="commerce-app"}' \
--data-urlencode 'limit=20'Notes:
- After instrumentation changes, rebuild the application container so the running service exposes the new metrics.
- Some counters appear only after the first matching event is observed. Until then, Prometheus and Grafana can show no data for that series.
- The load test intentionally produces successful traffic and controlled 4xx scenarios so the P1, P2, and P3 dashboards all receive data.
This section summarizes the concrete improvements implemented to reduce latency and improve observability for concurrent traffic.
- Enabled multi-worker app execution in Dockerfile using
WEB_CONCURRENCY=2so requests are not serialized through a single worker. - Applied
route_class=ObservabilityRouteconsistently to API routers inapp/apiso queue-wait and handler timing is captured for all endpoints. - Added request lifecycle diagnostics in app/observability/middleware.py:
duration_ms,queue_wait_ms,handler_msin_flight_requests,response_size_bytes, and request context fields
- Added structured request slow-path logging (
request_slow) for rapid triage when requests exceed the target threshold.
- Wired SQLAlchemy async pool settings in app/db/session.py and exposed tuning in app/core/config.py:
database_pool_size = 20database_max_overflow = 5database_pool_timeout = 5database_pool_pre_ping = False
- Refactored search execution in app/services/product_service.py to release the data-query connection before optional
COUNT(*)execution. - Added
COUNT(*)skip optimization on first-page short results (offset == 0 and len(records) < limit) to avoid unnecessary aggregate scans. - Added migration helper scripts/migrate_indexes.py to backfill performance indexes for existing databases, including
pg_trgm+ GIN trigram index forILIKEsearch onproduct.title. - Added DB observability in app/observability/db.py: query duration metrics, slow-query logs, and pool in-use tracking.
- Added application-level DB wall-time instrumentation in app/observability/db_timing.py and applied it across search, products, and categories paths:
- connection acquire timing (
db_acquire_ms) - execute and fetch wall timing (
db_execute_fetch_ms) - wrapper helpers for
get, scalar, and list query patterns
- connection acquire timing (
- Migrated logging to asynchronous queue-based handling in app/observability/logging.py:
QueueHandleron request pathQueueListenerbackground thread for I/O- startup/shutdown wiring in app/main.py
- Moved HTTP metric recording off the event-loop critical path in app/observability/middleware.py by dispatching histogram/counter updates to the thread pool via
run_in_executor. - Preserved synchronous in-flight counter updates to keep active-request gauges accurate while offloading heavier metric updates.
- Extended request logs with DB timing decomposition fields in app/observability/middleware.py:
db_acquire_msdb_execute_fetch_msdb_time_gap_ms(db_execute_fetch_ms - db_time_ms, clamped at zero)
- This decomposition makes DB-time undercount mismatches visible in production logs and distinguishes true DB latency from instrumentation-scope gaps.
- Migration status: scripts/migrate_indexes.py executed successfully against local
commerce_demo(all index statements applied). - Test suite status after migration:
35 passed(python -m pytest tests/ -q). - Local mixed-load validation with scripts/load_test.py:
workers=8,duration=30s: ~103 RPS, no functional errors.workers=16,duration=60s: ~69 RPS sustained, all requests completed successfully.
- Direct client-side latency sample after DB migration (
1000requests, concurrency16) showed:- average ~102 ms
- p50 ~89 ms
- p95 ~196 ms
- p99 ~373 ms
- max ~387 ms
- Post-deployment remote log validation confirms DB timing decomposition is active and useful for diagnosis:
- most requests now show a small
db_time_gap_ms(typically ~0-1 ms) - occasional endpoint-specific outliers remain identifiable for targeted follow-up
- most requests now show a small
These results indicate the service remains stable under heavier concurrent local traffic, with p95 staying near the 200 ms target and tail-latency behavior visible under stress for continued tuning.
The findings below are based on observed load-test runs, production-like remote logs, and the added DB timing decomposition (db_acquire_ms, db_execute_fetch_ms, db_time_gap_ms).
The 200ms target remains the default expectation, but the following scenarios can legitimately exceed it.
- Search requests that require both data retrieval and
COUNT(*)(for example price-range searches with non-trivial result sets) may exceed 200ms because they execute two DB-heavy phases. - Mutation endpoints (
POST,PATCH,DELETE) can exceed 200ms under concurrent load because they include validation reads plus write/commit phases in one request. - Database slow-tier events (observed as ~90-100ms query durations) can push otherwise normal requests above 200ms, especially when multiple queries occur in the same request.
- Burst concurrency can temporarily increase queue wait and in-flight pressure, which raises end-to-end latency even when individual SQL statements are healthy.
- Remote-host conditions (cross-region app/DB placement, undersized managed DB plan, or noisy-neighbor CPU/IO effects) can increase p95/p99 latency.
- Cold-start and warmup windows after deploy/restart may produce transient latency spikes before caches and pools stabilize.
Mitigation is tracked in the observability dashboards and logs (request_slow, DB timing decomposition fields, pool usage), and tuning guidance is documented in the remote service configuration section.