Skip to content

transport-discovery: persist latency in a dedicated key, decoupled from registration TTL#2418

Merged
0pcom merged 2 commits intoskycoin:developfrom
0pcom:feat/persistent-latency
May 3, 2026
Merged

transport-discovery: persist latency in a dedicated key, decoupled from registration TTL#2418
0pcom merged 2 commits intoskycoin:developfrom
0pcom:feat/persistent-latency

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 3, 2026

Summary

Latency lived inside the tp:<id> registration blob and inherited its 5-minute entry-timeout. Bandwidth has always been stored at bw:daily:<id>:<date> with a 35-day TTL, so any visor that paused re-registering (TPD restart, network blip, normal churn) silently lost its latency from /metrics until the next CXO push — while bandwidth-today survived intact. After a TPD bounce in production this is observable: bandwidth recovers as visors re-register, latency does not.

This change moves latency to a peer of bandwidth.

What's stored

  • New key: transport-discovery:lat:<id>, JSON {min, max, avg, updated_at} in microseconds, 35-day TTL (latencyTTL matches bw:daily:* retention).
  • TransportData.Latency{Min,Max,Avg} stays in the schema so older payloads decode cleanly, but no write touches it anymore.

Write path

UpdateLatency writes only to the new key. The "must be registered" coupling and the TTL-inheriting Set on tp:<id> are gone — those were exactly the reasons latency disappeared with registration churn. avg <= 0 still drops the update. Last-writer-wins, no per-edge tracking, no daily aggregation — only the storage location and lifetime change.

Read paths

getLatencyRecord(ctx, id) plus a batched hydrateDurableLatency([]*Entry) overlay durable values onto entry/entries:

  • GetTransportMetrics (/metrics endpoint) — reads the new key directly in its existing pipeline.
  • getAllTransportsWithQoSGetNetworkMetrics, GetVisorAggregateMetrics — hydrates after scanAllTransports.
  • GetTransportsByEdge (/transports/edge:, /transports/edges) — hydrates the returned slice.
  • GetTransportByID (/transports/id:) — single-entry hydration.

Test plan

  • go build ./... clean.
  • go vet ./pkg/transport-discovery/... clean.
  • go test ./pkg/transport-discovery/... ./pkg/transport/ ./pkg/router/ ./pkg/visor/stats/ all pass.
  • No new unit test added — the store suite runs against newMemoryStore() (where UpdateLatency is a no-op) and there's no miniredis fixture in-repo.
  • Post-deploy verification:
    • redis-cli TTL transport-discovery:lat:<id> returns ~3024000s (35d), never the 5-min registration TTL.
    • After a TPD bounce, /metrics carries latency for transports whose visors haven't yet re-pushed a CXO sample.
    • /metrics/visor, /metrics/visors, /transports/edge:, /transports/id: reflect the persisted latency.

Sequencing

Independent of #2415 (zero-clobber guards), but the two interact: #2415 prevents partial-zero snapshots from being persisted in the first place, this PR ensures whatever does land is durable.

0pcom added 2 commits May 3, 2026 00:02
…om registration TTL

Latency lived inside the tp:<id> registration blob and inherited its
5-minute entry-timeout. Bandwidth has always been stored separately
at bw:daily:<id>:<date> with a 35-day TTL, so any visor that paused
re-registering (TPD restart, network blip, normal churn) silently lost
its latency from /metrics until the next CXO push, while bandwidth-
today survived intact.

Move latency to a peer of bandwidth:

  - New key: transport-discovery:lat:<id>, JSON {min, max, avg,
    updated_at} in microseconds, 35-day TTL.
  - UpdateLatency writes only to that key. The "must be registered"
    coupling and the TTL-inheriting Set on tp:<id> are gone — those
    were exactly the reasons latency disappeared with registration
    churn. avg<=0 still drops the update.
  - GetTransportMetrics reads the new key in its existing pipeline.
  - getAllTransportsWithQoS, GetTransportsByEdge, GetTransportByID
    hydrate entry.Latency from the durable record so the aggregate
    paths (GetNetworkMetrics, GetVisorAggregateMetrics) and the
    /transports/id:, /transports/edge:, /transports/edges API
    endpoints all see the persisted value.

TransportData.Latency{Min,Max,Avg} stays in the schema so older
payloads decode cleanly, but no write touches them anymore. Reads
that go through the QoS hydration step end up with the durable
value overlaid on top of the (now always 0) blob field.

No semantic change to the value itself: still last-writer-wins,
still per-transport (round-trip is symmetric, no per-edge
tracking), still no daily aggregation. Only the storage location
and lifetime change.

Verified post-deploy by:
  - redis-cli TTL transport-discovery:lat:<id> returns ~3024000s,
    never the registration TTL.
  - After a TPD bounce, /metrics carries latency for transports
    whose visors haven't yet re-pushed a CXO sample.
@0pcom 0pcom merged commit a619f20 into skycoin:develop May 3, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 4, 2026
…iminator (#2422)

Production TPD was restarting every 30-40s (RestartCount 556 over 6h
on the prod host) because two distinct panics tear down the process:

1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216
   panic: "Finc to negative for: <hash>"
   The filling-refcount went below zero — likely a duplicate Finc on
   a Filler.incs map, or an Inc/Finc mismatch across overlapping
   fillers. Hard process kill via panic. Filler.apply / Filler.reject
   already consume Finc's error return and just log; surface the
   inconsistency through that path instead. Clamp fc to 0, log the
   condition with key and the offending inc, and continue. Worst
   case is a leaked filling-item slot — orders of magnitude better
   than killing the service.

2. pkg/httputil/httputil.go WriteJSON:50
   panic: "short write: i/o deadline reached"
   isIOError checks errors.Is(err, io.ErrShortWrite), but
   net/http's timeoutWriter returns its own error value with the
   same message string when a write deadline expires mid-response.
   errors.Is misses it (different sentinel), the fallback string
   match didn't include "short write", so getAllTransports'
   ~1MB JSON write to a slow client panics on every deadline hit.
   Added "short write", "i/o timeout", and "deadline exceeded"
   to the string-match fallback. New TestIsIOErrorShortWriteVariants
   pins all sentinel and string-match cases.

Neither bug is caused by #2415/#2418/#2421; they were just made
visible because deploys cycling at 30-40s aren't subtle. Together
these stop the panic loop without changing any data semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant