Skip to content

[Bug]: pgjsonb returner and event_return propagate database errors with poor diagnostics #69058

@co-cy

Description

@co-cy

What happened?

Description

Two asymmetries in the pgjsonb writer functions cause database failures to be either propagated to callers that do not expect them or dropped without enough context.

returner(ret) only catches salt.exceptions.SaltMasterError, which _get_serv raises on psycopg2.OperationalError during connect. A psycopg2.DatabaseError during the actual cur.execute(...) into salt_returns (foreign-key violation, NOT NULL violation, deadlock, DataError) propagates to the caller. The publish path in master.py:_prep_pub is wrapped in except Exception → log.critical, so it absorbs the error. But the syndic-aggregate path at master.py:1699-1701 is not wrapped:

if load.get("load") and self.opts["master_job_cache"]:
    fstr = "{}.save_load".format(self.opts["master_job_cache"])
    self.mminion.returners[fstr](load["jid"], load["load"])  # unhandled

So a single bad row in syndic aggregation can escape into a master subprocess.

event_return(events) has no local error handling at all. The event-bus caller in EventReturn._flush_event_single already wraps it in except Exception → log.error, which keeps the master alive, but the entire event_queue is dropped on any failure with no local context — operators see a generic stacktrace and no indication of how many events were lost.

There is a smaller leftover too: event_return passes the events list as the positional ret argument to _get_serv(events, commit=True). The ret parameter is meant to be the return-dict of one returner call (a dict carrying optional ret_config / ret_kwargs); passing a list happens to behave like ret=None because salt.returners.get_returner_options does in checks that are permissive on a list, but it is misleading copy-paste from returner(ret).

Setup

  • on-prem machine
  • classic packaging
  • onedir packaging

Any deployment using master_job_cache: pgjsonb or event_return: pgjsonb.

Steps to Reproduce the behavior

For the syndic path:

  1. Set master_job_cache: pgjsonb and run a master in syndic mode.
  2. Have a syndic publish a return whose row would violate a constraint on salt_returns (e.g. wrong encoding in return, oversized id).
  3. The psycopg2.DataError raised by cur.execute propagates out of pgjsonb.returner, through salt-master's syndic-handler, with no log line that names the offending jid/id.

For the events path:

  1. Set event_return: pgjsonb and event_return_queue: 5000.
  2. During a flush, raise any DatabaseError (e.g. take down salt_events's primary key with a manual schema change).
  3. All 5000 buffered events are dropped; the only log line is the generic Could not store events - returner '...' raised exception: ... from EventReturn.

Expected behavior

  • A psycopg2.DatabaseError during cur.execute should not propagate out of pgjsonb.returner regardless of which call path used it.
  • Each drop should be logged with enough context to correlate it to the lost payload — jid/id for returns, batch size for events.
  • event_return should not pretend its events list is a return-dict.

Additional context

The accompanying PR adds a psycopg2.DatabaseError catch to both functions that logs through log.exception (so the traceback is preserved) and tightens the existing SaltMasterError messages to include the same context. While there it also drops event_return's misleading positional events argument to _get_serv. Four behavioural unit tests cover both error paths and the corrected event_return happy path (none existed before).

Builds on #69049 (which already routed _get_serv / _purge_jobs / _archive_jobs through Salt's logger) and should be merged after it.

Type of salt install

Official deb

Major version

3006.x, 3007.x

What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)

debian-11, debian-12

salt --versions-report output

salt --versions-report
Salt Version:
          Salt: 3007.13

Python Version:
        Python: 3.10.19 (main, Feb  5 2026, 07:05:38) [GCC 11.2.0]

Dependency Versions:
          cffi: 2.0.0
      cherrypy: unknown
  cryptography: 42.0.5
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.6
       libgit2: 1.9.1
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 24.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.18.2
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.22.3
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.5.4
           ZMQ: 4.3.4

Salt Extensions:
 saltext.vault: 1.5.0

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: debian 12.13 bookworm
        locale: utf-8
       machine: x86_64
       release: 6.12.73+deb12-amd64
        system: Linux
       version: Debian GNU/Linux 12.13 bookworm

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbroken, incorrect, or confusing behaviorneeds-triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions