You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately, /home/postgres/pgdata ran out of space (in all pods, it seems,
probably almost simultaneously) and spilo/patroni started logging:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/patroni/async_executor.py", line 39, in run
wakeup = func(*args) if args else func()
File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 1067, in _do_followself.write_recovery_conf(primary_conninfo)
File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 911, in write_recovery_conf
f.write("{0} = '{1}'\n".format(name, value))
OSError: [Errno 28] No space left on device
I believe the last leader before all pods went out of disk was
either patroni-set-0003-1 or patroni-set-0003-2.
Recovery
In order to solve the issue I:
Scaled down patroni-set-0003 to 1 replica (still failing with OSError: No
space left on device), Note that this will leave me without any running old
leader, broken or not, I believe this could be a key to my issue.
Created a new StatefulSet, patroni-set-0004, with the same configuration as patroni-set-0003 except
With only the broken patroni-set-0003-0 running, patroni-set-0004-0 started
restoring from WAL archive, I left it overnight to restore. During this time
both patroni-set-0003-0 and patroni-set-0004-0 were running, but patroni-set-0003-0` was out of disk.
Several hours later, patroni-set-0004-0 was logging lots of:
following a different leader because i am not the healthiest node
Lock owner: None; I am patroni-set-0004-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]
I expected patroni-set-0004-0 to take over the master lock by this time.
Debugging why the disk outage occured, I found out about ext Reserved
blocks, I then
recovered 25Gi of disk space on patroni-set-0003-0's pgdata by running tune2fs -m 0 /dev/$PGDATA_DEV. I realize in hindsight that simply resizing
the GCE PD would have been easier.
However, once patroni-set-0003-0 was given extra space and restarted, it did
not seem willing to take the leader role even given the extra disk space and no
current leader, logging lots of:
Lock owner: None; I am patroni-set-0003-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]
I expected patroni-set-0003-0 to take the leader role by this time.
I then did the same thing to patroni-set-0003-{1,2}, freeing up 25Gi of space.
Once patroni-set-0003-1 was given extra disk space and restarted it took the
master lock.
The text was updated successfully, but these errors were encountered:
joar
changed the title
spilo/patroni not able to elect new leader if existing leader failed due to full disk?
spilo/patroni not able to elect new leader if previous leader, last working member failed due to full disk?
Jun 27, 2017
Scenario
patroni-set-0003
Unfortunately,
/home/postgres/pgdata
ran out of space (in all pods, it seems,probably almost simultaneously) and spilo/patroni started logging:
I believe the last leader before all pods went out of disk was
either
patroni-set-0003-1
orpatroni-set-0003-2
.Recovery
In order to solve the issue I:
patroni-set-0003
to 1 replica (still failing with OSError: Nospace left on device),
Note that this will leave me without any running old
leader, broken or not, I believe this could be a key to my issue.
patroni-set-0004
, with the same configuration aspatroni-set-0003
exceptWith only the broken
patroni-set-0003-0
running,patroni-set-0004-0
startedrestoring from WAL archive, I left it overnight to restore. During this time
both
patroni-set-0003-0
andpatroni-set-0004-0 were running, but
patroni-set-0003-0` was out of disk.Several hours later,
patroni-set-0004-0
was logging lots of:I expected
patroni-set-0004-0
to take over the master lock by this time.Debugging why the disk outage occured, I found out about ext Reserved
blocks, I then
recovered 25Gi of disk space on
patroni-set-0003-0
'spgdata
by runningtune2fs -m 0 /dev/$PGDATA_DEV
. I realize in hindsight that simply resizingthe GCE PD would have been easier.
However, once
patroni-set-0003-0
was given extra space and restarted, it didnot seem willing to take the leader role even given the extra disk space and no
current leader, logging lots of:
I expected
patroni-set-0003-0
to take the leader role by this time.I then did the same thing to
patroni-set-0003-{1,2}
, freeing up 25Gi of space.Once
patroni-set-0003-1
was given extra disk space and restarted it took themaster lock.
The text was updated successfully, but these errors were encountered: