Skip to content

[Question] no more wal's after my initial one #1038

@redscaresu

Description

@redscaresu

Hi, I have configure WAL with the following. This allows me to successfully create my basebackup WAL on my object storage just fine.

for example on DB initialisation I happily see the following object sent to my object storage.

spilo/foo-cluster/fdgdffg-dfgdfg-dfgdg-8fddfg8dc-dfgdfg/wal/basebackups_005/base_000000010000000000000002/tar_partitions/part_001.tar.lz4

with the following config

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-pod-config
  namespace: postgres-cluster
data: 
  BACKUP_SCHEDULE: "0 */12 * * *"
  USE_WALG_BACKUP: "true"
  BACKUP_NUM_TO_RETAIN: "14"
  AWS_ACCESS_KEY_ID: xxxxx
  AWS_SECRET_ACCESS_KEY: xxxxxx
  AWS_ENDPOINT: https://sdfdsfdfgdfg.compat.objectstorage.us-east-1.oraclecloud.com
  AWS_S3_FORCE_PATH_STYLE: "true"
  AWS_REGION: us-east-1

however after a while my container begins to OOM the wal-g process

2090157.723139] Memory cgroup stats for /kubepods/podd31a1ea6-7005-41d2-9e00-a78a08402cf9/c39b84616fedb31df69edb62c423677a82d25bd74ca590b186c9807e610f6cdf: cache:20192KB rss:469716KB rss_huge:360448KB shmem:19932KB mapped_file:16744KB dirty:0KB writeback:0KB swap:0KB inactive_anon:6996KB active_anon:482652KB inactive_file:136KB active_file:124KB unevictable:0KB
[2090157.723148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[2090157.723275] [24301]     0 24301     1098      193       8       3        0          -998 dumb-init
[2090157.723277] [24360]     0 24360     1159      459       8       3        0          -998 sh
[2090157.723279] [24499]     0 24499    12738      862      29       3        0          -998 su
[2090157.723281] [24500]     0 24500     1140      213       8       3        0          -998 runsvdir
[2090157.723283] [24501]     0 24501     1102      206       7       3        0          -998 runsv
[2090157.723284] [24502]     0 24502     1102      213       7       3        0          -998 runsv
[2090157.723286] [24503]     0 24503     1102      195       7       3        0          -998 runsv
[2090157.723288] [24504]   101 24504   173946    10175     101       3        0          -998 patroni
[2090157.723290] [24505]     0 24505     7089      705      19       3        0          -998 cron
[2090157.723292] [24506]   101 24506    27000     2014      52       3        0          -998 pgqd
[2090157.723294] [24510]   101 24510   279237    10805      86       5        0          -998 wal-g
[2090157.723296] [24708]   101 24708    78672     7358      88       3        0          -998 postgres
[2090157.723297] [24711]   101 24711    48841     1126      74       3        0          -998 postgres
[2090157.723299] [24713]   101 24713    99430     3186      86       3        0          -998 postgres
[2090157.723301] [24924]   101 24924    78705     3019      85       3        0          -998 postgres
[2090157.723303] [24925]   101 24925    78709     1618      81       3        0          -998 postgres
[2090157.723305] [24926]   101 24926    49436     1303      75       3        0          -998 postgres
[2090157.723307] [25076]   101 25076    79011     4159      87       3        0          -998 postgres
[2090157.723309] [25309]   101 25309    78672     2149      78       3        0          -998 postgres
[2090157.723311] [25310]   101 25310    78853     2123      80       3        0          -998 postgres
[2090157.723313] [25311]   101 25311    49371     1684      76       3        0          -998 postgres
[2090157.723315] [25312]   101 25312    78935     3006      84       3        0          -998 postgres
[2090157.723316] [25314]   101 25314    78813     2065      83       3        0          -998 postgres
[2090157.723318] [25315]   101 25315    78812     1723      79       3        0          -998 postgres
[2090157.723320] [26175]   101 26175    78977     2946      83       3        0          -998 postgres
[2090157.723322] [28384]   101 28384    79065     4629      88       3        0          -998 postgres
[2090157.723330] [31757]     0 31757    11288      700      27       3        0          -998 cron
[2090157.723332] [31758]   101 31758     1157      215       7       3        0          -998 sh
[2090157.723334] [31759]   101 31759   242369    17043      78       5        0          -998 wal-g
[2090157.723336] [31809]   101 31809    79065     4801      88       3        0          -998 postgres
[2090157.723338] [31834]   101 31834     1157      206       9       3        0          -998 sh
[2090157.723340] [31835]   101 31835   382233    95774     240       5        0          -998 wal-g
[2090157.723342] Memory cgroup out of memory: Kill process 24301 (dumb-init) score 0 or sacrifice child
[2090157.764125] Killed process 24360 (sh) total-vm:4636kB, anon-rss:116kB, file-rss:1720kB, shmem-rss:0kB
[2090413.393720] wal-g invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=-998
[2090413.393724] wal-g cpuset=a3937f08578a7c0522de5a73245347bf990a4fe182e0a424670fdedb1787ed61 mems_allowed=0

When I look closer it seems that after the initial basebackup no more wal's are sent eventually filling up my disk. Looking at the postgres logs they are unable to send any more logs after the initial basebackup.

failed to upload 'spilo/foo-cluster/d210db5f-e8f3-4807-9359-4b8df275df6f/wal/wal_005/00000009.history.lz4' to bucket 'postgres-foo-wal-backup': SignatureDoesNotMatch: The required information to complete authentication was not provided.
	status code: 403, request id: east-1:sfsdfgd231DIdfdgOIUX-_Q, host id:

This seems to be an issue on my side rather than the operator and how I authenticate with my cloud provider but I do not understand how the initial baseback up works but subsequent backups dont when they all use the same config. Can you give me any tips when it comes to debugging this?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions