-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade fails due to postmaster.pid #550
Comments
Hello, thank you for your report. Since every layer (podman, openshift,..) complicates the debugging, it would be really helpful to provide a reproducer just using podman or on rpm level. |
you can use this on minikube
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: postgres-pv-claim
labels:
app: postgres
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
selector:
matchLabels:
app: postgres
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: quay.io/centos7/postgresql-12-centos7
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 5432
env:
- name: POSTGRESQL_UPGRADE
value: "hardlink"
- name: POSTGRESQL_UPGRADE_FORCE
value: "true"
- name: POSTGRESQL_DATABASE
valueFrom:
secretKeyRef:
name: assisted-installer-rds
key: db.name
- name: POSTGRESQL_USER
valueFrom:
secretKeyRef:
name: assisted-installer-rds
key: db.user
- name: POSTGRESQL_PASSWORD
valueFrom:
secretKeyRef:
name: assisted-installer-rds
key: db.password
volumeMounts:
- mountPath: /var/lib/pgsql/data
name: postgredb
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 400Mi
volumes:
- name: postgredb
persistentVolumeClaim:
claimName: postgres-pv-claim than redeploy step 2 with a newer image This are the same steps that I'm running |
BTW, when you restart the image you can see that the |
It seems to be caused by the orchestration. I tested the scenario just by deploying the image by podman and then redeploying and the upgrade works fine. |
but I did it with podman and minikube and openshift. |
Of course, I am using persistent data location. Without the upgrade option, it isn't possible to re-deploy it with the higher Postgresql version.
|
so heres a step by step running it locally apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
selector:
matchLabels:
app: postgres
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: quay.io/centos7/postgresql-13-centos7
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 5432
env:
- name: POSTGRESQL_UPGRADE
value: "hardlink"
- name: POSTGRESQL_UPGRADE_FORCE
value: "true"
- name: POSTGRESQL_DATABASE
value: testdb
- name: POSTGRESQL_USER
value: user1
- name: POSTGRESQL_PASSWORD
value: password
volumeMounts:
- mountPath: /var/lib/pgsql/data
name: postgresdb
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 400Mi
volumes:
- name: postgresdb
persistentVolumeClaim:
claimName: postgres-pv-claim > # create volume
> podman volume create postgres-pv-claim
postgres-pv-claim
> # starting pod
> podman kube play deployment.yaml
Pod:
2e7286f9445a896a07a9f2f6ee0f8dc0a5438d2f63d98fa61c2fa98054076cb8
Container:
53021a21703d7b141c4c511dc71da6d97db738e3401c6bc8c402b758d466d4e3
> # change the image
> sed -i 's/postgresql-12/postgresql-13/g' deployment.yaml
> grep -Rn image: deployment.yaml
19: image: quay.io/centos7/postgresql-13-centos7
> # replace pod
> podman kube play deployment.yaml --replace
Pods stopped:
7b45233e856809db7527ec4cb074bfaedec0f3d9c16a4954500980f34ddb0b67
Pods removed:
7b45233e856809db7527ec4cb074bfaedec0f3d9c16a4954500980f34ddb0b67
Secrets removed:
Volumes removed:
Trying to pull quay.io/centos7/postgresql-13-centos7:latest...
Getting image source signatures
Copying blob b929e0b929d2 done |
Copying blob 06c7e4737942 skipped: already exists
Copying blob c61d16cfe03e skipped: already exists
Copying config 7d69e95a7f done |
Writing manifest to image destination
Pod:
f96d847bf394168c141c8c5a5d9e63cb4c373eb7b0defaa1f6468431a61ccaf5
Container:
52e3699579e0398985e4d8d815059c42b0a8ed285471398c0bcb4f24ff4f41ed
> # logs...
> podman logs postgres-pod-postgres
========== $PGDATA upgrade: 12 -> 13 ==========
===> Starting old postgresql once again for a clean shutdown...
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-03-13 08:58:58.037 UTC [26] FATAL: lock file "postmaster.pid" already exists
2024-03-13 08:58:58.037 UTC [26] HINT: Is another postmaster (PID 1) running in data directory "/var/lib/pgsql/data/userdata"?
stopped waiting
pg_ctl: could not start server
Examine the log output.
p.s. |
FYI, when you run I'm guessing that the |
I have the same opinion, it would make sense, that Postgresql stop is not handled correctly in the openshift environment |
I see it in minikube / podman and openshift. maybe a better way is to add PreStop command preStop:
exec:
command: [ pg_ctl, -D, /var/lib/pgsql/data/userdata, stop ]
``` |
@eifrach Can you please test your deployment with the defined |
what do you mean? the orignal is in |
I was reviewing the PR which should fix this issue: #554. However the main communication seems to be here, so I'll ask here. As you have discussed here, it indeed seems, like the container/pod is not terminating gracefully with SIGTERM. OCP/podman should send SIGKILL when SIGTERM does not manage to terminate the running container. If this is really happening here - and the container is terminated by SIGKILL - then based on upstream postgresql documentation this is not safe, and should be avoided: "It is best not to use SIGKILL to shut down the server. Doing so will prevent the server from releasing shared memory and semaphores. Furthermore, SIGKILL kills the postgres process without letting it relay the signal to its subprocesses, so it might be necessary to kill the individual subprocesses by hand as well." In that case I think that the proposed fix could have some unwanted consequences. The SIGTERM seems to work correctly when running the container directly from podman. (podman run/podman kill) |
thanks so much - i will try it out in a day or two and let you know |
hey, but I found a new problem, the repo has no longer have and
|
I understand your point that sending SIGKILL to the server is not a good practice. We should definitely identify why it happens and try to avoid it. But anyway we also need a mechanism to "recover" after such an ugly shutdown. Proposed PR doesn't handle the cause but the consequence.
So I would propose to merge it, if it works and doesn't case any other issue. And we can open an issue for further investigation to find out why the .pid file persists. |
CentOS 7 images from SCLOrg are no longer being rebuilt, see more in commit 9ebea2b.
I do not know, and even if this is a bug, it would not be in a container image layer, but rather rpm layer. @fila43 Do you have any info about this?
@fila43 , Ok, I agree. |
Now I tried to use the build to upgrade from 12 to 13 using the fedora image - seems that PSQL12 not found ========== $PGDATA upgrade: 12 -> 13 ==========
/usr/share/container-scripts/postgresql/common.sh: line 335: scl_source: No such file or directory
===> Starting old postgresql once again for a clean shutdown...
/usr/share/container-scripts/postgresql/common.sh: line 353: /opt/rh/rh-postgresql12/root/usr/bin/pg_ctl: No such file or directory
Warning: Can't detect cpuset size from cgroups, will use nproc
|
just fyi, after playing a bit with the script |
@eifrach Thanks for the analysis. We discussed this within the team and came to conclusion that some bigger changes of how upgrades in sclorg postgresql container images work would need to be incorporated. This could, however, take some time. |
sry for the late response, but for RHEL images it's a big issue for some of openshift operators |
At RPM level, we also have the Dump and Restore upgrade approach. It should also work for containers. More about here. Would it be suitable workaround for your use-case? |
The app uses official Redhat images, which also have security updates and those very important to us. For this I would need either to create a custom build or create some kind of a startup script for the migration. |
btw, this is how I got the upgrade to work FROM quay.io/sclorg/postgresql-13-c8s:latest
USER root
ENV PQSL12_BIN=/usr/lib64/pgsql/postgresql-12
ENV PQSL12_DESTANATION=/opt/rh/rh-postgresql12/root/usr
RUN dnf install -y procps-ng util-linux postgresql-upgrade
RUN mkdir -p $PQSL12_DESTANATION && \
/usr/libexec/fix-permissions $PQSL12_BIN/bin && \
ln -s $PQSL12_BIN/bin $PQSL12_DESTANATION/bin
# removing issues from current sctipt
RUN sed -i 's#"${old_pgengine}/pg_isready"# #g' /usr/share/container-scripts/postgresql/common.sh && \
sed -i 's#new_pgengine=/opt/rh/rh-postgresql${new_raw_version}/root/usr/bin#new_pgengine="/bin"#g' /usr/share/container-scripts/postgresql/common.sh && \
sed -i 's#&& ! pgrep -f "postgres" > /dev/null# #g' /usr/share/container-scripts/postgresql/common.sh
USER 26
ENTRYPOINT ["container-entrypoint"]
CMD ["run-postgresql"] |
Since we do not support in-place upgrades for images greater than centos7, as you mentioned, the container would need a couple of changes. This would lead to an increase in the size of the container. I recommend filing a feature request in JIra. |
Out of curiosity, why was this dropped? Seems like a rather large feature to remove. |
More precisely it has never been removed. Between rhel7 and rhel8 there was a change in postgresql on the rpm level and the new design hasn't been adopted yet. It's an area where we will welcome any kind of contribution. the proposed change is heading in the right direction but needs to be adjusted |
Container platform
OCP 4, Podman/Docker
Version
12, 13
OS version of the container image
RHEL 7, RHEL 8, CentOS 7, CentOS Stream 8
Bugzilla, Jira
No response
Description
We are trying to work on and upgrade flow from version 12 to version 13
we are using k8s deployment pods
when we try to deploy the new image we get an error during the upgrade
It seems that the
postmaster.pid
is always there also on restart / redeployment and the service recovers without an issue. Only during the upgrade do we face the problem.deleting the file and redeploying the upgrade solves it.
my question is,
not sure if deleting a PID is good solution to run in Production enviorment
Reproducer
deploy version 12, and redeploy version 13 with the upgrade env
can be done with
podman play
orminikube
The text was updated successfully, but these errors were encountered: