Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault upon startup with 2.0.1-pg12 and 1.7.5-pg12 using Docker on ARM #2968

Closed
aaron97neu opened this issue Feb 19, 2021 · 8 comments
Closed
Labels

Comments

@aaron97neu
Copy link

Relevant system information:

  • OS: Debian 10 Buster
  • PostgreSQL version: 12.6
  • TimescaleDB version: 2.0.1 / 1.7.5
  • Installation method: Docker

Describe the bug
timescaledb fails to start, citing a segfault. This behavior was observed on a Toradex IMX6 ARM device, no testing was done on other ARM devices. There was no failure when tested on a Debian 10 x86 machine

To Reproduce

  1. On an ARM machine, use the following docker-compose file:
services:
  tsdb:
    environment:
      POSTGRES_DB: testdb
      POSTGRES_PASSWORD: notapassword
      POSTGRES_USER: sampleuser
      TIMESCALEDB_TELEMETRY: "off"
    image: timescale/timescaledb:2.0.1-pg12
    ports:
    - 5432:5432/tcp
    restart: unless-stopped
    volumes:
    - postgres:/var/lib/postgresql/data:rw
version: '3.1'
volumes:
  postgres: 
  1. Run docker-compose up
  2. Watch log output, note the crash on startup and crash upon subsequent restarts. docker logs container-name might be required as the docker-compose cli will exit after the first crash
  3. Be sure to remove all volumes before changing to another version and testing again

Expected behavior
timescaledb should startup without issues

Actual behavior
timescaledb fails to start

Screenshots
Logfile from failed startup with 2.0.1. Note the failure to startup and failure after restarting, citing a segfault:

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... GMT
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found
1970-05-02 08:41:52.010 GMT [31] WARNING:  no usable system locales were found
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok


Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/data -l logfile start

waiting for server to start....1970-04-30 03:13:20.010 GMT [36] LOG:  starting PostgreSQL 12.6 on arm-unknown-linux-musleabihf, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 32-bit
1970-04-30 03:13:20.010 GMT [36] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
....1970-04-30 03:13:20.010 GMT [36] LOG:  startup process (PID 37) was terminated by signal 11: Segmentation fault
1970-04-30 03:13:20.010 GMT [36] LOG:  aborting startup due to startup process failure
.1970-04-30 03:13:20.010 GMT [36] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

PostgreSQL Database directory appears to contain a database; Skipping initialization

1970-04-26 18:26:24.009 GMT [1] LOG:  starting PostgreSQL 12.6 on arm-unknown-linux-musleabihf, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 32-bit
1970-04-26 18:26:24.009 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
1970-04-26 18:26:24.009 GMT [1] LOG:  listening on IPv6 address "::", port 5432
1970-04-26 18:26:24.009 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
1970-04-26 18:26:24.009 GMT [1] LOG:  startup process (PID 21) was terminated by signal 11: Segmentation fault
1970-04-26 18:26:24.009 GMT [1] LOG:  aborting startup due to startup process failure
1970-04-26 18:26:24.009 GMT [1] LOG:  database system is shut down

PostgreSQL Database directory appears to contain a database; Skipping initialization

1970-04-27 08:05:36.009 GMT [1] LOG:  starting PostgreSQL 12.6 on arm-unknown-linux-musleabihf, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 32-bit
1970-04-27 08:05:36.009 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
1970-04-27 08:05:36.009 GMT [1] LOG:  listening on IPv6 address "::", port 5432
1970-04-27 08:05:36.009 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
1970-04-27 08:05:36.009 GMT [1] LOG:  startup process (PID 21) was terminated by signal 11: Segmentation fault
1970-04-27 08:05:36.009 GMT [1] LOG:  aborting startup due to startup process failure
1970-04-27 08:05:36.009 GMT [1] LOG:  database system is shut down

Additional context
The incorrect date in the logs is noteworthy, on successful starts with previous versions, the date is correct

This bug appears to have been introduced between 2.0.0 and 2.0.1. Below is my testing of successes/fails within docker. All volumes, images, and containers were deleted between tests

Docker Image TSDB Version PG Version Startup Failure?
timescaledb:1.7.4-pg12 1.7.4 12.4 No
timescaledb:2.0.0-pg12 2.0.0 12.5 No
timescaledb:2.0.1-pg12 2.0.1 12.6 Yes
timescaledb:1.7.5-pg12 1.7.5 12.6 Yes
postgres:12.6 none 12.6 No
postgres:12.5 none 12.5 No
postgres:12.4 none 12.4 No

System information:

$ inxi -S -M -C
System:    Host: avena-apalis-dev06 Kernel: 5.9.0-0.bpo.2-armmp armv7l bits: 32 Console: tty 1 
           Distro: Debian GNU/Linux 10 (buster) 
Machine:   Type: ARM Device System: Toradex Apalis iMX6Q/D Module on Ixora Carrier Board V1.1 
           details: Freescale i.MX6 Quad/DualLite rev: N/A serial: 04958937 
CPU:       Topology: Quad Core model: ARMv7 v7l variant: cortex-a9 bits: 32 type: MCP 
           Speed: 792 MHz min/max: 396/792 MHz Core speeds (MHz): 1: 792 2: 792 3: 792 4: 792 
$ docker --version
Docker version 20.10.3, build 48d30b5
$ docker-compose --version 
docker-compose version 1.28.3, build unknown

Unsure if this appears on other ARM systems, have not had the opportunity to check

Is there a way to provide a more verbose/informative log? I briefly searched but could not find an easy way to enable this in docker

2.0.1 and 1.7.5 were the most recent timescaledb releases. I wonder if there is a shared patch between the two that is causing this, or the switch to 12.6

@svenklemm
Copy link
Member

A stacktrace of the segfault would be amazing. Did you try the 2.0.2 version published today? Your log shows you have a data volume mounted do you also get the segfault without a data volume?

@svenklemm
Copy link
Member

Since our images are based on the postgres alpine images could you try postgres:12.6-alpine as well.

@aaron97neu
Copy link
Author

What is the best way to get the stacktrace?

2.0.2 fails as well, removing the data volume on 2.0.2 still leads to a failure.

postgres:12.6-alpine fails.

In addition, postgres:12.5-alpine and postgres-13.2-alpine fail. postgres:12.4-alpine and postgres:13.2 work.
Strange that postgres:12.5-alpine fails but 2.0.0 which appears to be based on it does not

@svenklemm
Copy link
Member

OK this seems to be an upstream problem.

Instructions for getting postgres stacktrace: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

@aaron97neu
Copy link
Author

Agreed. Will see if I can get it fixed upstream. Thanks for the assistance

@svenklemm
Copy link
Member

There is already a bugreport about it on the postgres bugs mailing list: https://www.postgresql.org/message-id/CABnSV8Zswy5vi1Ns_1FX6fKVzJcxPWVw7Mtuv8EXZW13VQJVSg@mail.gmail.com

@aaron97neu
Copy link
Author

For anyone else in the future who runs into the same issue, I believe I have found the issue:
https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.13.0#time64_requirements

This issue stems from the postgres image upgrading to Alpine Linux 3.13. The key change here is time64 compatibility. Critically:

Alpine Linux 3.13.0 requires the host Docker to be version 19.03.9 (which contains backported moby commit 89fabf0) or greater and the host libseccomp to be version 2.4.2 (which contains backported libseccomp commit bf747eb) or greater. ... Therefore, the following platforms are not suitable as Docker hosts for 32-bit Alpine Linux 3.13.0, due to containing out-of-date libseccomp: Amazon Linux 1 or 2, CentOS 7 or 8, Debian stable without debian-backports, Raspbian stable, Ubuntu 14.04 or earlier, and Windows. This applies regardless of whether the Linux distribution Docker packages or separate Docker package repositories are used.

Being on Debian 10, libsecomp2 was still at 2.3.3. Changing it to the backports 2.4.4 version fixed the issue. This is important as all images based on Alpine 3.13 will not work without it

Alpine issue thread: alpinelinux/docker-alpine#135

@svenklemm I believe this is enough to close the issue?

@svenklemm
Copy link
Member

Yes feel free to close the issue. Thank you for the investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants