Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server container hanging [500] #1649

Open
5 tasks done
Sticcia opened this issue Jun 23, 2023 · 32 comments
Open
5 tasks done

Server container hanging [500] #1649

Sticcia opened this issue Jun 23, 2023 · 32 comments
Labels
bug Something isn't working

Comments

@Sticcia
Copy link

Sticcia commented Jun 23, 2023

Prerequisites

What package are you referring to?

Server (backend)

Describe the bug

Our Speckle Server is unexpectedly going in an unresponsive state.
Restarting the speckle-server Docker container will solve the issue but we can't understand what is causing it.
We have investigated the logs and only found some error messages from /graphql API requests (part of the log file attached below).
Following the unexpected crash all requests to the Speckle server return: Error: failed with status code 500.

Expected behavior

Server running smoothly and all requests handled correctly

System Info

Google Cloud VM Instance:

  • OS: Debian GNU/Linux 11 (bullseye)
  • OS version: 11
  • OS Config agent version: 20230330.00-g1
  • CPU platform: Intel Skylake
  • Architecture: x86/64
  • Boot Disk size: 10 GB
  • External Disk size (container mount point): 256 GB
  • Memory: 4 GB

Failure Logs

[36mspeckle-server_1 |�[0m {"level":"error","time":"2023-06-23T07:20:12.697Z","req":{"id":"8f50e384-010e-4da7-a79f-24cc87b068c2","method":"POST","path":"/api/getobjects/19e2cbffeb","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"11078","accept":"text/plain","sec-fetch-site":"cross-site","accept-language":"da-DK,da;q=0.9","accept-encoding":"gzip, deflate, br","sec-fetch-mode":"cors","content-type":"application/json","origin":"https://dev-velux.web.app","user-agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 16_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.1 Mobile/15E148 Safari/604.1","referer":"https://dev-velux.web.app/","sec-fetch-dest":"empty","x-request-id":"8f50e384-010e-4da7-a79f-24cc87b068c2","x-speckle-client-ip":"91.143.119.255"}},"authContext":{"auth":true,"userId":"3727070022","role":"server:admin","scopes":["streams:read","streams:write","profile:read","profile:email","users:read","server:stats"]},"userId":"3727070022","streamId":"19e2cbffeb","err":{"type":"Error","message":"Premature close","stack":"Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close\n at new NodeError (node:internal/errors:399:5)\n at Gzip. (node:internal/streams/pipeline:352:14)\n at Gzip.emit (node:events:525:35)\n at Gzip.emit (node:domain:489:12)\n at emitCloseNT (node:internal/streams/destroy:132:10)\n at emitErrorCloseNT (node:internal/streams/destroy:117:3)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ERR_STREAM_PREMATURE_CLOSE"},"msg":"App error streaming objects"}

[36mspeckle-server_1 |�[0m {"level":"error","time":"2023-06-23T07:20:14.050Z","req":{"id":"cd65b754-33de-443d-80f5-0f35e880a240","method":"POST","path":"/api/getobjects/19e2cbffeb","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"33167","accept":"text/plain","sec-fetch-site":"cross-site","accept-language":"da-DK,da;q=0.9","accept-encoding":"gzip, deflate, br","sec-fetch-mode":"cors","content-type":"application/json","origin":"https://dev-velux.web.app","user-agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 16_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.1 Mobile/15E148 Safari/604.1","referer":"https://dev-velux.web.app/","sec-fetch-dest":"empty","x-request-id":"cd65b754-33de-443d-80f5-0f35e880a240","x-speckle-client-ip":"91.143.119.255"}},"authContext":{"auth":true,"userId":"3727070022","role":"server:admin","scopes":["streams:read","streams:write","profile:read","profile:email","users:read","server:stats"]},"userId":"3727070022","streamId":"19e2cbffeb","err":{"type":"Error","message":"Premature close","stack":"Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close\n at new NodeError (node:internal/errors:399:5)\n at onclose (node:internal/streams/end-of-stream:154:30)\n at process.processTicksAndRejections (node:internal/process/task_queues:77:11)","code":"ERR_STREAM_PREMATURE_CLOSE"},"msg":"App error streaming objects"}

[36mspeckle-server_1 |�[0m {"level":"error","time":"2023-06-23T07:23:54.255Z","req":{"id":"be9f11cf-02bf-47eb-996b-75e5b5d8cf8b","method":"POST","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"468","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0","accept":"/","accept-language":"en-US,en;q=0.5","accept-encoding":"gzip, deflate, br","referer":"https://app.stykka.com/","content-type":"application/json","origin":"https://app.stykka.com","sec-fetch-dest":"empty","sec-fetch-mode":"cors","sec-fetch-site":"same-site","x-request-id":"be9f11cf-02bf-47eb-996b-75e5b5d8cf8b","x-speckle-client-ip":"194.182.0.255"}},"res":{"statusCode":500,"headers":{"x-request-id":"be9f11cf-02bf-47eb-996b-75e5b5d8cf8b","content-type":"application/json; charset=utf-8","content-length":"118","etag":"W/"76-xSxs8YfXn3PEb+TNZywNRe/DrDo""}},"err":{"type":"Error","message":"failed with status code 500","stack":"Error: failed with status code 500\n at ServerResponse.onResFinished (/speckle-server/node_modules/pino-http/logger.js:113:40)\n at ServerResponse.emit (node:events:525:35)\n at ServerResponse.emit (node:domain:489:12)\n at onFinish (node:_http_outgoing:989:10)\n at callback (node:internal/streams/writable:555:21)\n at afterWrite (node:internal/streams/writable:500:5)\n at afterWriteTick (node:internal/streams/writable:487:10)\n at process.processTicksAndRejections (node:internal/process/task_queues:81:21)"},"responseTime":60001,"msg":"request errored"}

Last 500 lines of the speckle-server container logs

Affected Projects

Speckle Frontend UI and Javascript viewer are not loading, .NET requests are failing.

@Sticcia Sticcia added the bug Something isn't working label Jun 23, 2023
@iainsproat
Copy link
Contributor

iainsproat commented Jun 23, 2023

@Sticcia - sorry to hear this is happening.

Have you deployed the server via Kubernetes or Docker Compose?
What version of the Docker Compose or Kubernetes Helm Chart have you deployed, and what version is the Speckle Server image that is running?
How frequently is the 500 error occurring, is it at a regular frequency, or only when certain actions are performed?
Have you tried running the server with debug logging level? (either set the LOG_LEVEL environment variable in the Docker Compose file, or set the .server.logLevel value in the Helm Chart to debug)

@Sticcia
Copy link
Author

Sticcia commented Jun 23, 2023

We deployed the server using Docker Compose.
Docker Compose version: 1.25.0
Speckle Server version: 2.13.3

It happens once a week on average, our previous deployment was on Digital Ocean and the issue persisted after the migration.
I don't see any relations to actions taken by us but we integrated the viewer in our web application which receives requests from external users (you can see in the log requests through app.stykka.com and dev-velux.web.app).

I have enabled debug logging now, will come back with an updated log file next time this happens.

Thanks for the support.

@saschabul
Copy link

saschabul commented Jun 28, 2023

Here is the error (GraphQLError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?) which probably triggered the last time server container was hanging. The full log is attached.
Thanks
log.txt

@iainsproat
Copy link
Contributor

Hi @saschabul

Thanks for sending the logs.

Knex is the module used to connect to the Postgres database. This error may indicate a number of things; it may be that it temporarily lost connection to the Postgres database, or that more connections were requested than there were available. It may also indicate other issues, such as the server not freeing up connections quickly enough.

Do you know if this failure occurred during any particular operation, for example sending a particular model from a Speckle connector?

Do you have a monitoring system in place to display the prometheus-format monitoring data that Speckle server generates? You may be able to see how the number of database connections change over time, and whether this correlates with higher demand on Speckle server.

@Sticcia
Copy link
Author

Sticcia commented Jun 29, 2023

Hi @iainsproat

No, we don't see any relation with actions taken by users.
No, we don't have a monitoring system, all we have is an up-time check that will notify us if the Speckle server is not responding. We will consider setting that up, thanks.

@iainsproat
Copy link
Contributor

@Sticcia @saschabul

I've created a Pull Request with a change which will include a health check for Speckle Server when run via Docker Compose. This health check with repeatedly poll Speckle Server, and if it stops responding for approximately 30 seconds will automatically restart it.

Could you try out the change in your environment and let me know if this works for you? https://github.com/specklesystems/speckle-server/pull/1651/files

Unfortunately this doesn't address the root cause, but will help reduce the amount of downtime and need for manual intervention to restart it.

@Sticcia
Copy link
Author

Sticcia commented Jun 29, 2023

@iainsproat

I updated the docker compose configuration following your changes.
(I was not able to start the container with a healthcheck without updating the version to 2.4)
This is our current docker-compose.yml:

version: '2.4'
services:
  ####
  # Speckle Server dependencies
  #######
  postgres:
    image: 'postgres:14.5-alpine'
    restart: always
    environment:
      POSTGRES_DB: speckle
      POSTGRES_USER: speckle
      POSTGRES_PASSWORD: speckle
    volumes:
    - ./postgres-data:/var/lib/postgresql/data/
    ports:
    - '127.0.0.1:5432:5432'

  redis:
    image: 'redis:7.0-alpine'
    restart: always
    volumes:
    - ./redis-data:/data
    ports:
    - '127.0.0.1:6379:6379'

  minio:
    image: 'minio/minio'
    command: server /data --console-address ":9001"
    restart: always
    volumes:
    - ./minio-data:/data
    ports:
    - '127.0.0.1:9000:9000'
    - '127.0.0.1:9001:9001'

  ####
  # Speckle Server
  #######
  speckle-frontend:
    image: speckle/speckle-frontend:2.13.3
    restart: always
    ports:
    - '127.0.0.1:8080:8080'
    environment:
      FILE_SIZE_LIMIT_MB: 100

  speckle-server:
    image: speckle/speckle-server:2.13.3
    restart: always
    healthcheck:
      test:
        - CMD
        - node
        - -e
        - "require('node:http').request({headers: {'Content-Type': 'application/json'}, port:3000, hostname:'127.0.0.1', path:'/graphql?query={serverInfo{version}}', method: 'GET' }, (res) => { body = ''; res.on('data', (chunk) => {body += chunk;}); res.on('end', () => {process.exit(res.statusCode != 200 || body.toLowerCase().includes('error'));}); }).end();"
      interval: 10s
      timeout: 10s
      retries: 3
      start_period: 5s
    ports:
    - '127.0.0.1:3000:3000'
    command: ['bash', '-c', '/wait && node bin/www']
    environment:
      CANONICAL_URL: "https://geo.stykka.com"
      SESSION_SECRET: "***"

      STRATEGY_LOCAL: 'true'
      LOG_LEVEL: 'debug'

      POSTGRES_URL: 'postgres'
      POSTGRES_USER: 'speckle'
      POSTGRES_PASSWORD: 'speckle'
      POSTGRES_DB: 'speckle'

      REDIS_URL: 'redis://redis'
      WAIT_HOSTS: 'postgres:5432, redis:6379, minio:9000'

      EMAIL: "true"
      EMAIL_HOST: "smtp.sendgrid.net"
      EMAIL_PORT: "465"
      EMAIL_USERNAME: "apikey"
      EMAIL_PASSWORD: "***"
      EMAIL_FROM: "***"

      EMAIL_SECURE: 'true'

      S3_ENDPOINT: 'http://minio:9000'
      S3_ACCESS_KEY: 'minioadmin'
      S3_SECRET_KEY: 'minioadmin'
      S3_BUCKET: 'speckle-server'
      S3_CREATE_BUCKET: 'true'
      S3_REGION: '' # optional, defaults to 'us-east-1'

      FILE_SIZE_LIMIT_MB: 500

      SENTRY_DSN: "***"

  speckle-preview-service:
    image: speckle/speckle-preview-service:2.13.3
    restart: always
    mem_limit: '1000m'
    memswap_limit: '1000m'
    command: ['bash', '-c', '/wait && node bin/www']

    environment:
      LOG_LEVEL: 'info'
      PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
      WAIT_HOSTS: 'postgres:5432'

  speckle-webhook-service:
    image: speckle/speckle-webhook-service:2.13.3
    restart: always
    command: ['bash', '-c', '/wait && node main.js']
    environment:
      LOG_LEVEL: 'info'
      PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
      WAIT_HOSTS: 'postgres:5432'

  fileimport-service:
    image: speckle/speckle-fileimport-service:2.13.3
    restart: always
    command: ['bash', '-c', '/wait && node src/daemon.js']
    environment:
      LOG_LEVEL: 'info'
      PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
      WAIT_HOSTS: 'postgres:5432'
      SPECKLE_SERVER_URL: 'http://speckle-server:3000'

@iainsproat
Copy link
Contributor

@Sticcia - thanks for looking at it. After you updated the docker compose file version, did the server start and work as normal?

@Sticcia
Copy link
Author

Sticcia commented Jun 29, 2023

@iainsproat - yes, the server is running as normal with the configuration above (I just now did a small edit). We will let you know if the server crashes again. Thanks for the support.

@Sticcia
Copy link
Author

Sticcia commented Jul 3, 2023

@iainsproat - we had a crash again Friday with the same error pasted by @saschabul above. We set up an up-time check with the same settings you shared in the yaml configuration. The check notified us of the failure but the container did not automatically restart.

Also, I tried to look into the logs with docker-compose logs -f --tail=50 speckle-server but there is nothing logged at the time of the failure (we restarted the service just after the crash, thus the Attaching to speckle_speckle-server_1 line)

[36mspeckle-server_1           |�[0m {"level":"info","time":"2023-06-23T08:41:13.272Z","req":{"id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","method":"POST","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"213","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36","accept":"*/*","referer":"https://geo.stykka.com/streams/f5ad2b244d?token=9d6313e5756d9dbdd253eff4747f90469f4d56ca205bb8483a","accept-encoding":"gzip, deflate, br","accept-language":"da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7","apollographql-client-name":"web","apollographql-client-version":"unknown","content-type":"application/json","origin":"https://geo.stykka.com","sec-fetch-site":"same-origin","sec-fetch-mode":"cors","sec-fetch-dest":"empty","x-request-id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","x-speckle-client-ip":"194.182.232.255"}},"authContext":{"auth":false},"req":{"id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","method":"POST","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"213","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36","accept":"*/*","referer":"https://geo.stykka.com/streams/f5ad2b244d?token=9d6313e5756d9dbdd253eff4747f90469f4d56ca205bb8483a","accept-encoding":"gzip, deflate, br","accept-language":"da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7","apollographql-client-name":"web","apollographql-client-version":"unknown","content-type":"application/json","origin":"https://geo.stykka.com","sec-fetch-site":"same-origin","sec-fetch-mode":"cors","sec-fetch-dest":"empty","x-request-id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","x-speckle-client-ip":"194.182.232.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"query EmailVerificationBannerState {\n  activeUser {\n    id\n    email\n    verified\n    hasPendingVerification\n    __typename\n  }\n}\n","graphql_variables":{},"graphql_operation_value":"GQL query activeUser","grqphql_operation_name":"GQL activeUser","actionName":"query activeUser","msg":"graphql response"}
[36mspeckle-server_1           |�[0m {"level":"info","time":"2023-06-23T08:41:13.274Z","req":{"id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","method":"POST","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","content-length":"213","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36","accept":"*/*","referer":"https://geo.stykka.com/streams/f5ad2b244d?token=9d6313e5756d9dbdd253eff4747f90469f4d56ca205bb8483a","accept-encoding":"gzip, deflate, br","accept-language":"da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7","apollographql-client-name":"web","apollographql-client-version":"unknown","content-type":"application/json","origin":"https://geo.stykka.com","sec-fetch-site":"same-origin","sec-fetch-mode":"cors","sec-fetch-dest":"empty","x-request-id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","x-speckle-client-ip":"194.182.232.255"}},"res":{"statusCode":200,"headers":{"x-request-id":"504f4d9e-a8c6-429b-84b5-3c2e74665a71","x-ratelimit-remaining":44,"access-control-allow-origin":"*","content-type":"application/json; charset=utf-8","content-length":"29","etag":"W/\"1d-soSuydrEK1Bd58NMwBp5Y4Wv7p8\""}},"responseTime":8,"msg":"request completed"}
Attaching to speckle_speckle-server_1
[36mspeckle-server_1           |�[0m {"level":"info","time":"2023-07-03T08:22:24.332Z","req":{"id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb"}},"authContext":{"auth":false},"req":{"id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb","method":"GET","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
[36mspeckle-server_1           |�[0m {"level":"info","time":"2023-07-03T08:22:24.338Z","req":{"id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb"}},"res":{"statusCode":200,"headers":{"x-request-id":"2c4b0b27-2239-4a38-a72e-b4ea083448bb","x-ratelimit-remaining":499,"access-control-allow-origin":"*","content-type":"application/json; charset=utf-8","content-length":"45","etag":"W/\"2d-lgbNWZvtrzDaO7tQCG8eXovuMp8\""}},"responseTime":10,"msg":"request completed"}

@Sticcia
Copy link
Author

Sticcia commented Jul 7, 2023

Hi @iainsproat the issue is happening more often and we cannot trace it to any actions that we take ourselves. The Docker Compose configuration you shared with us does not restart the service automatically. We can set it up ourselves to do this but it won't solve the issue. Do you have any news or information regarding the issue?

@iainsproat
Copy link
Contributor

iainsproat commented Jul 7, 2023

Hi @Sticcia - can you share the version of docker (docker --version) and docker compose (docker compose version) that your server is running on?

Can you also run docker inspect --format='{{json .State.Health}}' <ID of the running server container> for the server container? You may have to run docker ps -a to list the IDs of the running containers and select the one for the server container.

Please also run docker events --since 10m --until 0m (replacing 10m and 0m with the appropriate duration in the past when the incident occurred, the format is that of Go duration strings).
Are health_status events reported? These events should have been created when the healthcheck identifies changes in the status of the pod.

Are there any other events reported by Docker around that time that might help debug this issue?

@Sticcia
Copy link
Author

Sticcia commented Jul 7, 2023

@iainsproat

Docker versions:

Docker version 20.10.5+dfsg1, build 55c4c8
docker-compose version 1.25.0, build unknown
docker-py version: 4.1.0
CPython version: 3.9.2
OpenSSL version: OpenSSL 1.1.1n  15 Mar 2022

Docker inspect:

{"Status":"unhealthy","FailingStreak":12,"Log":[{"Start":"2023-07-07T12:18:58.144622803Z","End":"2023-07-07T12:19:08.144699892Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:18.15169539Z","End":"2023-07-07T12:19:28.151803699Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:38.158376948Z","End":"2023-07-07T12:19:48.158506217Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:58.165514836Z","End":"2023-07-07T12:20:08.165713169Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:20:18.171790741Z","End":"2023-07-07T12:20:28.171900318Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"}]}

Docker events:
docker_events.txt

Thanks for the help

@iainsproat
Copy link
Contributor

I can see in the events that it is returning an exitCode=127 due to the timeout, and in Docker inspect output it says the exit code is -1.

Docker's documentation states that:

The command’s exit status indicates the health status of the container. The possible values are:

0: success - the container is healthy and ready for use
1: unhealthy - the container is not working correctly
2: reserved - do not use this exit code

So neither -1 or 127 is valid, and perhaps we should handle the timeout condition better so that it returns a 1.

Despite this, it is reporting the container as unhealthy in your Docker inspect output. As we have correctly set the restart policy to always, I would expect it to have restarted.

The latest Docker version is 24, and the latest docker compose version is 2.19. I do not know what the impact of updating would be, or whether it would help the stability of speckle server in this particular situation.

@iainsproat
Copy link
Contributor

Hi @Sticcia - I've made a change to the healthcheck to ensure it times out correctly and responds with the exit code expected by docker.

The healthcheck command should now be:

"try { require('node:http').request({headers: {'Content-Type': 'application/json'}, port:3000, hostname:'127.0.0.1', path:'/graphql?query={serverInfo{version}}', method: 'GET', timeout: 2000 }, (res) => { body = ''; res.on('data', (chunk) => {body += chunk;}); res.on('end', () => {process.exit(res.statusCode != 200 || body.toLowerCase().includes('error'));}); }).end(); } catch { process.exit(1); }"

Let me know how you get on with this change, and please provide the docker events output, docker inspect output, and logs if you encounter any further issues.

@Sticcia
Copy link
Author

Sticcia commented Jul 12, 2023

@iainsproat - the server crashed again and didn't restart:
speckle_speckle-server_1 docker-entrypoint.sh bash ... Up (unhealthy) 127.0.0.1:3000->3000/tcp

Here are the logs you mentioned in the last comment.

Docker inspect output:

{"Status":"unhealthy","FailingStreak":14,"Log":[{"Start":"2023-07-12T10:39:18.166560137Z","End":"2023-07-12T10:39:28.166681021Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-12T10:39:38.174308127Z","End":"2023-07-12T10:39:48.174427043Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-12T10:39:58.182222813Z","End":"2023-07-12T10:40:08.182349343Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-12T10:40:18.18899444Z","End":"2023-07-12T10:40:28.189113566Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-12T10:40:38.195697535Z","End":"2023-07-12T10:40:48.195779531Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"}]}

Docker events output:

speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:26.992Z","req":{"id":"6922d38f-eb83-41d4-91f0-8c16ba949016","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"6922d38f-eb83-41d4-91f0-8c16ba949016","x-speckle-client-ip":"35.233.206.255"}},"authContext":{"auth":false},"req":{"id":"6922d38f-eb83-41d4-91f0-8c16ba949016","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"6922d38f-eb83-41d4-91f0-8c16ba949016","x-speckle-client-ip":"35.233.206.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
speckle-server_1           | {"level":"error","time":"2023-07-12T10:42:38.433Z","req":{"id":"9d30bf95-e157-4ce1-a370-8045e535f294","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294"}},"authContext":{"auth":false},"req":{"id":"9d30bf95-e157-4ce1-a370-8045e535f294","method":"GET","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","query-path":"serverInfo","err":{"type":"GraphQLError","message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?\n    at Client_PG.acquireConnection (/speckle-server/node_modules/knex/lib/client.js:312:26)\n    at async Runner.ensureConnection (/speckle-server/node_modules/knex/lib/execution/runner.js:287:28)\n    at async Runner.run (/speckle-server/node_modules/knex/lib/execution/runner.js:30:19)\n    at async getServerInfo (/speckle-server/packages/server/dist/modules/core/services/generic.js:11:28)\n    at async Object.serverInfo (/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js:7:20)","locations":[{"line":1,"column":2}],"path":["serverInfo"]},"msg":"graphql error"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:38.435Z","req":{"id":"9d30bf95-e157-4ce1-a370-8045e535f294","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294"}},"authContext":{"auth":false},"req":{"id":"9d30bf95-e157-4ce1-a370-8045e535f294","method":"GET","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:38.444Z","req":{"id":"9d30bf95-e157-4ce1-a370-8045e535f294","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294"}},"res":{"statusCode":200,"headers":{"x-request-id":"9d30bf95-e157-4ce1-a370-8045e535f294","x-ratelimit-remaining":499,"access-control-allow-origin":"*","content-type":"application/json; charset=utf-8","content-length":"290","etag":"W/\"122-/esAIs0eu+FuVa0OMIU0zOWANWc\""}},"responseTime":60019,"msg":"request completed"}
speckle-server_1           | {"level":"error","time":"2023-07-12T10:42:44.836Z","req":{"id":"48e35154-6ad7-49a0-bf35-04655652960f","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"48e35154-6ad7-49a0-bf35-04655652960f","x-speckle-client-ip":"35.205.72.255"}},"authContext":{"auth":false},"req":{"id":"48e35154-6ad7-49a0-bf35-04655652960f","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"48e35154-6ad7-49a0-bf35-04655652960f","x-speckle-client-ip":"35.205.72.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","query-path":"serverInfo","err":{"type":"GraphQLError","message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?\n    at Client_PG.acquireConnection (/speckle-server/node_modules/knex/lib/client.js:312:26)\n    at async Runner.ensureConnection (/speckle-server/node_modules/knex/lib/execution/runner.js:287:28)\n    at async Runner.run (/speckle-server/node_modules/knex/lib/execution/runner.js:30:19)\n    at async getServerInfo (/speckle-server/packages/server/dist/modules/core/services/generic.js:11:28)\n    at async Object.serverInfo (/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js:7:20)","locations":[{"line":1,"column":2}],"path":["serverInfo"]},"msg":"graphql error"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:44.839Z","req":{"id":"48e35154-6ad7-49a0-bf35-04655652960f","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"48e35154-6ad7-49a0-bf35-04655652960f","x-speckle-client-ip":"35.205.72.255"}},"authContext":{"auth":false},"req":{"id":"48e35154-6ad7-49a0-bf35-04655652960f","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"48e35154-6ad7-49a0-bf35-04655652960f","x-speckle-client-ip":"35.205.72.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
speckle-server_1           | {"level":"error","time":"2023-07-12T10:42:49.831Z","req":{"id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","x-speckle-client-ip":"35.238.118.255"}},"authContext":{"auth":false},"req":{"id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","x-speckle-client-ip":"35.238.118.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","query-path":"serverInfo","err":{"type":"GraphQLError","message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?\n    at Client_PG.acquireConnection (/speckle-server/node_modules/knex/lib/client.js:312:26)\n    at async Runner.ensureConnection (/speckle-server/node_modules/knex/lib/execution/runner.js:287:28)\n    at async Runner.run (/speckle-server/node_modules/knex/lib/execution/runner.js:30:19)\n    at async getServerInfo (/speckle-server/packages/server/dist/modules/core/services/generic.js:11:28)\n    at async Object.serverInfo (/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js:7:20)","locations":[{"line":1,"column":2}],"path":["serverInfo"]},"msg":"graphql error"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:49.833Z","req":{"id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","x-speckle-client-ip":"35.238.118.255"}},"authContext":{"auth":false},"req":{"id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"ad35d1b0-bbcd-4e1a-a5c2-38c780194504","x-speckle-client-ip":"35.238.118.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
speckle-server_1           | {"level":"error","time":"2023-07-12T10:42:58.351Z","req":{"id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487"}},"authContext":{"auth":false},"req":{"id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","method":"GET","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","query-path":"serverInfo","err":{"type":"GraphQLError","message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?\n    at Client_PG.acquireConnection (/speckle-server/node_modules/knex/lib/client.js:312:26)\n    at async Runner.ensureConnection (/speckle-server/node_modules/knex/lib/execution/runner.js:287:28)\n    at async Runner.run (/speckle-server/node_modules/knex/lib/execution/runner.js:30:19)\n    at async getServerInfo (/speckle-server/packages/server/dist/modules/core/services/generic.js:11:28)\n    at async Object.serverInfo (/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js:7:20)","locations":[{"line":1,"column":2}],"path":["serverInfo"]},"msg":"graphql error"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:58.353Z","req":{"id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487"}},"authContext":{"auth":false},"req":{"id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","method":"GET","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:58.368Z","req":{"id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487"}},"res":{"statusCode":200,"headers":{"x-request-id":"ca40c1f5-9fb7-4451-a10e-42500ceae487","x-ratelimit-remaining":499,"access-control-allow-origin":"*","content-type":"application/json; charset=utf-8","content-length":"290","etag":"W/\"122-/esAIs0eu+FuVa0OMIU0zOWANWc\""}},"responseTime":60020,"msg":"request completed"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:42:59.679Z","req":{"id":"5ffeddee-6c6e-4dae-af1a-5f92c7ff10f8","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"5ffeddee-6c6e-4dae-af1a-5f92c7ff10f8","x-speckle-client-ip":"35.238.118.255"}},"res":{"statusCode":null,"headers":{"x-request-id":"5ffeddee-6c6e-4dae-af1a-5f92c7ff10f8","x-ratelimit-remaining":499,"access-control-allow-origin":"*"}},"responseTime":9865,"msg":"request aborted"}
speckle-server_1           | {"level":"error","time":"2023-07-12T10:43:02.830Z","req":{"id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","x-speckle-client-ip":"35.186.159.255"}},"authContext":{"auth":false},"req":{"id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","x-speckle-client-ip":"35.186.159.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","query-path":"serverInfo","err":{"type":"GraphQLError","message":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stack":"KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?\n    at Client_PG.acquireConnection (/speckle-server/node_modules/knex/lib/client.js:312:26)\n    at async Runner.ensureConnection (/speckle-server/node_modules/knex/lib/execution/runner.js:287:28)\n    at async Runner.run (/speckle-server/node_modules/knex/lib/execution/runner.js:30:19)\n    at async getServerInfo (/speckle-server/packages/server/dist/modules/core/services/generic.js:11:28)\n    at async Object.serverInfo (/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js:7:20)","locations":[{"line":1,"column":2}],"path":["serverInfo"]},"msg":"graphql error"}
speckle-server_1           | {"level":"info","time":"2023-07-12T10:43:02.833Z","req":{"id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","method":"GET","path":"/graphql","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","x-speckle-client-ip":"35.186.159.255"}},"authContext":{"auth":false},"req":{"id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","method":"GET","headers":{"host":"geo.stykka.com","connection":"upgrade","accept":"*/*","accept-encoding":"deflate, gzip","content-type":"application/json","user-agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","x-request-id":"98f4c73b-3016-472f-b666-b1b9b1d071bd","x-speckle-client-ip":"35.186.159.255"}},"authContext":{"auth":false},"component":"graphql","graphql_operation_kind":"query","graphql_query":"{serverInfo{version}}","graphql_operation_value":"GQL query serverInfo","grqphql_operation_name":"GQL serverInfo","actionName":"query serverInfo","msg":"graphql response"}

Logs

Sentry error:

{"event_id":"13bbb381b042402e9d6b4dd03245d292","project":4505408537427968,"release":null,"dist":null,"platform":"node","message":"","datetime":"2023-07-12T10:47:58+00:00","tags":[["environment","production"],["handled","yes"],["kind","query"],["level","error"],["mechanism","generic"],["runtime","node v18.15.0"],["runtime.name","node"],["user","id:93d69b57eb759c2ef030eba5ee7186b4acdf0629e2762c3fdc36b84222f71807"],["server_name","0f7e8173e604"],["transaction","GET /graphql"],["url","http://127.0.0.1/graphql"]],"_metrics":{"bytes.ingested.event":5619,"bytes.stored.event":14234},"breadcrumbs":{"values":[{"timestamp":1688975303.689,"type":"http","category":"http","level":"info","data":{"method":"HEAD","status_code":200,"url":"http://minio:9000/speckle-server/"}},{"timestamp":1689158878.485,"type":"default","category":"query-path","level":"debug","message":"serverInfo"}]},"contexts":{"runtime":{"name":"node","version":"v18.15.0","type":"runtime"},"trace":{"trace_id":"7fe1728c5e1e4bc09d148704c179edf0","span_id":"bbe45e0d09007e38","op":"GQL query serverInfo","status":"unknown","type":"trace"}},"culprit":"GET /graphql","environment":"production","errors":[{"type":"js_no_source","url":"/speckle-server/node_modules/knex/lib/client.js"},{"type":"js_no_source","url":"/speckle-server/node_modules/knex/lib/execution/runner.js"},{"type":"js_no_source","url":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js"},{"type":"js_no_source","url":"/speckle-server/packages/server/dist/modules/core/services/generic.js"}],"exception":{"values":[{"type":"GraphQLError","value":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stacktrace":{"frames":[{"function":"Object.serverInfo","module":"server","filename":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","abs_path":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","lineno":7,"colno":20,"pre_context":["'use strict';","const { validateServerRole, validateScopes } = require('@/modules/shared');","const { updateServerInfo, getServerInfo, getPublicScopes, getPublicRoles } = require('../../services/generic');","module.exports = {","    Query: {","        async serverInfo() {"],"context_line":"            return await getServerInfo();","post_context":["        }","    },","    ServerInfo: {","        async roles() {","            return await getPublicRoles();","        },","        async scopes() {"],"in_app":true},{"function":"getServerInfo","module":"generic","filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","abs_path":"/speckle-server/packages/server/dist/modules/core/services/generic.js","lineno":11,"colno":28,"pre_context":["const Scopes = () => knex('scopes');","const Info = () => knex('server_config');","module.exports = {","    /**","     * @returns {Promise<import('@/modules/core/helpers/types').ServerInfo>}","     */","    async getServerInfo() {"],"context_line":"        const serverInfo = await Info().select('*').first();","post_context":["        serverInfo.version = process.env.SPECKLE_SERVER_VERSION || 'dev';","        serverInfo.canonicalUrl = process.env.CANONICAL_URL || 'localhost';","        return serverInfo;","    },","    async getAllScopes() {","        return await Scopes().select('*');","    },"],"in_app":true},{"function":"Runner.run","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":30,"colno":19,"pre_context":["\r","  // \"Run\" the target, calling \"toSQL\" on the builder, returning\r","  // an object or array of queries to run, each of which are run on\r","  // a single connection.\r","  async run() {\r","    const runner = this;\r","    try {\r"],"context_line":"      const res = await this.ensureConnection(ensureConnectionCallback);\r","post_context":["\r","      // Fire a single \"end\" event on the builder when\r","      // all queries have successfully completed.\r","      runner.builder.emit('end');\r","      return res;\r","\r","      // If there are any \"error\" listeners, we fire an error event\r"],"in_app":false},{"function":"Runner.ensureConnection","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":287,"colno":28,"pre_context":["\r","    if (this.connection) {\r","      return cb(this, cbParams);\r","    }\r","\r","    let acquiredConnection;\r","    try {\r"],"context_line":"      acquiredConnection = await this.client.acquireConnection();\r","post_context":["    } catch (error) {\r","      if (!(error instanceof KnexTimeoutError)) {\r","        return Promise.reject(error);\r","      }\r","      if (this.builder) {\r","        error.sql = this.builder.sql;\r","        error.bindings = this.builder.bindings;\r"],"in_app":false},{"function":"Client_PG.acquireConnection","module":"knex.lib:client","filename":"/speckle-server/node_modules/knex/lib/client.js","abs_path":"/speckle-server/node_modules/knex/lib/client.js","lineno":312,"colno":26,"pre_context":["    try {\r","      const connection = await this.pool.acquire().promise;\r","      debug('acquired connection from pool: %s', connection.__knexUid);\r","      return connection;\r","    } catch (error) {\r","      let convertedError = error;\r","      if (error instanceof TimeoutError) {\r"],"context_line":"        convertedError = new KnexTimeoutError(\r","post_context":["          'Knex: Timeout acquiring a connection. The pool is probably full. ' +\r","            'Are you missing a .transacting(trx) call?'\r","        );\r","      }\r","      throw convertedError;\r","    }\r","  }\r"],"in_app":false}]},"raw_stacktrace":{"frames":[{"function":"Object.serverInfo","module":"server","filename":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","abs_path":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","lineno":7,"colno":20,"pre_context":["'use strict';","const { validateServerRole, validateScopes } = require('@/modules/shared');","const { updateServerInfo, getServerInfo, getPublicScopes, getPublicRoles } = require('../../services/generic');","module.exports = {","    Query: {","        async serverInfo() {"],"context_line":"            return await getServerInfo();","post_context":["        }","    },","    ServerInfo: {","        async roles() {","            return await getPublicRoles();","        },","        async scopes() {"],"in_app":true},{"function":"getServerInfo","module":"generic","filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","abs_path":"/speckle-server/packages/server/dist/modules/core/services/generic.js","lineno":11,"colno":28,"pre_context":["const Scopes = () => knex('scopes');","const Info = () => knex('server_config');","module.exports = {","    /**","     * @returns {Promise<import('@/modules/core/helpers/types').ServerInfo>}","     */","    async getServerInfo() {"],"context_line":"        const serverInfo = await Info().select('*').first();","post_context":["        serverInfo.version = process.env.SPECKLE_SERVER_VERSION || 'dev';","        serverInfo.canonicalUrl = process.env.CANONICAL_URL || 'localhost';","        return serverInfo;","    },","    async getAllScopes() {","        return await Scopes().select('*');","    },"],"in_app":true},{"function":"Runner.run","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":30,"colno":19,"pre_context":["\r","  // \"Run\" the target, calling \"toSQL\" on the builder, returning\r","  // an object or array of queries to run, each of which are run on\r","  // a single connection.\r","  async run() {\r","    const runner = this;\r","    try {\r"],"context_line":"      const res = await this.ensureConnection(ensureConnectionCallback);\r","post_context":["\r","      // Fire a single \"end\" event on the builder when\r","      // all queries have successfully completed.\r","      runner.builder.emit('end');\r","      return res;\r","\r","      // If there are any \"error\" listeners, we fire an error event\r"],"in_app":false},{"function":"Runner.ensureConnection","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":287,"colno":28,"pre_context":["\r","    if (this.connection) {\r","      return cb(this, cbParams);\r","    }\r","\r","    let acquiredConnection;\r","    try {\r"],"context_line":"      acquiredConnection = await this.client.acquireConnection();\r","post_context":["    } catch (error) {\r","      if (!(error instanceof KnexTimeoutError)) {\r","        return Promise.reject(error);\r","      }\r","      if (this.builder) {\r","        error.sql = this.builder.sql;\r","        error.bindings = this.builder.bindings;\r"],"in_app":false},{"function":"Client_PG.acquireConnection","module":"knex.lib:client","filename":"/speckle-server/node_modules/knex/lib/client.js","abs_path":"/speckle-server/node_modules/knex/lib/client.js","lineno":312,"colno":26,"pre_context":["    try {\r","      const connection = await this.pool.acquire().promise;\r","      debug('acquired connection from pool: %s', connection.__knexUid);\r","      return connection;\r","    } catch (error) {\r","      let convertedError = error;\r","      if (error instanceof TimeoutError) {\r"],"context_line":"        convertedError = new KnexTimeoutError(\r","post_context":["          'Knex: Timeout acquiring a connection. The pool is probably full. ' +\r","            'Are you missing a .transacting(trx) call?'\r","        );\r","      }\r","      throw convertedError;\r","    }\r","  }\r"],"in_app":false}]},"mechanism":{"type":"generic","handled":true}}]},"extra":{"query":"{serverInfo{version}}","variables":"[undefined]"},"fingerprint":["{{ default }}"],"grouping_config":{"enhancements":"eJybzDRxc15qeXFJZU6qlZGBkbGugaGuoeEEAHJMCAM","id":"newstyle:2023-01-11"},"hashes":["417af1dfb2d7a62282628d53be75280d","398a1bbc2e7bcd6edbe5b994316b8afd"],"ingest_path":[{"version":"23.6.2","public_key":"XE7QiyuNlja9PZ7I9qJlwQotzecWrUIN91BAO7Q5R38"}],"key_id":"3236202","level":"error","location":"/speckle-server/packages/server/dist/modules/core/services/generic.js","logger":"","metadata":{"display_title_with_tree_label":false,"filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","function":"getServerInfo","type":"GraphQLError","value":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?"},"nodestore_insert":1689158879.262104,"processed_by_symbolicator":true,"received":1689158878.589285,"request":{"url":"http://127.0.0.1/graphql","method":"GET","query_string":[["query","{serverInfo{version}}"]],"headers":[["Connection","close"],["Content-Type","application/json"],["Host","127.0.0.1:3000"],["X-Request-Id","2c5a421e-dfce-45ec-9e16-7d5823587d96"]],"inferred_content_type":"application/json"},"sdk":{"name":"sentry.javascript.node","version":"6.19.7","integrations":["InboundFilters","FunctionToString","ContextLines","Console","OnUncaughtException","OnUnhandledRejection","LinkedErrors","Postgres","Http","Express"],"packages":[{"name":"npm:@sentry/node","version":"6.19.7"}]},"timestamp":1689158878.485,"title":"GraphQLError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(t...","transaction":"GET /graphql","type":"error","user":{"id":"93d69b57eb759c2ef030eba5ee7186b4acdf0629e2762c3fdc36b84222f71807"},"version":"7"}

The only difference we noticed is that the Sentry error was triggered many times (untill we stopped and started it manually), we believe it was every health check (alternating between the container check on 127.0.0.1 and our monitoring check on the public URL).

@iainsproat
Copy link
Contributor

iainsproat commented Jul 12, 2023

Thanks for providing this.

After reading a little more about Docker Compose's behaviour, it seems that it will not restart unhealthy containers 😞 . (For security reasons I would avoid the 'autoheal' solution mentioned in that link, as it requires mounting the docker socket to that container, effectively giving it access to your host machine.)

From your error message we're again seeing the same Knex: Timeout acquiring a connection. The pool is probably full.. This might be for a number of reasons (as I mentioned above), for example:

  • it may be that it temporarily lost connection to the Postgres database
  • more connections were requested than there were available
  • It may also indicate other issues, such as the server not freeing up connections quickly enough.

If you are able to connect Prometheus or Grafana to the metrics endpoint (/metrics) of the speckle server. There is a metric, speckle_server_knex_free, which provides data on the number of available connections. You can plot this over time to see if connections are become saturated. If connections remain free then it may point to an issue with the network or the postgres server itself.

As a possible solution to saturated connections, you could attempt to increase the number of available connections by adding the POSTGRES_MAX_CONNECTIONS_SERVER environment variable to the server. The default for this environment variable is 4, so you could try increasing it to 8. Please note that there may be limitations on the number of connections that your postgres server can handle, and the larger number of connections may require more cpu & memory resources for both the database and speckle server.

You could also provide redundancy by providing a one or more replicas of the speckle server container. If one container was to fail, or its database connections become temporarily saturated, then the other could continue to operate and prevent downtime. Please refer to Docker's documentation on replicas.

@Sticcia
Copy link
Author

Sticcia commented Jul 14, 2023

Hi again @iainsproat I have set-up a Grafana server on a separate VM instance but I am having trouble configuring the Prometheus data source.
You mentioned connecting to the /metrics endpoint, do I need to enable that explicitly in the configuration file?
Does the Grafana server need to run in a container on the same host as the Speckle services or can it be on a different host?
I couldn't find any documentation regarding this. Maybe you have an example of a data source configuration?

@iainsproat
Copy link
Contributor

Can you connect to your speckle server and query your-domain/metrics or localhost:3000/metrics?
I'm not sure how you've configured ingress, if you're using Speckle's DigitalOcean 1Click deployment the ingress will prevent external access to /metrics. If you have deployed Grafana to a separate machine without direct network access to the speckle server, you would want to have an ingress which adds some sort of authentication to prevent public access to /metrics.
It is probably easiest to run it on the same VM, and deploy it via docker compose on the same docker network as speckle server.

@Sticcia
Copy link
Author

Sticcia commented Jul 24, 2023

Hi @iainsproat, we have set-up Grafana and the Prometheus logging.
Server crashed again this weekend:
I created a public snapshot (I can send you the JSON file too if needed).

Screenshot
Our health-check was triggering restarts too fast, we have fixed that now. You can see in the graph that we stopped the health-check at 17:07 and let the server run a full restart which worked fine.

Also, the server usage was basically zero at the time of the crash (being a Sunday with most of us on vacation), so no relation to user actions was seen.

@iainsproat
Copy link
Contributor

Thanks @Sticcia
For my understanding, can you confirm that the crash first occurred at approximately 14:30?
And what we're seeing between 14:30 and 17:07 are aggressive restarts until you disabled healthchecks?
To prevent the aggressive restarts, I think we may need to increase the start_period time on the healthcheck. It takes the server a moment to start up, during which it is expected to be unresponsive. If the healthcheck runs during that time it will experience a failure and cause the restart behaviour.

@Sticcia
Copy link
Author

Sticcia commented Jul 24, 2023

@iainsproat - your understanding is correct but when I mentioned a health-check in the previous comment I meant a check made by us, not the Docker Compose health check, I wasn't very clear, sorry. We had already made the same fix as you just did.
Does the output from Grafana help find the core issue?
Can we find where all these connections came from?
Can we increase the number of available Knex connections? If yes, how?

@iainsproat
Copy link
Contributor

Grafana is helpful, as I think we see a dramatic increase in connections at around 14:30. But it's hard to tell if this is related or due to the restart, and its a consequence rather than the cause.

Do you still have the data available in Grafana to plot the metric speckle_server_knex_free? I'd like to see its values between 14:20 and 14:30, and a minute or two at either side of that time period.
Do you have logs for the events that occur around 14:30?

I'd like to establish more exact timeline; do we see the increase in connections before or after the restart?

@Sticcia
Copy link
Author

Sticcia commented Jul 24, 2023

@iainsproat

I was mistaken, the server crashed twice this weekend. Once at 10:45 where our uptime check got stuck in a restart loop (this is clearly visible in the graphs too, there are gaps during the restarts).

And once at 14:20, where the health check was disabled and the restart was done manually at 17:00.

Sorry for the confusion. Here is the data you asked for the crash at 14:20 (with no action taken in this time slot).

Data from 14:18 - 14:35

Speckle Server container logs:
logs.txt

Grafana data export:
grafana_data_export.csv

The increase of connections is before the restart.

@Sticcia
Copy link
Author

Sticcia commented Jul 26, 2023

Hi again @iainsproat

We managed to replicate the crash fairly consistently (although not 100% of the time) by lowering the VM type (CPU and memory) and overloading the server with many open connections.

We don't know if this was the same issue that happened sporadically before but the crash errors are the same.

We will write some tests to load the server in a more controlled way and see if that affects the performance/crash frequency.

@iainsproat
Copy link
Contributor

@Sticcia - that's a great finding, it sounds as if it might be a resource management issue.

You may wish to run the server in inspect mode to gain more diagnostic capabilities. Add the following command when running the server: "--inspect=PORT (replacing PORT with a port number you wish to expose the inspect interface on). More details about inspect, can be found on node's website. This may provide more insight into the resource usage.

@Sticcia
Copy link
Author

Sticcia commented Jul 26, 2023

@iainsproat - thanks for the quick response, if I understand correctly I should edit the docker-compose configuration with something like

speckle-server:
    image: speckle/speckle-server:2.13.3
    ports:
    - '127.0.0.1:3000:3000'
    - '127.0.0.1:3000:9229'
    command: ['bash', '-c', '/wait && node --inspect=9229 bin/www']

and then I can run something like Chrome DevTools on geo.stykka.com:9229 to see the debug output?

@iainsproat
Copy link
Contributor

@Sticcia - yes, though I think the command can just be --inspect=9229 bin/www as the entrypoint is already node.
Most importantly, I would not expose the port to the internet due to the security implications of this. Instead I would suggest creating an ssh tunnel to the host machine, or something similar which would allow you to access the port securely and without exposing it publicly.

@ayansengupta17
Copy link

Any update on this? We are facing the very same issue as well. We have increased VM sizes to

EC2 instance type: c7i.8xlarge
RDS: db.t4g.xlarge

and still face random crashes

@gjedlicska
Copy link
Contributor

up @iainsproat

@iainsproat
Copy link
Contributor

Hi @Sticcia - did you manage to find time to investigate this further? Does it still occur if you use the latest version of speckle's server?

@Sticcia
Copy link
Author

Sticcia commented Jun 24, 2024

Hi @iainsproat,
We do not have the resources to investigate this further. We scaled up the host configuration which, together with a auto-heal service, reduced the down time to only a few seconds a month.
We are still running a 2.18.x version, I will monitor it more after we upgrade and let you know if the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants