SSL SYSCALL error: EOF detected #8348

csangonzo · 2022-08-04T06:35:15Z

csangonzo
Aug 4, 2022

Hey guys!

A little background on what we're working on and how did we get here.
We've built a server side application (FastAPI) to handle data syncronization between a postgres database and an internal API. We're running multiple workers simultaneously with rq-scheduler - each environment has it's own worker and it's own queue, most notably there's a job that runs every hour to keep the data up to date.

Now for the tricky part: we have one database, let's call it X, where we store our metadata for the syncronization process (environment ids, external ids etc.) and we have some logs about it here in a sync schema. In another schema there's our source data that needs to be syncronized to this other service - this data is originally in another database, let's call it Y, which is accessed with postgres fdw (foreign data wrapper) which points to a read-only replica of the original live database Y.
Lately we ran into some issues regarding a query, where we got the following error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected.

The query itself runs in under 3 minutes when I use it directly (in DBeaver) without a problem using the fdw tables. In the original source it runs almost instantly. This bottleneck is caused by the fdw and I'm okay with it as far as it finishes. When this same query runs using sqlalchemy it won't finish now, it starts off and after ~20 minutes or so it raises the eof error.
One strange thing about it: lately we reached 10+ simultaneous workers and we don't have this error if I keep the workers under 10. (such a magical number)
One problem that might have an affect is that each worker has it's own sqlalchemy engine:

engine = create_engine(
    connection_string,
    echo=echo,
    future=True,
    pool_pre_ping=True
)

Could this be the couse of this problem?
I tried using the pool_timeout, max_overflow and pool_size options as well, but they didn't help.
Where should I look for this?

What we're using:
python: 3.8
fastapi: 0.68.1
SQLAlchemy: 1.4.22
rq: 1.9.0
rq-scheduler: 0.11.0
psycopg2-binary: 2.9.1

UPDATE:
The doc says:

When using an Engine with multiple Python processes, such as when using os.fork or Python multiprocessing, it’s important that the engine is initialized per process. See Using Connection Pools with Multiprocessing or os.fork() for details.

So that each worker has it's own sqlalchemy engine should not be a problem. (since each worker is a different python process)

CaselIT · 2022-08-04T10:21:46Z

CaselIT
Aug 4, 2022
Maintainer

hi,

there behaviour you are experiencing here seems environment dependant, not something really connected to sqlalchemy.
from the error it seems that someone is closing a connection at some point.

since you mentioned that the number of worker does have an impact, maybe you need to increase the total number of allowed connection by postgresql?

also strange that you are using psycopg2 with fastapi that's async, but I don't think this has any influence here

3 replies

csangonzo Aug 4, 2022
Author

since you mentioned that the number of worker does have an impact, maybe you need to increase the total number of allowed connection by postgresql?

The total number of connections (max_connections: 200) is fine, we ran into it once with an overpopulated table, and the error in that case is OperationalError: FATAL: sorry, too many clients already.

also strange that you are using psycopg2 with fastapi that's async, but I don't think this has any influence here

Yes, you're absolutely right, I'm about to change it sometime soon.

I just noticed I had another engine created that wasn't used, I'll try it with that removed.
I also thought about isolation levels, but the default is Read Committed which shouldn't be an issue here - if sqlalchemy doesn't change this without explicitly telling it to do so, which I think it doesn't.

I also checked a few SO questions, like: https://stackoverflow.com/questions/24130305/postgres-ssl-syscall-error-eof-detected-with-python-and-psycopg . Which suggests that pool_pre_ping=True should solve the issue, but I already had that set.

I'll dig deeper into fdw and replication with postgres. Is there any fdw specific sqlalchemy behaviour that could cause any issues or something I should know about?

CaselIT Aug 4, 2022
Maintainer

Not sure then. try looking if some proxy or similar has a max timeout on the connection or something like this. The issue seems to be that somehow the connection is terminated

I'll dig deeper into fdw and replication with postgres. Is there any fdw specific sqlalchemy behaviour that could cause any issues or something I should know about?

Not to my knowledge. A fdw should behave the same for sql clients in any case

csangonzo Aug 4, 2022
Author

It seems like it's a problem with the foreign data wrapper queries ...
I posted a question here if someone wants to follow up: https://dba.stackexchange.com/questions/315219/postgresql-foreign-data-wrappers-simultaneous-queries-wont-finish

Thanks for the help @CaselIT !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSL SYSCALL error: EOF detected #8348

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

SSL SYSCALL error: EOF detected #8348

csangonzo Aug 4, 2022

Replies: 1 comment · 3 replies

CaselIT Aug 4, 2022 Maintainer

csangonzo Aug 4, 2022 Author

CaselIT Aug 4, 2022 Maintainer

csangonzo Aug 4, 2022 Author

csangonzo
Aug 4, 2022

Replies: 1 comment 3 replies

CaselIT
Aug 4, 2022
Maintainer

csangonzo Aug 4, 2022
Author

CaselIT Aug 4, 2022
Maintainer

csangonzo Aug 4, 2022
Author