Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlalchemy queue pool limit lockup/timeout #104

Closed
jklaw90 opened this issue Jan 29, 2020 · 62 comments
Closed

sqlalchemy queue pool limit lockup/timeout #104

jklaw90 opened this issue Jan 29, 2020 · 62 comments
Labels

Comments

@jklaw90
Copy link

jklaw90 commented Jan 29, 2020

While doing some testing to see how this fastapi with sqlalchemy would hold up, my server seemed to lock up when running 100 concurrent requests. If i ran the requests sequentially it was totally fine.

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,

Is it possible I've made some error while trying to mimic the structure of the code base?
Or is it possible the main can lock up in the middleware with the session implementation of sqlalchemy?

Has anyone tested the performance of this cookie cutter project?

@jklaw90
Copy link
Author

jklaw90 commented Jan 30, 2020

Running the test attached i was able to reproduce the failed responses for a majority of the requests on this cookiecutter project.
And only a few requests can even respond before it seems to lock up the main, could this have something to do with the management of the session in the async middle-ware https://docs.sqlalchemy.org/en/13/orm/session_basics.html#is-the-session-thread-safe ?

I know this isn't the most scientific test but i just wanted to prove out this repro before switching to something like it.
Results:
200: 47 - 17 before the lock up and 30 of them after the sqlalch timeouts)
500: 53
98 - 98 requests took longer than 2 seconds to respond.

import aiohttp
import asyncio
from datetime import datetime

headers = {
    "Authorization": "Bearer xxxxxxxxxxxxxxxxxxx",
    "Content-Type": "application/json; charset=utf-8"
}

timeout = aiohttp.ClientTimeout(total=60)

async def get(url):
    async with aiohttp.ClientSession(timeout=timeout) as session:
        s = datetime.now()
        async with session.get(url, headers=headers) as resp:
            return resp.status, datetime.now() - s

loop = asyncio.get_event_loop()

coroutines = [get("http://localhost/api/v1/users/me") for i in range(100)]


results = loop.run_until_complete(asyncio.gather(*coroutines))

codes = {}
count = 0
for s, r in results:
    codes[s] = codes.get(s, 0) + 1
    if r.seconds > 2:
        count += 1

print(codes, count)

@chdsbd
Copy link

chdsbd commented Feb 2, 2020

I think this is related to #56.

@jklaw90
Copy link
Author

jklaw90 commented Feb 5, 2020

Yeah, I guess people just use this for convenience and don't have serious spikes of traffic with fastapi + sqlalchemy? Maybe the people who actually are using it in production are using one of the other async db clients.

@ebreton
Copy link
Contributor

ebreton commented Feb 5, 2020

For information, I also ran on the 100 requests limit, which was actually due to the default configuration of the postgres DB.

Launching the DB with more connections (500 in my case) solved the issue. You just have to modify the docker-compose by adding the following line in the postgres service description:

services:
  postgres:
    ...
    command: postgres -N 500

@jklaw90
Copy link
Author

jklaw90 commented Feb 6, 2020

@ebreton What was the default connection limit on that postgres container and did you also increase the pool size in sqlalchemy? I understand that would fix this issue, but the main shouldn't get blocked and timeout after 30 seconds in any condition with only 100 concurrent requests. Running all 100 requests sequentially works perfectly fine and finishes in no time.

I don't know fastapi/starlette and sqlachemy that well so I'm a little confused if the main is grabbing a session but blocking when the QueuePool is full? If so is it possible we could see better results by just changing the sqlalchemy configurations?

@ebreton
Copy link
Contributor

ebreton commented Feb 6, 2020

Hi @jklaw90

The default configuration in the official postgres container is 100, there is an example of overriding this value on the docker hub repo 1. I upgrades it to 500 for the sake of doing it only once. 😄
The default size of a SQLAlchemy pool is 5 2. I also increased it to 20 (keeping the default overflow to 10)

I agree that 100 concurrent requests works fine. In my case, I have actually just realized that I made an entirely different mistake, which is to serve (a lot of) pictures directly from the app. I am going to replace this by a NGinx, and I will probably be able to come back to the default SQL values in the app...

Thanks for the question 🎉

@jklaw90
Copy link
Author

jklaw90 commented Feb 7, 2020

Ahh thank you for the resources and insight!

I am still a little confused on why even with default 100 connections + pool size of 5 we would get timeouts rather than just getting responses slowly?

If there is some simple fix i'd love to know, really looking to switch from falcon to fastapi.

@unmade
Copy link

unmade commented Feb 14, 2020

@jklaw90

That's interesting! I have slightly different setup for my app and I couldn't reproduce the error you have, but occurred with another one. I don't have middleware and don't use Depends to get session. Instead, I create session directly in the endpoint.

I refactor my app to use Depends and have exactly the same error you got. Strangely enough, but with PYTHONASYNCIODEBUG=1 everything works fine.

So, I guess the problem might be in how Depends work, although I haven't looked at it yet. You might want to try to get rid of the session middleware and Depends and use custom context manager for session instead, something like:

from contextlib import contextmanager
from sqlalchemy.orm import sessionmaker

Session = sessionmaker(...)

@contextmanager
def SessionManager():
    session = Session()
    try:
        yield session
    finally:
        session.close()

And then you can use it in you endpoint like that:

@router.get("/")
def hello_world():
    with SessionManager() as db_session:
        ...

So, with a setup like that I also tried to fire 100 concurrent requests and got this error:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  sorry, too many clients already

However, if I set WEB_CONCURRENCY=1 (this env specifies number of workers for gunicorn - in this case it will run gunicorn with a single worker), everything is fine.

I think I got this error, because gunicorn spawns several workers (you can see how the number of worker is calculated here). Each worker thinks it has pool size of 100 available to it, when it is actually db_pool_size / number_of_workers:

So as a fix I created engine like this:

DB_POOL_SIZE = int(os.getenv("DB_POOL_SIZE", "100"))
WEB_CONCURRENCY = int(os.getenv("WEB_CONCURRENCY", "2"))
POOL_SIZE = max(DB_POOL_SIZE // WEB_CONCURRENCY, 5)

engine = create_engine(config.DATABASE_DSN, pool_size=POOL_SIZE, max_overflow=0)

Note: WEB_CONCURRENCY calculated a little bit more complicated

The problem is gone and everything seems to work fine

@vumaasha
Copy link

@tiangolo @dmontagu facing the same issue. When the number of concurrent requests is higher than the number of available of connections ( max pool size + overflow), many requests just time out. Are we doing something wrong? is there a way to fix it?

@unmade
Copy link

unmade commented Mar 19, 2020

@vumaasha can you please try solution I described above and see if it helps?

@vumaasha
Copy link

@vumaasha can you please try solution I described above and see if it helps?

I make heavy use of Depends, getting rid of Depends will need lot of refactoring for me. So it is not an option.

I only tried the part corresponding to setting pool size. It didn't help.

@stratosgear
Copy link

For me it seems to work (no more timeouts and I do not exceed the default of 100 open connections).

But I have the impression that using the SessionManager slows things down. Does this make sense?

I am running some loadtesting scenarios with locust and I will report back.

@vumaasha
Copy link

vumaasha commented Mar 19, 2020

@stratosgear you wont get timeout if you don't exceed number of available connections. Thats right but when it does exceed, it can be slow it shouldn't be timing out. Increasing the timeout also doesnt help

@stratosgear
Copy link

If I can explain my issue correctly it was like this:

  • My requests are usually replying in 200ms (on the average)
  • When I would try hitting the server with 300 concurrent users, I would exceed the 100 open connections of postgres.
  • After removing the Depends(get _db) and using the SessionManager (from above) I never went more that 25-30 db connections ( I was monitoring from pgadmin) all other things being equal (number of concurrent users, duration of test etc)

Unfortunately now I see that my response rate is larger than 1000ms and I'm wondering what causes this. I still have to clean things up and try the locust scenarios again.

@stratosgear
Copy link

stratosgear commented Mar 19, 2020

Sweet!
I have no extra overhead with the SessionManager, after rerunning my load testing scenarios.

One other configuration change I have is when I instantiate the sqlachemy engine:

engine = create_engine(
    config.SQLALCHEMY_DATABASE_URI, pool_pre_ping=True, pool_size=32, max_overflow=64
)

That ought to be able to utilize all 100 default Postgres connections, but my load tests anyway rarely seem to utlize more than 25-30 connections (now with the SessionManager)

🍺 to @unmade (or 🍰, 🍵 or 🥃 or whatever rocks his boat...)

@jklaw90
Copy link
Author

jklaw90 commented Mar 19, 2020

The problem is there is a lockup/timeout on sqlalchemy. The session maker should block and wait, so slow responses would be acceptable over no responses.
Updating the pool size just hides the underlying issue and isn't a solution I would be willing to use in production.

@unmade
Copy link

unmade commented Mar 19, 2020

@jklaw90 it is not just about updating the pool size. The problem seems to be somewhere in Depends. BTW, have you tried to do load testing with WEB_CONCURRENCY=1? Did you get the same error?

@jklaw90
Copy link
Author

jklaw90 commented Mar 19, 2020

I think what you've done it just moved the session creation to the thread itself rather the main.
And WEB_CONCURRENCY=1 is also limiting the worker count which i wouldn't do if performance is a concern.

@unmade
Copy link

unmade commented Mar 19, 2020

of course, you don't want WEB_CONCURRENCY=1 in production. I asked to understand if it affects problem or not - because when I tested it, the number of workers caused (directly or not) the problem.

@jklaw90
Copy link
Author

jklaw90 commented Apr 1, 2020

Sorry, i haven't had time to test it out. I actually stopped looking into fast api for now. It's for work so i just can't afford throw it up to find out if it works haha. I'll probably check it out again in the future, but most likely without sqlalchemy.

@jklaw90 jklaw90 closed this as completed Apr 14, 2020
@sandys
Copy link

sandys commented May 16, 2020

The problem is there is a lockup/timeout on sqlalchemy. The session maker should block and wait, so slow responses would be acceptable over no responses.
Updating the pool size just hides the underlying issue and isn't a solution I would be willing to use in production.

@tiangolo we are also hitting the Depends issue in production. Is there a long-term production-ready fix here ?

@vumaasha
Copy link

The problem is there is a lockup/timeout on sqlalchemy. The session maker should block and wait, so slow responses would be acceptable over no responses.
Updating the pool size just hides the underlying issue and isn't a solution I would be willing to use in production.

@tiangolo we are also hitting the Depends issue in production. Is there a long-term production-ready fix here ?

@sandys this setting seem to help for me. I am using a higher number for max_overflow and smaller number for pool. Only pool size connections are maintained, other connections are released as soon as possible. This does seem to help to mitigate the problem

engine = create_engine(config.SQLALCHEMY_DATABASE_URI, pool_pre_ping=True, pool_size=30, max_overflow=120)

@stratosgear
Copy link

Sorry I cannot be more helpful, but I tried all combinations of create_engine params, so I will be doubtful that this can be fixed solely with the max_overflow param.

I will insist that Depends(get _db) is somehow flowed. Moving to SessionManager suggestion by unmade made a hell of a difference, at least for my case.

@sandys
Copy link

sandys commented May 22, 2020

@stratosgear thanks for replying. Could you copy paste what you did with Sessionmanager ? the comment here seems to make the impression that Sessiomanager does not work and we need to put some WEB_CONCURRENCY and PYTHONASYNCIODEBUG to get it to work.

If you are deploying with gunicorn, could you also mention your config please. this will be super helpful

@stratosgear
Copy link

Sure....

So I got rid of the Depends(get _db) in the API method and I am using:

@router.get("/cleanup/{jobid}", response_model=schemas.Job, status_code=200)
def cleanup(response: Response, jobid: int):
    """ some comment"""

    with SessionManager() as db:
        jobRow = crud.job.get(db=db, id=jobid)
....
....

so my session is provided by SessioManager as db and I used that from that point on.

SessionManager, pulled straight out of my repo, is externally defined as defined as:

import warnings
from contextlib import contextmanager

from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

from rapidxmm.core import config

# How to echo all Sqlalchemy sql statements
# engine = create_engine(config.SQLALCHEMY_DATABASE_URI, pool_pre_ping=True, echo=True)

# Increase connection pool sizes to support more than 100 concurrent users.
# After some failed tests while testing 100 conc.users an a DB with
# pointed at nside 15 and slew at nside 14
# defaults were: pool_size=5, max_overflow=10
# https://docs.sqlalchemy.org/en/13/errors.html#error-3o7r
engine = create_engine(
    config.SQLALCHEMY_DATABASE_URI, pool_pre_ping=True, pool_size=32, max_overflow=64
)
db_session = scoped_session(
    sessionmaker(autocommit=False, autoflush=False, bind=engine)
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)


# SessionManager is an alternative way to get a DB connection
# as an alternative to FasTApi Depends that seems to quickly
# exhaust the Postgres open connections:
# https://github.com/tiangolo/full-stack-fastapi-postgresql/issues/104
@contextmanager
def SessionManager():
    db = SessionLocal()
    try:
        yield db
    except:
        # if we fail somehow rollback the connection
        warnings.warn("We somehow failed in a DB operation and auto-rollbacking...")
        db.rollback()
        raise
    finally:
        db.close()

Embarrassingly enough I do not know the exact gunicorn settings. I am using the standard main.py file for instantiating my App and I let it boot up with the default settings as they are defined in this cookiecutter skeleton repo.... :(

@sandys
Copy link

sandys commented May 22, 2020

@unmade @stratosgear i have a theory

what if you keep Depends.....but put it under Contextmanager. So for example


from typing import List
from contextlib import contextmanager

from fastapi import Depends, FastAPI, HTTPException
from sqlalchemy.orm import Session

from . import crud, models, schemas
from .database import SessionLocal, engine

models.Base.metadata.create_all(bind=engine)

app = FastAPI()


# Dependency
@contextmanager
def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()


@app.post("/users/", response_model=schemas.User)
def create_user(user: schemas.UserCreate, db: Session = Depends(get_db)):
   with db:
       db_user = crud.get_user_by_email(db, email=user.email)
       if db_user:
         raise HTTPException(status_code=400, detail="Email already registered")
       return crud.create_user(db=db, user=user)

maybe the whole issue is happening because contextmanager is needed to close the connection before the async corountine exits.

@DBendit
Copy link

DBendit commented Jul 31, 2020

I'll just add another "success story" of switching from using Depends(get_db) to using a context manager inside of API functions. I was seeing the exact same issue as everyone else in this thread, with the connection pool becoming quickly exhausted. Every alternative I tried using Depends() failed, but as soon as I moved to using a context manager, I was able to throw hundreds of connections at my service and see it queue them up properly.

@mtucker502
Copy link

@DBendit mind sharing your new context manager function and endpoint code? I ran into some obscure error from context manager (probably my fault) and was too busy (see lazy) at the time to figure it out.

@AlexanderPodorov
Copy link

@abrichr , to your questions:
In regards to background_tasks, this is the standard FastApi feature. It allows you to add any function that will be executed after the response is sent. In my case I've added the close_session function and passed session object as positional argument.
In regards to get_settings, this just returns my custom Settings class. In particular this class has session_maker attribute which I'm using to initiate a new DB session, actually returned by SQLAlchemy's sessionmaker factory function.
I found it useful to store SQLAlchemy's Engine instance on my custom Settings class to take advantage on dependency injection features.

@andersonrocha0
Copy link

Hey guys, I made a benchmark using SQLAlchemy Sync and Async.

The code and the results are in the link below:

https://github.com/andersonrocha0/fastApiAndSqlAlchemySyncVsAsync

@sandys
Copy link

sandys commented Apr 12, 2021 via email

@andersonrocha0
Copy link

@sandys Thanks for sharing man.

My intention was demonstrate that the SQLAchemy not work well using the sync way. Do you agree with my tests and my conclusion?

If is possible can you check if my tests and conclusion are right?

Tks a lot.

@sandys
Copy link

sandys commented Apr 13, 2021 via email

@andersonrocha0
Copy link

@sandys If you try the same SQLAlchemy pool technique with flask you do not have problems. So, I understand that the problem is the combination between SQLAlchemy and FastAPI.

However, I as said, using SQLAlchemy with async way, fix the problem and you don't need uses tools like PgBouncer, etc.

Thanks man.

@sandys
Copy link

sandys commented Apr 14, 2021 via email

@remidinishanth-ntnx
Copy link

@unmade @stratosgear i have a theory

what if you keep Depends.....but put it under Contextmanager. So for example


from typing import List
from contextlib import contextmanager

from fastapi import Depends, FastAPI, HTTPException
from sqlalchemy.orm import Session

from . import crud, models, schemas
from .database import SessionLocal, engine

models.Base.metadata.create_all(bind=engine)

app = FastAPI()


# Dependency
@contextmanager
def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()


@app.post("/users/", response_model=schemas.User)
def create_user(user: schemas.UserCreate, db: Session = Depends(get_db)):
   with db:
       db_user = crud.get_user_by_email(db, email=user.email)
       if db_user:
         raise HTTPException(status_code=400, detail="Email already registered")
       return crud.create_user(db=db, user=user)

maybe the whole issue is happening because contextmanager is needed to close the connection before the async corountine exits.

@sandys Did this work? Just by adding @contextmanager to get the local session?

What's your suggestion to use this in production? Should we move to asyncpg and async-sqlalchemy extension?

@sandys
Copy link

sandys commented Jul 19, 2021

I have a fully working, sqlalchemy and pydantic compatible gist here with both syncpg and asyncpg.

https://gist.github.com/sandys/671b8b86ba913e6436d4cb22d04b135f

You can try it and let me know. I have already benchmarked it. So should work fine.

@thebleucheese
Copy link

thebleucheese commented Aug 3, 2021

I have a fully working, sqlalchemy and pydantic compatible gist here with both syncpg and asyncpg.

https://gist.github.com/sandys/671b8b86ba913e6436d4cb22d04b135f

You can try it and let me know. I have already benchmarked it. So should work fine.

@sandys I'm relatively new to Python and these tools so excuse my ignorance, but don't you run into the issue outlined here: tiangolo/fastapi#3620 if you commit your session transaction after the yield in the get_db function?

I tried a similar but slightly different approach to what you have above and I was seeing some weird behavior. I fixed it but I'm not sure I completely understand all of the different methods of using a db session as a fastapi route dependency yet.

@sandys
Copy link

sandys commented Aug 3, 2021 via email

@sandys
Copy link

sandys commented Aug 21, 2021

@thebleucheese did u try my code ? would love any feedback if u have

@AlexanderPodorov
Copy link

@andersonrocha0 , @sandys , I can confirm it works with async SQLAlchemy. I had pool_size=5 and pool_overflow=5 and 50 concurrent users. Works like a charm.

@sandys
Copy link

sandys commented Oct 29, 2021 via email

@AlexanderPodorov
Copy link

@sandys , sorry, I was not specific about that. I'm not referring to any code here, rebuilt my app to async SQLA several months ago, and just wanted to give it a try, and it worked.

faustomorales added a commit to faustomorales/qsl that referenced this issue Dec 11, 2021
@maitham
Copy link

maitham commented Dec 22, 2021

@sandys Looked at your gist, I still seem to get errors, does the lru_cache guarantee reuse of the same async_engine?

@thebleucheese
Copy link

@thebleucheese did u try my code ? would love any feedback if u have

Yes, sorry it took me a while to get back to this. I had to move on to other issues I had with my project on the front-end.

I've encountered an issue with this approach and using multiple dependencies that rely on the DB session.

The following works fine in isolation.

@app.get("/", response_model=List[UserPyd])
async def foo(context_session: AsyncSession = Depends(get_db)):
    async with context_session as db:
        # code that uses db (AsyncSession)

If I add a dependency on another dependency that uses get_db I get this error:

File "/usr/lib/python3.8/contextlib.py", line 173, in aenter
raise RuntimeError("generator didn't yield") from None

async def bar(db_session: : AsyncSession = Depends(get_db)):
    async with db_session as db:
        # use db to check user permissions or some other action

@app.get("/", response_model=List[UserPyd])
async def foo(current_user: User = Depends(bar), context_session: AsyncSession = Depends(get_db)):
    async with context_session as db:
        # code that uses db (AsyncSession)

I'm not running the exact gist but the code is similar enough.

When I switch to a simplified setup without the contextmanager I don't see this error:

engine = create_async_engine(
    DATABASE_URL,
    pool_pre_ping=True,
    echo=True
)
_async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)

async def get_db() -> AsyncSession:
    db: AsyncSession = _async_session()
    try:
        yield db
        await db.commit()
    except:
        await db.rollback()
    finally:
        await db.close()

@bharathanr
Copy link

bharathanr commented Jan 30, 2022

Hi @thebleucheese , thanks for your feedback on @sandys code. The documentation discourages decorating Dependencies with @contextmanager/@asynccontextmanager here in the very last tip. It looks like this is because it breaks dependency resolution, based on your experience.

Do you have any recommendations on how to set up an Asyncio SQLAlchemy Session in production, based on your experiences?

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Thanks for the discussion everyone!

This was most probably solved in tiangolo/fastapi#5122, released as part of FastAPI 0.82.0 🎉

If this solves it for you, you could close the issue. 🤓

@shriramsb11
Copy link

shriramsb11 commented Aug 22, 2023

Hi,

Thanks for the discussion everyone!

This was most probably solved in tiangolo/fastapi#5122, released as part of FastAPI 0.82.0 🎉

If this solves it for you, you could close the issue. 🤓

So, can we use the dependency injection itself(Depends(get_db)), if the version is updated to 0.82.0 or above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests