Tweak gunicorn launch parameters #196
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These are minor parameter tweaks to the production webserver parameters. Since nobody else messes with this, I'll probably launch this in a few days even if nobody reviews it.
Background
Since the update to Django 5.2 a few weeks ago (#191), the application has been crashing roughly every other day and sending me an email like the one below. We didn't change the application drastically but I'm guessing the memory usage is a bit higher in Django 5.2 vs. 4.2. The free tier machines we're deployed on have 256MB of RAM and while we are only ~200MB used with 2x gunicorn workers, we are still getting killed sometimes. I did browse the Graphana graphs that fly made available (see below) but I didn't actually see a spike in memory usage around the crash time.
Also to deploy a database migration, we would need to connect to a running machine and if the machine is dying from 2 gunicorn workers, we don't have the spare memory to connect and run a migration when the need arises. We need some more spare memory. The red dashed lines in the graph below are me SSHing into an instance and the instance getting OOM killed.
This change
So my main plan is to switch to a single worker instead of 2. I tested this for a little while to see how it performed and so far it's looking good. I'm going to let it sit a few days and see if we get a crash. Fly's free tier lets you run 3 machines for free so I'm going to run 2 of these small/cheap machines instead of the 1 we were previously running but each machine will just have 1 worker instead of the 2 previously.
I made another small change to Gunicorn's worker settings based on a recommendation from ChatGPT (don't tell Hobson). This app doesn't get a lot of non-bot traffic and getting 10k requests may take a long time. Probably best to be restarting these workers more frequently from a time perspective. In general, this prevents a memory leak from building up and causing a crash, however, I don't actually believe this is what caused the crashes I'm seeing as the graphana graph doesn't show memory slowly climbing until processes get OOM killed which is what I'd expect if memory leaks were the problem.