Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

background instance getting terminated constantly #931

Closed
snarfed opened this issue Apr 5, 2020 · 26 comments
Closed

background instance getting terminated constantly #931

snarfed opened this issue Apr 5, 2020 · 26 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Apr 5, 2020

originally reported on google's issue tracker, 2020-03-27. i'm going to copy its comments here, more or less one for one.

for a while now, the background service's single instance (on manual scaling) runs out of memory, dies, and gets restarted a handful of times per day. not great, but tolerable.

suddenly, starting 3/26 around 9am PST (4pm UTC), its instance suddenly started getting terminated multiple times per minute. when this happens, all currently processing requests die, return 500, and show the usual error code 203 log message:

The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)

due to this, roughly 40-50% of all requests to this service 500ed for the first day or so after this, usually after spending some time running normally. the rest of the requests succeeded, as usual, and returned 200 or occasionally 304.

i looked for other 500s from my app itself that might be triggering the scheduler to restart the instance, but no luck. nothing in the rest of the logs seems related either.

did something on App Engine's side change around 9am PST 3/26? i know that's when the Atlanta router failure outage started, but that was supposedly fixed by around 1pm yesterday. could this be related, even though it's still happening?

i didn't deployed or otherwise make any changes to the app at all anytime on 3/26, and i'm not aware of any other changes in external dependencies that could have triggered this.

i tried redeploying from scratch. i also paused the Cloud Tasks queues feeding requests to this instance, waited ~20m, and then resumed them. no luck, the instance still constantly gets terminated while it's serving requests.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/27 5pm: after about 36h, i now seeing some brief periods of respite for 10-15m here and there without instance terminations, but otherwise it's still happening.

3/28 5pm: this has been gradually subsiding over the last 24h. it's still happening, but less often, now only every 10-15m or so. see attached graph.

image

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/29 7am, from google: ...Could you please provide me with your application’s runtime environment. I have found a particular issue with NodeJS runtimes which may be the cause of the issue.

I have also identified the possibility of this being an issue with manual scaling, as since there is only one instance setup to deal with the requests coming from Cloud Tasks, that single instance would become overburdened[1] and would not scale up accordingly due to manual scaling. I have also investigated further and noticed a significant burden on that one instance in terms of memory.

I would suggest to see if increasing the number of instances under the respective service will help with your issue[2] or if changing the “intance_class”[3] would help alleviate the stress on the instance.

There is no guarantee for uptime for manual scaling instances[4], so please keep in mind the best practices when implementing manual scaling.

I also investigated the GCP outages, there was a incident on Friday which could have caused issues for all customers. As you have pointed out, there was a influx of 500 messages on Friday likely caused by the outages, however since these outages have already been fixed, and the issue seems to still persist, they are not the root cause of the issue at hand[5].

[1]https://cloud.google.com/appengine/docs/standard/java/how-instances-are-managed#shutdown
[2]https://cloud.google.com/appengine/docs/standard/python/config/appref#manual_scaling
[3]https://cloud.google.com/appengine/docs/standard/python/config/appref#instance_class
[4]https://cloud.google.com/appengine/docs/standard/java/how-instances-are-managed#instance_uptime
[5]https://status.cloud.google.com/

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/29 10am: thank you for your reply!

this app is on the Python 3 Standard runtime. here's the YAML config file for the affected service: https://github.com/snarfed/bridgy/blob/master/background.yaml#L12

i'll try increasing the number of instances. however, this service has been happily serving for years with this same single instance configuration and the same load, including request volume, latency, and utilization. this issue started suddenly without any change to the app or its load.

fortunately, the issue has continued to gradually resolve itself. the instance is still getting terminated with error code 203 and no other cause i can see, but only every 15-60m or so:

image

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/30 10am: this is slowly increasing again. the instance is now back to getting terminated every few minutes or so. graph attached.

i definitely buy that the router outage on 3/26 may have been the original cause, but if so, this issue itself evidently wasn't resolved when the router outage itself was resolved on 3/27.

image

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/31 9am, from google: From what I have discovered through internal tools, your instance is overburdened and has been seeing memory errors throughout its lifespan including the 200 errors you are facing as well. Please see if any of my solutions I have provided helps with the issue you are facing and please let me know of the results. The issue may have been amplified by the outages, but it is not likely the cause as from what I have seen, the instance has been terminated due to the errors you are seeing from before the outages started occurring.

I have tried to reproduce your issue myself and have also been unsuccessful.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

3/31 1pm: thank you for investigating so thoroughly! i very much appreciate it.

you're absolutely right, this instance runs hot and high on memory pressure, and it exceeds the memory limit and gets terminated multiple times per day.

however, that's been happening for many months now. it's definitely not ideal, but it has been stable and not a big problem. these error code 203 terminations are new, different, much more frequent, and started suddenly, at the same time as the router outage.

also, the log messages for the two types of terminations are also different:

"While handling this request, the process that handled this request was found to be using too much memory and was terminated"

vs

"The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)"

so, the two problems do both cause instance terminations (and request 500s), but otherwise they seem unrelated.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

4/1 11am, from google: Hey, no problem, we are here to help the community identify any discrepancies with Google Cloud Platform functionalities. As I mentioned, your instance has been throwing both of these types of errors throughout the lifespan of the instance, both the memory errors as well as the 203 errors that you have seen a spike in recently. Through my investigation using internal tools and the logs of your application these errors (“The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)”) have been thrown since March 5th, and have increased in rate since, however the outages must have increased the frequency of these errors temporarily. I am still unable to reproduce this error so far, so I would like to ask you for more information.
...

  • What kinds of operations are being performed by Cloud Tasks and then handled by your Cloud Tasks requests handler?
    • From what it seems, the load on your instance is significant and I would like to know if these operations are the cause of the failures in terms of memory and/or the 203 you are facing.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

4/2 8am: thanks again for the deep investigation! you're absolutely right, this instance is somewhat overburdened, at least in terms of memory. i've added another instance to see if that help, but i suspect it won't, since i believe it runs out of memory due to a memory leak, not due to the workload itself. (i've never quite been able to fix that leak. ☹️)

to answer your questions:

  • If you made any particular changes to your application logic, and/or task request handler code on or near March 5th.

yes! this is a great clue. i use Google's Python 3 ndb library, and on March 5, i switched to using its code for managing compressed TextPropertys in the datastore instead of my own. details in this commit and this ndb PR that added the feature. the engineers who maintain ndb and worked on that change, i expect at Google, are Chris Rossi and Carlos de la Guardia.

i've temporarily reverted this change and redeployed my app without it to see if that helps. fingers crossed!

  • A minimum reproducible example of your Cloud Tasks request handler[1].
  • What kinds of operations are being performed by Cloud Tasks and then handled by your Cloud Tasks requests handler?

sure! the app is open source; here's the code for the task handlers. Poll is the heaviest and most frequent task, but all of the task handlers read data from the datastore, make multiple external HTTP requests to social network APIs and other web sites, and then read and write more data to the datastore. Poll tasks in particular can run for a while, often 5-15s or longer, largely due to the HTTP requests.

Poll in particular has a substantial amount of logic and external HTTP requests. i'll look into minimizing it, but i don't know that i'll have much luck.

  • This being when the error’s first started to occur, from what I have found, I would like to verify that no changes to the code at this time were responsible.

i didn't change anything in the app on or near 3/27. i did make the ndb change mentioned above on 3/5, but that was the only change at that time.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

4/2 4pm: i tried reverting the compressed TextProperty change i mentioned and downgraded ndb from 1.1.2 to 1.1.1, but no luck. i'll try downgrading to 1.1.0 next. if that doesn't help, i'll try adding instances to see if relieving the memory pressure helps.

4/4 9am: no luck i tried adding a second instance, but that didn't seem to help either. both instances still get terminated every few minutes or so with error code 203.

looking at logs, instances still do seem to get terminated due to memory much less often, only once every hour or two.

i'm switching it back to one instance now. for my own reference for looking at logs, the service ran on two instances from 4/3 ~8:30am PST to 4/4 ~9:30am PST. i'm also limiting all task queues to just one concurrent task, so the single instance will now only ever run one Poll task (the heaviest handler) and one Propagate task at a time.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

4/4 2pm, from google: Thanks for your patience during this investigation, but I must inform you that at this point in the investigation, I can safely assume that the issue is no longer a GAE wide issue, it seems to be directly associated with your application logic/code. I have come to this conclusion as first of all, most issues with 203 error codes are directly related to customer code, as I found using internal tools. All other causes for 203 errors, I have already rooted out as not being the cause. Issues on Issue tracker are meant to identify platform wide issues (service wide issues), which may be affecting all customers. I would suggest investigating further either through a support case[1] for dedicated support or ask the community on Stack Overflow[2] for suggestions about this error as well as the memory leak you have mentioned. We would love to continue helping you with this issue, but as it is not platform wide, we recommend using our other support channels to do so.

I did however discover that the issue could be related to your datastore operations through cloud tasks. It turns out that transactions expire after 60 seconds, unless idle for 30 seconds[3], and a likely possibility is that your datastore operations may be taking longer than 60 seconds causing these errors to occur. A good way to mitigate this possible issue is by applying the datastore best practices[4].

[1]https://cloud.google.com/support/docs#role_based_support
[2]https://cloud.google.com/support/docs/community#technical_questions_stack_exchange_stack_overflow_and_server_fault
[3]https://cloud.google.com/datastore/docs/concepts/transactions#using_transactions
[4]https://cloud.google.com/datastore/docs/best-practices

@Ryuno-Ki
Copy link

Ryuno-Ki commented Apr 5, 2020

So … have you asked on StackOverflow yet?

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

not yet, but that's a good next step.

@armingrewe
Copy link

Not really a tech person, so I at best understand half of this. But is this the reason the mentions currently only arrive intermittently, sometimes only after me manually requesting a pull?

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

@armingrewe hmm! this means they might be delayed a bit, but only hours if so, not days, and shouldn't mean any are missed entirely. if you're seeing some missed entirely, and https://brid.gy/about#missing doesn't explain them, feel free to file another issue!

@Ryuno-Ki
Copy link

Ryuno-Ki commented Apr 5, 2020

So from what I got is that your program is under such a high pressure, that Google is killing it off to save the rest of their platform (and restart a new instance).

Sounds like a use-case for load-balancing. But you already tried to spin up a second instance.
So … do you have a diagram displaying the architecture? Perhaps using Redis as a cache could help … something like a task queue.

Or, perhaps, Python is the wrong language for this kind of problem (or a subset of it).
I wonder, whether Go, Rust or Node.js could do a better job here (but I know that many Pythonistas hate to leave their language).
What do you think?

@armingrewe
Copy link

@snarfed sorry, think I didn't describe it very well. That's pretty much what I meant. It sometimes seems to take many hours until it runs / picks them up, if memory serves me correctly I've seen cases where it said "last pulled 11 hours ago". Out of impatience I've then clicked on "Pull now", which then sometimes still took an hour or two to run and send. As far as I can tell none was missed though.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

@Ryuno-Ki thanks for thinking about this!

we haven't conclusively found the root cause yet. the google support person thinks it's load/pressure, but i don't entirely agree. this service is network I/O bound, not CPU bound. it has been under a bit of memory pressure for months now, and occasionally uses too much and gets terminated, but only a few times per day, with an "out of memory" error message.

this pattern of terminations, on the other hand:

  • started suddenly, without any change to bridgy itself, its load, any silo API, etc.
  • happens much more often - multiple times per minute at the beginning, now every 5-10m
  • includes a different error message
  • doesn't go away when i drastically reduced or even temporarily pause most of the load
  • doesn't go away when i turn on autoscaling

(as for its architecture, it's a background task processing service that handles polling silo accounts, sending webmentions, and superfeedr notifications. it scache most things that it can, and it's not latency sensitive. if it was CPU bound and latency sensitive, or if i couldn't add instances, i'd agree that a new language might help. that would be a massive rewrite, though, so i'm glad it's probably not warranted.)

@Ryuno-Ki
Copy link

Ryuno-Ki commented Apr 5, 2020

I read all the logs you provided here, but thanks for the summary.
What I was thinking was rather to turn this background task processing service into a microservice, which communicates via a Redis store (or Events like Kafka). Python has a tendency to monolithic architecture.

To explore the idea, you could pick one of your streams (e.g. webmentions) and duplicate the events to a second queue, which would then get picked up from the microservice.
This should allow you to compare the performance against each other.
Plus, it has a limited scope, so shouldn't be too much investment to prototype (not sure, whether you'd need to learn another programming language for this. I could write something in Node.js, but am not familiar with GCP. So I'd deploy it to Heroku instead).

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

i'm not sure how splitting this service into microservices would help here. we still don't fully understand the root cause of the problem itself, so it's hard to think about solutions when we don't yet know exactly what we need to solve. i get that microservices could reduce the memory footprint, but given all the evidence, i'm not yet convinced that that's the problem.

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

btw, i'm happy to talk in indieweb chat too if you want!

@snarfed
Copy link
Owner Author

snarfed commented Apr 7, 2020

this has gradually subsided and gone away pretty much entirely over the last couple days, without anything obvious i changed on the app's side that fixed it. guess it may remain a mystery. at least it's gone! hope it stays that way.

@Ryuno-Ki
Copy link

Ryuno-Ki commented Apr 7, 2020

So … preliminary closing this issue then?

@snarfed
Copy link
Owner Author

snarfed commented Apr 7, 2020

maybe! i'll give it a few more days and see first.

@snarfed
Copy link
Owner Author

snarfed commented Apr 12, 2020

seems stable. guess this did indeed fix itself. maybe something internal to app engine, maybe something else. unsatisfying, but oh well. 🤷

@snarfed snarfed closed this as completed Apr 12, 2020
snarfed added a commit that referenced this issue Apr 15, 2020
doesn't seem to affect throughput, still doing >10k polls per day. 500s seem down a bit, just 2-3% of all poll requests now. for #931.
@snarfed
Copy link
Owner Author

snarfed commented Apr 30, 2020

this has crept back up over the last few days. restarts are back to every couple minutes now on average. ☹️

image

@snarfed snarfed reopened this Apr 30, 2020
@snarfed
Copy link
Owner Author

snarfed commented May 28, 2020

closing again, it settled back down a few weeks ago.
image

@snarfed snarfed closed this as completed May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants