New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
background instance getting terminated constantly #931
Comments
3/27 5pm: after about 36h, i now seeing some brief periods of respite for 10-15m here and there without instance terminations, but otherwise it's still happening. 3/28 5pm: this has been gradually subsiding over the last 24h. it's still happening, but less often, now only every 10-15m or so. see attached graph. |
3/29 7am, from google: ...Could you please provide me with your application’s runtime environment. I have found a particular issue with NodeJS runtimes which may be the cause of the issue. I have also identified the possibility of this being an issue with manual scaling, as since there is only one instance setup to deal with the requests coming from Cloud Tasks, that single instance would become overburdened[1] and would not scale up accordingly due to manual scaling. I have also investigated further and noticed a significant burden on that one instance in terms of memory. I would suggest to see if increasing the number of instances under the respective service will help with your issue[2] or if changing the “intance_class”[3] would help alleviate the stress on the instance. There is no guarantee for uptime for manual scaling instances[4], so please keep in mind the best practices when implementing manual scaling. I also investigated the GCP outages, there was a incident on Friday which could have caused issues for all customers. As you have pointed out, there was a influx of 500 messages on Friday likely caused by the outages, however since these outages have already been fixed, and the issue seems to still persist, they are not the root cause of the issue at hand[5]. [1]https://cloud.google.com/appengine/docs/standard/java/how-instances-are-managed#shutdown |
3/29 10am: thank you for your reply! this app is on the Python 3 Standard runtime. here's the YAML config file for the affected service: https://github.com/snarfed/bridgy/blob/master/background.yaml#L12 i'll try increasing the number of instances. however, this service has been happily serving for years with this same single instance configuration and the same load, including request volume, latency, and utilization. this issue started suddenly without any change to the app or its load. fortunately, the issue has continued to gradually resolve itself. the instance is still getting terminated with error code 203 and no other cause i can see, but only every 15-60m or so: |
3/30 10am: this is slowly increasing again. the instance is now back to getting terminated every few minutes or so. graph attached. i definitely buy that the router outage on 3/26 may have been the original cause, but if so, this issue itself evidently wasn't resolved when the router outage itself was resolved on 3/27. |
3/31 9am, from google: From what I have discovered through internal tools, your instance is overburdened and has been seeing memory errors throughout its lifespan including the 200 errors you are facing as well. Please see if any of my solutions I have provided helps with the issue you are facing and please let me know of the results. The issue may have been amplified by the outages, but it is not likely the cause as from what I have seen, the instance has been terminated due to the errors you are seeing from before the outages started occurring. I have tried to reproduce your issue myself and have also been unsuccessful. |
3/31 1pm: thank you for investigating so thoroughly! i very much appreciate it. you're absolutely right, this instance runs hot and high on memory pressure, and it exceeds the memory limit and gets terminated multiple times per day. however, that's been happening for many months now. it's definitely not ideal, but it has been stable and not a big problem. these error code 203 terminations are new, different, much more frequent, and started suddenly, at the same time as the router outage. also, the log messages for the two types of terminations are also different: "While handling this request, the process that handled this request was found to be using too much memory and was terminated" vs "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)" so, the two problems do both cause instance terminations (and request 500s), but otherwise they seem unrelated. |
4/1 11am, from google: Hey, no problem, we are here to help the community identify any discrepancies with Google Cloud Platform functionalities. As I mentioned, your instance has been throwing both of these types of errors throughout the lifespan of the instance, both the memory errors as well as the 203 errors that you have seen a spike in recently. Through my investigation using internal tools and the logs of your application these errors (“The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)”) have been thrown since March 5th, and have increased in rate since, however the outages must have increased the frequency of these errors temporarily. I am still unable to reproduce this error so far, so I would like to ask you for more information.
|
4/2 8am: thanks again for the deep investigation! you're absolutely right, this instance is somewhat overburdened, at least in terms of memory. i've added another instance to see if that help, but i suspect it won't, since i believe it runs out of memory due to a memory leak, not due to the workload itself. (i've never quite been able to fix that leak. to answer your questions:
yes! this is a great clue. i use Google's Python 3 ndb library, and on March 5, i switched to using its code for managing compressed i've temporarily reverted this change and redeployed my app without it to see if that helps. fingers crossed!
sure! the app is open source; here's the code for the task handlers.
i didn't change anything in the app on or near 3/27. i did make the ndb change mentioned above on 3/5, but that was the only change at that time. |
4/2 4pm: i tried reverting the compressed 4/4 9am: no luck i tried adding a second instance, but that didn't seem to help either. both instances still get terminated every few minutes or so with error code 203. looking at logs, instances still do seem to get terminated due to memory much less often, only once every hour or two. i'm switching it back to one instance now. for my own reference for looking at logs, the service ran on two instances from 4/3 ~8:30am PST to 4/4 ~9:30am PST. i'm also limiting all task queues to just one concurrent task, so the single instance will now only ever run one |
4/4 2pm, from google: Thanks for your patience during this investigation, but I must inform you that at this point in the investigation, I can safely assume that the issue is no longer a GAE wide issue, it seems to be directly associated with your application logic/code. I have come to this conclusion as first of all, most issues with 203 error codes are directly related to customer code, as I found using internal tools. All other causes for 203 errors, I have already rooted out as not being the cause. Issues on Issue tracker are meant to identify platform wide issues (service wide issues), which may be affecting all customers. I would suggest investigating further either through a support case[1] for dedicated support or ask the community on Stack Overflow[2] for suggestions about this error as well as the memory leak you have mentioned. We would love to continue helping you with this issue, but as it is not platform wide, we recommend using our other support channels to do so. I did however discover that the issue could be related to your datastore operations through cloud tasks. It turns out that transactions expire after 60 seconds, unless idle for 30 seconds[3], and a likely possibility is that your datastore operations may be taking longer than 60 seconds causing these errors to occur. A good way to mitigate this possible issue is by applying the datastore best practices[4]. [1]https://cloud.google.com/support/docs#role_based_support |
So … have you asked on StackOverflow yet? |
not yet, but that's a good next step. |
Not really a tech person, so I at best understand half of this. But is this the reason the mentions currently only arrive intermittently, sometimes only after me manually requesting a pull? |
@armingrewe hmm! this means they might be delayed a bit, but only hours if so, not days, and shouldn't mean any are missed entirely. if you're seeing some missed entirely, and https://brid.gy/about#missing doesn't explain them, feel free to file another issue! |
So from what I got is that your program is under such a high pressure, that Google is killing it off to save the rest of their platform (and restart a new instance). Sounds like a use-case for load-balancing. But you already tried to spin up a second instance. Or, perhaps, Python is the wrong language for this kind of problem (or a subset of it). |
@snarfed sorry, think I didn't describe it very well. That's pretty much what I meant. It sometimes seems to take many hours until it runs / picks them up, if memory serves me correctly I've seen cases where it said "last pulled 11 hours ago". Out of impatience I've then clicked on "Pull now", which then sometimes still took an hour or two to run and send. As far as I can tell none was missed though. |
@Ryuno-Ki thanks for thinking about this! we haven't conclusively found the root cause yet. the google support person thinks it's load/pressure, but i don't entirely agree. this service is network I/O bound, not CPU bound. it has been under a bit of memory pressure for months now, and occasionally uses too much and gets terminated, but only a few times per day, with an "out of memory" error message. this pattern of terminations, on the other hand:
(as for its architecture, it's a background task processing service that handles polling silo accounts, sending webmentions, and superfeedr notifications. it scache most things that it can, and it's not latency sensitive. if it was CPU bound and latency sensitive, or if i couldn't add instances, i'd agree that a new language might help. that would be a massive rewrite, though, so i'm glad it's probably not warranted.) |
I read all the logs you provided here, but thanks for the summary. To explore the idea, you could pick one of your streams (e.g. webmentions) and duplicate the events to a second queue, which would then get picked up from the microservice. |
i'm not sure how splitting this service into microservices would help here. we still don't fully understand the root cause of the problem itself, so it's hard to think about solutions when we don't yet know exactly what we need to solve. i get that microservices could reduce the memory footprint, but given all the evidence, i'm not yet convinced that that's the problem. |
btw, i'm happy to talk in indieweb chat too if you want! |
this has gradually subsided and gone away pretty much entirely over the last couple days, without anything obvious i changed on the app's side that fixed it. guess it may remain a mystery. at least it's gone! hope it stays that way. |
So … preliminary closing this issue then? |
maybe! i'll give it a few more days and see first. |
seems stable. guess this did indeed fix itself. maybe something internal to app engine, maybe something else. unsatisfying, but oh well. 🤷 |
doesn't seem to affect throughput, still doing >10k polls per day. 500s seem down a bit, just 2-3% of all poll requests now. for #931.
originally reported on google's issue tracker, 2020-03-27. i'm going to copy its comments here, more or less one for one.
for a while now, the background service's single instance (on manual scaling) runs out of memory, dies, and gets restarted a handful of times per day. not great, but tolerable.
suddenly, starting 3/26 around 9am PST (4pm UTC), its instance suddenly started getting terminated multiple times per minute. when this happens, all currently processing requests die, return 500, and show the usual error code 203 log message:
due to this, roughly 40-50% of all requests to this service 500ed for the first day or so after this, usually after spending some time running normally. the rest of the requests succeeded, as usual, and returned 200 or occasionally 304.
i looked for other 500s from my app itself that might be triggering the scheduler to restart the instance, but no luck. nothing in the rest of the logs seems related either.
did something on App Engine's side change around 9am PST 3/26? i know that's when the Atlanta router failure outage started, but that was supposedly fixed by around 1pm yesterday. could this be related, even though it's still happening?
i didn't deployed or otherwise make any changes to the app at all anytime on 3/26, and i'm not aware of any other changes in external dependencies that could have triggered this.
i tried redeploying from scratch. i also paused the Cloud Tasks queues feeding requests to this instance, waited ~20m, and then resumed them. no luck, the instance still constantly gets terminated while it's serving requests.
The text was updated successfully, but these errors were encountered: