-
Notifications
You must be signed in to change notification settings - Fork 2.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292
Comments
@styxlab The issue you experienced is caused by the database (in this case Nexus) not being optimised for serverless connections. Database connections cannot be shared between serverless invocations between cold boots. Therefore, each time your serverless function is called (while cold), a new database connection will need to be established. You can get around this by adding poolers between your database and the serverless function or by switching to a serverless friendly database. You can find more information about this over at https://vercel.com/docs/solutions/databases#connecting-to-your-database. |
Thanks @williamli for looking into this issue. Unfortunately, connection pooling cannot explain the issue, I ruled that out already. Why? because I took the database out of the example, there is not a single call to a database! In the example I simply return mocked data in the GraphQL resolver, that's where a real world example would make a request to the database. Nexus is not a database it is a GraphQL schema generator, and Prisma is a ORM or model mapper. I include them in the example, because they have an influence on the problem (maybe through lambda function bundle size, I don't know). |
Just for the record, I enhanced the reproduction example with
|
Here are some additional findings:
As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases. |
If you develop with plain AWS, you can significantly decrease cold start time by increasing function memory size (will also give you more virtual cpu cores). I think you can also change the memory setting in vercel. |
@styxlab I'm having this issue as well, my backend uses Prisma + Nexus + Vercel. I'm using connection pooling, so I know it's not what @williamli mentioned in his comment. Have you made any more progress on this issue? |
@nhuesmann This is still an unsolved problem for me and that's why I still run my api endpoints on a digitalocean droplet (everything else on Vercel). The best I could do with Vercel lambda was to call the endpoints every 3-4 minutes (warming), but that didn't reliably help all the time. It's also difficult to test that from different regions. I am planning to write an in-depth blog article about my findings but do not yet know when I have the time for this. |
We're (https://github.com/gooditcollective) quite interested in this as well, since we build all of our clients projects on Vercel. Specifically, we use Graphql function on plain nodejs Vercel environment, made with Apollo Server. Even completely minimal solution (with no external connections to databases and similar things, with no extra code dependencies, just plain Apollo Server initialisation and single http handler made with micro) boots up in about 1.5-2 seconds. We'd love to find a way to make it reasonable (I guess, 200-500ms would be already satisfactory). Is there any suggestions or ideas from Vercel's team or community, I wonder. We'll try limiting function memory size, but I reckon that will have small effect if any. Rewarming is something that we will do also, but this feels like a broken solution and is unreliable. Is there anything we can try? |
@neoromantic I don't want to get in the way of a reply from @vercel, but it's good to see you are reporting figures that correspond very well with my own observations (am also using apollo-micro and tested with empty resolvers - no db connection). I am also very interested in moving my temporary solution (graphql API endpoints on DO) back to Vercel, but the performance difference is really huge as I am getting <~ 100ms consistently without worrying about cold startups. I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later. In any case, a real solution would probably have to come from AWS, so maybe it's better addressed there? |
If I remember correctly, Vercel's position and strategy is that their platform is very much cache-oriented. So it's not the main case for vercel to be a hosting for real-time api functions, but to generate a response and cache it so it can be delivered statically. Personally, I want to consider solutions like fly.io, which allows to have multi-zone setup for gql server and redis cache backend, for example. But since I've adopted vercel (called zeit then) since very first versions and I adore their ideology and wonderful support. So I'm very hopeful that at least we would get an understanding on how to manage cold boot times. |
I am experiencing 10s+ cold starts with 255kb function, it quite a deal breaker |
@timuric This sounds a bit high, with a 255kb function I would expect cold start times of ~ 1 second. Did you make the following checks?
I missed the latter check initially, that's why I ended up with ~ 7 secs, because individual cold starts accumulated. Once you understand your access pattern, you can optimize. However, the barrier of ~ 1 second remains, which is still a big issue for me. |
Just wanted to chime in here to confirm that we are also running into this exact same issue with, in fact, the same stack causing this (apollo-server-micro with nexus). As the OP already mentioned, this has nothing to do with the database as we also ruled that out entirely (returning stubbed data performs exactly as poorly as it does with a database connection). |
We are experiencing similar issues, though cold start times are shorter for us, at around 1.5s. Something that surprised me was that for NextJS (which is what we use), API endpoints are bundled together up to a size of 50mb. Therefore, despite having a number of separate API endpoints, they are actually bundled together with size ~30mb. This is to reduce the number of cold starts and keep things warm. However, when there is a cold start (which happens quite frequently, as the API does not experience high traffic), it is long enough to cause issues for our application. I haven't tried creating a small endpoint to keep it warm yet, but will try that next and see what affect it has. |
A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ? |
Can't see how is this a solution. Edge functions are limited at 1mb size and should return response in 1.5s It's great for authentication and other quick routes, but not for APIs, as far I can tell. So, I guess, in real production setup we should refrain from hosting APIs on Vercel, and consider serverful approach to this or consider always-hot Serverless Functions like in Google Cloud. |
I'm dealing with the same issue! Any feedback? |
Some problems for me using apollo-micro-server and MongoDB Atlas - the cold-start is unpredictable and varies in stability and duration. Is this just a problem with GraphQL in combination with Vercel or a more general problem with Vercel? |
Can confirm having the same slow cold boots using Nuxt3 (Node & Vue3) |
Having the same slow cold boots using GRAND stack. Any advice would be really helpful |
We opt for pinging the lambda every 5 mins, its far from being good and against some of the serverless practices but it eliminates the cold boot entirely and it is not affecting our quota limits that much, we started to think about migrating away to solution that isnt serverless..
… On 26 Jan 2022, at 16:47, Dom Vinyard ***@***.***> wrote:
Having the same slow cold boots using GRAND stack. Any advice would be really helpful
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.
|
Just reporting the same here. |
My current solution is to rewrite all api routes to a digital ocean droplet where I run a copy of my nextjs project. So, pages are served from Vercel but my GraphQL endpoint runs on a real server with no cold start delays. |
Having same issues with long cold starts. It used to be ~0,5s on nextjs |
@piotrpawlik: I haven't noticed longer cold start times after 12.1.0. However, I am also experiencing accumulating cold start issues with Unfortunately, this is in addition to the cold start time of ~ 1 sec of the calling lambda function itself (hence accumulating). With some inevitable network latency, the cold start of a revalidate endpoint will take approx. 3 seconds in total. I experimented with warming, but you have to trigger every edge server worldwide, so this is not a practical workaround. I am sad to say, but cold start issues are the biggest bummer with Vercel/AWS lambda. |
currently having this problem |
We are experiencing long cold starts for Next.js SSR, the resulting bundles are roughly 260B. From the logs we see Init Duration of 4s - 5s, which is far from acceptable dynamic web response time. Is it possible to increase memory size for SSR functions? |
This has become a completely untenable issue for our application. API endpoints are basically useless on Vercel because of the cold start issue...why is this not solved? |
+1 API endpoints take way too long on a cold start. Will have to find a different solution. It's quite sad the lack of response here from the Vercel team. I also find it strange that my API function size is 30MB+ even though I only have a couple small functions (and @next/bundle-analyzer is reporting them at 200kb...). I love Vercel but this is super disappointing. |
Having the same issue with signin/signup API routes - spent a while adjusting email generation, trialling templating, switching from SMTP to restAPI for mail, etc etc, but discovered now that although I see ~8s for the first test, if I log out and straight back in, second attempt is much faster (<~1s?, which is fine for me). So guess my problem is not my emailing, but the cold start behaviour? (reading above, it seems like I might significantly reduce the 8s by trying to flatten out API requests down to a single endpoint, not sure that will be ideal for DRY but worth it if it cuts 8->2 seconds, which would be a bit annoying but no longer terrible). Maybe the edge functions will solve my problem, will have to try and see if I can rewrite using those (functions are small, so hopefully will fit in the 1MB limit). |
I wrote up ways to debug and detect your root issue with Serverless Function performance decreases. Apologies for the slow response - let me know if this helps 😄 |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Bug description
My Next.js app is deployed to Vercel and uses a lambda route for a GraphQL server (apollo-server-micro) that is configured with Prisma + Nexus. Lambda cold starts on Vercel lead to slow queries that take approximately 7 seconds. I see the 7 seconds on the private deploy with project name "blogody". A typical cold start signature looks as follows:
x-vercel-id | cdg1::iad1::28qc9-1622127003386-1cc4995040e3
As I cannot share this repo publicly, I made a smaller example that still shows a smaller but significant cold start times of approximately 2,5 seconds. I have not managed to find the influencing factors and I hope Vercel can shed some light on it. Here is the deploy output for the serverless functions:
I see the cold starts after approx. 10 minutes inactivity, but that could vary.
I put some simple timestamps into the app, both on the client and the server. From those timestamps, you can see that in the case of a cold start the total query time is governed by the waiting time between query initiation and endpoint function invocation (
start request
-before fetching
). I hope vercel can debug what happens within this timespan and give some guidance on how to reduce it.Some screenshots from the example:
How to reproduce
Expected behavior
I know that cold starts cannot be fully eliminated, but cold start times of 2 - 7 seconds are a problem for me. I can accept cold start time of roughly 1 second. Thus, I expect the following help from this issue:
I will also opened an issue @prisma to see if the issue is amplified by that stack.
Additional information
You can find
package.json
and prisma schema on the linked repo. Note that the example takes out any calls to the database (all prisma queries are taken out the the GraphQL resolvers). This is to show that we are indeed dealing with a cold start issue and not database latency.I'd be happy to provide more information if needed.
The text was updated successfully, but these errors were encountered: