Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

Closed
styxlab opened this issue May 27, 2021 · 31 comments
Closed

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

styxlab opened this issue May 27, 2021 · 31 comments

Comments

@styxlab
Copy link

styxlab commented May 27, 2021

Bug description

My Next.js app is deployed to Vercel and uses a lambda route for a GraphQL server (apollo-server-micro) that is configured with Prisma + Nexus. Lambda cold starts on Vercel lead to slow queries that take approximately 7 seconds. I see the 7 seconds on the private deploy with project name "blogody". A typical cold start signature looks as follows:

x-vercel-id | cdg1::iad1::28qc9-1622127003386-1cc4995040e3

image

image

As I cannot share this repo publicly, I made a smaller example that still shows a smaller but significant cold start times of approximately 2,5 seconds. I have not managed to find the influencing factors and I hope Vercel can shed some light on it. Here is the deploy output for the serverless functions:


00:04:49.070 | Serverless function size info
-- | --
00:04:49.071 | Serverless Function's page: api/graphql.js
00:04:49.074 | Large Dependencies                          Uncompressed size  Compressed size
00:04:49.074 | node_modules/.prisma/client                             47 MB          16.4 MB
00:04:49.074 | node_modules/prettier/parser-typescript.js            3.18 MB           817 kB
00:04:49.074 | node_modules/prettier/index.js                        1.72 MB           396 kB
00:04:49.074 | node_modules/prettier/parser-flow.js                  3.11 MB           310 kB
00:04:49.074 | node_modules/@prisma/client                           1.17 MB           243 kB
00:04:49.074 | node_modules/busboy/deps                               618 kB           196 kB
00:04:49.075 | node_modules/micro/node_modules                        360 kB           190 kB
00:04:49.075 | node_modules/encoding/node_modules                     329 kB           179 kB
00:04:49.075 | node_modules/lodash/lodash.js                          544 kB          96.4 kB
00:04:49.075 | All dependencies                                      63.5 MB          20.2 MB
00:04:49.163 | Created all serverless functions in: 43.168s

I see the cold starts after approx. 10 minutes inactivity, but that could vary.

I put some simple timestamps into the app, both on the client and the server. From those timestamps, you can see that in the case of a cold start the total query time is governed by the waiting time between query initiation and endpoint function invocation (start request - before fetching). I hope vercel can debug what happens within this timespan and give some guidance on how to reduce it.

Some screenshots from the example:
image
image
image
image
image

How to reproduce

  1. Clone project lambda-cold-start and deploy to Vercel
  2. You can inspect it right away under https://lambda-cold-start.vercel.app (but better deploy yourself, so you can control the inactivity).

Expected behavior

I know that cold starts cannot be fully eliminated, but cold start times of 2 - 7 seconds are a problem for me. I can accept cold start time of roughly 1 second. Thus, I expect the following help from this issue:

  1. What exactly happens on Vercel between request imitation and lambda function invocation
  2. Understand the influencing factors (hopefully something can improved)
  3. Based on the analysis, maybe some ideas for viable remedies (warm-up strategies).

I will also opened an issue @prisma to see if the issue is amplified by that stack.

Additional information

You can find package.json and prisma schema on the linked repo. Note that the example takes out any calls to the database (all prisma queries are taken out the the GraphQL resolvers). This is to show that we are indeed dealing with a cold start issue and not database latency.

I'd be happy to provide more information if needed.

@williamli
Copy link
Contributor

williamli commented May 28, 2021

@styxlab The issue you experienced is caused by the database (in this case Nexus) not being optimised for serverless connections. Database connections cannot be shared between serverless invocations between cold boots. Therefore, each time your serverless function is called (while cold), a new database connection will need to be established. You can get around this by adding poolers between your database and the serverless function or by switching to a serverless friendly database.

You can find more information about this over at https://vercel.com/docs/solutions/databases#connecting-to-your-database.

image (15)
image (16)

@styxlab
Copy link
Author

styxlab commented May 28, 2021

Thanks @williamli for looking into this issue. Unfortunately, connection pooling cannot explain the issue, I ruled that out already. Why? because I took the database out of the example, there is not a single call to a database! In the example I simply return mocked data in the GraphQL resolver, that's where a real world example would make a request to the database.

Nexus is not a database it is a GraphQL schema generator, and Prisma is a ORM or model mapper. I include them in the example, because they have an influence on the problem (maybe through lambda function bundle size, I don't know).

@styxlab
Copy link
Author

styxlab commented May 28, 2021

Just for the record, I enhanced the reproduction example with

  • /api/hello endpoint showing that a simple function does not exhibit the reported problem (always <500ms)
  • expose some AWS process env variables for easier debugging
  • I am testing from Frankfurt Germany (should not have an influenc on cold start, but here you go).

@styxlab styxlab changed the title Experiencing long lambda cold start delays from 2 - 7 seconds on Vercel Experiencing long lambda cold start delays from approx. 2 seconds on Vercel Jun 2, 2021
@styxlab styxlab changed the title Experiencing long lambda cold start delays from approx. 2 seconds on Vercel Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel Jun 2, 2021
@styxlab
Copy link
Author

styxlab commented Jun 2, 2021

Here are some additional findings:

  1. The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).

  2. Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.

  3. Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.

  4. The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).

  5. I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

@OG84
Copy link

OG84 commented Aug 7, 2021

If you develop with plain AWS, you can significantly decrease cold start time by increasing function memory size (will also give you more virtual cpu cores). I think you can also change the memory setting in vercel.
Second thing which slows down lambda startup are dynamic require statements of js code. But we dont have any influence on this. Thats up to vercel. Periodically running a ping on your function with an early return should also be helpful.

@nhuesmann
Copy link

Here are some additional findings:

  1. The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).
  2. Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.
  3. Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.
  4. The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).
  5. I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

@styxlab I'm having this issue as well, my backend uses Prisma + Nexus + Vercel. I'm using connection pooling, so I know it's not what @williamli mentioned in his comment.

Have you made any more progress on this issue?

@styxlab
Copy link
Author

styxlab commented Aug 26, 2021

@nhuesmann This is still an unsolved problem for me and that's why I still run my api endpoints on a digitalocean droplet (everything else on Vercel).

The best I could do with Vercel lambda was to call the endpoints every 3-4 minutes (warming), but that didn't reliably help all the time. It's also difficult to test that from different regions.

I am planning to write an in-depth blog article about my findings but do not yet know when I have the time for this.

@neoromantic
Copy link

We're (https://github.com/gooditcollective) quite interested in this as well, since we build all of our clients projects on Vercel.

Specifically, we use Graphql function on plain nodejs Vercel environment, made with Apollo Server.

Even completely minimal solution (with no external connections to databases and similar things, with no extra code dependencies, just plain Apollo Server initialisation and single http handler made with micro) boots up in about 1.5-2 seconds.

We'd love to find a way to make it reasonable (I guess, 200-500ms would be already satisfactory).

Is there any suggestions or ideas from Vercel's team or community, I wonder.

We'll try limiting function memory size, but I reckon that will have small effect if any. Rewarming is something that we will do also, but this feels like a broken solution and is unreliable.

Is there anything we can try?

@styxlab
Copy link
Author

styxlab commented Sep 1, 2021

@neoromantic I don't want to get in the way of a reply from @vercel, but it's good to see you are reporting figures that correspond very well with my own observations (am also using apollo-micro and tested with empty resolvers - no db connection).

I am also very interested in moving my temporary solution (graphql API endpoints on DO) back to Vercel, but the performance difference is really huge as I am getting <~ 100ms consistently without worrying about cold startups. I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later. In any case, a real solution would probably have to come from AWS, so maybe it's better addressed there?

@neoromantic
Copy link

I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later.

If I remember correctly, Vercel's position and strategy is that their platform is very much cache-oriented. So it's not the main case for vercel to be a hosting for real-time api functions, but to generate a response and cache it so it can be delivered statically.

Personally, I want to consider solutions like fly.io, which allows to have multi-zone setup for gql server and redis cache backend, for example.

But since I've adopted vercel (called zeit then) since very first versions and I adore their ideology and wonderful support. So I'm very hopeful that at least we would get an understanding on how to manage cold boot times.

@timuric
Copy link

timuric commented Sep 9, 2021

I am experiencing 10s+ cold starts with 255kb function, it quite a deal breaker

@styxlab
Copy link
Author

styxlab commented Sep 9, 2021

@timuric This sounds a bit high, with a 255kb function I would expect cold start times of ~ 1 second. Did you make the following checks?

  • exclude the network round-trip time
  • make no calls to database or other async tasks (just for testing)
  • make sure you are not calling other serverless functions (which could also exhibit could starts)

I missed the latter check initially, that's why I ended up with ~ 7 secs, because individual cold starts accumulated. Once you understand your access pattern, you can optimize. However, the barrier of ~ 1 second remains, which is still a big issue for me.

@fubhy
Copy link

fubhy commented Sep 20, 2021

Just wanted to chime in here to confirm that we are also running into this exact same issue with, in fact, the same stack causing this (apollo-server-micro with nexus). As the OP already mentioned, this has nothing to do with the database as we also ruled that out entirely (returning stubbed data performs exactly as poorly as it does with a database connection).

@joshsny
Copy link

joshsny commented Oct 24, 2021

We are experiencing similar issues, though cold start times are shorter for us, at around 1.5s.

Something that surprised me was that for NextJS (which is what we use), API endpoints are bundled together up to a size of 50mb. Therefore, despite having a number of separate API endpoints, they are actually bundled together with size ~30mb.

This is to reduce the number of cold starts and keep things warm. However, when there is a cold start (which happens quite frequently, as the API does not experience high traffic), it is long enough to cause issues for our application.

I haven't tried creating a small endpoint to keep it warm yet, but will try that next and see what affect it has.

@styxlab
Copy link
Author

styxlab commented Oct 26, 2021

A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ?

@neoromantic
Copy link

A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ?

Can't see how is this a solution. Edge functions are limited at 1mb size and should return response in 1.5s

It's great for authentication and other quick routes, but not for APIs, as far I can tell.

So, I guess, in real production setup we should refrain from hosting APIs on Vercel, and consider serverful approach to this or consider always-hot Serverless Functions like in Google Cloud.

@edgesoft
Copy link

I'm dealing with the same issue! Any feedback?

@devdev-dev
Copy link

devdev-dev commented Nov 22, 2021

Some problems for me using apollo-micro-server and MongoDB Atlas - the cold-start is unpredictable and varies in stability and duration.

Is this just a problem with GraphQL in combination with Vercel or a more general problem with Vercel?

@SchneiderOr
Copy link

SchneiderOr commented Dec 30, 2021

Can confirm having the same slow cold boots using Nuxt3 (Node & Vue3)
We aren't opeming ani db connections, only making some fetch requests to our API which is deployed on aws and then rendering quite a lightweight page

@DomVinyard
Copy link

Having the same slow cold boots using GRAND stack. Any advice would be really helpful

@SchneiderOr
Copy link

SchneiderOr commented Jan 26, 2022 via email

@digitaljohn
Copy link

Just reporting the same here. apollo-server-micro with or without DB connections.

@styxlab
Copy link
Author

styxlab commented Jan 26, 2022

My current solution is to rewrite all api routes to a digital ocean droplet where I run a copy of my nextjs project. So, pages are served from Vercel but my GraphQL endpoint runs on a real server with no cold start delays.

@piotrpawlik
Copy link

Having same issues with long cold starts. It used to be ~0,5s on nextjs 12.0.8 but turned to ~2s after upgrading to 12.1.0...

@styxlab
Copy link
Author

styxlab commented Mar 14, 2022

@piotrpawlik: I haven't noticed longer cold start times after 12.1.0. However, I am also experiencing accumulating cold start issues with unstable_revalidate(). I can confirm that res.unstable_revalidate() takes ~ 300ms in a warm scenario as advertised. However, it's ~ 1.3 secs on cold start.

Unfortunately, this is in addition to the cold start time of ~ 1 sec of the calling lambda function itself (hence accumulating). With some inevitable network latency, the cold start of a revalidate endpoint will take approx. 3 seconds in total. I experimented with warming, but you have to trigger every edge server worldwide, so this is not a practical workaround. I am sad to say, but cold start issues are the biggest bummer with Vercel/AWS lambda.

@karmatradeDev
Copy link

currently having this problem

@vicary
Copy link

vicary commented Apr 21, 2022

We are experiencing long cold starts for Next.js SSR, the resulting bundles are roughly 260B.

From the logs we see Init Duration of 4s - 5s, which is far from acceptable dynamic web response time.

Is it possible to increase memory size for SSR functions?

@jonbnewman
Copy link

jonbnewman commented May 17, 2022

This has become a completely untenable issue for our application.

API endpoints are basically useless on Vercel because of the cold start issue...why is this not solved?

@ltbittner
Copy link

+1 API endpoints take way too long on a cold start. Will have to find a different solution. It's quite sad the lack of response here from the Vercel team.

I also find it strange that my API function size is 30MB+ even though I only have a couple small functions (and @next/bundle-analyzer is reporting them at 200kb...).

I love Vercel but this is super disappointing.

@chrisb2244
Copy link

Having the same issue with signin/signup API routes - spent a while adjusting email generation, trialling templating, switching from SMTP to restAPI for mail, etc etc, but discovered now that although I see ~8s for the first test, if I log out and straight back in, second attempt is much faster (<~1s?, which is fine for me).

So guess my problem is not my emailing, but the cold start behaviour? (reading above, it seems like I might significantly reduce the 8s by trying to flatten out API requests down to a single endpoint, not sure that will be ideal for DRY but worth it if it cuts 8->2 seconds, which would be a bit annoying but no longer terrible).

Maybe the edge functions will solve my problem, will have to try and see if I can rewrite using those (functions are small, so hopefully will fit in the 1MB limit).

@leerob
Copy link
Member

leerob commented Jun 14, 2022

I wrote up ways to debug and detect your root issue with Serverless Function performance decreases.

Apologies for the slow response - let me know if this helps 😄

@vercel vercel locked and limited conversation to collaborators Jun 14, 2022
@leerob leerob converted this issue into discussion #7961 Jun 14, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests