Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

styxlab · 2021-05-27T15:01:19Z

Bug description

My Next.js app is deployed to Vercel and uses a lambda route for a GraphQL server (apollo-server-micro) that is configured with Prisma + Nexus. Lambda cold starts on Vercel lead to slow queries that take approximately 7 seconds. I see the 7 seconds on the private deploy with project name "blogody". A typical cold start signature looks as follows:

x-vercel-id | cdg1::iad1::28qc9-1622127003386-1cc4995040e3

As I cannot share this repo publicly, I made a smaller example that still shows a smaller but significant cold start times of approximately 2,5 seconds. I have not managed to find the influencing factors and I hope Vercel can shed some light on it. Here is the deploy output for the serverless functions:


00:04:49.070 | Serverless function size info
-- | --
00:04:49.071 | Serverless Function's page: api/graphql.js
00:04:49.074 | Large Dependencies                          Uncompressed size  Compressed size
00:04:49.074 | node_modules/.prisma/client                             47 MB          16.4 MB
00:04:49.074 | node_modules/prettier/parser-typescript.js            3.18 MB           817 kB
00:04:49.074 | node_modules/prettier/index.js                        1.72 MB           396 kB
00:04:49.074 | node_modules/prettier/parser-flow.js                  3.11 MB           310 kB
00:04:49.074 | node_modules/@prisma/client                           1.17 MB           243 kB
00:04:49.074 | node_modules/busboy/deps                               618 kB           196 kB
00:04:49.075 | node_modules/micro/node_modules                        360 kB           190 kB
00:04:49.075 | node_modules/encoding/node_modules                     329 kB           179 kB
00:04:49.075 | node_modules/lodash/lodash.js                          544 kB          96.4 kB
00:04:49.075 | All dependencies                                      63.5 MB          20.2 MB
00:04:49.163 | Created all serverless functions in: 43.168s

I see the cold starts after approx. 10 minutes inactivity, but that could vary.

I put some simple timestamps into the app, both on the client and the server. From those timestamps, you can see that in the case of a cold start the total query time is governed by the waiting time between query initiation and endpoint function invocation (start request - before fetching). I hope vercel can debug what happens within this timespan and give some guidance on how to reduce it.

Some screenshots from the example:

How to reproduce

Clone project lambda-cold-start and deploy to Vercel
You can inspect it right away under https://lambda-cold-start.vercel.app (but better deploy yourself, so you can control the inactivity).

Expected behavior

I know that cold starts cannot be fully eliminated, but cold start times of 2 - 7 seconds are a problem for me. I can accept cold start time of roughly 1 second. Thus, I expect the following help from this issue:

What exactly happens on Vercel between request imitation and lambda function invocation
Understand the influencing factors (hopefully something can improved)
Based on the analysis, maybe some ideas for viable remedies (warm-up strategies).

I will also opened an issue @prisma to see if the issue is amplified by that stack.

Additional information

You can find package.json and prisma schema on the linked repo. Note that the example takes out any calls to the database (all prisma queries are taken out the the GraphQL resolvers). This is to show that we are indeed dealing with a cold start issue and not database latency.

I'd be happy to provide more information if needed.

The text was updated successfully, but these errors were encountered:

williamli · 2021-05-28T06:06:57Z

@styxlab The issue you experienced is caused by the database (in this case Nexus) not being optimised for serverless connections. Database connections cannot be shared between serverless invocations between cold boots. Therefore, each time your serverless function is called (while cold), a new database connection will need to be established. You can get around this by adding poolers between your database and the serverless function or by switching to a serverless friendly database.

You can find more information about this over at https://vercel.com/docs/solutions/databases#connecting-to-your-database.

styxlab · 2021-05-28T08:06:27Z

Thanks @williamli for looking into this issue. Unfortunately, connection pooling cannot explain the issue, I ruled that out already. Why? because I took the database out of the example, there is not a single call to a database! In the example I simply return mocked data in the GraphQL resolver, that's where a real world example would make a request to the database.

Nexus is not a database it is a GraphQL schema generator, and Prisma is a ORM or model mapper. I include them in the example, because they have an influence on the problem (maybe through lambda function bundle size, I don't know).

styxlab · 2021-05-28T11:07:15Z

Just for the record, I enhanced the reproduction example with

/api/hello endpoint showing that a simple function does not exhibit the reported problem (always <500ms)
expose some AWS process env variables for easier debugging
I am testing from Frankfurt Germany (should not have an influenc on cold start, but here you go).

styxlab · 2021-06-02T16:26:46Z

Here are some additional findings:

The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).
Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.
Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.
The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).
I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

OG84 · 2021-08-07T05:37:34Z

If you develop with plain AWS, you can significantly decrease cold start time by increasing function memory size (will also give you more virtual cpu cores). I think you can also change the memory setting in vercel.
Second thing which slows down lambda startup are dynamic require statements of js code. But we dont have any influence on this. Thats up to vercel. Periodically running a ping on your function with an early return should also be helpful.

nhuesmann · 2021-08-26T00:16:46Z

Here are some additional findings:

The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).

Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.

Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.

The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).

I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

@styxlab I'm having this issue as well, my backend uses Prisma + Nexus + Vercel. I'm using connection pooling, so I know it's not what @williamli mentioned in his comment.

Have you made any more progress on this issue?

styxlab · 2021-08-26T06:51:39Z

@nhuesmann This is still an unsolved problem for me and that's why I still run my api endpoints on a digitalocean droplet (everything else on Vercel).

The best I could do with Vercel lambda was to call the endpoints every 3-4 minutes (warming), but that didn't reliably help all the time. It's also difficult to test that from different regions.

I am planning to write an in-depth blog article about my findings but do not yet know when I have the time for this.

neoromantic · 2021-09-01T15:53:29Z

We're (https://github.com/gooditcollective) quite interested in this as well, since we build all of our clients projects on Vercel.

Specifically, we use Graphql function on plain nodejs Vercel environment, made with Apollo Server.

Even completely minimal solution (with no external connections to databases and similar things, with no extra code dependencies, just plain Apollo Server initialisation and single http handler made with micro) boots up in about 1.5-2 seconds.

We'd love to find a way to make it reasonable (I guess, 200-500ms would be already satisfactory).

Is there any suggestions or ideas from Vercel's team or community, I wonder.

We'll try limiting function memory size, but I reckon that will have small effect if any. Rewarming is something that we will do also, but this feels like a broken solution and is unreliable.

Is there anything we can try?

styxlab · 2021-09-01T16:11:56Z

@neoromantic I don't want to get in the way of a reply from @vercel, but it's good to see you are reporting figures that correspond very well with my own observations (am also using apollo-micro and tested with empty resolvers - no db connection).

I am also very interested in moving my temporary solution (graphql API endpoints on DO) back to Vercel, but the performance difference is really huge as I am getting <~ 100ms consistently without worrying about cold startups. I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later. In any case, a real solution would probably have to come from AWS, so maybe it's better addressed there?

neoromantic · 2021-09-01T17:05:11Z

I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later.

If I remember correctly, Vercel's position and strategy is that their platform is very much cache-oriented. So it's not the main case for vercel to be a hosting for real-time api functions, but to generate a response and cache it so it can be delivered statically.

Personally, I want to consider solutions like fly.io, which allows to have multi-zone setup for gql server and redis cache backend, for example.

But since I've adopted vercel (called zeit then) since very first versions and I adore their ideology and wonderful support. So I'm very hopeful that at least we would get an understanding on how to manage cold boot times.

timuric · 2021-09-09T11:12:36Z

I am experiencing 10s+ cold starts with 255kb function, it quite a deal breaker

styxlab · 2021-09-09T12:02:28Z

@timuric This sounds a bit high, with a 255kb function I would expect cold start times of ~ 1 second. Did you make the following checks?

exclude the network round-trip time
make no calls to database or other async tasks (just for testing)
make sure you are not calling other serverless functions (which could also exhibit could starts)

I missed the latter check initially, that's why I ended up with ~ 7 secs, because individual cold starts accumulated. Once you understand your access pattern, you can optimize. However, the barrier of ~ 1 second remains, which is still a big issue for me.

fubhy · 2021-09-20T17:40:30Z

Just wanted to chime in here to confirm that we are also running into this exact same issue with, in fact, the same stack causing this (apollo-server-micro with nexus). As the OP already mentioned, this has nothing to do with the database as we also ruled that out entirely (returning stubbed data performs exactly as poorly as it does with a database connection).

joshsny · 2021-10-24T07:06:29Z

We are experiencing similar issues, though cold start times are shorter for us, at around 1.5s.

Something that surprised me was that for NextJS (which is what we use), API endpoints are bundled together up to a size of 50mb. Therefore, despite having a number of separate API endpoints, they are actually bundled together with size ~30mb.

This is to reduce the number of cold starts and keep things warm. However, when there is a cold start (which happens quite frequently, as the API does not experience high traffic), it is long enough to cause issues for our application.

I haven't tried creating a small endpoint to keep it warm yet, but will try that next and see what affect it has.

styxlab · 2021-10-26T16:30:08Z

A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ?

neoromantic · 2021-10-26T16:35:55Z

A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ?

Can't see how is this a solution. Edge functions are limited at 1mb size and should return response in 1.5s

It's great for authentication and other quick routes, but not for APIs, as far I can tell.

So, I guess, in real production setup we should refrain from hosting APIs on Vercel, and consider serverful approach to this or consider always-hot Serverless Functions like in Google Cloud.

edgesoft · 2021-11-13T19:01:20Z

I'm dealing with the same issue! Any feedback?

devdev-dev · 2021-11-22T22:01:18Z

Some problems for me using apollo-micro-server and MongoDB Atlas - the cold-start is unpredictable and varies in stability and duration.

Is this just a problem with GraphQL in combination with Vercel or a more general problem with Vercel?

SchneiderOr · 2021-12-30T16:01:24Z

Can confirm having the same slow cold boots using Nuxt3 (Node & Vue3)
We aren't opeming ani db connections, only making some fetch requests to our API which is deployed on aws and then rendering quite a lightweight page

DomVinyard · 2022-01-26T14:47:12Z

Having the same slow cold boots using GRAND stack. Any advice would be really helpful

SchneiderOr · 2022-01-26T16:06:12Z

We opt for pinging the lambda every 5 mins, its far from being good and against some of the serverless practices but it eliminates the cold boot entirely and it is not affecting our quota limits that much, we started to think about migrating away to solution that isnt serverless..

…

On 26 Jan 2022, at 16:47, Dom Vinyard ***@***.***> wrote: Having the same slow cold boots using GRAND stack. Any advice would be really helpful — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

digitaljohn · 2022-01-26T21:49:57Z

Just reporting the same here. apollo-server-micro with or without DB connections.

styxlab · 2022-01-26T22:15:15Z

My current solution is to rewrite all api routes to a digital ocean droplet where I run a copy of my nextjs project. So, pages are served from Vercel but my GraphQL endpoint runs on a real server with no cold start delays.

piotrpawlik · 2022-02-23T14:56:20Z

Having same issues with long cold starts. It used to be ~0,5s on nextjs 12.0.8 but turned to ~2s after upgrading to 12.1.0...

styxlab · 2022-03-14T11:21:21Z

@piotrpawlik: I haven't noticed longer cold start times after 12.1.0. However, I am also experiencing accumulating cold start issues with unstable_revalidate(). I can confirm that res.unstable_revalidate() takes ~ 300ms in a warm scenario as advertised. However, it's ~ 1.3 secs on cold start.

Unfortunately, this is in addition to the cold start time of ~ 1 sec of the calling lambda function itself (hence accumulating). With some inevitable network latency, the cold start of a revalidate endpoint will take approx. 3 seconds in total. I experimented with warming, but you have to trigger every edge server worldwide, so this is not a practical workaround. I am sad to say, but cold start issues are the biggest bummer with Vercel/AWS lambda.

karmatradeDev · 2022-04-01T05:49:10Z

currently having this problem

vicary · 2022-04-21T06:12:25Z

We are experiencing long cold starts for Next.js SSR, the resulting bundles are roughly 260B.

From the logs we see Init Duration of 4s - 5s, which is far from acceptable dynamic web response time.

Is it possible to increase memory size for SSR functions?

jonbnewman · 2022-05-17T12:14:31Z

This has become a completely untenable issue for our application.

API endpoints are basically useless on Vercel because of the cold start issue...why is this not solved?

ltbittner · 2022-05-25T03:39:12Z

+1 API endpoints take way too long on a cold start. Will have to find a different solution. It's quite sad the lack of response here from the Vercel team.

I also find it strange that my API function size is 30MB+ even though I only have a couple small functions (and @next/bundle-analyzer is reporting them at 200kb...).

I love Vercel but this is super disappointing.

chrisb2244 · 2022-06-03T06:19:55Z

Having the same issue with signin/signup API routes - spent a while adjusting email generation, trialling templating, switching from SMTP to restAPI for mail, etc etc, but discovered now that although I see ~8s for the first test, if I log out and straight back in, second attempt is much faster (<~1s?, which is fine for me).

So guess my problem is not my emailing, but the cold start behaviour? (reading above, it seems like I might significantly reduce the 8s by trying to flatten out API requests down to a single endpoint, not sure that will be ideal for DRY but worth it if it cuts 8->2 seconds, which would be a bit annoying but no longer terrible).

Maybe the edge functions will solve my problem, will have to try and see if I can rewrite using those (functions are small, so hopefully will fit in the 1MB limit).

leerob · 2022-06-14T00:42:32Z

I wrote up ways to debug and detect your root issue with Serverless Function performance decreases.

Apologies for the slow response - let me know if this helps 😄

styxlab mentioned this issue May 27, 2021

Slow queries with Prisma + Nexus + GraphQL on Vercel due to lambda cold starts prisma/prisma#7303

Closed

styxlab changed the title ~~Experiencing long lambda cold start delays from 2 - 7 seconds on Vercel~~ Experiencing long lambda cold start delays from approx. 2 seconds on Vercel Jun 2, 2021

styxlab changed the title ~~Experiencing long lambda cold start delays from approx. 2 seconds on Vercel~~ Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel Jun 2, 2021

sastraxi mentioned this issue Mar 14, 2022

First request is slow sastraxi/sendy-video#1

Open

DarkNebula0 mentioned this issue Mar 30, 2022

Network Connection should be non blocking for landing page gamedaoco/gamedao-haiku#62

Closed

vercel locked and limited conversation to collaborators Jun 14, 2022

leerob converted this issue into discussion #7961 Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

styxlab commented May 27, 2021

williamli commented May 28, 2021 •

edited

Loading

styxlab commented May 28, 2021

styxlab commented May 28, 2021

styxlab commented Jun 2, 2021 •

edited

Loading

OG84 commented Aug 7, 2021 •

edited

Loading

nhuesmann commented Aug 26, 2021

styxlab commented Aug 26, 2021

neoromantic commented Sep 1, 2021

styxlab commented Sep 1, 2021

neoromantic commented Sep 1, 2021

timuric commented Sep 9, 2021

styxlab commented Sep 9, 2021

fubhy commented Sep 20, 2021

joshsny commented Oct 24, 2021

styxlab commented Oct 26, 2021

neoromantic commented Oct 26, 2021

edgesoft commented Nov 13, 2021

devdev-dev commented Nov 22, 2021 •

edited

Loading

SchneiderOr commented Dec 30, 2021 •

edited

Loading

DomVinyard commented Jan 26, 2022

SchneiderOr commented Jan 26, 2022 via email

digitaljohn commented Jan 26, 2022

styxlab commented Jan 26, 2022 •

edited

Loading

piotrpawlik commented Feb 23, 2022

styxlab commented Mar 14, 2022

karmatradeDev commented Apr 1, 2022

vicary commented Apr 21, 2022

jonbnewman commented May 17, 2022 •

edited

Loading

ltbittner commented May 25, 2022

chrisb2244 commented Jun 3, 2022

leerob commented Jun 14, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #6292

Comments

styxlab commented May 27, 2021

Bug description

How to reproduce

Expected behavior

Additional information

williamli commented May 28, 2021 • edited Loading

styxlab commented May 28, 2021

styxlab commented May 28, 2021

styxlab commented Jun 2, 2021 • edited Loading

OG84 commented Aug 7, 2021 • edited Loading

nhuesmann commented Aug 26, 2021

styxlab commented Aug 26, 2021

neoromantic commented Sep 1, 2021

styxlab commented Sep 1, 2021

neoromantic commented Sep 1, 2021

timuric commented Sep 9, 2021

styxlab commented Sep 9, 2021

fubhy commented Sep 20, 2021

joshsny commented Oct 24, 2021

styxlab commented Oct 26, 2021

neoromantic commented Oct 26, 2021

edgesoft commented Nov 13, 2021

devdev-dev commented Nov 22, 2021 • edited Loading

SchneiderOr commented Dec 30, 2021 • edited Loading

DomVinyard commented Jan 26, 2022

SchneiderOr commented Jan 26, 2022 via email

digitaljohn commented Jan 26, 2022

styxlab commented Jan 26, 2022 • edited Loading

piotrpawlik commented Feb 23, 2022

styxlab commented Mar 14, 2022

karmatradeDev commented Apr 1, 2022

vicary commented Apr 21, 2022

jonbnewman commented May 17, 2022 • edited Loading

ltbittner commented May 25, 2022

chrisb2244 commented Jun 3, 2022

leerob commented Jun 14, 2022

This issue was moved to a discussion.

williamli commented May 28, 2021 •

edited

Loading

styxlab commented Jun 2, 2021 •

edited

Loading

OG84 commented Aug 7, 2021 •

edited

Loading

devdev-dev commented Nov 22, 2021 •

edited

Loading

SchneiderOr commented Dec 30, 2021 •

edited

Loading

styxlab commented Jan 26, 2022 •

edited

Loading

jonbnewman commented May 17, 2022 •

edited

Loading