Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Host Error causing renovate to fail for monorepo #17483

Closed
dbasilio opened this issue Aug 29, 2022 · 28 comments
Closed

External Host Error causing renovate to fail for monorepo #17483

dbasilio opened this issue Aug 29, 2022 · 28 comments
Assignees
Labels
priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others status:in-progress Someone is working on implementation type:bug Bug fix of existing functionality

Comments

@dbasilio
Copy link

How are you running Renovate?

Mend Renovate hosted app on github.com

If you're self-hosting Renovate, tell us what version of Renovate you run.

No response

Please select which platform you are using if self-hosting.

github.com

If you're self-hosting Renovate, tell us what version of the platform you run.

No response

Was this something which used to work for you, and then stopped?

It used to work, and then stopped

Describe the bug

We maintain a monorepo with 138 packages in them. We used to get renovate PRs for the repo, but it stopped working a while ago (not exactly sure when). I finally got around to checking the logs and it's failing with the error below. The package itself is not that important as it fails on a different package pretty frequently (usually it fails on the first one listed in devDependencies on the global package.json).

The error is not consistent as we've had three successful runs in the last 3 days, but even on successful runs we haven't been getting new PRs. I'm going to guess we're hitting some sort of limit on requests to npm and being throttled, but if that's the case, are there config options we can make to increase the timeout or paginate more or something?

Relevant debug logs

Logs
WARN: Host error
{
  "hostType": "npm",
  "packageName": "react-phone-number-input",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1661790262126,
      "socket": 1661790262127,
      "lookup": 1661790262130,
      "connect": 1661790267339,
      "secureConnect": 1661790267369,
      "upload": 1661790267369,
      "response": 1661790267420,
      "error": 1661790323264,
      "abort": 1661790341747,
      "phases": {
        "wait": 1,
        "dns": 3,
        "tcp": 5209,
        "tls": 30,
        "request": 0,
        "firstByte": 51,
        "total": 79621
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:475:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:475:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://registry.npmjs.org/react-phone-number-input",
      "hostType": "npm",
      "username": "",
      "password": "",
      "method": "GET",
      "http2": false
    },
    "response": {
      "statusCode": 200,
      "statusMessage": "OK",
      "headers": {
        "date": "Mon, 29 Aug 2022 16:24:27 GMT",
        "content-type": "application/json",
        "transfer-encoding": "chunked",
        "connection": "close",
        "cf-ray": "7426a313295ec378-SEA",
        "access-control-allow-origin": "*",
        "age": "173",
        "cache-control": "public, max-age=300",
        "etag": "W/\"baa4594aa8274c326bcf05b7e8f487db\"",
        "last-modified": "Sat, 27 Aug 2022 08:55:56 GMT",
        "vary": "accept-encoding, accept",
        "cf-cache-status": "HIT",
        "expect-ct": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
        "x-amz-replication-status": "PENDING",
        "server": "cloudflare",
        "content-encoding": "gzip"
      },
      "httpVersion": "1.1"
    }
  }
}

Have you created a minimal reproduction repository?

No reproduction repository

@dbasilio dbasilio added priority-5-triage status:requirements Full requirements are not yet known, so implementation should not be started type:bug Bug fix of existing functionality labels Aug 29, 2022
@rarkins
Copy link
Collaborator

rarkins commented Aug 30, 2022

We should analyse these timestamps above to work out if this timeout should be reduced. I'm not sure if that 60s is a timeout for the full file transfer or a timeout failing to connect, for no data received, etc.

Then we should check if we can retry such failures at least once.

@rarkins rarkins added priority-2-high Bugs impacting wide number of users or very important features status:ready and removed priority-5-triage status:requirements Full requirements are not yet known, so implementation should not be started labels Aug 30, 2022
@viceice
Copy link
Member

viceice commented Aug 31, 2022

We've a similar issue on our helm charts repo

WARN: Host error
{
  "hostType": "github-releases",
  "packageName": "norwoodj/helm-docs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1661933000729,
      "socket": 1661933000945,
      "lookup": 1661933108895,
      "error": 1661933108935,
      "phases": {
        "wait": 216,
        "dns": 107950,
        "total": 108206
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:475:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:475:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/vnd.github.machine-man-preview+json",
        "authorization": "***********",
        "content-type": "application/json",
        "content-length": "621",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://api.github.com/graphql",
      "hostType": "github-releases",
      "username": "",
      "password": "",
      "method": "POST",
      "http2": false
    }
  }
}

@viceice
Copy link
Member

viceice commented Aug 31, 2022

@viceice
Copy link
Member

viceice commented Aug 31, 2022

@dbasilio can you try to set a higher timeout1 or limit concurrency2 for npm registry?

Footnotes

  1. https://docs.renovatebot.com/configuration-options/#timeout

  2. https://docs.renovatebot.com/configuration-options/#concurrentrequestlimit

@rarkins rarkins added status:in-progress Someone is working on implementation and removed status:ready labels Aug 31, 2022
@dbasilio
Copy link
Author

dbasilio commented Sep 1, 2022

@viceice We've tried a timeout of 120s, and a concurrency limit of 2 individually with no change. Going to try both together

@dbasilio
Copy link
Author

dbasilio commented Sep 1, 2022

Both settings together don't seem to help, we're still getting the external-host-error more often than note. Anecdotally, it does seem to fail on "later" packages alphabetically

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

@rarkins
Copy link
Collaborator

rarkins commented Sep 2, 2022

I'm going to use this issue as a general performance tuning issue. It seems some of the issues are environmental/app-related, because we can't replicate them when running locally yet they occur consistently in the hosted app. This may be because of:

  • Docker itself
  • The Ubuntu base image
  • AWS infra e.g. DNS
  • Memory or CPU limitations

@rarkins
Copy link
Collaborator

rarkins commented Sep 2, 2022

This tried so far:

  • Custom DNS cache
  • Custom DNS settings in env: RES_OPTIONS: 'ndots:3 retrans:1000 retry:3 rotate'
  • Enable http2 for api.github.com
  • Enable http2 for public npm registry

@dbasilio
Copy link
Author

dbasilio commented Sep 6, 2022

Since it's been running for a few days now with

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

I've noticed that it started hitting a timeout more than the external host error. We are still getting the occasional success but most often it fails.
image

@rarkins
Copy link
Collaborator

rarkins commented Sep 9, 2022

Looking at the last "done" run, here are the starts:

  "hostStats": {
    "api.github.com": {
      "requestCount": 36,
      "requestAvgMs": 460,
      "queueAvgMs": 0
    },
    "registry.npmjs.org": {
      "requestCount": 103,
      "requestAvgMs": 19024,
      "queueAvgMs": 0
    },
    "xxx.jfrog.io": {
      "requestCount": 4,
      "requestAvgMs": 23116,
      "queueAvgMs": 0
    }
  },

Even though concurrentRequestLimit was configured, we see no queueAvgMs for registry.npmjs.org plus very high requestAvgMs for both npm and jfrog.

@rarkins
Copy link
Collaborator

rarkins commented Sep 9, 2022

I found the queue problem, see #17713. Please remove hostType from your host rule

@dbasilio
Copy link
Author

Updated my hostrule to:

{
    "matchHost": "registry.npmjs.org/",
    "concurrentRequestLimit": 2,
    "timeout": 120000
,

(and then the same for our jfrog registry as well in a separate host rule.

We're still getting mostly external host errors
image

@rarkins
Copy link
Collaborator

rarkins commented Sep 12, 2022

Try removing all limits again

@dbasilio
Copy link
Author

No change
image

@rarkins
Copy link
Collaborator

rarkins commented Sep 13, 2022

What's the details for the errors you get now? Eg ETIMEDOUT, ECONNRESET, etc. pasting details here would be helpful

@dbasilio
Copy link
Author

External Host Error

{
  "hostType": "npm",
  "packageName": "@datadog/browser-logs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1663081515075,
      "socket": 1663081515076,
      "lookup": 1663081519436,
      "connect": 1663081519436,
      "secureConnect": 1663081522929,
      "upload": 1663081522931,
      "response": 1663081523051,
      "error": 1663081595475,
      "abort": 1663081635691,
      "phases": {
        "wait": 1,
        "dns": 4360,
        "tcp": 0,
        "tls": 3493,
        "request": 2,
        "firstByte": 120,
        "total": 120616
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (node:events:628:26)\n    at ClientRequest.emit (node:events:525:35)\n    at ClientRequest.emit (node:domain:489:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (node:_http_client:481:9)\n    at TLSSocket.emit (node:events:513:28)\n    at TLSSocket.emit (node:domain:489:12)\n    at emitErrorNT (node:internal/streams/destroy:157:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:122:3)\n    at processTicksAndRejections (node:internal/process/task_queues:83:21)\n    at runNextTicks (node:internal/process/task_queues:65:3)\n    at listOnTimeout (node:internal/timers:528:9)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (node:internal/timers:561:11)\n    at processTimers (node:internal/timers:502:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://registry.npmjs.org/@datadog%2Fbrowser-logs",
      "hostType": "npm",
      "username": "",
      "password": "",
      "method": "GET",
      "http2": false
    },
    "response": {
      "statusCode": 200,
      "statusMessage": "OK",
      "headers": {
        "date": "Tue, 13 Sep 2022 15:05:22 GMT",
        "content-type": "application/json",
        "transfer-encoding": "chunked",
        "connection": "close",
        "cf-ray": "74a1c7de2db7680a-SEA",
        "access-control-allow-origin": "*",
        "age": "77",
        "cache-control": "public, max-age=300",
        "etag": "W/\"3786e68d0270801401b11f9b4ac6d624\"",
        "last-modified": "Tue, 13 Sep 2022 11:35:42 GMT",
        "vary": "accept-encoding, accept",
        "cf-cache-status": "HIT",
        "x-amz-replication-status": "COMPLETED",
        "server": "cloudflare",
        "content-encoding": "gzip"
      },
      "httpVersion": "1.1"
    }
  }
}

And this is all the INFO level logs for the timeout I posted above

INFO: Repository started
{
  "renovateVersion": "32.194.5"
}
INFO: Dependency extraction complete
{
  "baseBranch": "master",
  "stats": {
    "managers": {
      "dockerfile": {
        "fileCount": 1,
        "depCount": 1
      },
      "npm": {
        "fileCount": 140,
        "depCount": 4456
      },
      "nvm": {
        "fileCount": 1,
        "depCount": 1
      }
    },
    "total": {
      "fileCount": 142,
      "depCount": 4458
    }
  }
}
WARN: Job timeout

@rarkins
Copy link
Collaborator

rarkins commented Sep 13, 2022

Have you configured a 60s timeout?

@dbasilio
Copy link
Author

We do not have a timeout configured anywhere in our config

@rarkins
Copy link
Collaborator

rarkins commented Sep 13, 2022

We've had a 4 minute timeout configured for a number of days. Strange that your log shows 60s - when was it captured?

Note: the file in question shouldn't take so long either way

@dbasilio
Copy link
Author

dbasilio commented Sep 14, 2022

These are the logs for the last day, with no change to our config:
image

The errors were respectively:

  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @sentry/types - Client network socket disconnected before secure TLS connection was established
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @sentry/types - Timeout awaiting 'request' for 60000ms

All runs used renovate version 32.194.5. I also double confirmed that we are not specifying a timeout anywhere in our shared configs.

@rarkins
Copy link
Collaborator

rarkins commented Sep 16, 2022

Strange, it seems to be inheriting the default 60s hostRules.timeout instead of the 240s rule we've set in the app. Any change if you then try to apply a longer timeout in your own config? e.g.

"hostRules": [{
  "timeout": 240000
}]

@dbasilio
Copy link
Author

We've had it running all weekend with a 4m timeout and this is what it looks like
image

The failures were (in order):

  • @types/svg-sprite-loader - Client network socket disconnected before secure TLS connection was established
  • @yarnpkg/shell - socket hang up
  • @types/event-emitter - Client network socket disconnected before secure TLS connection was established
  • @datadog/browser-logs - socket hang up
  • @juggle/resize-observer - socket hang up
  • monodeploy - Client network socket disconnected before secure TLS connection was established
  • @types/react-tooltip - Client network socket disconnected before secure TLS connection was established
  • @svgr/plugin-svgo - socket hang up
  • the last two errors were before the 4m change

@dbasilio
Copy link
Author

We've been running with the 4m change for a week now, we're still seeing intermittent failures but it's now completing often enough that we're getting PRs through so most of my concern has been alleviated.

Is there a problem with increasing the timeout further? Like 5 or 6 minutes? Is there a max on the timeout? I don't want to further increase the timeout if it's going to negatively impact the running of the app.

image

@rarkins
Copy link
Collaborator

rarkins commented Sep 26, 2022

No problem with you trying a higher number. Something is clearly going wrong somewhere in the chain and causing rate limiting, but it's good if it eventually completes.

@rarkins
Copy link
Collaborator

rarkins commented Oct 23, 2022

@dbasilio I'm seeing only occasional failures for your repo now, can you confirm the same from your side?

Only ~1 failure in the last day, for the record it was still ECONNRESET, and I'm surprised to see that Renovate did retry 2 times:

{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Fcore = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17249)","time":"2022-10-22T14:06:04.321Z"}
{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Ffslib = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17446)","time":"2022-10-22T14:06:04.515Z"}

@rarkins rarkins added priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others and removed priority-2-high Bugs impacting wide number of users or very important features labels Oct 23, 2022
@dbasilio
Copy link
Author

We're still seeing a decent amount of failures, but it's completing frequently enough that we're at least getting PRs put up
image

@dbasilio
Copy link
Author

We're seeing success often enough now that I think we can close this issue out
image
Thanks for all the help!

@rarkins rarkins closed this as completed Nov 30, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others status:in-progress Someone is working on implementation type:bug Bug fix of existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants