URL checker fails on URLs that work #93

valentinsulzer · 2022-06-29T15:55:06Z

e.g. https://github.com/pybamm-team/PyBaMM/runs/7115511512?check_suite_focus=true#step:4:631

Are there different settings we can use to make it work?

vsoch · 2022-06-29T18:18:31Z

The GitHub failures are strange - testing locally it seems to work okay. The DOI and sciencedirect appear to have a setup that when I try locally with requests I get "Service unavailable (503)" and denied (403). If you didn't already try running the workflow again I would do that first.

I've had trouble with DOI services too, beyond URL checker (e.,g., zenodo goes down a lot). So I think your best bet is to add those to the list of patterns to skip - there is a good example of that with the USRSE repository. https://github.com/USRSE/usrse.github.io/blob/c604fe86a19dbecaa9bf4333ace0e8a5b154981f/.github/workflows/clean-expired-jobs.yaml#L26-L34

The other thing you could try is using 0.0.25 or earlier - for 0.0.26 we added multiprocessing that sped up the checks, but perhaps it made it easier for the server to detect. If you want to try that earlier pin, if it does seem to work more reliably I can do the work to bring back that slower (but maybe more resilient?) mode.

SuperKogito · 2022-06-29T19:05:13Z

I can confirm what @vsoch mentioned about DOI and sciencedirect. Usually using higher timeout and retry_count values should help but you already did that. Alternatively, you can exclude the links from the checks using exclude_urls.
Running the checks locally urlchecker check PyBaMM --timeout 15 --retry-count 5 --verbose true --file-types .rst,.md,.py,.ipynb --exclude-patterns https://www.datacamp.com/community/tutorials/fuzzy-string-python,http://127.0.0.1,https://github.com/pybamm-team/PyBaMM/tree/v --exclude-files CHANGELOG.md --branch master --no-print confirms what @vsoch mentioned. The higher input values might only help with failed GitHub links.

@vsoch I just opened a related issue at urlstechie/urlchecker-python#73 and described imo what could be the issue.

valentinsulzer · 2022-06-29T19:53:08Z

Thanks for the quick response! I will try v0.0.25, and if that doesn't work try skipping doi.org (I guess those links are fairly safe), and report back

vsoch · 2022-06-29T20:03:21Z

@tinosulzer and as @SuperKogito mentioned I think we have some information about the science direct! I am planning on opening a PR this evening to try and fix it - I'll ping here.

vsoch · 2022-06-29T23:09:52Z

okay here is PR to test! urlstechie/urlchecker-python#74 You should be able to use the action branch. Since we can't be sure if the Accept header will break others that don't expect it, I only add it after the first failure. I also add the headers derivation inside the retry loop so a different user agent is tried each time.

SuperKogito · 2022-06-29T23:52:16Z

Nice work @vsoch 👍 DOI fixed but sciencedirect not yet :))

vsoch · 2022-06-30T00:19:04Z

Was there a fix I missed for sciencedirect?

vsoch · 2022-06-30T00:30:55Z

It might be the user agent - I'm going to look into other ways to provide (possibly newer) ones. Will again report back :)

SuperKogito · 2022-06-30T00:41:03Z

The fix I mentioned should work for both, I think.

vsoch · 2022-06-30T00:43:17Z

I applied the fix - so I think I'm suggesting that it doesn't, and perhaps the user agent is an issue too?

SuperKogito · 2022-06-30T00:46:16Z

that could be it. That's why I mentioned it, since in my previous test I only tested it with this agent:

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )

rootwork · 2022-07-22T19:23:57Z

Coming over from #94, just to report that if the cause is the same for my failures, they are all URLs that don't involve DOI or sciencedirect. Here are some that are failing:

https://www.linux.org/
https://drupal.org/
https://codepen.io/rootwork/
https://www.progressiveexchange.org/
http://groundwire.org/blog/groundwire-engagement-pyramid/ [http is accurate there; there's no https version]

vsoch · 2022-07-22T22:15:01Z

Some progress! I'm able to fix the last one, but the rest it looks like have probably pretty sophisticated methods for checking for this kind of thing. This is the current still failing set:

🤔 Uh oh... The following urls did not pass:
❌️ https://www.progressiveexchange.org/
❌️ https://groups.drupal.org/node/278968
❌️ https://drupal.org/
❌️ https://www.linux.org/
❌️ https://www.drupal.org/node/1982024
❌️ https://www.flickr.com/photos/username
❌️ https://codepen.io/rootwork/

But I have one more idea - since it's an action we could do an install of selenium (and provide a browser binary) as a last resort headless method to do the check. Since it's an actual browser, I think that could work.Otherwise, you can just add those to the skip list. What do you think?

vsoch · 2022-07-23T03:13:39Z

@tinosulzer I've tested your URLs against the new version (with a better user agent) and I think the only non-working one (from your set) is the sciencedirect:

🤔 Uh oh... The following urls did not pass:
...
❌️ https://www.sciencedirect.com/science/article/pii/S0013468608005045

I think this is an improvement, so I'm going to draft a release for this update, and then go back to tackle these other URLs. I'm thinking it's time to bring in a little more heavy of a tester - an actual headless browser like selenium.

vsoch · 2022-07-23T05:23:51Z

Note that the first fix is merged here, for the remainder of URLs I will be trying selenium soon, tracking in urlstechie/urlchecker-python#73.

rootwork · 2022-07-23T06:23:34Z

https://www.flickr.com/photos/username is a legitimate broken link, sorry for not catching that. (It's a placeholder URL in a comment.)

SuperKogito mentioned this issue Jun 29, 2022

Add selenium for (still) failing requests urlstechie/urlchecker-python#73

Closed

vsoch mentioned this issue Jul 24, 2022

preparing to test action with web drivers! #96

Merged

vsoch closed this as completed in #96 Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL checker fails on URLs that work #93

URL checker fails on URLs that work #93

valentinsulzer commented Jun 29, 2022

vsoch commented Jun 29, 2022

SuperKogito commented Jun 29, 2022

valentinsulzer commented Jun 29, 2022

vsoch commented Jun 29, 2022

vsoch commented Jun 29, 2022

SuperKogito commented Jun 29, 2022

vsoch commented Jun 30, 2022

vsoch commented Jun 30, 2022

SuperKogito commented Jun 30, 2022

vsoch commented Jun 30, 2022

SuperKogito commented Jun 30, 2022

rootwork commented Jul 22, 2022

vsoch commented Jul 22, 2022

vsoch commented Jul 23, 2022

vsoch commented Jul 23, 2022

rootwork commented Jul 23, 2022

URL checker fails on URLs that work #93

URL checker fails on URLs that work #93

Comments

valentinsulzer commented Jun 29, 2022

vsoch commented Jun 29, 2022

SuperKogito commented Jun 29, 2022

valentinsulzer commented Jun 29, 2022

vsoch commented Jun 29, 2022

vsoch commented Jun 29, 2022

SuperKogito commented Jun 29, 2022

vsoch commented Jun 30, 2022

vsoch commented Jun 30, 2022

SuperKogito commented Jun 30, 2022

vsoch commented Jun 30, 2022

SuperKogito commented Jun 30, 2022

rootwork commented Jul 22, 2022

vsoch commented Jul 22, 2022

vsoch commented Jul 23, 2022

vsoch commented Jul 23, 2022

rootwork commented Jul 23, 2022