Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL checker fails on URLs that work #93

Closed
valentinsulzer opened this issue Jun 29, 2022 · 16 comments · Fixed by #96
Closed

URL checker fails on URLs that work #93

valentinsulzer opened this issue Jun 29, 2022 · 16 comments · Fixed by #96

Comments

@valentinsulzer
Copy link

e.g. https://github.com/pybamm-team/PyBaMM/runs/7115511512?check_suite_focus=true#step:4:631

Are there different settings we can use to make it work?

@vsoch
Copy link
Collaborator

vsoch commented Jun 29, 2022

The GitHub failures are strange - testing locally it seems to work okay. The DOI and sciencedirect appear to have a setup that when I try locally with requests I get "Service unavailable (503)" and denied (403). If you didn't already try running the workflow again I would do that first.

image

I've had trouble with DOI services too, beyond URL checker (e.,g., zenodo goes down a lot). So I think your best bet is to add those to the list of patterns to skip - there is a good example of that with the USRSE repository. https://github.com/USRSE/usrse.github.io/blob/c604fe86a19dbecaa9bf4333ace0e8a5b154981f/.github/workflows/clean-expired-jobs.yaml#L26-L34

The other thing you could try is using 0.0.25 or earlier - for 0.0.26 we added multiprocessing that sped up the checks, but perhaps it made it easier for the server to detect. If you want to try that earlier pin, if it does seem to work more reliably I can do the work to bring back that slower (but maybe more resilient?) mode.

@SuperKogito
Copy link
Member

I can confirm what @vsoch mentioned about DOI and sciencedirect. Usually using higher timeout and retry_count values should help but you already did that. Alternatively, you can exclude the links from the checks using exclude_urls.
Running the checks locally urlchecker check PyBaMM --timeout 15 --retry-count 5 --verbose true --file-types .rst,.md,.py,.ipynb --exclude-patterns https://www.datacamp.com/community/tutorials/fuzzy-string-python,http://127.0.0.1,https://github.com/pybamm-team/PyBaMM/tree/v --exclude-files CHANGELOG.md --branch master --no-print confirms what @vsoch mentioned. The higher input values might only help with failed GitHub links.

Screenshot from 2022-06-29 20-48-53

@vsoch I just opened a related issue at urlstechie/urlchecker-python#73 and described imo what could be the issue.

@valentinsulzer
Copy link
Author

Thanks for the quick response! I will try v0.0.25, and if that doesn't work try skipping doi.org (I guess those links are fairly safe), and report back

@vsoch
Copy link
Collaborator

vsoch commented Jun 29, 2022

@tinosulzer and as @SuperKogito mentioned I think we have some information about the science direct! I am planning on opening a PR this evening to try and fix it - I'll ping here.

@vsoch
Copy link
Collaborator

vsoch commented Jun 29, 2022

okay here is PR to test! urlstechie/urlchecker-python#74 You should be able to use the action branch. Since we can't be sure if the Accept header will break others that don't expect it, I only add it after the first failure. I also add the headers derivation inside the retry loop so a different user agent is tried each time.

@SuperKogito
Copy link
Member

Nice work @vsoch 👍 DOI fixed but sciencedirect not yet :))
Screenshot from 2022-06-30 01-45-05

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

Was there a fix I missed for sciencedirect?

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

It might be the user agent - I'm going to look into other ways to provide (possibly newer) ones. Will again report back :)

@SuperKogito
Copy link
Member

The fix I mentioned should work for both, I think.

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

I applied the fix - so I think I'm suggesting that it doesn't, and perhaps the user agent is an issue too?

@SuperKogito
Copy link
Member

that could be it. That's why I mentioned it, since in my previous test I only tested it with this agent:

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )

@rootwork
Copy link

Coming over from #94, just to report that if the cause is the same for my failures, they are all URLs that don't involve DOI or sciencedirect. Here are some that are failing:

https://www.linux.org/
https://drupal.org/
https://codepen.io/rootwork/
https://www.progressiveexchange.org/
http://groundwire.org/blog/groundwire-engagement-pyramid/ [http is accurate there; there's no https version]

@vsoch
Copy link
Collaborator

vsoch commented Jul 22, 2022

Some progress! I'm able to fix the last one, but the rest it looks like have probably pretty sophisticated methods for checking for this kind of thing. This is the current still failing set:

🤔 Uh oh... The following urls did not pass:
❌️ https://www.progressiveexchange.org/
❌️ https://groups.drupal.org/node/278968
❌️ https://drupal.org/
❌️ https://www.linux.org/
❌️ https://www.drupal.org/node/1982024
❌️ https://www.flickr.com/photos/username
❌️ https://codepen.io/rootwork/

But I have one more idea - since it's an action we could do an install of selenium (and provide a browser binary) as a last resort headless method to do the check. Since it's an actual browser, I think that could work.Otherwise, you can just add those to the skip list. What do you think?

@vsoch
Copy link
Collaborator

vsoch commented Jul 23, 2022

@tinosulzer I've tested your URLs against the new version (with a better user agent) and I think the only non-working one (from your set) is the sciencedirect:

🤔 Uh oh... The following urls did not pass:
...
❌️ https://www.sciencedirect.com/science/article/pii/S0013468608005045

I think this is an improvement, so I'm going to draft a release for this update, and then go back to tackle these other URLs. I'm thinking it's time to bring in a little more heavy of a tester - an actual headless browser like selenium.

@vsoch
Copy link
Collaborator

vsoch commented Jul 23, 2022

Note that the first fix is merged here, for the remainder of URLs I will be trying selenium soon, tracking in urlstechie/urlchecker-python#73.

@rootwork
Copy link

https://www.flickr.com/photos/username is a legitimate broken link, sorry for not catching that. (It's a placeholder URL in a comment.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants