-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL checker fails on URLs that work #93
Comments
The GitHub failures are strange - testing locally it seems to work okay. The DOI and sciencedirect appear to have a setup that when I try locally with requests I get "Service unavailable (503)" and denied (403). If you didn't already try running the workflow again I would do that first. I've had trouble with DOI services too, beyond URL checker (e.,g., zenodo goes down a lot). So I think your best bet is to add those to the list of patterns to skip - there is a good example of that with the USRSE repository. https://github.com/USRSE/usrse.github.io/blob/c604fe86a19dbecaa9bf4333ace0e8a5b154981f/.github/workflows/clean-expired-jobs.yaml#L26-L34 The other thing you could try is using 0.0.25 or earlier - for 0.0.26 we added multiprocessing that sped up the checks, but perhaps it made it easier for the server to detect. If you want to try that earlier pin, if it does seem to work more reliably I can do the work to bring back that slower (but maybe more resilient?) mode. |
I can confirm what @vsoch mentioned about DOI and sciencedirect. Usually using higher @vsoch I just opened a related issue at urlstechie/urlchecker-python#73 and described imo what could be the issue. |
Thanks for the quick response! I will try v0.0.25, and if that doesn't work try skipping |
@tinosulzer and as @SuperKogito mentioned I think we have some information about the science direct! I am planning on opening a PR this evening to try and fix it - I'll ping here. |
okay here is PR to test! urlstechie/urlchecker-python#74 You should be able to use the action branch. Since we can't be sure if the Accept header will break others that don't expect it, I only add it after the first failure. I also add the headers derivation inside the retry loop so a different user agent is tried each time. |
Nice work @vsoch 👍 DOI fixed but sciencedirect not yet :)) |
Was there a fix I missed for sciencedirect? |
It might be the user agent - I'm going to look into other ways to provide (possibly newer) ones. Will again report back :) |
The fix I mentioned should work for both, I think. |
I applied the fix - so I think I'm suggesting that it doesn't, and perhaps the user agent is an issue too? |
that could be it. That's why I mentioned it, since in my previous test I only tested it with this agent:
|
Coming over from #94, just to report that if the cause is the same for my failures, they are all URLs that don't involve DOI or sciencedirect. Here are some that are failing:
|
Some progress! I'm able to fix the last one, but the rest it looks like have probably pretty sophisticated methods for checking for this kind of thing. This is the current still failing set:
But I have one more idea - since it's an action we could do an install of selenium (and provide a browser binary) as a last resort headless method to do the check. Since it's an actual browser, I think that could work.Otherwise, you can just add those to the skip list. What do you think? |
@tinosulzer I've tested your URLs against the new version (with a better user agent) and I think the only non-working one (from your set) is the sciencedirect: 🤔 Uh oh... The following urls did not pass:
...
❌️ https://www.sciencedirect.com/science/article/pii/S0013468608005045 I think this is an improvement, so I'm going to draft a release for this update, and then go back to tackle these other URLs. I'm thinking it's time to bring in a little more heavy of a tester - an actual headless browser like selenium. |
Note that the first fix is merged here, for the remainder of URLs I will be trying selenium soon, tracking in urlstechie/urlchecker-python#73. |
|
e.g. https://github.com/pybamm-team/PyBaMM/runs/7115511512?check_suite_focus=true#step:4:631
Are there different settings we can use to make it work?
The text was updated successfully, but these errors were encountered: