Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add selenium for (still) failing requests #73

Closed
SuperKogito opened this issue Jun 29, 2022 · 12 comments · Fixed by #77
Closed

Add selenium for (still) failing requests #73

SuperKogito opened this issue Jun 29, 2022 · 12 comments · Fixed by #77

Comments

@SuperKogito
Copy link
Member

According to urlstechie/urlchecker-action#93, it seems that the checker is not able to check doi and sciencedirect link. A quick inspection shows that a simple 'Accept': 'application/json' in the headers should fix this. You can amend this to #72, I think. Unfortunately I cannot test or help much here these days.

This fails:

import requests

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )
timeout = 5
url_doi = 'https://doi.org/10.1063/5.0023771'
url_sci = 'https://www.sciencedirect.com/science/article/pii/S0013468608005045'

r = requests.get(url_sci, 
                 headers={"User-Agent": agent})

print(f"Status Code: {r.status_code}")

This works:

import requests

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )
timeout = 5
url_doi = 'https://doi.org/10.1063/5.0023771'
url_sci = 'https://www.sciencedirect.com/science/article/pii/S0013468608005045'

r = requests.get(url_sci, 
                 headers={"User-Agent": agent, 'Accept': 'application/json'})

print(f"Status Code: {r.status_code}")

you can test the code online here

@vsoch
Copy link
Collaborator

vsoch commented Jun 29, 2022

oh fantastic! @SuperKogito I'll put in a PR tonight.

@vsoch
Copy link
Collaborator

vsoch commented Jun 29, 2022

okay here is PR to test! #74

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

@SuperKogito I found this https://github.com/Luqman-Ud-Din/random_user_agent but it doesn't have many stars / users.

@SuperKogito
Copy link
Member Author

well stars are not exactly the best metric, cuz we don't have that many either :P but we will need a full review of the tool before using it or adapting parts of it for our use so how about we release the change for DOI and then work on further improvement on another PR.

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

haha true! Let me know what you'd like to try - I'm okay using it or trying to emulate it either way.

@SuperKogito
Copy link
Member Author

I will try to take a look at it this weekend or the one after since I am quite busy lately even on weekends and as soon as I review it I will let you know ;)

@vsoch
Copy link
Collaborator

vsoch commented Jun 30, 2022

okay sounds good! Ping me when you want to discuss further about the direction to take. And thanks for your help today!!

@vsoch vsoch changed the title DOI and sciencedirect requests Add selenium for (still) failing requests Jul 23, 2022
@vsoch
Copy link
Collaborator

vsoch commented Jul 23, 2022

@SuperKogito I think the DOI one should be fixed, and for these other tricky ones I think we can give selenium a shot! I'm not sure if I'll have time this weekend, but it's on my TODO to look at soon.

@SuperKogito
Copy link
Member Author

SuperKogito commented Jul 23, 2022

I was considering this option actually; https://pypi.org/project/fake-useragent/
It seems to have realistic agents and much more choices. Hopefully, I will be able to test it this weekend.

@vsoch
Copy link
Collaborator

vsoch commented Jul 23, 2022

@SuperKogito that’s exactly what I tested and added! We are on the same wavelength ❤️

it worked like a charm and fixed a subset of URLs, and I added additional browser headers too. Unfortunately not all are fixed and we need a backup strategy - I’m going to hopefully have time to do a selenium test this weekend.

@SuperKogito
Copy link
Member Author

oh perfect. Then let me know how can I support ? and what can I help with? I have some time this weekend.

@vsoch
Copy link
Collaborator

vsoch commented Jul 23, 2022

I say if you have extra time, enjoy it for yourself to relax! Unless there is a pressing issue on our board that you are excited about, or you want to mess around with testing other ways to improve our results. Eg the science direct link still doesn’t work, but I’m hoping selenium can help! I ran out of header ideas. But yeah, I’d say put yourself first and just enjoy the time for yourself and/or family.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants