Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning in log when fetching paywalled article from newscientist.com #3818

Open
Najrim opened this issue Dec 21, 2018 · 8 comments
Open

Warning in log when fetching paywalled article from newscientist.com #3818

Najrim opened this issue Dec 21, 2018 · 8 comments

Comments

@Najrim
Copy link

Najrim commented Dec 21, 2018

Issue details

I'm getting a "first byte timeout" warning in the log when fetching a New Scientist article using the site config below. Here's the full log:

[2018-12-21 12:07:06] graby.DEBUG: Graby is ready to fetch [] []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for newscientist.com in primary folder {"host":"newscientist.com"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config newscientist.com.txt {"host":"newscientist.com.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: newscientist.com {"key":"newscientist.com"} []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: newscientist.com.merged {"key":"newscientist.com.merged"} []
[2018-12-21 12:07:06] graby.DEBUG: Fetching url: https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/ {"url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Trying using method "get" on url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"method":"get","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:07] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:07] graby.DEBUG: Auth: add parameters. {"host":"newscientist.com","parameters":{"host":"newscientist.com","requiresLogin":true,"loginUri":"https://www.newscientist.com/login/","usernameField":"email","passwordField":"password","extraFields":[],"notLoggedInXpath":"//*[@id=\"subscription-barrier\"]","username":"**masked**","password":"**masked**"}} []
[2018-12-21 12:07:59] graby.WARNING: Request throw exception (with a response): Server error response [url] https://www.newscientist.com/login/ [status code] 503 [reason phrase] first byte timeout {"error_message":"Server error response [url] https://www.newscientist.com/login/ [status code] 503 [reason phrase] first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://www.newscientist.com/login/","body":"(only length for debug): 898","headers":"text/html; charset=utf-8","all_headers":{"server":"Varnish","retry-after":"0","content-type":"text/html; charset=utf-8","accept-ranges":"bytes","date":"Fri, 21 Dec 2018 11:07:59 GMT","via":"1.1 varnish","connection":"close","set-cookie":"FastlyEdge=0Hn7/4IXzxmkidBNrKvgQ5uZvbHKk0FxlB1cNP9GrHg=; path=/, visid_incap_276977=HUnLrOVFTgKRGx/qEkevVlrJHFwAAAAAQUIPAAAAAADedXTrsCPiAaZuUFVWe+RW; expires=Sat, 21 Dec 2019 07:13:41 GMT; path=/; Domain=.newscientist.com, nlbi_276977=foWePze26xybU2g/kc7wwQAAAAB/NWVVErw+UZKEARAjFcdW; path=/; Domain=.newscientist.com, incap_ses_275_276977=YydzE58fB2raRf7xxwLRA47JHFwAAAAAs7ze6aEphCGKxzx9Lwvnew==; path=/; Domain=.newscientist.com","x-is-ssl":"yes","x-served-by":"cache-bma1641-BMA","x-cache":"MISS","x-cache-hits":"0","x-iinfo":"10-14605285-14605293 NNNN CT(2 0 0) RT(1545390426143 197) q(0 0 0 1) r(522 522) U5","x-cdn":"Incapsula","transfer-encoding":"chunked"},"status":503}} []
[2018-12-21 12:07:59] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-12-21 12:07:59] graby.DEBUG: Opengraph data: [array] {"ogData":[]} []
[2018-12-21 12:07:59] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-12-21 12:07:59] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:59] graby.DEBUG: No "single_page_link" config found [] []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to extract content [] []
[2018-12-21 12:07:59] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:59] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-12-21 12:07:59] graby.DEBUG: Body size after Readability: 479 {"length":479} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Using Readability [] []
[2018-12-21 12:07:59] graby.DEBUG: Detected title: 503 first byte timeout {"title":"503 first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying again without tidy [] []
[2018-12-21 12:07:59] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-12-21 12:07:59] graby.DEBUG: Body size after Readability: 574 {"length":574} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Using Readability [] []
[2018-12-21 12:07:59] graby.DEBUG: Detected title: 503 first byte timeout {"title":"503 first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Success ?  {"is_success":false} []
[2018-12-21 12:07:59] graby.DEBUG: Extract failed [] []

Environment

  • wallabag version (or git revision) that exhibits the issue: git
  • How did you install wallabag? Via git clone or by downloading the package? git clone
  • Last wallabag version that did not exhibit the issue (if applicable):
  • php version: 7.0
  • OS:
  • type of hosting (shared or dedicated): self-hosting
  • which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL):

Steps to reproduce/test case

This is the site-config:

requires_login: yes

login_uri: https://www.newscientist.com/login/
login_username_field: email
login_password_field: password

not_logged_in_xpath: //*[@id="subscription-barrier"]
@techexo
Copy link
Contributor

techexo commented Dec 21, 2018

The site config looks good to me. (Edit: I know it's not a very useful message, just to indicate that I checked and didn't see something obvious.)

@Najrim
Copy link
Author

Najrim commented Dec 21, 2018

Any idea what the warning means? "first byte timeout" and "Server error response" sounds like the server is blocking whatever program is used to download the page. There is a 503 in there; could it be the HTTP Error 503 "Service Unavailable"?

@techexo
Copy link
Contributor

techexo commented Dec 23, 2018

IIRC, first byte timeout means that the server didn't answer the request in 60 seconds. So it could be an overload on their part, a DDoS and so on, or there is issues with the parser or the authentication making the remote server bug.

@Jeannotisintheplace
Copy link

Hello,
Did anyone finally manage to get his/her article from New Scientist with Wallabag?

Having a paid account on new scientist,I tried different settings on the "Site Credentials" management menu of Wallabag, but never managed to get the full articles, only the free partial part of the articles.

@Najrim
Copy link
Author

Najrim commented Mar 23, 2020

Nope, stopped trying.

@Jeannotisintheplace
Copy link

Oh, that's too sad for this containment time !
Did you find an alternative way to read new scientist articles on a ebook reader by any chance ?

@Najrim
Copy link
Author

Najrim commented Mar 25, 2020

Unfortunately not. It's the one source I've been struggling with.

@Jeannotisintheplace
Copy link

Jeannotisintheplace commented Mar 25, 2020

Hello,
Found that DOTEPUB does the job:

https://dotepub.com/

:-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants