Warning in log when fetching paywalled article from newscientist.com #3818

Najrim · 2018-12-21T11:29:13Z

Issue details

I'm getting a "first byte timeout" warning in the log when fetching a New Scientist article using the site config below. Here's the full log:

[2018-12-21 12:07:06] graby.DEBUG: Graby is ready to fetch [] []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for newscientist.com in primary folder {"host":"newscientist.com"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config newscientist.com.txt {"host":"newscientist.com.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: newscientist.com {"key":"newscientist.com"} []
[2018-12-21 12:07:06] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-12-21 12:07:06] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-12-21 12:07:06] graby.DEBUG: Cached site config with key: newscientist.com.merged {"key":"newscientist.com.merged"} []
[2018-12-21 12:07:06] graby.DEBUG: Fetching url: https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/ {"url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Trying using method "get" on url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"method":"get","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:06] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.newscientist.com/article/mg24032041-600-the-race-to-green-domestic-heating-and-prevent-climate-catastrophe/"} []
[2018-12-21 12:07:07] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:07] graby.DEBUG: Auth: add parameters. {"host":"newscientist.com","parameters":{"host":"newscientist.com","requiresLogin":true,"loginUri":"https://www.newscientist.com/login/","usernameField":"email","passwordField":"password","extraFields":[],"notLoggedInXpath":"//*[@id=\"subscription-barrier\"]","username":"**masked**","password":"**masked**"}} []
[2018-12-21 12:07:59] graby.WARNING: Request throw exception (with a response): Server error response [url] https://www.newscientist.com/login/ [status code] 503 [reason phrase] first byte timeout {"error_message":"Server error response [url] https://www.newscientist.com/login/ [status code] 503 [reason phrase] first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://www.newscientist.com/login/","body":"(only length for debug): 898","headers":"text/html; charset=utf-8","all_headers":{"server":"Varnish","retry-after":"0","content-type":"text/html; charset=utf-8","accept-ranges":"bytes","date":"Fri, 21 Dec 2018 11:07:59 GMT","via":"1.1 varnish","connection":"close","set-cookie":"FastlyEdge=0Hn7/4IXzxmkidBNrKvgQ5uZvbHKk0FxlB1cNP9GrHg=; path=/, visid_incap_276977=HUnLrOVFTgKRGx/qEkevVlrJHFwAAAAAQUIPAAAAAADedXTrsCPiAaZuUFVWe+RW; expires=Sat, 21 Dec 2019 07:13:41 GMT; path=/; Domain=.newscientist.com, nlbi_276977=foWePze26xybU2g/kc7wwQAAAAB/NWVVErw+UZKEARAjFcdW; path=/; Domain=.newscientist.com, incap_ses_275_276977=YydzE58fB2raRf7xxwLRA47JHFwAAAAAs7ze6aEphCGKxzx9Lwvnew==; path=/; Domain=.newscientist.com","x-is-ssl":"yes","x-served-by":"cache-bma1641-BMA","x-cache":"MISS","x-cache-hits":"0","x-iinfo":"10-14605285-14605293 NNNN CT(2 0 0) RT(1545390426143 197) q(0 0 0 1) r(522 522) U5","x-cdn":"Incapsula","transfer-encoding":"chunked"},"status":503}} []
[2018-12-21 12:07:59] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-12-21 12:07:59] graby.DEBUG: Opengraph data: [array] {"ogData":[]} []
[2018-12-21 12:07:59] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-12-21 12:07:59] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:59] graby.DEBUG: No "single_page_link" config found [] []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to extract content [] []
[2018-12-21 12:07:59] graby.DEBUG: Returning cached and merged site config for newscientist.com {"host":"newscientist.com"} []
[2018-12-21 12:07:59] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-12-21 12:07:59] graby.DEBUG: Body size after Readability: 479 {"length":479} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Using Readability [] []
[2018-12-21 12:07:59] graby.DEBUG: Detected title: 503 first byte timeout {"title":"503 first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying again without tidy [] []
[2018-12-21 12:07:59] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-12-21 12:07:59] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-12-21 12:07:59] graby.DEBUG: Body size after Readability: 574 {"length":574} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-12-21 12:07:59] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-12-21 12:07:59] graby.DEBUG: Using Readability [] []
[2018-12-21 12:07:59] graby.DEBUG: Detected title: 503 first byte timeout {"title":"503 first byte timeout"} []
[2018-12-21 12:07:59] graby.DEBUG: Success ?  {"is_success":false} []
[2018-12-21 12:07:59] graby.DEBUG: Extract failed [] []

Environment

wallabag version (or git revision) that exhibits the issue: git
How did you install wallabag? Via git clone or by downloading the package? git clone
Last wallabag version that did not exhibit the issue (if applicable):
php version: 7.0
OS:
type of hosting (shared or dedicated): self-hosting
which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL):

Steps to reproduce/test case

This is the site-config:

requires_login: yes

login_uri: https://www.newscientist.com/login/
login_username_field: email
login_password_field: password

not_logged_in_xpath: //*[@id="subscription-barrier"]

The text was updated successfully, but these errors were encountered:

techexo · 2018-12-21T16:15:10Z

The site config looks good to me. (Edit: I know it's not a very useful message, just to indicate that I checked and didn't see something obvious.)

Najrim · 2018-12-21T16:21:28Z

Any idea what the warning means? "first byte timeout" and "Server error response" sounds like the server is blocking whatever program is used to download the page. There is a 503 in there; could it be the HTTP Error 503 "Service Unavailable"?

techexo · 2018-12-23T01:15:16Z

IIRC, first byte timeout means that the server didn't answer the request in 60 seconds. So it could be an overload on their part, a DDoS and so on, or there is issues with the parser or the authentication making the remote server bug.

Jeannotisintheplace · 2020-03-23T11:28:38Z

Hello,
Did anyone finally manage to get his/her article from New Scientist with Wallabag?

Having a paid account on new scientist,I tried different settings on the "Site Credentials" management menu of Wallabag, but never managed to get the full articles, only the free partial part of the articles.

Najrim · 2020-03-23T13:36:42Z

Nope, stopped trying.

Jeannotisintheplace · 2020-03-25T20:46:58Z

Oh, that's too sad for this containment time !
Did you find an alternative way to read new scientist articles on a ebook reader by any chance ?

Najrim · 2020-03-25T21:08:36Z

Unfortunately not. It's the one source I've been struggling with.

Jeannotisintheplace · 2020-03-25T21:59:20Z

Hello,
Found that DOTEPUB does the job:

https://dotepub.com/

:-)

j0k3r added the Site Config label Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning in log when fetching paywalled article from newscientist.com #3818

Warning in log when fetching paywalled article from newscientist.com #3818

Najrim commented Dec 21, 2018

techexo commented Dec 21, 2018 •

edited

Najrim commented Dec 21, 2018

techexo commented Dec 23, 2018

Jeannotisintheplace commented Mar 23, 2020

Najrim commented Mar 23, 2020

Jeannotisintheplace commented Mar 25, 2020

Najrim commented Mar 25, 2020

Jeannotisintheplace commented Mar 25, 2020 •

edited

Warning in log when fetching paywalled article from newscientist.com #3818

Warning in log when fetching paywalled article from newscientist.com #3818

Comments

Najrim commented Dec 21, 2018

Issue details

Environment

Steps to reproduce/test case

techexo commented Dec 21, 2018 • edited

Najrim commented Dec 21, 2018

techexo commented Dec 23, 2018

Jeannotisintheplace commented Mar 23, 2020

Najrim commented Mar 23, 2020

Jeannotisintheplace commented Mar 25, 2020

Najrim commented Mar 25, 2020

Jeannotisintheplace commented Mar 25, 2020 • edited

techexo commented Dec 21, 2018 •

edited

Jeannotisintheplace commented Mar 25, 2020 •

edited