www.theverge.com consistent "unable to retrieve full-text content" #969

JulienPalard · 2014-12-14T22:59:28Z

I have a consistent "unable to retrieve full-text content" while storing articles from the verge, here is an example: http://www.theverge.com/2014/12/11/7376599/anti-piracy-meeting-between-google-sony-eli-lilly-homeland-security

Have you tried different "content extractors" ? If true, have you documented this search ? I may be interested to try implementing one, if nothing cool exists.

tcitworld · 2014-12-14T23:08:22Z

Wallabag uses this to parse websites : http://help.fivefilters.org/customer/portal/articles/223153-site-patterns

You might be interested in adapting the siteconfig file.

JulienPalard · 2014-12-14T23:58:31Z

I tried a POC in a few lines of python, trying to find a nice element using their ratio of words/tags, then trying to find common parents to matched elements. It typically find paragraphs, which common parent is typically the article div. I only took time to try on the theverge and melty, it worked, but we may need a huge test to compare implementations. Does this kind of work may help ?

tcitworld · 2014-12-15T09:43:43Z

ping @nicosomb

nicosomb · 2014-12-15T09:52:00Z

I don't know why we can't get theverge content ... @fivefilters manages it here: http://fivefilters.org/content-only/

I tried a POC in a few lines of python

Today, wallabag only uses PHP scripts. If you write a great tool in python, how can we use it? You host it on your own server?

tcitworld · 2014-12-15T09:53:23Z

@fivefilters have a newer version of FullTextRSS, and maybe a newer siteconfig for TheVerge.

nicosomb · 2014-12-15T09:53:51Z

Not here https://github.com/fivefilters/ftr-site-config/blob/master/theverge.com.txt

fivefilters · 2014-12-15T11:06:31Z

Thanks for the report. I'll test using Full-Text RSS version used by Wallabag to see if I can reproduce this. Has a anyone else managed to reproduce this using Wallabag (wondering if it's perhaps an issue related to the verge blocking the server making the request)?

tcitworld · 2014-12-15T11:36:34Z

I didn't try yet myself.

nicosomb · 2014-12-15T12:24:57Z

I tried with my framabag account and I have the same behavior than @JulienPalard.

JulienPalard · 2014-12-16T14:15:52Z

@nicosomb I know that wallabag uses PHP, but what I wrote is just a POC, it's probably easily portable to PHP. I was just throwing a few lines of code in a language I like, to test my ideas, with ease, don't worry ;-)

BTW it seems to work on a the only two pages I tested, I was not aware of http://fivefilters.org/content-only/, now I'll be able to compare my results against an existing thing.

But I'll continue my POC only if someone here think the algorithm used by fivefilters is not good enough :-)

Wikunia · 2015-01-24T18:53:44Z

Looks like there is a bug in parsing the config files. At the moment I have no idea why :/
This config file:

title: //h1
body: //article
prune: no
#tidy: no

Works for my article:
http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience and for the one from @JulienPalard.

*Works => shows the article but unfortunately sth. more

tcitworld · 2015-01-25T12:33:12Z

Thank you. I'll try getting rid of the "something more"

fivefilters · 2015-01-25T16:00:51Z

I haven't had a chance to test this yet, but if anyone has access to the debug output, that might reveal why it's not working with the original site config. I tried with ftr.fivefilters.org and here's the output:

* APC is enabled and available on server
* Supplied URL: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* Caching is enabled...
* Cache key not found in APC
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class CookieJar (humble-http-agent/CookieJar.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience)
* Starting parallel fetch (HttpRequestPool)
* Processing set of 1
* ...http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ......adding to pool
* Sending request...
* Received responses
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (HttpRequestPool)
* Processing set of 1
* ...http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ......in memory
* --------
* Processing feed item 1
* Item URL: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience, effective: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience)
* Character encoding: utf-8
* Looking for site config files to see if single page link exists
* Returning cached and merged site config for theverge.com
* . looking for site config for theverge.com.merged in primary folder
* ... site config for theverge.com.merged in APC cache
* --------
* Attempting to extract content
* Returning cached and merged site config for theverge.com
* . looking for site config for theverge.com.merged in primary folder
* ... site config for theverge.com.merged in APC cache
* Strings replaced: 169 (find_string and/or replace_string)
* Using Tidy
* Attempting to parse HTML with libxml
* ** Loading class Readability (readability/Readability.php)
* Title matched: We just tried HoloLens, Microsoft's most intriguing product in years
* ...XPath match: //meta[@property="og:title"]/@content
* Language matched: en-US
* Date matched: 2015-01-21 23:28:24
* ...XPath match: //meta[@property="article:published_time"]/@content
* Stripping 1 elements (strip)
* Stripping 3 elements (strip)
* Stripping 3 elements (strip)
* Stripping 4 elements (strip)
* Stripping 3 elements (strip)
* Stripping 2 elements (strip)
* Stripping 1 elements (strip)
* Stripping 10 elements (strip_id_or_class)
* Stripping 4 elements (strip_id_or_class)
* Body matched
* ...XPath match: //article
* Done!

Link: http://ftr.fivefilters.org/makefulltextfeed.php?debug&url=www.theverge.com%2F2015%2F1%2F21%2F7868251%2Fmicrosoft-hololens-hologram-hands-on-experience

nicosomb · 2016-01-24T08:17:18Z

I just tried on wallabag v2 (http://v2.wallabag.org), it's fixed.

tcitworld added the Bug label Dec 15, 2014

tcitworld self-assigned this Jan 31, 2015

nicosomb closed this as completed Jan 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

www.theverge.com consistent "unable to retrieve full-text content" #969

www.theverge.com consistent "unable to retrieve full-text content" #969

JulienPalard commented Dec 14, 2014

tcitworld commented Dec 14, 2014

JulienPalard commented Dec 14, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

fivefilters commented Dec 15, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

JulienPalard commented Dec 16, 2014

Wikunia commented Jan 24, 2015

tcitworld commented Jan 25, 2015

fivefilters commented Jan 25, 2015

nicosomb commented Jan 24, 2016

www.theverge.com consistent "unable to retrieve full-text content" #969

www.theverge.com consistent "unable to retrieve full-text content" #969

Comments

JulienPalard commented Dec 14, 2014

tcitworld commented Dec 14, 2014

JulienPalard commented Dec 14, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

fivefilters commented Dec 15, 2014

tcitworld commented Dec 15, 2014

nicosomb commented Dec 15, 2014

JulienPalard commented Dec 16, 2014

Wikunia commented Jan 24, 2015

tcitworld commented Jan 25, 2015

fivefilters commented Jan 25, 2015

nicosomb commented Jan 24, 2016