Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

www.theverge.com consistent "unable to retrieve full-text content" #969

Closed
JulienPalard opened this issue Dec 14, 2014 · 14 comments
Closed
Assignees
Labels

Comments

@JulienPalard
Copy link

I have a consistent "unable to retrieve full-text content" while storing articles from the verge, here is an example: http://www.theverge.com/2014/12/11/7376599/anti-piracy-meeting-between-google-sony-eli-lilly-homeland-security

Have you tried different "content extractors" ? If true, have you documented this search ? I may be interested to try implementing one, if nothing cool exists.

@tcitworld
Copy link
Member

Wallabag uses this to parse websites : http://help.fivefilters.org/customer/portal/articles/223153-site-patterns

You might be interested in adapting the siteconfig file.

@JulienPalard
Copy link
Author

I tried a POC in a few lines of python, trying to find a nice element using their ratio of words/tags, then trying to find common parents to matched elements. It typically find paragraphs, which common parent is typically the article div. I only took time to try on the theverge and melty, it worked, but we may need a huge test to compare implementations. Does this kind of work may help ?

@tcitworld
Copy link
Member

ping @nicosomb

@nicosomb
Copy link
Member

I don't know why we can't get theverge content ... @fivefilters manages it here: http://fivefilters.org/content-only/

I tried a POC in a few lines of python

Today, wallabag only uses PHP scripts. If you write a great tool in python, how can we use it? You host it on your own server?

@tcitworld
Copy link
Member

@fivefilters have a newer version of FullTextRSS, and maybe a newer siteconfig for TheVerge.

@nicosomb
Copy link
Member

@tcitworld tcitworld added the Bug label Dec 15, 2014
@fivefilters
Copy link

Thanks for the report. I'll test using Full-Text RSS version used by Wallabag to see if I can reproduce this. Has a anyone else managed to reproduce this using Wallabag (wondering if it's perhaps an issue related to the verge blocking the server making the request)?

@tcitworld
Copy link
Member

I didn't try yet myself.

@nicosomb
Copy link
Member

I tried with my framabag account and I have the same behavior than @JulienPalard.

@JulienPalard
Copy link
Author

@nicosomb I know that wallabag uses PHP, but what I wrote is just a POC, it's probably easily portable to PHP. I was just throwing a few lines of code in a language I like, to test my ideas, with ease, don't worry ;-)

BTW it seems to work on a the only two pages I tested, I was not aware of http://fivefilters.org/content-only/, now I'll be able to compare my results against an existing thing.

But I'll continue my POC only if someone here think the algorithm used by fivefilters is not good enough :-)

@Wikunia
Copy link

Wikunia commented Jan 24, 2015

Looks like there is a bug in parsing the config files. At the moment I have no idea why :/
This config file:

title: //h1
body: //article
prune: no
#tidy: no

Works for my article:
http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience and for the one from @JulienPalard.

*Works => shows the article but unfortunately sth. more

@tcitworld
Copy link
Member

Thank you. I'll try getting rid of the "something more"

@fivefilters
Copy link

I haven't had a chance to test this yet, but if anyone has access to the debug output, that might reveal why it's not working with the original site config. I tried with ftr.fivefilters.org and here's the output:

* APC is enabled and available on server
* Supplied URL: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* Caching is enabled...
* Cache key not found in APC
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class CookieJar (humble-http-agent/CookieJar.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience)
* Starting parallel fetch (HttpRequestPool)
* Processing set of 1
* ...http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ......adding to pool
* Sending request...
* Received responses
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (HttpRequestPool)
* Processing set of 1
* ...http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ......in memory
* --------
* Processing feed item 1
* Item URL: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience, effective: http://www.theverge.com/2015/1/21/7868251/microsoft-hololens-hologram-hands-on-experience)
* Character encoding: utf-8
* Looking for site config files to see if single page link exists
* Returning cached and merged site config for theverge.com
* . looking for site config for theverge.com.merged in primary folder
* ... site config for theverge.com.merged in APC cache
* --------
* Attempting to extract content
* Returning cached and merged site config for theverge.com
* . looking for site config for theverge.com.merged in primary folder
* ... site config for theverge.com.merged in APC cache
* Strings replaced: 169 (find_string and/or replace_string)
* Using Tidy
* Attempting to parse HTML with libxml
* ** Loading class Readability (readability/Readability.php)
* Title matched: We just tried HoloLens, Microsoft's most intriguing product in years
* ...XPath match: //meta[@property="og:title"]/@content
* Language matched: en-US
* Date matched: 2015-01-21 23:28:24
* ...XPath match: //meta[@property="article:published_time"]/@content
* Stripping 1 elements (strip)
* Stripping 3 elements (strip)
* Stripping 3 elements (strip)
* Stripping 4 elements (strip)
* Stripping 3 elements (strip)
* Stripping 2 elements (strip)
* Stripping 1 elements (strip)
* Stripping 10 elements (strip_id_or_class)
* Stripping 4 elements (strip_id_or_class)
* Body matched
* ...XPath match: //article
* Done!

Link: http://ftr.fivefilters.org/makefulltextfeed.php?debug&url=www.theverge.com%2F2015%2F1%2F21%2F7868251%2Fmicrosoft-hololens-hologram-hands-on-experience

@tcitworld tcitworld self-assigned this Jan 31, 2015
@nicosomb
Copy link
Member

I just tried on wallabag v2 (http://v2.wallabag.org), it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants