-
-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
www.theverge.com consistent "unable to retrieve full-text content" #969
Comments
Wallabag uses this to parse websites : http://help.fivefilters.org/customer/portal/articles/223153-site-patterns You might be interested in adapting the siteconfig file. |
I tried a POC in a few lines of python, trying to find a nice element using their ratio of words/tags, then trying to find common parents to matched elements. It typically find paragraphs, which common parent is typically the article div. I only took time to try on the theverge and melty, it worked, but we may need a huge test to compare implementations. Does this kind of work may help ? |
ping @nicosomb |
I don't know why we can't get theverge content ... @fivefilters manages it here: http://fivefilters.org/content-only/
Today, wallabag only uses PHP scripts. If you write a great tool in python, how can we use it? You host it on your own server? |
@fivefilters have a newer version of FullTextRSS, and maybe a newer siteconfig for TheVerge. |
Thanks for the report. I'll test using Full-Text RSS version used by Wallabag to see if I can reproduce this. Has a anyone else managed to reproduce this using Wallabag (wondering if it's perhaps an issue related to the verge blocking the server making the request)? |
I didn't try yet myself. |
I tried with my framabag account and I have the same behavior than @JulienPalard. |
@nicosomb I know that wallabag uses PHP, but what I wrote is just a POC, it's probably easily portable to PHP. I was just throwing a few lines of code in a language I like, to test my ideas, with ease, don't worry ;-) BTW it seems to work on a the only two pages I tested, I was not aware of http://fivefilters.org/content-only/, now I'll be able to compare my results against an existing thing. But I'll continue my POC only if someone here think the algorithm used by fivefilters is not good enough :-) |
Looks like there is a bug in parsing the config files. At the moment I have no idea why :/
Works for my article: *Works => shows the article but unfortunately sth. more |
Thank you. I'll try getting rid of the "something more" |
I haven't had a chance to test this yet, but if anyone has access to the debug output, that might reveal why it's not working with the original site config. I tried with ftr.fivefilters.org and here's the output:
|
I just tried on wallabag v2 (http://v2.wallabag.org), it's fixed. |
I have a consistent "unable to retrieve full-text content" while storing articles from the verge, here is an example: http://www.theverge.com/2014/12/11/7376599/anti-piracy-meeting-between-google-sony-eli-lilly-homeland-security
Have you tried different "content extractors" ? If true, have you documented this search ? I may be interested to try implementing one, if nothing cool exists.
The text was updated successfully, but these errors were encountered: