Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic site config #1029

Closed
Wikunia opened this issue Jan 26, 2015 · 2 comments
Closed

Automatic site config #1029

Wikunia opened this issue Jan 26, 2015 · 2 comments

Comments

@Wikunia
Copy link

Wikunia commented Jan 26, 2015

I often have a website that isn't supported!
Maybe you should rethink the process of parsing the website instead of using the "Not possible solution" there might be a way to create a xpath script automatically.
At least a config which can test things like:
h1 for the title and article for the body.
In a second step it can be possible to get the div with the most words or sth. like that.

What do you think about it?

@tcitworld tcitworld added this to the 1.9.0 milestone Jan 26, 2015
@tcitworld
Copy link
Member

You may be interested with reading #564, #598 and #849.

I add @fivefilters to the discussion.

@fivefilters
Copy link

Hi Wikunia,

Wallabag relies on Full-Text RSS which already tries to do what you're suggesting. It looks for specific extraction rules first (in case the site being requested has custom extraction rules associated with it), but if none is found it uses PHP Readability to try and detect the article using word counts and other heuristics to identify the element most likely to contain the article text.

Of course it's not foolproof and there are still cases where it fails. If you can provide problem URLs, we might be able to help fix the extraction.

If you're curious about the implementation of Full-Text RSS, you'll find code at http://code.fivefilters.org

@tcitworld tcitworld modified the milestones: 1.9.0, 2.0 Feb 10, 2015
@nicosomb nicosomb modified the milestones: 2.2.0, 2.0.0 Feb 19, 2016
@j0k3r j0k3r closed this as completed Apr 14, 2016
@nicosomb nicosomb removed this from the 2.2.0 milestone Sep 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants