Automatic site config #1029

Wikunia · 2015-01-26T21:30:03Z

I often have a website that isn't supported!
Maybe you should rethink the process of parsing the website instead of using the "Not possible solution" there might be a way to create a xpath script automatically.
At least a config which can test things like:
h1 for the title and article for the body.
In a second step it can be possible to get the div with the most words or sth. like that.

What do you think about it?

The text was updated successfully, but these errors were encountered:

tcitworld · 2015-01-26T21:40:31Z

You may be interested with reading #564, #598 and #849.

I add @fivefilters to the discussion.

fivefilters · 2015-01-27T18:46:13Z

Hi Wikunia,

Wallabag relies on Full-Text RSS which already tries to do what you're suggesting. It looks for specific extraction rules first (in case the site being requested has custom extraction rules associated with it), but if none is found it uses PHP Readability to try and detect the article using word counts and other heuristics to identify the element most likely to contain the article text.

Of course it's not foolproof and there are still cases where it fails. If you can provide problem URLs, we might be able to help fix the extraction.

If you're curious about the implementation of Full-Text RSS, you'll find code at http://code.fivefilters.org

tcitworld added the Improvement label Jan 26, 2015

tcitworld added this to the 1.9.0 milestone Jan 26, 2015

tcitworld modified the milestones: 1.9.0, 2.0 Feb 10, 2015

nicosomb modified the milestones: 2.2.0, 2.0.0 Feb 19, 2016

j0k3r closed this as completed Apr 14, 2016

nicosomb removed this from the 2.2.0 milestone Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic site config #1029

Automatic site config #1029

Wikunia commented Jan 26, 2015

tcitworld commented Jan 26, 2015

fivefilters commented Jan 27, 2015

Automatic site config #1029

Automatic site config #1029

Comments

Wikunia commented Jan 26, 2015

tcitworld commented Jan 26, 2015

fivefilters commented Jan 27, 2015