You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I often have a website that isn't supported!
Maybe you should rethink the process of parsing the website instead of using the "Not possible solution" there might be a way to create a xpath script automatically.
At least a config which can test things like: h1 for the title and article for the body.
In a second step it can be possible to get the div with the most words or sth. like that.
What do you think about it?
The text was updated successfully, but these errors were encountered:
Wallabag relies on Full-Text RSS which already tries to do what you're suggesting. It looks for specific extraction rules first (in case the site being requested has custom extraction rules associated with it), but if none is found it uses PHP Readability to try and detect the article using word counts and other heuristics to identify the element most likely to contain the article text.
Of course it's not foolproof and there are still cases where it fails. If you can provide problem URLs, we might be able to help fix the extraction.
If you're curious about the implementation of Full-Text RSS, you'll find code at http://code.fivefilters.org
I often have a website that isn't supported!
Maybe you should rethink the process of parsing the website instead of using the "Not possible solution" there might be a way to create a xpath script automatically.
At least a config which can test things like:
h1
for the title andarticle
for the body.In a second step it can be possible to get the div with the most words or sth. like that.
What do you think about it?
The text was updated successfully, but these errors were encountered: