You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm looking for a solution to parse pages that are non-english, which seems to
give varying results with Boilerpipe. Here are a couple of examples where
boilerpipe misses the main portion of text (tested with
http://boilerpipe-web.appspot.com/ - 2011-01-06):
*
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik
- picks up some teasers instead
*
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning -
picks up the comment section
I also see minor artifacts from non-content sections throughout the extracted
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra"
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto -
Misses main header and teaser
I know it's hard to get all the above URL:s right without site-specific code,
but I also know it's possible. I've run all of the URL:s above through
readability.js, and it parses all of them without any artifacts. Maybe it's
readabilities reliance on class names (which generally is in english even on
foreign language sites) that makes it cope better. Problem is, readability.js
is a mess to run server-side, and has not undergone the rigorous testing
boilerpipe has, so I would much rather see boilerpip succeed that switch to
readability.js.
Thanks for your hard work.
Original issue reported on code.google.com by EmilStenstrom on 6 Jan 2011 at 2:43
The text was updated successfully, but these errors were encountered:
Here's an update from 2011-12-08 on the above URL:s, using the web version of
boilerpipe:
*
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik
- Misses the header altogether (dn.se has had a new design since then...)
*
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - picks up some teasers
instead of main text.
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - One
teaser, and various text from popups
Minor artifacts:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - - "Skriv ut" is
a link to print the article. "Bildmaterial" is a header from the sidebar".
"Dela" at the bottom is from the sharing feature
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - This one does no
longer have any artifacts, well done!
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto -
Misses main header and teaser
I don't know what magic Readability uses, but all of the above urls works
perfectly with Readability.
Original comment by EmilStenstrom on 8 Dec 2011 at 9:08
Original issue reported on code.google.com by
EmilStenstrom
on 6 Jan 2011 at 2:43The text was updated successfully, but these errors were encountered: