Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for non-english pages #16

Open
GoogleCodeExporter opened this issue May 25, 2015 · 3 comments
Open

Better support for non-english pages #16

GoogleCodeExporter opened this issue May 25, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

I'm looking for a solution to parse pages that are non-english, which seems to 
give varying results with Boilerpipe. Here are a couple of examples where 
boilerpipe misses the main portion of text (tested with 
http://boilerpipe-web.appspot.com/ - 2011-01-06):

* 
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik 
- picks up some teasers instead
* 
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from 
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - 
picks up the comment section

I also see minor artifacts from non-content sections throughout the extracted 
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a 
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra" 
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto - 
Misses main header and teaser

I know it's hard to get all the above URL:s right without site-specific code, 
but I also know it's possible. I've run all of the URL:s above through 
readability.js, and it parses all of them without any artifacts. Maybe it's 
readabilities reliance on class names (which generally is in english even on 
foreign language sites) that makes it cope better. Problem is, readability.js 
is a mess to run server-side, and has not undergone the rigorous testing 
boilerpipe has, so I would much rather see boilerpip succeed that switch to 
readability.js.

Thanks for your hard work.

Original issue reported on code.google.com by EmilStenstrom on 6 Jan 2011 at 2:43

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

Here's an update from 2011-12-08 on the above URL:s, using the web version of 
boilerpipe:

* 
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik 
- Misses the header altogether (dn.se has had a new design since then...)
* 
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - picks up some teasers 
instead of main text.
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - One 
teaser, and various text from popups

Minor artifacts:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ -  - "Skriv ut" is 
a link to print the article. "Bildmaterial" is a header from the sidebar". 
"Dela" at the bottom is from the sharing feature
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - This one does no 
longer have any artifacts, well done!
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto - 
Misses main header and teaser

I don't know what magic Readability uses, but all of the above urls works 
perfectly with Readability.

Original comment by EmilStenstrom on 8 Dec 2011 at 9:08

@GoogleCodeExporter
Copy link
Author

http://www.anspress.com/index.php?a=2&cid=48&lng=az&nid=270848

Original comment by eyusi...@gmail.com on 13 May 2014 at 1:44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant