-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paragraph detection problems #33
Comments
First, add test cases for these |
Number 5 : Starting with a link seems fixed |
Number 8: fixed in ec09fcf |
Starting with special letters : fixed in 31be88c |
Comment on (2) : never seen elsewhere, bad practice ? |
Maybe use a more radical approach to get paragraphs : http://stackoverflow.com/a/10730777 |
It's been a while since we didn't encounter a website that doesn't have at least one paragraph that is detected. |
One issue to rule them all :
(2) Custom tags (
)http://pro.clubic.com/entreprises/google/actualite-714679-achats-app-google-cible-ftc-concours-apple.html
(3) Add spans
http://www.gameblog.fr/news/44514-monument-valley-decroche-le-million
http://blog.colepeters.com/design-culture-is-frozen-shithole/
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html
(4) With many links
http://en.wikipedia.org/wiki/Ubuntu_(operating_system)
(7) First and last in HN:
https://news.ycombinator.com/item?id=8078356
(9) P starting by an img :
http://www.skorks.com/2010/05/closures-a-simple-explanation-using-ruby/
https://www.eff.org/deeplinks/2014/07/inaugural-stupid-patent-month
(10) Poorly structured article : (bunch of test with paragraphs separeted by br tags)
http://www.romandie.com/news/LOtan-en-alerte-face-a-une-intense-activite-de-laviation-russe_RP/532140.rom
http://paulgraham.com/pinch.html (not from a clueless person)
(11) Content is in an iFrame : http://popist.com/s/16aa4fc/ (How disgraceful is that ?)
5. Starting with a linkhttps://medium.com/the-physics-arxiv-blog/the-face-recognition-algorithm-that-finally-outperforms-humans-2c567adbf7fc
http://eloquentjavascript.net/00_intro.html
8. Avoid wrapper divs :http://www.slate.fr/story/90333/economie-collaborative-partage
http://www.bbc.com/future/story/20140808-music-like-never-heard-before (word count NOK)
http://arstechnica.com/tech-policy/2014/08/why-the-head-of-mt-gox-bitcoin-exchange-should-be-in-jail/
any article read in Instapaper
5. Starting with special letters (span or font tag)http://www.jp-petit.org/science/Z-machine/FOCUS/principe_fonctionnement_FOCUS.htm
http://nautil.us/issue/18/genius/super_intelligent-humans-are-coming
1. Side textNext to Jeff portrait : http://www.wired.com/2014/07/google_brain
This was only a matter of "orphaned paragrah"
The text was updated successfully, but these errors were encountered: