Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paragraph detection problems #33

Open
trochr opened this issue Jul 25, 2014 · 7 comments
Open

Paragraph detection problems #33

trochr opened this issue Jul 25, 2014 · 7 comments
Labels

Comments

@trochr
Copy link
Owner

trochr commented Jul 25, 2014

One issue to rule them all :

(2) Custom tags (

)
http://pro.clubic.com/entreprises/google/actualite-714679-achats-app-google-cible-ftc-concours-apple.html

(3) Add spans
http://www.gameblog.fr/news/44514-monument-valley-decroche-le-million
http://blog.colepeters.com/design-culture-is-frozen-shithole/
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

(4) With many links
http://en.wikipedia.org/wiki/Ubuntu_(operating_system)

(7) First and last in HN:
https://news.ycombinator.com/item?id=8078356

(9) P starting by an img :
http://www.skorks.com/2010/05/closures-a-simple-explanation-using-ruby/
https://www.eff.org/deeplinks/2014/07/inaugural-stupid-patent-month

(10) Poorly structured article : (bunch of test with paragraphs separeted by br tags)
http://www.romandie.com/news/LOtan-en-alerte-face-a-une-intense-activite-de-laviation-russe_RP/532140.rom
http://paulgraham.com/pinch.html (not from a clueless person)

(11) Content is in an iFrame : http://popist.com/s/16aa4fc/ (How disgraceful is that ?)

5. Starting with a link
https://medium.com/the-physics-arxiv-blog/the-face-recognition-algorithm-that-finally-outperforms-humans-2c567adbf7fc
http://eloquentjavascript.net/00_intro.html

8. Avoid wrapper divs :
http://www.slate.fr/story/90333/economie-collaborative-partage
http://www.bbc.com/future/story/20140808-music-like-never-heard-before (word count NOK)
http://arstechnica.com/tech-policy/2014/08/why-the-head-of-mt-gox-bitcoin-exchange-should-be-in-jail/
any article read in Instapaper

5. Starting with special letters (span or font tag)
http://www.jp-petit.org/science/Z-machine/FOCUS/principe_fonctionnement_FOCUS.htm
http://nautil.us/issue/18/genius/super_intelligent-humans-are-coming

1. Side text
Next to Jeff portrait : http://www.wired.com/2014/07/google_brain
This was only a matter of "orphaned paragrah"

This was referenced Jul 25, 2014
@trochr
Copy link
Owner Author

trochr commented Jul 25, 2014

First, add test cases for these

@trochr
Copy link
Owner Author

trochr commented Jul 31, 2014

Number 5 : Starting with a link seems fixed

@trochr
Copy link
Owner Author

trochr commented Sep 1, 2014

Number 8: fixed in ec09fcf
(Ars technica was a different case, with a strange article structure)

@trochr
Copy link
Owner Author

trochr commented Oct 16, 2014

Starting with special letters : fixed in 31be88c

@trochr trochr added the bug label Oct 30, 2014
@trochr
Copy link
Owner Author

trochr commented Nov 24, 2014

Comment on (2) : never seen elsewhere, bad practice ?

@trochr
Copy link
Owner Author

trochr commented Dec 9, 2014

Maybe use a more radical approach to get paragraphs : http://stackoverflow.com/a/10730777

@trochr
Copy link
Owner Author

trochr commented Sep 13, 2016

It's been a while since we didn't encounter a website that doesn't have at least one paragraph that is detected.
Keeping open for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant