Don't parse HTML with RegEx #11

schlamar · 2012-11-19T20:59:15Z

That's just wrong. ⚠️ There are xml/html parsers like lxml or beautiful soup.

See references:

theanti9 · 2012-11-21T04:30:40Z

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

schlamar · 2012-11-21T12:56:55Z

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't parse HTML with RegEx #11

Don't parse HTML with RegEx #11

schlamar commented Nov 19, 2012

theanti9 commented Nov 21, 2012

schlamar commented Nov 21, 2012

Don't parse HTML with RegEx #11

Don't parse HTML with RegEx #11

Comments

schlamar commented Nov 19, 2012

theanti9 commented Nov 21, 2012

schlamar commented Nov 21, 2012