Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't parse HTML with RegEx #11

Open
schlamar opened this issue Nov 19, 2012 · 2 comments
Open

Don't parse HTML with RegEx #11

schlamar opened this issue Nov 19, 2012 · 2 comments

Comments

@schlamar
Copy link

That's just wrong. ⚠️ There are xml/html parsers like lxml or beautiful soup.

See references:

@theanti9
Copy link
Owner

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

@schlamar
Copy link
Author

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants