Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: BeautifulSoup #22141

Closed
twaddington opened this issue Aug 18, 2019 · 4 comments
Closed

[Question]: BeautifulSoup #22141

twaddington opened this issue Aug 18, 2019 · 4 comments
Labels

Comments

@twaddington
Copy link
Contributor

@twaddington twaddington commented Aug 18, 2019

Checklist

  • I'm asking a question
  • I've looked through the README and FAQ for similar questions
  • I've searched the bugtracker for similar questions including closed ones

Question

I recently started digging through the youtube-dl code and notice that regular expressions are used all over the place. I'm wondering if the developers have considered migrating to an HTML parser like beautifulsoup instead? This would be a more readable and idiomatic way of extracting data from video pages.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 18, 2019

Regular expressions are used intentionally. Code using HTML parsers will break on any minor change in page layout.

@dstftw dstftw closed this Aug 18, 2019
@twaddington
Copy link
Contributor Author

@twaddington twaddington commented Aug 18, 2019

Regular expressions are used intentionally. — @dstftw

That's great that it's an intentional decision. I'm not advocating a change. Was just wondering if it'd been discussed before.

Code using HTML parsers will break on any minor change in page layout. — @dstftw

I don't mean to be blunt, but this isn't accurate. For example, you can use a parser to query the DOM for all anchor tags with a command like soup.a. The whole idea is that even if the HTML structure changes, the query still works as expected.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 18, 2019

It's all good until it comes to some more realistic and complicated scenarios of extraction, like extraction of something laying arbitrary levels of nesting deep involving several alternative paths and so on. In such cases parser-based code will most likely grow and branch resulting in actually less readable code. Also it won't work for HTML embedded in some JavaScript. Also it may not work in case of malformed HTML. Also it won't work for non-HTML. And lots of other potential corner cases that will require special treatment.
Keeping this in mind I see zero point in switching to HTML parsers.

@twaddington
Copy link
Contributor Author

@twaddington twaddington commented Aug 19, 2019

I have definitely seen some parsers choke on sites with squirrelly markup. Thanks for offering your opinion. I definitely didn't mean to come in and tell you how to maintain your own project. No reason to change what's already working well for you ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.