Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
[Question]: BeautifulSoup #22141
[Question]: BeautifulSoup #22141
Comments
|
Regular expressions are used intentionally. Code using HTML parsers will break on any minor change in page layout. |
That's great that it's an intentional decision. I'm not advocating a change. Was just wondering if it'd been discussed before.
I don't mean to be blunt, but this isn't accurate. For example, you can use a parser to query the DOM for all anchor tags with a command like |
|
It's all good until it comes to some more realistic and complicated scenarios of extraction, like extraction of something laying arbitrary levels of nesting deep involving several alternative paths and so on. In such cases parser-based code will most likely grow and branch resulting in actually less readable code. Also it won't work for HTML embedded in some JavaScript. Also it may not work in case of malformed HTML. Also it won't work for non-HTML. And lots of other potential corner cases that will require special treatment. |
|
I have definitely seen some parsers choke on sites with squirrelly markup. Thanks for offering your opinion. I definitely didn't mean to come in and tell you how to maintain your own project. No reason to change what's already working well for you ;) |
Checklist
Question
I recently started digging through the youtube-dl code and notice that regular expressions are used all over the place. I'm wondering if the developers have considered migrating to an HTML parser like beautifulsoup instead? This would be a more readable and idiomatic way of extracting data from video pages.