Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CPAN Pull Request Challenge: Pull Request Preview: Tidying Up (Requesting Comments) #3
As mentioned in the other issue I have been assigned this CPAN module as a pull request challenge assignment for January. I have begun attempting to tidy things up where I can. The work is currently a work in progress in my
A snapshot has been pushed to
Thanks. I will try to finalize something and submit an official pull request when I'm ready. Or you can feel free to merge anytime you wish and I'll work around the upstream history.
I am also considering replacing the regex
Yes, performance was a consideration of mine as well. In my experience, parsing SGML based dialects is faster than you'd imagine. I have implemented Web applications that parse complex, user uploaded HTML data (albeit, with .NET) and the performance hit is negligible. I'd imagine that would be true in this case as well.
My concern is that it isn't possible to parse all possible HTML markup with just a regular expression. It will come down to whether it has to always work or whether performance matters more than working every time.
I will try to put together some test cases that break the regular expression and we can go from there. :)
If you're sure you want to keep the regular expression parsing I was thinking that perhaps we could add a configurable branch on the HTML transformation to choose the desired implementation. Depending on how popular the module is, we can either leave the legacy behavior as default and require a constructor argument to enable a proper HTML parser for users that need or want it; or visa-versa. E.g.,
That way the user can do what is necessary, and we can choose an appropriate default for the users that don't care. The performance hit will be optional. :) I'll try to figure out the tests and create some nasty HTML samples. :)
I am using HTML::StickyQuery（HTML::Parser） for <a href="...">.
It might be best to together <form> and <a>.
Performance than now to improve :)
I see that
Firstly, it sounds very inefficient to concatenate onto the output string that many times (it would probably create a ton of temporary strings, whereas a smarter, lower-level interface to Perl strings could concatenate them all at once). That seems like it might be a design flaw of
Secondly, this makes it difficult to combine both effects without parsing twice. Instead of manipulating the tag text through an API, allowing multiple processes to have their turn, they just assume that they own the entire parser and modify the output string. Perhaps it would work though if we implemented a wrapper that checked the tag name and dispatched to each type of parser's methods... E.g., (this is a big hack just to demonstrate a POSSIBLE idea)
In theory something like that could work, but it would be a huge hack that could break when the internals of either package changed, and it would also depend on merging the state of
A cleaner approach might be to patch both packages, split their methods (e.g.,
I hope that there's a much easier way that I'm missing...