Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any reason to make parser methods private? #56

Open
slusarz opened this issue Mar 20, 2013 · 4 comments
Open

Any reason to make parser methods private? #56

slusarz opened this issue Mar 20, 2013 · 4 comments

Comments

@slusarz
Copy link
Contributor

slusarz commented Mar 20, 2013

A bit of background (if you are wondering why I've suddenly shown up and made a bunch of pull requests :) ):

I'm a dev on the Horde project. We finally got fed up with our csstidy parser module and decided to look for code that could actually parse CSS3, and do it in a coherent fashion (i.e. stateful parsing, not preg crap). Found this project and am happy. We are going to package the parser in a PEAR package to be able to be used in our framework (i.e. autoloading is setup, etc.).

As such, one of the things I would like to see is stream support instead of reading all data into memory. We use streams all over the place to reduce memory usage - since we have installations that have millions of users, we need to cut down memory usage as much as possible.

One way to do that is to open CSS files with fopen() and then pass the stream resource to the CSS parser instead of reading the whole file into memory (which can be 200KB+ of data). I would then write a local Horde wrapper which extends Parser to overwrite the strlen, substr (and new strpos method I just added a pull request for) to access the stream instead of making them string operations.

However, with those methods being marked as private, there is no way to do that. As a project, we mark everything as protected for precisely this reason - so end users can extend our classes as they see fit.

Any chance of making this change upstream? Appreciate the requests you've already processed from me - it would be great to keep our copy as pristine as possible in order to be able to update from the main source as needed, thus the reason for the request.

@sabberworm
Copy link
Contributor

Thanks for your interest in the project.

I have no qualms with making private methods protected. Send me a pull request and I’ll merge it.

As for the streaming parser (still not a real streaming parser as it will only work for streams to files – which are rewindable and advancable at will): that would be a wonderful addition and a very welcome change. I see two main obstacles (as the project currently stands):

  • consumeExpression uses inputLeft, which gets the entire rest of the string. We would have to get rid of both these methods. This means changing the handling of unicode escapes to match one character at a time.
  • Some methods (like consume) are too monolithic to be easily overridden by subclasses. These would have to be split into smaller methods first.

On a related note: What this project currently does not do is the parsing of selectors and the like. It even breaks fundamentally on something like body[data-some-token="{"] because it will try to start the declaration block at the next { it sees. To really be “code that could actually parse CSS3”, this project will have to either start parsing selectors or at least adapt consumeUntil to respect the nesting of parens and quotes (except when parsing comments, where it is also currently used).

@slusarz
Copy link
Contributor Author

slusarz commented Mar 20, 2013

Re the streaming parser: you are actually incorrect about the need for the stream to be a file resource. We use PHP temporary streams all over the place. Tremendously useful for things like reading data from an IMAP server. Since mail attachments can, and frequently are, large data chunks - and often base64 encoded - they would normally take 10-20 MB to store in a PHP string and then another 10-20 MB to decode using the built-in PHP base 64 decoding function. By instead feeding the IMAP server output directly into a PHP temporary stream, we guarantee that no more than 2 MB of memory is ever used. And the base64 decoding is done on-the-fly, so there is no need to duplicate data. Thus we can handle 10+MB attachments with only a 2MB hit to memory - useful since we often have 1000's of connections a second on a single machine.

Now for CSS parsing, the odds that someone has a 2MB+ CSS data string is very small so memory savings is not a killer feature in this regard. The flip side is that you are treating a string like a stream currently anyway - doing tokenization character by character - so using streams just makes sense from a coding perspective.

So there's two things that could be done:

  1. Have separate classes for parsing string data and stream data, and use whichever class reflects the input from the consumer. --OR--
  2. Maintain the single parsing class, and rewrite it so that the parsing methods handle stream data. If string data is passed into the constructor, just convert it to a stream internally ($fp = fopen('php://temp', 'r+'); fwrite($fp, $data); rewind($fp);) This seems to be the correct way of doing things because it doesn't require a rewrite of Parser (from a consumer's perspective), you don't have to change the definitions of any of the internal methods, and there is less code to maintain.

However, the problem with 2 is that, by allowing non-ASCII input, you do have to write your stream to correctly parse multibyte characters. This is annoying. (Aside: we do string tokenization for IMAP data, but the nice thing about IMAP data is that it is ASCII only so I don't have to worry about this). It shouldn't be that hard to find a stream class/code example that handles UTF-8 stream data. But this becomes progressively more difficult if you want to support other multibyte charsets.

From our project's perspective, we don't need any of the multibyte stuff (or, at the most, anything outside of UTF-8) since we require UTF-8 for everything and don't use non-ASCII stuff in any of our CSS files. So if this is going to be too complicated to add upstream, I'll stick with the conversion of private -> protected and do what we need in our code. However, since you have shown interest in having stream handling code here, it is worth my time to work with you and see if we can come up with something.

@sabberworm
Copy link
Contributor

There are different definitions of what a streaming parser is but I was thinking of a stream that possibly may not yet be read till the very end. Such parsers may stop at any given moment and wait for more data to become available. They can either do this using asynchronous I/O or on a thread by waiting. Since neither is possible with PHP, we can’t do this kind of streaming parser.

But of course you’re right: streams whose data is available at any given time are not limited to point to files.

I think if we’re going to implement the changes necessary to deal with a stream instead of a string, we might as well do it in the base class.

@unclecheese
Copy link

I've submitted a PR for the private visibility. That was bringing me down, too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants