Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A different approach #105

Closed
garfix opened this issue Jul 25, 2021 · 17 comments
Closed

A different approach #105

garfix opened this issue Jul 25, 2021 · 17 comments

Comments

@garfix
Copy link

garfix commented Jul 25, 2021

Hi,

I found out about this package because Magento 2 is using it in its build process. The build process is very time consuming, and part this time is spent compressing Javascript files with JShrink. When I looked at the code it occurred to me that a different approach might make it much faster. This approach would be based on the pivotal use of the PHP function preg_replace_callback, which would process the entire Javascript file at once.

Here's a bit of (very incomplete) code I used to test if this would work:

$exp = "~(
            /\*!.*?\*/ |                                             # /* license */
            /\*.*?\*/ |                                             # /* comment */
            //[^\n]* |                                              # // comment
            /(?:\\\\/|[^/])+/[dgimsuy]*[ ]*[,;\n)] |                 # erg exp: /(ape|monkey)\/banana/mi;
            \"(?:\\\\\"|[^\"])*\" |                                 # double quoted string
            '(?:\\\\'|[^'])*' |                                     # single quoted string
            (?P<negatives1>--?)\s+(?P<negatives2>--?) |             # a - --b     a-- - b
            (?P<positives1>\+\+?)\s+(?P<positives2>\+\+?) |         # a + ++b     a++ + b
            (?:return|var) |                                        # operator keyword
            [ \t\n]+                                                # whitespace
        )~xs";

        $normalized = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        $result = preg_replace_callback($exp, function($matches) {

            $match = $matches[1];
            $first = $match[0];
            switch ($first) {
                case '"':
                    // remove line continuation
                    $string = str_replace("\\\n", "", $match);
                    return $string;
                case ' ':
                    return '';
                case "\n":
                    return '';
                case "\t":
                    return '';
            }
            $firstTwo = substr($match, 0, 2);
            switch ($firstTwo) {
                case '//':
                    return '';
                case '/*':
                    return '';
            }
            switch ($match) {
                case 'var':
                    return 'var ';
                case 'return':
                    return 'return ';
            }
            if (isset($matches['negatives1']) && $matches['negatives1'] !== "") {
                return $matches['negatives1'] . " " . $matches['negatives2'];
            }
            if (isset($matches['positives1']) && $matches['positives1'] !== "") {
                return $matches['positives1'] . " " . $matches['positives2'];
            }
            return $match;
        }, $normalized);

So, basically, the outermost loop is replaced by a single preg_replace_callback. It is much faster because it implements the inner loop with C code (the implementation of preg_replace_callback is written in C), rather than PHP code.

Before I work this out in detail, I was wondering if you are open to this complete rewrite. I am willing to perform the complete rewrite myself, but there are also some points in which the new code will not be backwards compatible with the existing code, like checking if a regular expression is well formed and throwing a RuntimeException. This new code will not do that. Doubtlessly there will be some other minor points that are not completely backwards compatible. So the question is: are you willing to loose some of the compatibility for a signification increase in processing speed?

If you are interested, I would like the help of some beta testers to debug my code. I just had this idea, it doesn't mean I make no mistakes working it out.

Of course I can start a repository in my own domain, but then it would have to start its user base from the start. By rewriting JShrink it may benefit all of your users.

What do you think?

@tedivm
Copy link
Member

tedivm commented Aug 4, 2021

I would be okay with changing approaches, lets just make sure to benchmark it and confirm that it won't break existing tests.

@garfix
Copy link
Author

garfix commented Aug 5, 2021

Great! I've continued to work out the details, and it still looks promising. I faced a number of problems but they could be resolved. After the weekend I hope to be able to present a working version that passes the tests.

I understand how to run the tests. What should I do to benchmark?

The problem of removing whitespace is fascinating. It seems to me that to do it right, you would at least need to tokenize and parse the Javascript. But even then you could not just remove all whitespace tokens and serialize the parse tree. There are some very specific rules you need to follow to ensure that the resulting JS code still works. But these rules are not given, they need to be discovered by trial and error. Tokenizing and parsing JS in PHP is out of the question, just because it takes up a lot more time. But without it, you run the real chance to mess things up. For example, I ran across what I will call "the double division problem":

let new = old / (temp * 10) / 2;

In my code, and your code as well, if I remember correctly, / (temp * 10) / will be parsed as a JS regexp.

It freaked me out pretty bad. This is just an example of the things that may go wrong without proper tokenization and parsing. But we have to deal with it.

What is also troubling me is that the existing JShrink code will probably create output that is just a little different than mine. An extra enter here and there. More to the point: why does JShrink add enters after every function?

var x=5;function bar(){return--x;}
function foo(){while(bar());}
function mak(){for(;;);}

If the output differs to much, or the users depend on the exceptions being thrown, consider the possibility of running in two modes: "classic mode" (existing code) and "preg_replace_callback mode" (or something).

I see this in several tests, but I have not found a reason for it. They seem to be superfluous. I hesitate to add these enters just to make the tests pass; I would like to know why they are necessary.

Finally I don't see the relevance for the IE 10- style conditional comments and the bizarre series of whitespace characters (mongolian vowel separator?) but they don't slow things much to I will put them in.

These are my concerns. I hope you can find some time to look at them.

@garfix
Copy link
Author

garfix commented Aug 8, 2021

Hello Robert,

I can now present a fully working version. I changed the code many times and tried all sorts of techniques. This latest version is fast and extendable.
It handles most test-cases, and the ones it does not pass can be debated. [2] It handles utf-8. It is several times faster [1] than the existing implementation and contains 1/3 of the code. I added comments and split up regular expressions for legibility.
I only used the jQuery library as a real world test case. The minified version is parsed without errors by PHPStorm; I used this to try if the code still worked. The resulting JavaScript is now 35.06% of the original size (was: 35.22%), so a little bit smaller.

The code needs a lot more testing on real-world examples. I could use some help with that.

I pasted the code below for you to study and try. Let me know what you think of it, and how to proceed.

Next thursday I will be away for holidays; I will not be able to respond for 2 weeks.

[1] 4x on PHP5.6; 7x on PHP7; 15x on PHP8.0
[2] Tests that were not passed:

  • minify/input/condcomm.js ; the // lines are removed completely, perhaps this is ok?
  • minify/input/issue132.js ; the space between return and the regular expression is removed, this should be ok
  • jshrink/input/utf_chars.js ; this is the "double division problem" at work; my version leaves in more space
  • input/ifreturn.js, input/assignment.js, input/empty-blocks.js, input/forstatement.js, input/ifreturn2.js ; jShrink leaves the newline after some }'s, which I believe is not necessary. See also not minifying line breaks after function end #87
  • input/whitespace.js: my version doesn't handle the mongolian vowel separator, not sure why
<?php

/**
 * Definition
 * A "word" in this class is any scalar value, variable, or keyword. Two words may not be glued together; whitespace between them is required.
 *
 * This Javascript minifier removes whitespace from a JavaScript file this way:
 *
 * - Go through the file step by step
 *   - Leave alone strings and regular expressions
 *   - Remove optional whitespace (between identifier and non-word; and between two non-words)
 *   - Replace required whitespace (between two words) by newline or space
 *   - Replace comments by removal markers (\r)
 * - Combine adjacent removal markers
 * - Remove removal markers that can be safely removed
 * - Replace the other markers by a newline (\n)
 *
 * Remarks
 * - The removal markers were introduced to get the comments out of the way, which makes the main expressions much simpler.
 * - This code supports both ASCII and UTF-8 JavaScript.
 *   All regular expressions have the unicode modifier (u), which ensures that the JS is not treated as bytes but as encoded code points.
 */
class Minifier2
{
    protected static $expressions;
    protected static $e = null;

    protected static $defaultOptions = [
        'flaggedComments' => true
    ];

    protected $options = [];

    /**
     * Processes a javascript string and outputs only the required characters,
     * stripping out all unneeded characters.
     *
     * @param string $js      The raw javascript to be minified
     * @param array  $options Various runtime options in an associative array
     */
    public static function minify($js, $options = [])
    {
        $minifier = new Minifier2($options);
        return $minifier->minifyUsingCallbacks($js);
    }

    public function __construct($options)
    {
        $this->options = array_merge(static::$defaultOptions, $options);
    }

    protected static function init()
    {
        if (self::$e !== null) {
            return [self::$e, self::$expressions];
        }

        $e = [];
        // white space (includes space and tab) https://262.ecma-international.org/6.0/#sec-white-space
        $e['whitespace-chars'] = " \t\f\v\p{Zs}";
        $e['whitespace-chars-newline'] = "\n" . $e['whitespace-chars'];
        $e['whitespace'] = "[" . $e['whitespace-chars'] . "]";
        $e['whitespace-newline'] = "[" . $e['whitespace-chars-newline'] . "]";
        // possessive quantifier (++) is needed when used in combination with look ahead
        $e['some-whitespace'] = $e['whitespace'] . "++";
        $e['some-whitespace-newline'] = $e['whitespace-newline'] . "++";
        $e['optional-whitespace'] = $e['whitespace'] . "*";
        $e['optional-whitespace-newline'] = $e['whitespace-newline'] . "*";
        // a single escaped character in a regexp, like \\ or \n (may not be newline)
        $e['regex-escape'] = "\\\\[^\n]";
        // regexp character class: [..] with escape characters
        // [^\n\\]] is any non-] character (also not be newline)
        $e['character-class'] = "\\[(?:" . $e['regex-escape'] . "|[^\n\\]])*?\\]";
        // regexp: /..(..)../i with character class and escaped characters
        // [^\n/] is any non-slash character (also not be newline)
        $e['regexp'] = "/(?:" . $e['regex-escape'] . "|" . $e['character-class'] . "|[^\n/])+/[a-z]*";
        // characters than can form a word; these characters should not be joined together by the removal of whitespace
        $e['word'] = "[a-zA-Z0-9_\$'\"\x{0080}-\x{FFFF}]";
        // temporary placeholder that will later be replaced or removed
        $e['removal-marker'] = "\r";

        self::$e = $e;

        // note : the order of these expressions is important
        self::$expressions = [
            // /** comment */
            "(?<starComment>" . $e['optional-whitespace-newline'] . "/\*.*?\*/" . $e['optional-whitespace-newline'] . ")",
            // // comment
            "(?<lineComment>//[^\n]*" . $e['optional-whitespace-newline'] . ")",
            // regular expression
            "(?<regexp>" . $e['regexp'] . $e['optional-whitespace-newline'] . ")",
            // "double quotes"
            '(?<doubleQuote>"(?:\\\\.|[^"])*")',
            // 'single quotes'
            "(?<singleQuote>'(?:\\\\.|[^'])*')",
            // `template literal`
            "(?<templateLiteral>`(?:\\\\.|[^`])*`)",
            // a sequence of - and -- operators; i.e. a - --b; b-- -c; d-- - -e; f - -g
            "(?<min>--?" . $e['some-whitespace-newline'] . "(?:\-{$e['some-whitespace-newline']})?" . "--?)",
            // a sequence of + and ++ operators
            "(?<plus>\+\+?" . $e['some-whitespace-newline'] . "(?:\+{$e['some-whitespace-newline']})?" . "\+\+?)",

            // Optional whitespace
            // the following expression should not be captured in a (named) group
            // it matches so often that this extra memory allocation slows down execution by a factor two
            "(?:" .
                // whitespace not preceded by an word char
                "(?<!" . $e['word'] . ")". $e['some-whitespace-newline'] .
            "|" .
                // whitespace not succeeded by an word char
                $e['some-whitespace-newline'] . "(?!" . $e['word'] . ")" .
            ")",

            // Required whitespace
            // whitespace both preceded and succeeded by an word char
            "(?<requiredSpace>" . $e['some-whitespace-newline'] . ")"
        ];

        return [self::$e, self::$expressions];
    }

    protected function processMatch($matches)
    {
        $e = self::$e;

        // the fully matching text
        $match = $matches[0];               //echo "[$match]";

        // create a version without leading and trailing whitespace
        $trimmed = trim($match, $e['whitespace-chars-newline']);

        // Required whitespace
        // Should be handled before optional whitespace
        if (!empty($matches['requiredSpace'])) {
            return strpos($matches['requiredSpace'], "\n") === false ? " " : "\n";
        }

        // Optional whitespace
        // note: this match is not captured by a named group (because of speed)
        if ($trimmed === '') {
            return "";
        }

        if (!empty($matches['doubleQuote'])) {
            // remove line continuation
            return str_replace("\\\n", "", $match);
        }
        if (!empty($matches['singleQuote']) || !empty($matches['templateLiteral'])) {
            return $match;
        }
        if (!empty($matches['lineComment'])) {
            return $e['removal-marker'];
        }
        if (!empty($matches['starComment'])) {
            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    return $match;
                case '!':
                    if ($this->options['flaggedComments']) {
                        preg_match("~^(?P<pre>" . $e['optional-whitespace-newline'] . ")(?P<license>/\*.*?\*/)" .
                            $e['optional-whitespace-newline'] . "$~su", $match, $newMatches);
                        $prefix = $newMatches['pre'] === "" ? "" : "\n";
                        return $prefix . $newMatches['license'] . "\n";
                    }
                    return $e['removal-marker'];
            }
            // multi line comment
            return $e['removal-marker'];
        }
        if (!empty($matches['plus']) || !empty($matches['min'])) {
            return preg_replace("~{$e['some-whitespace-newline']}~su", " ", $trimmed);
        }
        if (!empty($matches['regexp'])) {
            // regular expression
            // only if the space after the regexp contains a newline, keep it
            preg_match("~^{$e['regexp']}(?P<post>" . $e['optional-whitespace-newline'] . ")$~su", $match, $newMatches);
            $postfix = strpos($newMatches['post'], "\n") === false ? "" : "\n";
            return $trimmed . $postfix;
        }

        return $match;
    }

    public function minifyUsingCallbacks($js)
    {
        // prepare the expressions (once)
        list($e, $expressions) = self::init();

        // treat all newlines as unix newlines, to keep working with newlines simple
        $normalized = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        // build the main expression; it will give just one match
        // modifier s: the dot (.) also matches the newline
        // modifier u: the JavaScript file and the regexp pattern is treated a UTF-8
        $exp = "~(?:" .  implode("|", $expressions) . ")~su";

        // main loop
        $result = preg_replace_callback($exp, [$this, 'processMatch'], $normalized);

        // remove all markers that do not cause words to be joined together
        $result = preg_replace("~(?<!" . $e['word'] . ")". $e['removal-marker'] ."++~su", "", $result);
        $result = preg_replace("~" . $e['removal-marker'] . "++" . "(?!" . $e['word'] . ")~su", "", $result);

        // replace the remaining markers with a single newline
        // this cannot be a space, because two assignments cannot be separated by space
        // this cannot be a semicolon, because this might split up expressions
        return preg_replace("~" . $e['removal-marker'] . "++~su", "\n", $result);
    }
}

@garfix
Copy link
Author

garfix commented Aug 9, 2021

Sorry to have to replace this code so soon; I hope you didn't start reading it. I thought of a better way to handle the comments, that didn't involve markers and allowed the resulting code to be even smaller, and simpler to reason about.

This new version consists of 2 phases: in the first phase all non-essential comments are removed. Then, in phase 2, with all nasty comments out of the way, it is very simple to reason about whitespace, because there are no comments polluting it.

I restricted whitespace to space and tab because it actually takes up a considerable amount of time to check for all possible whitespace variants. Which forms of whitespace are actually used by programmers?

<?php

/**
 * Definition
 * A "word" in this class is any scalar value, variable, or keyword. Two words may not be glued together; whitespace between them is required.
 *
 * This Javascript minifier removes whitespace from a JavaScript file in two phases:
 *
 * Phase 1: remove comments
 * All comments that may be removed, are removed
 * By replacing comments by whitespace, it is much easier to reason about whitespace, in phase 2
 *
 * Phase 2: remove whitespace
 * Remove optional whitespace (between word and non-word; and between two non-words)
 * Replace required whitespace (between two words) by newline or space
 *
 * Remarks
 * - This code supports both ASCII and UTF-8 JavaScript.
 *   All regular expressions have the unicode modifier (u), which ensures that the JS is not treated as bytes but as encoded code points.
 */
class Minifier2
{
    protected static $commentExpressions;
    protected static $mainExpressions;
    protected static $e = null;

    protected static $defaultOptions = [
        'flaggedComments' => true
    ];

    protected $options = [];

    /**
     * Processes a javascript string and outputs only the required characters,
     * stripping out all unneeded characters.
     *
     * @param string $js      The raw javascript to be minified
     * @param array  $options Various runtime options in an associative array
     */
    public static function minify($js, $options = [])
    {
        $minifier = new Minifier2($options);
        return $minifier->minifyUsingCallbacks($js);
    }

    public function __construct($options)
    {
        $this->options = array_merge(static::$defaultOptions, $options);
    }

    protected static function init()
    {
        if (self::$e === null) {

            $e = [];
            // white space (includes space and tab) https://262.ecma-international.org/6.0/#sec-white-space
            $e['whitespace-chars'] = " \t";
            $e['whitespace-chars-newline'] = "\n" . $e['whitespace-chars'];
            $e['whitespace'] = "[" . $e['whitespace-chars'] . "]";
            $e['whitespace-newline'] = "[" . $e['whitespace-chars-newline'] . "]";
            // possessive quantifier (++) is needed when used in combination with look ahead
            $e['some-whitespace'] = $e['whitespace'] . "++";
            $e['some-whitespace-newline'] = $e['whitespace-newline'] . "++";
            $e['optional-whitespace'] = $e['whitespace'] . "*";
            $e['optional-whitespace-newline'] = $e['whitespace-newline'] . "*";
            // a single escaped character in a regexp, like \\ or \n (may not be newline)
            $e['regex-escape'] = "\\\\[^\n]";
            // regexp character class: [..] with escape characters
            // [^\n\\]] is any non-] character (also not be newline)
            $e['character-class'] = "\\[(?:" . $e['regex-escape'] . "|[^\n\\]])*?\\]";
            // regexp: /..(..)../i with character class and escaped characters
            // [^\n/] is any non-slash character (also not be newline)
            $e['regexp'] = "/(?:" . $e['regex-escape'] . "|" . $e['character-class'] . "|[^\n/])+/[a-z]*";
            // characters than can form a word; these characters should not be joined together by the removal of whitespace
            $e['word'] = "[a-zA-Z0-9_\$'\"\x{0080}-\x{FFFF}]";

            self::$e = $e;

            // these expression must always be used, because they keep tokens together
            $basicExpressions = [
                // /** comment */
                'starComment' => "(?<starComment>" . $e['optional-whitespace-newline'] . "/\*.*?\*/" . $e['optional-whitespace-newline'] . ")",
                // // comment
                'lineComment' => "(?<lineComment>//[^\n]*" . $e['optional-whitespace-newline'] . ")",
                // regular expression
                'regexp' => "(?<regexp>" . $e['regexp'] . $e['optional-whitespace-newline'] . ")",
                // "double quotes"
                'double' => '(?<doubleQuote>"(?:\\\\.|[^"])*")',
                // 'single quotes'
                'single' => "(?<singleQuote>'(?:\\\\.|[^'])*')",
                // `template literal`
                'template' => "(?<templateLiteral>`(?:\\\\.|[^`])*`)",
            ];

            $specificExpressions = [
                // Required whitespace
                 // a sequence of - and -- operators; i.e. a - --b; b-- -c; d-- - -e; f - -g
                'min' => "(?<=-)(?<min>" . $e['some-whitespace-newline'] . ")(?=-)",
                 // a sequence of + and ++ operators
                'plus' => "(?<=\+)(?<plus>" . $e['some-whitespace-newline'] . ")(?=\+)",

                // Optional whitespace
                // the following expression should not be captured in a (named) group
                // it matches so often that this extra memory allocation slows down execution by a factor two
                'optional' =>
                    "(?:" .
                        // whitespace not preceded by a word char
                        "(?<!" . $e['word'] . ")" . $e['some-whitespace-newline'] .
                    "|" .
                        // whitespace not succeeded by a word char
                        $e['some-whitespace-newline'] . "(?!" . $e['word'] . ")" .
                    ")",

                // Required whitespace
                // whitespace both preceded and succeeded by a word char
                'required' => "(?<requiredSpace>" . $e['some-whitespace-newline'] . ")"
            ];

            // note : the order of these expressions is important
            self::$commentExpressions = $basicExpressions;

            self::$mainExpressions = array_merge($basicExpressions, $specificExpressions);
        }

        return [self::$commentExpressions, self::$mainExpressions];
    }

    public function minifyUsingCallbacks($js)
    {
        // prepare the expressions (once)
        list($commentExpressions, $mainExpressions) = self::init();

        // treat all newlines as unix newlines, to keep working with newlines simple
        $normalized = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        // build the main expression; it will give just one match
        // modifier s: the dot (.) also matches the newline
        // modifier u: the JavaScript file and the regexp pattern is treated a UTF-8
        $exp1 = "~(?:" .  implode("|", $commentExpressions) . ")~su";

        // remove unnecessary comments
        $minimallyCommented = preg_replace_callback($exp1, [$this, 'processComments'], $normalized);

        $exp2 = "~(?:" .  implode("|", $mainExpressions) . ")~su";

        // remove whitespace
        return preg_replace_callback($exp2, [$this, 'processWhitespace'], $minimallyCommented);
    }

    protected function processComments($matches)
    {
        $e = self::$e;

        // the fully matching text
        $match = $matches[0];

        if (!empty($matches['lineComment'])) {
            return '';
        }
        if (!empty($matches['starComment'])) {

            // create a version without leading and trailing whitespace
            $trimmed = trim($match, $e['whitespace-chars-newline']);

            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    return $match;
                case '!':
                    if ($this->options['flaggedComments']) {
                        preg_match("~^(?P<pre>" . $e['optional-whitespace-newline'] . ")(?P<license>/\*.*?\*/)" .
                            $e['optional-whitespace-newline'] . "$~su", $match, $newMatches);
                        $prefix = $newMatches['pre'] === "" ? "" : "\n";
                        return $prefix . $newMatches['license'] . "\n";
                    }
                    return '';
            }
            // multi line comment
            return '';
        }

        return $match;
    }

    protected function processWhitespace($matches)
    {
        $e = self::$e;

        // the fully matching text
        $match = $matches[0];

        // create a version without leading and trailing whitespace
        $trimmed = trim($match, $e['whitespace-chars-newline']);

        // Required whitespace
        // Should be handled before optional whitespace
        if (!empty($matches['requiredSpace'])) {
            return strpos($matches['requiredSpace'], "\n") === false ? " " : "\n";
        }
        // + followed by +, or - followed by -
        if (!empty($matches['plus']) || !empty($matches['min'])) {
            return ' ';
        }

        // Optional whitespace
        // note: this match is not captured by a named group (because of speed)
        if ($trimmed === '') {
            return "";
        }

        if (!empty($matches['doubleQuote'])) {
            // remove line continuation
            return str_replace("\\\n", "", $match);
        }
        if (!empty($matches['regexp'])) {
            // regular expression
            // only if the space after the regexp contains a newline, keep it
            preg_match("~^{$e['regexp']}(?P<post>" . $e['optional-whitespace-newline'] . ")$~su", $match, $newMatches);
            $postfix = strpos($newMatches['post'], "\n") === false ? "" : "\n";
            return $trimmed . $postfix;
        }

        return $match;
    }
}

@garfix
Copy link
Author

garfix commented Aug 12, 2021

Okay, I did another rewrite. It occurred to me that most replacements are simple whitespace deletions, and these don't require a callback, a simple replace will do. This version is again faster than the one before.

I also added a solution for the double division problem. The check for a valid regexp is extended by matching the characters that follow it.

I am off on a holiday now. Won't be able to answer any questions at this time. Hope you will have some time to look at the code and decide whether or not you would like to use it in JShrink, and which new form this will take. I realize this is quite a change to the library. If there's any way I can help to ease the change, I will be happy to help. And I will be available after the integration to fix the bugs that will pop up.

Here are the main issues with this change, as I see it

  • no exceptions
  • no newlines after functions
  • failing tests (tests may need to be updated)
  • less forms of whitespace (can be re-added if desired)
  • generated code is smaller but also different in many time respects

And these are the advantages

  • 10x faster on PHP7.4; even more on PHP 8.0
  • Less code
  • No buffers used
  • fixes the problems with regular expressions
  • fixes many of the open issues
<?php

/**
 * This minifier removes whitespace from a JavaScript file.
 *
 * Definition
 * A "word" in this class is any scalar value, variable, or keyword. Two words may not be glued together; whitespace between them is required.
 * A "block" in this class is a piece of JavaScript code that contains whitespace that may not be removed.
 *
 * The algorithm has four steps. The whitespace is removed in step 3, in a single sweep.
 * Steps 1 and 2 prepare for this step. Step 4 cleans up after.
 *
 * Step 1: remove comments
 * All comments that may be removed, are removed
 * By removing comments, it is much easier to reason about whitespace, in step 2, because the whitespace is no longer polluted by comments.
 *
 * Step 2: handle all blocks
 * Match the blocks (pieces of code that contain required whitespace)
 * Leave the blocks in the Javascript, but replace their whitespace by placeholders
 * All whitespace that is now left is optional and can be removed
 *
 * Step 3: remove all remaining whitespace
 * At this point all whitespace that is left can be safely and quickly removed
 *
 * Step 4: replace placeholders by whitespace
 * Replace the placeholders that were created in step 2 and replace them by their whitespace
 *
 * JavaScript regular expression syntax
 * JavaScript has a special syntactic construct for regular expression: /abc(d)ef/i
 * Care has been taken that this construct is not confused with
 * - single line comments, which look like an empty regexp: //
 * - a combination of two divisions: 1 / (x + y) / z
 * When a regexp is followed by a newline, this newline may not be replaced by a simple space.
 *
 * The ++ and -- operators
 * When the space between an ++ operator and a + operator is removed, a syntax error occurs: a + ++b -> a+++b
 * So some sort of whitespace must be preserved.
 *
 * Remarks
 * - This code supports both ASCII and UTF-8 JavaScript.
 *   All regular expressions have the unicode modifier (u), which ensures that the JS is not treated as bytes but as encoded code points.
 *   The "dotall" modifier (s) is used everywhere: the dot (.) also matches the newline
 *
 */
class Minifier2
{
    protected static $e = null;
    protected static $tokenExpressions;
    protected static $allExpressions;

    protected static $defaultOptions = [
        'flaggedComments' => true
    ];

    protected $options = [];

    /**
     * Processes a javascript string and outputs only the required characters,
     * stripping out all unneeded characters.
     *
     * @param string $js      The raw javascript to be minified
     * @param array  $options Various runtime options in an associative array
     */
    public static function minify($js, $options = [])
    {
        self::init();

        $minifier = new Minifier2($options);
        return $minifier->minifyUsingCallbacks($js);
    }

    public function __construct($options)
    {
        $this->options = array_merge(static::$defaultOptions, $options);
    }

    protected static function init()
    {
        if (self::$e === null) {

            $e = [];
            // white space (includes space and tab) https://262.ecma-international.org/6.0/#sec-white-space
            $e['whitespace-array'] = [" ", "\t"];
            $e['whitespace-array-newline'] = array_merge($e['whitespace-array'], ["\n"]);
            // placeholders are unused code points; see also https://en.wikipedia.org/wiki/Private_Use_Areas
            $e['whitespace-array-placeholders'] = ["\x{E000}", "\x{E001}", "\x{E002}"];

            $e['whitespace-chars'] = implode("", $e['whitespace-array']);
            $e['whitespace-chars-newline'] = "\n" . $e['whitespace-chars'];

            $e['whitespace'] = "[" . $e['whitespace-chars'] . "]";
            $e['whitespace-newline'] = "[" . $e['whitespace-chars-newline'] . "]";

            // possessive quantifier (++) is needed when used in combination with look ahead
            $e['some-whitespace'] = $e['whitespace'] . "++";
            $e['some-whitespace-newline'] = $e['whitespace-newline'] . "++";
            $e['optional-whitespace'] = $e['whitespace'] . "*";
            $e['optional-whitespace-newline'] = $e['whitespace-newline'] . "*";

            // a single escaped character in a regexp, like \\ or \n (may not be newline)
            $e['regex-escape'] = "\\\\[^\n]";
            // regexp character class: [..] with escape characters
            // [^\n\\]] is any non-] character (also not be newline)
            $e['character-class'] = "\\[(?:" . $e['regex-escape'] . "|[^\n\\]])*?\\]";
            // regexp: /..(..)../i with character class and escaped characters
            // [^\n/] is any non-slash character (also not be newline)
            $e['regexp'] = "/(?:" . $e['regex-escape'] . "|" . $e['character-class'] . "|[^\n/])+/[igmus]*";

            // characters than can form a word; these characters should not be joined together by the removal of whitespace
            $e['word'] = "[a-zA-Z0-9_\$'\"\x{0080}-\x{FFFF}]";

            self::$e = $e;

            // these expression must always be present, because they keep tokens that contain whitespace together
            self::$tokenExpressions = [
                // /** comment */
                'starComment' => "(?<starComment>" . "/\*.*?\*/" . $e['optional-whitespace-newline'] . ")",
                // // comment
                'lineComment' => "(?<lineComment>//[^\n]*" . $e['optional-whitespace-newline'] . ")",
                // regular expression
                'regexp' =>
                    "(?<regexp>" .
                        // if there's whitespace, match it all (possessive "+")
                        $e['regexp'] . $e['optional-whitespace'] . "+" .
                        // to distinguish a regexp from a sequence of dividers (i.e.: x / y / z):
                        "(?:" .
                            // it is followed by a newline; add it to the match
                            "\n" .
                        "|" .
                            // it is followed by any of there characters
                            "(?=[;,\.)])" .
                        ")" .
                    ")",
                // "double quotes"
                'double' => '(?<doubleQuote>"(?:\\\\.|[^"])*")',
                // 'single quotes'
                'single' => "(?<singleQuote>'(?:\\\\.|[^'])*')",
                // `template literal`
                'template' => "(?<templateLiteral>`(?:\\\\.|[^`])*`)",
            ];

            $specificExpressions = [
                // a sequence of - and -- operators; i.e. a - --b; b-- -c; d-- - -e; f - -g
                'min' => "(?<=-)(?<min>" . $e['some-whitespace-newline'] . ")(?=-)",
                // a sequence of + and ++ operators
                'plus' => "(?<=\+)(?<plus>" . $e['some-whitespace-newline'] . ")(?=\+)",
                // whitespace both preceded and succeeded by a word char
                'requiredSpace' => "(?<=" . $e['word'] . ")" . "(?<requiredSpace>" . $e['some-whitespace-newline'] . ")" . "(?=" . $e['word'] . ")"
            ];

            self::$allExpressions = array_merge(self::$tokenExpressions, $specificExpressions);
        }
    }

    public function minifyUsingCallbacks($js)
    {
        // treat all newlines as unix newlines, to keep working with newlines simple
        $normalized = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        // remove comments
        $exp1 = "~(?:" .  implode("|", self::$tokenExpressions) . ")~su";
        $minimallyCommented = preg_replace_callback($exp1, [$this, 'removeComments'], $normalized);

        // rewrite blocks and insert whitespace placeholders
        $exp2 = "~(?:" .  implode("|", self::$allExpressions) . ")~su";
        $placeholderText = preg_replace_callback($exp2, [$this, 'processBlocks'], $minimallyCommented);

        // remove all remaining space
        $shrunkText = preg_replace("~" . self::$e['some-whitespace-newline'] . "~su", '', $placeholderText);

        // replace whitespace placeholders by their original whitespace
        return str_replace(self::$e['whitespace-array-placeholders'], self::$e['whitespace-array-newline'], $shrunkText);
    }

    /**
     * Removes all comments that need to be removed
     * The newlines that are added here may later be removed again
     *
     * @param $matches
     * @return mixed|string
     */
    protected function removeComments($matches)
    {
        // the fully matching text
        $match = $matches[0];

        if (!empty($matches['lineComment'])) {
            // not empty because this might glue words together
            return "\n";
        }
        if (!empty($matches['starComment'])) {

            // create a version without leading and trailing whitespace
            $trimmed = trim($match, self::$e['whitespace-chars-newline']);

            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    return $match;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // option says: leave flagged comments in
                        return $match;
                    }
            }
            // multi line comment; not empty because this might glue words together
            return "\n";
        }

        // leave other matches unchanged
        return $match;
    }

    /**
     * Updates the code for all blocks (they contain whitespace that should be conserved)
     * No early returns: all code must reach `end` and have the whitespace replaced by placeholders
     *
     * @param $matches
     * @return array|mixed|string|string[]
     */
    protected function processBlocks($matches)
    {
        // the fully matching text
        $match = $matches[0];

        // create a version without leading and trailing whitespace
        $trimmed = trim($match, self::$e['whitespace-chars-newline']);

        // Should be handled before optional whitespace
        if (!empty($matches['requiredSpace'])) {
            $match = strpos($matches['requiredSpace'], "\n") === false ? " " : "\n";
            goto end;
        }
        // + followed by +, or - followed by -
        if (!empty($matches['plus']) || !empty($matches['min'])) {
            $match = ' ';
            goto end;
        }
        if (!empty($matches['doubleQuote'])) {
            // remove line continuation
            $match = str_replace("\\\n", "", $match);
            goto end;
        }
        if (!empty($matches['starComment'])) {
            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    goto end;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // ensure newlines before and after
                        $match = "\n" . $trimmed . "\n";
                        goto end;
                    }
            }
            // simple multi line comment; will have been removed in the first step
            goto end;
        }
        if (!empty($matches['regexp'])) {
            // regular expression
            // only if the space after the regexp contains a newline, keep it
            preg_match("~^" .
                self::$e['regexp'] . "(?P<post>" . self::$e['optional-whitespace-newline'] . ")" .
                "$~su", $match, $newMatches);
            $postfix = strpos($newMatches['post'], "\n") === false ? "" : "\n";
            $match = $trimmed . $postfix;
            goto end;
        }

        end:

        return str_replace(self::$e['whitespace-array-newline'], self::$e['whitespace-array-placeholders'], $match);
    }
}

@yellow1912
Copy link

yellow1912 commented Aug 29, 2021

very nice @garfix thank you for the code.

Edit: I think there is an issue with semi colons. You mave have to automatically add semi colons after lines.

@garfix
Copy link
Author

garfix commented Aug 29, 2021

Thanks, @yellow1912 ! Can you give an example of the problem with semicolons?

@yellow1912
Copy link

Let me know if you need full code, but for example something like this:

window['nuReady'] = function (callback) {
      // do something here
    }

var abc = 'test';

You notice that if you minify using the given code, it will generate something like this:

window['nuReady']=function (callback) {}var abc='test';

The gerated code will not work because of a missing semi colon:

window['nuReady']=function (callback) {};var abc='test';

@garfix
Copy link
Author

garfix commented Aug 29, 2021

Yes, I understand. This probably the reason why JShrink adds newlines after functions. Thanks for the clear example. I will dive into it.

@garfix
Copy link
Author

garfix commented Sep 4, 2021

Hi @yellow1912 , yes this was a big thing. In Javascript the semantics may depend on newlines, in pretty complicated ways. To tackle this problem, newlines are only removed when this is known to be safe. This is a very important when the code has no semicolons, but it is also present when there are semicolons.

For readers new into this, the issue can best be explained with an example:

function a(x) { return x * x; }
let b = a(2) + a(3);

The newlines between the function and the assignment may be omitted:

function a(x) { return x * x; } let b = a(2) + a(3);`

But if we write

let a = function(x) { return x * x; }
let b = a(2) + a(3);

The newline may not be removed

let a = function(x) { return x * x; } let b = a(2) + a(3);

// Error: identifier expected

I changed to code to deal with this.
I also changed the way the expressions are built from using an array to creating an object.

<?php

/**
 * This minifier removes whitespace from a JavaScript file.
 *
 * Definition
 * A "word" in this class is any scalar value, variable, or keyword. Two words may not be glued together; whitespace between them is required.
 * A "block" in this class is a piece of JavaScript code that contains whitespace that may not be removed.
 *
 * The algorithm has four steps. The whitespace is removed in step 3, in a single sweep.
 * Steps 1 and 2 prepare for this step. Step 4 cleans up after.
 *
 * Step 1: remove comments
 * All comments that may be removed, are removed
 * By removing comments, it is much easier to reason about whitespace, in step 2, because the whitespace is no longer polluted by comments.
 *
 * Step 2: handle all blocks
 * Match the blocks (pieces of code that contain required whitespace)
 * Leave the blocks in the Javascript, but replace their whitespace by placeholders
 * All whitespace that is now left is optional and can be removed
 *
 * Step 3: remove all remaining whitespace
 * At this point all whitespace that is left can be safely and quickly removed
 * Special caution is taken for newlines; they are only removed if it is certain that the semantics of the code is not changed.
 *
 * Step 4: replace placeholders by whitespace
 * Replace the placeholders that were created in step 2 and replace them by their whitespace
 *
 * JavaScript regular expression syntax
 * JavaScript has a special syntactic construct for regular expression: /abc(d)ef/i
 * Care has been taken that this construct is not confused with
 * - single line comments, which look like an empty regexp: //
 * - a combination of two divisions: 1 / (x + y) / z
 * When a regexp is followed by a newline, this newline may not be replaced by a simple space.
 *
 * The ++ and -- operators
 * When the space between an ++ operator and a + operator is removed, a syntax error occurs: a + ++b -> a+++b
 * So some sort of whitespace must be preserved.
 *
 * Remarks
 * - This code supports both ASCII and UTF-8 JavaScript.
 *   All regular expressions have the unicode modifier (u), which ensures that the JS is not treated as bytes but as encoded code points.
 *   The "dotall" modifier (s) is used everywhere: the dot (.) also matches the newline
 *
 */
class Minifier2
{
    protected static $defaultOptions = [
        'flaggedComments' => true
    ];

    protected $options = [];

    /**
     * Processes a javascript string and outputs only the required characters,
     * stripping out all unneeded characters.
     *
     * @param string $js      The raw javascript to be minified
     * @param array  $options Various runtime options in an associative array
     */
    public static function minify($js, $options = [])
    {
        $minifier = new Minifier2($options);
        return $minifier->minifyUsingCallbacks($js);
    }

    public function __construct($options)
    {
        $this->options = array_merge(static::$defaultOptions, $options);
    }

    public function minifyUsingCallbacks($js)
    {
        $e = MinifierExpressions::get();

        // treat all newlines as unix newlines, to keep working with newlines simple
        $normalized = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        // remove comments
        $exp1 = "~(?:" .  implode("|", $e->tokenExpressions) . ")~su";
        $minimallyCommented = preg_replace_callback($exp1, [$this, 'removeComments'], $normalized);

        // rewrite blocks and insert whitespace placeholders
        $exp2 = "~(?:" .  implode("|", $e->allExpressions) . ")~su";
        $placeholderText = preg_replace_callback($exp2, [$this, 'processBlocks'], $minimallyCommented);

        // remove all remaining space (without the newlines)
        $shrunkText = preg_replace("~" . $e->someWhitespace . "~", '', $placeholderText);

        // reduce multiple newlines to single one
        $shrunkText = preg_replace("~[\\n]+~", "\n", $shrunkText);

        // remove newlines that may safely be removed
        foreach ($e->safeNewlines as $safeNewline) {
            $shrunkText = preg_replace("~" . $safeNewline . "~", "", $shrunkText);
        }

        // replace whitespace placeholders by their original whitespace
        $shrunkText = str_replace($e->whitespaceArrayPlaceholders, $e->whitespaceArrayNewline, $shrunkText);

        // remove leading and trailing whitespace
        return trim($shrunkText);
    }

    /**
     * Removes all comments that need to be removed
     * The newlines that are added here may later be removed again
     *
     * @param $matches
     * @return mixed|string
     */
    protected function removeComments($matches)
    {
        $e = MinifierExpressions::get();

        // the fully matching text
        $match = $matches[0];

        if (!empty($matches['lineComment'])) {
            // not empty because this might glue words together
            return "\n";
        }
        if (!empty($matches['starComment'])) {

            // create a version without leading and trailing whitespace
            $trimmed = trim($match, $e->whitespaceCharsNewline);

            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    return $match;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // option says: leave flagged comments in
                        return $match;
                    }
            }
            // multi line comment; not empty because this might glue words together
            return "\n";
        }

        // leave other matches unchanged
        return $match;
    }

    /**
     * Updates the code for all blocks (they contain whitespace that should be conserved)
     * No early returns: all code must reach `end` and have the whitespace replaced by placeholders
     *
     * @param $matches
     * @return array|mixed|string|string[]
     */
    protected function processBlocks($matches)
    {
        $e = MinifierExpressions::get();

        // the fully matching text
        $match = $matches[0];

        // create a version without leading and trailing whitespace
        $trimmed = trim($match, $e->whitespaceCharsNewline);

        // Should be handled before optional whitespace
        if (!empty($matches['requiredSpace'])) {
            $match = strpos($matches['requiredSpace'], "\n") === false ? " " : "\n";
            goto end;
        }
        // + followed by +, or - followed by -
        if (!empty($matches['plus']) || !empty($matches['min'])) {
            $match = ' ';
            goto end;
        }
        if (!empty($matches['doubleQuote'])) {
            // remove line continuation
            $match = str_replace("\\\n", "", $match);
            goto end;
        }
        if (!empty($matches['starComment'])) {
            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    goto end;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // ensure newlines before and after
                        $match = "\n" . $trimmed . "\n";
                        goto end;
                    }
            }
            // simple multi line comment; will have been removed in the first step
            goto end;
        }
        if (!empty($matches['regexp'])) {
            // regular expression
            // only if the space after the regexp contains a newline, keep it
            preg_match("~^" . $e->regexp . "(?P<post>" . $e->optionalWhitespaceNewline . ")" . "$~su",
                $match, $newMatches);
            $postfix = strpos($newMatches['post'], "\n") === false ? "" : "\n";
            $match = $trimmed . $postfix;
            goto end;
        }

        end:

        return str_replace($e->whitespaceArrayNewline, $e->whitespaceArrayPlaceholders, $match);
    }
}

class MinifierExpressions
{
    protected static $expressions = null;

    public $tokenExpressions = [];
    public $allExpressions = [];
    public $safeNewlines = [];
    public $whitespaceArrayPlaceholders = [];
    public $whitespaceArrayNewline;
    public $whitespaceCharsNewline;
    public $optionalWhitespaceNewline;
    public $someWhitespace;
    public $regexp;

    public static function get() {
        if (self::$expressions === null) {
            self::$expressions = self::create();
        }
        return self::$expressions;
    }

    protected static function create() {

        $e = new MinifierExpressions();

        // white space (includes space and tab) https://262.ecma-international.org/6.0/#sec-white-space
        $whitespaceArray = [" ", "\t"];
        $e->whitespaceArrayNewline = array_merge($whitespaceArray, ["\n"]);
        // placeholders are unused code points; see also https://en.wikipedia.org/wiki/Private_Use_Areas
        $e->whitespaceArrayPlaceholders = ["\x{E000}", "\x{E001}", "\x{E002}"];

        $whitespaceChars = implode("", $whitespaceArray);
        $e->whitespaceCharsNewline = "\n" . $whitespaceChars;

        $whitespace = "[" . $whitespaceChars . "]";
        $whitespaceNewline = "[" . $e->whitespaceCharsNewline . "]";

        // possessive quantifier (++) is needed when used in combination with look ahead
        $e->someWhitespace = $whitespace . "++";
        $someWhitespaceNewline = $whitespaceNewline . "++";
        $optionalWhitespace = $whitespace . "*";
        $e->optionalWhitespaceNewline = $whitespaceNewline . "*";

        // a single escaped character in a regexp, like \\ or \n (may not be newline)
        $regexEscape = "\\\\[^\n]";
        // regexp character class: [..] with escape characters
        // [^\n\\]] is any non-] character (also not be newline)
        $characterClass = "\\[(?:" . $regexEscape . "|[^\n\\]])*?\\]";
        // regexp: /..(..)../i with character class and escaped characters
        // [^\n/] is any non-slash character (also not be newline)
        $e->regexp = "/(?:" . $regexEscape . "|" . $characterClass . "|[^\n/])+/[igmus]*";

        // characters than can form a word; these characters should not be joined together by the removal of whitespace
        $word = "[a-zA-Z0-9_\$'\"\x{0080}-\x{FFFF}]";

        // note: ! is not infix and not safe
        $safeOperators = ",\.&|+\-*/%=<>?:";
        $openingBrackets = "({[";
        $closingBrackets = ")}\\]";
        $allBrackets = $openingBrackets . $closingBrackets;

        $e->safeNewlines = [
            // newline preceded by opening bracket or operator
            "(?<=[;" . $openingBrackets . $safeOperators . "])" . "\n",
            // newline followed by closing bracket or operator
            "\n" . "(?=[;" . $closingBrackets. $safeOperators . "])",
            // newline between any two brackets
            "(?<=[" . $allBrackets . "])" . "\n" . "(?=[" . $allBrackets . "])"
        ];

        // these expression must always be present, because they keep tokens that contain whitespace together
        $e->tokenExpressions = [
            // /** comment */
            'starComment' => "(?<starComment>" . "/\*.*?\*/" . $e->optionalWhitespaceNewline . ")",
            // // comment
            'lineComment' => "(?<lineComment>//[^\n]*" . $e->optionalWhitespaceNewline . ")",
            // regular expression
            'regexp' =>
                "(?<regexp>" .
                // if there's whitespace, match it all (possessive "+")
                $e->regexp . $optionalWhitespace . "+" .
                // to distinguish a regexp from a sequence of dividers (i.e.: x / y / z):
                "(?:" .
                // it is followed by a newline; add it to the match
                "\n" .
                "|" .
                // it is followed by any of there characters
                "(?=[;,\.)])" .
                ")" .
                ")",
            // "double quotes"
            'double' => '(?<doubleQuote>"(?:\\\\.|[^"])*")',
            // 'single quotes'
            'single' => "(?<singleQuote>'(?:\\\\.|[^'])*')",
            // `template literal`
            'template' => "(?<templateLiteral>`(?:\\\\.|[^`])*`)",
        ];

        $specificExpressions = [
            // a sequence of - and -- operators; i.e. a - --b; b-- -c; d-- - -e; f - -g
            'min' => "(?<=-)(?<min>" . $someWhitespaceNewline . ")(?=-)",
            // a sequence of + and ++ operators
            'plus' => "(?<=\+)(?<plus>" . $someWhitespaceNewline . ")(?=\+)",
            // whitespace both preceded and succeeded by a word char
            'requiredSpace' => "(?<=" . $word . ")" . "(?<requiredSpace>" . $someWhitespaceNewline . ")" . "(?=" . $word . ")"
        ];

        $e->allExpressions = array_merge($e->tokenExpressions, $specificExpressions);

        return $e;
    }
}

This new code now passes the tests that require newlines to be present. Tests that are still not passed are:

  • minify/input/condcomm.js: some comments are removed; this is probably alright
  • minify/input/issue132.js: removed an extra space
  • jshrink/input/utf_chars.js: removed extra spaces
  • requests/input/ifreturn.js: output lf in stead of cr/lf
  • requests/input/whitespace.js: the odd whitespaces are not supported (they can be if needed)

I have prepared some new tests as well.

@garfix
Copy link
Author

garfix commented Sep 13, 2021

After jQuery, I checked Lodash and Babel. This brought to light a JIT stack overflow problem, which troubled me quite a bit, but which proved to be solvable by using greedy quantifiers.

Any problem in the execution of the regular expressions is now reported via a RuntimeException.

I changed some of the expressions to have the output look a bit more like the old JShrink output. There remains an obvious difference in the use of newlines after each - and + by JShrink that I don't understand.

Would like to have some feedback or further instructions from @tedivm ;)

<?php

/**
 * This minifier removes whitespace from a JavaScript file.
 *
 * Definition
 * A "word" in this class is any scalar value, variable, or keyword. Two words may not be glued together; whitespace between them is required.
 * A "block" in this class is a piece of JavaScript code that contains whitespace that may not be removed.
 *
 * The algorithm has four steps. The whitespace is removed in step 3, in a single sweep.
 * Steps 1 and 2 prepare for this step. Step 4 cleans up after.
 *
 * Step 1: remove comments
 * All comments that may be removed, are removed
 * By removing comments, it is much easier to reason about whitespace, in step 2, because the whitespace is no longer polluted by comments.
 *
 * Step 2: handle all blocks
 * Match the blocks (pieces of code that contain required whitespace)
 * Leave the blocks in the Javascript, but replace their whitespace by placeholders
 * All whitespace that is now left is optional and can be removed
 *
 * Step 3: remove all remaining whitespace
 * At this point all whitespace that is left can be safely and quickly removed
 * Special caution is taken for newlines; they are only removed if it is certain that the semantics of the code is not changed.
 *
 * Step 4: replace placeholders by whitespace
 * Replace the placeholders that were created in step 2 and replace them by their whitespace
 *
 * JavaScript regular expression syntax
 * JavaScript has a special syntactic construct for regular expression: /abc(d)ef/i
 * Care has been taken that this construct is not confused with
 * - single line comments, which look like an empty regexp: //
 * - a combination of two divisions: 1 / (x + y) / z
 * When a regexp is followed by a newline, this newline may not be replaced by a simple space.
 *
 * The ++ and -- operators
 * When the space between an ++ operator and a + operator is removed, a syntax error occurs: a + ++b -> a+++b
 * So some sort of whitespace must be preserved.
 *
 * Remarks
 * - This code supports both ASCII and UTF-8 JavaScript.
 *   All regular expressions have the unicode modifier (u), which ensures that the JS is not treated as bytes but as encoded code points.
 *   The "dotall" modifier (s) is used everywhere: the dot (.) also matches the newline
 *   The expressions use the greedy quantifier after or-groups, to avoid the chance of JIT stack overflow.
 *
 */
class Minifier2
{
    protected static $defaultOptions = [
        'flaggedComments' => true
    ];

    protected $options = [];

    /**
     * Processes a javascript string and outputs only the required characters,
     * stripping out all unneeded characters.
     *
     * @param string $js      The raw javascript to be minified
     * @param array  $options Various runtime options in an associative array
     */
    public static function minify($js, $options = [])
    {
        $minifier = new Minifier2($options);
        return $minifier->minifyUsingCallbacks($js);
    }

    public function __construct($options)
    {
        $this->options = array_merge(static::$defaultOptions, $options);
    }

    public function minifyUsingCallbacks($js)
    {
        $e = MinifierExpressions::get();

        // treat all newlines as unix newlines, to keep working with newlines simple
        $shrunkText = str_replace(["\r\n", "\r"], ["\n", "\n"], $js);

        // remove comments
        $exp1 = "~(?:" .  implode("|", $e->tokenExpressions) . ")~su";
        $shrunkText = preg_replace_callback($exp1, [$this, 'removeComments'], $shrunkText);
        $this->checkRegexpError();

        // rewrite blocks and insert whitespace placeholders
        $exp2 = "~(?:" .  implode("|", $e->allExpressions) . ")~su";
        $placeholderText = preg_replace_callback($exp2, [$this, 'processBlocks'], $shrunkText);
        $this->checkRegexpError();

        // remove all remaining space (without the newlines)
        $shrunkText = preg_replace("~" . $e->someWhitespace . "~", '', $placeholderText);
        $this->checkRegexpError();

        // reduce multiple newlines to single one
        $shrunkText = preg_replace("~[\\n]+~", "\n", $shrunkText);
        $this->checkRegexpError();

        // remove newlines that may safely be removed
        foreach ($e->safeNewlines as $safeNewline) {
            $shrunkText = preg_replace("~" . $safeNewline . "~", "", $shrunkText);
            $this->checkRegexpError();
        }

        // replace whitespace placeholders by their original whitespace
        $shrunkText = str_replace($e->whitespaceArrayPlaceholders, $e->whitespaceArrayNewline, $shrunkText);
        $this->checkRegexpError();

        // remove leading and trailing whitespace
        return trim($shrunkText);
    }

    protected function checkRegexpError()
    {
        $error = preg_last_error();
        if ($error === 0) { return; }

        $msg = "";
        switch ($error) {
            case PREG_INTERNAL_ERROR: $msg = "Internal error (no specified)"; break;
            case PREG_BACKTRACK_LIMIT_ERROR: $msg = "Backtrace limit error"; break;
            case PREG_RECURSION_LIMIT_ERROR: $msg = "Recursion limit error"; break;
            case PREG_BAD_UTF8_ERROR: $msg = "Bad utf-8 error"; break;
            case PREG_BAD_UTF8_OFFSET_ERROR: $msg = "Bad utf-8 offset error"; break;
            case 6 /* PREG_JIT_STACKLIMIT_ERROR */: $msg = "JIT stack limit error"; break;
        }
        throw new RuntimeException("A regular expression error occurred: " . $msg);
    }

    /**
     * Removes all comments that need to be removed
     * The newlines that are added here may later be removed again
     *
     * @param $matches
     * @return mixed|string
     */
    protected function removeComments($matches)
    {
        $e = MinifierExpressions::get();

        // the fully matching text
        $match = $matches[0];

        if (!empty($matches['lineComment'])) {
            // not empty because this might glue words together
            return "\n";
        }
        if (!empty($matches['starComment'])) {

            // create a version without leading and trailing whitespace
            $trimmed = trim($match, $e->whitespaceCharsNewline);

            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    return $match;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // option says: leave flagged comments in
                        return $match;
                    }
            }
            // multi line comment; not empty because this might glue words together
            return "\n";
        }

        // leave other matches unchanged
        return $match;
    }

    /**
     * Updates the code for all blocks (they contain whitespace that should be conserved)
     * No early returns: all code must reach `end` and have the whitespace replaced by placeholders
     *
     * @param $matches
     * @return array|mixed|string|string[]
     */
    protected function processBlocks($matches)
    {
        $e = MinifierExpressions::get();

        // the fully matching text
        $match = $matches[0];

        // create a version without leading and trailing whitespace
        $trimmed = trim($match, $e->whitespaceCharsNewline);

        // Should be handled before optional whitespace
        if (!empty($matches['requiredSpace'])) {
            $match = strpos($matches['requiredSpace'], "\n") === false ? " " : "\n";
            goto end;
        }
        // + followed by +, or - followed by -
        if (!empty($matches['plus']) || !empty($matches['min'])) {
            $match = ' ';
            goto end;
        }
        if (!empty($matches['doubleQuote'])) {
            // remove line continuation
            $match = str_replace("\\\n", "", $match);
            goto end;
        }
        if (!empty($matches['starComment'])) {
            switch ($trimmed[2]) {
                case '@':
                    // IE conditional comment
                    goto end;
                case '!':
                    if ($this->options['flaggedComments']) {
                        // ensure newlines before and after
                        $match = "\n" . $trimmed . "\n";
                        goto end;
                    }
            }
            // simple multi line comment; will have been removed in the first step
            goto end;
        }
        if (!empty($matches['regexp'])) {
            // regular expression
            // only if the space after the regexp contains a newline, keep it
            preg_match("~^" . $e->regexp . "(?P<post>" . $e->optionalWhitespaceNewline . ")" . "$~su",
                $match, $newMatches);
            $postfix = strpos($newMatches['post'], "\n") === false ? "" : "\n";
            $match = $trimmed . $postfix;
            goto end;
        }

        end:

        return str_replace($e->whitespaceArrayNewline, $e->whitespaceArrayPlaceholders, $match);
    }
}

class MinifierExpressions
{
    protected static $expressions = null;

    public $tokenExpressions = [];
    public $allExpressions = [];
    public $safeNewlines = [];
    public $whitespaceArrayPlaceholders = [];
    public $whitespaceArrayNewline;
    public $whitespaceCharsNewline;
    public $optionalWhitespaceNewline;
    public $someWhitespace;
    public $regexp;

    public static function get() {
        if (self::$expressions === null) {
            self::$expressions = self::create();
        }
        return self::$expressions;
    }

    protected static function create() {

        $e = new MinifierExpressions();

        // white space (includes space and tab) https://262.ecma-international.org/6.0/#sec-white-space
        $whitespaceArray = [" ", "\t"];
        $e->whitespaceArrayNewline = array_merge($whitespaceArray, ["\n"]);
        // placeholders are unused code points; see also https://en.wikipedia.org/wiki/Private_Use_Areas
        $e->whitespaceArrayPlaceholders = ["\x{E000}", "\x{E001}", "\x{E002}"];

        $whitespaceChars = implode("", $whitespaceArray);
        $e->whitespaceCharsNewline = "\n" . $whitespaceChars;

        $whitespace = "[" . $whitespaceChars . "]";
        $whitespaceNewline = "[" . $e->whitespaceCharsNewline . "]";

        // possessive quantifier (++) is needed when used in combination with look ahead
        $e->someWhitespace = $whitespace . "++";
        $someWhitespaceNewline = $whitespaceNewline . "++";
        $optionalWhitespace = $whitespace . "*+";
        $e->optionalWhitespaceNewline = $whitespaceNewline . "*+";

        // a single escaped character in a regexp, like \\ or \n (may not be newline)
        $regexEscape = "\\\\[^\n]";
        // regexp character class: [..] with escape characters
        // [^\n\\]] is any non-] character (also not be newline)
        $characterClass = "\\[(?:" . $regexEscape . "|[^\n\\]])*+\\]";
        // regexp: /..(..)../i with character class and escaped characters
        // [^\n/] is any non-slash character (also not be newline)
        $e->regexp = "/(?:" . $regexEscape . "|" . $characterClass . "|[^\n/])++/[igmus]*+";

        // characters than can form a word; these characters should not be joined together by the removal of whitespace
        $word = "[a-zA-Z0-9_\$\x{0080}-\x{FFFF}]";

        // note: ! is not infix and not safe
        $safeOperators = ",\.&|+\-*/%=<>?:";
        $openingBrackets = "({[";
        $closingBrackets = ")}\\]";
        $expressionClosingBracket = ")";
        $blockOpeningBracket = "{";

        // newlines that may be safely removed
        $e->safeNewlines = [
            // newline preceded by opening bracket or operator
            "(?<=[;" . $openingBrackets . $safeOperators . "])" . "\n",
            // newline followed by closing bracket or operator
            "\n" . "(?=[;" . $closingBrackets. $safeOperators . "])",
            // newline between any two brackets
            "(?<=[" . $openingBrackets . "])" . "\n" . "(?=[" . $openingBrackets . "])",
            "(?<=[" . $closingBrackets . "])" . "\n" . "(?=[" . $closingBrackets . "])",
            "(?<=[" . $expressionClosingBracket . "])" . "\n" . "(?=[" . $blockOpeningBracket . "])",
        ];

        // these expression must always be present, because they keep tokens that contain whitespace together
        $e->tokenExpressions = [
            // /** comment */
            'starComment' => "(?<starComment>" . "/\*.*?\*/" . $e->optionalWhitespaceNewline . ")",
            // // comment
            'lineComment' => "(?<lineComment>//[^\n]*" . $e->optionalWhitespaceNewline . ")",
            // regular expression
            'regexp' =>
                "(?<regexp>" .
                // if there's whitespace, match it all
                $e->regexp . $optionalWhitespace .
                // to distinguish a regexp from a sequence of dividers (i.e.: x / y / z):
                "(?:" .
                // it is followed by a newline; add it to the match
                "\n" .
                "|" .
                // it is followed by any of these characters
                "(?=[;,\.)])" .
                ")" .
                ")",
            // "double quotes"
            'double' => '(?<doubleQuote>"(?:\\\\.|[^"])*+")',
            // 'single quotes'
            'single' => "(?<singleQuote>'(?:\\\\.|[^'])*+')",
            // `template literal`
            'template' => "(?<templateLiteral>`(?:\\\\.|[^`])*+`)",
        ];

        $specificExpressions = [
            // a sequence of - and -- operators; i.e. a - --b; b-- -c; d-- - -e; f - -g
            'min' => "(?<=-)(?<min>" . $someWhitespaceNewline . ")(?=-)",
            // a sequence of + and ++ operators
            'plus' => "(?<=\+)(?<plus>" . $someWhitespaceNewline . ")(?=\+)",
            // whitespace both preceded and succeeded by a word char
            'requiredSpace' => "(?<=" . $word . ")" . "(?<requiredSpace>" . $someWhitespaceNewline . ")" . "(?=" . $word . ")"
        ];

        $e->allExpressions = array_merge($e->tokenExpressions, $specificExpressions);

        return $e;
    }
}

@garfix
Copy link
Author

garfix commented Sep 19, 2021

Hello Robert,

I worked out the regular-expressions-based approach. It tested it on some libraries and it seems to have reached an acceptable level of completeness and robustness. On PHP 7 it's about 10 times faster than the existing code. On PHP 8 (with JIT enabled!) it's about 5 times faster. It also solves a number of the open issues of the package.

As mentioned, it does not pass all tests, and this is intentional.

I think it's now up to you to decide if you accept the new code or not. I can imagine that you feel this code is just too different from the old code; or that for some reason you just don't feel comfortable with it. This is fine too. I will just start a new package under my own account and move this code into it.

If you like it, but don't trust it enough to swap the new code in completely, we could add the option "mode" ("classic": the existing code, default; or "regexp": the new code) The users may then explicitly set the new "regexp" mode if they lack speed or run into some errors with the existing code.

If you just think it's good, I suppose the package needs to jump to version 2.

In any case, I would like you to tell me what to do next. It is not clear to me and I need your assistance.

@garfix
Copy link
Author

garfix commented Oct 9, 2021

Since it has been three weeks since I placed my request, I assume you have neither the time nor the inclination to do this integration. And because the proposed code is quite different from the existing one I think it is probably a good idea to start a new package. I will start making preparations.

@yellow1912
Copy link

@garfix sound good I think. Please make sure to put a link here.

@garfix
Copy link
Author

garfix commented Oct 10, 2021

https://github.com/garfix/js-minify

Still working on it. Feedback is welcome.

Robert, I would like to add a thank-you-Robert-for-your-work-on-JShrink text in the README, but your license says I have to ask your permission. So if you don't mind you have to say so.

Also, I copied some of your tests (changing some of the content and the names). If you don't agree with this, I will rewrite them.

@garfix
Copy link
Author

garfix commented Oct 11, 2021

I have made the first release. It's on Packagist:

https://packagist.org/packages/garfix/js-minify

@tedivm
Copy link
Member

tedivm commented Oct 27, 2021

The new package looks awesome! Thanks for your work on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants