feature: implement EngineData parsing #43

scoiatael · 2022-08-02T13:31:58Z

fixes #6

This feature should be more or less finished - during the code review I'll run it on sample designs to verify parsing works consistently.

fixes webtoon#6

pastelmind

First, thank you for working on the PR.

I'll be honest: I have been building an EngineData parser myself. I began working on it around the 27th of July, a couple of days after you volunteered to write the PR.

Please understand that I had no intention of undermining your hard work. I had been planning to attack the problem myself for a long time, and your initial suggestion motivated me to make one last attempt. I apologize for not telling you this sooner. I didn't want to stop you in case my solution failed, and I did not anticipate that you would submit a PR in such a short time.

Correctness

I found 2 bugs in your parser--please refer to my other comments. The first is trivial to fix. Also, it is not critical since it doesn't affect well-formed EngineData documents.

However, the second bug (unicode decoding) is a blocker. Because our use case involves unicode text (CJK characters), we need to correctly decode unicode strings. (My parser appears to handle them correctly, but I'd be grateful to accept bug reports).

Performance

I ran some quick benchmarks on my code as well as yours (with slight modifications), and the results currently favor mine by 5-60% on different browsers:

Chrome 104.0

Firefox 103.0.1

Safari 15.6

(My benchmarks were done on a MacBook Pro 13-inch / M1 / 2020)

Integration

My parser is almost done and only needs to be integrated into the rest of the library. If you wish to work on your parser's performance, I'm ready to wait for further submissions. Otherwise, we could reuse your code (e.g. Layer.ts) to integrate my parser into @webtoon/psd.

pastelmind · 2022-08-02T15:44:57Z

packages/psd/src/engineData/parser.ts

+      it = this.advance();
+    }
+    switch (it.type) {
+      case TokenType.Name:


TokenType.Name should not be parsed as a value here. It causes << /abc /abc >> to be parsed as { abc: "abc" }.

Are you sure about that interpretation?

At least in PDF spec /-prefixed sequences are called "Name Objects" and are

an atomic symbol uniquely defined by a sequence of any characters
(8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of
the same sequence of characters denote the same object.

A dictionary on the other hand:

is an associative table containing pairs of objects, known as the dictionary’s entries. The first
element of each entry is the key and the second element is the value. The key shall be a name (unlike
dictionary keys in PostScript, which may be objects of any type). The value may be any kind of object, including
another dictionary

And so names are in no way tied to dictionaries - and as such sequence << /abc /abc >> should mean "a dictionary containing one object, under key of Name "abc" and having value of Name "abc" - and both key and value referencing the same object (although iirc not many implementations care about that 2nd part). There's even an example of dict with values being names in the PDF spec:

EXAMPLE << /Type /Example /Subtype /DictionaryExample /Version 0 . 01 /IntegerItem 12 /StringItem ( a string ) /Subdictionary << /Item1 0 . 4 /Item2 true /LastItem ( not ! ) /VeryLastItem ( OK ) >> >>

See PDF spec - section 7.3.5 (Name objects) and 7.3.7 (Dictionary objects).

That is interesting. I did not know that EngineData is based on the PDF format.

Have you observed any PSD file that actually uses /name as anything other than a dictionary key? Personally I have not.

I doubt there's any official mention anyplace - but seeing how both are Adobe-based and how Adobe likes to push PDF in places where they have to put unstructured data it'd IMO make sense. That's why I based lexer and parser on PDF spec :)

pastelmind · 2022-08-02T15:46:27Z

packages/psd/tests/unit/engineData.test.ts

+      0x3e, // >
+    ]);
+    expect(() => parseEngineData(data)).toThrowError(
+      MissingEngineDataProperties


With the parser fixed, this test case should throw UnexpectedEngineDataToken instead of MissingEngineDataProperties.

packages/psd/src/engineData/lexer.ts

scoiatael · 2022-08-03T14:14:59Z

No hard feelings, but next time you could mention you are willing to give a feature one more try - I could have implemented something else in the same time and library would've benefited from that :)

As for this PR - I implemented fix to the CJK parsing bug and added test for it. I also migrated parser to stack-based - to make optimizations a little easier (now flamegraphs can properly merge similar invocations because stack looks largely the same).

That said, I couldn't find any meaningful optimizations (more that ~5% on Node.js). I suspect your hand-written lexer simply behaves better that what is generated from the async-function in my case :)

As for next steps - I'm literally indifferent as to which implementation ends up on master - as long as any does :) As you are the maintainer, the choice is yours (and your colleagues). The code in this PR is also yours to use as you see fit - e.g. for integrating your library if that speeds up development and such is the choice. Just let me know ;)

One final question: would it make sense to re-write this part in Rust? Since it looks like parsing EngineData might take a large portion of total design parsing time.

pastelmind · 2022-08-03T14:36:11Z

Thank you. I handled the situation poorly, and will not make the same mistake again.

I doubt it would be appropriate to convert the EngineData parser to Rust/WebAssembly. Decoding layer images is by far the most time- and memory-intensive task (hundreds of milliseconds per layer). Parsing an EngineData document with JS takes < 1ms when optimized, and perf gains from Rust+WebAssembly would be barely noticeable.

scoiatael · 2022-08-04T11:01:57Z

Ran some benchmarks; looks like for overall file opening process your parser provides a significant speedup:

Command	Mean [s]	Min [s]	Max [s]	Relative
`cat samples.txt \| xargs node parse.mjs scoiatael`	6.868 ± 0.086	6.727	7.014	1.14 ± 0.05
`cat samples.txt \| xargs node parse.mjs pastelmind`	6.049 ± 0.262	5.682	6.361	1.00

(where parse.mjs is here, Node is v18.2.0 and hardware is CPU: 11th Gen Intel i7-1185G7 (8) @ 4.800GHz)

@pastelmind Maybe the simplest way forward it to just merge this PR and then replace my implementation with yours?

- array instead of Generator - heavy optimize string() function in lexer 1) slice copies buffer, 2) creating decoder is costly, 3) memoize results for each character instead of lookup

scoiatael · 2022-08-09T11:31:58Z

@pastelmind had some spare time and managed to find pretty substantial optimizations; right now the results are:

Command	Mean [s]	Min [s]	Max [s]	Relative
`cat samples.txt \| xargs node parse.mjs scoiatael`	4.793 ± 0.076	4.677	4.899	1.00
`cat samples.txt \| xargs node parse.mjs pastelmind`	5.889 ± 0.052	5.815	5.977	1.23 ± 0.02

it doesn't improve performance

pastelmind

I was sick for most of the week and only got around to reviewing your PR just now.

Awesome work on optimization! It seems that avoiding unnecessary memory allocations greatly boosted your lexer's performance.
Could you clarify memoize results for each character instead of lookup in cd40e08? I don't see any memoizing code.
Cursor is becoming somewhat burdened. We could split it into two variants--the regular Cursor for parsing standard structures, and a specialized EngineDataCursor exclusively for the lexer. This isn't needed now, though.

I will accept the PR for now.

scoiatael · 2022-08-12T08:00:10Z

memoize results for each character instead of lookup

refers to STRING_TOKEN_JT - which pre-computes a boolean value for each character instead of doing two map lookups at runtime :) It was originally intended to be a specialized jump-table (hence JT suffix) but turned out that simple boolean array works best for this case.

It seems that avoiding unnecessary memory allocations greatly boosted your lexer's performance.

If you are referring to using Array instead of Generator then not really - I did it just to have clean demarkation line when benchmarking with 0x. This change can be easily reverted if we ever start to run into memory issues :)

PS. Hope you feel better now :)

scoiatael · 2022-08-18T15:12:38Z

@pastelmind are we waiting on someone / something? Dunno what the process is, but I can't merge it on my own ;)

feature: implement EngineData parsing

5e33004

fixes webtoon#6

pastelmind reviewed Aug 2, 2022

View reviewed changes

scoiatael added 3 commits August 3, 2022 12:38

avoid .at() to target Node<16

5c63fc9

fix CJK handling

05c4599

migrate parser to stack-based

afe4b76

fix SyntaxError: Unexpected token '&&=' on Node 14

66d81f6

optimize lexer

cd40e08

- array instead of Generator - heavy optimize string() function in lexer 1) slice copies buffer, 2) creating decoder is costly, 3) memoize results for each character instead of lookup

remove CursorProxy

59a55f6

it doesn't improve performance

pastelmind approved these changes Aug 11, 2022

View reviewed changes

pastelmind merged commit b8d4db7 into webtoon:main Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: implement EngineData parsing #43

feature: implement EngineData parsing #43

scoiatael commented Aug 2, 2022

pastelmind left a comment •

edited

pastelmind Aug 2, 2022

scoiatael Aug 3, 2022

pastelmind Aug 3, 2022

scoiatael Aug 3, 2022

pastelmind Aug 2, 2022

scoiatael commented Aug 3, 2022

pastelmind commented Aug 3, 2022 •

edited

scoiatael commented Aug 4, 2022

scoiatael commented Aug 9, 2022

pastelmind left a comment

scoiatael commented Aug 12, 2022 •

edited

scoiatael commented Aug 18, 2022

feature: implement EngineData parsing #43

feature: implement EngineData parsing #43

Conversation

scoiatael commented Aug 2, 2022

pastelmind left a comment • edited

Choose a reason for hiding this comment

Correctness

Performance

Integration

pastelmind Aug 2, 2022

Choose a reason for hiding this comment

scoiatael Aug 3, 2022

Choose a reason for hiding this comment

pastelmind Aug 3, 2022

Choose a reason for hiding this comment

scoiatael Aug 3, 2022

Choose a reason for hiding this comment

pastelmind Aug 2, 2022

Choose a reason for hiding this comment

scoiatael commented Aug 3, 2022

pastelmind commented Aug 3, 2022 • edited

scoiatael commented Aug 4, 2022

scoiatael commented Aug 9, 2022

pastelmind left a comment

Choose a reason for hiding this comment

scoiatael commented Aug 12, 2022 • edited

scoiatael commented Aug 18, 2022

pastelmind left a comment •

edited

pastelmind commented Aug 3, 2022 •

edited

scoiatael commented Aug 12, 2022 •

edited