Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: implement EngineData parsing #43

Merged
merged 7 commits into from Aug 18, 2022

Conversation

scoiatael
Copy link
Collaborator

fixes #6

This feature should be more or less finished - during the code review I'll run it on sample designs to verify parsing works consistently.

Copy link
Collaborator

@pastelmind pastelmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thank you for working on the PR.

I'll be honest: I have been building an EngineData parser myself. I began working on it around the 27th of July, a couple of days after you volunteered to write the PR.

Please understand that I had no intention of undermining your hard work. I had been planning to attack the problem myself for a long time, and your initial suggestion motivated me to make one last attempt. I apologize for not telling you this sooner. I didn't want to stop you in case my solution failed, and I did not anticipate that you would submit a PR in such a short time.

Correctness

I found 2 bugs in your parser--please refer to my other comments. The first is trivial to fix. Also, it is not critical since it doesn't affect well-formed EngineData documents.

However, the second bug (unicode decoding) is a blocker. Because our use case involves unicode text (CJK characters), we need to correctly decode unicode strings. (My parser appears to handle them correctly, but I'd be grateful to accept bug reports).

Performance

I ran some quick benchmarks on my code as well as yours (with slight modifications), and the results currently favor mine by 5-60% on different browsers:

  • Chrome 104.0

chrome 104 0

  • Firefox 103.0.1

firefox 103 0 1

  • Safari 15.6

safari 15 6

(My benchmarks were done on a MacBook Pro 13-inch / M1 / 2020)

Integration

My parser is almost done and only needs to be integrated into the rest of the library. If you wish to work on your parser's performance, I'm ready to wait for further submissions. Otherwise, we could reuse your code (e.g. Layer.ts) to integrate my parser into @webtoon/psd.

it = this.advance();
}
switch (it.type) {
case TokenType.Name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenType.Name should not be parsed as a value here. It causes << /abc /abc >> to be parsed as { abc: "abc" }.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that interpretation?

At least in PDF spec /-prefixed sequences are called "Name Objects" and are

an atomic symbol uniquely defined by a sequence of any characters
(8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of
the same sequence of characters denote the same object.

A dictionary on the other hand:

is an associative table containing pairs of objects, known as the dictionary’s entries. The first
element of each entry is the key and the second element is the value. The key shall be a name (unlike
dictionary keys in PostScript, which may be objects of any type). The value may be any kind of object, including
another dictionary

And so names are in no way tied to dictionaries - and as such sequence << /abc /abc >> should mean "a dictionary containing one object, under key of Name "abc" and having value of Name "abc" - and both key and value referencing the same object (although iirc not many implementations care about that 2nd part). There's even an example of dict with values being names in the PDF spec:

EXAMPLE << /Type /Example
/Subtype /DictionaryExample
/Version 0 . 01
/IntegerItem 12
/StringItem ( a string )
/Subdictionary << /Item1 0 . 4
/Item2 true
/LastItem ( not ! )
/VeryLastItem ( OK )
>>
>>

See PDF spec - section 7.3.5 (Name objects) and 7.3.7 (Dictionary objects).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting. I did not know that EngineData is based on the PDF format.

Have you observed any PSD file that actually uses /name as anything other than a dictionary key? Personally I have not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt there's any official mention anyplace - but seeing how both are Adobe-based and how Adobe likes to push PDF in places where they have to put unstructured data it'd IMO make sense. That's why I based lexer and parser on PDF spec :)

0x3e, // >
]);
expect(() => parseEngineData(data)).toThrowError(
MissingEngineDataProperties
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the parser fixed, this test case should throw UnexpectedEngineDataToken instead of MissingEngineDataProperties.

packages/psd/src/engineData/lexer.ts Outdated Show resolved Hide resolved
@scoiatael
Copy link
Collaborator Author

No hard feelings, but next time you could mention you are willing to give a feature one more try - I could have implemented something else in the same time and library would've benefited from that :)


As for this PR - I implemented fix to the CJK parsing bug and added test for it. I also migrated parser to stack-based - to make optimizations a little easier (now flamegraphs can properly merge similar invocations because stack looks largely the same).

That said, I couldn't find any meaningful optimizations (more that ~5% on Node.js). I suspect your hand-written lexer simply behaves better that what is generated from the async-function in my case :)


As for next steps - I'm literally indifferent as to which implementation ends up on master - as long as any does :) As you are the maintainer, the choice is yours (and your colleagues). The code in this PR is also yours to use as you see fit - e.g. for integrating your library if that speeds up development and such is the choice. Just let me know ;)


One final question: would it make sense to re-write this part in Rust? Since it looks like parsing EngineData might take a large portion of total design parsing time.

@pastelmind
Copy link
Collaborator

pastelmind commented Aug 3, 2022

Thank you. I handled the situation poorly, and will not make the same mistake again.

I doubt it would be appropriate to convert the EngineData parser to Rust/WebAssembly. Decoding layer images is by far the most time- and memory-intensive task (hundreds of milliseconds per layer). Parsing an EngineData document with JS takes < 1ms when optimized, and perf gains from Rust+WebAssembly would be barely noticeable.

@scoiatael
Copy link
Collaborator Author

Ran some benchmarks; looks like for overall file opening process your parser provides a significant speedup:

Command Mean [s] Min [s] Max [s] Relative
cat samples.txt | xargs node parse.mjs scoiatael 6.868 ± 0.086 6.727 7.014 1.14 ± 0.05
cat samples.txt | xargs node parse.mjs pastelmind 6.049 ± 0.262 5.682 6.361 1.00

(where parse.mjs is here, Node is v18.2.0 and hardware is CPU: 11th Gen Intel i7-1185G7 (8) @ 4.800GHz)

@pastelmind Maybe the simplest way forward it to just merge this PR and then replace my implementation with yours?

- array instead of Generator
- heavy optimize string() function in lexer
    1) slice copies buffer,
    2) creating decoder is costly,
    3) memoize results for each character instead of lookup
@scoiatael
Copy link
Collaborator Author

@pastelmind had some spare time and managed to find pretty substantial optimizations; right now the results are:

Command Mean [s] Min [s] Max [s] Relative
cat samples.txt | xargs node parse.mjs scoiatael 4.793 ± 0.076 4.677 4.899 1.00
cat samples.txt | xargs node parse.mjs pastelmind 5.889 ± 0.052 5.815 5.977 1.23 ± 0.02

it doesn't improve performance
Copy link
Collaborator

@pastelmind pastelmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sick for most of the week and only got around to reviewing your PR just now.

  • Awesome work on optimization! It seems that avoiding unnecessary memory allocations greatly boosted your lexer's performance.
  • Could you clarify memoize results for each character instead of lookup in cd40e08? I don't see any memoizing code.
  • Cursor is becoming somewhat burdened. We could split it into two variants--the regular Cursor for parsing standard structures, and a specialized EngineDataCursor exclusively for the lexer. This isn't needed now, though.

I will accept the PR for now.

@scoiatael
Copy link
Collaborator Author

scoiatael commented Aug 12, 2022

memoize results for each character instead of lookup

refers to STRING_TOKEN_JT - which pre-computes a boolean value for each character instead of doing two map lookups at runtime :) It was originally intended to be a specialized jump-table (hence JT suffix) but turned out that simple boolean array works best for this case.

It seems that avoiding unnecessary memory allocations greatly boosted your lexer's performance.

If you are referring to using Array instead of Generator then not really - I did it just to have clean demarkation line when benchmarking with 0x. This change can be easily reverted if we ever start to run into memory issues :)

PS. Hope you feel better now :)

@scoiatael
Copy link
Collaborator Author

@pastelmind are we waiting on someone / something? Dunno what the process is, but I can't merge it on my own ;)

@pastelmind pastelmind merged commit b8d4db7 into webtoon:main Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse EngineData
2 participants