Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust line wrapping algorithm to be closer to the GNU gettext tooling #51

Open
Kingdutch opened this issue Oct 23, 2019 · 4 comments
Open

Comments

@Kingdutch
Copy link

Kingdutch commented Oct 23, 2019

In our project we use both tools for different parts of our translation pipeline. Unfortunately the inconsistency creates very noisy commits. Bringing them closer together will make reviewing changes made with tools relying on gettext-parser a lot easier to review.

Examples from a diff
Below are examples of a diff. On the left is a .po file created by msginit and manipulated by GNU gettext tools. On the right side is the output of a call to gettext-parser's po.compile function.

afbeelding

The string <b><em class=\"placeholder\">@count</em> Members</b> are selected is 65 characters long, so it's not wrapped. However the entire line msgid_plural "<b><em class=\"placeholder\">@count</em> Members</b> are selected" is 80 characters long which seems to cause the GNU tools to wrap the line.

The same happens for
afbeelding


afbeelding

<em>Books</em> have a built-in hierarchical navigation. Use for handbooks or is 77 characters long. The GNU tools seem to allow this with the space being on the first line as the 77th character. gettext-parser will wrap one space earlier.

This seems to occur more often than expected. For example in the following lines as well.
afbeelding


afbeelding
It seems the GNU tools have a bit more knowledge of HTML while the gettext-parser treats href=\"\">[social_mentions:mentioned_user]</a> as an unbreakable string, the GNU break makes more sense because <a href=\"\"> together is more important.

A similar case for this can be found below, where the space is found to be a better breaking point than within a tag.

afbeelding
afbeelding


I also saw the following which actually suggests that the GNU tools allow 77 characters on a line so it may be a better default than 76. I couldn't find the GNU tools' line-break algorithm so I'm not sure whether they special case spaces and dots or just count to 77.
afbeelding

Open Social Branding will be replaced by site name (and slogan if available). is exactly 77 characters.

@smhg
Copy link
Owner

smhg commented Oct 23, 2019

Thank you for your extensive comparison! Always a pleasure to see issues reported this way!

I'm all for matching gettext tools in every possible way.

That said, based on your HTML example, this might go deep.

Maybe we can wrap the C library? I don't know what the recommended approach is these days, so any input would be appreciated.
Something like emscripten maybe? Browser support is important.

@Kingdutch
Copy link
Author

Kingdutch commented Oct 24, 2019

Cool, yeah I tried searching around to see if I can figure out how this logic is implemented in the gettext tools but was unsuccesful. Perhaps someone else has experience there and can chime in.

I don't think wrapping the C library is an option if you want browser support (and I think it takes away some of the appeal of this library ^^).

One approach to this would be to brute force it. The examples in the start of this post could be added as test cases. Some testing tool could then run the same texts through msgfmt (I believe that simply formats a .po file right? Otherwise replace msgfmt in this text with another gettext tool ^^) and parse/compile with this library and diff the two.

Alternatively a framework like jest could be used together with Snapshot testing. The snapshot could be the output from the msgfmt tool. Then Jest can tell us the difference and we can go from there. This means we don't have to install the gettext tooling in the test environment but can use pure Javascript. As a bonus it would help spot regressions from the formatting made by future changes once the algorithm has improved 😋

One thing that could expose is that maybe changing from 76 to 77 as the default wrapping length already makes a good difference.

EDIT: I see this library already uses chai and mocha ('m a Jest fanboy sorry) which has this package for snapshotting: https://www.npmjs.com/package/mocha-chai-snapshot

@smhg
Copy link
Owner

smhg commented Oct 24, 2019

If you see low-hanging fruit (any fruit, really ;)), please send in a PR.
There are some line-folding tests which you can add yours to.

In general:
Testing is a secondary issue I think. The hard part is getting a 1-on-1 match with gettext without reinventing the wheel (e.g. understanding HTML,...). That would need a solution first, if one exists.

@vHeemstra
Copy link
Contributor

I'm not sure, but browsing through the GNU gettext git, it seems the wrap() method handles line breaking, using the ulc_width_linebreaks() helper method to find the best breaking position.

In the wrap() method, page_width seems to be the maximum number of characters/columns per line and it is used as a starting point. Indentation, opening/closing characters, etc, are subtracted and added to keep track of the current possible room/width on that line. (see here)

wrap() is used in many print methods, including message_print().

Maybe this helps shed a little light on what's happening. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants