New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InDesign finalizations #87
Conversation
openformats/formats/indesign.py
Outdated
@@ -104,6 +104,9 @@ def _can_skip_content(self, string): | |||
return True | |||
except ValueError: | |||
pass | |||
# Special content in BackingStory.xml | |||
if u'\ufeff' == string: | |||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rigaspapas why not just check for content after stripping all non-printable characters?
ref: https://mayart.de/download/Indesign-IDML/special-idml-chars.pdf
one idea would be to use the unicodedata
module to ignore all character not in printable categories:
https://www.unicode.org/reports/tr44/#General_Category_Values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm saying this is who's to say we don't find another file that e.g. has a u'\u200c'
character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouk what we want to achieve is to avoid exporting text that is not translatable. There is no standard way in Unicode characters to distinguish letters from symbols. Is there any?
You are right that any non-printable character can cause the same unwanted result. I considered ignoring one specific character, because I guess that's what InDesign insert in every document automatically (not a user input).
9cf6aca
to
9770ffb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a bug here that needs fixing (see my comment below about the missing test case). Ignore the other comment (unless you like oneliners)
u'<?ACE 8?> <Br/>;', | ||
u'\ufeff', | ||
u' \ufeff ', | ||
u' \ufeff 5', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's another test case:
u'\ufeff<Br/>;'
;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We should first strip the special characters and then check for the translatable ones. Thanks!
for letter in string: | ||
char_type = unicodedata.category(letter) | ||
if char_type[0] in acceptable: | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for fun:
from six.moves import map
return any(c[0] in ["L", "P", "S"] for c in map(unicodedata.category, string))
or even
from six.moves import map
return any(map(["L", "P", "S"].__contains__,
map(itemgetter(0), map(unicodedata.category, string)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouk I really like the first approach, which is very functional, but it would require a new dependency. So, should we leave it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, it's fine like it is. But these examples don't really require the new dependency, you could do from itertools import imap
. It's just that this way it's compatible with both python2 and python3.
Use python's unicodedata library to identify printable characters and ignore strings that don't contain any.
9770ffb
to
6f70b68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checklist (for the reviewer)
Problem
Steps to reproduce
recent research results
stringSolution