Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encodings and \ide #86

Open
jaakristioja opened this Issue Mar 18, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@jaakristioja
Copy link

jaakristioja commented Mar 18, 2019

The documentation for \ide states:

Use: An optional character encoding specification. This marker should be used to specify the character encoding of the text within the file. For example: CP-1252, CP-1251, UTF-8, UTF-16, OR Custom . If the character encoding does not conform to a known standard, but is rather a customized solution for the project, a minimum of the name of the font used for the project should be included. For archive purposes, texts which rely upon a custom encoding solution should be converted to Unicode, if at all possible.

Does this apply to the whole USFM file (including the markers) or just character data contained by the markers? Does an \ide marker only affect everything up until the next \ide marker?

I couldn't find in the USFM specification any mention of a default or initial encoding to use when decoding USFM files. This leads to a kind of chicken-and-egg problem when trying to read USFM files, because one cannot be certain of the exact encoding before reading an \ide marker. It seems, however, that one generally needs to know the encoding to read the file up to that \ide marker.

For example, is the following valid USFM?

$ ((echo '\id MAT test'; echo '\ide UTF-8')|iconv -f ASCII -t UTF-16; (echo '\usfm 3.0'; echo '\ide UTF-16')|iconv -f ASCII -t UTF-8; echo '\rem Hello, World!' | iconv -f ASCII -t UTF-32) | hexdump -C
00000000  ff fe 5c 00 69 00 64 00  20 00 4d 00 41 00 54 00  |..\.i.d. .M.A.T.|
00000010  20 00 74 00 65 00 73 00  74 00 0a 00 5c 00 69 00  | .t.e.s.t...\.i.|
00000020  64 00 65 00 20 00 55 00  54 00 46 00 2d 00 38 00  |d.e. .U.T.F.-.8.|
00000030  0a 00 5c 75 73 66 6d 20  33 2e 30 0a 5c 69 64 65  |..\usfm 3.0.\ide|
00000040  20 55 54 46 2d 31 36 0a  ff fe 00 00 5c 00 00 00  | UTF-16.....\...|
00000050  72 00 00 00 65 00 00 00  6d 00 00 00 20 00 00 00  |r...e...m... ...|
00000060  48 00 00 00 65 00 00 00  6c 00 00 00 6c 00 00 00  |H...e...l...l...|
00000070  6f 00 00 00 2c 00 00 00  20 00 00 00 57 00 00 00  |o...,... ...W...|
00000080  6f 00 00 00 72 00 00 00  6c 00 00 00 64 00 00 00  |o...r...l...d...|
00000090  21 00 00 00 0a 00 00 00                           |!.......|

Should implementations try to guess the encoding when reading the file, and only change after encountering an \ide marker, or is there a default or initial encoding like ASCII, ISO-8859-something or UTF-8 be assumed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.