-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Unicode blocks not supported #25
Comments
Okay. This is interesting. It looks like as it is now it works with all that I think looks like more or less exotic word-characters. What it cannot handle is the non-word blocks. I don't know if there would be any obvious problems in just including and letting it handle all character ranges? |
Okay, this was fun. We should now have full unicode support. As it is now all unicode points that don't fit into the Python
This may be a problem when someone might want to make an edition with text in one of these codeblocks, like Devanagari, Runic, Old Persian or Linear B to just name a few. Maybe I should consider adding a configuration to indicate the language(s). That way it would easy to switch when necessary without overmatching 90% of the time. |
No more break off and also correct matching here 👍 That leaves the problem how to acquire information about whether a character should be regarded a word-charcter or a word-boundry without needing to check all 109,242 Unicode characters manually: \documentclass[a5paper]{scrartcl}
\usepackage{fontspec}
\setmainfont{Junicode}
\usepackage[series={A},noledgroup,draft]{reledmac}
\begin{document}
\beginnumbering
\pstart
o “o ⸀o o. o
\edtext{o}{%
\Afootnote{test}}
\pend
\endnumbering
\end{document} -> \documentclass[a5paper]{scrartcl}
\usepackage{fontspec}
\setmainfont{Junicode}
\usepackage[series={A},noledgroup,draft]{reledmac}
\begin{document}
\beginnumbering
\pstart
\sameword{o} “o ⸀o \sameword{o}. \sameword{o}
\edtext{\sameword[1]{o}}{%
\Afootnote{test}}
\pend
\endnumbering
\end{document} The information should be available in the Unicode database, but I haven't yet seen a quick way to extract it. (That you have "Basic Latin" in the list above is probably just a mix-up and nothing you use in the actual code, do you?) |
In the example you give. As you see it, should the composed character be interpreted as identical to "o". That doesn't seem to be the case for me. I'm almost afraid to ask, but can we think of cases? Of course another problem is the curious case of two different code points with the same graphical representation (there are some examples, and I'm sure you can remind me of them). Those will not match although they look identical to the reader. About those I imagine a list of such cases could be compiled and it could be taken into consideration. About "Basic Latin", that confused me too. I don't really know what it's doing there, so it may not be excluded that the way I checked for matches in the different blocks was wrong. But anyway. EDIT: By the way. I think I'll close this for now. I have made a new issue for enabling configuration of language. |
I should have been more detailled about my examples:
Are you thinking of the "canonically equivalent" characters, eg. that 006F (o) and 0308 (̈) are to be considered the same as 00F6 (ö)? As far as I know it is still not equally well implemented in different pieces of systems and software. So it is possibly very complex to support anything that python itself doesn't support.
I'd definitely suggest no to go into this. |
Okay, so this is very useful. I am improving the punctuation tokenization now to include more characters. I will describe that in #24 when I'm done. About the composing characters (which I mistook the punctuation as): All composing characters such as the suggested 006F (o) and 0308 (̈) = 00F6 (ö) are normalized to the single character glyphs (00F6 here) before processing so that sholdn't be a problem. I am adding a note on this in the readme, as that actually is a change in the file that can be very hard to perceive. |
While trying some real life examples with branch issue-24 I noticed that some Unicode characters break compilation, i.e. it says "Starting conversion." and never comes to an end.
It is a bit curious which characters are affected. It seems to go by Unicode blocks and neither by frequency (e.g. typographical quotes and € don't work, Runes do) nor function.
The text was updated successfully, but these errors were encountered: