Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support #108

Closed
vtereshkov opened this issue Jul 18, 2021 · 21 comments
Closed

Unicode support #108

vtereshkov opened this issue Jul 18, 2021 · 21 comments
Labels
enhancement New feature or request

Comments

@vtereshkov
Copy link
Owner

vtereshkov commented Jul 18, 2021

What Unicode to choose for chars and strings?

UTF-8
Pros: Backward-compatible with ASCII, no need to support both "narrow" and "wide" strings
Cons: Chars have variable width, ambiguous len, sizeof and indexing. Poor support on Windows

UTF-16
Pros: Native for Windows. Fixed char width
Cons: Unnatural for Linux. Incompatible with ASCII. Not all Unicode chars can be represented

UTF-32
Pros: Native for Linux. Fixed char width. Complete Unicode supported
Cons: Unnatural for Windows. Incompatible with ASCII

@vtereshkov vtereshkov added the enhancement New feature or request label Jul 18, 2021
vtereshkov added a commit that referenced this issue Jul 18, 2021
@vtereshkov
Copy link
Owner Author

vtereshkov commented Jul 18, 2021

Now we have a rudimentary support for UTF-8, as the latest terminals and C runtime libraries on Windows 10 and Linux support the C.UTF-8 or similar locale strings. String length returned by len() is in bytes, not in characters. Go does the same, though it is inconvenient.

fn main() {
    s := "Привет" + ',' + " мир!"
    printf("Строка: " + s + ", длина: " + repr(len(s)) + '\n')
}

...:~/umka-lang/umka_linux$ ./umka -locale C.UTF-8 ../test.um
Строка: Привет, мир!, длина: 21 

On Windows, this feaure is available under MSVC, but not under MinGW (older runtime?). It seems that the MSVC runtime is also buggy: scanf() fails to read non-ASCII UTF-8.

On Linux everything works as expected.

@vtereshkov
Copy link
Owner Author

@marekmaskarinec Please notice the API change: umkaInit() now requires locale, which can be NULL.

vtereshkov added a commit that referenced this issue Jul 19, 2021
@vtereshkov
Copy link
Owner Author

Need to consider creating a module like utf8 in Go: https://pkg.go.dev/unicode/utf8

@vtereshkov
Copy link
Owner Author

@marekmaskarinec Do Umka's printf() and scanf() work correctly with non-ASCII UTF-8 strings on Void Linux? Everything is fine on Ubuntu 20, but not on Windows 10.

@marekmaskarinec
Copy link
Contributor

This program:

fn main() {
    s := ""
    scanf("%s", &s)
    printf("%s\n", repr([]char(s)))
    printf("%s\n", s)
}

Produces this (input included):

🬀🬾
{ 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFF80 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFFBE 0x00 } 
🬀🬾

I did not touch the locale.

@vtereshkov
Copy link
Owner Author

@marekmaskarinec And what if you set -locale C.UTF-8?

@marekmaskarinec
Copy link
Contributor

marekmaskarinec commented Jul 23, 2021

It doesn't seem to work.

[ tests ]$ umka -locale C.UTF-8 test.um
Error test.um (1, 1): Cannot set locale

I think the characters I used to test aren't UTF-8. Should I test with utf-8 characters?

@marekmaskarinec
Copy link
Contributor

Here is a test with some czech characters, which are utf-8.

řášďéě
{ 0xFFFFFFC5 0xFFFFFF99 0xFFFFFFC3 0xFFFFFFA1 0xFFFFFFC5 0xFFFFFFA1 0xFFFFFFC4 0xFFFFFF8F 0xFFFFFFC3 0xFFFFFFA9 0xFFFFFFC4 0xFFFFFF9B 0x00 } 
řášďéě

@vtereshkov
Copy link
Owner Author

@marekmaskarinec Thank you. I doubt if there any characters in Unicode which are not UTF-8. And what does the Linux shell command locale -a print on your machine?

@marekmaskarinec
Copy link
Contributor

[ ~ ]$ locale -a
C
POSIX
en_GB.utf8
en_US.utf8

@vtereshkov
Copy link
Owner Author

@marekmaskarinec When running utf8test.um on my Windows machine, I get

bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

whereas, according to expected.log, it should be

bytes: 6
characters: 2
▀: U+2580
€: U+20ac

I'm not sure that expected.log is correct.

Another problem is that when I print the output to the console rather than a file, the characters are interpreted as Windows-1251 instead of UTF-8:

bytes: 9
characters: 4
тЦА: U+2580
тВм: U+20ac
$: U+24
┬в: U+a2

But as I said in another place, this is probably a problem with the MinGW C runtime.

@marekmaskarinec
Copy link
Contributor

I'm not sure that expected.log is correct.

Yes. I added some additional character so expected.log is incorrect.

@vtereshkov
Copy link
Owner Author

@marekmaskarinec I have tested utf8.um on a Cyrillic string. The behavior seems to be incorrect:

string: ▀€$¢
bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

string: Привет, мир!
bytes: 21
characters: 12
ҟ: U+49f
?: U+4c0
Ҹ: U+4b8
Ҳ: U+4b2
ҵ: U+4b5
?: U+4c2
,: U+2c
 : U+20
Ҽ: U+4bc
Ҹ: U+4b8
?: U+4c0
!: U+21

A third-party UTF-8 encoder gives the following representation for "Привет, мир!":

\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\xD0\xBC\xD0\xB8\xD1\x80\x21

@vtereshkov
Copy link
Owner Author

@marekmaskarinec Two other things to consider:

  • r^ < 0x7f etc. Shouldn't it be r^ <= 0x7f?
  • 1 << 8. Shouldn't it be 1 << 7?

@marekmaskarinec
Copy link
Contributor

marekmaskarinec commented Sep 6, 2021

I fixed those things, bit with no effect. As far as I know, the problem is in getNextRune. Encoding works as intended.

Update: the problem might be with characters that have significant bits set to 1 in the first byte.

Update 2: turns out it was problem with the mask. I fixed it and now all except two characters decode corretly.

@vtereshkov
Copy link
Owner Author

@marekmaskarinec Are you going to commit the changes? Or you hope to first figure out what has happened with the two remaining characters?

@marekmaskarinec
Copy link
Contributor

The changes are currently in my fork in branch utf8. I tried with one of the not working letters - CYRILLIC CAPITAL LETTER ER. It is generating 0x440, but the correct codepoint is 0x420. What I found out is that the byte I was getting was d1, but it's supposed to be d0.

vtereshkov added a commit that referenced this issue Sep 10, 2021
@skejeton
Copy link
Contributor

skejeton commented Oct 16, 2021

im for utf8 to be honestly, either that or UTF-32, but given the poor support of UTF-32, i'd choose utf-8, as utf-16 can't represent all characters in 2 bytes anyway, nature of utf-8 makes it opt in, you either have an ascii string, but if you want, you add a foreign character, in this case it makes use of 8th bit, which allows for it to not conflict with ascii

@vtereshkov
Copy link
Owner Author

@ishdx2 Yes, this is what I chose myself, but I hoped for a better support of UTF-8 by the C runtime and consoles over various platforms. On Linux the support is very good, on Windows it is not. MinGW does not have UTF-8 locales altogether, while MSVC supports them in printf(), but not in scanf(). This is weird.

@skejeton
Copy link
Contributor

I'm afraid you have to use UTF-16 winapi functions

@vtereshkov
Copy link
Owner Author

UTF-8 is now supported by the utf8.um standard library module.

For Windows-specific console I/O problems, see #354.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants