Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Character Support #244

Closed
bdkjones opened this issue Jan 11, 2014 · 24 comments
Closed

UTF-8 Character Support #244

bdkjones opened this issue Jan 11, 2014 · 24 comments

Comments

@bdkjones
Copy link

Try to compile the following code with libsass:

$someVariable: ⧲;

html, body {
    font-size: 20px;
    background-color: orange;
}

It will fail and return this error message:

main.scss:1: error: error reading values after :

If you comment out that first line, the file will compile fine. The issue appears to be that libsass cannot support non-ASCII characters. The code above compiles correctly with the current version of Ruby Sass.

@akhleung
Copy link

I'm looking into this now. Hopefully it'll be as simple as accepting any byte whose value is greater than 127.

@bdkjones
Copy link
Author

I don't think that's the case. Unicode is pretty complex. Some characters take up more than one byte, for instance, and must be handled correctly. There are several C libraries that deal with Unicode. Perhaps it would be worth looking at those? This is somewhat of an edge case: the characters that complicate Unicode are really all the Asian language ones, which are rarely used in CSS (but can be!)

The real issue is handling European characters such as: é, ü, ø, etc. Those will currently fail just as the above code does.

@akhleung
Copy link

Hmm ... my understanding is that if we restrict the input to UTF-8, then each byte in a multibyte character will be greater than 127 -- this appears to be by design, to assist with backward compatibility with ASCII. Moreover, the leading byte of a given character will specify how many bytes the whole character takes (I mention this mainly because it seems like it would be useful for implementing Sass 3.3's string length function). I got this all from the Wikipedia article on UTF-8 (particularly the first table in the "description" section):
http://en.wikipedia.org/wiki/UTF-8

As for using a unicode library, we're sort of attached to the notion of LibSass being zero-dependency (well, aside from standard POSIX and Windows libs). If these issues keep coming up, however, then I suppose we'll have to reconsider....

@bdkjones
Copy link
Author

I absolutely LOVE the idea of zero-dependency. It makes me want to hug you guys. I have to deal with Jade and Stylus, which have like 7,231 nested dependencies. (Seriously, it's an absolute disaster).

I'm fine with restricting libsass to UTF-8, but technically CSS supports all of Unicode, right? That is, it's possible for someone to specify a charset that's Unicode-16 with or without a BOM. In those situations, the above would not work.

I require UTF-8 for my app and make no attempt to handle any other encodings, so I'm fine with it!

@mgreter
Copy link
Contributor

mgreter commented Feb 8, 2014

As akhleung pointed out, the real difference begins when you interpret unicode stuff as strings (as asking for the length ... or differently put "how many chars"). I don't see why libsass would need to really understand utf8 (at this point). The main chars for css to care about are ;:{}. But a possible problem, that could arise, is if you have these asciii codes in the second/third/... byte of the uincode char. So yeah, it would be nice if libsass is unicode aware.
There are UTF-BOMS (pretty much a "must" for UTF-16, but "annoying" for UTF-8). Beside that, text editors do some sophisticated guessing in what encoding a text file was saved. I guess that unicode has its main use in :content pseudo selector; because that pretty much is the only reason to write css in "unicode".
If I'm not wrong that sign is a greek capital Phi (ce a6). As a workaround, you could replace that char with it's html entity "Φ".
IMO Unicode handling is not something you implement yourself. Either your environment supports it or not. For C++ I guess it should be "as easy" as to change all CHAR types to WCHAR.

@QuLogic
Copy link
Contributor

QuLogic commented Feb 8, 2014

if you have these asciii codes in the second/third/... byte of the uincode char

UTF-8 is specifically designed so that that is impossible, unless you are referring to characters above 127, which are not standardized at all (hence the various incompatible code pages).

@mgreter
Copy link
Contributor

mgreter commented Feb 8, 2014

What I mean is: a colon has the hex value of "3A" (Ascii, or any other codepage, afaik). If I looked it up correctly, this char "䌺" should have the hex value of "43 3A". As you see the octal value 3A is the same (for the second byte), and if the string handling is not aware of unicode, this could be falsely been interpreted as a colon.

@QuLogic
Copy link
Contributor

QuLogic commented Feb 8, 2014

No, 0x433A is the code point for that character, but when encoded in UTF-8, no ASCII characters less than 128 will occur in the second and higher bytes. That character encoded in UTF-8 is 0xE4 0x8C 0xBA.

@mgreter
Copy link
Contributor

mgreter commented Feb 8, 2014

@QuLogic: You are correct, sir! Seems like I still haven't figured out unicode completely! Although I guess it's still a valid point if we're talking UTF-16 :) Is wchar_t portable?

@bdkjones
Copy link
Author

bdkjones commented Feb 8, 2014

I'm not sure I've followed the above conversation correctly, but I do think it's important that libsass be able to handle non-ASCII characters such as é, ï, œ, ø and so on.

With the content: property of CSS, it's quite common to run into these sorts of characters in stylesheets and libsass should, at the very least, not choke on them. As for UTF-16 and BOM... to hell with that. I suggest we make libsass UTF-8 compliant. That covers 99.9% of use cases and it is far, far easier than trying to deal with every possible character encoding, from ISO-Latin to UTF-16 BOM.

@QuLogic
Copy link
Contributor

QuLogic commented Feb 9, 2014

@QuLogic: You seem to be correct, sir! Seems like I still haven't figured out unicode completely!

It's probably a very common mistake to think that for multi-byte characters, you can just set the MSB for bytes that need to be continued. But in UTF-8, every single byte in a multi-byte character has the MSB set. This has several advantages, one of which being you'd never see a lower ASCII character "by mistake".

Anyway, another advantage is that you can just treat UTF-8 as "a bunch of bytes" so long as you don't need to count actual characters. You don't accidentally see other characters, and you don't have embedded NULs to worry about.

@mgreter
Copy link
Contributor

mgreter commented Feb 9, 2014

I just tested the example the OP has posted.
When saved in ANSI I got the same error "error reading values after :"
When saved in UTF-8 I got another error "error reading values after Ô"
When I put the unicode char into quotes ('⧲'), everything works as expected.
I also tested them against ruby sass which only showed one difference in the second test, which would compile with ruby sass (unqoted identifier with unicode chars don't seem to work in libsass).
So for me it looks like UTF-8 is (pretty much) already working (and IMO this makes sense, as @QuLogic already pointed out). Maybe someone else can confirm that?

Maybe something like this could solve this difference between ruby sass and libsass (prelexer.cpp)?

const char* alpha(const char* src) { return std::isalpha(*src) || !isascii(*src) ? src+1 : 0; }
const char* alnum(const char* src) { return std::isalnum(*src) || !isascii(*src) ? src+1 : 0; }

@akhleung
Copy link

@mgreter I'm working on more general improvements to LibSass's scanning functions, but I'll try your suggestions and see how well they work. Thanks!

@mgreter
Copy link
Contributor

mgreter commented Feb 10, 2014

I actually just tried it with a basic example (which did compile fine with the "hack"):

SELECTÖR { paräm: valüe; }

@akhleung
Copy link

Sounds good ... incidentally, do you want to put this into a pull request?

@mgreter
Copy link
Contributor

mgreter commented Feb 10, 2014

Created pull request (#283). I guess this should be save (as chars above 127 should not have any other meaning then alpha character). Maybe someone knows what the css specification says about unicode for selectors or property names? I also don't know how portable isascii is (I guess if it is it should be the safest test, as I've read that chars may be signed or unsigned; and I'm a bit unsure if that really can be predicted). To make it short, others probably can tell better if this is a good solution or not.

@akhleung
Copy link

I've been redoing the scanning functions to be closer to the diagrams and algorithms detailed in http://www.w3.org/TR/css3-syntax/ ... it looks like their definition of a non-ascii character is any char whose value is >= 128. isascii appears not to be a standard function, but assuming it's defined the way one would expect, it should be fine for now.

@mgreter
Copy link
Contributor

mgreter commented Feb 10, 2014

I would say that's correct, so the fix should pretty much be in line with the css specs:
http://www.w3.org/TR/css3-syntax/#token-diagrams (under <ident-token>)
Just retested the original reported problem:

$someVariäble: ⧲;
html, body { paräm: $someVariäble; }

Resulted in:

html, body {
  paräm: ⧲; }

IMO this bug can be closed.

@akhleung
Copy link

All right, I'll tentatively mark this as ready for validation. Please let me know if you have use-cases that still fail.

@IGZjaviernieto
Copy link

hi,

when I try something like:

.some-class{
content: "";
}

I'm getting: "error: error reading values after..."

@akhleung
Copy link

That particular case is supposed to generate an error (though LibSass's error message isn't very helpful in this case) -- you need to escape the backslash in your string: "\\".

@IGZjaviernieto
Copy link

we're already scaping the backslash "\" (mistype error in previous post), and getting the error. We're using grunt-sass->node-sass->libsass. Maybe some overscape in the lib call chain??

@akhleung
Copy link

Hmm ... I think your error is probably related to #102, which was fixed just a few days ago. If you're using grunt-sass and node-sass, they probably haven't pulled in the latest updates.

@IGZjaviernieto
Copy link

ok, thanks.

mgreter added a commit to mgreter/libsass-spec that referenced this issue Mar 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants