UTF-8 Character Support #244

bdkjones · 2014-01-11T20:59:07Z

Try to compile the following code with libsass:

$someVariable: ⧲;

html, body {
    font-size: 20px;
    background-color: orange;
}

It will fail and return this error message:

main.scss:1: error: error reading values after :

If you comment out that first line, the file will compile fine. The issue appears to be that libsass cannot support non-ASCII characters. The code above compiles correctly with the current version of Ruby Sass.

The text was updated successfully, but these errors were encountered:

akhleung · 2014-01-22T18:54:30Z

I'm looking into this now. Hopefully it'll be as simple as accepting any byte whose value is greater than 127.

bdkjones · 2014-01-22T23:40:30Z

I don't think that's the case. Unicode is pretty complex. Some characters take up more than one byte, for instance, and must be handled correctly. There are several C libraries that deal with Unicode. Perhaps it would be worth looking at those? This is somewhat of an edge case: the characters that complicate Unicode are really all the Asian language ones, which are rarely used in CSS (but can be!)

The real issue is handling European characters such as: é, ü, ø, etc. Those will currently fail just as the above code does.

akhleung · 2014-01-22T23:56:52Z

Hmm ... my understanding is that if we restrict the input to UTF-8, then each byte in a multibyte character will be greater than 127 -- this appears to be by design, to assist with backward compatibility with ASCII. Moreover, the leading byte of a given character will specify how many bytes the whole character takes (I mention this mainly because it seems like it would be useful for implementing Sass 3.3's string length function). I got this all from the Wikipedia article on UTF-8 (particularly the first table in the "description" section):
http://en.wikipedia.org/wiki/UTF-8

As for using a unicode library, we're sort of attached to the notion of LibSass being zero-dependency (well, aside from standard POSIX and Windows libs). If these issues keep coming up, however, then I suppose we'll have to reconsider....

bdkjones · 2014-01-23T00:00:22Z

I absolutely LOVE the idea of zero-dependency. It makes me want to hug you guys. I have to deal with Jade and Stylus, which have like 7,231 nested dependencies. (Seriously, it's an absolute disaster).

I'm fine with restricting libsass to UTF-8, but technically CSS supports all of Unicode, right? That is, it's possible for someone to specify a charset that's Unicode-16 with or without a BOM. In those situations, the above would not work.

I require UTF-8 for my app and make no attempt to handle any other encodings, so I'm fine with it!

mgreter · 2014-02-08T07:31:02Z

As akhleung pointed out, the real difference begins when you interpret unicode stuff as strings (as asking for the length ... or differently put "how many chars"). I don't see why libsass would need to really understand utf8 (at this point). The main chars for css to care about are ;:{}. But a possible problem, that could arise, is if you have these asciii codes in the second/third/... byte of the uincode char. So yeah, it would be nice if libsass is unicode aware.
There are UTF-BOMS (pretty much a "must" for UTF-16, but "annoying" for UTF-8). Beside that, text editors do some sophisticated guessing in what encoding a text file was saved. I guess that unicode has its main use in :content pseudo selector; because that pretty much is the only reason to write css in "unicode".
If I'm not wrong that sign is a greek capital Phi (ce a6). As a workaround, you could replace that char with it's html entity "Φ".
IMO Unicode handling is not something you implement yourself. Either your environment supports it or not. For C++ I guess it should be "as easy" as to change all CHAR types to WCHAR.

QuLogic · 2014-02-08T07:40:11Z

if you have these asciii codes in the second/third/... byte of the uincode char

UTF-8 is specifically designed so that that is impossible, unless you are referring to characters above 127, which are not standardized at all (hence the various incompatible code pages).

mgreter · 2014-02-08T07:54:16Z

What I mean is: a colon has the hex value of "3A" (Ascii, or any other codepage, afaik). If I looked it up correctly, this char "䌺" should have the hex value of "43 3A". As you see the octal value 3A is the same (for the second byte), and if the string handling is not aware of unicode, this could be falsely been interpreted as a colon.

QuLogic · 2014-02-08T07:57:52Z

No, 0x433A is the code point for that character, but when encoded in UTF-8, no ASCII characters less than 128 will occur in the second and higher bytes. That character encoded in UTF-8 is 0xE4 0x8C 0xBA.

mgreter · 2014-02-08T08:01:57Z

@QuLogic: You are correct, sir! Seems like I still haven't figured out unicode completely! Although I guess it's still a valid point if we're talking UTF-16 :) Is wchar_t portable?

bdkjones · 2014-02-08T08:49:39Z

I'm not sure I've followed the above conversation correctly, but I do think it's important that libsass be able to handle non-ASCII characters such as é, ï, œ, ø and so on.

With the content: property of CSS, it's quite common to run into these sorts of characters in stylesheets and libsass should, at the very least, not choke on them. As for UTF-16 and BOM... to hell with that. I suggest we make libsass UTF-8 compliant. That covers 99.9% of use cases and it is far, far easier than trying to deal with every possible character encoding, from ISO-Latin to UTF-16 BOM.

QuLogic · 2014-02-09T00:48:27Z

@QuLogic: You seem to be correct, sir! Seems like I still haven't figured out unicode completely!

It's probably a very common mistake to think that for multi-byte characters, you can just set the MSB for bytes that need to be continued. But in UTF-8, every single byte in a multi-byte character has the MSB set. This has several advantages, one of which being you'd never see a lower ASCII character "by mistake".

Anyway, another advantage is that you can just treat UTF-8 as "a bunch of bytes" so long as you don't need to count actual characters. You don't accidentally see other characters, and you don't have embedded NULs to worry about.

mgreter · 2014-02-09T20:55:58Z

I just tested the example the OP has posted.
When saved in ANSI I got the same error "error reading values after :"
When saved in UTF-8 I got another error "error reading values after Ô"
When I put the unicode char into quotes ('⧲'), everything works as expected.
I also tested them against ruby sass which only showed one difference in the second test, which would compile with ruby sass (unqoted identifier with unicode chars don't seem to work in libsass).
So for me it looks like UTF-8 is (pretty much) already working (and IMO this makes sense, as @QuLogic already pointed out). Maybe someone else can confirm that?

Maybe something like this could solve this difference between ruby sass and libsass (prelexer.cpp)?

const char* alpha(const char* src) { return std::isalpha(*src) || !isascii(*src) ? src+1 : 0; }
const char* alnum(const char* src) { return std::isalnum(*src) || !isascii(*src) ? src+1 : 0; }

akhleung · 2014-02-10T20:04:49Z

@mgreter I'm working on more general improvements to LibSass's scanning functions, but I'll try your suggestions and see how well they work. Thanks!

mgreter · 2014-02-10T20:10:37Z

I actually just tried it with a basic example (which did compile fine with the "hack"):

SELECTÖR { paräm: valüe; }

akhleung · 2014-02-10T20:12:20Z

Sounds good ... incidentally, do you want to put this into a pull request?

mgreter · 2014-02-10T20:29:03Z

Created pull request (#283). I guess this should be save (as chars above 127 should not have any other meaning then alpha character). Maybe someone knows what the css specification says about unicode for selectors or property names? I also don't know how portable isascii is (I guess if it is it should be the safest test, as I've read that chars may be signed or unsigned; and I'm a bit unsure if that really can be predicted). To make it short, others probably can tell better if this is a good solution or not.

akhleung · 2014-02-10T20:38:09Z

I've been redoing the scanning functions to be closer to the diagrams and algorithms detailed in http://www.w3.org/TR/css3-syntax/ ... it looks like their definition of a non-ascii character is any char whose value is >= 128. isascii appears not to be a standard function, but assuming it's defined the way one would expect, it should be fine for now.

mgreter · 2014-02-10T20:56:01Z

I would say that's correct, so the fix should pretty much be in line with the css specs:
http://www.w3.org/TR/css3-syntax/#token-diagrams (under <ident-token>)
Just retested the original reported problem:

$someVariäble: ⧲;
html, body { paräm: $someVariäble; }

Resulted in:

html, body {
  paräm: ⧲; }

IMO this bug can be closed.

akhleung · 2014-02-10T23:36:43Z

All right, I'll tentatively mark this as ready for validation. Please let me know if you have use-cases that still fail.

IGZjaviernieto · 2014-06-10T17:28:48Z

hi,

when I try something like:

.some-class{
content: "";
}

I'm getting: "error: error reading values after..."

akhleung · 2014-06-10T17:36:24Z

That particular case is supposed to generate an error (though LibSass's error message isn't very helpful in this case) -- you need to escape the backslash in your string: "\\".

IGZjaviernieto · 2014-06-11T07:47:12Z

we're already scaping the backslash "\" (mistype error in previous post), and getting the error. We're using grunt-sass->node-sass->libsass. Maybe some overscape in the lib call chain??

akhleung · 2014-06-11T17:39:29Z

Hmm ... I think your error is probably related to #102, which was fixed just a few days ago. If you're using grunt-sass and node-sass, they probably haven't pulled in the latest updates.

IGZjaviernieto · 2014-06-12T08:00:02Z

ok, thanks.

sass/libsass#244

mgreter mentioned this issue Feb 10, 2014

Treat all non ascii chars (utf-8) as alpha characters #283

Merged

akhleung added the validate label Feb 10, 2014

akhleung closed this as completed Mar 19, 2014

mgreter mentioned this issue Jun 5, 2014

Unicode Support by CSS Specifications #381

Closed

mgreter added a commit to mgreter/libsass-spec that referenced this issue Mar 22, 2020

Add spec test for libsass issue 244

fb1a7b4

sass/libsass#244

mgreter removed the Dev - Needs Test label Mar 22, 2020

mgreter added the Dev - Test Written label Mar 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 Character Support #244

UTF-8 Character Support #244

bdkjones commented Jan 11, 2014

akhleung commented Jan 22, 2014

bdkjones commented Jan 22, 2014

akhleung commented Jan 22, 2014

bdkjones commented Jan 23, 2014

mgreter commented Feb 8, 2014

QuLogic commented Feb 8, 2014

mgreter commented Feb 8, 2014

QuLogic commented Feb 8, 2014

mgreter commented Feb 8, 2014

bdkjones commented Feb 8, 2014

QuLogic commented Feb 9, 2014

mgreter commented Feb 9, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

IGZjaviernieto commented Jun 10, 2014

akhleung commented Jun 10, 2014

IGZjaviernieto commented Jun 11, 2014

akhleung commented Jun 11, 2014

IGZjaviernieto commented Jun 12, 2014

UTF-8 Character Support #244

UTF-8 Character Support #244

Comments

bdkjones commented Jan 11, 2014

akhleung commented Jan 22, 2014

bdkjones commented Jan 22, 2014

akhleung commented Jan 22, 2014

bdkjones commented Jan 23, 2014

mgreter commented Feb 8, 2014

QuLogic commented Feb 8, 2014

mgreter commented Feb 8, 2014

QuLogic commented Feb 8, 2014

mgreter commented Feb 8, 2014

bdkjones commented Feb 8, 2014

QuLogic commented Feb 9, 2014

mgreter commented Feb 9, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

mgreter commented Feb 10, 2014

akhleung commented Feb 10, 2014

IGZjaviernieto commented Jun 10, 2014

akhleung commented Jun 10, 2014

IGZjaviernieto commented Jun 11, 2014

akhleung commented Jun 11, 2014

IGZjaviernieto commented Jun 12, 2014