New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Support by CSS Specifications #381

Closed
mgreter opened this Issue Jun 5, 2014 · 6 comments

Comments

Projects
None yet
5 participants
@mgreter
Contributor

mgreter commented Jun 5, 2014

Hi all

I came around this whole topic pretty recently, as I'm currently refactoring our FE Asset Manager Tool (Webmerge). Therefore I wanted to document my findings here.

Declaring character encodings in CSS

This explains how the character encoding of a css file is determined. Since we are only dealing with local files, we never have a HTTP header. So the precedence should be 'charset' rule, byte-order mark (BOM) or auto-detection (finally falling back to system default/UTF-8). This may not sound too hard to implement, but what about import rules? The CSS specs do not forbid the mixing of different encodings! I solved that by converting all files to UTF-8 internally. On writing there is an option to tell the tool what encoding it should be (UTF-8 by default). One can also define if it should write a BOM or not and if it should add the charset declaration.

Since my tool is written in perl, I have a lot of utilities at hand to deal with different unicode charsets. I'm pretty sure that most OSS uses libiconv to convert between different encodings. But I have now idea how easy/hard this would be to integrate platform independent (it seems doable).

Current status on libsass unicode support

Currently libsass seems to handle the common UTF-8 case pretty well. I believe it should correctly support all ASCII compatible encodings (like UTF-8 or Latin-1). If all includes use the same encoding, the output should be correct (in the same encoding). It should also handle unicode chars in selectors, variable names and other identifiers. This is true for all ASCII compatible encodings. So the main incompatible encodings (I'm aware of) are UTF-16/UTF-32 (which could be converted to UTF-8 with libiconv).

Current encoding auto detection

Libsass currently reads all kind of BOMs and will error out if it finds something it doesn't know how to handle! It seems that it throws away the optional UTF-8 BOM (if any is found). IMO it would be nice if users could configure that (also if a charset rule should be added to the output).

What is currently not supported

  • Using non ASCII compatible encodings (like UTF-16)
  • Using non ASCII characters in different encodings in different includes

What is missing to support the above cases

  • A way to convert between encodings (like libiconv)
  • Sniffing the charset inside the file (source is available)
  • Handling the conversion on import (and export)
  • Optional: Make output encoding configurable
  • Optional: Add optional/mandatory BOM (configurable)

Low priority feature

I guess the current implementation should handle more than 99% of all real world use cases.
A) Unicode characters are still seldomly seen (as they can be written escaped)
B) It will still work if it's UTF-8 or in any of the most common known western ISO codepages.
Although I'm not sure how this applies to asian and other "exotic" codepages!

I guess the biggest Problem is to have libiconv (or some other) library as a dependency. Since it contains a lot of rules for the conversions, I see it as the only way to handle this correctly. Once that is sorted out it should be pretty much straight forward to implement the missing pieces (in parser.cpp - Parser::parse should return encoding and add Parser::sniff_charset, then convert the source byte stream to UTF-8).

I hope the statements above all hold true. Unicode is really not the easiest topic to wrap your head around. But since I did all the above recently in Perl, I wanted to document it here. Feel free to extend or criticize.

Have a nice day
Marcel

@mgreter mgreter changed the title from Unicode Support by CSS Specifications [discussion] to Unicode Support by CSS Specifications Jun 5, 2014

@akhleung

This comment has been minimized.

akhleung commented Oct 7, 2014

My knee-jerk reaction is that I'd rather not make libiconv a dependency of LibSass itself ... it might be appropriate to put it in SassC instead, as a conversion step before invoking LibSass. If libiconv needs to be invoked on each import in a Sass project, then maybe this could be solved with custom importers (i.e., SassC could provide a callback that would pass each import through libiconv). Or, since the command-line version of iconv is installed on most Unix-like systems, we could even simply pipe each import through that before compiling (not what I would actually recommend).

@hcatlin

This comment has been minimized.

Member

hcatlin commented Oct 7, 2014

Yeah, iconv is a BEAST. I've spent many years wrangling with it... and never come out unscarred.

@mgreter

This comment has been minimized.

Contributor

mgreter commented Dec 11, 2014

I wonder if we should close this and preserve the information in a wiki page?

@xzyfer

This comment has been minimized.

Contributor

xzyfer commented Dec 14, 2014

I'm not sure if this is related but it was recently reported to be that our str-slice specs are disabled, presumable due to utf-8 tests in the spec

str-slice("øáéíóúüñ¿éàŤDžǂɊɱʭʬѪ҈ݓ", -80, -200);

Is this related to utf-8 handling or a bug with string-functions and utf-8?

@HugoGiraudel

This comment has been minimized.

HugoGiraudel commented Dec 15, 2014

Note: LibSass only chokes with this str-slice test. All other tests for str-slice including special characters pass. All other tests about Sass string functions including special characters pass. Only this one fails.

@mgreter

This comment has been minimized.

Contributor

mgreter commented Mar 9, 2015

I created a wiki page to document the findings.
IMO we will not have full support in the foreseeable future!

@mgreter mgreter closed this Mar 9, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment