Skip to content
This repository

How should we handle encodings? #75

Closed
judofyr opened this Issue April 12, 2011 · 55 comments
Magnus Holm
Collaborator

Templates are very much based on files, which therefore means we'll have to worry about encodings. There are two ways templates can be passed to an engine: through a block (which returns a string: klass.new { 'foo' }) or through a filename.

Currently, encodings are handled as such:

  • Templates accessed through blocks are encoded in whatever encoding the string already has.
  • Templates accessed through the file systems are encoded in BINARY.

For all cases, you can pass :default_encoding => "FOO" as an option which will set the @default_encoding-ivar. In addition, if the precompiled code (when using precompiled mode) includes a magic comment (# coding: FOO) OR a @default_encoding is set, the generated code will include a magic comment before the preamble.

My thoughts

Some observation:

  • If a template is encoded in encoding FOO, we can assume the rendered string should also be in encoding FOO
  • If the template is fetched through a block, it already has a known encoding that we should respect
  • If the template is fetched from the file system, we don't know the encoding and our safest bet is to use BINARY
  • We can still give the template engine a hint of which encoding it's in (Encoding.default_external or the :default_encoding option)
  • In any case, the template engine might know better which encoding the template should be in and should be able to tell Tilt that the rendered string should be in encoding BAR.
  • The template engine shouldn't really need to worry about the file system or Encoding.default_external/internal. It should only get a template (encoded in some encoding) together with a guess from Tilt's side (which handles the file reading).

Let's go over each of the use cases and how I believe we should change Tilt:

Fetched through block

Any template that's fetched through a block already has an encoding. I think we should keep the current behaviour such that the template keeps its encoding and @default_encoding is set to options[:default_encoding] (which defaults to nil).

Fetched through file system

We don't know the encoding, so we'll read the file in binary mode and set @default_encoding to options[:default_encoding] || Encoding.default_external (this last one is different from today).

The precompiled code from the template engine

Because tilt.rb is written in UTF-8 and the precompiled code will be manipulated with literal strings from tilt.rb, the precompiled code generated by the template engine must be compatible with UTF-8 (not a code change, just a spec).

The encoding of the final generated code

The encoding of the final generated code should be decided by the template engine. There are two ways: the precompiled code is encoded in a specific encoding or the precompiled code includes a magic comment. Tilt should not add a magic comment based on @default_encoding because that's just a hint from Tilt's side and it's the engine responsibility to handle it correctly. (code change).

Why @default_encoding is only a hint

Even when you pass :default_encoding as an option (and it's not inherited from Encoding.default_external when reading from the file system) we'll have to depend on that the template engine handles it. The engine might have a different way of checking the encoding of the template, so we can't just @data.encode(@default_encoding) in case it leads to an encoding error. The default encoding is after all only that: the default encoding, which the engine should fall back to if it can't guess it.

How template engine should handle the encoding

Many simple template engine don't really need to work on the encoding-level, so from theirs point of view, this should be enough:

def compile(template, default_encoding)
  data = template.dup.force_encoding("BINARY")
  enc = encoding_specified_in_template(data) # extract magic comment or whatnot
  compiled = parse_and_compile(data)
  compiled.force_encoding(enc || default_encoding || template.encoding)
end

If you really need to be fully encoding-aware before you parse and compile, you should do something like this:

def compile(template, default_encoding)
  data = template.dup.force_encoding("BINARY")
  enc = encoding_specified_in_template(data) || default_encoding || template.encoding
  data.force_encoding(enc)
  compiled = parse_and_compile(data)
  compiled.force_encoding(enc) # You don't need this if you rather generate a magic comment
end

Your thoughs

What do you think?

Joshua Peek
josh commented April 12, 2011

Summoning @wycats. He wanted this cleaned up before Rails could adopt Tilt. I plead ignorance.

Magnus Holm
Collaborator

Calling in some other template engine hackers who might be interested:

Tilt: @rtomayko
Sinatra: @rkh
Rails: @josh
Slim: @stonean, @minad, @fredwu
Haml: @nex3
Issue #66: @julian7

Tilt is working fine at the moment, so this isn't a pressing issue; take your time and let's come up with the best solution together :-)

Magnus Holm
Collaborator

The Padrino guys might be interested too I guess: @nesquena and @DAddYE

Denny Trebbin

If you try to find file encoding (i.e. a client send you somehow, uploaded somewhere or brought to you by usb or whatnot) you may could get some help from
http://prometheus.rubyforge.org/cmess/

charset = CMess::GuessEncoding::Automatic.guess(input)

Konstantin Haase
Collaborator
rkh commented April 13, 2011

In Sinatra we adopted the "If I don't know the encoding, use UTF-8" way, which is why default_encoding was introduced in the first place. Encoding is a really hairy thing and the ruby behavior currently differs not only between Windows and Unix but also between Linux and OSX. Take into account that as soon as one string used in a template or the template itself is BINARY, Ruby will raise an exception on interpolation. It is hard already to have proper encoding for user input over HTTP (the HTTP client does not tell you what encoding form data is in). Also note that there was a common agreement on ruby-core to have 1.9.3 default to Unicode rather than making any default dependent on the OS. I would love to keep changes in a separate branch for now and have that branch intensely tested on different platforms with different settings. Targeting 1.4 is a good idea imo.

Davide D'Agostino

100% agree with @rkh, In my opinion we can wait some months and stress test it... or instead wait ruby 1.9.3.

Nathan Esquenazi

Related (extended) discussion about this same issue. Keep in mind while this started as a Padrino bug, it runs way deeper back to this issue with Tilt:

padrino/padrino-framework#519 (comment)

I think we do need to address this. As it stands templates cannot contain UTF-8 characters inside them if Encoding.default_internal is set to Encoding::UTF_8 unless an encoding is set to every single template explicitly. In Sinatra, this is hackily avoided since Encoding.default_internal is set to nil. But that decision will come back to bite us relatively quickly. Tilt needs to respect default_external and not default to parsing every template as ASCII-8BIT.

Nathan Esquenazi

Quoting @nex3 in that other thread because this seems to make the most sense:

The UTF-8-over-HTTP problem is going to exist just as much with BINARY templates as with default_external-encoded ones. Fundamentally, you just can't hide from users the fact that they must specify the encodings of their external files.

From a template engine's point of view, Rails' solution is pretty much ideal: use default_external (which defaults to UTF-8) when loading a template, but provide a hook for template engines to recognize encoding comments. If no encoding comment is found and the file isn't valid in default_external, throw an error.

Konstantin Haase
Collaborator
rkh commented May 09, 2011
Nathan Esquenazi

I will run this against the simplest failing case I outlined here: padrino/padrino-framework#519 (comment) and see if it works later and report back. Thanks for whipping up this patch.

Mislav Marohnić
mislav commented July 05, 2011

Agreed with @nesquena. Just got bit by the inability to render CoffeeScript templates which contain multibyte characters (Tilt reads them in as ASCII). Here's the monkeypatch:

Tilt::CoffeeScriptTemplate.class_eval do
  alias original_prepare prepare
  def prepare
    original_prepare
    data.force_encoding @default_encoding
  end
end
Nathan Esquenazi

Yeah this is still a lingering problem although admittedly we fixed it hackily by simply no longer setting the correct encodings in 1.9.2. This has unfortunate ramifications but it is better then the alternative from the padrino developers perspective until this gets resolved at the Tilt level.

Ryan Tomayko
Owner

/cc @apotonick, @brianmario

Some more eyeballs on this issue potentially.

Also wanted to make sure everyone saw this link from @nesquena earlier since it has some really good discussion around how much we should rely on Encoding.default_external:

padrino/padrino-framework#519

@rkh How does that patch factor in? Do you feel like that addresses the main issues with template files on disk?

Brian Lopez

I mostly agree with @nex3 on padrino/padrino-framework#519 (comment) (also quoted by @nesquena above).

Unless you use an encoding detection library like charlock_holmes, the only way to reliably know what the actual encoding of the template off the filesystem is to have the coder specify it using a magic header. Encoding.default_external should be used as the fallback, and if it blows up I think it's perfectly reasonable to raise that to the user asking them to specify the encoding of the offending template using a magic header. They should only need to do that for templates that aren't compatible with Encoding.default_external.

Then again, a valid point could be made to say that we should never raise an exception but instead just do our best to display the content - even if it ends up corrupt once sent down to the browser. Maybe that could be a configurable switch? raise_encoding_errors or something? I could go either way, but adding more config options feels like it should be a last-resort.

Ryan Tomayko
Owner

Then again, a valid point could be made to say that we should never raise an exception but
instead just do our best to display the content - even if it ends up corrupt once sent down to
the browser

That decision can still be left up to the app, though, I think. If you wanted non-draconian encoding error handling you'd set things up to use BINARY everywhere, essentially mimicking 1.8 behavior.

This raises a really good point though. If we move to Encoding.default_external as a default for templates read from disk, which is starting to feel right to me, then shit is going to blow up on existing apps since we'll be draconian by default. I'm going to want to do a major rev on the version number if that's the case.

EDIT: s/Encoding.default_encoding/Encoding.default_external/

Davide D'Agostino

@brianmario 100% agree.

Davide D'Agostino

I don't know how much cost in terms of performances reading everything in BINARY we should check this; I really hate the magic comment & c. Btw setting Encoding.default_internal or Encoding.default_external is quite simple so the major problem for the developer is to have all files with the same encoding. I think this isn't and bad behavior almost for me. So, my vote is to use Encoding.default_external.

Ryan Tomayko
Owner

I'm leaning toward @rkh's patch with maybe an additional layer where you can override default_external specifically for tilt.

  • Add Tilt.default_external_encoding, nil by default.
  • Use Encoding.default_external when Tilt.default_external_encoding is nil.
  • Always use the :default_encoding option when given.
  • Always use magic comment when found in template (guess the engine needs to support this?)

This would let you Tilt.default_external_encoding = 'utf8' if you wanted to ensure templates are treated as utf-8 but don't want to change Encoding.default_external for whatever reason. You could also set Tilt.default_external_encoding = 'BINARY' to get the current behavior.

Tilt 1.4 could ship with Tilt.default_external_encoding = 'BINARY' and Tilt 2.0 would ship with Tilt.default_external_encoding = nil.

Nathan Esquenazi

@rtomayko Yeah I think we are all on the same page. The Tilt.default_external_encoding which will eventually default to Encoding.default_external is a fairly safe approach. This allows for a smooth transition from the existing behavior. I am happy to test any branch with these adjustments in Padrino to try and reproduce the error as outlined in the old issue. If it works there with utf-8 encodings set, then that should be a good indicator.

Konstantin Haase
Collaborator

After figuring out the encoding, should we leave it by that or call String#encode (with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.

Brian Lopez

FWIW the Encoding.default_external= method raises a warning with ruby -w. Not sure what ruby-core has planned for it, but that warning kinda makes me feel like they might disable the ability to set it from inside a running app (would have to be set from $LC_TYPE or $LANG I guess?)`. I don't know either way, but just wanted to make us all aware.

Ryan Tomayko
Owner

Blah. Yeah that's a whole other dimension to this I haven't considered. We may need Tilt.default_internal_encoding as well, with semantics similar to Tilt.default_external_encoding.

Davide D'Agostino

@rtomayko I suggest to keep it more strict only to Encoding.default_external, at the moment I don't know when can be useful have two separate encodings, but is true that we don't break compatibility.

Always use magic comment when found in template (guess the engine needs to support this?)

As said I don't love this, but maybe others would like to use it.

Nathan Esquenazi

Interesting, wasn't aware of that warning when setting default_external. We would still be able to access the default encoding type through that even if it isn't expected to be set explicitly. Relying on the default external encoding and encoding everything to that internally strikes me as the most consistent approach that is least likely to be an annoyance for the end user developer as @rkh noted above.

/cc @raggi (would be interested to hear your thoughts)

Magnus Holm
Collaborator

Always use magic comment when found in template (guess the engine needs to support this?)

Instead of this I think we should just have a method on Tilt::Template (detect_encoding maybe) which returns nil by default and can be overridden by the template mapper. It's up to the template mapper to decide if it wants to use magic comments or if it just defaults to BINARY.

Magnus Holm
Collaborator

After figuring out the encoding, should we leave it by that or call String#encode (with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.

Well, the output from the template engine isn't guaranteed to be a string. It could be an object or an Array (think Rack-streaming) instead. Beside, it's pretty simple for anyone who uses Tilt to add an #encode at the end.

Konstantin Haase
Collaborator

By the way: "@konstantinhaase any reason not to just extract what rails is doing? I spent a lot of time on it and don't want to deal with the bugs again." - @wycats - http://twitter.com/wycats/status/113458749854842880

Ryan Tomayko
Owner

@rkh Which is ... ?

Davide D'Agostino

Bugs?

Konstantin Haase
Collaborator
Brian Lopez

+1 to what rails is doing

Ryan Tomayko
Owner

@judofyr Thanks for running that down. Most of AV's behavior seems to make good sense with Tilt, although I think we need a way to enable / disable some of it.

To summarize AV's behavior:

Determining template source encoding:

  • Assume template source is encoded in Encoding.default_external.
  • Use generic Ruby style encoding comment (supported by all templates) if supplied in template source.
  • Use template engine specific encoding comment if supplied in template source.
  • Raise exception when template source is not valid in the determined encoding.

Template source transcoding:

  • Handled by AV before the template source is handed off to template engine regardless of where string comes from.
  • Force template source data string to determined encoding with String#force_encoding.
  • Convert to Encoding.default_internal with String#encode!().
  • Raise exception when template source cannot be converted.

Template source to Ruby source conversion:

  • This is relevant only to source generating templates like ERB
  • Pass template source encoded as Encoding.default_internal into template specific handler.
  • Template handler converts the template source to ruby source.
  • Handlers must return the ruby source string encoded as default_internal.
  • AV also includes a safe-measure call to encode!() to guarantee the ruby source is default_internal.
  • Create compiled render method via call to module_eval.

All that's left is calling the method, which should return a string encoded as default_internal.

AV seems to assume Encoding.default_internal will be set. Tilt needs to work when default_internal is nil, which complicates things somewhat.

One thing that absolutely should happen is: for anything that tilt reads from disk using an IO object it creates, do something like:

data.force_encoding(magic || default_encoding || Encoding.default_external)
data.encode!

Note that the transcode step is a noop when Encoding.default_internal is nil.

Some other things we should consider supporting:

  • The generic magic comment syntax. Would cause force_encoding(encoding) on all template source strings when detected.

  • Raising an exception when template source data is not valid in its encoding (configurable).

The main issue I'm struggling with in all this is conversion to default_internal of template source strings returned from a reader block. I really hoped we could punt that responsibility to the caller but it doesn't work out. Here's my reasoning:

  1. Evaluating all templates in utf-8 / default_internal regardless of external encoding is a worthwhile feature and makes sense for many environments.

  2. You can't get that behavior from IO's normal external to internal conversion because of magic comments (especially template specific ones).

  3. The caller can't be responsible for conversion because they might not know about template specific magic comments either, nor will they want to deal with them even when they do know.

  4. The only option remaining is for Tilt to be responsible for performing external to internal conversion.

Essentially I think we should use Encoding.default_internal as a flag that says "always convert template source from the detected template source encoding to default_internal" regardless of where the source comes from. Perhaps additionally an option is needed to turn that behavior off at the Template instance level; otherwise, evaluating templates under other encodings becomes impossible when Encoding.default_internal is set.

Joshua Peek

To some extent, encodings should be handled individually by each handler. Not saying tilt shouldn't do anything, but I think each handler class should opt into the magic comment syntax and automatic transcoding. Not everything needs this. The coffee handler only works with utf-8, you can't write CS in anything else. Also sass/less can handle encodings via @charset css directive.

Think the important part is defining the contact for how all handlers should be expected to behave.

Ryan's outline looks good.

Ryan Tomayko
Owner

@josh Good to know. Yeah, I think we can get some basic stuff into Template without too much work but then we're really going to need to audit each template engine individually.

Nick Sutterer
Ryan Tomayko
Owner

A more concrete proposal with code: #107

Pavel Argentov

Sorry guys that I suddenly appear from my bear lair (actually I've encountered the problem yesterday and saw that almost everybody stopped discussing it by May), but ... can anybody say how to conquer this issue when I have it with Padrino stack? I use haml/markdown/ruby_vars in UTF-8. How to settle them all?

Nathan Esquenazi

@argent-smith Try:

# config/boot.rb
Padrino.after_load do
  Encoding.default_internal = nil
end

does that help?

Pavel Argentov

It helped until I used the single UTF-8 encoded variable. So now I have UTF-8 haml rendered well and "ncompatible character encodings: UTF-8 and ASCII-8BIT" when I try to do "= utf8_string_returning_method" in the haml template.

Nathan Esquenazi

The next step would be to try and create a simple stripped down test case of this failure, preferably in Sinatra as a reference. Then if you can, check out this tilt patch from @rtomayko: https://github.com/rtomayko/tilt/pull/107/files, and see if with that applied the issue goes away.

Pavel Argentov

I'll finish the test suite by tomorrow.

Pavel Argentov

See my test suite argent-smith/sinatra-test.

It looks like on clean Sinatra with stock Tilt the issue isn't reproduced: everything goes just fine. I'll make yet another test suite with Padrino and tell you what happens.

Pavel Argentov

Well, after adding markdown to my test I've reproduced the issue. Actually markdown was in my app when I've first encountered the issue. The error doesn't depend on a particular markdown implementation: I've tried kramdown and redcarpet with the same result. For the further details see my test suite mentioned above.

Nathan Esquenazi

Great, I was hoping you could isolate this test with sinatra alone, since Padrino doesn't really mess with encodings. So the above repo fails with stock Tilt with Sinatra? Can you try again with the patched Tilt that Ryan has a pull request for and see if that makes the issue go away?

Pavel Argentov

I'll try it ASAP

Pavel Argentov

@nesquena @rtomayko I've got a conflict when tried to merge encodings to master: see /lib/tilt/template.rb in gist 1374322. Would you tell me how to resolve it?

Damian Janowski

So... are we going to honor Encoding.default_external?

Pavel Argentov

@djanowski doesn't help in my case (xcuse my stupidity ;-)

Aleksander Pohl
apohllo commented May 25, 2012

Any progress with that issue?

Konstantin Haase rkh referenced this issue in sinatra/sinatra August 31, 2012
Closed

Encoding problem in coffee-script #559

Eero Saynatkari
rue commented October 06, 2012

Is this seriously still unsolved? What do you need to get this done, I’ll help?

Ryan Tomayko
Owner

@rue Yes. I'd say the work is maybe half done and it's going to require a major rev to release due to the change in behavior. #107 is IMO still the furthest we've got on this and its not being actively worked on. I assume it's suffered from bit rot over the last year as well.

The docs added at the top of https://github.com/rtomayko/tilt/pull/107/files give a basic roadmap of what work remains. The two biggest issues are implementing the :transcode option and auditing every single template engine. If you'd like to help with either, fork and send a pull request with the encodings branch as the base.

Gopal Patel nixme referenced this issue in fnando/i18n-js November 12, 2012
Open

New release coming soon #89

Alex Peattie alexpeattie referenced this issue in gettalong/kramdown March 07, 2013
Closed

Unicode problem (I guess) #38

Daniel Mendler
minad commented March 12, 2013

@judofyr Any ideas how this will evolve?

Vitaly A. Sorokin

I've just ran into this issue and I will just leave a workaround here that worked for me.
At the top of an .erb file: <% # encoding: utf-8 %>

Magnus Holm judofyr closed this April 28, 2013
Magnus Holm
Collaborator

Initial implementation merged in #175. Open a new issue if you have any problems.

Daniel Mendler minad referenced this issue in judofyr/temple August 16, 2011
Closed

Encoding handling #42

Daniel Mendler minad referenced this issue in slim-template/slim March 23, 2012
Closed

Encoding handling #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.