Skip to content

Encoding Support #107

Closed
wants to merge 11 commits into from
@rtomayko
Owner

Taking a crack at the encoding problem (#75). The code I have written so far is pretty close to @judofyr's original write up on that issue. The Template class has been modified somewhat and I'm using the ERBTemplate, BuilderTemplate, and CoffeeScriptTemplate implementations to exercise the requirements.

I have a section in the README on encodings to spec out the expected behavior and general strategy. There's an overviewy part and then it describes how templates should think about template source data:

Template Source Encoding

The template source data may come from a file or from a string. In either case, the real template source encoding should be determined as follows in order of preference:

  • Template specific encoding rules (e.g., utf-8 only formats).
  • A (template specific) magic encoding comment embedded in the source string.
  • The source string's existing encoding (string only).
  • The :default_encoding option to Template.new (file only).
  • Encoding.default_external - the default system encoding (file only)

Some template file formats have strict encoding requirements. CoffeeScript is a utf-8 only format for instance. Template implementations are encouraged to use this type of information to constrain the detection logic defined above.

This part feels pretty solid to me. The ERB/Erubis, Builder, and CS templates all adhere to these rules and document their behavior when the source encoding detection process is somehow constrained.

This next part I've only prose and no code. It's a lot more shaky:

Render Context Encoding

When the system internal encoding (Encoding.default_internal) is not set (MRI default), templates should be evaluated and produce a result string encoded the same as the template source data. e.g., A Big5 encoded template on disk will generate a Big5 result string and expect interpolated values to be Big5 compatible.

When Encoding.default_internal is set, templates should be converted from the template source encoding to the internal encoding before being compiled / evaluated and the result string should be encoded in the default internal encoding. For instance, when default_internal is set to UTF-8, a Big5 encoded template on disk will be converted to and generate a UTF-8 result string and interpolated values must be utf-8 compatible.

I do like this behavior in theory. It's compatible with ActionView and seems consistent with the spirit of Encoding.default_internal. However, the caller needs the ability to override these behaviors for cases where default_internal cannot be changed but the caller knows render context transcoding is needed or not needed. For that I think we should add another option:

Templates that perform render context transcoding must allow these default behaviors to be controlled via the :transcode option:

  • :transcode => true - Convert from template source encoding to the system default internal encoding before evaluating the template. The result string is guaranteed to be in the default internal encoding. Do nothing when Encoding.default_internal is nil.

    This is the default behavior when no :transcode option is given.

  • :transcode => false - Perform no encoding conversion. The result string will have the same encoding as the detected template source string.

    This is the default behavior when default_internal is nil.

  • :transcode => 'utf-8' - Ignore default_internal. Instead, convert from template source encoding to utf-8 before evaluating the template. The result string is guaranteed to be utf-8 encoded. The encoding value ('utf-8') may be any valid encoding name or Encoding constant.

My plan is to start working on the default transcoding behavior + :transcode option for all of the templates I've touched so far. What do you guys think?

/cc @judofyr, @josh, @rkh, @josevalim, @apotonick, @nesquena, @brianmario, @DAddYE, everyone ...

added some commits Sep 18, 2011
@rtomayko revise and define more strictly the default_encoding option
The default_encoding option takes effect only when tilt reads
template data from the filesystem. Templates provided via custom
reader block are assumed to be tagged with a best guess encoding
already.

It's also worth noting that, unlike File.read, the default file
reader does not perform Encoding.default_internal transcoding. The
string is marked with the default_encoding or the system encoding
(Encoding.default_external) but no transcoding is performed. This is
because magic comments or template specific encoding settings are
not yet available.
d602069
@rtomayko generate ruby source in same encoding as template source data e18b99c
@rtomayko specify template source encoding behavior in README, tests ee65ca9
@rtomayko ERB templates adhere to source template encoding behavior 35e5a48
@rtomayko refine ruby source comment extraction utility methods 7ef2de5
@rtomayko Builder template adheres to source template encoding behavior 9c0ecf8
@rtomayko start to spec out :transcode behavior a little
None of this is happening in the code yet. I'm just trying to figure
out what it might look like.
fcb9a1e
@rtomayko CoffeeScript template requires utf-8 input, generates utf-8 output only
All of the various input and output overrides are ignored
essentially.
b73c7da
@rtomayko reorg encoding spec in README f36cc18
@rtomayko typos abound in binary method comments ae8900b
@rtomayko fix markdown bullet indent in README 9a474c5
@josh josh commented on the diff Sep 19, 2011
lib/tilt/builder.rb
@@ -14,7 +35,10 @@ module Tilt
require_template_library 'builder'
end
- def prepare; end
+ def prepare
+ return if !data.respond_to?(:to_str)
+ @source = assign_source_encoding(data.to_str)
@josh
josh added a note Sep 19, 2011

I like the opt-in for assign_source_encoding.

@rtomayko
Owner
rtomayko added a note Sep 19, 2011

Yeah. I'm going with an approach where the base class provides some convenience APIs but the behavior is more or less in the hands of the template subclass. I don't think there's any other way since template's vary so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josh josh commented on the diff Sep 19, 2011
README.md
+ - `:transcode => true` - Convert from template source encoding to the system
+ default internal encoding (`Encoding.default_internal`) before evaluating the
+ template. The result string is guaranteed to be in the default internal
+ encoding. Do nothing when `Encoding.default_internal` is nil.
+
+ This is the default behavior when no `:transcode` option is given.
+
+ - `:transcode => false` - Perform no encoding conversion. The result string
+ will have the same encoding as the detected template source string.
+
+ This is the default behavior when `Encoding.default_internal` is nil.
+
+ - `:transcode => 'utf-8'` - Ignore `Encoding.default_internal`. Instead,
+ convert from template source encoding to utf-8 before evaluating the
+ template. The result string is guaranteed to be utf-8 encoded. The encoding
+ value (`'utf-8'`) may be any valid encoding name or Encoding constant.
@josh
josh added a note Sep 19, 2011

I don't think this :transcode option is implemented yet. Is this going to be set as an option on initialize or supposed to be passed each time to the render method?

We might be able to provide some of these things for the handler by adding them to the default render implementation. The before evaluating step doesn't always exist for every handler. For an example, coffee, sass and markdown are contextless and ignore the locals option. So these handlers only need to call encode on the final string they produce. So we might be able to move that final encode to our render. But I'm not sure if thats entirely a good idea.

@rtomayko
Owner
rtomayko added a note Sep 19, 2011

Nope I haven't started in on any of the transcoding yet.

I hadn't even considered making this an option to render. I figured we'd want it on initialize and also maybe as a Template class attribute that can be overridden by subclasses. Having it on render could be interesting. I'm not sure how much it'd be used though. I think the 99% case for transcoding is going to be people setting Encoding.default_internal = 'utf-8' or wanting the same effect when running templates.

@josh
josh added a note Sep 19, 2011

Don't really want to put it on render cause we have no options arg. Passing to initialize makes the most sense.

We could provide some sort of Template#encode() helper that encodes the string to options[:transcode] || default_internal. So handlers can call that before they return the final result.

@rtomayko
Owner
rtomayko added a note Sep 19, 2011

Exactly what I'm thinking. It needs to happen before evaluation for context / interpolating templates though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rtomayko
Owner

Related: #75

@joshbuddy

@rtomayko Any update on this?

@rtomayko
Owner
rtomayko commented Jan 4, 2012

@joshbuddy: I haven't had a second to look at this for a long while now, unfortunately. Hopefully things will settle down soon. I'd definitely like to get this closed out.

@joshbuddy

@rtomayko No worries. Just wanted to make sure it didn't get lost in the shuffle.

@morgoth
morgoth commented Feb 22, 2012

I have same problems with template encodings.

I tested encoding branch and it works fine for me.

Any update on this issue?

@conanbatt

This is a pretty serious issue for my application. I cannot support internationalization and users can easily break pages if they put the wrong characters.

Is there a quickfix or a monkey-patch i can do to fix this? adding the magic comment to the erb did not work. I've been waiting 2 months for this merge :S.

@timhaines

Rail's asset pipeline fails on templates with unicode in them due to the lack of encoding support right?

@timhaines

FWIW The encodings branch of the gem worked for me.

@conanbatt

Yes that was my solution also.

@matiaskorhonen

What's the status on this? Just wondering why the encodings branch hasn't gotten merged into master?

@konklone
konklone commented May 9, 2012

I too would love these changes, having some related breakages on my end - any chance these'll be merged into master soon?

@mariosant mariosant referenced this pull request in middleman/middleman Jan 20, 2013
Closed

Bug with encoding #738

@rkh
Collaborator
rkh commented Feb 27, 2013

What's the status of this?

Rails, Sinatra and Ruby 2.0 default everything to UTF-8.

@judofyr
Collaborator
judofyr commented Feb 27, 2013

I'm going to revisit this for the 1.4 release. I'm planning on implementing (or, copying from this branch) the minimal code we need for supporting encodings. No transcoding or funky stuff, but something that makes UTF-8 templates work out-of-the-box for 99% of all users.

@Wardrop
Wardrop commented Mar 4, 2013

Please do @judofyr, I've just hit this in a SCSS template I have which contains a single UTF-8 character. I've worked around it in my framework by simply doing a manual File#read in the block I send to #new. It seems File#binread otherwise used by Tilt always defaults to ASCII-8BIT, even on Ruby 2.0.0.

@judofyr
Collaborator
judofyr commented Mar 4, 2013

@Wardrop: Can you try the branch in #175 and see if it fixes your problem?

@judofyr
Collaborator
judofyr commented Apr 28, 2013

#175 has now been merged.

@judofyr judofyr closed this Apr 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.