Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Implement minimal encoding support #175

Merged
merged 6 commits into from

7 participants

@judofyr
Collaborator

Oh boy, it's this time again. I want to fix Tilt's encoding support so that it's at least usable. This PR tries to handle encodings correctly within Tilt. I've excluded transcoding because it's fairy simple to implement this outside of Tilt.

The encoding of data

The encoding of the data is done in two "steps". First we try to guess the encoding, then we check if the Template-implementation wants to override it.

Step 1: If you're using a custom reader (Template.new { foobar }) we already know the encoding. If you're reading a file, then we read it as binary and force_encoding it to default_external (to avoid raising exceptions).

Step 2: There's a method called #default_encoding. If this returns a truthy value, we force_encode the data to this encoding. By default it simply looks up the :default_encoding-option, but template engines can override this with custom behavior (e.g. always return UTF-8 for CoffeeScript)

This means that the encoding of the data has/is:

  • Sensible defaults (Encoding.default_external)
  • Overridable by the user (through the :default_encoding-option)
  • Overridable by the template engine (through `#default_encoding)

The encoding of Tilt's generated source code

Tilt combines the output from the template engine with its own "add local variables and define it as an unbound method for performance". It's the encoding of this source code that determines the encoding that the code runs under (e.g. the value of "".encoding).

In this PR we use the same encoding for Tilt's source code as the template engines source code (result from precompiled_template). For legacy reasons I've added support for extracting magic comments, so Tilt behaves the same wether a template engine returns "foo".force_encoding('…') or "# coding: …\nfoo (although the former is preferred)

The encoding of the final string will (most likely) be the same as the source code encoding (although technically the source code can return whatever string with whatever encoding it likes).

Thoughts?

Is this good enough for now? Will this solve your problems?

/cc @josh, @rtomayko, @mislav, @djanowski, @rue, @nesquena, @DAddYE, @apotonick, @brianmario, @rkh, @apohllo, @argent-smith, @fibric

@brianmario

I like what's here so far. I think it offers enough flexibility to cover the most common cases where someone might have a set of templates in mixed encodings.

Something that's always going to be hard is actually knowing the encoding of the file being read off disk. I've seen plenty of cases where people either forget or don't realize they have to put a magic comment in their files. Ideally that wouldn't be required anyway.

I know it would potentially add some headache (especially on Heroku) but would you all be opposed to adding charlock_holmes as a dependency and using that to make a much more accurate encoding detection of the file on disk? That way at least the string would be tagged with the correct encoding and Ruby's encoding implementation could help out with the rest of the hard work that comes along with concatenation.

Although I think using charlock_holmes would work best, requiring libicu as a dependency might be a bit heavy for most users.

@josh

I don't think charlock_holmes should be a hard dependency, but it'd be nice if we had a story to extend it neatly into the detection process.

@judofyr
Collaborator

The easiest way to do it now would be to just use a different reader:

reader = proc do |t|
  File.read(t.file).detect_encoding!
end

However, this has to be provided by the library/framework that uses Tilt.

@judofyr
Collaborator

I see the value of having more "magic" in Tilt to alleviate some of the encodings problems, but I don't feel we have safe way to implement it now. Tilt is currently very global; all mappings (and prefers) are registered in a global namespace. We can't really change these semantics in Tilt because it would make changes for every user of Tilt. For Tilt 2.0 I want to add a more granular API that makes it possible to opt-in to features without changing Tilt's behavior globally.

For Tilt 1.4 I'm mostly looking into making encodings suck less. Then I want to structure Tilt 2.0 so that we can make encodings awesome in later releases without breaking anything.

@judofyr
Collaborator

Actually, considering that this turned out to be a much smaller patch, we might want to consider releasing this as 1.3.5 and jump right to 2.0 later.

@josh

For Tilt 2.0 I want to add a more granular API that makes it possible to opt-in to features without changing Tilt's behavior globally.

Definite :+1: in that direction. Thats basically my use case with sprockets. I have forks of most of the template engines I care about just to get that level of control.

@rkh
Collaborator
rkh commented

-1 as charlock_holmes as hard dependency, due to it depending on icu and being a C extension.

@judofyr judofyr referenced this pull request
Closed

Encoding Support #107

@rtomayko
Owner

This looks like a great step to me as well. :+1:

@nesquena

I looked through the proposed patch and tested this version of tilt against my original issues and it seems to have solved our issues for the most part. :+1: Thanks @judofyr for putting this together.

@judofyr
Collaborator

I tweaked it a bit (it now raises an error if you initialize it with an incorrect encoding) and added some documentation.

@djanowski

@judofyr Great to see this issue revived. The PR description seems to make sense. It's everything I've ever wanted.

@djanowski djanowski commented on the diff
lib/tilt/template.rb
((6 lines not shown))
@data = @reader.call(self)
+
+ if @data.respond_to?(:force_encoding)

Just wondering. Can't we just use File.read options here? Like: File.read(@file, encoding: default_encoding)

@judofyr Collaborator
judofyr added a note

Because Ruby will try to encode it to Encoding.default_internal.

Right, right. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@judofyr judofyr merged commit a6789fa into from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 2, 2013
  1. @judofyr
  2. @judofyr
Commits on Mar 6, 2013
  1. @judofyr

    We don't need to encode! preamable/postamble:

    judofyr authored
    Ruby will do the conversion for us when we append.
  2. @judofyr
  3. @judofyr
Commits on Mar 24, 2013
  1. @judofyr

    Doc tweaks

    judofyr authored
This page is out of date. Refresh to see the latest.
Showing with 185 additions and 24 deletions.
  1. +21 −0 README.md
  2. +67 −24 lib/tilt/template.rb
  3. +97 −0 test/tilt_template_test.rb
View
21 README.md
@@ -194,6 +194,27 @@ template, but if you depend on a specific implementation, you should use #prefer
When a file extension has a preferred template class, Tilt will *always* use
that class, even if it raises an exception.
+Encodings
+---------
+
+Tilt needs to know the encoding of the template in order to work properly:
+
+Tilt will use `Encoding.default_external` as the encoding when reading external
+files. If you're mostly working with one encoding (e.g. UTF-8) we *highly*
+recommend setting this option. When providing a custom reader block (`Tilt.new
+{ custom_string }`) you'll have ensure the string is properly encoded yourself.
+
+Most of the template engines in Tilt also allows you to override the encoding
+using the `:default_encoding`-option:
+
+```ruby
+tmpl = Tilt.new('hello.erb', :default_encoding => 'Big5')
+```
+
+Ultimately it's up to the template engine how to handle the encoding: It might
+respect `:default_encoding`, it might always assume it's UTF-8 (like
+CoffeScript), or it can do its own encoding detection.
+
Template Compilation
--------------------
View
91 lib/tilt/template.rb
@@ -65,11 +65,37 @@ def initialize(file=nil, line=1, options={}, &block)
@default_encoding = @options.delete :default_encoding
# load template data and prepare (uses binread to avoid encoding issues)
- @reader = block || lambda { |t| File.respond_to?(:binread) ? File.binread(@file) : File.read(@file) }
+ @reader = block || lambda { |t| read_template_file }
@data = @reader.call(self)
+
+ if @data.respond_to?(:force_encoding)

Just wondering. Can't we just use File.read options here? Like: File.read(@file, encoding: default_encoding)

@judofyr Collaborator
judofyr added a note

Because Ruby will try to encode it to Encoding.default_internal.

Right, right. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ @data.force_encoding(default_encoding) if default_encoding
+
+ if !@data.valid_encoding?
+ raise Encoding::InvalidByteSequenceError, "#{eval_file} is not valid #{@data.encoding}"
+ end
+ end
+
prepare
end
+ # The encoding of the source data. Defaults to the
+ # default_encoding-option if present. You may override this method
+ # in your template class if you have a better hint of the data's
+ # encoding.
+ def default_encoding
+ @default_encoding
+ end
+
+ def read_template_file
+ data = File.open(file, 'rb') { |io| io.read }
+ if data.respond_to?(:force_encoding)
+ # Set it to the default external (without verifying)
+ data.force_encoding(Encoding.default_external) if Encoding.default_external
+ end
+ data
+ end
+
# Render the template in the given scope with the locals specified. If a
# block is given, it is typically available within the template via
# +yield+.
@@ -156,26 +182,29 @@ def evaluate(scope, locals, &block)
def precompiled(locals)
preamble = precompiled_preamble(locals)
template = precompiled_template(locals)
- magic_comment = extract_magic_comment(template)
- if magic_comment
- # Magic comment e.g. "# coding: utf-8" has to be in the first line.
- # So we copy the magic comment to the first line.
- preamble = magic_comment + "\n" + preamble
+ postamble = precompiled_postamble(locals)
+ source = ''
+
+ # Ensure that our generated source code has the same encoding as the
+ # the source code generated by the template engine.
+ if source.respond_to?(:force_encoding)
+ template_encoding = extract_encoding(template)
+
+ source.force_encoding(template_encoding)
+ template.force_encoding(template_encoding)
end
- parts = [
- preamble,
- template,
- precompiled_postamble(locals)
- ]
- [parts.join("\n"), preamble.count("\n") + 1]
+
+ source << preamble << "\n" << template << "\n" << postamble
+
+ [source, preamble.count("\n")+1]
end
# A string containing the (Ruby) source code for the template. The
- # default Template#evaluate implementation requires either this method
- # or the #precompiled method be overridden. When defined, the base
- # Template guarantees correct file/line handling, locals support, custom
- # scopes, and support for template compilation when the scope object
- # allows it.
+ # default Template#evaluate implementation requires either this
+ # method or the #precompiled method be overridden. When defined,
+ # the base Template guarantees correct file/line handling, locals
+ # support, custom scopes, proper encoding, and support for template
+ # compilation.
def precompiled_template(locals)
raise NotImplementedError
end
@@ -212,8 +241,13 @@ def compiled_method(locals_keys)
def compile_template_method(locals)
source, offset = precompiled(locals)
method_name = "__tilt_#{Thread.current.object_id.abs}"
- method_source = <<-RUBY
- #{extract_magic_comment source}
+ method_source = ""
+
+ if method_source.respond_to?(:force_encoding)
+ method_source.force_encoding(source.encoding)
+ end
+
+ method_source << <<-RUBY
TOPOBJECT.class_eval do
def #{method_name}(locals)
Thread.current[:tilt_vars] = [self, locals]
@@ -234,13 +268,22 @@ def unbind_compiled_method(method_name)
method
end
+ def extract_encoding(script)
+ extract_magic_comment(script) || script.encoding
+ end
+
def extract_magic_comment(script)
- comment = script.slice(/\A[ \t]*\#.*coding\s*[=:]\s*([[:alnum:]\-_]+).*$/)
- if comment && !%w[ascii-8bit binary].include?($1.downcase)
- comment
- elsif @default_encoding
- "# coding: #{@default_encoding}"
+ binary script do
+ script[/\A[ \t]*\#.*coding\s*[=:]\s*([[:alnum:]\-_]+).*$/n, 1]
end
end
+
+ def binary(string)
+ original_encoding = string.encoding
+ string.force_encoding(Encoding::BINARY)
+ yield
+ ensure
+ string.force_encoding(original_encoding)
+ end
end
end
View
97 test/tilt_template_test.rb
@@ -1,3 +1,4 @@
+# coding: utf-8
require 'contest'
require 'tilt'
require 'tempfile'
@@ -165,4 +166,100 @@ def initialize(name)
inst = SourceGeneratingMockTemplate.new { |t| 'Hey #{CONSTANT}!' }
assert_equal "Hey Bob!", inst.render(Person.new("Joe"))
end
+
+ ##
+ # Encodings
+
+ class DynamicMockTemplate < MockTemplate
+ def precompiled_template(locals)
+ options[:code]
+ end
+ end
+
+ class UTF8Template < MockTemplate
+ def default_encoding
+ Encoding::UTF_8
+ end
+ end
+
+ if ''.respond_to?(:encoding)
+ original_encoding = Encoding.default_external
+
+ setup do
+ @file = Tempfile.open('template')
+ @file.puts "stuff"
+ @file.close
+ @template = @file.path
+ end
+
+ teardown do
+ Encoding.default_external = original_encoding
+ Encoding.default_internal = nil
+ @file.delete
+ end
+
+ test "reading from file assumes default external encoding" do
+ Encoding.default_external = 'Big5'
+ inst = MockTemplate.new(@template)
+ assert_equal 'Big5', inst.data.encoding.to_s
+ end
+
+ test "reading from file with a :default_encoding overrides default external" do
+ Encoding.default_external = 'Big5'
+ inst = MockTemplate.new(@template, :default_encoding => 'GBK')
+ assert_equal 'GBK', inst.data.encoding.to_s
+ end
+
+ test "reading from file with default_internal set does no transcoding" do
+ Encoding.default_internal = 'utf-8'
+ Encoding.default_external = 'Big5'
+ inst = MockTemplate.new(@template)
+ assert_equal 'Big5', inst.data.encoding.to_s
+ end
+
+ test "using provided template data verbatim when given as string" do
+ Encoding.default_internal = 'Big5'
+ inst = MockTemplate.new(@template) { "blah".force_encoding('GBK') }
+ assert_equal 'GBK', inst.data.encoding.to_s
+ end
+
+ test "uses the template from the generated source code" do
+ tmpl = "ふが"
+ code = tmpl.inspect.encode('Shift_JIS')
+ inst = DynamicMockTemplate.new(:code => code) { '' }
+ res = inst.render
+ assert_equal 'Shift_JIS', res.encoding.to_s
+ assert_equal tmpl, res.encode(tmpl.encoding)
+ end
+
+ test "uses the magic comment from the generated source code" do
+ tmpl = "ふが"
+ code = ("# coding: Shift_JIS\n" + tmpl.inspect).encode('Shift_JIS')
+ # Set it to an incorrect encoding
+ code.force_encoding('UTF-8')
+
+ inst = DynamicMockTemplate.new(:code => code) { '' }
+ res = inst.render
+ assert_equal 'Shift_JIS', res.encoding.to_s
+ assert_equal tmpl, res.encode(tmpl.encoding)
+ end
+
+ test "uses #default_encoding instead of default_external" do
+ Encoding.default_external = 'Big5'
+ inst = UTF8Template.new(@template)
+ assert_equal 'UTF-8', inst.data.encoding.to_s
+ end
+
+ test "uses #default_encoding instead of current encoding" do
+ tmpl = "".force_encoding('Big5')
+ inst = UTF8Template.new(@template) { tmpl }
+ assert_equal 'UTF-8', inst.data.encoding.to_s
+ end
+
+ test "raises error if the encoding is not valid" do
+ assert_raises(Encoding::InvalidByteSequenceError) do
+ UTF8Template.new(@template) { "\xe4" }
+ end
+ end
+ end
end
Something went wrong with that request. Please try again.