Skip to content

Commit

Permalink
Add support for transliteration to ASCII.
Browse files Browse the repository at this point in the history
  • Loading branch information
norman committed Apr 28, 2010
1 parent 9c82841 commit 928fdb4
Show file tree
Hide file tree
Showing 8 changed files with 271 additions and 11 deletions.
19 changes: 11 additions & 8 deletions README.textile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Features:
* translation and localization
* interpolation of values to translations (Ruby 1.9 compatible syntax)
* pluralization (CLDR compatible)
* customizable transliteration to ASCII
* flexible defaults
* bulk lookup
* lambdas as translation data
Expand Down Expand Up @@ -35,25 +36,25 @@ gem install i18n

h3. Installation on Rails < 2.3.5 (deprecated)

Up to version 2.3.4 Rails will not accept i18n gems > 0.1.3. There is an unpacked
gem inside of active_support/lib/vendor which gets loaded unless gem 'i18n', '~> 0.1.3'.
Up to version 2.3.4 Rails will not accept i18n gems > 0.1.3. There is an unpacked
gem inside of active_support/lib/vendor which gets loaded unless gem 'i18n', '~> 0.1.3'.
This requirement is relaxed in "6da03653":http://github.com/rails/rails/commit/6da03653

The new i18n gem can be loaded from vendor/plugins like this:

def reload_i18n!
raise "Move to i18n version 0.2.0 or greater" if Rails.version > "2.3.4"

$:.grep(/i18n/).each { |path| $:.delete(path) }
I18n::Backend.send :remove_const, "Simple"
$: << Rails.root.join('vendor', 'plugins', 'i18n', 'lib').to_s
end

Then you can `reload_i18n!` inside an i18n initializer.
Then you can `reload_i18n!` inside an i18n initializer.

h2. Tests

You can run tests both with
You can run tests both with

* `rake test` or just `rake`
* run any test file directly, e.g. `ruby test/api/simple_test.rb`
Expand All @@ -62,7 +63,7 @@ You can run tests both with
The structure of the test suite is a bit unusual as it uses modules to reuse
particular tests in different test cases.

The reason for this is that we need to enforce the I18n API across various
The reason for this is that we need to enforce the I18n API across various
combinations of extensions. E.g. the Simple backend alone needs to support
the same API as any combination of feature and/or optimization modules included
to the Simple backend. We test this by reusing the same API defition (implemented
Expand All @@ -71,7 +72,7 @@ as test methods) in test cases with different setups.
You can find the test cases that enforce the API in test/api. And you can find
the API definition test methods in test/api/tests.

All other test cases (e.g. as defined in test/backend, test/core\_ext) etc.
All other test cases (e.g. as defined in test/backend, test/core\_ext) etc.
follow the usual test setup and should be easy to grok.

h2. Authors
Expand All @@ -90,15 +91,17 @@ h2. Contributors
* Frederick Cheung
* Jeremy Kemper
* José Valim
* Krzysztof Knapik
* Lawrence Pit
* Luca Guidi
* M4SSIVE
* Marko Seppae
* Mathias Meyer
* Michael Lang
* Norman Clarke
* Theo Cushion
* Yaroslav Markin

h2. License

MIT License. See the included MIT-LICENSE file.
MIT License. See the included MIT-LICENSE file.
7 changes: 5 additions & 2 deletions contributors.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,19 @@ Andrew Briening
Clemens Kofler
Frederick Cheung
Jeremy Kemper
Josh Harvey
Joshua Harvey
José Valim
Krzysztof Knapik
Lawrence Pit
Luca Guidi
M4SSIVE
Marko Seppae
Mathias Meyer
Matt Aimonetti
Michael Lang
Norman Clarke
Saimon Moore
Stephan Soller
Sven Fuchs
Theo Cushion
Yaroslav Markin
Krzysztof Knapik
63 changes: 63 additions & 0 deletions lib/i18n.rb
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,69 @@ def translate!(key, options = {})
end
alias :t! :translate!

# Transliterates UTF-8 characters to ASCII. By default this method will
# transliterate only Latin strings to an ASCII approximation:
#
# I18N.transliterate("Ærøskøbing")
# # => "AEroskobing"
#
# I18N.transliterate("日本語")
# # => "???"
#
# It's also possible to add support for per-locale transliterations. I18N
# expects transliteration rules to be stored at
# <tt>i18n.transliterate.rule</tt>.
#
# Transliteration rules can either be a Hash or a Proc. Procs must accept a
# single string argument. Hash rules inherit the default transliteration
# rules, while Procs do not.
#
# *Examples*
#
# Setting a Hash in <locale>.yml:
#
# i18n:
# transliterate:
# rule:
# ü: "ue"
# ö: "oe"
#
# Setting a Hash using Ruby:
#
# store_translations(:de, :i18n => {
# :transliterate => {
# :rule => {
# "ü" => "ue",
# "ö" => "oe"
# }
# }
# )
#
# Setting a Proc:
#
# translit = lambda {|string| MyTransliterator.transliterate(string) }
# store_translations(:xx, :i18n => {:transliterate => {:rule => translit})
#
# Transliterating strings:
#
# I18N.locale = :en
# I18N.transliterate("Jürgen") # => "Jurgen"
# I18N.locale = :de
# I18N.transliterate("Jürgen") # => "Juergen"
# I18N.transliterate("Jürgen", :locale => :en) # => "Jurgen"
# I18N.transliterate("Jürgen", :locale => :de) # => "Juergen"
def transliterate(*args)
options = args.pop if args.last.is_a?(Hash)
key = args.shift
locale = options && options.delete(:locale) || config.locale
raises = options && options.delete(:raise)
replacement = options && options.delete(:replacement)
config.backend.transliterate(locale, key, replacement)
rescue I18n::ArgumentError => exception
raise exception if raises
handle_exception(exception, locale, key, options)
end

# Localizes certain objects, such as dates and numbers to local formatting.
def localize(object, options = {})
locale = options.delete(:locale) || config.locale
Expand Down
1 change: 1 addition & 0 deletions lib/i18n/backend.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@ module Backend
autoload :Metadata, 'i18n/backend/metadata'
autoload :Pluralization, 'i18n/backend/pluralization'
autoload :Simple, 'i18n/backend/simple'
autoload :Transliterator, 'i18n/backend/transliterator'
end
end
9 changes: 9 additions & 0 deletions lib/i18n/backend/base.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,15 @@ def translate(locale, key, options = {})
entry
end

# Given a locale and a UTF-8 string, return the locale's ASCII
# approximation for the string.
def transliterate(locale, string, replacement = nil)
@transliterators ||= {}
@transliterators[locale] ||= Transliterator.get I18n.t(:'i18n.transliterate.rule',
:locale => locale, :resolve => false, :default => {})
@transliterators[locale].transliterate(string, replacement)
end

# Acts the same as +strftime+, but uses a localized version of the
# format string. Takes a key from the date/time formats translations as
# a format argument (<em>e.g.</em>, <tt>:short</tt> in <tt>:'date.formats'</tt>).
Expand Down
94 changes: 94 additions & 0 deletions lib/i18n/backend/transliterator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# encoding: utf-8
module I18n
module Backend
module Transliterator

DEFAULT_REPLACEMENT_CHAR = "?"

# Get a transliterator instance.
def self.get(rule = nil)
if !rule || rule.kind_of?(Hash)
HashTransliterator.new(rule)
elsif rule.kind_of? Proc
ProcTransliterator.new(rule)
else
raise I18n::ArgumentError, "Transliteration rule must be a proc or a hash."
end
end

# A transliterator which accepts a Proc as its transliteration rule.
class ProcTransliterator

def initialize(rule)
@rule = rule
end

def transliterate(string, replacement = nil)
@rule.call(string)
end

end

# A transliterator which accepts a Hash of characters as its translation
# rule.
class HashTransliterator

DEFAULT_APPROXIMATIONS = {
"À"=>"A", "Á"=>"A", "Â"=>"A", "Ã"=>"A", "Ä"=>"A", "Å"=>"A", "Æ"=>"AE",
"Ç"=>"C", "È"=>"E", "É"=>"E", "Ê"=>"E", "Ë"=>"E", "Ì"=>"I", "Í"=>"I",
"Î"=>"I", "Ï"=>"I", "Ð"=>"D", "Ñ"=>"N", "Ò"=>"O", "Ó"=>"O", "Ô"=>"O",
"Õ"=>"O", "Ö"=>"O", "×"=>"x", "Ø"=>"O", "Ù"=>"U", "Ú"=>"U", "Û"=>"U",
"Ü"=>"U", "Ý"=>"Y", "Þ"=>"Th", "ß"=>"ss", "à"=>"a", "á"=>"a", "â"=>"a",
"ã"=>"a", "ä"=>"a", "å"=>"a", "æ"=>"ae", "ç"=>"c", "è"=>"e", "é"=>"e",
"ê"=>"e", "ë"=>"e", "ì"=>"i", "í"=>"i", "î"=>"i", "ï"=>"i", "ð"=>"d",
"ñ"=>"n", "ò"=>"o", "ó"=>"o", "ô"=>"o", "õ"=>"o", "ö"=>"o", "ø"=>"o",
"ù"=>"u", "ú"=>"u", "û"=>"u", "ü"=>"u", "ý"=>"y", "þ"=>"th", "ÿ"=>"y",
"Ā"=>"A", "ā"=>"a", "Ă"=>"A", "ă"=>"a", "Ą"=>"A", "ą"=>"a", "Ć"=>"C",
"ć"=>"c", "Ĉ"=>"C", "ĉ"=>"c", "Ċ"=>"C", "ċ"=>"c", "Č"=>"C", "č"=>"c",
"Ď"=>"D", "ď"=>"d", "Đ"=>"D", "đ"=>"d", "Ē"=>"E", "ē"=>"e", "Ĕ"=>"E",
"ĕ"=>"e", "Ė"=>"E", "ė"=>"e", "Ę"=>"E", "ę"=>"e", "Ě"=>"E", "ě"=>"e",
"Ĝ"=>"G", "ĝ"=>"g", "Ğ"=>"G", "ğ"=>"g", "Ġ"=>"G", "ġ"=>"g", "Ģ"=>"G",
"ģ"=>"g", "Ĥ"=>"H", "ĥ"=>"h", "Ħ"=>"H", "ħ"=>"h", "Ĩ"=>"I", "ĩ"=>"i",
"Ī"=>"I", "ī"=>"i", "Ĭ"=>"I", "ĭ"=>"i", "Į"=>"I", "į"=>"i", "İ"=>"I",
"ı"=>"i", "IJ"=>"IJ", "ij"=>"ij", "Ĵ"=>"J", "ĵ"=>"j", "Ķ"=>"K", "ķ"=>"k",
"ĸ"=>"k", "Ĺ"=>"L", "ĺ"=>"l", "Ļ"=>"L", "ļ"=>"l", "Ľ"=>"L", "ľ"=>"l",
"Ŀ"=>"L", "ŀ"=>"l", "Ł"=>"L", "ł"=>"l", "Ń"=>"N", "ń"=>"n", "Ņ"=>"N",
"ņ"=>"n", "Ň"=>"N", "ň"=>"n", "ʼn"=>"'n", "Ŋ"=>"NG", "ŋ"=>"ng",
"Ō"=>"O", "ō"=>"o", "Ŏ"=>"O", "ŏ"=>"o", "Ő"=>"O", "ő"=>"o", "Œ"=>"OE",
"œ"=>"oe", "Ŕ"=>"R", "ŕ"=>"r", "Ŗ"=>"R", "ŗ"=>"r", "Ř"=>"R", "ř"=>"r",
"Ś"=>"S", "ś"=>"s", "Ŝ"=>"S", "ŝ"=>"s", "Ş"=>"S", "ş"=>"s", "Š"=>"S",
"š"=>"s", "Ţ"=>"T", "ţ"=>"t", "Ť"=>"T", "ť"=>"t", "Ŧ"=>"T", "ŧ"=>"t",
"Ũ"=>"U", "ũ"=>"u", "Ū"=>"U", "ū"=>"u", "Ŭ"=>"U", "ŭ"=>"u", "Ů"=>"U",
"ů"=>"u", "Ű"=>"U", "ű"=>"u", "Ų"=>"U", "ų"=>"u", "Ŵ"=>"W", "ŵ"=>"w",
"Ŷ"=>"Y", "ŷ"=>"y", "Ÿ"=>"Y", "Ź"=>"Z", "ź"=>"z", "Ż"=>"Z", "ż"=>"z",
"Ž"=>"Z", "ž"=>"z"
}

def initialize(rule = nil)
@rule = rule
add DEFAULT_APPROXIMATIONS
add rule if rule
end

def transliterate(string, replacement = nil)
string.gsub(/[^\x00-\x7f]/u) do |char|
approximations[char] || replacement || DEFAULT_REPLACEMENT_CHAR
end
end

private

def approximations
@approximations ||= {}
end

# Add transliteration rules to the approximations hash.
def add(hash)
hash.keys.each {|key| hash[key.to_s] = hash.delete(key).to_s}
approximations.merge! hash
end

end
end
end
end
2 changes: 1 addition & 1 deletion lib/i18n/version.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module I18n
VERSION = "0.3.7"
end
end
87 changes: 87 additions & 0 deletions test/backend/transliterator_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# encoding: utf-8
$:.unshift(File.expand_path(File.dirname(__FILE__) + '/../')); $:.uniq!
require 'test_helper'

class I18nBackendTransliterator < Test::Unit::TestCase

class Backend
include I18n::Backend::Base
end

def setup
I18n.backend = Backend.new
@proc = lambda { |n| n.upcase }
@hash = { :"ü" => "ue", :"ö" => "oe" }
@transliterator = I18n::Backend::Transliterator.get
end

test "transliteration rule can be a proc" do
store_translations(:xx, :i18n => {:transliterate => {:rule => @proc}})
assert_equal "HELLO", I18n.backend.transliterate(:xx, "hello")
end

test "transliteration rule can be a hash" do
store_translations(:xx, :i18n => {:transliterate => {:rule => @hash}})
assert_equal "ue", I18n.backend.transliterate(:xx, "ü")
end

test "transliteration rule must be a proc or hash" do
store_translations(:xx, :i18n => {:transliterate => {:rule => ""}})
assert_raise I18n::ArgumentError do
I18n.backend.transliterate(:xx, "ü")
end
end

test "transliterator defaults to latin => ascii when no rule is given" do
assert_equal "AEroskobing", I18n.backend.transliterate(:xx, "Ærøskøbing")
end

test "default transliterator should not modify ascii characters" do
(0..127).each do |byte|
char = [byte].pack("U")
assert_equal char, @transliterator.transliterate(char)
end
end

test "default transliterator correctly transliterates latin characters" do
# create string with range of Unicode's western characters with
# diacritics, excluding the division and multiplication signs which for
# some reason or other are floating in the middle of all the letters.
string = (0xC0..0x17E).to_a.reject {|c| [0xD7, 0xF7].include? c}.pack("U*")
string.split(//) do |char|
assert_match %r{^[a-zA-Z']*$}, @transliterator.transliterate(string)
end
end

test "should replace non-ASCII chars not in map with a replacement char" do
assert_equal "abc?", @transliterator.transliterate("abcſ")
end

test "can replace non-ASCII chars not in map with a custom replacement string" do
assert_equal "abc#", @transliterator.transliterate("abcſ", "#")
end

if RUBY_VERSION >= "1.9"
test "default transliterator raises errors for invalid UTF-8" do
assert_raise ArgumentError do
@transliterator.transliterate("a\x92b")
end
end
end

test "I18n.transliterate should transliterate using a default transliterator" do
assert_equal "aeo", I18n.transliterate("áèö")
end

test "I18n.transliterate should transliterate using a locale" do
store_translations(:xx, :i18n => {:transliterate => {:rule => @hash}})
assert_equal "ue", I18n.transliterate("ü", :locale => :xx)
end

test "default transliterator fails with custom rules with uncomposed input" do
char = [117, 776].pack("U*") # "ü" as ASCII "u" plus COMBINING DIAERESIS
transliterator = I18n::Backend::Transliterator.get(@hash)
assert_not_equal "ue", transliterator.transliterate(char)
end

end

0 comments on commit 928fdb4

Please sign in to comment.