Skip to content
This repository
Browse code

Improve reliability of Inflector.transliterate. [#4374 state:resolved]

Signed-off-by: Jeremy Kemper <jeremy@bitsweat.net>
  • Loading branch information...
commit dceef0828a23e8298dd9a9aab1a33c49e84f17d6 1 parent 36f3634
Norman Clarke authored April 12, 2010 jeremy committed April 12, 2010
2  activesupport/CHANGELOG
... ...
@@ -1,5 +1,7 @@
1 1
 *Rails 3.0.0 [beta 3] (pending)*
2 2
 
  3
+* Improve transliteration quality.  #4374 [Norman Clarke]
  4
+
3 5
 * Speed up and add Ruby 1.9 support for ActiveSupport::Multibyte::Chars#tidy_bytes.  #4350 [Norman Clarke]
4 6
 
5 7
 
61  activesupport/lib/active_support/inflector/transliterate.rb
... ...
@@ -1,32 +1,47 @@
1 1
 # encoding: utf-8
2  
-require 'iconv'
3  
-require 'kconv'
4 2
 require 'active_support/core_ext/string/multibyte'
5 3
 
6 4
 module ActiveSupport
7 5
   module Inflector
8 6
     extend self
9  
-    
10  
-    # Replaces accented characters with their ascii equivalents.
11  
-    def transliterate(string)
12  
-      Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
13  
-    end
14 7
 
15  
-    if RUBY_VERSION >= '1.9'
16  
-      undef_method :transliterate
17  
-      def transliterate(string)
18  
-        proxy = ActiveSupport::Multibyte.proxy_class.new(string)
19  
-        proxy.normalize(:kd).gsub(/[^\x00-\x7F]+/, '')
20  
-      end
  8
+    # UTF-8 byte => ASCII approximate UTF-8 byte(s)
  9
+    ASCII_APPROXIMATIONS = {
  10
+      198 => [65, 69],   # Æ => AE
  11
+      208 => 68,         # Ð => D
  12
+      216 => 79,         # Ø => O
  13
+      222 => [84, 104],  # Þ => Þ
  14
+      223 => [115, 115], # ß => ss
  15
+      230 => [97, 101],  # æ => ae
  16
+      240 => 100,        # ð => d
  17
+      248 => 111,        # ø => o
  18
+      254 => [116, 104], # þ => th
  19
+      272 => 68,         # Đ => D
  20
+      273 => 100,        # đ => đ
  21
+      294 => 72,         # Ħ => H
  22
+      295 => 104,        # ħ => h
  23
+      305 => 105,        # ı => i
  24
+      306 => [73, 74],   # IJ =>IJ
  25
+      307 => [105, 106], # ij => ij
  26
+      312 => 107,        # ĸ => k
  27
+      319 => 76,         # Ŀ => L
  28
+      320 => 108,        # ŀ => l
  29
+      321 => 76,         # Ł => L
  30
+      322 => 108,        # ł => l
  31
+      329 => 110,        # ʼn => n
  32
+      330 => [78, 71],   # Ŋ => NG
  33
+      331 => [110, 103], # ŋ => ng
  34
+      338 => [79, 69],   # Π=> OE
  35
+      339 => [111, 101], # œ => oe
  36
+      358 => 84,         # Ŧ => T
  37
+      359 => 116         # ŧ => t
  38
+    }
21 39
 
22  
-    # The iconv transliteration code doesn't function correctly
23  
-    # on some platforms, but it's very fast where it does function.
24  
-    elsif "foo" != (Inflector.transliterate("föö") rescue nil)
25  
-      undef_method :transliterate
26  
-      def transliterate(string)
27  
-        string.mb_chars.normalize(:kd). # Decompose accented characters
28  
-          gsub(/[^\x00-\x7F]+/, '')     # Remove anything non-ASCII entirely (e.g. diacritics).
29  
-      end
  40
+    # Replaces accented characters with an ASCII approximation, or deletes it if none exsits.
  41
+    def transliterate(string)
  42
+      ActiveSupport::Multibyte::Chars.new(string).tidy_bytes.normalize(:d).unpack("U*").map do |char|
  43
+        ASCII_APPROXIMATIONS[char] || (char if char < 128)
  44
+      end.compact.flatten.pack("U*")
30 45
     end
31 46
 
32 47
     # Replaces special characters in a string so that it may be used as part of a 'pretty' URL.
@@ -45,8 +60,6 @@ def transliterate(string)
45 60
     #   <%= link_to(@person.name, person_path(@person)) %>
46 61
     #   # => <a href="/person/1-donald-e-knuth">Donald E. Knuth</a>
47 62
     def parameterize(string, sep = '-')
48  
-      # remove malformed utf8 characters
49  
-      string = string.toutf8 unless string.is_utf8?
50 63
       # replace accented chars with their ascii equivalents
51 64
       parameterized_string = transliterate(string)
52 65
       # Turn unwanted chars into the separator
@@ -59,6 +72,6 @@ def parameterize(string, sep = '-')
59 72
         parameterized_string.gsub!(/^#{re_sep}|#{re_sep}$/i, '')
60 73
       end
61 74
       parameterized_string.downcase
62  
-    end    
  75
+    end
63 76
   end
64 77
 end
5  activesupport/test/inflector_test_cases.rb
@@ -188,7 +188,10 @@ module InflectorTestCases
188 188
   StringToParameterizedAndNormalized = {
189 189
     "Malmö"                               => "malmo",
190 190
     "Garçons"                             => "garcons",
191  
-    "Ops\331"                            => "ops"
  191
+    "Ops\331"                             => "opsu",
  192
+    "Ærøskøbing"                          => "aeroskobing",
  193
+    "Aßlar"                               => "asslar",
  194
+    "Japanese: 日本語"                    => "japanese"
192 195
   }
193 196
 
194 197
   UnderscoreToHuman = {
50  activesupport/test/transliterate_test.rb
... ...
@@ -0,0 +1,50 @@
  1
+# encoding: utf-8
  2
+require 'abstract_unit'
  3
+require 'active_support/inflector/transliterate'
  4
+
  5
+class TransliterateTest < Test::Unit::TestCase
  6
+
  7
+  APPROXIMATIONS = {
  8
+    "À"=>"A", "Á"=>"A", "Â"=>"A", "Ã"=>"A", "Ä"=>"A", "Å"=>"A", "Æ"=>"AE",
  9
+    "Ç"=>"C", "È"=>"E", "É"=>"E", "Ê"=>"E", "Ë"=>"E", "Ì"=>"I", "Í"=>"I",
  10
+    "Î"=>"I", "Ï"=>"I", "Ð"=>"D", "Ñ"=>"N", "Ò"=>"O", "Ó"=>"O", "Ô"=>"O",
  11
+    "Õ"=>"O", "Ö"=>"O", "Ø"=>"O", "Ù"=>"U", "Ú"=>"U", "Û"=>"U", "Ü"=>"U",
  12
+    "Ý"=>"Y", "Þ"=>"Th", "ß"=>"ss", "à"=>"a", "á"=>"a", "â"=>"a", "ã"=>"a",
  13
+    "ä"=>"a", "å"=>"a", "æ"=>"ae", "ç"=>"c", "è"=>"e", "é"=>"e", "ê"=>"e",
  14
+    "ë"=>"e", "ì"=>"i", "í"=>"i", "î"=>"i", "ï"=>"i", "ð"=>"d", "ñ"=>"n",
  15
+    "ò"=>"o", "ó"=>"o", "ô"=>"o", "õ"=>"o", "ö"=>"o", "ø"=>"o", "ù"=>"u",
  16
+    "ú"=>"u", "û"=>"u", "ü"=>"u", "ý"=>"y", "þ"=>"th", "ÿ"=>"y", "Ā"=>"A",
  17
+    "ā"=>"a", "Ă"=>"A", "ă"=>"a", "Ą"=>"A", "ą"=>"a", "Ć"=>"C", "ć"=>"c",
  18
+    "Ĉ"=>"C", "ĉ"=>"c", "Ċ"=>"C", "ċ"=>"c", "Č"=>"C", "č"=>"c", "Ď"=>"D",
  19
+    "ď"=>"d", "Đ"=>"D", "đ"=>"d", "Ē"=>"E", "ē"=>"e", "Ĕ"=>"E", "ĕ"=>"e",
  20
+    "Ė"=>"E", "ė"=>"e", "Ę"=>"E", "ę"=>"e", "Ě"=>"E", "ě"=>"e", "Ĝ"=>"G",
  21
+    "ĝ"=>"g", "Ğ"=>"G", "ğ"=>"g", "Ġ"=>"G", "ġ"=>"g", "Ģ"=>"G", "ģ"=>"g",
  22
+    "Ĥ"=>"H", "ĥ"=>"h", "Ħ"=>"H", "ħ"=>"h", "Ĩ"=>"I", "ĩ"=>"i", "Ī"=>"I",
  23
+    "ī"=>"i", "Ĭ"=>"I", "ĭ"=>"i", "Į"=>"I", "į"=>"i", "İ"=>"I", "ı"=>"i",
  24
+    "IJ"=>"IJ", "ij"=>"ij", "Ĵ"=>"J", "ĵ"=>"j", "Ķ"=>"K", "ķ"=>"k", "ĸ"=>"k",
  25
+    "Ĺ"=>"L", "ĺ"=>"l", "Ļ"=>"L", "ļ"=>"l", "Ľ"=>"L", "ľ"=>"l", "Ŀ"=>"L",
  26
+    "ŀ"=>"l", "Ł"=>"L", "ł"=>"l", "Ń"=>"N", "ń"=>"n", "Ņ"=>"N", "ņ"=>"n",
  27
+    "Ň"=>"N", "ň"=>"n", "ʼn"=>"n", "Ŋ"=>"NG", "ŋ"=>"ng", "Ō"=>"O", "ō"=>"o",
  28
+    "Ŏ"=>"O", "ŏ"=>"o", "Ő"=>"O", "ő"=>"o", "Œ"=>"OE", "œ"=>"oe", "Ŕ"=>"R",
  29
+    "ŕ"=>"r", "Ŗ"=>"R", "ŗ"=>"r", "Ř"=>"R", "ř"=>"r", "Ś"=>"S", "ś"=>"s",
  30
+    "Ŝ"=>"S", "ŝ"=>"s", "Ş"=>"S", "ş"=>"s", "Š"=>"S", "š"=>"s", "Ţ"=>"T",
  31
+    "ţ"=>"t", "Ť"=>"T", "ť"=>"t", "Ŧ"=>"T", "ŧ"=>"t", "Ũ"=>"U", "ũ"=>"u",
  32
+    "Ū"=>"U", "ū"=>"u", "Ŭ"=>"U", "ŭ"=>"u", "Ů"=>"U", "ů"=>"u", "Ű"=>"U",
  33
+    "ű"=>"u", "Ų"=>"U", "ų"=>"u", "Ŵ"=>"W", "ŵ"=>"w", "Ŷ"=>"Y", "ŷ"=>"y",
  34
+    "Ÿ"=>"Y", "Ź"=>"Z", "ź"=>"z", "Ż"=>"Z", "ż"=>"z", "Ž"=>"Z", "ž"=>"z"
  35
+  }
  36
+
  37
+  def test_transliterate_should_not_change_ascii_chars
  38
+    (0..127).each do |byte|
  39
+      char = [byte].pack("U")
  40
+      assert_equal char, ActiveSupport::Inflector.transliterate(char)
  41
+    end
  42
+  end
  43
+
  44
+  def test_should_convert_accented_chars_to_approximate_ascii_chars
  45
+    APPROXIMATIONS.each do |given, expected|
  46
+      assert_equal expected, ActiveSupport::Inflector.transliterate(given)
  47
+    end
  48
+  end
  49
+
  50
+end

1 note on commit dceef08

Yaroslav Markin

Will you accept patches that increase this list? Maybe 2x-3x?

Norman Clarke

@Yaroslav, what characters do you want to use that are not caught by this patch? Keep in mind other accented Roman characters are handled by decomposition; these are only the ones in UTF-8's Basic Latin/Supplement 1 zones that don't decompose and so can't be stripped of diacritics to produce an ASCII letter.

@Yehuda, regarding Yaroslav's request, I had considered adding a hash argument to transliterate, to allow custom transliterations to be set per locale using Rails' i18n support. This would be useful for the German and Spanish cases I resolved in the friendly_id plugin here. The code for that is here. However, I was concerned that now might not be a good time to add new features to Rails since the release is pending. Is this something you would consider adding? I'd be happy to work on it if you are interested.

Yaroslav Markin

@norman Simply put, http://github.com/yaroslav/russian/blob/master/lib/russian/transliteration.rb#L14. This is added easily via method aliasing if you have a plugin, but if we are doing a full-blown Rails transliteration, it may fit.

Having custom transliterations would be awesome, in fact. Less hacks = better code. Could you collaborate with us on rails-i18n google group?

Jeremy Kemper

Yeah, let's get this in. It's a core concern that we should handle well. No need to wait; git master is always hungry for the latest and greatest!

Norman Clarke

@Jeremy, Great! I'll send a patch tomorrow.
@Yaroslav, I'll send the patch to the list for feedback before sending it to Rails core.

halo commented on dceef08 April 30, 2010

you guys rock.

José Valim

Norman, you added transliterate to I18n. Wouldn't be the case to make Rails transliterate just delegate to I18n?

Norman Clarke

Yes, in fact I have this commit staged and ready to go for Rails. I've just been waiting for the transliteration in i18n to be released before sending the patch, so that Rails can update it's i18n gem version dependency. I'll send the patch to the RoR core list and you can apply it if and when you see fit.

José Valim

Awesome Norman! Please add the patch to lighthouse and assign it to me. I will articulate with Sven to release I18n soon (I think he's actually waiting my KeyValue backend commits).

Please sign in to comment.
Something went wrong with that request. Please try again.