Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Normalization - NFD #17

Merged
merged 24 commits into from Apr 27, 2012
Commits
Jump to file or symbol
Failed to load files and symbols.
+805 −3
Split
View
@@ -0,0 +1,17 @@
+# encoding: UTF-8
+
+module TwitterCldr
+ module Normalizers
+ class Base
+ class << self
+ def code_point_to_char(code_point)
+ [code_point.upcase.hex].pack('U*')
+ end
+ def char_to_code_point(char)
+ code_point = char.unpack('U*').first.to_s(16).upcase
+ code_point.rjust(4, '0') #Pad to at least 4 digits
+ end
+ end
+ end
+ end
+end
@@ -0,0 +1,81 @@
+# encoding: UTF-8
+
+module TwitterCldr
+ module Normalizers
+ class NFD < Base
+ @@hangul_constants = {:SBase => "AC00".hex, :LBase => "1100".hex, :VBase => "1161".hex, :TBase => "11A7".hex,
+ :Scount => 11172, :LCount => 19, :VCount => 21, :TCount => 28, :NCount => 588, :Scount => 1172}
+ class << self
+ def normalize(string)
+ #Convert string to code points
+ code_points = string.split('').map { |char| char_to_code_point(char) }
+
+ #Normalize code points
+ normalized_code_points = normalize_code_points(code_points)
+
+ #Convert normalized code points back to string
+ normalized_code_points.map { |code_point| code_point_to_char(code_point) }.join
+ end
+
+ def normalize_code_points(code_points)
+ code_points = code_points.map { |code_point| decompose code_point }.flatten
+ reorder code_points
+ code_points
+ end
+
+ #Recursively replace the given code point with the values in its Decomposition_Mapping property
+ def decompose(code_point)
+ unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)
@camertron

camertron Apr 19, 2012

Collaborator

Wouldn't it be cool if for_code_point returned an instance of something like TwitterCldr::Shared::UnicodeData::CodePoint? That way we wouldn't have to use array indices to access the code point data. In other words, you could do unicode_data.code_point instead of unicode_data[0]. What do you think?

@timothyandrew

timothyandrew Apr 19, 2012

Contributor

That's a great idea…I actually started implementing something like this, but let it go early. In my version, for_code_point returns a hash of values. I can just zip up a list of keys with the values returned by for_code_point to get me my hash:

keys = [:codepoint, :name, :category, :combining_class, :bidi_class, :decomposition, :digit_value, :non_decimal_digit_value, :numeric_value, :bidi_mirrored, :unicode1_name, :iso_comment, :simple_uppercase_map, :simple_lowercase_map, :simple_titlecase_map]
Hash[keys.zip UnicodeData.for_code_point('1F3E9')]

which gives me:

{:codepoint=>"1F3E9", :name=>"LOVE HOTEL", :category=>"So", :combining_class=>"0", :bidi_class=>"ON", :decomposition=>"", :digit_value=>"", :non_decimal_digit_value=>"", :numeric_value=>"", :bidi_mirrored=>"N", :unicode1_name=>"", :iso_comment=>"", :simple_uppercase_map=>"", :simple_lowercase_map=>"", :simple_titlecase_map=>""}

Wouldn't that be simpler than returning an instance of TwitterCldr::Shared::UnicodeData::CodePoint?

@KL-7

KL-7 Apr 19, 2012

Contributor

@timothyandrew, you can use Struct class for that purpose. Creating CodePoint struct will be pretty easy, but in return you'll be able to do unicode_data.codepoint and any mistype in the name of the attribute won't stay unnoticed.

@timothyandrew

timothyandrew Apr 19, 2012

Contributor

@KL-7 Yeah, I think that's the perfect solution for this. Thanks for the idea; I had no idea ruby had something like this! :)

@timothyandrew

timothyandrew Apr 19, 2012

Contributor

The change is at pull request #19.

+ return code_point unless unicode_data
+ decomposition_mapping = unicode_data[5].split
+
+ # Special decomposition for Hangul syllables.
+ # Documented in Section 3.12 at http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
+ if unicode_data[1].include? 'Hangul'
+ sIndex = code_point.hex - @@hangul_constants[:SBase]
+
+ lIndex = sIndex / @@hangul_constants[:NCount]
+ vIndex = (sIndex % @@hangul_constants[:NCount]) / @@hangul_constants[:TCount]
+ tIndex = sIndex % @@hangul_constants[:TCount]
+
+ lPart = (@@hangul_constants[:LBase] + lIndex).to_s(16).upcase
+ vPart = (@@hangul_constants[:VBase] + vIndex).to_s(16).upcase
+ tPart = (@@hangul_constants[:TBase] + tIndex).to_s(16).upcase if tIndex > 0
+
+ [lPart, vPart, tPart].compact
+
+ #Return the code point if compatibility mapping or if no mapping exists
+ elsif decomposition_mapping.first =~ /<.*>/ || decomposition_mapping.empty?
+ code_point
+ else
+ decomposition_mapping.map do |decomposition_code_point|
+ decompose(decomposition_code_point)
+ end.flatten
+ end
+ end
+
+ #Swap any two adjacent code points A & B if ccc(A) > ccc(B) > 0
+ def reorder(code_points)
+ (code_points.size).times do
@KL-7

KL-7 Apr 14, 2012

Contributor

Why parenthesis here?

@timothyandrew

timothyandrew Apr 17, 2012

Contributor

Just thought it made it a little more readable than code_points.size.times do

@KL-7

KL-7 Apr 17, 2012

Contributor

I see, but afaik people usually chain methods as much as they need without adding any unnecessary parenthesis.

+ code_points.each_with_index do |cp, i|
+ unless i == (code_points.size - 1)
+ ccc_a, ccc_b = combining_class_for(cp), combining_class_for(code_points[i+1])
+ if (ccc_a > ccc_b) && (ccc_b > 0)
+ code_points[i], code_points[i+1] = code_points[i+1], code_points[i]
+ end
+ end
+ end
+ end
+ end
+
+ def combining_class_for(code_point)
+ begin
+ unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)[3].to_i
+ rescue NoMethodError
+ 0
+ end
+ end
+ end
+ end
+ end
+end
View
@@ -15,8 +15,25 @@ def for_code_point(code_point)
range.include? code_point.to_i(16)
end
- TwitterCldr.get_resource("unicode_data", target.first)[code_point.to_sym] if target
- end
+ if target
+ block_data = TwitterCldr.get_resource("unicode_data", target.first)
+ block_data.fetch(code_point.to_sym) { |code_point_sym| get_range_start(code_point_sym, block_data) }
+ end
+ end
+
+ private
+ # Check if block constitutes a range. The code point beginning a range will have a name enclosed in <>, ending with 'First'
+ # eg: <CJK Ideograph Extension A, First>
+ # http://unicode.org/reports/tr44/#Code_Point_Ranges
+ def get_range_start(code_point, block_data)
+ start_code_point = block_data.keys.sort_by { |key| key.to_s.to_i(16) }.first
+ start_data = block_data[start_code_point].clone
+ if start_data[1] =~ /<.*, First>/
+ start_data[0] = code_point.to_s
+ start_data[1] = start_data[1].sub(', First', '')
+ start_data
+ end
+ end
end
end
end
View
@@ -125,4 +125,8 @@ def supported_locale?(locale)
# formatter helpers
require 'formatters/numbers/helpers/base'
require 'formatters/numbers/helpers/fraction'
-require 'formatters/numbers/helpers/integer'
+require 'formatters/numbers/helpers/integer'
+
+# all normalizers
+require 'normalizers/base'
+require 'normalizers/canonical/nfd'
Oops, something went wrong.