New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization - NFD #17
Changes from 21 commits
5a674dc
8d568b5
b62ab8f
e6f47b1
4c3cd7c
a79a36b
a2f33f7
72b0218
d559bed
e7b66d2
a5701b3
2d7a38b
13c1403
c88c407
1776bb7
d970564
775275e
2a260d6
7b0b34c
11c6d28
b9ffc4e
d5c4955
7dfd9f2
f91f5c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# encoding: UTF-8 | ||
|
||
module TwitterCldr | ||
module Normalizers | ||
class Base | ||
class << self | ||
def code_point_to_char(code_point) | ||
[code_point.upcase.hex].pack('U*') | ||
end | ||
def char_to_code_point(char) | ||
code_point = char.unpack('U*').first.to_s(16).upcase | ||
#Pad to atleast 4 digits | ||
code_point.rjust(4, '0') | ||
end | ||
end | ||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# encoding: UTF-8 | ||
|
||
module TwitterCldr | ||
module Normalizers | ||
class NFD < Base | ||
@@hangul_constants = {:SBase => "AC00".hex, :LBase => "1100".hex, :VBase => "1161".hex, :TBase => "11A7".hex, | ||
:Scount => 11172, :LCount => 19, :VCount => 21, :TCount => 28, :NCount => 588, :Scount => 1172} | ||
class << self | ||
def normalize(string) | ||
#Convert string to code points | ||
code_points = string.split('').map { |char| char_to_code_point(char) } | ||
|
||
#Normalize code points | ||
normalized_code_points = normalize_code_points(code_points) | ||
|
||
#Convert normalized code points back to string | ||
normalized_code_points.map { |code_point| code_point_to_char(code_point) }.join | ||
end | ||
|
||
def normalize_code_points(code_points) | ||
code_points = code_points.map { |code_point| decompose code_point }.flatten | ||
reorder code_points | ||
code_points | ||
end | ||
|
||
#Recursively replace the given code point with the values in its Decomposition_Mapping property | ||
def decompose(code_point) | ||
unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it be cool if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a great idea…I actually started implementing something like this, but let it go early. In my version, keys = [:codepoint, :name, :category, :combining_class, :bidi_class, :decomposition, :digit_value, :non_decimal_digit_value, :numeric_value, :bidi_mirrored, :unicode1_name, :iso_comment, :simple_uppercase_map, :simple_lowercase_map, :simple_titlecase_map]
Hash[keys.zip UnicodeData.for_code_point('1F3E9')] which gives me: {:codepoint=>"1F3E9", :name=>"LOVE HOTEL", :category=>"So", :combining_class=>"0", :bidi_class=>"ON", :decomposition=>"", :digit_value=>"", :non_decimal_digit_value=>"", :numeric_value=>"", :bidi_mirrored=>"N", :unicode1_name=>"", :iso_comment=>"", :simple_uppercase_map=>"", :simple_lowercase_map=>"", :simple_titlecase_map=>""} Wouldn't that be simpler than returning an instance of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @timothyandrew, you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KL-7 Yeah, I think that's the perfect solution for this. Thanks for the idea; I had no idea ruby had something like this! :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change is at pull request #19. |
||
return code_point unless unicode_data | ||
decomposition_mapping = unicode_data[5].split | ||
|
||
# Special decomposition for Hangul syllables. | ||
# Documented in Section 3.12 at http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf | ||
if unicode_data[1].include? 'Hangul' | ||
sIndex = code_point.hex - @@hangul_constants[:SBase] | ||
|
||
lIndex = sIndex / @@hangul_constants[:NCount] | ||
vIndex = (sIndex % @@hangul_constants[:NCount]) / @@hangul_constants[:TCount] | ||
tIndex = sIndex % @@hangul_constants[:TCount] | ||
|
||
lPart = (@@hangul_constants[:LBase] + lIndex).to_s(16).upcase | ||
vPart = (@@hangul_constants[:VBase] + vIndex).to_s(16).upcase | ||
tPart = (@@hangul_constants[:TBase] + tIndex).to_s(16).upcase if tIndex > 0 | ||
|
||
[lPart, vPart, tPart].compact | ||
|
||
#Return the code point if compatibility mapping or if no mapping exists | ||
elsif decomposition_mapping.first =~ /<.*>/ || decomposition_mapping.empty? | ||
code_point | ||
else | ||
decomposition_mapping.map do |decomposition_code_point| | ||
decompose(decomposition_code_point) | ||
end.flatten | ||
end | ||
end | ||
|
||
#Swap any two adjacent code points A & B if ccc(A) > ccc(B) > 0 | ||
def reorder(code_points) | ||
(code_points.size).times do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why parenthesis here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just thought it made it a little more readable than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, but afaik people usually chain methods as much as they need without adding any unnecessary parenthesis. |
||
code_points.each_with_index do |cp, i| | ||
unless cp == code_points.last | ||
ccc_a, ccc_b = combining_class_for(cp), combining_class_for(code_points[i+1]) | ||
if (ccc_a > ccc_b) && (ccc_b > 0) | ||
code_points[i], code_points[i+1] = code_points[i+1], code_points[i] | ||
end | ||
end | ||
end | ||
end | ||
end | ||
|
||
def combining_class_for(code_point) | ||
begin | ||
unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)[3].to_i | ||
rescue NoMethodError | ||
0 | ||
end | ||
end | ||
end | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling ^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Saying it that way is more like the norm here in India. 😃
Will make the change.