Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization - NFD #17

Merged
merged 24 commits into from Apr 27, 2012
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5a674dc
Add helper functions to convert a character to a code point and vice …
timothyandrew Apr 13, 2012
8d568b5
When converting a character to a code point, pad it up to 4 digits wi…
timothyandrew Apr 13, 2012
b62ab8f
Add NormalizationTest.txt
timothyandrew Apr 13, 2012
e6f47b1
Add basic NFD (no Hangul decomposition).
timothyandrew Apr 13, 2012
4c3cd7c
Merge branch 'master' of https://github.com/twitter/twitter-cldr-rb i…
timothyandrew Apr 13, 2012
a79a36b
Rename #decompose_code_points to #decompose
timothyandrew Apr 14, 2012
a2f33f7
Minor refactoring of #reorder
timothyandrew Apr 14, 2012
72b0218
Convert tabs to spaces in Base.rb & nfd_specs.rb.
timothyandrew Apr 17, 2012
d559bed
Convert tabs to spaces in base_spec.rb
timothyandrew Apr 17, 2012
e7b66d2
Merge branch 'master' of https://github.com/twitter/twitter-cldr-rb i…
timothyandrew Apr 17, 2012
a5701b3
Use `&&` and `||` instead of `and` and `or`.
timothyandrew Apr 17, 2012
2d7a38b
Code point range support for UnicodeData#for_code_point.
timothyandrew Apr 17, 2012
13c1403
Remove Array.new(size=15, obj=""). No longer required because of 2d7a…
timothyandrew Apr 17, 2012
c88c407
A first attempt at Hangul decomposition.
timothyandrew Apr 17, 2012
1776bb7
Rename a test in nfd_spec to be accurate with the method name.
timothyandrew Apr 18, 2012
d970564
Use String#rjust to pad the code point up to 4 digits.
timothyandrew Apr 18, 2012
775275e
Add Normalizers::NFD#normalize
timothyandrew Apr 19, 2012
2a260d6
Use Hash#fetch instead of `or`
timothyandrew Apr 19, 2012
7b0b34c
Reduce the size of NormalizationTest.txt
timothyandrew Apr 19, 2012
11c6d28
Fail gracefully for unassigned code points by returning the code point.
timothyandrew Apr 20, 2012
b9ffc4e
Return 0 as the combining class for unassigned code points.
timothyandrew Apr 20, 2012
d5c4955
Fix spelling mistake in comment.
timothyandrew Apr 22, 2012
7dfd9f2
[NFD Bug] Compare indexes instead of elements.
timothyandrew Apr 22, 2012
f91f5c6
Merge branch 'master' of https://github.com/twitter/twitter-cldr-rb i…
timothyandrew Apr 27, 2012
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 18 additions & 0 deletions lib/normalizers/base.rb
@@ -0,0 +1,18 @@
# encoding: UTF-8

module TwitterCldr
module Normalizers
class Base
class << self
def code_point_to_char(code_point)
[code_point.upcase.hex].pack('U*')
end
def char_to_code_point(char)
code_point = char.unpack('U*').first.to_s(16).upcase
#Pad to atleast 4 digits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling ^_^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Saying it that way is more like the norm here in India. 😃
Will make the change.

code_point.rjust(4, '0')
end
end
end
end
end
81 changes: 81 additions & 0 deletions lib/normalizers/canonical/nfd.rb
@@ -0,0 +1,81 @@
# encoding: UTF-8

module TwitterCldr
module Normalizers
class NFD < Base
@@hangul_constants = {:SBase => "AC00".hex, :LBase => "1100".hex, :VBase => "1161".hex, :TBase => "11A7".hex,
:Scount => 11172, :LCount => 19, :VCount => 21, :TCount => 28, :NCount => 588, :Scount => 1172}
class << self
def normalize(string)
#Convert string to code points
code_points = string.split('').map { |char| char_to_code_point(char) }

#Normalize code points
normalized_code_points = normalize_code_points(code_points)

#Convert normalized code points back to string
normalized_code_points.map { |code_point| code_point_to_char(code_point) }.join
end

def normalize_code_points(code_points)
code_points = code_points.map { |code_point| decompose code_point }.flatten
reorder code_points
code_points
end

#Recursively replace the given code point with the values in its Decomposition_Mapping property
def decompose(code_point)
unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cool if for_code_point returned an instance of something like TwitterCldr::Shared::UnicodeData::CodePoint? That way we wouldn't have to use array indices to access the code point data. In other words, you could do unicode_data.code_point instead of unicode_data[0]. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea…I actually started implementing something like this, but let it go early. In my version, for_code_point returns a hash of values. I can just zip up a list of keys with the values returned by for_code_point to get me my hash:

keys = [:codepoint, :name, :category, :combining_class, :bidi_class, :decomposition, :digit_value, :non_decimal_digit_value, :numeric_value, :bidi_mirrored, :unicode1_name, :iso_comment, :simple_uppercase_map, :simple_lowercase_map, :simple_titlecase_map]
Hash[keys.zip UnicodeData.for_code_point('1F3E9')]

which gives me:

{:codepoint=>"1F3E9", :name=>"LOVE HOTEL", :category=>"So", :combining_class=>"0", :bidi_class=>"ON", :decomposition=>"", :digit_value=>"", :non_decimal_digit_value=>"", :numeric_value=>"", :bidi_mirrored=>"N", :unicode1_name=>"", :iso_comment=>"", :simple_uppercase_map=>"", :simple_lowercase_map=>"", :simple_titlecase_map=>""}

Wouldn't that be simpler than returning an instance of TwitterCldr::Shared::UnicodeData::CodePoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothyandrew, you can use Struct class for that purpose. Creating CodePoint struct will be pretty easy, but in return you'll be able to do unicode_data.codepoint and any mistype in the name of the attribute won't stay unnoticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KL-7 Yeah, I think that's the perfect solution for this. Thanks for the idea; I had no idea ruby had something like this! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is at pull request #19.

return code_point unless unicode_data
decomposition_mapping = unicode_data[5].split

# Special decomposition for Hangul syllables.
# Documented in Section 3.12 at http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
if unicode_data[1].include? 'Hangul'
sIndex = code_point.hex - @@hangul_constants[:SBase]

lIndex = sIndex / @@hangul_constants[:NCount]
vIndex = (sIndex % @@hangul_constants[:NCount]) / @@hangul_constants[:TCount]
tIndex = sIndex % @@hangul_constants[:TCount]

lPart = (@@hangul_constants[:LBase] + lIndex).to_s(16).upcase
vPart = (@@hangul_constants[:VBase] + vIndex).to_s(16).upcase
tPart = (@@hangul_constants[:TBase] + tIndex).to_s(16).upcase if tIndex > 0

[lPart, vPart, tPart].compact

#Return the code point if compatibility mapping or if no mapping exists
elsif decomposition_mapping.first =~ /<.*>/ || decomposition_mapping.empty?
code_point
else
decomposition_mapping.map do |decomposition_code_point|
decompose(decomposition_code_point)
end.flatten
end
end

#Swap any two adjacent code points A & B if ccc(A) > ccc(B) > 0
def reorder(code_points)
(code_points.size).times do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why parenthesis here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought it made it a little more readable than code_points.size.times do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but afaik people usually chain methods as much as they need without adding any unnecessary parenthesis.

code_points.each_with_index do |cp, i|
unless cp == code_points.last
ccc_a, ccc_b = combining_class_for(cp), combining_class_for(code_points[i+1])
if (ccc_a > ccc_b) && (ccc_b > 0)
code_points[i], code_points[i+1] = code_points[i+1], code_points[i]
end
end
end
end
end

def combining_class_for(code_point)
begin
unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)[3].to_i
rescue NoMethodError
0
end
end
end
end
end
end
21 changes: 19 additions & 2 deletions lib/shared/unicode_data.rb
Expand Up @@ -12,8 +12,25 @@ def for_code_point(code_point)
range.include? code_point.to_i(16)
end

TwitterCldr.get_resource("unicode_data", target.first)[code_point.to_sym] if target
end
if target
block_data = TwitterCldr.get_resource("unicode_data", target.first)
block_data.fetch(code_point.to_sym) { |code_point_sym| get_range_start(code_point_sym, block_data) }
end
end

private
# Check if block constitutes a range. The code point beginning a range will have a name enclosed in <>, ending with 'First'
# eg: <CJK Ideograph Extension A, First>
# http://unicode.org/reports/tr44/#Code_Point_Ranges
def get_range_start(code_point, block_data)
start_code_point = block_data.keys.sort_by { |key| key.to_s.to_i(16) }.first
start_data = block_data[start_code_point].clone
if start_data[1] =~ /<.*, First>/
start_data[0] = code_point.to_s
start_data[1] = start_data[1].sub(', First', '')
start_data
end
end
end
end
end
Expand Down
4 changes: 4 additions & 0 deletions lib/twitter_cldr.rb
Expand Up @@ -114,3 +114,7 @@ def self.supported_locale?(locale)
require 'formatters/numbers/helpers/base'
require 'formatters/numbers/helpers/fraction'
require 'formatters/numbers/helpers/integer'

# all normalizers
require 'normalizers/base'
require 'normalizers/canonical/nfd'