New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFKD implementation #32
Changes from all commits
89e0c3f
63f9ea2
6913e07
f38adbc
c66e710
f26a430
a6dee47
2dcfe1b
98fff02
272f8ac
3265979
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# encoding: UTF-8 | ||
|
||
# Copyright 2012 Twitter, Inc | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
module TwitterCldr | ||
module Normalizers | ||
|
||
# Implements normalization of a Unicode string to Normalization Form D (NFD). | ||
# This normalization includes only Canonical Decomposition. | ||
# | ||
class NFD < NFKD | ||
|
||
class << self | ||
|
||
protected | ||
|
||
# Returns code point's Decomposition Mapping based on its Unicode data. Returns nil if the mapping has | ||
# compatibility type (it contains compatibility formatting tag). | ||
# | ||
def decomposition_mapping(unicode_data) | ||
mapping = parse_decomposition_mapping(unicode_data) | ||
mapping unless compatibility_decomposition?(mapping) | ||
end | ||
|
||
end | ||
|
||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
# encoding: UTF-8 | ||
|
||
# Copyright 2012 Twitter, Inc | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
module TwitterCldr | ||
# Normalizers module includes algorithm for Unicode normalization. Basic information on this topic can be found in the | ||
# Unicode Standard Annex #15 "Unicode Normalization Forms" at http://www.unicode.org/reports/tr15/. More detailed | ||
# description is given in the section "3.11 Normalization Forms" of the Unicode Standard core specification. The | ||
# latest version at the moment (for Unicode 6.1) is available at http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf. | ||
# | ||
module Normalizers | ||
class NFKD | ||
|
||
class << self | ||
|
||
def normalize(string) | ||
code_points = TwitterCldr::Utils::CodePoints.from_string(string) | ||
normalized_code_points = normalize_code_points(code_points) | ||
TwitterCldr::Utils::CodePoints.to_string(normalized_code_points) | ||
end | ||
|
||
def normalize_code_points(code_points) | ||
canonical_ordering(decomposition(code_points)) | ||
end | ||
|
||
protected | ||
|
||
def decomposition(code_points) | ||
code_points.map{ |code_point| decompose_recursively(code_point) }.flatten | ||
end | ||
|
||
# Recursively decomposes a given code point with the values in its Decomposition Mapping property. | ||
# | ||
def decompose_recursively(code_point) | ||
unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point) | ||
return code_point unless unicode_data | ||
|
||
if unicode_data.name.include?('Hangul') | ||
decompose_hangul(code_point) | ||
else | ||
decompose_regular(code_point, decomposition_mapping(unicode_data)) | ||
end | ||
end | ||
|
||
# Decomposes regular (non-Hangul) code point. | ||
# | ||
def decompose_regular(code_point, mapping) | ||
if mapping && !mapping.empty? | ||
mapping.map{ |cp| decompose_recursively(cp) }.flatten | ||
else | ||
code_point | ||
end | ||
end | ||
|
||
# Returns code point's Decomposition Mapping based on its Unicode data. | ||
# | ||
def decomposition_mapping(unicode_data) | ||
mapping = parse_decomposition_mapping(unicode_data) | ||
mapping.shift if compatibility_decomposition?(mapping) # remove compatibility formatting tag | ||
mapping | ||
end | ||
|
||
def compatibility_decomposition?(mapping) | ||
!!(COMPATIBILITY_FORMATTING_TAG_REGEXP =~ mapping.first) | ||
end | ||
|
||
def parse_decomposition_mapping(unicode_data) | ||
unicode_data.decomposition.split | ||
end | ||
|
||
# Special decomposition for Hangul syllables. Documented in Section 3.12 at | ||
# http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf | ||
# | ||
def decompose_hangul(code_point) | ||
s_index = code_point.hex - HANGUL_DECOMPOSITION_CONSTANTS[:SBase] | ||
|
||
l_index = s_index / HANGUL_DECOMPOSITION_CONSTANTS[:NCount] | ||
v_index = (s_index % HANGUL_DECOMPOSITION_CONSTANTS[:NCount]) / HANGUL_DECOMPOSITION_CONSTANTS[:TCount] | ||
t_index = s_index % HANGUL_DECOMPOSITION_CONSTANTS[:TCount] | ||
|
||
result = [] | ||
|
||
result << (HANGUL_DECOMPOSITION_CONSTANTS[:LBase] + l_index).to_s(16).upcase | ||
result << (HANGUL_DECOMPOSITION_CONSTANTS[:VBase] + v_index).to_s(16).upcase | ||
result << (HANGUL_DECOMPOSITION_CONSTANTS[:TBase] + t_index).to_s(16).upcase if t_index > 0 | ||
|
||
result | ||
end | ||
|
||
# Performs the Canonical Ordering Algorithm by stable sorting of every subsequence of combining code points | ||
# (code points that have combining class greater than zero). | ||
# | ||
def canonical_ordering(code_points) | ||
code_points_with_cc = code_points.map { |cp| [cp, combining_class_for(cp)] } | ||
|
||
result = [] | ||
accum = [] | ||
|
||
code_points_with_cc.each do |cp_with_cc| | ||
if cp_with_cc[1] == 0 | ||
unless accum.empty? | ||
result.concat(stable_sort(accum)) | ||
accum = [] | ||
end | ||
result << cp_with_cc | ||
else | ||
accum << cp_with_cc | ||
end | ||
end | ||
|
||
result.concat(stable_sort(accum)) unless accum.empty? | ||
|
||
result.map { |cp_with_cc| cp_with_cc[0] } | ||
end | ||
|
||
# Performs stable sorting of a sequence of [code_point, combining_class] pairs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a link to the document (if you or Andrew used one) that explains this algorithm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's nothing more than a regular bubble sort with a small optimization that stops the algorithm if nothing was swapped during the iteration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, @camertron, do you mean a document explaining NFD & NFKD algorithms themselves? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great to add a comment at the top of this file with a link to the NFKD spec, but I was referring to a description of the stable sort algorithm you used. If you mention it's just a bubble sort that maintains the order of equal elements and exits early if no swap was performed, that should be fine. |
||
# | ||
def stable_sort(code_points_with_cc) | ||
n = code_points_with_cc.size - 2 | ||
|
||
code_points_with_cc.size.times do | ||
swapped = false | ||
|
||
(0..n).each do |j| | ||
if code_points_with_cc[j][1] > code_points_with_cc[j + 1][1] | ||
code_points_with_cc[j], code_points_with_cc[j + 1] = code_points_with_cc[j + 1], code_points_with_cc[j] | ||
swapped = true | ||
end | ||
end | ||
|
||
break unless swapped | ||
n -= 1 | ||
end | ||
|
||
code_points_with_cc | ||
end | ||
|
||
def combining_class_for(code_point) | ||
TwitterCldr::Shared::UnicodeData.for_code_point(code_point).combining_class.to_i | ||
rescue NoMethodError | ||
0 | ||
end | ||
|
||
end | ||
|
||
COMPATIBILITY_FORMATTING_TAG_REGEXP = /^<.*>$/ | ||
|
||
HANGUL_DECOMPOSITION_CONSTANTS = { | ||
:SBase => 0xAC00, | ||
:LBase => 0x1100, | ||
:VBase => 0x1161, | ||
:TBase => 0x11A7, | ||
:LCount => 19, | ||
:VCount => 21, | ||
:TCount => 28, | ||
:NCount => 588, # VCount * TCount | ||
:Scount => 11172 # LCount * NCount | ||
} | ||
|
||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Just so I understand completely: this method does not handle compatibility mapping as NFKD does, so it simply returns nil if a compatibility mapping is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when NFD comes across compatibility mapping it skips it, while NFKD removes compatibility tag and goes further.