Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Unicode Collation Algorithm. #40

Merged
merged 49 commits into from

2 participants

@KL-7
Collaborator

It's not ready for being merged in at the moment, but there is already a very basic implementation of UCA.

In addition, I squeezed in a couple of changes to Resources specs to to test it more thoroughly. Honestly, I was going to change the implementation a bit, so I added some specs and made the changes. Then I decided to revert the changes (at least for now), but I can't throw the specs away ;) So, here they are.

I'll be adding stuff to this PR as the implementation grows. At the moment we need at least handle some special cases described in the docs, and normalize the input string (maybe with some optimizations in this part, because straightforward application of NFD to the argument of the sort_key method makes a noticeable impact on performance).

@KL-7
Collaborator

Yay! Only 84 out of total 160K tests left. I hope they all are related to the same issue: canonical reordering (performed during normalization) removes a contraction that exists in the original (denormalized) string. I don't see a good workaround for that at the moment, but I'll think about it over the weekend.

KL-7 added some commits
@KL-7 KL-7 Improve Resources spec. 150cbc5
@KL-7 KL-7 Add FractionalUCA_SHORT.txt to resources. 3b3555a
@KL-7 KL-7 Add Collation::Trie. 9a00894
@KL-7 KL-7 Add Collation::Collator. 58ab7c1
@KL-7 KL-7 Do not upcase in CodePoints.to_char. 6d79525
@KL-7 KL-7 Handle surrogates as completely ignorable. cdff267
@KL-7 KL-7 Assign implicit weights to unknown and ill-formed code points. 871284f
@KL-7 KL-7 Extract Hangul decomposition from Normalizers. 9798d9f
@KL-7 KL-7 Add Hangul.hangul_syllable? method. 5987fe0
@KL-7 KL-7 Decompose Hangul syllables before collation. 17daf5a
@KL-7 KL-7 Fix empty wydes array for 0. 8b4dd09
@KL-7 KL-7 Normalize string before collation. 30060d6
@KL-7 KL-7 After NFD Hangul syllables are processed normally. b583d45
@KL-7 KL-7 Fix typo. fafc1ee
@KL-7 KL-7 Strip case bits from the 3rd level of fractional CE. 23eab88
@KL-7 KL-7 Write failures_short.txt for easy diff comparison. a126452
@KL-7 KL-7 Use CollationTest_CLDR_NON_IGNORABLE.txt. e238ce5
@KL-7 KL-7 Do not convert fractional CE into regular ones. 872abca
@KL-7 KL-7 Fix parsing FractionalUCA data for fractional CE with comments. 1d76e20
@KL-7 KL-7 Treat illegal xxFFFE and xxFFFF values as ignorable. 64370c7
@KL-7 KL-7 Remove Collator#build_collation_elements. b857f41
@KL-7 KL-7 Separate sort key levels generation. 7e1b447
@KL-7 KL-7 Extract TwitterCldr::Collation::SortKey. 8a97ad8
@KL-7 KL-7 Print failures count while running collation test. 2f0ddd3
@KL-7 KL-7 Add basic specs for SortKey. e0d568e
@KL-7 KL-7 Compress tertiary weights. 2440666
@KL-7 KL-7 Unfold '006C | 00B7;' syntax in FractionalUCA_SHORT.txt
Replaced

    006C | 00B7; [, DB A9, 05]

with

    006C 00B7; [3D, 05, 05][, DB A9, 05]

where [3D, 05, 05] is the fractional collation element of 006C code
point.

This can break if 006C is tailored (because 006C 0087 won't be tailored
automatically in that case).
e35eeec
@KL-7 KL-7 Make Collator#trie public. 9c3a012
@KL-7 KL-7 Add secondaries compression. 2370b74
@KL-7 KL-7 Extract Collation::TrieBuilder. a998ca5
@KL-7 KL-7 Port ImplicitCEGenerator from ICU. 8957981
@KL-7 KL-7 Extract Hangul composition into Hangul.compose. 153c5c3
@KL-7
Collaborator

@camertron, have a look on 153c5c3, please. I extracted NFKC.compose_hangul into Hangul.compose so all Hangul related stuff is there. And made two tiny changes in the NFKC itself. Did I get right the idea behind valid_hangul_sequence? method?

@KL-7
Collaborator

Hey, @camertron. Looks like the first part of collation is almost done. Only 3 out of 160K tests left and they're related to the problem with Hangul decomposition that I described here. I'd appreciate if you help me with that, but maybe I'll look into it myself tomorrow. Besides that I'd like to go over all changes I've made (it turned out to be quite a lot of code in the end) and add some comments, maybe clean up some things.

Also I'm going to extract a short version of Unicode collation test for running locally and setup a full test run on Travis CI.

One thing I'd like to add is a couple of methods for Collator class: compare(str1, str2) and sort(strings). Our users can do it themselves (as I do in the specs right now), but it'd be nice to have these methods built-in. Plus, sort keys are efficient for multiple comparison. If you want to sort a big list of strings, you should build sort keys first and then use them for sorting instead of building sort keys every time you want to compare two strings. We can handle that inside sort method.

Please, have a look at the code and let me know if you have any comments.

@KL-7
Collaborator

@camertron, I renamed Normalizers module into Normalization because it sounds as a better match for Collation module. Do you mind?

@camertron
Collaborator

@KL-7, renaming Normalizers to Normalization is fine, but it doesn't match up with Formatters and Tokenizers as it used to. What if you put your collation code into a Collators module?

@camertron camertron commented on the diff
lib/twitter_cldr/collation/collator.rb
((23 lines not shown))
+ def self.trie
+ @trie ||= TwitterCldr::Collation::TrieBuilder.load_trie(FRACTIONAL_UCA_SHORT_RESOURCE)
+ end
+
+ private
+
+ def sort_key_for_code_points(integer_code_points)
+ TwitterCldr::Collation::SortKey.build(get_collation_elements(integer_code_points))
+ end
+
+ def get_integer_code_points(str_or_code_points)
+ code_points = str_or_code_points.is_a?(String) ? TwitterCldr::Utils::CodePoints.from_string(str_or_code_points) : str_or_code_points
+
+ # Normalization makes the collation process significantly slower (like seven times slower on the UCA
+ # non-ignorable test from CollationTest_NON_IGNORABLE.txt). ICU uses some optimizations to apply normalization
+ # only in special, rare cases. We need to investigate possible solutions and do normalization cleverly too.
@camertron Collaborator

Are you referring to the nfd_quick_check technique?

@KL-7 Collaborator
KL-7 added a note

Huh, do we have it? There's something called Fast C or D (or FCD) in the ICU document. They use it to skip normalization when it's not necessary.

@camertron Collaborator

Yeah, I think that's the same thing. We don't have it yet - it will require adding and loading a few more resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@KL-7
Collaborator

Twitter::Collators::Collator doesn't look good to me (mostly because there's only one collator and it's not the only significant class here). I think about these two modules as about implementation of Unicode normalization and collation (hence the names), so it seems fine that they both don't match others. What do you think?

lib/twitter_cldr/collation/collator.rb
@@ -0,0 +1,125 @@
+# encoding: UTF-8
+
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+
+module TwitterCldr
+ module Collation
+
+ class Collator
+
+ FRACTIONAL_UCA_SHORT_RESOURCE = 'collation/FractionalUCA_SHORT.txt'
+
+ UNMARKED = 3
@camertron Collaborator

Is this used anywhere?

@KL-7 Collaborator
KL-7 added a note

It was...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lib/twitter_cldr/collation/implicit_ce_generator.rb
((46 lines not shown))
+ last3 = last2 / MEDIAL_COUNT
+ last2 %= MEDIAL_COUNT
+
+ # spread out, leaving gap at start
+ last0 = MIN_TRAIL + last0 * FINAL_4_MULTIPLIER
+
+ # offset
+ last1 += MIN_TRAIL
+ last2 += MIN_TRAIL
+ last3 += MIN_4_PRIMARY
+
+ (last3 << 24) + (last2 << 16) + (last1 << 8) + last0
+ end
+ end
+
+ # Method used to:
@camertron Collaborator

Did you follow an algorithm or a spec for this def? If so, please give the URL here :)

@KL-7 Collaborator
KL-7 added a note

I haven't found a good description of what has to be done with CJK, so I ported this class (necessary methods and constants) from ICU4J. I'll add some notes about that later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lib/twitter_cldr/collation/sort_key.rb
((9 lines not shown))
+ class SortKey
+
+ PRIMARY_LEVEL, SECONDARY_LEVEL, TERTIARY_LEVEL = 0, 1, 2
+
+ LEVEL_SEPARATOR = 1 # separate levels in a sort key '01' bytes
+
+ TERTIARY_LEVEL_MASK = 0x3F # mask for removing case bits from tertiary weight ('CC' bits in 'CC00 0000')
+
+ attr_reader :collation_elements
+
+ # Returns a sort key as an array of bytes.
+ #
+ # collation_elements - an array of collation elements, represented as arrays of integer weights.
+ #
+ def self.build(collation_elements)
+ new(collation_elements).bytes_array
@camertron Collaborator

I would have expected build to return an instance of SortKey instead of a byte array. If you're going to return a byte array, why create an instance of SortKey at all? Throughout the rest of the collation code, it doesn't look like instances of SortKey are retained anywhere - just the byte arrays.

@KL-7 Collaborator
KL-7 added a note

I extracted it from Collator to make things separated. And I need an instance because otherwise @bytes_array and @common_count will be passed from one method into another as arguments that doesn't feel right in this particular case when we have quite a lot of methods involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/collation/trie.rb
((15 lines not shown))
+ def add(key, value)
+ final = key.inject(@root) do |node, key_element|
+ node[1][key_element] ||= [nil, {}]
+ end
+
+ final[0] = value
+ end
+
+ def get(key)
+ final = key.inject(@root) do |node, key_element|
+ subtree = node[1][key_element]
+ return unless subtree
+ subtree
+ end
+
+ final[0]
@camertron Collaborator

Cleaner (IMHO) to say final.first

@KL-7 Collaborator
KL-7 added a note

Agree, but we don't have Array#second from ActiveSupport here, so final.first and node[1] won't look nice together. That's why I thought about creating a Node class. But, as I said, I decided against it eventually and implemented everything with plain arrays and raw indexes =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/collation/trie_builder.rb
((12 lines not shown))
+
+ def self.load_trie(file_path)
+ new(file_path).build
+ end
+
+ def initialize(resource)
+ @file_path = File.join(TwitterCldr::RESOURCES_DIR, resource)
+ end
+
+ def build
+ parse_trie(load_collation_elements_table)
+ end
+
+ private
+
+ def parse_trie(table)
@camertron Collaborator

Streaming the resource is clever, nicely done! However, the table argument is a little misleading - it's not really the table you're passing in, it's a file handle. Could we make this streaming behavior available in TwitterCldr::Resources?

@KL-7 Collaborator
KL-7 added a note

How do you mean streaming? I separated this method from load_collation_elements_table to be able to stub the latter with a string in the specs. That's why I chose a neutral table name for the argument.

@camertron Collaborator

I meant streaming in the sense that you're not loading the entire table into memory. Instead, you're opening the file with open and reading and parsing each line individually. The line parse_trie(load_collation_elements_table) doesn't pass the table to parse_trie, it passes an instance of File.

@KL-7 Collaborator
KL-7 added a note

Ah, now I see what you mean. Unfortunately, I guess, one does not simply stream a YAML file. And all our other resources are in this format.

@KL-7 Collaborator
KL-7 added a note

Well, you can actually pass a file handler (as an IO-object) into YAML.load or even a file name directly into YAML.load_file. That might be more efficient, I think.

@camertron Collaborator

Oh no, I meant we could use streaming for non-yml files, but move the logic for streaming into TwitterCldr::Resources. Just a thought, probably not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/normalization/hangul.rb
@@ -0,0 +1,68 @@
+# encoding: UTF-8
+
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+
+module TwitterCldr
+ module Normalization
+ module Hangul
@camertron Collaborator

This is a great refactoring of the Hangul stuff, nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/normalization/nfkc.rb
@@ -44,7 +44,7 @@ def compose(code_points)
end
if hangul_code_points.size > 1 && !next_hangul_type
- hangul_code_points.size.times { final.pop }
+ final.pop(hangul_code_points.size)
@camertron Collaborator

Ha! Nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/normalization/nfkc.rb
@@ -55,23 +55,11 @@ def compose(code_points)
end
def valid_hangul_sequence?(buffer_size, hangul_type)
- case [buffer_size, hangul_type]
- when [0, :lparts], [1, :vparts], [2, :tparts]
- true
- else
- false
- end
+ [[0, :lparts], [1, :vparts], [2, :tparts]].include?([buffer_size, hangul_type])
@camertron Collaborator

Yes, this is much better. I was doing some other logic with this case before refactoring and just didn't see it this way - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron camertron commented on the diff
lib/twitter_cldr/normalization/nfkc.rb
((16 lines not shown))
def compose_hangul(code_points)
- l_index = code_points.first.hex - HANGUL_DECOMPOSITION_CONSTANTS[:LBase]
- v_index = code_points[1].hex - HANGUL_DECOMPOSITION_CONSTANTS[:VBase]
- t_index = code_points[2] ? code_points[2].hex - HANGUL_DECOMPOSITION_CONSTANTS[:TBase] : 0 # tpart may be missing, that's ok
- lv_index = (l_index * HANGUL_DECOMPOSITION_CONSTANTS[:NCount]) + (v_index * HANGUL_DECOMPOSITION_CONSTANTS[:TCount])
- (HANGUL_DECOMPOSITION_CONSTANTS[:SBase] + lv_index + t_index).to_s(16).upcase.rjust(4, "0")
+ TwitterCldr::Normalization::Hangul.compose(code_points.map { |cp| cp.hex }).to_s(16).upcase.rjust(4, "0")
@camertron Collaborator

Ugh. I'm getting really tired of this .to_s(16).upcase.rjust(4, "0") thing. Maybe we can add a method to Fixnum or something. Hopefully we can move away from doing string code point to int conversions in the future too.

@KL-7 Collaborator
KL-7 added a note

I think the second option is the right choice. And I'd rather leave things that way until then, so we don't forget about these annoying conversions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron
Collaborator

Overall, this PR is magical - I'm super excited about getting it merged in! However I'll confess, I don't understand everything that's going on here, so I'd love to get a walk-through from you on Monday. Here are a few quick things I wanted to bring up:

  1. Where is the FCE table? I don't see it in resources.
  2. As you mentioned previously, it would be awesome to provide a sort method in LocalizedArray that would handle computing sort keys and comparing the items.
  3. What would you think about maintaining a sort key cache? It may grow to be too large if users are computing them for a bunch of different strings, so perhaps we'll need to specify some kind of expiration mechanism. Actually, I was thinking it might be time to think about a TwitterCldr::Cache class that can have some fancy functionality. We could also provide the ability to set a global memory limit - after all, the gem is getting pretty fat.

Let me know what you think :)

@KL-7
Collaborator

Thanks for your great comments, @camertron. It's really awesome to know that someone is carefully reading your code =)

  1. Don't worry, it's right there. GitHub is just clever enough not to show such a big file on the diff tab (it's there but the content is not displayed).

  2. Ok, I'll add it.

  3. Caching sort keys is a good idea, but together with a carefully tuned expiration mechanism it sounds not like a 5-minute task. What if I finish this PR so you can merge it, and then look into caching? And global cache sounds even more awesome. I saw you've already extracted compute_cache_key method, because we need caching almost everywhere. Though, I'm not sure, how much memory all our cache will need in a real app. Will be our internal cache good enough? Should we think about an adapter for, let's say, memcached?

@camertron
Collaborator
  1. Ah ok cool!
  2. Thanks.
  3. Yes, it's much too big a feature to add to this PR - we should add it later for sure. I looked at the size of resources and there are only 3.4MB worth of files in there. There will be some additional overhead of course when the data is read into Ruby objects, but that's really not too large to handle for our internal cache. Still, it might be worth examining how much memory gets used running certain tests (like normalization) or formatting a few dates. That's for our existing cached data. Sort keys are a different story, however. Whereas before we were only caching a distinct, limited collection of values, now we could potentially be caching an unlimited collection of values. I don't know if the problem warrants multiple adapters like memcached or redis... but maybe it does. Do you happen to know how ICU4J handles this?
@KL-7
Collaborator

I like how this PR looks now. What do you say, @camertron?

@camertron camertron merged commit 8b4fff8 into twitter:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jun 15, 2012
  1. @KL-7

    Improve Resources spec.

    KL-7 authored
  2. @KL-7
  3. @KL-7

    Add Collation::Trie.

    KL-7 authored
  4. @KL-7

    Add Collation::Collator.

    KL-7 authored
  5. @KL-7
  6. @KL-7
  7. @KL-7
  8. @KL-7
  9. @KL-7
  10. @KL-7
  11. @KL-7

    Fix empty wydes array for 0.

    KL-7 authored
  12. @KL-7
  13. @KL-7
  14. @KL-7

    Fix typo.

    KL-7 authored
  15. @KL-7
  16. @KL-7
  17. @KL-7
  18. @KL-7
  19. @KL-7
  20. @KL-7
  21. @KL-7
  22. @KL-7
  23. @KL-7
  24. @KL-7
  25. @KL-7

    Add basic specs for SortKey.

    KL-7 authored
  26. @KL-7

    Compress tertiary weights.

    KL-7 authored
  27. @KL-7

    Unfold '006C | 00B7;' syntax in FractionalUCA_SHORT.txt

    KL-7 authored
    Replaced
    
        006C | 00B7; [, DB A9, 05]
    
    with
    
        006C 00B7; [3D, 05, 05][, DB A9, 05]
    
    where [3D, 05, 05] is the fractional collation element of 006C code
    point.
    
    This can break if 006C is tailored (because 006C 0087 won't be tailored
    automatically in that case).
  28. @KL-7

    Make Collator#trie public.

    KL-7 authored
  29. @KL-7

    Add secondaries compression.

    KL-7 authored
  30. @KL-7

    Extract Collation::TrieBuilder.

    KL-7 authored
  31. @KL-7
  32. @KL-7
Commits on Jun 16, 2012
  1. @KL-7
  2. @KL-7
  3. @KL-7
  4. @KL-7
  5. @KL-7
  6. @KL-7
  7. @KL-7
Commits on Jun 18, 2012
  1. @KL-7

    Merge branch 'master' into collation

    KL-7 authored
    Conflicts:
    	spec/shared/code_point_spec.rb
  2. @KL-7
  3. @KL-7

    Cleanup.

    KL-7 authored
  4. @KL-7

    Update collation test.

    KL-7 authored
Commits on Jun 19, 2012
  1. @KL-7

    Add Collator#compare.

    KL-7 authored
Commits on Jun 20, 2012
  1. @KL-7

    Add Collator#sort.

    KL-7 authored
  2. @KL-7
  3. @KL-7
  4. @KL-7

    Add ICU4J license infromation.

    KL-7 authored
  5. @KL-7

    Rephrase some comments.

    KL-7 authored
Something went wrong with that request. Please try again.