Unicode Collation Algorithm. #40

Merged
merged 49 commits into from Jun 20, 2012

Conversation

Projects
None yet
2 participants
Contributor

KL-7 commented Jun 4, 2012

It's not ready for being merged in at the moment, but there is already a very basic implementation of UCA.

In addition, I squeezed in a couple of changes to Resources specs to to test it more thoroughly. Honestly, I was going to change the implementation a bit, so I added some specs and made the changes. Then I decided to revert the changes (at least for now), but I can't throw the specs away ;) So, here they are.

I'll be adding stuff to this PR as the implementation grows. At the moment we need at least handle some special cases described in the docs, and normalize the input string (maybe with some optimizations in this part, because straightforward application of NFD to the argument of the sort_key method makes a noticeable impact on performance).

Contributor

KL-7 commented Jun 15, 2012

Yay! Only 84 out of total 160K tests left. I hope they all are related to the same issue: canonical reordering (performed during normalization) removes a contraction that exists in the original (denormalized) string. I don't see a good workaround for that at the moment, but I'll think about it over the weekend.

KL-7 added some commits Jun 1, 2012

@KL-7 KL-7 Improve Resources spec. 150cbc5
@KL-7 KL-7 Add FractionalUCA_SHORT.txt to resources. 3b3555a
@KL-7 KL-7 Add Collation::Trie. 9a00894
@KL-7 KL-7 Add Collation::Collator. 58ab7c1
@KL-7 KL-7 Do not upcase in CodePoints.to_char. 6d79525
@KL-7 KL-7 Handle surrogates as completely ignorable. cdff267
@KL-7 KL-7 Assign implicit weights to unknown and ill-formed code points. 871284f
@KL-7 KL-7 Extract Hangul decomposition from Normalizers. 9798d9f
@KL-7 KL-7 Add Hangul.hangul_syllable? method. 5987fe0
@KL-7 KL-7 Decompose Hangul syllables before collation. 17daf5a
@KL-7 KL-7 Fix empty wydes array for 0. 8b4dd09
@KL-7 KL-7 Normalize string before collation. 30060d6
@KL-7 KL-7 After NFD Hangul syllables are processed normally. b583d45
@KL-7 KL-7 Fix typo. fafc1ee
@KL-7 KL-7 Strip case bits from the 3rd level of fractional CE. 23eab88
@KL-7 KL-7 Write failures_short.txt for easy diff comparison. a126452
@KL-7 KL-7 Use CollationTest_CLDR_NON_IGNORABLE.txt. e238ce5
@KL-7 KL-7 Do not convert fractional CE into regular ones. 872abca
@KL-7 KL-7 Fix parsing FractionalUCA data for fractional CE with comments. 1d76e20
@KL-7 KL-7 Treat illegal xxFFFE and xxFFFF values as ignorable. 64370c7
@KL-7 KL-7 Remove Collator#build_collation_elements. b857f41
@KL-7 KL-7 Separate sort key levels generation. 7e1b447
@KL-7 KL-7 Extract TwitterCldr::Collation::SortKey. 8a97ad8
@KL-7 KL-7 Print failures count while running collation test. 2f0ddd3
@KL-7 KL-7 Add basic specs for SortKey. e0d568e
@KL-7 KL-7 Compress tertiary weights. 2440666
@KL-7 KL-7 Unfold '006C | 00B7;' syntax in FractionalUCA_SHORT.txt
Replaced

    006C | 00B7; [, DB A9, 05]

with

    006C 00B7; [3D, 05, 05][, DB A9, 05]

where [3D, 05, 05] is the fractional collation element of 006C code
point.

This can break if 006C is tailored (because 006C 0087 won't be tailored
automatically in that case).
e35eeec
@KL-7 KL-7 Make Collator#trie public. 9c3a012
@KL-7 KL-7 Add secondaries compression. 2370b74
@KL-7 KL-7 Extract Collation::TrieBuilder. a998ca5
@KL-7 KL-7 Port ImplicitCEGenerator from ICU. 8957981
@KL-7 KL-7 Extract Hangul composition into Hangul.compose. 153c5c3
Contributor

KL-7 commented Jun 15, 2012

@camertron, have a look on 153c5c3, please. I extracted NFKC.compose_hangul into Hangul.compose so all Hangul related stuff is there. And made two tiny changes in the NFKC itself. Did I get right the idea behind valid_hangul_sequence? method?

Contributor

KL-7 commented Jun 16, 2012

Hey, @camertron. Looks like the first part of collation is almost done. Only 3 out of 160K tests left and they're related to the problem with Hangul decomposition that I described here. I'd appreciate if you help me with that, but maybe I'll look into it myself tomorrow. Besides that I'd like to go over all changes I've made (it turned out to be quite a lot of code in the end) and add some comments, maybe clean up some things.

Also I'm going to extract a short version of Unicode collation test for running locally and setup a full test run on Travis CI.

One thing I'd like to add is a couple of methods for Collator class: compare(str1, str2) and sort(strings). Our users can do it themselves (as I do in the specs right now), but it'd be nice to have these methods built-in. Plus, sort keys are efficient for multiple comparison. If you want to sort a big list of strings, you should build sort keys first and then use them for sorting instead of building sort keys every time you want to compare two strings. We can handle that inside sort method.

Please, have a look at the code and let me know if you have any comments.

Contributor

KL-7 commented Jun 16, 2012

@camertron, I renamed Normalizers module into Normalization because it sounds as a better match for Collation module. Do you mind?

Collaborator

camertron commented Jun 16, 2012

@KL-7, renaming Normalizers to Normalization is fine, but it doesn't match up with Formatters and Tokenizers as it used to. What if you put your collation code into a Collators module?

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/collation/collator.rb
+ def self.trie
+ @trie ||= TwitterCldr::Collation::TrieBuilder.load_trie(FRACTIONAL_UCA_SHORT_RESOURCE)
+ end
+
+ private
+
+ def sort_key_for_code_points(integer_code_points)
+ TwitterCldr::Collation::SortKey.build(get_collation_elements(integer_code_points))
+ end
+
+ def get_integer_code_points(str_or_code_points)
+ code_points = str_or_code_points.is_a?(String) ? TwitterCldr::Utils::CodePoints.from_string(str_or_code_points) : str_or_code_points
+
+ # Normalization makes the collation process significantly slower (like seven times slower on the UCA
+ # non-ignorable test from CollationTest_NON_IGNORABLE.txt). ICU uses some optimizations to apply normalization
+ # only in special, rare cases. We need to investigate possible solutions and do normalization cleverly too.
@camertron

camertron Jun 16, 2012

Collaborator

Are you referring to the nfd_quick_check technique?

@KL-7

KL-7 Jun 16, 2012

Contributor

Huh, do we have it? There's something called Fast C or D (or FCD) in the ICU document. They use it to skip normalization when it's not necessary.

@camertron

camertron Jun 16, 2012

Collaborator

Yeah, I think that's the same thing. We don't have it yet - it will require adding and loading a few more resources.

Contributor

KL-7 commented Jun 16, 2012

Twitter::Collators::Collator doesn't look good to me (mostly because there's only one collator and it's not the only significant class here). I think about these two modules as about implementation of Unicode normalization and collation (hence the names), so it seems fine that they both don't match others. What do you think?

@camertron camertron and 1 other commented on an outdated diff Jun 16, 2012

lib/twitter_cldr/collation/collator.rb
@@ -0,0 +1,125 @@
+# encoding: UTF-8
+
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+
+module TwitterCldr
+ module Collation
+
+ class Collator
+
+ FRACTIONAL_UCA_SHORT_RESOURCE = 'collation/FractionalUCA_SHORT.txt'
+
+ UNMARKED = 3
@camertron

camertron Jun 16, 2012

Collaborator

Is this used anywhere?

@KL-7

KL-7 Jun 16, 2012

Contributor

It was...

@camertron camertron and 1 other commented on an outdated diff Jun 16, 2012

lib/twitter_cldr/collation/implicit_ce_generator.rb
+ last3 = last2 / MEDIAL_COUNT
+ last2 %= MEDIAL_COUNT
+
+ # spread out, leaving gap at start
+ last0 = MIN_TRAIL + last0 * FINAL_4_MULTIPLIER
+
+ # offset
+ last1 += MIN_TRAIL
+ last2 += MIN_TRAIL
+ last3 += MIN_4_PRIMARY
+
+ (last3 << 24) + (last2 << 16) + (last1 << 8) + last0
+ end
+ end
+
+ # Method used to:
@camertron

camertron Jun 16, 2012

Collaborator

Did you follow an algorithm or a spec for this def? If so, please give the URL here :)

@KL-7

KL-7 Jun 16, 2012

Contributor

I haven't found a good description of what has to be done with CJK, so I ported this class (necessary methods and constants) from ICU4J. I'll add some notes about that later.

@camertron camertron and 1 other commented on an outdated diff Jun 16, 2012

lib/twitter_cldr/collation/sort_key.rb
+ class SortKey
+
+ PRIMARY_LEVEL, SECONDARY_LEVEL, TERTIARY_LEVEL = 0, 1, 2
+
+ LEVEL_SEPARATOR = 1 # separate levels in a sort key '01' bytes
+
+ TERTIARY_LEVEL_MASK = 0x3F # mask for removing case bits from tertiary weight ('CC' bits in 'CC00 0000')
+
+ attr_reader :collation_elements
+
+ # Returns a sort key as an array of bytes.
+ #
+ # collation_elements - an array of collation elements, represented as arrays of integer weights.
+ #
+ def self.build(collation_elements)
+ new(collation_elements).bytes_array
@camertron

camertron Jun 16, 2012

Collaborator

I would have expected build to return an instance of SortKey instead of a byte array. If you're going to return a byte array, why create an instance of SortKey at all? Throughout the rest of the collation code, it doesn't look like instances of SortKey are retained anywhere - just the byte arrays.

@KL-7

KL-7 Jun 16, 2012

Contributor

I extracted it from Collator to make things separated. And I need an instance because otherwise @bytes_array and @common_count will be passed from one method into another as arguments that doesn't feel right in this particular case when we have quite a lot of methods involved.

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/collation/trie.rb
+ def add(key, value)
+ final = key.inject(@root) do |node, key_element|
+ node[1][key_element] ||= [nil, {}]
+ end
+
+ final[0] = value
+ end
+
+ def get(key)
+ final = key.inject(@root) do |node, key_element|
+ subtree = node[1][key_element]
+ return unless subtree
+ subtree
+ end
+
+ final[0]
@camertron

camertron Jun 16, 2012

Collaborator

Cleaner (IMHO) to say final.first

@KL-7

KL-7 Jun 16, 2012

Contributor

Agree, but we don't have Array#second from ActiveSupport here, so final.first and node[1] won't look nice together. That's why I thought about creating a Node class. But, as I said, I decided against it eventually and implemented everything with plain arrays and raw indexes =)

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/collation/trie_builder.rb
+
+ def self.load_trie(file_path)
+ new(file_path).build
+ end
+
+ def initialize(resource)
+ @file_path = File.join(TwitterCldr::RESOURCES_DIR, resource)
+ end
+
+ def build
+ parse_trie(load_collation_elements_table)
+ end
+
+ private
+
+ def parse_trie(table)
@camertron

camertron Jun 16, 2012

Collaborator

Streaming the resource is clever, nicely done! However, the table argument is a little misleading - it's not really the table you're passing in, it's a file handle. Could we make this streaming behavior available in TwitterCldr::Resources?

@KL-7

KL-7 Jun 16, 2012

Contributor

How do you mean streaming? I separated this method from load_collation_elements_table to be able to stub the latter with a string in the specs. That's why I chose a neutral table name for the argument.

@camertron

camertron Jun 16, 2012

Collaborator

I meant streaming in the sense that you're not loading the entire table into memory. Instead, you're opening the file with open and reading and parsing each line individually. The line parse_trie(load_collation_elements_table) doesn't pass the table to parse_trie, it passes an instance of File.

@KL-7

KL-7 Jun 16, 2012

Contributor

Ah, now I see what you mean. Unfortunately, I guess, one does not simply stream a YAML file. And all our other resources are in this format.

@KL-7

KL-7 Jun 16, 2012

Contributor

Well, you can actually pass a file handler (as an IO-object) into YAML.load or even a file name directly into YAML.load_file. That might be more efficient, I think.

@camertron

camertron Jun 16, 2012

Collaborator

Oh no, I meant we could use streaming for non-yml files, but move the logic for streaming into TwitterCldr::Resources. Just a thought, probably not necessary.

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/normalization/hangul.rb
@@ -0,0 +1,68 @@
+# encoding: UTF-8
+
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+
+module TwitterCldr
+ module Normalization
+ module Hangul
@camertron

camertron Jun 16, 2012

Collaborator

This is a great refactoring of the Hangul stuff, nice work.

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/normalization/nfkc.rb
@@ -44,7 +44,7 @@ def compose(code_points)
end
if hangul_code_points.size > 1 && !next_hangul_type
- hangul_code_points.size.times { final.pop }
+ final.pop(hangul_code_points.size)
@camertron

camertron Jun 16, 2012

Collaborator

Ha! Nice.

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/normalization/nfkc.rb
@@ -55,23 +55,11 @@ def compose(code_points)
end
def valid_hangul_sequence?(buffer_size, hangul_type)
- case [buffer_size, hangul_type]
- when [0, :lparts], [1, :vparts], [2, :tparts]
- true
- else
- false
- end
+ [[0, :lparts], [1, :vparts], [2, :tparts]].include?([buffer_size, hangul_type])
@camertron

camertron Jun 16, 2012

Collaborator

Yes, this is much better. I was doing some other logic with this case before refactoring and just didn't see it this way - thanks!

@camertron camertron commented on the diff Jun 16, 2012

lib/twitter_cldr/normalization/nfkc.rb
def compose_hangul(code_points)
- l_index = code_points.first.hex - HANGUL_DECOMPOSITION_CONSTANTS[:LBase]
- v_index = code_points[1].hex - HANGUL_DECOMPOSITION_CONSTANTS[:VBase]
- t_index = code_points[2] ? code_points[2].hex - HANGUL_DECOMPOSITION_CONSTANTS[:TBase] : 0 # tpart may be missing, that's ok
- lv_index = (l_index * HANGUL_DECOMPOSITION_CONSTANTS[:NCount]) + (v_index * HANGUL_DECOMPOSITION_CONSTANTS[:TCount])
- (HANGUL_DECOMPOSITION_CONSTANTS[:SBase] + lv_index + t_index).to_s(16).upcase.rjust(4, "0")
+ TwitterCldr::Normalization::Hangul.compose(code_points.map { |cp| cp.hex }).to_s(16).upcase.rjust(4, "0")
@camertron

camertron Jun 16, 2012

Collaborator

Ugh. I'm getting really tired of this .to_s(16).upcase.rjust(4, "0") thing. Maybe we can add a method to Fixnum or something. Hopefully we can move away from doing string code point to int conversions in the future too.

@KL-7

KL-7 Jun 16, 2012

Contributor

I think the second option is the right choice. And I'd rather leave things that way until then, so we don't forget about these annoying conversions.

Collaborator

camertron commented Jun 16, 2012

Overall, this PR is magical - I'm super excited about getting it merged in! However I'll confess, I don't understand everything that's going on here, so I'd love to get a walk-through from you on Monday. Here are a few quick things I wanted to bring up:

  1. Where is the FCE table? I don't see it in resources.
  2. As you mentioned previously, it would be awesome to provide a sort method in LocalizedArray that would handle computing sort keys and comparing the items.
  3. What would you think about maintaining a sort key cache? It may grow to be too large if users are computing them for a bunch of different strings, so perhaps we'll need to specify some kind of expiration mechanism. Actually, I was thinking it might be time to think about a TwitterCldr::Cache class that can have some fancy functionality. We could also provide the ability to set a global memory limit - after all, the gem is getting pretty fat.

Let me know what you think :)

Contributor

KL-7 commented Jun 16, 2012

Thanks for your great comments, @camertron. It's really awesome to know that someone is carefully reading your code =)

  1. Don't worry, it's right there. GitHub is just clever enough not to show such a big file on the diff tab (it's there but the content is not displayed).
  2. Ok, I'll add it.
  3. Caching sort keys is a good idea, but together with a carefully tuned expiration mechanism it sounds not like a 5-minute task. What if I finish this PR so you can merge it, and then look into caching? And global cache sounds even more awesome. I saw you've already extracted compute_cache_key method, because we need caching almost everywhere. Though, I'm not sure, how much memory all our cache will need in a real app. Will be our internal cache good enough? Should we think about an adapter for, let's say, memcached?
Collaborator

camertron commented Jun 16, 2012

  1. Ah ok cool!
  2. Thanks.
  3. Yes, it's much too big a feature to add to this PR - we should add it later for sure. I looked at the size of resources and there are only 3.4MB worth of files in there. There will be some additional overhead of course when the data is read into Ruby objects, but that's really not too large to handle for our internal cache. Still, it might be worth examining how much memory gets used running certain tests (like normalization) or formatting a few dates. That's for our existing cached data. Sort keys are a different story, however. Whereas before we were only caching a distinct, limited collection of values, now we could potentially be caching an unlimited collection of values. I don't know if the problem warrants multiple adapters like memcached or redis... but maybe it does. Do you happen to know how ICU4J handles this?
Contributor

KL-7 commented Jun 20, 2012

I like how this PR looks now. What do you say, @camertron?

@camertron camertron added a commit that referenced this pull request Jun 20, 2012

@camertron camertron Merge pull request #40 from KL-7/collation
Adding the Unicode Collation Algorithm (does not include language-specific tailoring).
8b4fff8

@camertron camertron merged commit 8b4fff8 into twitter:master Jun 20, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment