Skip to content

Some minor fixes and changes for Normalizers::NFD. #30

Merged
merged 4 commits into from May 14, 2012

2 participants

@KL-7
KL-7 commented May 10, 2012

Hi, @camertron.

I started getting into Unicode normalization in order to implement NFKD. As NFKD algorithm is pretty close to NFD, most likely I'll extract some logic from Normalizers::NFD into a base class to be able to use it in a sibling Normalizers::NFKD class. At least that's the plan at the moment.

I looked at Normalizers::NFD class and besides removing incorrect duplicate of :Scount in @@hangul_constants I made a couple of other minor changes, like extracting Hangul decomposition into a method, moving Hangul algorithms constants from class variable into a constant and making internal methods private.

@KL-7
KL-7 commented May 10, 2012

@camertron, I also removed core_ext/strings, because Symbol extension sitting in this directory is a bit strange. And changed LocalizedString#to_s to return a copy of the underlying object to prevent its accidental modification. Do you think it's a good idea?

KL-7 added some commits May 10, 2012
@KL-7 KL-7 Some minor fixes and changes for Normalizers::NFD. 856845b
@KL-7 KL-7 Remove core_ext/strings directory. 2e7939d
@KL-7 KL-7 Duplicate @base_obj in LocalizedString#to_s. 2c6ffe7
@KL-7 KL-7 Optimize Canonical Ordering Algorithm.
Canonical Ordering Algorithm requires stable sorting of every sequence
of combining characters (characters with combining class > 0) inside the
string. Performing stable sorting (bubble sort) only on those sequences
instead of applying sorting to the whole string reduces the running time
of the nomalization test (on the full suit from NormalizationTest.txt)
in half (on my machine it reduced from ~40 seconds to ~20 seconds). I
think the performance boost might be even more noticeable on longer
strings.
ea3f3eb
@KL-7
KL-7 commented May 10, 2012

One more change: I rewrote reordering that is used as a part of canonical ordering algorithm. It passes the test on the full version of NormalizationTest.txt and completes it in 20 seconds instead of the original 40 seconds. I think it's quite a good performance boost. See the commit message for more information.

@KL-7
KL-7 commented May 10, 2012

@camertron, why don't we use travis ci to run normalization test against the full version of NormalizationTest.txt?

If you upload a gzip-ed version of this file (for faster downloading and in case of unicode.org downtime) to the Github, I can setup a separate rake task for CI testing or a simple env variable that will trigger downloading of this file and running normalization test against it instead of the smaller version that we have in the repository.

Tests running for 20 seconds might be less pleasant than tests running for 3 seconds for developers, but it doesn't matter for travis. That way we'll be sure that our implementation passes the full test provided by the Unicode Consortium. Plus we will have an easy way to trigger full test locally (I had to manually download and replace NormalizationTest.txt file for that and then remove it to make sure it won't get commited).

@camertron

@KL-7 the refactor looks good. Moving the files in core_ext/string is fine with me, and it makes sense to have LocalizedString#to_s return a copy of the underlying string too.

I also like your idea of having Travis run our test suite over the entire NormalizationTest.txt file. My only concern is that doing so will add an "external" dependency and a bit more complexity. Let's try it and see if it causes any problems.

@camertron camertron merged commit 5f2c5ab into twitter:master May 14, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.