Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Territories #93

Merged
merged 14 commits into from

4 participants

@jasonkb

Hi Cameron and all,

I am an engineer at Airbnb. Thank you for this wonderful gem!

I want to localize our country selector (on e.g. https://www.airbnb.com/rooms/new where you choose the "Country" of your new listing) so that the names of the countries are translated.

To do this, I thought the best way would be to add a class to TwitterCLDR to get at the territory names in the CLDR locales resource. What do you think of this pull request? It makes the "territories" component of the locales resource be extracted when "rake update:locales_resources" is run, and adds a TwitterCLDR::Shared::Territories class that makes it easy to translate country/territory names, similar to how TwitterCLDR::Shared::Languages makes it easy to translate language names.

Thank you!
Jason

lib/twitter_cldr/shared/territories.rb
((8 lines not shown))
+ module Territories
+
+ class << self
+
+ def all
+ all_for(TwitterCldr.get_locale)
+ end
+
+ def all_for(code)
+ get_resource(code)[:territories]
+ rescue
+ {}
+ end
+
+ def from_code(code)
+ from_code_for_locale(code, TwitterCldr.get_locale)
@camertron Collaborator

Just FYI TwitterCldr.get_locale is going away in favor of TwitterCldr.locale. See PR #92 :)

@jasonkb
jasonkb added a note

Ah, excellent changes! I will change to TwitterCLDR.locale when PR #92 is merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
spec/shared/territories_spec.rb
@@ -0,0 +1,101 @@
+# encoding: UTF-8
+
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+
+require 'spec_helper'
+
+include TwitterCldr::Shared
+
+describe Territories do
+ describe "#translate_territory" do
+ it "should translate a territory name from one locale to another" do
+ Territories.translate_territory("Russia", :en, :es).should == "Rusia"
@camertron Collaborator

There's a handy RSpec matcher we wrote (match_normalized) that I like to use when comparing text values. It normalizes both values before comparison, ensuring that characters like non-breaking spaces and accents are treated as equivalent even if their bytes aren't quite the same. For this example, you'd do this:

Territories.translate_territory("Russia", :en, :es).should match_normalized("Rusia")
@jasonkb
jasonkb added a note

I'll change all the 'should == "..."' to 'should match_normalized("...")' (and I'll also change languages_spec.rb in the same way, which I modeled this file from) but I want to check one thing -- don't we actually want to test that there are no strange non-breaking spaces or accents anywhere? Also, below, we would like to make sure that the accent in
Territories.translate_territory("Spain", :en, :es).should == "España"
isn't lost or mangled, so at least there, it would be appropriate to do a direct string comparison maybe?

@camertron Collaborator

Normalizing the text before matching won't remove important accent marks and such, it just combines them so the bytes can be compared. For example, the character ñ can either be represented in memory as n and ˜ separately or combined into a single character. Most display systems (eg. browsers) will combine these characters visually, even if they're actually separate characters. The normalization step just ensures we're comparing apples to apples and not driving ourselves crazy trying to find the difference between strings that look visually equivalent.

@jasonkb
jasonkb added a note

Ah I see -- I misunderstood what match_normalized does. Cool. I changed territories_spec.rb and languages_spec.rb to use match_normalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
spec/shared/territories_spec.rb
((43 lines not shown))
+ Territories.translate_territory("Russia").should == "Rusia"
+ end
+
+ it "successfully translates locale codes that are and are not in the CLDR using the locale map" do
+ Territories.translate_territory("Russia", :en, :'zh-cn').should == "俄罗斯"
+ Territories.translate_territory("Russia", :en, :'zh').should == "俄罗斯"
+ end
+
+ it "should return nil if no translation was found" do
+ Territories.translate_territory("Russia", :en, :blarg).should == nil
+ end
+ end
+
+ describe "#from_code_for_locale" do
+ it "should return the territory in the correct locale for the given locale code (i.e. RU in English should be Russia)" do
+ Territories.from_code_for_locale(:ES, :en).should == "Spain"
@camertron Collaborator

What's up with the uppercased locale code symbols (eg. :ES)?

@jasonkb
jasonkb added a note

These are ISO 3166-1 alpha-2 country codes. In CLDR they are (seem to always be) uppercase.

@KL-7 Collaborator
KL-7 added a note

Probably it'd make sense to downcase them for internal consistency – TwitterCldr::Shared::LanguageCodes works only with lower case values as you can see here.

@KL-7 Collaborator
KL-7 added a note

Great pull request by the way! :+1:

@camertron Collaborator

Yeah, I agree. Lowercasing would be consistent with the rest of the locale codes in the project.

@jasonkb
jasonkb added a note

Ah now I see what you mean by "consistent with the rest of the project" -- I see that methods phone code and postal code related use lower-cased ISO 3166-1 alpha-2 codes.

I'll make it so all ISO 3166-1 alpha-2 codes are downcased before they are stored or compared.

(If there weren't consistency concerns, I think upper-case country codes would be preferable -- official references all use uppercase codes:

They are uppercase in references on iso.org:
http://www.iso.org/iso/country_codes/iso_3166_code_lists/country_names_and_code_elements.htm
http://www.iso.org/iso/home/standards/country_codes/iso-3166-1_decoding_table.htm

They are uppercase in CLDR itself:

# grep 'territory' vendor/cldr/common/main/en.xml
...
                        <territory type="KE">Kenya</territory>
                        <territory type="KG">Kyrgyzstan</territory>
                        <territory type="KH">Cambodia</territory>
                        <territory type="KI">Kiribati</territory>
                        <territory type="KM">Comoros</territory>
                        <territory type="KN">Saint Kitts and Nevis</territory>
                        <territory type="KP">North Korea</territory>
                        <territory type="KR">South Korea</territory>
...

Plus, it makes it easier to not get confused between language and country codes :es, :ES, :sv, :SE when there is also a case difference =P)

@jasonkb
jasonkb added a note

I made a bunch of changes to the pull request to store territory codes lower-case, and allow asking for their names from either an upper-case code or a lower-case code. Could you take another look?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@camertron
Collaborator

Hey @jasonkb this looks mostly great, thanks for contributing! Before I can merge this PR, there are a few legal details to wrap up - our open-source manager will get in touch with you via email.

@caniszczyk
Owner

@jasonkb, what's your email as in the diff 'jasonkb@airbnb.com' bounced for me :/

@jasonkb
@caniszczyk
Owner

Thanks Jason, email sent.

lib/twitter_cldr/utils.rb
@@ -65,9 +65,51 @@ def traverse_hash(hash, path)
end
end
+ # Normalizes each key in the 'arg' hash or constituent hashes as
+ # if it were a territory code.
+ #
+ # In addition, removes entries in hashes where the key begins with a digit.
+ # Because of the way the twitter-cldr-rb YAML resource pipeline works,
+ # these three-digit codes get mangled (e.g. interpreted as octal then
+ # reinterpreted out in decimal), and translations for UN three-digit
+ # area codes cannot be trusted.
+ def deep_normalize_territory_code_keys(arg)
@KL-7 Collaborator
KL-7 added a note

I think TwitterCldr::Utils::Territories is a better place for these two methods, because they are related only to this feature. On the other hand, Utils module is supposed to contain more generic helpers that are used in multiple places across the project.

@jasonkb
jasonkb added a note

Good point, I moved those two methods to TwitterCldr::Utils::Territories.

@KL-7 Collaborator
KL-7 added a note

Ah, sorry, I actually meant TwitterCldr::Shared::Territories module – the only module (beside auxiliary importer that is used only rarely when CLDR is updated) that uses these methods :( My bad.

Luckily, you did it step by step, so if you simply remove the last commit it'd be perfect. I apologize for confusing you.

@jasonkb
jasonkb added a note

Hm right I did it in two steps, because I did the first step and then thought it was pretty unclean to have
1) LocalesResourcesImporter have this one static method, and have a spec file that tests only that one method
2) TwitterCldr::Shared::Territories.normalize_territory_code() be public

So I like it better this way, but I'm happy to revert the last commit if you still think it is better in that state. Thanks!

@KL-7 Collaborator
KL-7 added a note

Importers don't have specs, because they're used only by TwitterCLDR developers and in most cases it's easy enough to check the final resource files that they produce. At least that's how I justified my laziness when I added these importers without a single spec :)

Anyway, I see your point, so leave it in TwitterCldr::Utils::Territories if you think it's appropriate.

@jasonkb
jasonkb added a note

Makes sense, the importers are indirectly tested by all the rest of the specs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lib/twitter_cldr/utils.rb
@@ -65,9 +65,51 @@ def traverse_hash(hash, path)
end
end
+ # Normalizes each key in the 'arg' hash or constituent hashes as
+ # if it were a territory code.
+ #
+ # In addition, removes entries in hashes where the key begins with a digit.
+ # Because of the way the twitter-cldr-rb YAML resource pipeline works,
@KL-7 Collaborator
KL-7 added a note

Looks like the problem is in the way ruby-cldr dump data. Specifically, in the way they override this method of Ya2YAML class. With this method override you always get numeric strings without quotes (that are later parsed by the standard Ruby YAML parser as integers in TwitterCLDR):

1.9.3p362 :001 > Cldr::Export::Yaml.new.yaml("en"=>{"territories"=>{"001"=>"World", "002"=>"Africa", "003"=>"North America"}})
 => "en: \n  territories: \n    001: World\n    002: Africa\n    003: \"North America\""

Without this override (if you comment it and use the original implementation from ya2yaml lib) you get quotes around keys:

1.9.3p362 :001 > Cldr::Export::Yaml.new.yaml("en"=>{"territories"=>{"001"=>"World", "002"=>"Africa", "003"=>"North America"}})
 => "en: \n  territories: \n    ? \"001\"\n    : World\n    ? \"002\"\n    : Africa\n    ? \"003\"\n    : \"North America\""

It looks better, but for some reason this YAML string can't be parsed by standard Ruby YAML parser. Probably the fix in ruby-cldr should be a bit more accurate than just commenting is_one_plain_line? method :) If you're interested, you can dig deeper and try to fix the way ruby-cldr dumps data. I think it'd be great to get it fixed rather than ignore these type of codes.

FYI, standard Ruby YAML engine tries to omit quotes around strings when possible, but it leaves them if the string represents a valid octal number (with '0' prefix):

1.9.3p362 :010 > YAML.dump(['001', '008', 'foo'])
 => "---\n- '001'\n- 008\n- foo\n"

Looks like that's how it should be according to the YAML spec and I think that is what we need to get in the YAML files generated by ruby-cldr to solve this problem.

@jasonkb
jasonkb added a note

Thank you for the thorough investigation!

Given how many headaches YAML has already caused you (reading the "Unicode YAML Support" section of the readme) my guess is time would be better spent refactoring to not use YAML at all =P

Because these UN three-digit area codes are exceedingly infrequently used, I would prefer to leave this be (and make sure it is well-documented that they are not supported).

@KL-7 Collaborator
KL-7 added a note

Fair enough, I think.

The biggest problem with YAML is that MRI 1.8.7 is pretty bad at dumping Unicode characters into YAML and most of this gem is about Unicode. @camertron tried to solve it by adopting parts of ya2yaml gem internally in TwitterCLDR and probably that's also the reason why ruby-cldr uses ya2yaml. It fixes some issues, but the differences between all these YAML implementations are causing a lot of troubles :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jason Katz-B... added some commits
lib/twitter_cldr/utils/territories.rb
((25 lines not shown))
+
+ # Normalizes each key in the 'arg' hash or constituent hashes as
+ # if it were a territory code.
+ #
+ # In addition, removes entries in hashes where the key begins with a digit.
+ # Because of the way the twitter-cldr-rb YAML resource pipeline works,
+ # these three-digit codes get mangled (e.g. interpreted as octal then
+ # reinterpreted out in decimal), and translations for UN three-digit
+ # area codes cannot be trusted.
+ def deep_normalize_territory_code_keys(arg)
+ case arg
+ when Array
+ arg.map { |elem| deep_normalize_territory_code_keys(elem) }
+ when Hash
+ normalized = arg.inject({}) do |carry, (key, value)|
+ normalized_key = TwitterCldr::Utils::Territories.normalize_territory_code(key)
@KL-7 Collaborator
KL-7 added a note

I believe you can remove explicit TwitterCldr::Utils::Territories since both methods are in the same module now.

@jasonkb
jasonkb added a note

Done.

@KL-7 Collaborator
KL-7 added a note

Great, thanks! :+1:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jasonkb

What are the next steps here? What is the status of PR #92?

Thanks!
--Jason

Jason Katz-Brown Make Territories work for Traditional Chinese.
The problem was that before it lowercased "zh-Hant" into "zh-hant" when
producing resources/locales/zh-Hant/territories.yml.
1e47a2c
@camertron
Collaborator

LGTM gentlemen. I'll merge this request first, make a few additional changes recommended by @KL-7 to #92, then merge that. Should be ready by end-of-day today Jan 12th PST.

@camertron camertron merged commit e4fdb37 into twitter:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 9, 2013
  1. Import "territories" component of locale data.

    Jason Katz-Brown authored
  2. Class for localizing country names

    Jason Katz-Brown authored
  3. Territories resource files.

    Jason Katz-Brown authored
  4. Remove debugging puts.

    Jason Katz-Brown authored
  5. Fix typo "Macedona" in spec.

    Jason Katz-Brown authored
Commits on Jan 10, 2013
  1. Regeneraet territories YML files with lower-case codes.

    Jason Katz-Brown authored
  2. Use "should match_normalized" in languages_spec.rb.

    Jason Katz-Brown authored
  3. Move territory-specific methods out of utils.rb.

    Jason Katz-Brown authored
  4. Make territory munging use the correct function name.

    Jason Katz-Brown authored
  5. Add Afrikaans territories.yml.

    Jason Katz-Brown authored
    Not sure why it wasn't included in previous commit!
  6. Move territory-related utils to Utils::Territories.

    Jason Katz-Brown authored
  7. Remove unneeded namespacing.

    Jason Katz-Brown authored
Commits on Jan 11, 2013
  1. Make Territories work for Traditional Chinese.

    Jason Katz-Brown authored
    The problem was that before it lowercased "zh-Hant" into "zh-hant" when
    producing resources/locales/zh-Hant/territories.yml.
Something went wrong with that request. Please try again.