Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Implement upper, lower case conversion for char #12561

Merged
merged 3 commits into from Mar 13, 2014

Conversation

Projects
None yet
7 participants
Contributor

pzol commented Feb 26, 2014

Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html

Removed auto-generated dead code which wasn't used.

kud1ing commented Feb 26, 2014

See also #9084

Owner

huonw commented Feb 26, 2014

Removed auto-generated dead code which wasn't used.

Would it be possible to do this in a separate commit (in this PR), for ease of review and general good-git-practice?

Contributor

pzol commented Feb 26, 2014

Ok, will do that in two separate commits. Need to iron out an issue I have found first.

Owner

huonw commented Feb 26, 2014

Thanks.

Contributor

pzol commented Feb 26, 2014

Done.

Contributor

pzol commented Feb 26, 2014

Docs updated.

Contributor

pzol commented Mar 1, 2014

@flaper87 removed code duplication.

@Valloric Valloric commented on the diff Mar 1, 2014

src/libstd/char.rs
@@ -486,6 +518,39 @@ fn test_to_digit() {
}
#[test]
+fn test_to_lowercase() {
+ assert_eq!('A'.to_lowercase(), 'a');
@pzol

pzol Mar 2, 2014

Contributor

The problem comes out when you deal with string comparison in a language sensitive context. If you just convert to upper case and then compare, without ever considering a locale, you cannot know we're dealing with turkish. A plain conversion to lower and to upper works according to UnicodeData.txt - without the special language sensitive context, i.e. the dotted upper case I with a dot, convers to a simple i.

So currently this would not pass:

  assert_eq!('ı'.to_uppercase(), 'İ');
  assert_eq!('İ'.to_lowercase(), 'i');

  let tr_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz";
  let tr_upper    = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ";
  let tr_lower    = "abcçdefgğhıijklmnoöprsştuüvyz";

  for (a, e) in upper_chars(tr_alphabet).zip(tr_upper.chars()) {
    assert!(a == e, format!("actual {} != expected {}", a, e));

This should be tackled in a different lib. I will be proposing a libi18n that should have things like case insensitive comparison in a locale context.

On 1 mar 2014, at 21:05, Val Markovic notifications@github.com wrote:

In src/libstd/char.rs:

@@ -486,6 +518,39 @@ fn test_to_digit() {
}

#[test]
+fn test_to_lowercase() {

  • assert_eq!('A'.to_lowercase(), 'a');
    In important test case to have would be the infamous Turkish i.

It's an issue in many languages/frameworks:
http://blogs.msdn.com/b/deeptanshuv/archive/2004/09/04/225720.aspx
http://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/
https://groups.google.com/d/topic/golang-nuts/w8eZxT3dA48/discussion
http://stackoverflow.com/questions/16830570/qt-turkish-characters-case-conversion
http://wiki.tcl.tk/748
epeli/underscore.string#252
nicolas-grekas/Patchwork-UTF8#2
http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm

I could add more, but basically every language/framework gets this wrong and it has cost people's lives.


Reply to this email directly or view it on GitHub.

@flaper87 flaper87 commented on an outdated diff Mar 2, 2014

src/etc/unicode.py
- f.write(" %c %s to %s\n" %
- (prefix,
- escape_char(pair[0]),
- escape_char(pair[1])))
- prefix = '|'
- f.write(" { true }\n")
- f.write(" _ { false }\n")
- f.write(" };\n")
- f.write(" }\n\n")
+def emit_conversions_module(f, lowerupper, upperlower):
+ f.write("pub mod conversions {\n")
+ f.write("""
+ use cmp::{Equal, Less, Greater};
+ use vec::ImmutableVector;
+ use tuple::Tuple2;
+ use option::{ Option, Some, None };
@flaper87

flaper87 Mar 2, 2014

Contributor

small nit: no need to have spaces after { and before }

Contributor

flaper87 commented Mar 2, 2014

Just a small nit from a partial review. It looks good to me! Thanks a lot!

Contributor

flaper87 commented Mar 3, 2014

@pzol could you squash the last 2 commits into the second one? This is looking good. Thanks

Contributor

pzol commented Mar 4, 2014

Squashed!

Contributor

flaper87 commented Mar 6, 2014

LGTM, @huonw mind taking a final look here?

Owner

alexcrichton commented Mar 9, 2014

In the past I've found unicode case sensitivity to be a very tricky and hairy topic. I've heard things like it's based on locale, based on which variant of unicode you're using, it changes from revision to revision, etc.

My only worry about this is that the to_uppercase and to_lowercase functions are a little vague about what exactly they are doing. It would be nice for them to base as transparent as possible with explicit references to any online standards or documentation explaining how exactly the case conversion is being performed.

I'm a little worried to merge this as I'm certainly no unicode expert, but the code looks good to me and I'd be willing to r+ with more comprehensive comments.

Contributor

pzol commented Mar 10, 2014

@alexcrichton comments with references in the code or in the commit?

Case folding is decribed here http://unicode.org/reports/tr21/tr21-3.html.
The conversion implemented here cover the so called common (ASCII basicly) and simple case folding - where one codepoint translates to one codepoint without locale specific sensivity, like the turkish special cases of i. The conversion is based on the UnicodeData.txt file ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt which was already being used in std::unicode and is documented here http://www.unicode.org/reports/tr44/.

The above mentioned documented mentions:

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase).
For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
For example, the German character 00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with 0149 "ʼn" latin small letter n preceded by apostrophe.
Characters may also have different case mappings, depending on the context.
For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it is followed by another letter, but lowercases to 03C2 "ς" small final sigma if it is not.
Characters may have case mappings that depend on the locale.
For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "ı" small dotless i.
Case mappings are not, in general, reversible.
For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

Go, Python and other languages use the same mechanism.
My recommendation would be to start with this approach and provide a more sophisticated and more complete in a separate library liblocale. I am currently working on such. An example of a locale library I like is http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html

Currently all Rust character handling is based on the the (naive) assumption, that one Rust char being one codepoint maps to one letter, and thus the implemented simple case conversion seems appropriate.

Owner

alexcrichton commented Mar 10, 2014

I'd be looking for comments on the functions themselves. The current documentation only states:

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it only maps a codepoint to its equivalent if it is also a single codepoint
///
/// # Return value
///
/// Returns the char itself if no conversion if possible

This isn't very descriptive about how it's doing the uppercase/lowercase behind the scenes.

Owner

huonw commented Mar 10, 2014

I agree with @alexcrichton: having references and citations to the canonical source of algorithms is really good so that everyone is on the same page with precisely what is implemented.

Contributor

pzol commented Mar 13, 2014

How about

/// Convert a char to its uppercase equivalent
///
/// The case-folding performed is the common or simple mapping:
/// it maps one unicode codepoint (one char in Rust) to its uppercase equivalent according
/// to the Unicode database at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
/// The additional SpecialCasing.txt is not considered here, as it expands to multiple
/// codepoints in some cases.
///
/// A full reference can be found here
/// http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992
///
/// # Return value
///
/// Returns the char itself if no conversion was made
#[inline]
pub fn to_uppercase(c: char) -> char {
    conversions::to_upper(c)
}
/// Convert a char to its lowercase equivalent
///
/// The case-folding performed is the common or simple mapping
/// see `to_uppercase` for references and more information
///
/// # Return value
///
/// Returns the char itself if no conversion if possible
#[inline]
pub fn to_lowercase(c: char) -> char {
    conversions::to_lower(c)
}
Owner

huonw commented Mar 13, 2014

Seems fine to me. (cc mozilla#12862 re the links, you don't have to do anything about them now, though.)

Contributor

pzol commented Mar 13, 2014

Sorry, should have updated them before, done now.

Contributor

flaper87 commented Mar 13, 2014

@pzol could you squash your last commit into the one that implements the upper, lower case conversion? With that and @huonw comments addressed, it LGTM

Owner

huonw commented Mar 13, 2014

(My comment is already addressed... as I said in it, there is nothing that needs work.)

Contributor

flaper87 commented Mar 13, 2014

(I wasn't referring to that one, anyway, looks fine)

@pzol pzol Remove code duplication
Remove whitespace

Update documentation for to_uppercase, to_lowercase
dba5625
Contributor

pzol commented Mar 13, 2014

Squashed the last commit!

@bors bors added a commit that referenced this pull request Mar 13, 2014

@bors bors auto merge of #12561 : pzol/rust/char-case, r=alexcrichton
Added common and simple case folding, i.e. mapping one to one character mapping. For more information see http://www.unicode.org/faq/casemap_charprop.html

Removed auto-generated dead code which wasn't used.
47a8c76

@bors bors merged commit dba5625 into rust-lang:master Mar 13, 2014

1 check passed

default all tests passed

@pzol pzol deleted the pzol:char-case branch Mar 13, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment