Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word-final sigma in str::to_lowercase #26035

Closed
SimonSapin opened this issue Jun 5, 2015 · 5 comments
Closed

Word-final sigma in str::to_lowercase #26035

SimonSapin opened this issue Jun 5, 2015 · 5 comments

Comments

@SimonSapin
Copy link
Contributor

By design, str::to_lowercase and str::to_uppercase do not depend on the language of the text (which shouldn’t be assumed to be the same as the locale of the machine running the program).

Mostly, this means ignoring the conditional mappings in Unicode’s SpecialCasing.txt, with one exception: the greek letter Sigma is Σ in upper-case and σ in lower-case except in word-final position, where it is ς. The corresponding mapping in SpecialCasing.txt is:

# <code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA

With Final_Sigma defined in the Unicode standard:

C is preceded by a sequence consisting of a cased letter and then zero or more case-ignorable characters, and C is not followed by a sequence consisting of zero or more case-ignorable characters and then a cased letter.

(cased letter and other terms have a precise definition given beforehand.)

Since char::to_lowercase doesn’t know context, I think it should just return σ for Σ. But str::to_lowercase does have context and could implement this conditional mapping.

@SimonSapin
Copy link
Contributor Author

CC @aturon, @alexcrichton

@alexcrichton
Copy link
Member

Our current trend of conventions and such leads me to think that we should do the special behavior in str::to_lowercase.

@SimonSapin
Copy link
Contributor Author

Untested first attempt:

impl str {
    pub fn to_lowercase(&self) -> String {
        let mut s = String::with_capacity(self.len());
        for (i, c) in self[..].char_indices() {
            if c == 'Σ' && is_final_sigma(self, i) {
                s.push_str("ς")
            } else {
                s.extend(c.to_lowercase());
            }
        }
        return s;

        fn is_final_sigma(s: &str, i: usize) -> bool {
            debug_assert!('Σ'.len_utf8() == 2);
            s[..i].chars().rev().skip_while(is_case_ignorable).any(is_cased_letter) &&
            !s[i + 2..].chars() .skip_while(is_case_ignorable).any(is_cased_letter)
        }
    }
}

… where is_cased_ignorable and is_cased_letter would be private functions based on tables generated by src/etc/unicode.py, according to the definitions in the linked PDF.

@SimonSapin
Copy link
Contributor Author

In other words: since there is only one such mapping, I think it makes more sense to hard-code it than to try and generalize rustc_unicode::tables::conversions::to_lowercase_table to include arbitrary conditions.

@SimonSapin
Copy link
Contributor Author

According to a Greek speaker, not doing this is "Very, very bad. Basically, bad enough that people won’t use it.".

bors added a commit that referenced this issue Jun 9, 2015
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well.
* Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while.
* Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`.
* Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton
@bors bors closed this as completed in f901086 Jun 9, 2015
nvzqz added a commit to nvzqz/fmty that referenced this issue Feb 1, 2023
This `to_lowercase` approach converts Σ to σ instead of ς in word-final
position. See rust-lang/rust#26035.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants