-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
\N{...} neglects formal aliases and named sequences from Unicode charnames namespace #56962
Comments
Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.) This is a problem because aliases correct errors in the original names, and are the preferred versions. For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI. It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database. However, Python blows up when you try to use this:
This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET": $ uninames '^\s+%'
Ƣ 01A2 LATIN CAPITAL LETTER OI
% LATIN CAPITAL LETTER GHA
ƣ 01A3 LATIN SMALL LETTER OI
% LATIN SMALL LETTER GHA
* Pan-Turkic Latin alphabets
ೞ 0CDE KANNADA LETTER FA
% KANNADA LETTER LLLA
* obsolete historic letter
* name is a mistake for LLLA
ຝ 0E9D LAO LETTER FO TAM
% LAO LETTER FO FON
= fo fa
* name is a mistake for fo sung
ຟ 0E9F LAO LETTER FO SUNG
% LAO LETTER FO FAY
* name is a mistake for fo tam
ຣ 0EA3 LAO LETTER LO LING
% LAO LETTER RO
= ro rot
* name is a mistake, lo ling is the mnemonic for 0EA5
ລ 0EA5 LAO LETTER LO LOOT
% LAO LETTER LO
= lo ling
* name is a mistake, lo loot is the mnemonic for 0EA3
࿐ 0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
% TIBETAN MARK BKA- SHOG GI MGO RGYAN
* used in Bhutan
ꀕ A015 YI SYLLABLE WU
% YI SYLLABLE ITERATION MARK
* name is a misnomer
︘ FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
% PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
* misspelling of "BRACKET" in character name is a known defect
# <vertical> 3017
𝃅 1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
% BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
* misspelling of "FTHORA" in character name is a known defect There are only In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction:
It is my suggestion that Python do the same thing. There are currently only 11 of these. The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name. They come from the NamedSequences.txt file in the Unicode Character Database. An example entry is:
There are 418 of these named sequences as of Unicode 6.0.0. This shows that Perl can also access named sequences: $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")'
Ā̀
$ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
\x{100}\x{300}
$ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")'
ㇷ゚
$ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' | uniquote -x
\x{31F7}\x{309A} Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be. Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility. If you look at the ICU UCharacter class, you can see that they provide a more |
Here’s the right test file for the right ticket. |
I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. This: >>> "\N{LATIN CAPITAL LETTER GHA}"
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name is most likely a result of this: >>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
unicodedata.lookup("LATIN CAPITAL LETTER GHA")
KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'" Although the lookup comes first in nametests.py, it is never executed because of the later SyntaxError. The Reference for string literals says" The doc for unicodedata says The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”." So the question is, what are the 'names' therein defined? The annex refers to http://www.unicode.org/Public/6.0.0/ucd/ As best I can tell, the annex plus files are a bit ambiguous as to 'Unicode character name'. The following quote seems neutral: "the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names." The following: "Unicode character names constitute a special case. Formally, they are values of the Name property." points toward UnicodeData.txt, which lists the Name property along with others. However, "Unicode character name, as published in the Unicode names list," indirectly points toward including aliases. NamesList.txt says it contains the "Final Unicode 6.0 names list." (but one which "should not be parsed for machine-readable information". It includes all 11 aliases in NameAliases.txt. My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars. Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged. Minimal test code might be: from unicodedata import lookup
AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2")
AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"),
"\u0100\u0300")
plus a test that "\N{LATIN CAPITAL LETTER GHA}" and
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I have no idea how to write that).
More what ;-) |
"Terry J. Reedy" <report@bugs.python.org> wrote
Yes, I think the 11 aliases pose no problem. It's amazing the trouble
But these do. The problem is bracketed character classes.
as
but that doesn't help if the sequence got replaced as a string escape. If you ask how we do this in Perl, the answer is "poorly". It really only
More expressive set of lookup functions where it is clear which thing
Well, there are some Python bindings for ICU that I was eager to try out, Hm, and maybe they are only for Python 2 not Python 3, which I try to do --tom |
For the "Line_Break" property, one of the possible values is "Inseparable", with 2 permitted aliases, the shorter "IN" (which is reasonable) and "Inseperable" (ouch!). |
Matthew Barnett <report@bugs.python.org> wrote
Yeahy, I've shaken my head at that one, too. It's one thing to make an alias for something you typo'd in the first Bidi_Class=Paragraph_Separator
Bidi_Class=Common_Separator
Bidi_Class=European_Separator
Bidi_Class=Segment_Separator
General_Category=Line_Separator
General_Category=Paragraph_Separator
General_Category=Separator
General_Category=Space_Separator
Line_Break=Inseparable
Line_Break=Inseperable And there's still set, which makes you wonder Sentence_Break=Sep SB=SE
Sentence_Break=Sp SB=Sp You really have to look those up to realize they're two different things:
And that none of them have something like SB=Space or SB=Separator --tom |
+1 on the feature request. |
The attached patch changes Tools/unicode/makeunicodedata.py to create a list of names and codepoints taken from http://www.unicode.org/Public/6.0.0/ucd/NameAliases.txt and adds it to Modules/unicodename_db.h. I'm not sure this is the best way to implement this, and someone will probably want to review and tweak both the approach and the C code, but it works fine:
>>> "\N{LATIN CAPITAL LETTER GHA}"
'Ƣ'
>>> import unicodedata
>>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
'Ƣ'
>>> "\N{LATIN CAPITAL LETTER OI}"
'Ƣ'
>>> unicodedata.lookup("LATIN CAPITAL LETTER OI")
'Ƣ' The patch doesn't include changes for NamedSequences.txt. |
I propose to use a better lookup algorithm using binary search, and then integrate the NamedSequences into this as well. The search result could be a record struct {
char *name;
int len;
Py_UCS4 chars[3]; /* no sequence is more than 3 chars */
} You would have two tables for these: one for the aliases, and one for the named sequences. _getcode would continue to return a single char only, and thus not support named sequences. lookup could well return strings longer than 1, but only in 3.3. I'm not sure that \N escapes should support named sequences: people rightfully expect that each escaped element in a string literal constitutes exactly one character. |
Leaving named sequences for unicodedata.lookup() only (and not for \N{}) makes sense. The list of aliases is so small (11 entries) that I'm not sure using a binary search for it would bring any advantage. Having a single lookup algorithm that looks in both tables doesn't work because the aliases lookup must be in _getcode for \N{...} to work, whereas the lookup of named sequences will happen in unicodedata_lookup (Modules/unicodedata.c:1187). |
There are certainly advantages to that strategy: you don't have to You may wish unicode.name() to return the alias in preference, however. The rest of this perhaps painfully long message is just elaboration --tom
If you mean, is it ok to add just the aliases and not the named sequences to However, because the one namespace comprises all three of names, The ICU library supports this sort of thing. In ICU4J's Java bindings, static int getCharFromExtendedName(String name)
[icu] Find a Unicode character by either its name and return its code point value.
static int getCharFromName(String name)
[icu] Finds a Unicode code point by its most current Unicode name and return its code point value.
static int getCharFromName1_0(String name)
[icu] Find a Unicode character by its version 1.0 Unicode name and return its code point value.
static int getCharFromNameAlias(String name)
[icu] Find a Unicode character by its corrected name alias and return its code point value. The first one obviously has a bug in its definition, as the English
The UCharNameChoice enum tells what sort of thing you want:
Looking at the src for the Java is no more immediately illuminating, Now I'll tell you what Perl does. I do this not to say it is "right", Perl does not provide the old 1.0 names at all. We don't have a Unicode We also provide for certain well known aliases from the Names file: Perl makes no distinction between anything in the namespace when using However, the "functional" API does make a slight distinction. -- charnames::vianame() takes a name or alias (as a string) and returns a single
-- charnames::string_vianame() takes a string name, alias, *or* sequence,
-- charnames::viacode() takes an integer can gives back the official alias
Consider
That was an error, and there is an official alias fixing it:
(That's FHTORA vs FTHORA.) You may use either as the name, and if you reverse the code % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")' % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")' % perl -mcharnames -wle 'print charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS"))' So on round-tripping, I gave it the "wrong" one (the original) and it gave Using the \N{} thing, it again doesn't matter: % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS}"' % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"' The interesting thing is the named sequences. string_vianame() works just fine on those: % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' % perl -mcharnames -wle 'printf "U+%v04X\n", charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' And that works fine with \N{} as well (provided you don't try charclasses): % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' | uniquote -v % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' It's kinda sad that for \N{} and sequneces you can't just "do the right
However, that falls part if you do
Because the compiler will do the substitution early on the first
So you can do:
And it is just fine. The issue is that there are ways for you to get
That works, but only accidentally, because of course U+0100.0300 contains This is not a solved problem. I hope this helps. --tom |
Attached a new patch that adds support for named sequences (still needs some test and can probably be improved).
I assume with [] you mean a regex character class, right?
With my latest patch, all 3 are supported.
\N{} will only support names and aliases (maybe this can go in 2.7/3.2 too).
This can be done for 3.3, but I wonder if it might create problems. People might use unicodedata.name() to get a name and use it elsewhere, and the other side might not be aware of aliases. |
Yes, that's fine as well. |
-1. .name() is documented (and users familiar with it expect it) as It doesn't really matter much to me if it's non-sensical - it's just
Python doesn't use regexes in the language parser, but does do \N
If there would be a reasonably official source for these names, and one
-1. Readability counts, writability not so much (I know this is |
The C0 and C1 control code names don't change. There is/was one stability The problem with official names is that they have things in them that you
Rather than the more obvious pair of
?? If so, then I don't understand that. Nobody in their right
I actually very strongly resent and rebuff that entire mindset in the most *PLEASE* don't start. Yes, I just got done driving 16 hours and am overtired, but it's
There are 15 "commonly abbreviated as" aliases in the Names.txt file.
All of the standards documents *talk* about things like LRO and ZWNJ. From the charnames manpage, which shows that we really don't just make
Those are the defaults. They are overridable. That's because we feel that
let alone
then they can, because there is a mechanism for making aliases:
That way you can do
This is probably not as persuasive as the private-use case described below. It is important to remember that all charname bindings in Perl are attached
Which dutifully prints out:
So charname bindings are never "hard to read" because the effect is I realize (or at least, believe) that Python has no notion of nested The most persuasive use-case for user-defined names is for private-use For example, Apple has a bunch of private-use glyphs they use all the time. Now what are you supposed to do in your program when you want a named character So all you do is
and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The I assert that this facility makes your program more readable, and its Private use characters are important in Asian texts, but they are also
I have an entire Tengwar module that makes heavy use of named
Now you can write \N{TENGWAR LETTER TINCO} etc. See how slick that is? It gets better. Perl lets you define your character properties, too.
So I have code in my Tengwar module that does stuff like this, using
Not to mention this using my own properties:
Actually, I'm fibbing. I *never* write regexes all on one line like
People who write patterns without whitespace for cognitive chunking (plus Anyway, do you see how much better that is than opaque unreadable magic No, I don't expect Python to do this sort of thing. You don't have proper I just wanted to give a concrete example where flexibility leads to a --tom
|
Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely because that's a Unicode 1 name, and nowadays these codepoints are simply marked as '<control>'.
They probably don't, but they just write \n anyway. I don't think we need to support any of these aliases, especially if they are not defined in the Unicode standard. I'm also not sure humans use \N{...}: you don't want to write
Right, I had to read down till the table with the meanings before figuring out what they were (and I already forgot it).
This is actually a good use case for \N{..}. One way to solve that problem is doing: This requires the format call for each string and it's a workaround, but at least is readable (I hope you don't have too many apples in your strings). I guess we could add some way to define a global list of names, and that would probably be enough for most applications. Making it per-module would be more complicated and maybe not too elegant.
I actually find those *less* readable. If there's something fancy in the regex, a comment *before* it is welcomed, but having to read a regex divided on several lines and remove meaningless whitespace and redundant comments just makes the parsing more difficult for me. |
Attached a new patch with more tests and doc. |
Ezio Melotti <report@bugs.python.org> wrote
Yes, but there are a lot of them, 65 of them in fact. I do not care to
If you look at Names.txt, there are significant "aliases" there for There are still "holes" of course. Code point 128 has no name even in C1.
I recognize that redefining what sort of object the compiler treats some But it still has to happen at compile time, of course, so I don't know The run-time looks of Python's unicodedata.lookup (like Perl's I do note that if you could extend \N{...} the way we do with charname d=20
>> Python doesn't require it. :)/2
Really? White space makes things harder to read? I thought Pythonistas Inomorewantaregexwithoutwhitespacethananyothercodeortext. :) I do grant you that chatty comments may be a separate matter. White space in patterns is also good when you have successive patterns
Now, that isn't good code for all *kinds* of reasons, but white space By virtue of having a "titlecase each word's first letter and lowercase the But because Perl has always made it easy to grab "words" (actually,
all the time, and that has all kind of problems. If you prefer the
but that is still wrong.
I think Python is being smarter than Perl in simply providing people
all the time. However, as I have written elsewhere, I question a lot of However, the problem is that what a word is cannot be considered Each of these is a single word:
The capitalization there should be
Notice how you can't do the same with the first apostrophe+t as with the And of course, you can't actually create something in true English With English titlecasing, you have to respect what your publishing house 2: as at by in of on to up vs 3: but for off out per pro qua via 4: amid atop down from into like near next onto over <cutoff point for O'Reilly Media> 5: about above after among below circa given minus 6: across amidst around before behind beside beside beyond 7: against barring beneath besides between betwixt 10: throughout underneath The thing is that prepositions become adverbs in phrasal verbs, like "to Merely getting something like this right:
is going to take a bit of work. So is
(and that must give the same answer in NFC vs NFD, of course.) Plus what to do with something like num2ascii is ill-defined in English, And that is just English. Other languages have completely different rules.
Isn't that tricky! I guess that you would have to treat punctuation I'm really not sure. It is not obvious what the right thing to do here. I do believe that Python's titlecase function can and should be fixed to I fear the only thing you can do with the confusion of Unicode titlecase However, I'm still bothered by things with apostrophes though.
since I can't countenance the obviously wrong:
with the last the hardest to get right. I do have code that correctly And Swedes might be upset seeing Antonia Ax:Son Johnson instead Maybe we should just go back to the Pythonic equivalent of
where \w is specifically per tr18's Annex C, and give up on punctuation Thank you very much for all your hard work -- and patience with me. --tom |
I was surprised at that too ;-). One person's opinion in a specific
Except that I can imagine someone using the latter as a noun to make the |
The example I initially showed probably wasn't the best for that.
If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING
If there aren't any rules, then how come all book and movie titles always
Those are the basic rules. The main problem is that "short" isn't English has sentence casing (only the first word) and headline casing (most of them).
I myself usually fall back to the Chicago Manual of Style or the Oxford But I completely agree that this should *not* be in the titlecase()
One of the goals of Unicode is that casing not be language dependent. And Did you know there is a problem with all the case stuff in Python? It str.islower()
That really isn't right. A cased character is one with the Unicode "Cased" I've spent all bloody day trying to model Python's islower, isupper, and istitle
I really don't understand. BTW, I feel that MᶜKinley is titlecase in that lowercase I really don't understand any of these functions. I'm very sad. I think they are Shall I file a separate bug report? --tom from __future__ import unicode_literals
from __future__ import print_function
import regex
VERBOSE = 0
data = [ # first test the problem cases just one at a time # test superscripts # test romans # test small caps # test cased combining mark (this is in titlecase) # test cased symbols # test titlecased code point 3-way # test titlecase
] for s in data: # "Return true if all cased characters in the string are lowercase if s.islower():
if not ( regex.search(r'\p{cased}', s)
and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
print(s+" islower() but fails to have at least one cased character with all cased characters lowercase!")
else:
if ( regex.search(r'\p{cased}', s)
and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
print(s+" not islower() but has at least one cased character with all cased characters lowercase!") # "Return true if all cased characters in the string are uppercase if s.isupper():
if not ( regex.search(r'\p{cased}', s)
and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
print(s+" isupper() but fails to have at least one cased character with all cased characters uppercase!")
else:
if ( regex.search(r'\p{cased}', s)
and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
print(s+" not isupper() but has at least one cased character with all cased characters uppercase!") # "Return true if the string is a titlecased string and there is at has_it = s.istitle()
want_it1 = (
# at least one title/uppercase
regex.search(r'[\p{Lt}\p{uppercase}]', s)
and not
# plus no title/uppercase follows cased character
regex.search(r'(?<=\p{cased})[\p{Lt}\p{uppercase}]', s)
and not
# plus no lowercase follows uncased character
regex.search(r'(?<=\P{CASED})\p{lowercase}', s)
)
want_it = regex.search(r'''(?x)
^
(?:
\P{CASED} *
[\p{Lt}\p{uppercase}]
(?! [\p{Lt}\p{uppercase}] )
\p{lowercase} *
) +
\P{CASED} *
$
''', s)
if VERBOSE:
if has_it and want_it:
print( s + " istitle() and should be (OK)")
if not has_it and not want_it:
print( s + " not istitle() and should not be (OK)")
if has_it and not want_it:
print( s + " istitle() but should not be")
if want_it and not has_it:
print( s + " not istitle() but should be") |
I think things like "from __future__ import ..." do something similar, but I'm not sure it will work in this case (also because you will have to provide the list of aliases somehow).
Also don't generalize my opinion regarding *where* whitespace makes thing less readable: I was just talking about regex. To provide an example, I find: # define a function to capitalize s
def my_capitalize(s):
"""This function capitalizes the argument s and returns it"""
the_first_letter = s[0] # 0 means the first char
the_rest_of_s = s[1:] # 1: means from the second till the end
the_first_letter_uppercased = the_first_letter.upper() # upper makes the string uppercase
the_rest_of_s_lowercased = the_rest_of_s.lower() # lower makes the string lowercase
s_capitalized = the_first_letter_uppercased + the_rest_of_s_lowercased # + concatenates
return s_capitalized less readable than: def my_capitalize(s):
return s[0].upper() + s[1:].lower() You could argue that the first is much more explicit and in a way clearer, but overall I think you agree with me that is less readable. Also this clearly depends on how well you know the notation you are reading: if you don't know it very well, you might still prefer the commented/verbose/extended/redundant version. Another important thing to mention, is that notation of regular expressions is fairly simple (especially if you leave out look-arounds and Unicode-related things that are not used too often), but having a similar succinct notation for a whole programming language (like Perl) might not work as well (I'm not picking on Perl here, as you said you can write readable programs if you don't abuse the notation, and the succinctness offered by the language has some advantages, but with Python we prefer more readable, even if we have to be a little more verbose). Another example of a trade-off between verbosity and succinctness is the new string formatting mini-language.
You might want to take a look and possibly add a comment on bpo-12204 about this.
If by "model" you mean "trying to figure out how they work", it's probably easier to look at the implementation (I assume you know enough C to understand what they do). You can find the code for str.istitle() at http://hg.python.org/cpython/file/default/Objects/unicodeobject.c#l10358 and the actual implementation of some macros like Py_UNICODE_ISTITLE at http://hg.python.org/cpython/file/default/Objects/unicodectype.c.
If after reading the code and/or the documentation you still think they are broken and/or that they can be improved, then you can open another issue. BTW, instead of writing custom scripts to test things, it might be better to use unittest (see http://docs.python.org/py3k/library/unittest.html#basic-example), or even better write a patch for Lib/test/test_unicode.py. |
Can we please leave the English language out of this issue? As a point of order, please all try to stick at the issue at hand. |
The patch is pretty much complete, it just needs a review (I left some comments on the review page). |
The patch needs to take versioning into account. It seems that NamedSequences where added in 4.1, and NameAliases in 5.0. So for the moment, when using 3.2 (i.e. when self is not NULL), it is fine to lookup neither. Please put an assertion into makeunicodedata that this needs to be reviewed when an old version other than 3.2 needs to be supported. The size of the DB does matter; there are frequent complaints about it. The named sequences take 20kB on my system; not sure whether that's too much. If you want to reduce the size (and also speedup lookup), you could use private-use characters, like so:
|
Ezio Melotti <report@bugs.python.org> wrote
Ah yes, that's right. Hm. I bet then it *would* be possible, just perhaps
Certainly. It's a bit like the way bug rate per lines of code is invariant across
Thanks, that helps immensely. I'm completely fluent in C. I've gone The main underlying problem is that the internal macros are defined in a The originating culprit is Tools/unicode/makeunicodedata.py. if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]:
flags |= ALPHA_MASK
if category == "Ll":
flags |= LOWER_MASK
if 'Line_Break' in properties or bidirectional == "B":
flags |= LINEBREAK_MASK
linebreaks.append(char)
if category == "Zs" or bidirectional in ("WS", "B", "S"):
flags |= SPACE_MASK
spaces.append(char)
if category == "Lt":
flags |= TITLE_MASK
if category == "Lu":
flags |= UPPER_MASK It needs to use DerivedCoreProperties.txt to figure out whether
This affects a lot of things, but you should be able to just fix it You will probably also want to add Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch) that uses the UTS#18 Annex C definition, so that you catch marks, too.
where Alphabetic is defined above to include Nl and Other_Alphabetic. Soemwhat related is stuff like this: typedef struct {
const Py_UCS4 upper;
const Py_UCS4 lower;
const Py_UCS4 title;
const unsigned char decimal;
const unsigned char digit;
const unsigned short flags;
} _PyUnicode_TypeRecord; There are two different bugs here. First, you are missing const Py_UCS4 fold; which is another field from UnicodeData.txt, one that is critical Second, there's also the problem that Py_UCS4 is an int. That means you
You will also need to extend the API from just Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch) to something like
I don't know what the ??? return type is there, but it's whatever the I know that Matthew Barnett has had to cover a bunch of these for his regex
I handn't actually *looked* at capitalize yet, because I stumbled over Ok, more bugs. Consider this: static
int fixcapitalize(PyUnicodeObject *self)
{
Py_ssize_t len = self->length;
Py_UNICODE *s = self->str;
int status = 0;
if (len == 0)
return 0;
if (Py_UNICODE_ISLOWER(*s)) {
*s = Py_UNICODE_TOUPPER(*s);
status = 1;
}
s++;
while (--len > 0) {
if (Py_UNICODE_ISUPPER(*s)) {
*s = Py_UNICODE_TOLOWER(*s);
status = 1;
}
s++;
}
return status;
} There are several bugs there. First, you have to use the TITLECASE if there Second, you cannot decide to do the case change only if it starts out as a Does this help at all? I have to go to a meeting now. --tom |
Tom: PLEASE focus on one issue at a time. This is about formal |
Here is a new patch that stores the names of aliases and named sequences in the Private Use Area. To summarize a bit, this is what we want: I.e., \N{...} should only support aliases, unicodedata.lookup should support aliases and named sequences, unicodedata.name doesn't support either, and when 3.2.0 is used nothing is supported. The function calls involved for these 3 functions are: \N{...} and .lookup: .name: My patch adds an extra arg to _getcode and _getucname (I hope that's fine -- or are they public?). _getcode is called by \N{...} and .lookup; both support aliases, so _getcode now resolves aliases by default. Since only .lookup wants named sequences, _getcode now accepts an extra 'with_named_seq' arg and looks up named sequences only when its value is 1. .lookup passes 1, gets the codepoint, and converts it to a sequence. \N{...} passes 0 and doesn't get named sequences. _getucname is called by .name and indirectly (through _cmpname) by .lookup and \N{...}. Since _getcode takes care of deciding who gets aliases and sequences, _getucname now accepts an extra 'with_alias_and_seq' arg and looks up aliases and named sequences only when its value is 1. _cmpname passes 1, gets aliases and named sequences and then lets _getcode decide what to do with them. .name passes 0 and doesn't get aliases and named sequences. All this happens on 6.0.0 only, when self != NULL (i.e. we are using 3.2.0) named sequences and aliases are ignored. The patch doesn't include the changes to unicodename_db.h -- run makeunicodedata.py to get them. |
Ezio Melotti <report@bugs.python.org> wrote
Looks good! Thanks! --tom |
(I had to re-upload the patch a couple of time to get the review button to work. Apparently if there are some conflicts rietveld fails to apply the patch, whereas hg is able to merge files without problems here. Sorry for the noise.) |
If you don't use git-style diffs, Rietveld will much better accommodate patches that don't apply to tip cleanly. Unfortunately, hg git-style diffs don't indicate the base revision, so Rietveld guesses that the base line is tip, and then fails if it doesn't apply exactly. |
If the latest patch is fine I'll commit it shortly. |
Yes, it looks good. Thank you very much. -tom |
LGTM |
New changeset a985d733b3a3 by Ezio Melotti in branch 'default': |
I committed the patch and the buildbots seem happy. Thanks for the report and the feedback! Tom, about the problems you mentioned in msg144836, can you report it in a new issue or, if there are already issues about them, add a message there? |
New changeset 329b96fe4472 by Ezio Melotti in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: