-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide how to handle str/unicode #208
Comments
A new proposal: use the existing triad bytes-str-unicode. |
Let me try to explain the new proposal with more care. Gradual BytingI am most interested in solving this issue for straddling code; my assumption is that most of the interest in type annotations for Python 2 has to do with that. (This is the case at Dropbox, and everyone who has enough Python 2 code to want to annotate it probably should be thinking about porting to Python 3 anyway. :-) In the proposal, str has a position similar to the one that Any has in the type system as a whole -- i.e. assuming we have three variables b, s, t, with respective types bytes, str, (typing.)Text, then: b is compatible with s, s is compatible with b and t, t is compatible with s, but b and t are not compatible with each other. IOW the relationships between the three types are not expressible using subtyping relationships only. (It's actually a little more complicated, I'll spell out the actual rules I'm proposing below.) Before we get to that, I'd like to discuss the use cases for the proposal. In straddling code we often have the problem that some Python 2 code is vague about whether it works on bytes or text or both. The corresponding Python 3 code may work only on bytes, or only on Text, or on both as long as they are used consistently (i.e. AnyStr), or possibly on Union[bytes, Text]. To find such cases, maybe we could just type-check the code twice, once in Python 2 mode and once in Python 3 mode. If it type-checks cleanly in both modes, it should run correctly in both Python versions too (insofar as type-checking cleanly can ever say anything about running without errors :-). However, when we have a large code base, it is usually a struggle to make it type-check cleanly even in one mode, and typically we start with Python 2. So if we have code that runs correctly using Python 2 and type-checks cleanly in Python 2 mode, and we want to port it to Python 3, requiring it to type-check cleanly in Python 3 mode is setting the bar very high (as high as expecting it to run correctly using Python 3). Therefore I am proposing a gradual approach. Similar to the way we start by type-checking an untyped program (which by definition should type-check cleanly, since all types are Any -- even though in practice there are some holes in that theory), I propose to start with a Python 2 program that uses str for all string types, and type-checks cleanly that way, and gradually change the program to replace each occurrence of str with either bytes or Text (or one or the rarer alternatives like AnyStr or Union[str, Text]). That way we can gradually tighten the type annotations, keeping the code type-check clean as we go. Just like, when I define a function with The actual details are a bit subtle. I'm proposing (in builtins; recall that this is for Python 2 only):
The subclassing relationship between bytes and str makes str acceptable where bytes is required. In mypy we can add a "promotion" from bytes to str to enable compatibility in the other direction. Mypy (in Python 2 mode) has an existing promotion from str to unicode that accepts str where unicode is required. I don't actually propose to make unicode acceptable where str is required (this is a deviation from the "str is like Any" idea). Because promotions are not transitive (unlike subclassing), bytes is not acceptable where unicode is required, nor the other way around. There is still a lot more to explain. I want to show in detail what happens in various cases, and why I think that is right. I need to explain the concept of "text in spirit" to motivate why I am okay with the difference between these rules and the actual workings of Python 2. I want to go over some examples involving container types (since that's where the "AsciiBytes" proposal went astray). And I need to give some guidelines for stub authors and changes to existing stubs. (E.g. I think that Python 2 getattr() may have to be changed to accept unicode.) [But that will have to wait until tomorrow.] |
This contradicts with this part of the proposal:
Also, the example with
Mypy actually considers the promotions Other notes:
|
That's why I wrote It's actually a little more complicated. I am having a hard time summarizing the proposal briefly and also writing it up in detail without contradictions between the two. In case of conflict the detailed version should win and the summary seen as a hint at most. Maybe we'll have to use more vagueness in the summary to avoid confusing experts who know the terminology.
More imprecision in the summary. :-( It really can't, unless
I will try to spec out the true compatibility as a bunch of tables.
Yes. There are already some overloads like that. The bigger difference will be that these overloads won't exist for bytes+unicode.
Yes. In fact IO[unicode] would only be obtainable by calling io.open().
I think not (so we agree here). This will lead to some of the same issues as I ran into when trying to implement the AsciiBytes idea, but the issues will much less common. [In the next installment I will try to construct the tables of compatibilities. I will also talk about the concept of "text in spirit".] |
Text in Spirit(This is still pretty messy. But I promised I would explain the concept.) I'll sometimes say that some variable in Python 2 is "text in spirit". For example, in Note that text encoded as bytes is not "text in spirit". The requirement is that the corresponding Python 3 API uses Basically the point of "text in spirit" is to make the argument that an API should not use bytes even though it may accept non-ASCII str instances. But I have to do more exploration before I decide how important this concept is. |
Here's another table that could be useful -- if we define |
Compatibility Tables[UPDATE: made function calls primary, per Jukka's suggestion below] Let's start by stating the compatibility between expressions of types bytes, str, text and functions with arguments of those types. Each row corresponds to a declared argument type; each column corresponds to the type of an expression passed in for that argument.
The above table also describes compatibility of expressions with variables (assuming the type checker, like mypy currently, doesn't just change the type of the variable). Note that I'm not decided yet whether to allow passing a Text value to a str argument, but I'm inclined to put "No" there, even though that breaks the illusion of "str as the Any of string types" (IOW gradual byting :-). Next let's describe the return type for expressions of the form
Note that this table is more regular and I'm pretty confident about it. |
Irregularities
|
A reason we might prefer to use a call instead of an assignment as a basis for the table is that some type checkers like to infer a new type from assignment, and thus arbitrary assignments are considered correct -- they just redefine the type of a variable. |
OK, edited the text. |
If we don't do the Example first phase annotation where we'd need
Here we'd need to use |
@gvanrossum @JukkaL I would like to join your discussion. Here is a summary of the current ASCII types and gradual byting proposals as I understand them. Rationale(This piece is here to make sure we’re solving the same problem.) The changes in text and binary data handling in Python 3 is one of the major reasons people cannot run their Python 2 code on Python 3. These changes were:
The most viable approach to porting to Python 3 is via porting to Python 2+3. Therefore we need a way to make this transition from a Python 2 program with implicit conversions to a Python 2 program with less implicit conversions to a Python 2+3 program that runs on both versions but still contains some implicit conversions to a Python 3 program. ASCII TypesThe idea is to introduce two new types, one new types compatibility rule and special type inference rules for binary and text literals. The new types are:
The If a text or binary literal contains only ASCII characters then type checkers should infer the corresponding ASCII types instead of regular text / binary types. Pros
Cons
Gradual BytingA proposal by Guido described here. The idea is to distinguish the
Questions
What do you think of the workaround for invariant collections of Could you please answer my questions about the gradual byting proposal? |
Writing my answer... Re: RationaleAgreed, although your way of describing the changes makes it sound like Python 3 is a step back from Python 2 in this respect; I believe the opposite (PY3 is better than PY2). The key part is that some things around strings changed and we want to provide a gradual way to convert PY2 via straddling (2+3) to (eventually) PY3. The proposal is mostly concerned with how to write string types for the straddling case (in a way that works in PY2 and is idiomatic PY3). Re: ASCII TypesI'm not sure you characterized the proposal the same way as I heard it (as retold by @ddfisher), but you're pretty close. IIRC the proposal actually made ASCIIText compatible with all bytes, and ASCIIBytes compatible with all Text, in Python 2. So e.g. As an example that goes in the other direction, in Python 2, I'd like to point out that in both these examples (taken from the behavior of builtins, there are many more like it) declaring the argument or variable as ASCIIBytes or ASCIIText is inappropriate, since while ASCII characters/bytes are accepted in either persuasion (text or bytes), the full range of one or the other base types (bytes or Text) is still accepted. Another problem I have with the ASCII proposals is that it emphasizes literals too much. Yes, in examples we often like to write things like Re: Gradual Byting(NOTE: The nickname "gradual byting" may actually be a bad pun, as in the end the compatibility rules are more complicated than those for
(You probably meant supertype, since I propose to make bytes a supertype of str in PY2.) It's not completely resolved, but I think it's more reasonable to ask people to distinguish between My main reason that I find this less of a problem than the corresponding problem with
This situation is inherently confusing, because in PY2, sometimes str+unicode works, and sometimes it doesn't. As long as you don't know, you may be better off with
I think we have two choices: So in the end maybe "gradual byting" is correct. Ideally for straddling code, you should run mypy twice, once with
Mypy in
BTW I find |
The full gradual byting idea (
(The type checking rules for generic types like What I don't like is that gradual byting gives up on checking for implicit ASCII conversions. Basically we'll be able to check the two distinctly marked But it seems that it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the I would like to experiment with the idea of gradual byting in PyCharm for the next few days to see if there are any concerns with it. I'll report about my findings later this week. |
My current inclination is not to do anything special about these, because List is invariant. Although if
Yes, that's the most important use case we have for mypy at Dropbox -- we want our code to become more Python 3 ready. Mixing bytes and Text is a type error. But it's harder to argue that mixing ASCII and non-ASCII is a type error. I don't like to treat strings containing only ASCII characters as a subtype of bytes or Text, because dynamic sources of characters (other than literals in the source code) don't typically tell you whether they can ever return non-ASCII characters. As an analogy, let's say we wanted to treat non-negative integers as a subtype of int. If we define a "type" as a set of values, this is certainly a reasonable thought, and non-negative integers are closed under addition and multiplication (just like ASCII strings are closed under concatenation and slicing). But there are few input functions that return only non-negative integers -- int() in particular can definitely return a negative int. So it's hard to enforce the non-negativity of integers being processed by a program without explicit range checks or complicated proofs that a certain algorithm preserves that property. I feel it's similar for the ASCII-ness of bytes and Text -- a function that reads a string from a file or socket (or from e.g. os.environ()) has no particular reason to believe that the file will only contain ASCII characters. I'm looking forward to the outcome of your experiments. In the meantime I will also try to look into a more complete set of changes to typeshed and mypy. |
I'm done with my experiments with the idea of gradual byting. I've created a proof-of-concept implementation of gradual byting in PyCharm by modifying Original Gradual BytingWhat I've learned from my experiment with the original gradual byting proposal is that the type checker doesn't help in porting PY2 code to PY2+3. I mean if you already use But overall you get no guidance on how to proceed with porting your code. The type checker doesn't tell you if there are any variables or functions with no type hints. It doesn't promote the use of Guided Gradual BytingDuring my further experiments I came up with the following guided process for making PY2 text/binary data handling more PY3-like. The original gradual byting is a step in this process. Summary of Proposed Changes
Idea(Note: It includes some pictures, please see the comment on the GitHub page). Hypothesis: Most of text/binary data in PY2 programs can be converted to either Unicode data or 8-bit data in PY2+3. Native In PY3 you have a clear separation of text and binary data: they are not compatible with each other. In PY2 things are complicated because of the implicit conversion between text and binary data using the ASCII encoding. If you want to make your PY2 code PY2+3 compatible (straddling), you have to make it more PY3-like in respect of text/binary data separation. A good way to proceed is to start putting type hints into your code so that a type checker would be able to check your code for correctness. The steps of the proposed approach to porting are described below. You may start with no or some type hints in your code. You may proceed module-by-module or modify the whole program at once. 1. Add type hints for all declarations in your codeUse the "Warn about implicit For text and binary data you have the following options:
2. Remove all
|
@gvanrossum @JukkaL I'm looking forward to your feedback. |
Sorry, I'm tied up at the core python sprint this week. I hope to have time
next week!
|
@vlasovskikh Thanks for the detailed write-up! Your approach sounds mostly reasonable. If @gvanrossum agrees, hopefully we can can experiment with it and mypy. A few things I'm not sure about:
This could be useful during migration, but I'm not sure if users will find it easy to understand. An alternative would be to propose that users define a similar type alias by themselves instead of including it in
The implications of a mode that requires casts between
Would
This would be fine in Python 2 mode but it wouldn't work in Python 3 or strict
This may have to written like this, which seems a bit excessive but perhaps still reasonable:
It seems that the final example would also work in the strict |
@whatisaphone I can open up my fork to you if you are able to contribute to this |
I doubt it, you have to parse the Also, |
@gvanrossum is six a prerequisite of mypy? Can I just import six.text_type as unicode? |
Yeah, if you want to write cross-version code that needs unicode in Python 2 and str in Python 3, and you don't want to use |
43k loc are happy, a few obscure bugs found. its not quite 100k loc, but hopefully it's a start. It does cause some minor bumps. IO[Text] doesn't work, but I probably shouldn't have been doing that anyway. |
I reviewed the CI failures and I see there's a "mypy_test.py" that I need to be running, so I've started going through that and I need a little assistance, please. In stdlib/2/_threading_local.pyi:4 I found this:
I'm getting the following error:
I'm thinking I need to change that line to this:
but that I also need to change the definition of I've grepped mypy's source, but I'm having difficulty finding where Thanks |
I think you're looking for this: https://github.com/python/mypy/blob/1a9e2804cdad401a3019eabd37002f32d08fe0ec/mypy/checker.py#L291 PS: For help with the PR it's usually quicker to add a comment to the PR itself rather than to the issue to which it relates. |
I ran the experiment you suggested, with remdragon's typeshed fork. With your suggested change to semanal.py, my test above works as expected!
I'm happy to see this taking shape! 😃 @remdragon I'm happy to be added, but I can't promise I'll have a lot of time to dedicate to this unfortunately. I was able to spend a bit of time just now. I cloned your fork and tested it on our codebase. It correctly found a few lurking bugs 😃 but also one interesting regression. Here's a minimal repro: from typing import Any
class Class: pass
def foo(**kwargs):
# type: (**Any) -> None
c = Class()
c.__dict__.update(kwargs) I ran mypy with
My intuition tells me that a Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X: pass
...
>>> x = X()
>>> setattr(x, u'unicode', True)
>>> x.__dict__
{'unicode': True}
>>> # ^ It's a `str` now
...
>>> setattr(x, u'\x80', True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128) It's similar to the I changed it back locally and our codebase typechecked with no ill effects. (Specifically, I changed What do you think about that change? |
I've started a serious Python 3 porting effort myself recently, and I think I have to agree with those who suggest one refinement: In Python 2, make I have not thought much about how we could implement this. |
This (no promotion from |
typeshed also already recommends using |
I still have to investigate this, but the problem with typeshed is likely that we don’t have an easy standard way to determine whether a specific function needs Unicode or Text. |
I think remdragon's fork was the right approach, and it's too bad it went stale so quickly. When it was fresh I tried it (combined with the one-liner mypy tweak) on our codebase and there was only the one minor regression (easily fixable, and I wrote way too many words about it, whoops!) I think that's the way forward. As I understand it, typeshed is maintained by hand, so I don't know that there's any way around a big "Textification" commit that is also written by hand. In cases where it's unclear, fall back to I propose we land remdragon's fork (after a fix and rebase) first—which should cause no change in behavior. Then make the changes in mypy to drop the Did I miss anything? What do you think about that approach? EDIT: Changed |
Is this issue still relevant? I think there is a consensus on how to handle this and it has been implemented by all type checkers supporting Python 2. |
I don't think anything is going to change, especially now Python 2 has reached its end of life (and acknowledging that many projects and companies are still working on porting their legacy Python 2 code to Python 3). I'm not sure what is the consensus on how to handle it? I believe we changed some language in PEP 484 about not promoting str to unicode but I don't believe mypy made a change to match (and the PEP was intentionally not requiring it, just allowing it). |
FWIW I migrated reasonably-sized codebase from Python 2.7 to 3.5 in late 2016 and I would claim what typing and mypy provided back then was already good enough, as I had no runtime errors related to str/bytes/unicode/text handling as far as I remember once mypy was satisfied. |
The consensus from what I understand (which might be wrong) is:
That said I have no particular stake in this discussion, since I'm fortunate enough not to work with Python 2 anymore. |
Mypy treats I don't think that it's important to change anything or to reach consensus any more. I'm fine with closing this issue. |
Sounds good. Closing. |
<http://pyenchant.github.io/pyenchant/api/index.html> TODO: API accepts Union[Text, bytes], see <python/typing#208>
<http://pyenchant.github.io/pyenchant/api/index.html> TODO: API accepts Union[Text, bytes], see <python/typing#208>
<https://github.com/pyenchant/pyenchant> <http://pyenchant.github.io/pyenchant/api/index.html> - [ ] API-2 accepts Union[Text, bytes], see <python/typing#208> - [ ] API-3 provides new functions and is Pyhton-3-only
<https://github.com/pyenchant/pyenchant> <http://pyenchant.github.io/pyenchant/api/index.html> - [ ] API-2 accepts Union[Text, bytes], see <python/typing#208> - [ ] API-3 provides new functions and is Pyhton-3-only
<https://github.com/pyenchant/pyenchant> <http://pyenchant.github.io/pyenchant/api/index.html> - [ ] API-2 accepts Union[Text, bytes], see <python/typing#208> - [ ] API-3 provides new functions and is Pyhton-3-only
<https://github.com/pyenchant/pyenchant> <http://pyenchant.github.io/pyenchant/api/index.html> - [ ] API-2 accepts Union[Text, bytes], see <python/typing#208> - [ ] API-3 provides new functions and is Pyhton-3-only
There's a long discussion on this topic in the mypy tracker: python/mypy#1141
I'm surfacing it here because I can never remember whether that discussion is here, or in the typeshed repo, or in the mypy tracker.
(Adding str, bytes, unicode, Text, basestring as additional search keywords.)
The text was updated successfully, but these errors were encountered: