Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide how to handle str/unicode #208

Closed
gvanrossum opened this issue Apr 25, 2016 · 115 comments
Closed

Decide how to handle str/unicode #208

gvanrossum opened this issue Apr 25, 2016 · 115 comments

Comments

@gvanrossum
Copy link
Member

gvanrossum commented Apr 25, 2016

There's a long discussion on this topic in the mypy tracker: python/mypy#1141

I'm surfacing it here because I can never remember whether that discussion is here, or in the typeshed repo, or in the mypy tracker.

(Adding str, bytes, unicode, Text, basestring as additional search keywords.)

@gvanrossum
Copy link
Member Author

A new proposal: use the existing triad bytes-str-unicode.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 17, 2016

Let me try to explain the new proposal with more care.

Gradual Byting

I am most interested in solving this issue for straddling code; my assumption is that most of the interest in type annotations for Python 2 has to do with that. (This is the case at Dropbox, and everyone who has enough Python 2 code to want to annotate it probably should be thinking about porting to Python 3 anyway. :-)

In the proposal, str has a position similar to the one that Any has in the type system as a whole -- i.e. assuming we have three variables b, s, t, with respective types bytes, str, (typing.)Text, then: b is compatible with s, s is compatible with b and t, t is compatible with s, but b and t are not compatible with each other. IOW the relationships between the three types are not expressible using subtyping relationships only. (It's actually a little more complicated, I'll spell out the actual rules I'm proposing below.)

Before we get to that, I'd like to discuss the use cases for the proposal. In straddling code we often have the problem that some Python 2 code is vague about whether it works on bytes or text or both. The corresponding Python 3 code may work only on bytes, or only on Text, or on both as long as they are used consistently (i.e. AnyStr), or possibly on Union[bytes, Text]. To find such cases, maybe we could just type-check the code twice, once in Python 2 mode and once in Python 3 mode. If it type-checks cleanly in both modes, it should run correctly in both Python versions too (insofar as type-checking cleanly can ever say anything about running without errors :-).

However, when we have a large code base, it is usually a struggle to make it type-check cleanly even in one mode, and typically we start with Python 2. So if we have code that runs correctly using Python 2 and type-checks cleanly in Python 2 mode, and we want to port it to Python 3, requiring it to type-check cleanly in Python 3 mode is setting the bar very high (as high as expecting it to run correctly using Python 3).

Therefore I am proposing a gradual approach. Similar to the way we start by type-checking an untyped program (which by definition should type-check cleanly, since all types are Any -- even though in practice there are some holes in that theory), I propose to start with a Python 2 program that uses str for all string types, and type-checks cleanly that way, and gradually change the program to replace each occurrence of str with either bytes or Text (or one or the rarer alternatives like AnyStr or Union[str, Text]). That way we can gradually tighten the type annotations, keeping the code type-check clean as we go.

Just like, when I define a function with def f(x: Any), I can call f(1), f('') and f([0]) and it's all the same to the type checker, and f's body I can use x+1, x() or x[0], the idea here is that a function defined with def g(s: str) can be called as g(''), g(b'') or g(u''), and in g's body I can use s+b'xxx', s+'yyy' or s+u'zzz'.

The actual details are a bit subtle. I'm proposing (in builtins; recall that this is for Python 2 only):

  • class bytes with (mostly) the methods currently present on str, with arguments of type bytes and returning bytes (as appropriate).
  • class str(bytes) with overloaded methods that return str if the other argument is a str, returning bytes for `bytes (more or less).
  • class unicode unchanged from its current definition, keeping typing.Text as a pure alias for it.

The subclassing relationship between bytes and str makes str acceptable where bytes is required. In mypy we can add a "promotion" from bytes to str to enable compatibility in the other direction. Mypy (in Python 2 mode) has an existing promotion from str to unicode that accepts str where unicode is required. I don't actually propose to make unicode acceptable where str is required (this is a deviation from the "str is like Any" idea). Because promotions are not transitive (unlike subclassing), bytes is not acceptable where unicode is required, nor the other way around.

There is still a lot more to explain. I want to show in detail what happens in various cases, and why I think that is right. I need to explain the concept of "text in spirit" to motivate why I am okay with the difference between these rules and the actual workings of Python 2. I want to go over some examples involving container types (since that's where the "AsciiBytes" proposal went astray). And I need to give some guidelines for stub authors and changes to existing stubs. (E.g. I think that Python 2 getattr() may have to be changed to accept unicode.)

[But that will have to wait until tomorrow.]

@JukkaL
Copy link
Contributor

JukkaL commented Aug 17, 2016

t [Text] is compatible with s [str]

This contradicts with this part of the proposal:

I don't actually propose to make unicode acceptable where str is required

Also, the example with def g(s: str) suggests that it can be called as g(u''), with
a unicode argument. This should be clarified and made consistent across the proposal, as
otherwise things get confusing.

Because promotions are not transitive (unlike subclassing)

Mypy actually considers the promotions int -> float and float -> complex
transitive, and int can be promoted to complex. We could change the language
to something like "these promotions are not transitive" or we could perhaps treat
the int -> complex promotion as a separate promotion.

Other notes:

  • I'd assume that str methods would return unicode if the other argument is unicode.
    Currently this is left unspecified. It could be useful to have table of the result types
    of s1 + s2 for all combinations of str, bytes and unicode (9 cases).
  • AnyStr would have to range over str, bytes and unicode. This means that we may want
    to give different meanings to IO[str] and IO[bytes], for example.
  • List[Any] is compatible with List[int] and vice versa in mypy (though PEP 484/483
    seems to be silent on this), but should List[str] be compatible with List[bytes], and
    vice versa? I'd argue that List[str] and List[bytes] should be incompatible, similar to
    how List[int] and List[float] are incompatible, but I don't have a strong opinion on
    this.

@gvanrossum
Copy link
Member Author

This contradicts with this part of the proposal

That's why I wrote It's actually a little more complicated. I am having a hard time summarizing the proposal briefly and also writing it up in detail without contradictions between the two. In case of conflict the detailed version should win and the summary seen as a hint at most. Maybe we'll have to use more vagueness in the summary to avoid confusing experts who know the terminology.

the example with def g(s: str) suggests that it can be called as g(u'')

More imprecision in the summary. :-( It really can't, unless g() is implemented in C in a certain way, e.g. getattr(x, u"foo"). But for a Python function this is wrong. Actually, for a Python function, the other way around is also wrong. But nevertheless the promotion allows it. Just like the promotion from int to float is technically wrong in Python 2, as shown here:

def f(a):
    # type: (float) -> float
    return a/2
assert f(3) == 1.5  # Fails, it returns 1

I will try to spec out the true compatibility as a bunch of tables.

str methods would return unicode if the other argument is unicode

Yes. There are already some overloads like that. The bigger difference will be that these overloads won't exist for bytes+unicode.

AnyStr would have to range over str, bytes and unicode. This means that we may want to give different meanings to IO[str] and IO[bytes], for example.

Yes. In fact IO[unicode] would only be obtainable by calling io.open().

should List[str] be compatible with List[bytes], and vice versa?

I think not (so we agree here). This will lead to some of the same issues as I ran into when trying to implement the AsciiBytes idea, but the issues will much less common.

[In the next installment I will try to construct the tables of compatibilities. I will also talk about the concept of "text in spirit".]

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 17, 2016

Text in Spirit

(This is still pretty messy. But I promised I would explain the concept.)

I'll sometimes say that some variable in Python 2 is "text in spirit". For example, in getattr(x, name), name is "text in spirit". In this case I mean two things with this: first, that in Python 3 the name argument to getattr() has type str, not bytes. Second, that even in Python 2, the name is an identifier, and even though you can write getattr(x, '\xff\x01'), that would be useless.

Note that text encoded as bytes is not "text in spirit". The requirement is that the corresponding Python 3 API uses str, and the Python 2 API supports bytes or unicode, though not necessarily all bytes or all unicode -- e.g. getattr() only accepts a unicode name if it contains only ASCII characters, even though it doesn't make that requirement when the argument is str.

Basically the point of "text in spirit" is to make the argument that an API should not use bytes even though it may accept non-ASCII str instances. But I have to do more exploration before I decide how important this concept is.

@JukkaL
Copy link
Contributor

JukkaL commented Aug 17, 2016

Here's another table that could be useful -- if we define def f(s: s1) -> None: ..., is a call with an argument of type s2 valid, when s1 and s2 range over str, bytes and unicode.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 17, 2016

Compatibility Tables

[UPDATE: made function calls primary, per Jukka's suggestion below]

Let's start by stating the compatibility between expressions of types bytes, str, text and functions with arguments of those types. Each row corresponds to a declared argument type; each column corresponds to the type of an expression passed in for that argument.

Argument type xb: bytes xs: str xt: Text
arg_b: bytes Yes (same) Yes (str <: bytes) No
arg_s: str Yes (promotion) Yes (same) ???
arg_t: Text No Yes (promotion) Yes (same)

The above table also describes compatibility of expressions with variables (assuming the type checker, like mypy currently, doesn't just change the type of the variable). Note that I'm not decided yet whether to allow passing a Text value to a str argument, but I'm inclined to put "No" there, even though that breaks the illusion of "str as the Any of string types" (IOW gradual byting :-).

Next let's describe the return type for expressions of the form x + y where x and y can each by of type bytes, str, or Text.

x yb: bytes ys: str yt: Text
xb: bytes bytes bytes ERROR
xs: str bytes str Text
xt: Text ERROR Text Text

Note that this table is more regular and I'm pretty confident about it.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 17, 2016

Irregularities

encode() and decode()

For bizarre reasons, in Python 2 both str and unicode support both encode() and decode(). This makes no sense, e.g. u'abc'.decode('utf8') is equivalent to u.abc'.encode('ascii').decode('utf8'), and 'abc'.encode('utf8') really means 'abc'.decode('ascii').encode('utf8').

I propose to rationalize this to the extent possible, as follows:

  • bytes only support .decode(), and it returns unicode
  • unicode only supports .encode(), and it returns bytes (not str!)
  • str supports .encode(), returning bytes, and .decode(), returning unicode

This would mean complete removal of unicode.decode() from the stubs, since it basically always means some terrible misunderstanding happened. For variables declared as bytes, it would likewise remove the encode() method, whose use would point to a similar (but opposite) misunderstanding. For str we remain generous (since using str means the code probably hasn't received enough attention from the straddling police).

__str__() and __repr__()

The return types of bytes.__str__() and bytes.__repr__() are still str, because that's how they are constrained by object. (FWIW these are examples of methods returning "text in spirit" strings.)

@JukkaL
Copy link
Contributor

JukkaL commented Aug 17, 2016

A reason we might prefer to use a call instead of an assignment as a basis for the table is that some type checkers like to infer a new type from assignment, and thus arbitrary assignments are considered correct -- they just redefine the type of a variable.

@gvanrossum
Copy link
Member Author

OK, edited the text.

@JukkaL
Copy link
Contributor

JukkaL commented Aug 18, 2016

If we don't do the Text -> str promotion (which seems reasonable), then the original "gradual byting" story may need tweaking, as the first step would be to annotate with str and Text only (not just str, because unicode literals wouldn't be compatible with it), and the gradual byting migration would migrate some str types to bytes (or maybe unicode). Also, the gradual migration may involve changing some '' literals to b'' literals

Example first phase annotation where we'd need Text:

def utf8_len(x: Text) -> int:
    return len(x.encode('utf8'))

utf8_len(u'\u1234')

Here we'd need to use Text unless we include the Text -> str promotion.

@vlasovskikh
Copy link
Member

@gvanrossum @JukkaL I would like to join your discussion.

Here is a summary of the current ASCII types and gradual byting proposals as I understand them.

Rationale

(This piece is here to make sure we’re solving the same problem.)

The changes in text and binary data handling in Python 3 is one of the major reasons people cannot run their Python 2 code on Python 3. These changes were:

  • Disabled implicit str to unicode and vice versa conversions using the ASCII encoding
  • Some missing methods or methods with different semantics for the text and binary classes

The most viable approach to porting to Python 3 is via porting to Python 2+3.

Therefore we need a way to make this transition from a Python 2 program with implicit conversions to a Python 2 program with less implicit conversions to a Python 2+3 program that runs on both versions but still contains some implicit conversions to a Python 3 program.

ASCII Types

The idea is to introduce two new types, one new types compatibility rule and special type inference rules for binary and text literals.

The new types are:

  • class typing.ASCIIBytes(bytes): ...
  • class typing.ASCIIText(typing.Text): ...

The ASCIIBytes type is compatible with ASCIIText for Python 2.

If a text or binary literal contains only ASCII characters then type checkers should infer the corresponding ASCII types instead of regular text / binary types.

Pros

  • People can be very precise about their Python 2 types and implicit conversions. This precision is much needed for forbidding implicit conversions eventually while porting from Python 2 to Python 2+3 to Python 3.

Cons

  • List[ASCIIBytes] is not compatible with List[str] that causes lots of errors according to Guido’s experiments. As a workaround, we might make ASCIIBytes compatible with any TypeVar that is constrained by ASCIIText or Text despite it’s variance. The cost of this workaround is more false negatives for unsafe modifications of invariant collections of strings.
  • People will have to write ASCII types for their ASCII-preserving functions (e. g. for a user-defined equivalent of str.upper() .

Gradual Byting

A proposal by Guido described here.

The idea is to distinguish the str in type hints from bytes and Text (unicode in Python 2). Then there are new types compatibility rules:

  • bytes is compatible with str
  • str is compatible with bytes
  • Text is maybe compatible with str (??? undecided yet)
  • str is compatible with Text

Questions

  • How the issue with List[bytes] being not a subtype of List[str] is resolved here? (see the similar issue in the ASCII types proposal)
  • If Text is not compatible with str then str is no longer a safe staring point in the Python 3 migration process. Won’t it confuse people?
  • What should be the type of functions like getattr() that do implicit text to bytes conversion?

What do you think of the workaround for invariant collections of ASCIIBytes?

Could you please answer my questions about the gradual byting proposal?

@gvanrossum
Copy link
Member Author

Writing my answer...

Re: Rationale

Agreed, although your way of describing the changes makes it sound like Python 3 is a step back from Python 2 in this respect; I believe the opposite (PY3 is better than PY2).

The key part is that some things around strings changed and we want to provide a gradual way to convert PY2 via straddling (2+3) to (eventually) PY3.

The proposal is mostly concerned with how to write string types for the straddling case (in a way that works in PY2 and is idiomatic PY3).

Re: ASCII Types

I'm not sure you characterized the proposal the same way as I heard it (as retold by @ddfisher), but you're pretty close.

IIRC the proposal actually made ASCIIText compatible with all bytes, and ASCIIBytes compatible with all Text, in Python 2. So e.g. getattr(obj, name) requires name to be str in Python 3, but in Python 2 it accepts all bytes (as an alias for str) and ASCIIText, but throws UnicodeEncodeError for Text instances containing non-ASCII characters. "Morally" I think it's text, not bytes, since in Python 3 it only accepts text strings, and the best annotation for straddling code is name: str.

As an example that goes in the other direction, in Python 2, s.encode('utf8') works for all Text and for ASCIIBytes, but throws UnicodeDecodeError if s is a str bytes object containing non-ASCII bytes. In Python 3 only text strings have this method. Again s s "morally" text, but we have to type it differently (I'd propose s: Text).

I'd like to point out that in both these examples (taken from the behavior of builtins, there are many more like it) declaring the argument or variable as ASCIIBytes or ASCIIText is inappropriate, since while ASCII characters/bytes are accepted in either persuasion (text or bytes), the full range of one or the other base types (bytes or Text) is still accepted.

Another problem I have with the ASCII proposals is that it emphasizes literals too much. Yes, in examples we often like to write things like getattr(obj, 'xxx') and then it works out nicely that the argument is an ASCIIBytes. But in real code it's much more likely that you're computing a name from some other information and then pass it to getattr(obj, name). That computed name is much more likely to have the (inferred or explicitly declared) type str, or if it came from a unicode-aware computation it could have the type Text. In the latter case, inferring (i.e., proving) through such a computation that the value is in fact ASCIIText is usually too hard.

Re: Gradual Byting

(NOTE: The nickname "gradual byting" may actually be a bad pun, as in the end the compatibility rules are more complicated than those for Any in gradual typing.)

How the issue with List[bytes] being not a subtype of List[str] is resolved here?

(You probably meant supertype, since I propose to make bytes a supertype of str in PY2.)

It's not completely resolved, but I think it's more reasonable to ask people to distinguish between bytes and str in their annotations (and thinking!) than to start introducing the ASCII types (which have no use in PY3 code).

My main reason that I find this less of a problem than the corresponding problem with List[str] vs. List[ASCIIBytes] is that we actually have bytes literals in PY2. (It's a little tricky to keep track of them in the parser, but not impossible, and I plan to do it. So if you meant a list of bytes, use [b'xx', b'yy', b'zz']. Also, making the distinction carefully is useful going forward to pure PY3 code, while distinguishing between ASCII and non-ASCII is artifical (only PY2 cares about them).

If Text is not compatible with str then str is no longer a safe starting point in the Python 3 migration process. Won’t it confuse people?

This situation is inherently confusing, because in PY2, sometimes str+unicode works, and sometimes it doesn't. As long as you don't know, you may be better off with Any.

What should be the type of functions like getattr() that do implicit text to bytes conversion?

I think we have two choices: str or Text. Both are "morally" text but in PY2, str has preference for 8-bit characters while Text prefers Unicode characters. Since this is a C function that internally works with 8-bit characters (in PY2) I think it should be typed as str and from this I concede that making Text compatible with str may actually be the right thing to do.

So in the end maybe "gradual byting" is correct. Ideally for straddling code, you should run mypy twice, once with --py2 and once without. The recommendation would then be to strive for the following, in straddling code (or code striving to become straddling):

  • Use bytes where PY2 has strings that must be bytes in PY3
  • Use str where PY3 has strings and it's complicated in PY2
  • Use Text where PY2 uses Unicode

Mypy in --py2 mode would have to learn that bytes <-> str and str <-> Text but not bytes <-> Text, and it would have to assign the correct type based on the form of literal:

  • b'x' -> bytes
  • 'x' -> str (unless from __future__ import unicode_literals; then Text)
  • u'x' -> Text (an alias for unicode)

BTW I find from __future__ import unicode_literals an anti-pattern that does more harm than good, and I now recommend against it.

@gvanrossum gvanrossum modified the milestone: PEP 484 finalization Aug 26, 2016
@vlasovskikh
Copy link
Member

The full gradual byting idea (bytes <-> str <-> Text, but not bytes <-> Text) sounds logical and easy to explain. I particularly like the following about it:

  • You can start with just using str type hints in your Python 2 code base, you'll get no false positives
  • You can gradually introduce new more precise types bytes and Text and get better type checking, while keeping str in places where it's complicated

(The type checking rules for generic types like List[str / bytes / Text] are still a bit unclear. I guess the idea is to pretend that List[str] is compatible with List[bytes] and List[Text] and vice versa.)

What I don't like is that gradual byting gives up on checking for implicit ASCII conversions. Basically we'll be able to check the two distinctly marked bytes and Text subsets of the program and won't be able to tell anything about implicit ASCII conversions since str is compatible with both bytes and Text.

But it seems that it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeErrors and UnicodeEncodeErrors in PY2 programs. @gvanrossum @JukkaL Do you agree with it?

I would like to experiment with the idea of gradual byting in PyCharm for the next few days to see if there are any concerns with it. I'll report about my findings later this week.

@gvanrossum
Copy link
Member Author

type checking rules for generic types like List[str / bytes / Text] are still a bit unclear

My current inclination is not to do anything special about these, because List is invariant. Although if str was really analogous to Any here it would indeed work. I think experiments will have to decide whether it's needed.

it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeErrors and UnicodeEncodeErrors in PY2

Yes, that's the most important use case we have for mypy at Dropbox -- we want our code to become more Python 3 ready.

Mixing bytes and Text is a type error. But it's harder to argue that mixing ASCII and non-ASCII is a type error. I don't like to treat strings containing only ASCII characters as a subtype of bytes or Text, because dynamic sources of characters (other than literals in the source code) don't typically tell you whether they can ever return non-ASCII characters.

As an analogy, let's say we wanted to treat non-negative integers as a subtype of int. If we define a "type" as a set of values, this is certainly a reasonable thought, and non-negative integers are closed under addition and multiplication (just like ASCII strings are closed under concatenation and slicing). But there are few input functions that return only non-negative integers -- int() in particular can definitely return a negative int. So it's hard to enforce the non-negativity of integers being processed by a program without explicit range checks or complicated proofs that a certain algorithm preserves that property.

I feel it's similar for the ASCII-ness of bytes and Text -- a function that reads a string from a file or socket (or from e.g. os.environ()) has no particular reason to believe that the file will only contain ASCII characters.

I'm looking forward to the outcome of your experiments. In the meantime I will also try to look into a more complete set of changes to typeshed and mypy.

@vlasovskikh
Copy link
Member

I'm done with my experiments with the idea of gradual byting.

I've created a proof-of-concept implementation of gradual byting in PyCharm by modifying __buitlin__.pyi from Typeshed and tweaking our type inference engine and type checker. Then I tried to port some real-life code from PY2 to PY2+3 using the modified IDE.

Original Gradual Byting

What I've learned from my experiment with the original gradual byting proposal is that the type checker doesn't help in porting PY2 code to PY2+3. I mean if you already use Text and bytes alongside with str then yes, it helps you to some extent.

But overall you get no guidance on how to proceed with porting your code. The type checker doesn't tell you if there are any variables or functions with no type hints. It doesn't promote the use of Text and bytes instead of str. It doesn't catch most of the text/binary data compatibility errors.

Guided Gradual Byting

During my further experiments I came up with the following guided process for making PY2 text/binary data handling more PY3-like. The original gradual byting is a step in this process.

Summary of Proposed Changes

  • Gradual byting
    • Make Text <-> str <-> bytes, but Text and bytes are not compatible
    • Infer u'foo' -> Text, 'foo' -> str, b'foo' -> bytes
  • Introduce typing.NativeStr as an alias for str that says explicitly that it's a native string
  • Recommend new type checking options
    • Warn about implicit Any for declarations
    • Warn about str in type hints (use Text / bytes / NativeStr instead)
    • Strict str checks (disables gradual byting promotions, use cast() if you're sure)
  • Recommend extra type checking options for not 100%-typehinted code
    • Warn about implicit Any for expressions
    • Warn about str literals

Idea

(Note: It includes some pictures, please see the comment on the GitHub page).

Hypothesis: Most of text/binary data in PY2 programs can be converted to either Unicode data or 8-bit data in PY2+3. Native str strings (the ones that are strictly 8-bit in PY2 and Unicode in PY3, i.e. mixed in PY2+3) are the minority. Handling native strings causes many problems while porting from PY2 to PY2+3. Type checkers should make you aware of these problems and provide some help in reducing the amount of native strings.

In PY3 you have a clear separation of text and binary data: they are not compatible with each other. In PY2 things are complicated because of the implicit conversion between text and binary data using the ASCII encoding.

Figure 1

If you want to make your PY2 code PY2+3 compatible (straddling), you have to make it more PY3-like in respect of text/binary data separation. A good way to proceed is to start putting type hints into your code so that a type checker would be able to check your code for correctness.

The steps of the proposed approach to porting are described below. You may start with no or some type hints in your code. You may proceed module-by-module or modify the whole program at once.

1. Add type hints for all declarations in your code

Use the "Warn about implicit Any for declarations" type checker option to get notified of all the places where type hints are missing.

Figure 2

For text and binary data you have the following options:

  • Use Text when data is Unicode in PY2 and PY3
  • Use bytes when data is 8-bit string in PY2 and PY3
  • Use AnyStr when your code works with both Unicode and 8-bit strings in PY2 and PY3 as long as all the function arguments are of the same type (or use Union[Text, bytes] if it doesn't matter)
  • Use NativeStr when data is Unicode in PY3 and 8-bit string in PY2 (see the footnotes about NativeStr)
  • Use str otherwise (if things are complicated)

2. Remove all str entries in type hints in favor of Text/bytes/NativeStr/etc.

Use the "Warn about str in type hints" type checker option to get notified of all the remaining occurrences of str in type hints.

Figure 3

Go through all the places you marked with str as complicated and figure out which of the text/binary types is actually appropriate here.

The purpose of this step is to make the native string subset as small as possible since a) native string operations are the hardest to port; b) your code will look more PY3-like with mostly Text and bytes subsets.

3. Enable strict str checking

Use the "Strict str checks" type checker option to enable strict separation of Text, str, and bytes data in your code.

Figure 4

The remaining type checker warnings at this step show the most tricky parts of your text/binary data handling code that has to be carefully written to become PY2 and PY3 compatible. It may involve:

  • Re-categorizing your values into Text, NativeStr, and bytes again
  • Doing things differently for PY2 and PY3 inside if PY2 conditions
  • Explicitly casting types using typing.cast(<type>, <value>) if you're sure what are you doing and you need a way make the type checker happy

Extra options

These type checker options might be helpful during PY2 to PY2+3 porting if your code is not 100% type hinted:

  • Warn about implicit Any for expressions as well
  • Warn about str literals

I found these extra options very useful when you have more modules to port or you use third-party libraries with no type hints / stubs.

Footnotes

PyCharm doesn't support the stubs from Typeshed yet, it's still a work in progress.

The idea of an option for warnings about declarations with no type hints comes from the --noImplicitAny option of the TypeScript compiler. It is used heavly in the TypeScript community for testing the TypeScript stubs of untyped JavaScript libraries.

typing.NativeStr is a new type needed to help people get rid of ambiguous str (is it really a native string or is it a marker that things are complicated?). It could be an alias to str. There should be an option to warn about any str and unicode entries in type hints.

@vlasovskikh
Copy link
Member

@gvanrossum @JukkaL I'm looking forward to your feedback.

@gvanrossum
Copy link
Member Author

gvanrossum commented Sep 7, 2016 via email

@JukkaL
Copy link
Contributor

JukkaL commented Sep 8, 2016

@vlasovskikh Thanks for the detailed write-up! Your approach sounds mostly reasonable. If @gvanrossum agrees, hopefully we can can experiment with it and mypy.

A few things I'm not sure about:

  1. The str / NativeStr distinction

This could be useful during migration, but I'm not sure if users will find it easy to understand. An alternative would be to propose that users define a similar type alias by themselves instead of including it in typing.

  1. Strict str checking

The implications of a mode that requires casts between str and other string types for Python 2 are still unclear to me. Stubs would potentially require things like Union[str, Text] in places for things to work seamlessly, without needing seemingly redundant casts from str to Text/bytes when interacting with library modules. I'm not sure how much of a problem this would be. Also, we'd need a separate stub for the str class in this mode.

  1. AnyStr

Would AnyStr range over str, bytes and Text? Functions that use AnyStr would likely now be a little tricky to write in some cases. Consider this function:

def f(x: AnyStr) -> AnyStr:
    return x + 'a'

This would be fine in Python 2 mode but it wouldn't work in Python 3 or strict str checking mode. Here's a straightforward straddling implementation that actually wouldn't work, since given a str argument, the return type would be bytes, not str in Python 2 mode:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, Text):
        return x + u'a'
    else:
        return x + b'a'

This may have to written like this, which seems a bit excessive but perhaps still reasonable:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, Text):
        return x + u'a'
    elif isinstance(x, str):
        return x + 'a'
    else:
        return x + b'a'

It seems that the final example would also work in the strict str checking mode.

@remdragon
Copy link

@whatisaphone I can open up my fork to you if you are able to contribute to this

@gvanrossum
Copy link
Member Author

@takeda

I'm wondering if in case of format() protocols could be used, for example something like SupportsStr (where it checks for presence of __str__() etc).

I doubt it, you have to parse the {...} thingies in the format string since these may indicate that only certain types are allowed. Example: {:.3f}'.format(3.14) works, but {:.3f}'.format('abc') reports a ValueError.

Also, unicode defines __str__.

@remdragon
Copy link

@gvanrossum is six a prerequisite of mypy? Can I just import six.text_type as unicode?

@gvanrossum
Copy link
Member Author

Yeah, if you want to write cross-version code that needs unicode in Python 2 and str in Python 3, and you don't want to use Text, six.text_type should do the trick. (Right now the two are equivalent, but if you make Text a union, six.text_type should remain an alias for unicode, which might help.)

@remdragon
Copy link

43k loc are happy, a few obscure bugs found. its not quite 100k loc, but hopefully it's a start.

It does cause some minor bumps. IO[Text] doesn't work, but I probably shouldn't have been doing that anyway.

@remdragon
Copy link

remdragon commented Nov 22, 2018

I reviewed the CI failures and I see there's a "mypy_test.py" that I need to be running, so I've started going through that and I need a little assistance, please.

In stdlib/2/_threading_local.pyi:4 I found this:

__all__: List[str]

I'm getting the following error:

stdlib\2\_threading_local.pyi:4: error: Type of __all__ must be "Sequence[unicode]", not "List[str]"

I'm thinking I need to change that line to this:

__all__: List[Text]

but that I also need to change the definition of __all__ to be Sequence[Text].

I've grepped mypy's source, but I'm having difficulty finding where __all__'s type would be defined. Would someone point me in the right direction?

Thanks

@gvanrossum
Copy link
Member Author

I think you're looking for this: https://github.com/python/mypy/blob/1a9e2804cdad401a3019eabd37002f32d08fe0ec/mypy/checker.py#L291

PS: For help with the PR it's usually quicker to add a comment to the PR itself rather than to the issue to which it relates.

@whatisaphone
Copy link

whatisaphone commented Nov 27, 2018

@gvanrossum

I ran the experiment you suggested, with remdragon's typeshed fork. With your suggested change to semanal.py, my test above works as expected!

error: Argument 1 to "takes_unicode" has incompatible type "str"; expected "unicode"

I'm happy to see this taking shape! 😃


@remdragon I'm happy to be added, but I can't promise I'll have a lot of time to dedicate to this unfortunately.

I was able to spend a bit of time just now. I cloned your fork and tested it on our codebase. It correctly found a few lurking bugs 😃 but also one interesting regression. Here's a minimal repro:

from typing import Any

class Class: pass

def foo(**kwargs):
    # type: (**Any) -> None
    c = Class()
    c.__dict__.update(kwargs)

I ran mypy with --python-version=2.7. The last line errors with:

error: Argument 1 to "update" of "dict" has incompatible type "Dict[str, Any]"; expected "Mapping[Union[str, unicode], Any]"

My intuition tells me that a **kwargs should be compatible with a __dict__. I poked at it in the repl, and I came to the conclusion that object.__dict__ should remain with str instead of being changed to use Text. The reason being:

Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X: pass
...
>>> x = X()
>>> setattr(x, u'unicode', True)
>>> x.__dict__
{'unicode': True}
>>> # ^ It's a `str` now
...
>>> setattr(x, u'\x80', True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)

It's similar to the encode and format issues I mentioned above. You can treat a unicode as an attribute key, but only if it's not actually unicode. The actual __dict__ won't contain unicode (unless you munge it directly, in which case you can break the rules with unicode, integers, or whatever else).

I changed it back locally and our codebase typechecked with no ill effects. (Specifically, I changed stdlib/2/__builtin__.pyi:30 to __dict__ = ... # type: Dict[str, Any])

What do you think about that change?

@gvanrossum
Copy link
Member Author

I've started a serious Python 3 porting effort myself recently, and I think I have to agree with those who suggest one refinement: In Python 2, make unicode be exactly that (IOW drop the promotion from str to unicode), and make Text be essentially a union of str and unicode. (Perhaps Text could be equivalent to basestring.) In Python 3 nothing would change.

I have not thought much about how we could implement this.

@rchen152
Copy link
Collaborator

This (no promotion from str to unicode, Text = Union[str, unicode]) is what pytype does. For us, the implementation was as easy as changing a line in the pyi file (https://github.com/google/pytype/blob/master/pytype/pytd/builtins/2and3/typing.pytd#L22). The tricky parts were (1) fixing all the google code with now-incorrect unicode annotations and (2) dealing with the performance ramifications of unions everywhere.

@srittau
Copy link
Collaborator

srittau commented Dec 27, 2018

typeshed also already recommends using Text instead of unicode if unicode and str are accepted. It's quite likely that there is legacy code annotated with unicode, though. I suppose that this will be easily greppable.

@gvanrossum
Copy link
Member Author

I still have to investigate this, but the problem with typeshed is likely that we don’t have an easy standard way to determine whether a specific function needs Unicode or Text.

@whatisaphone
Copy link

whatisaphone commented Jan 7, 2019

I think remdragon's fork was the right approach, and it's too bad it went stale so quickly. When it was fresh I tried it (combined with the one-liner mypy tweak) on our codebase and there was only the one minor regression (easily fixable, and I wrote way too many words about it, whoops!) I think that's the way forward.

As I understand it, typeshed is maintained by hand, so I don't know that there's any way around a big "Textification" commit that is also written by hand. In cases where it's unclear, fall back to AnyStr and mypy should keep accepting the things that it accepts today.

I propose we land remdragon's fork (after a fix and rebase) first—which should cause no change in behavior. Then make the changes in mypy to drop the str promotion (presumably behind a config flag), and include the new typeshed.

Did I miss anything? What do you think about that approach?

EDIT: Changed Text to AnyStr

@srittau
Copy link
Collaborator

srittau commented Apr 25, 2020

Is this issue still relevant? I think there is a consensus on how to handle this and it has been implemented by all type checkers supporting Python 2.

@gvanrossum
Copy link
Member Author

I don't think anything is going to change, especially now Python 2 has reached its end of life (and acknowledging that many projects and companies are still working on porting their legacy Python 2 code to Python 3).

I'm not sure what is the consensus on how to handle it? I believe we changed some language in PEP 484 about not promoting str to unicode but I don't believe mypy made a change to match (and the PEP was intentionally not requiring it, just allowing it).

@jstasiak
Copy link
Contributor

FWIW I migrated reasonably-sized codebase from Python 2.7 to 3.5 in late 2016 and I would claim what typing and mypy provided back then was already good enough, as I had no runtime errors related to str/bytes/unicode/text handling as far as I remember once mypy was satisfied.

@srittau
Copy link
Collaborator

srittau commented Apr 25, 2020

The consensus from what I understand (which might be wrong) is:

  • PEP 484: Text is an alias for str (py3) or unicode (py2)
  • PEP 484: Text in argument types accept both str and unicode (py2)
  • unicode can't be assigned to str, whether the reverse holds true is up to type checkers.

That said I have no particular stake in this discussion, since I'm fortunate enough not to work with Python 2 anymore.

@JukkaL
Copy link
Contributor

JukkaL commented Apr 27, 2020

Mypy treats Text as an alias to str (Python 3) or unicode (Python 2). In Python 2, str is considered as a subtype of unicode (but not vice versa). Mypy is not going to change this, since Python 2 has reached end of life, and because any changes would likely be very disruptive.

I don't think that it's important to change anything or to reach consensus any more. I'm fine with closing this issue.

@gvanrossum
Copy link
Member Author

Sounds good. Closing.

pmhahn added a commit to univention/typeshed that referenced this issue Jun 26, 2020
pmhahn added a commit to univention/typeshed that referenced this issue Jun 30, 2020
pmhahn added a commit to univention/typeshed that referenced this issue Jul 21, 2020
<https://github.com/pyenchant/pyenchant>
<http://pyenchant.github.io/pyenchant/api/index.html>

- [ ] API-2 accepts Union[Text, bytes], see <python/typing#208>
- [ ] API-3 provides new functions and is Pyhton-3-only
pmhahn added a commit to univention/typeshed that referenced this issue Jul 21, 2020
<https://github.com/pyenchant/pyenchant>
<http://pyenchant.github.io/pyenchant/api/index.html>

- [ ] API-2 accepts Union[Text, bytes], see <python/typing#208>
- [ ] API-3 provides new functions and is Pyhton-3-only
pmhahn added a commit to univention/typeshed that referenced this issue Nov 16, 2020
<https://github.com/pyenchant/pyenchant>
<http://pyenchant.github.io/pyenchant/api/index.html>

- [ ] API-2 accepts Union[Text, bytes], see <python/typing#208>
- [ ] API-3 provides new functions and is Pyhton-3-only
pmhahn added a commit to univention/typeshed that referenced this issue Nov 16, 2020
<https://github.com/pyenchant/pyenchant>
<http://pyenchant.github.io/pyenchant/api/index.html>

- [ ] API-2 accepts Union[Text, bytes], see <python/typing#208>
- [ ] API-3 provides new functions and is Pyhton-3-only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests