Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36502: Correct documentation of `str.isspace`. #15019

Merged
merged 4 commits into from Aug 14, 2019

Conversation

@gnprice
Copy link
Contributor

commented Jul 30, 2019

The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so. The unicodedata module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace test a thorough check that the
implementation agrees with the intended definition.

https://bugs.python.org/issue36502

Correct documentation of `str.isspace`.
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the `isspace` test a thorough check that the
implementation agrees with the intended definition.
@the-knights-who-say-ni

This comment has been minimized.

Copy link

commented Jul 30, 2019

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@tirkarthi

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

I think this would require an issue in the tracker. There are several issues regarding definition of whitespace and usage. Some isspace related issues :

@mangrisano

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

@gnprice

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

There are several issues regarding definition of whitespace and usage. Some isspace related issues :

Ah, thanks for those links!

https://bugs.python.org/issue36502

Indeed this looks like exactly the documentation bug this patch fixes.

https://bugs.python.org/issue18236

This is a proposal to change the behavior of isspace to use the Unicode White_Space property, which didn't yet exist when Python first gained Unicode support. That sounds like very much the right thing to me.

That will represent a change in behavior; I'm not sure what that means for releasing such a change. It's also significant work (as we don't currently parse that part of the Unicode database at all), which perhaps partly explains why it hasn't been done in the 6 years since it was proposed.

For any release before the change to use White_Space gets done, I think a doc fix like this one would be an improvement. The "Other" category is very far from either the actual or desired behavior.

@@ -617,7 +618,14 @@ def test_isspace(self):
self.checkequalnofix(True, '\u2000', 'isspace')
self.checkequalnofix(True, '\u200a', 'isspace')
self.checkequalnofix(False, '\u2014', 'isspace')
# apparently there are no non-BMP spaces chars in Unicode 6
for i in range(0x10000):

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 30, 2019

Member

I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

This comment has been minimized.

Copy link
@gnprice

gnprice Jul 30, 2019

Author Contributor

I would prefer to have a constant declared at the top level of the module.

Ah, good idea. I think in fact perhaps I'll make the range itself a constant, and then be able to write:

for codepoint in BMP_CODEPOINTS:
  char = chr(codepoint)

because that seems to get at the meaning very well.

(Or perhaps for char in bmp_chars():. OTOH I just spent a few minutes trying to see a clean way to make it for char in BMP_CHARS:, and couldn't find one.)

This comment has been minimized.

Copy link
@gnprice

gnprice Jul 30, 2019

Author Contributor

Why not testing the full range of the UCS? range(0x110000)

Well, the most causal answer to "why" 🙂 is that I first added this as a test in test_unicodedata.py before seeing the existing test_isspace here -- and two of the central tests there (running through the whole database and computing overall checksums) have a loop much like this one, so I followed their lead.

Those two tests (test_method_checksum and test_function_checksum) look to be substantially unchanged since they were first added in 6a20ee7 back in 2000. They probably should be updated to go through the full range of codepoints; I think back in 2000 this was the full range of codepoints.

(Hmm... as I look closer at what test_method_checksum in particular does, it's really all about methods on str, and doesn't actually mention unicodedata. Perhaps it should move to test_unicode.py.)

The one real cost of looping through all planes (vs. just the BMP) is test runtime -- the test takes about 50ms for just the BMP (on my fast desktop) but 900ms for everything. Fast test suites are precious, so in test_isspace it seems a bit of a shame to spend almost a second looping through values where nothing at all is expected to happen. Well worth it for the loops in test_unicodedata, though, where there's plenty happening the whole way along.

This comment has been minimized.

Copy link
@gnprice

gnprice Jul 31, 2019

Author Contributor

I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

Just pushed an update making both of these changes. Thanks for the review!

I was about to edit the existing two loops in test_unicodedata to run the full range too, but then realized that that requires updating the checksum. I feel like that should be a separate change and not mixed in with this one. So I'll keep that around as a commit on a branch on top of this one; I can send it right after this PR is complete, or also happy to squash it in here if you think that'd be preferable.

@@ -102,6 +106,14 @@ def test_function_checksum(self):
result = h.hexdigest()
self.assertEqual(result, self.expectedchecksum)

def test_isspace_invariant(self):

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 31, 2019

Member

It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test with:

    @support.requires_resource('cpu')

So it will be skipped on slow CIs but only run on fast CIs.

This comment has been minimized.

Copy link
@gnprice

gnprice Aug 1, 2019

Author Contributor

It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

Sure; I've waffled on that myself. Will move it back to there.

(Note that the existing test at the top of test_unicodedata.py similarly doesn't mention the unicodedata module. Perhaps it'd be good to move it to test_unicode.py too, though I think not as part of this PR.)

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test [...]

Ah, thanks! Will do.

@@ -14,6 +14,10 @@
encoding = 'utf-8'
errors = 'surrogatepass'

def all_chars():
'''Each Unicode codepoint, as a one-character string.'''
for codepoint in range(0x110000):

This comment has been minimized.

Copy link
@vstinner

vstinner Jul 31, 2019

Member

Hum, I only expected a constant, not a whole function. I'm not sure that such function is needed, since it's only called once. I suggest to remove the function, and use sys.maxunicode rather than 0x110000. So the test will be updated automatically if sys.maxunicode is increased.

This comment has been minimized.

Copy link
@gnprice

gnprice Aug 1, 2019

Author Contributor

I made it a generator because I think that makes the abstraction a bit crisper: the test just says for char in all_chars(), and the meaning is it iterates through all possible characters. But open-coding the loop works too; I can change it.

So the test will be updated automatically if sys.maxunicode is increased.

I doubt very much that sys.maxunicode will ever be changed in the future. 😉 Expanding Unicode beyond the original 16 bits was traumatic enough -- it's still a widespread cause of bugs, decades later -- and when they did, they gave themselves quite a lot of safety margin.

If you'd find the code clearer to read that way, though, that'd certainly be a good reason.

FWIW I find the hex constant somewhat clearer. One reason is that I don't have to wonder about the fencepost -- note that the correct version would be range(sys.maxunicode + 1).

As one form of data, a quick grep finds the existing usage in the codebase leans toward the explicit hex:

$ git grep maxunicode | wc -l
31

$ git grep -ie 0x10ffff -e 0x110000 | wc -l
100

This comment has been minimized.

Copy link
@vstinner

vstinner Aug 1, 2019

Member

I doubt very much that sys.maxunicode will ever be changed in the future.

It's not just about being future-proof: a named constant also helps to better self-explain the code, it helps the readability. I always wanted to replace hardcoded constants with named constants: I just did it in PR #15067.

This comment has been minimized.

Copy link
@gnprice

gnprice Aug 1, 2019

Author Contributor

Sounds good. In this case, as I said I find the explicit literal somewhat clearer; but I definitely think that's the important criterion here, and I don't mind changing it.

@gnprice

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2019

@vstinner I just pushed another update:

* 578335f44 Move back to test_unicode; open-code loop; mark as uses-CPU.

Thanks again for the review!

I wrote the explicit range(0x110000) rather than range(sys.maxunicode + 1), for the reasons mentioned above. Happy to change it if you'd prefer, though.

@vstinner
Copy link
Member

left a comment

LGTM apart the hardcoded 0x110000 constant (so I don't approve the change yet, sorry!).

for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E',
'\U0001F40D', '\U0001F46F']:
self.assertFalse(ch.isspace(), '{!a} is not space.'.format(ch))

@support.requires_resource('cpu')
def test_isspace_invariant(self):
for codepoint in range(0x110000):

This comment has been minimized.

Copy link
@vstinner

vstinner Aug 1, 2019

Member

I would prefer to use sys.maxunicode + 1 here.

This comment has been minimized.

Copy link
@gnprice

gnprice Aug 1, 2019

Author Contributor

OK; done!

@gnprice gnprice changed the title Correct documentation of `str.isspace`. bpo-36502: Correct documentation of `str.isspace`. Aug 3, 2019

@gnprice

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

@vstinner Ping -- I believe this will be quick for you to act on, because I made the small change you requested in the last round.

I see you're back online this week, so replying just to make sure this returns to your inbox 🙂

@vstinner vstinner removed the skip issue label Aug 14, 2019

@vstinner vstinner merged commit 6bccbe7 into python:master Aug 14, 2019

5 checks passed

Azure Pipelines PR #20190801.28 succeeded
Details
bedevere/issue-number Issue number 36502 found
Details
bedevere/news "skip news" label found
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@miss-islington

This comment has been minimized.

Copy link

commented Aug 14, 2019

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒🤖 I'm not a witch! I'm not a witch!

@miss-islington

This comment has been minimized.

Copy link

commented Aug 14, 2019

I'm having trouble backporting to 3.8. Reason: 'Error 110 while writing to socket. Connection timed out.'. Please retry by removing and re-adding the needs backport to 3.8 label.

@miss-islington

This comment has been minimized.

Copy link

commented Aug 14, 2019

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒🤖

@miss-islington

This comment has been minimized.

Copy link

commented Aug 14, 2019

Sorry @gnprice and @vstinner, I had trouble checking out the 3.8 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 6bccbe7dfb998af862a183f2c36f0d4603af2c29 3.8

@vstinner

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

@gnprice: The bot failed to automatically backport the change to Python 3.8. Can you please try to create a PR for Python 3.8?

@gnprice gnprice deleted the gnprice:isspace-doc branch Aug 15, 2019

gnprice added a commit to gnprice/cpython that referenced this pull request Aug 15, 2019

bpo-36502: Correct documentation of str.isspace() (pythonGH-15019)
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.
@bedevere-bot

This comment has been minimized.

Copy link

commented Aug 15, 2019

GH-15296 is a backport of this pull request to the 3.8 branch.

@gnprice

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Thanks @vstinner for the reviews and the merge!

Can you please try to create a PR for Python 3.8?

Done. Turned out to be an easy merge conflict in the imports block.

vstinner added a commit that referenced this pull request Aug 19, 2019

bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296)
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.

miss-islington added a commit to miss-islington/cpython that referenced this pull request Aug 19, 2019

bpo-36502: Correct documentation of str.isspace() (pythonGH-15019) (p…
…ythonGH-15296)

The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.
(cherry picked from commit 8c1c426)

Co-authored-by: Greg Price <gnprice@gmail.com>

miss-islington added a commit that referenced this pull request Aug 19, 2019

bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296)
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.
(cherry picked from commit 8c1c426)

Co-authored-by: Greg Price <gnprice@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.