-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
gh-135676: Add a summary of source characters #138194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The lexical analyzer determines the program text's :ref:`encoding <encodings>` | ||
(UTF-8 by default), and decodes the text into | ||
:ref:`source characters <lexical-source-character>`. | ||
If the text cannot be decoded, a :exc:`SyntaxError` is raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this also not raise a Unicode*Error
? For example, running a file with a surrogate:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1-2: surrogates not allowed
* formfeed | ||
* * :ref:`Whitespace <whitespace>` | ||
|
||
* * * CR, LF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also list CRLF
, as we list in the formal grammar:
newline: <ASCII LF> | <ASCII CR> <ASCII LF> | <ASCII CR>
.. (the following uses zero-width-joiner characters to render | ||
.. a literal backquote) | ||
|
||
backquote (`````) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a substitution be used here? See example for nbsp in math.rst.
The lexical analysis docs have notes like this at the end:
The period can also occur in floating-point and imaginary literals.
The following printing ASCII characters have special meaning as part of other tokens or are otherwise significant to the lexical analyzer:
' " # \
The following printing ASCII characters are not used in Python. Their occurrence outside string literals and comments is an unconditional error:
$ ? `
The intent behind these seems to be providing a "map" of what all the ASCII characters do in Python, but that map is incomplete as it is, and isn't really kept up to date.
This instead provides a summary of source characters -- nominally the ones that start tokens, with notes for other notable cases.
The table can also serve as an alternate "table of contents".
The presentation -- a table of bulleted lists -- is a bit wacky but I think it gets the job done.
📚 Documentation preview 📚: https://cpython-previews--138194.org.readthedocs.build/