Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 71 additions & 5 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,76 @@ Lexical analysis
A Python program is read by a *parser*. Input to the parser is a stream of
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as
the *tokenizer*).
This chapter describes how the lexical analyzer breaks a file into tokens.
This chapter describes how the lexical analyzer produces these tokens.

Python reads program text as Unicode code points; the encoding of a source file
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
raised.
The lexical analyzer determines the program text's :ref:`encoding <encodings>`
(UTF-8 by default), and decodes the text into
:ref:`source characters <lexical-source-character>`.
If the text cannot be decoded, a :exc:`SyntaxError` is raised.

Next, the lexical analyzer uses the source characters to generate a stream of tokens.
The type of a generated token generally depends on the next source character to
be processed. Similarly, other special behavior of the analyzer depends on
the first source character that hasn't yet been processed.
The following table gives a quick summary of these source characters,
with links to sections that contain more information.

.. list-table::
:header-rows: 1

* - Character
- Next token (or other relevant documentation)

* - * space
* tab
* formfeed
- * :ref:`Whitespace <whitespace>`

* - * CR, LF
- * :ref:`New line <line-structure>`
* :ref:`Indentation <indentation>`

* - * backslash (``\``)
- * :ref:`Explicit line joining <explicit-joining>`
* (Also significant in :ref:`string escape sequences <escape-sequences>`)

* - * hash (``#``)
- * :ref:`Comment <comments>`

* - * quote (``'``, ``"``)
- * :ref:`String literal <strings>`

* - * ASCII letter (``a``-``z``, ``A``-``Z``)
* non-ASCII character
- * :ref:`Name <identifiers>`
* Prefixed :ref:`string or bytes literal <strings>`

* - * underscore (``_``)
- * :ref:`Name <identifiers>`
* (Can also be part of :ref:`numeric literals <numbers>`)

* - * number (``0``-``9``)
- * :ref:`Numeric literal <numbers>`

* - * dot (``.``)
- * :ref:`Numeric literal <numbers>`
* :ref:`Operator <operators>`

* - * question mark (``?``)
* dollar (``$``)
*
.. (the following uses zero-width space characters to render
.. a literal backquote)

backquote (``​`​``)
* control character
- * Error (outside string literals and comments)

* - * other printing character
- * :ref:`Operator or delimiter <operators>`

* - * end of file
- * :ref:`End marker <endmarker-token>`


.. _line-structure:
Expand Down Expand Up @@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
encoding is used for all lexical analysis, including string literals, comments
and identifiers.

.. _lexical-source-character:

All lexical analysis, including string literals, comments
and identifiers, works on Unicode text decoded using the source encoding.
Any Unicode code point, except the NUL control character, can appear in
Expand Down
Loading