From 05252cc6892e7aed7364b6b4bae8dc2f7aa78f9f Mon Sep 17 00:00:00 2001 From: Adam Turner <9087854+aa-turner@users.noreply.github.com> Date: Sun, 6 Aug 2023 03:59:48 +0100 Subject: [PATCH 1/3] GH-107678: Update ``library/re.rst`` for Unicode --- Doc/library/re.rst | 229 +++++++++++++++++++++++++++------------------ 1 file changed, 139 insertions(+), 90 deletions(-) diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 3f03f0341d8166..6c4eca2480dda9 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -17,7 +17,7 @@ those found in Perl. Both patterns and strings to be searched can be Unicode strings (:class:`str`) as well as 8-bit strings (:class:`bytes`). However, Unicode strings and 8-bit strings cannot be mixed: -that is, you cannot match a Unicode string with a byte pattern or +that is, you cannot match a Unicode string with a bytes pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string. @@ -257,8 +257,7 @@ The special characters are: .. index:: single: \ (backslash); in regular expressions * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted - inside a set, although the characters they match depends on whether - :const:`ASCII` or :const:`LOCALE` mode is in force. + inside a set, although the characters they match depends on the flags_ used. .. index:: single: ^ (caret); in regular expressions @@ -326,18 +325,24 @@ The special characters are: currently supported extensions. ``(?aiLmsux)`` - (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, - ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the - letters set the corresponding flags: :const:`re.A` (ASCII-only matching), - :const:`re.I` (ignore case), :const:`re.L` (locale dependent), - :const:`re.M` (multi-line), :const:`re.S` (dot matches all), - :const:`re.U` (Unicode matching), and :const:`re.X` (verbose), - for the entire regular expression. + (One or more letters from the set + ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.) + The group matches the empty string; + the letters set the corresponding flags for the entire regular expression: + + * :const:`re.A` (ASCII-only matching), + * :const:`re.I` (ignore case), + * :const:`re.L` (locale dependent), + * :const:`re.M` (multi-line), + * :const:`re.S` (dot matches all), + * :const:`re.U` (Unicode matching), + * :const:`re.X` (verbose). + (The flags are described in :ref:`contents-of-module-re`.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a *flag* argument to the - :func:`re.compile` function. Flags should be used first in the - expression string. + :func:`re.compile` function. + Flags should be used first in the expression string. .. versionchanged:: 3.11 This construction can only be used at the start of the expression. @@ -351,14 +356,20 @@ The special characters are: pattern. ``(?aiLmsux-imsx:...)`` - (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, - ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by + (Zero or more letters from the set + ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``, + optionally followed by ``'-'`` followed by one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) - The letters set or remove the corresponding flags: - :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case), - :const:`re.L` (locale dependent), :const:`re.M` (multi-line), - :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), - and :const:`re.X` (verbose), for the part of the expression. + The letters set or remove the corresponding flags for the part of the expression: + + * :const:`re.A` (ASCII-only matching) + * :const:`re.I` (ignore case), + * :const:`re.L` (locale dependent) + * :const:`re.M` (multi-line), + * :const:`re.S` (dot matches all) + * :const:`re.U` (Unicode matching), + * :const:`re.X` (verbose) + (The flags are described in :ref:`contents-of-module-re`.) The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used @@ -366,7 +377,7 @@ The special characters are: when one of them appears in an inline group, it overrides the matching mode in the enclosing group. In Unicode patterns ``(?a:...)`` switches to ASCII-only matching, and ``(?u:...)`` switches to Unicode matching - (default). In byte pattern ``(?L:...)`` switches to locale depending + (default). In bytes patterns ``(?L:...)`` switches to locale depending matching, and ``(?a:...)`` switches to ASCII-only matching (default). This override is only in effect for the narrow inline group, and the original matching mode is restored outside of the group. @@ -529,47 +540,60 @@ character ``'$'``. ``\b`` Matches the empty string, but only at the beginning or end of a word. - A word is defined as a sequence of word characters. Note that formally, - ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character - (or vice versa), or between ``\w`` and the beginning/end of the string. - This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, - ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. - - By default Unicode alphanumerics are the ones used in Unicode patterns, but - this can be changed by using the :const:`ASCII` flag. Word boundaries are - determined by the current locale if the :const:`LOCALE` flag is used. - Inside a character range, ``\b`` represents the backspace character, for - compatibility with Python's string literals. + A word is defined as a sequence of word characters. + Note that formally, ``\b`` is defined as the boundary + between a ``\w`` and a ``\W`` character (or vice versa), + or between ``\w`` and the beginning or end of the string. + This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``, + and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``. + + The default word characters in Unicode (str) patterns + are Unicode alphanumerics and the underscore, + but this can be changed by using the :py:const:`~re.ASCII` flag. + Word boundaries are determined by the current locale + if the :py:const:`~re.LOCALE` flag is used. + + .. note:: + + Inside a character range, ``\b`` represents the backspace character, + for compatibility with Python's string literals. .. index:: single: \B; in regular expressions ``\B`` - Matches the empty string, but only when it is *not* at the beginning or end - of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, - ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. - ``\B`` is just the opposite of ``\b``, so word characters in Unicode - patterns are Unicode alphanumerics or the underscore, although this can - be changed by using the :const:`ASCII` flag. Word boundaries are - determined by the current locale if the :const:`LOCALE` flag is used. + Matches the empty string, + but only when it is *not* at the beginning or end of a word. + This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``, + ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``. + ``\B`` is just the opposite of ``\b``, + so word characters in Unicode (str) patterns + are Unicode alphanumerics or the underscore, + although this can be changed by using the :py:const:`~re.ASCII` flag. + Word boundaries are determined by the current locale + if the :py:const:`~re.LOCALE` flag is used. .. index:: single: \d; in regular expressions ``\d`` For Unicode (str) patterns: - Matches any Unicode decimal digit (that is, any character in - Unicode character category [Nd]). This includes ``[0-9]``, and - also many other digit characters. If the :const:`ASCII` flag is - used only ``[0-9]`` is matched. + Matches any Unicode decimal digit + (that is, any character in Unicode character category `[Nd]`__). + This includes ``[0-9]``, and also many other digit characters. + If the :py:const:`~re.ASCII` flag is used, only matches ``[0-9]``. + + __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153 For 8-bit (bytes) patterns: - Matches any decimal digit; this is equivalent to ``[0-9]``. + Matches any decimal digit in the ASCII character set; + this is equivalent to ``[0-9]``. .. index:: single: \D; in regular expressions ``\D`` - Matches any character which is not a decimal digit. This is - the opposite of ``\d``. If the :const:`ASCII` flag is used this - becomes the equivalent of ``[^0-9]``. + Matches any character which is not a decimal digit. + This is the opposite of ``\d``. + If the :py:const:`~re.ASCII` flag is used, + matches the equivalent of ``[^0-9]``. .. index:: single: \s; in regular expressions @@ -578,7 +602,7 @@ character ``'$'``. Matches Unicode whitespace characters (which includes ``[ \t\n\r\f\v]``, and also many other characters, for example the non-breaking spaces mandated by typography rules in many - languages). If the :const:`ASCII` flag is used, only + languages). If the :py:const:`~re.ASCII` flag is used, only ``[ \t\n\r\f\v]`` is matched. For 8-bit (bytes) patterns: @@ -589,30 +613,37 @@ character ``'$'``. ``\S`` Matches any character which is not a whitespace character. This is - the opposite of ``\s``. If the :const:`ASCII` flag is used this + the opposite of ``\s``. If the :py:const:`~re.ASCII` flag is used this becomes the equivalent of ``[^ \t\n\r\f\v]``. .. index:: single: \w; in regular expressions ``\w`` For Unicode (str) patterns: - Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`) + Matches Unicode word characters; + this includes all Unicode alphanumeric characters + (as defined by :py:meth:`str.isalnum`), as well as the underscore (``_``). - If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched. + If the :py:const:`~re.ASCII` flag is used, + only ``[a-zA-Z0-9_]`` is matched. For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; - this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is - used, matches characters considered alphanumeric in the current locale - and the underscore. + this is equivalent to ``[a-zA-Z0-9_]``. + If the :py:const:`~re.LOCALE` flag is used, + matches characters considered alphanumeric in the current locale and the underscore. .. index:: single: \W; in regular expressions ``\W`` - Matches any character which is not a word character. This is - the opposite of ``\w``. If the :const:`ASCII` flag is used this - becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is - used, matches characters which are neither alphanumeric in the current locale + Matches any character which is not a word character. + This is the opposite of ``\w``. + By default, matches non-underscore (``_``) characters + for which :py:meth:`str.isalnum` returns ``False``. + If the :py:const:`~re.ASCII` flag is used, + matches ``[^a-zA-Z0-9_]``. + If the :py:const:`~re.LOCALE` flag is used, + matches characters which are neither alphanumeric in the current locale nor the underscore. .. index:: single: \Z; in regular expressions @@ -644,9 +675,11 @@ string literals are also accepted by the regular expression parser:: (Note that ``\b`` is used to represent word boundaries, and means "backspace" only inside character classes.) -``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode -patterns. In bytes patterns they are errors. Unknown escapes of ASCII -letters are reserved for future use and treated as errors. +``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are +only recognized in Unicode (str) patterns. +In bytes patterns they are errors. +Unknown escapes of ASCII letters are reserved +for future use and treated as errors. Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is @@ -694,30 +727,37 @@ Flags Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` perform ASCII-only matching instead of full Unicode matching. This is only - meaningful for Unicode patterns, and is ignored for byte patterns. + meaningful for Unicode (str) patterns, and is ignored for bytes patterns. + Corresponds to the inline flag ``(?a)``. - Note that for backward compatibility, the :const:`re.U` flag still - exists (as well as its synonym :const:`re.UNICODE` and its embedded - counterpart ``(?u)``), but these are redundant in Python 3 since - matches are Unicode by default for strings (and Unicode matching - isn't allowed for bytes). + .. note:: + + The :py:const`~re.U` flag still exists for backward compatibility, + but is redundant in Python 3 since + matches are Unicode by default for ``str`` patterns, + and Unicode matching isn't allowed for bytes patterns. + :py:const`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant. .. data:: DEBUG Display debug information about compiled expression. + No corresponding inline flag. .. data:: I IGNORECASE - Perform case-insensitive matching; expressions like ``[A-Z]`` will also - match lowercase letters. Full Unicode matching (such as ``Ü`` matching - ``ü``) also works unless the :const:`re.ASCII` flag is used to disable - non-ASCII matches. The current locale does not change the effect of this - flag unless the :const:`re.LOCALE` flag is also used. + Perform case-insensitive matching; + expressions like ``[A-Z]`` will also match lowercase letters. + Full Unicode matching (such as ``Ü`` matching ``ü``) + also works unless the :py:const:`~re.ASCII` flag + is used to disable non-ASCII matches. + The current locale does not change the effect of this flag + unless the :py:const:`~re.LOCALE` flag is also used. + Corresponds to the inline flag ``(?i)``. Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in @@ -725,29 +765,35 @@ Flags letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign). - If the :const:`ASCII` flag is used, only letters 'a' to 'z' + If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z' and 'A' to 'Z' are matched. .. data:: L LOCALE Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching - dependent on the current locale. This flag can be used only with bytes - patterns. The use of this flag is discouraged as the locale mechanism - is very unreliable, it only handles one "culture" at a time, and it only - works with 8-bit locales. Unicode matching is already enabled by default - in Python 3 for Unicode (str) patterns, and it is able to handle different - locales/languages. + dependent on the current locale. + This flag can be used only with bytes patterns. + Corresponds to the inline flag ``(?L)``. + .. warning:: + + This flag is discouraged, consider Unicode matching instead. + The locale mechanism is very unreliable, + as it only handles one "culture" at a time, + and it only works with 8-bit locales. + Unicode matching is enabled by default for Unicode (str) patterns, + and it is able to handle different locales and languages. + .. versionchanged:: 3.6 - :const:`re.LOCALE` can be used only with bytes patterns and is - not compatible with :const:`re.ASCII`. + :py:const:`~re.LOCALE` can be used only with bytes patterns + and is not compatible with :py:const:`~re.ASCII`. .. versionchanged:: 3.7 - Compiled regular expression objects with the :const:`re.LOCALE` flag no - longer depend on the locale at compile time. Only the locale at - matching time affects the result of matching. + Compiled regular expression objects with the :py:const:`~re.LOCALE` flag + no longer depend on the locale at compile time. + Only the locale at matching time affects the result of matching. .. data:: M @@ -759,6 +805,7 @@ Flags end of each line (immediately preceding each newline). By default, ``'^'`` matches only at the beginning of the string, and ``'$'`` only at the end of the string and immediately before the newline (if any) at the end of the string. + Corresponds to the inline flag ``(?m)``. .. data:: NOFLAG @@ -778,19 +825,19 @@ Flags Make the ``'.'`` special character match any character at all, including a newline; without this flag, ``'.'`` will match anything *except* a newline. + Corresponds to the inline flag ``(?s)``. .. data:: U UNICODE - In Python 2, this flag made :ref:`special sequences ` - include Unicode characters in matches. Since Python 3, Unicode characters - are matched by default. + In Python 3, Unicode characters are matched by default + for ``str`` patterns. + This flag is therefore redundant with **no effect**, + and is only kept for backward compatibility. - See :const:`A` for restricting matching on ASCII characters instead. - - This flag is only kept for backward compatibility. + See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead. .. data:: X VERBOSE @@ -916,6 +963,8 @@ Functions Empty matches for the pattern split the string only when not adjacent to a previous empty match. + .. code:: pycon + >>> re.split(r'\b', 'Words, words, words.') ['', 'Words', ', ', 'words', ', ', 'words', '.'] >>> re.split(r'\W*', '...words...') @@ -1229,7 +1278,7 @@ attributes: The regex matching flags. This is a combination of the flags given to :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit - flags such as :data:`UNICODE` if the pattern is a Unicode string. + flags such as :py:const`~re.UNICODE` if the pattern is a Unicode string. .. attribute:: Pattern.groups From a45f7de5b83f4a9a78a4bd9c1b3c802a4173cc5a Mon Sep 17 00:00:00 2001 From: Adam Turner <9087854+aa-turner@users.noreply.github.com> Date: Sun, 6 Aug 2023 04:13:09 +0100 Subject: [PATCH 2/3] Add missing colons --- Doc/library/re.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 6c4eca2480dda9..22ee98f85c3d10 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -733,11 +733,11 @@ Flags .. note:: - The :py:const`~re.U` flag still exists for backward compatibility, + The :py:const:`~re.U` flag still exists for backward compatibility, but is redundant in Python 3 since matches are Unicode by default for ``str`` patterns, and Unicode matching isn't allowed for bytes patterns. - :py:const`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant. + :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant. .. data:: DEBUG @@ -1278,7 +1278,7 @@ attributes: The regex matching flags. This is a combination of the flags given to :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit - flags such as :py:const`~re.UNICODE` if the pattern is a Unicode string. + flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string. .. attribute:: Pattern.groups From f40c66072e0cb36904dc997192d1669b182823ae Mon Sep 17 00:00:00 2001 From: Adam Turner <9087854+aa-turner@users.noreply.github.com> Date: Sun, 6 Aug 2023 16:59:55 +0100 Subject: [PATCH 3/3] Grammar; commas; switch order of ASCII matches sentences Reviewed-by: Joe Kaufeld Reviewed-by: Hugo van Kemenade --- Doc/library/re.rst | 64 ++++++++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 30 deletions(-) diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 22ee98f85c3d10..367a14824f01ad 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -257,7 +257,7 @@ The special characters are: .. index:: single: \ (backslash); in regular expressions * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted - inside a set, although the characters they match depends on the flags_ used. + inside a set, although the characters they match depend on the flags_ used. .. index:: single: ^ (caret); in regular expressions @@ -330,13 +330,13 @@ The special characters are: The group matches the empty string; the letters set the corresponding flags for the entire regular expression: - * :const:`re.A` (ASCII-only matching), - * :const:`re.I` (ignore case), - * :const:`re.L` (locale dependent), - * :const:`re.M` (multi-line), - * :const:`re.S` (dot matches all), - * :const:`re.U` (Unicode matching), - * :const:`re.X` (verbose). + * :const:`re.A` (ASCII-only matching) + * :const:`re.I` (ignore case) + * :const:`re.L` (locale dependent) + * :const:`re.M` (multi-line) + * :const:`re.S` (dot matches all) + * :const:`re.U` (Unicode matching) + * :const:`re.X` (verbose) (The flags are described in :ref:`contents-of-module-re`.) This is useful if you wish to include the flags as part of the @@ -363,11 +363,11 @@ The special characters are: The letters set or remove the corresponding flags for the part of the expression: * :const:`re.A` (ASCII-only matching) - * :const:`re.I` (ignore case), + * :const:`re.I` (ignore case) * :const:`re.L` (locale dependent) - * :const:`re.M` (multi-line), + * :const:`re.M` (multi-line) * :const:`re.S` (dot matches all) - * :const:`re.U` (Unicode matching), + * :const:`re.U` (Unicode matching) * :const:`re.X` (verbose) (The flags are described in :ref:`contents-of-module-re`.) @@ -377,7 +377,7 @@ The special characters are: when one of them appears in an inline group, it overrides the matching mode in the enclosing group. In Unicode patterns ``(?a:...)`` switches to ASCII-only matching, and ``(?u:...)`` switches to Unicode matching - (default). In bytes patterns ``(?L:...)`` switches to locale depending + (default). In bytes patterns ``(?L:...)`` switches to locale dependent matching, and ``(?a:...)`` switches to ASCII-only matching (default). This override is only in effect for the narrow inline group, and the original matching mode is restored outside of the group. @@ -565,7 +565,7 @@ character ``'$'``. but only when it is *not* at the beginning or end of a word. This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``, ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``. - ``\B`` is just the opposite of ``\b``, + ``\B`` is the opposite of ``\b``, so word characters in Unicode (str) patterns are Unicode alphanumerics or the underscore, although this can be changed by using the :py:const:`~re.ASCII` flag. @@ -579,7 +579,8 @@ character ``'$'``. Matches any Unicode decimal digit (that is, any character in Unicode character category `[Nd]`__). This includes ``[0-9]``, and also many other digit characters. - If the :py:const:`~re.ASCII` flag is used, only matches ``[0-9]``. + + Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used. __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153 @@ -592,8 +593,8 @@ character ``'$'``. ``\D`` Matches any character which is not a decimal digit. This is the opposite of ``\d``. - If the :py:const:`~re.ASCII` flag is used, - matches the equivalent of ``[^0-9]``. + + Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used. .. index:: single: \s; in regular expressions @@ -602,8 +603,9 @@ character ``'$'``. Matches Unicode whitespace characters (which includes ``[ \t\n\r\f\v]``, and also many other characters, for example the non-breaking spaces mandated by typography rules in many - languages). If the :py:const:`~re.ASCII` flag is used, only - ``[ \t\n\r\f\v]`` is matched. + languages). + + Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; @@ -613,8 +615,9 @@ character ``'$'``. ``\S`` Matches any character which is not a whitespace character. This is - the opposite of ``\s``. If the :py:const:`~re.ASCII` flag is used this - becomes the equivalent of ``[^ \t\n\r\f\v]``. + the opposite of ``\s``. + + Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. .. index:: single: \w; in regular expressions @@ -624,8 +627,8 @@ character ``'$'``. this includes all Unicode alphanumeric characters (as defined by :py:meth:`str.isalnum`), as well as the underscore (``_``). - If the :py:const:`~re.ASCII` flag is used, - only ``[a-zA-Z0-9_]`` is matched. + + Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; @@ -640,8 +643,9 @@ character ``'$'``. This is the opposite of ``\w``. By default, matches non-underscore (``_``) characters for which :py:meth:`str.isalnum` returns ``False``. - If the :py:const:`~re.ASCII` flag is used, - matches ``[^a-zA-Z0-9_]``. + + Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. + If the :py:const:`~re.LOCALE` flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore. @@ -779,11 +783,11 @@ Flags .. warning:: - This flag is discouraged, consider Unicode matching instead. - The locale mechanism is very unreliable, - as it only handles one "culture" at a time, - and it only works with 8-bit locales. - Unicode matching is enabled by default for Unicode (str) patterns, + This flag is discouraged; consider Unicode matching instead. + The locale mechanism is very unreliable + as it only handles one "culture" at a time + and only works with 8-bit locales. + Unicode matching is enabled by default for Unicode (str) patterns and it is able to handle different locales and languages. .. versionchanged:: 3.6 @@ -834,7 +838,7 @@ Flags In Python 3, Unicode characters are matched by default for ``str`` patterns. - This flag is therefore redundant with **no effect**, + This flag is therefore redundant with **no effect** and is only kept for backward compatibility. See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.