In [1]:
import re

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position.

This is called a positive lookbehind assertion. (?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches.

The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not.

Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

In [9]:
m = re.search('(?<=abc)def', 'abcdef')
m

<re.Match object; span=(3, 6), match='def'>

In [10]:
m.group(0)

'def'

In [18]:
m = re.search('(?<=abc)dddef', 'abcdddef')
m

<re.Match object; span=(3, 8), match='dddef'>

In [19]:
m.group(0)

'dddef'

In [20]:
m = re.search('(?<=def)xx', 'abcdddefxx')
m

<re.Match object; span=(8, 10), match='xx'>

In [21]:
m.group(0)

'xx'

\W

Matches any character which is not a word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.

In [22]:
m = re.search(r'(?<=-)\w+', 'spam-egg')
m

<re.Match object; span=(5, 8), match='egg'>

In [26]:
m = re.search(r'(?<=-)\w+', 'spam-1132egg13232323232')
m

<re.Match object; span=(5, 23), match='1132egg13232323232'>

In [27]:
m.group(0)

'1132egg13232323232'

\+

Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’

In [28]:
re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [29]:
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [30]:
re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

['Words', 'words, words.']

[]

Used to indicate a set of characters. In a set:

Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'.

Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.

Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.

Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will match a right bracket, as well as left bracket, braces, and parentheses.

Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal '[' or containing literal character sequences '--', '&&', '~~', and '||'. To avoid a warning escape them with a backslash.

In [33]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']