# re — Regular expression operations

>Category : Text Processing Services  
Source : https://docs.python.org/3/library/re.html  
See also : The third-party `regex` module  

Python provides `re` module for Regular Expression. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

Regular expressions use the backslash character ('\') to indicate special forms. **For example, to match a literal backslash, one might have to write '`\\\\`' as the pattern string**.

The solution is to use **Python’s raw string notation** for regular expression patterns, backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', and "/n" is a one-character string containing a newline.

In [1]:
import re

## Regular Expression Syntax

- Special characters : Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

- Ordinary characters : Ordinary characters simply match themselves. like 'A', 'a', '0'.


The special characters are: 

Character | Match | 匹配 
:---------|:-------|:----
`.`|Matches any character except a newline.|匹配除了换行符之外的内容
`/`|Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence|转义字符
`[]`|Used to indicate a set of characters|字符集中，特殊符号失效
***Quantifier***|***Be used after a character or`()`***
`*`|Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.|匹配前一个子表达式0次至无限次。
`+`|Causes the resulting RE to match 1 or more repetitions of the preceding RE.|匹配前一个子表达式1次至无限次
`?`|Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.|匹配前一个子表达式0次或1次
`*?`,`+?`,`??`| These are the non-greedy version of the prevous qulifiers `*`,`+`,`?`.
`{m}`|Specifies that exactly m copies of the previous RE should be matched.|表示匹配前面的子表达式 m 次
`{m,n}`|Causes the resulting RE to match from m to n repetitions of the preceding RE|表示匹配前面的子表达式至少 m 次，至多 n 次。若省略m，则匹配0至n次；若省略n，则匹配m至无限次。
`{m,n}?`|This is the non-greedy version of the previous qualifier.|`{m,n}`的非贪婪模式，尽可能匹配最少次数。
***Pre-defined characterset***||
`\d`|Matches any  decimal digit，this is equivalent to `[0-9]`|匹配所有数字字符
`\D`|This is the opposite of `\d`, this is equivalent to `^[0-9]`|匹配所有非数字字符
`\s`|Matches whitespace characters, this is equivatent to `[ \t\n\r\f\v]`|匹配空白字符，包括`/t/n/r/f/v`
`\S`|This is the opposite of `\s`, this is equivatent to `[^ \t\n\r\f\v]`|匹配所有非空白字符
`\w`|Matches word characters,this is equivatent to `[a-zA-Z0-9]`|匹配所有单词字符
`\W`|Matches any character which is not a word character, this is equivatent to `[^a-zA-Z0-9]`|匹配所有非单词字符
***Border Match*** | ***Do not exhausts characters***|   
`^`|Matches the start of the string.|匹配字符串的开始|
`$`|Matches the end of the string or just before the newline at the end of the string|匹配字符串的结束
`\A`|Matches only at the start of the string|仅匹配字符串开头
`\Z`|Matches only at the end of the string|仅匹配字符串末尾
`\b`|Matches the empty string, but only at the beginning or end of a word. This is defined as the boundary between a \w and a \W character, or between \w and the beginning/end of the string.|匹配\w和\W之间
`\B`|Matches the empty string, but only when it is not at the beginning or end of a word.| [^\b]
***Logic and Group*** ||
&#124;|A and B can be arbitrary REs, creates a regular expression that will match either A or B.|匹配子表达式A或者子表达式B，如果A可以匹配，则跳过B。
`(...)`|Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group |被括起来的表达式作为一个分组，分组作为一个整体，后可接数量词。
`\number`|Matches the contents of the group of the same number. Groups are numbered starting from 1.|匹配分组列表中编号为number的内容（引用number分组），分组从1开始编号。

## Module Contents
### re.match vs re.search
- re.compile(pattern, flags=0)
- re.search(pattern, string, flags=0)

### re.findall vs re.finditer
- re.findall(pattern, string, flags=0)
- re.finditer(pattern, string, flags=0)

### re.split
- re.split(pattern, string, maxsplit=0, flags=0)

### re.sub
- re.sub(pattern, repl, string, count=0, flags=0)

### re.compile

## Match Object

Match objects support the following methods and attributes:
- Methods:
    - match.group([group1, ...])
    - match.groups(default=None)
    - match.groupdict(default=None)
    - match.start([group])
    - match.end([group])
    - match.span([group])
- Attributes:
    - match.pos
    - match.endpos
    - match.string
    - match.lastindex
    - match.lastgroup
    - match.re

### match.group([group1, ...])
Returns one or more subgroups of the match.
- Without arguments, group1 defaults to zero (the whole match is returned).
- If there is a single argument, the result is a single string;
    - If a groupN argument is zero, the corresponding return value is the entire matching string.
    - If it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.
- If there are multiple arguments, the result is a tuple with one item per argument. 
- If a group is contained in a part of the pattern that did not match, the corresponding result is None.
- If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

In [2]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
m.group(0)       # The entire match

'Isaac Newton'

In [3]:
m.group(1)       # The first parenthesized subgroup.

'Isaac'

In [4]:
m.group(2)       # The second parenthesized subgroup.

'Newton'

In [5]:
m.group(1, 2)    # Multiple arguments give us a tuple.

('Isaac', 'Newton')

In [6]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.group('first_name')  # m.group(1)

'Malcolm'

In [7]:
m.group('last_name')  # m.group(2)

'Reynolds'

In [8]:
m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
m.group(1)

'c3'

### match.groups(default=None)
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.

In [9]:
m = re.match(r"(\d+)\.(\d+)", "24.1632")
m.groups()

('24', '1632')

In [10]:
m = re.match(r"(\d+)\.?(\d+)?", "24")
m.groups()

('24', None)

### match.groupdict(default=None)
Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.

In [11]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.groupdict()

{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

### match.start([group]) & match.end([group]) & match.span([group])
Return the indices of the start and end of the substring matched by group.
Return the 2-tuple (m.start(group), m.end(group))

In [12]:
email = "tony@tiremove_thisger.net"
m = re.search("remove_this", email)
email[:m.start()] + email[m.end():]

'tony@tiger.net'

In [13]:
m.span()

(7, 18)

In [14]:
m.endpos

25

## Regular Expression Examples

In this example, we’ll use the following helper function to display match objects a little more gracefully:

In [15]:
def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

### Checking for a pair
Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

In [16]:
valid = re.compile(r"^[a2-9tjqk]{5}$")
displaymatch(valid.match("akt5q"))  # Valid.

"<Match: 'akt5q', groups=()>"

In [17]:
displaymatch(valid.match("akt5e"))  # Invalid.

In [18]:
displaymatch(valid.match("akt"))    # Invalid.

In [19]:
displaymatch(valid.match("727ak"))  # Valid.

"<Match: '727ak', groups=()>"

In [20]:
pair = re.compile(r".*(.).*\1")
displaymatch(pair.match("717ak"))

"<Match: '717', groups=('7',)>"

In [21]:
displaymatch(pair.match("718ak"))     # No pairs.

In [22]:
pair.match("717aa").groups()

('a',)

### search() vs. match()
- re.match() checks for a match only at the beginning of the string. 
- re.search() checks for a match anywhere in the string (this is what Perl does by default).

In [23]:
re.match("c", "abcdef")  # No match
re.search("c", "abcdef") # Match

<_sre.SRE_Match object; span=(2, 3), match='c'>