# String Manipulation and Regular Expressions

# 字符串操作和正则表达式

> One place where the Python language really shines is in the manipulation of strings.
This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of *regular expressions*.
Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.

Python语言对于字符串的操作是其一大亮点。本章会讨论Python的一些內建的字符串操作和格式化方法。在这之后，我们会简单讨论一下一个非常有用的话题*正则表达式*。这类字符串的操作经常会在数据科学中出现，因此也是Python中很重要的一节。

> Strings in Python can be defined using either single or double quotations (they are functionally equivalent):

Python中的字符串可以使用单引号或双引号定义（它们的功能是一致的）：

In [1]:
x = 'a string'
y = "a string"
x == y

True

> In addition, it is possible to define multi-line strings using a triple-quote syntax:

除此之外，还可以使用连续的三个引号定义多行的字符串：

In [2]:
multiline = """
one
two
three
"""

> With this, let's take a quick tour of some of Python's string manipulation tools.

好了，接下来我们来快速的看一下Python的字符串操作工具。

## Simple String Manipulation in Python

## Python的简单字符串操作

> For basic manipulation of strings, Python's built-in string methods can be extremely convenient.
If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing.
We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper

对于基本的字符串操作来说，Python內建的字符串方法使用起来非常方便。如果你有在C或其他底层语言的编程经历的话，你会发现Python的字符串操作非常简单。我们之前介绍了Python的字符串类型和一些方法；下面我们稍微深入的了解一下。

### Formatting strings: Adjusting case

### 格式化字符串：转换大小写

> Python makes it quite easy to adjust the case of a string.
Here we'll look at the ``upper()``, ``lower()``, ``capitalize()``, ``title()``, and ``swapcase()`` methods, using the following messy string as an example:

Python对字符串进行大小写转换非常容易。我们将会看到`upper()`，`lower()`，`capitalize()`，`title()`和`swapcase()`方法，下面我们用一个大小写混乱的字符串作为例子来说明：

In [3]:
fox = "tHe qUICk bROWn fOx."

> To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:

想要将整个字符串转为大写或者小写，使用`upper()`或者`lower()`方法：

In [4]:
fox.upper()

'THE QUICK BROWN FOX.'

In [5]:
fox.lower()

'the quick brown fox.'

> A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the ``title()`` and ``capitalize()`` methods:

还有一个很常见的格式化需求，将每个单词的首字母编程大写，或者每个句子的首字母变为大写。可以使用`title()`和`capitalize()`方法：

In [6]:
fox.title()

'The Quick Brown Fox.'

In [7]:
fox.capitalize()

'The quick brown fox.'

> The cases can be swapped using the ``swapcase()`` method:

可以使用`swapcase()`方法切换大小写：

In [8]:
fox.swapcase()

'ThE QuicK BrowN FoX.'

### Formatting strings: Adding and removing spaces

### 格式化字符串：增加和去除空格

> Another common need is to remove spaces (or other characters) from the beginning or end of the string.
The basic method of removing characters is the ``strip()`` method, which strips whitespace from the beginning and end of the line:

另外一个常见的需求是在字符串开头或结束为止去除空格（或者其他字符）。`strip()`方法可以去除开头和结尾的空白。

In [9]:
line = '         this is the content         '
line.strip()

'this is the content'

> To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:

如果需要去除右边或者左边的空格，可以使用`rstrip()`或`lstrip()`方法：

In [10]:
line.rstrip()

'         this is the content'

In [11]:
line.lstrip()

'this is the content         '

> To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:

想要去除非空格的其他字符，你可以将你想要去除的字符作为参数传给`strip()`方法：

In [12]:
num = "000000000000435"
num.strip('0')

'435'

> The opposite of this operation, adding spaces or other characters, can be accomplished using the ``center()``, ``ljust()``, and ``rjust()`` methods.

与strip相反的操作，往字符串中加入空格或其他字符，可以使用`center()`，`ljust()`，`rjust()`方法。

> For example, we can use the ``center()`` method to center a given string within a given number of spaces:

例如，我们可以使用`center()`方法在给定长度的空格中居中：

In [13]:
line = "this is the content"
line.center(30)

'     this is the content      '

> Similarly, ``ljust()`` and ``rjust()`` will left-justify or right-justify the string within spaces of a given length:

同理，`ljust()`和`rjust()`让字符串在给定长度的空格中居左或居右：

In [14]:
line.ljust(30)

'this is the content           '

In [15]:
line.rjust(30)

'           this is the content'

> All these methods additionally accept any character which will be used to fill the space.
For example:

上述的方法都可以接收一个额外的参数用来取代空白字符，例如：

In [16]:
'435'.rjust(10, '0')

'0000000435'

> Because zero-filling is such a common need, Python also provides ``zfill()``, which is a special method to right-pad a string with zeros:

因为0填充也经常需要用到，因此Python提供了`zfill()`方法来直接提供0填充的功能：

In [17]:
'435'.zfill(10)

'0000000435'

### Finding and replacing substrings

### 查找和替换子串

> If you want to find occurrences of a certain character in a string, the ``find()``/``rfind()``, ``index()``/``rindex()``, and ``replace()`` methods are the best built-in methods.

如果你想要在字符串中查找特定的子串，內建的`find()`/`rfind()`，`index()`/`rindex()`以及`replace()`方法是最合适的选择。

> ``find()`` and ``index()`` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

`find()`和`index()`是非常相似的，它们都是查找子串在字符串中第一个出现的位置，返回位置的序号值：

In [18]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

16

In [19]:
line.index('fox')

16

> The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

两个方法唯一的区别在于如果找不到子串情况下的处理方式；`find()`会返回-1，而`index()`会生成一个`ValueError`异常：

In [20]:
line.find('bear')

-1

In [21]:
line.index('bear')

ValueError: substring not found

> The related ``rfind()`` and ``rindex()`` work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

相应的`rfind()`和`rindex()`方法很类似，区别是这两个方法查找的是子串在字符串中最后出现的位置。

In [23]:
line.rfind('a')

35

> For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:

对于需要检查字符串是否以某个子串开始或者结束，Python提供了`startswith()`和`endswith()`方法：

In [24]:
line.endswith('dog')

True

In [25]:
line.startswith('fox')

False

> To go one step further and replace a given substring with a new string, you can use the ``replace()`` method.
Here, let's replace ``'brown'`` with ``'red'``:

要将字符串中的某个子串替换成新的子串的内容，可以使用`replace()`方法。下例中将`'brown'`替换成`'red'`：

In [26]:
line.replace('brown', 'red')

'the quick red fox jumped over a lazy dog'

> The ``replace()`` function returns a new string, and will replace all occurrences of the input:

`replace()`方法会返回一个新的字符串，并将里面所有找到的子串替换：

In [27]:
line.replace('o', '--')

'the quick br--wn f--x jumped --ver a lazy d--g'

> For a more flexible approach to this ``replace()`` functionality, see the discussion of regular expressions in [Flexible Pattern Matching with Regular Expressions](#Flexible-Pattern-Matching-with-Regular-Expressions).

想要更加灵活的使用`replace()`方法，参见[使用正则表达式进行模式匹配](#Flexible-Pattern-Matching-with-Regular-Expressions)。

### Splitting and partitioning strings

### 分割字符串

> If you would like to find a substring *and then* split the string based on its location, the ``partition()`` and/or ``split()`` methods are what you're looking for.
Both will return a sequence of substrings.

如果需要查找一个子串*并且*根据找到的子串的位置将字符串进行分割，`partition()`和/或`split()`方法正是你想要的。

> The ``partition()`` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

`partition()`方法返回三个元素的一个元组：查找的子串前面的子串，查找的子串本身和查找的子串后面的子串：

In [28]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

> The ``rpartition()`` method is similar, but searches from the right of the string.

`rpartition()`方法类似，不过是从字符串右边开始查找。

> The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

`split()`方法可能更加有用；它会查找所有子串出现的位置，然后返回这些位置之间的内容列表。默认的子串会是任何的空白字符，返回字符串中所有的单词：

In [29]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

> A related method is ``splitlines()``, which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

还有一个`splitlines()`方法，会按照换行符分割字符串。我们以日本17世纪诗人松尾芭蕉的俳句为例：

In [30]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

> Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable:

如果你需要撤销`split()`方法，可以使用`join()`方法，使用一个特定字符串将一个迭代器串联起来：

In [31]:
'--'.join(['1', '2', '3'])

'1--2--3'

> A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

使用换行符`"\n"`将刚才拆开的诗句连起来，恢复成原来的字符串：

In [32]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


## Format Strings

## 格式化字符串

> In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string *representations* of values of other types.
Of course, string representations can always be found using the ``str()`` function; for example:

在前面介绍的方法中，我们学习到了怎样从字符串中提取值和如果将字符串本身操作成需要的格式。对于字符串来说，还有一个重要的需求，就是将其他类型的值使用字符串*表达出来*。当然，你总是可以使用`str()`函数将其他类型的值转换为字符串，例如：

In [33]:
pi = 3.14159
str(pi)

'3.14159'

> For more complicated formats, you might be tempted to use string arithmetic as outlined in [Basic Python Semantics: Operators](04-Semantics-Operators.ipynb):

对于更加复杂的格式，你可能试图使用在[Python语法: 操作符](04-Semantics-Operators.ipynb)介绍过的字符串运算来实现：

In [34]:
"The value of pi is " + str(pi)

'The value of pi is 3.14159'

> A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
Here is a basic example:

但是我们又一个更灵活的方式来处理格式化，那就是使用*格式化字符串*，也就是在字符串中含有特殊的标记代表格式（这个特殊标记指的是花括号），然后将需要表达的值插入到字符串的相应位置上。例如：

In [35]:
"The value of pi is {}".format(pi)

'The value of pi is 3.14159'

> Inside the ``{}`` marker you can also include information on exactly *what* you would like to appear there.
If you include a number, it will refer to the index of the argument to insert:

在花括号`{}`之间，你可以加入需要的信息。例如你可以在花括号中加入数字，表示该位置插入的参数的序号：

In [36]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

> If you include a string, it will refer to the key of any keyword argument:

如果你在花括号中加入字符串，表示的是该位置插入的关键字参数的名称：

In [37]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')

'First letter: A. Last letter: Z.'

> Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

最后，对于数字输入，你可以在花括号中加入格式化的代码控制数值转换为字符串的格式。例如，将一个浮点数转换为字符串，并且保留小数点后3位，可以这样写：

In [38]:
"pi = {0:.3f}".format(pi)

'pi = 3.142'

> As before, here the "``0``" refers to the index of the value to be inserted.
The "``:``" marks that format codes will follow.
The "``.3f``" encodes the desired precision: three digits beyond the decimal point, floating-point format.

如前所述，`"0"`表示参数位置序号。`":"`表示格式化代码分隔符。`".3f"`表示浮点数格式化的代码，小数点后保留3位。

> This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available.
For more information on the syntax of these format strings, see the [Format Specification](https://docs.python.org/3/library/string.html#formatspec) section of Python's online documentation.

这样的格式定义非常灵活，我们这里的例子仅仅是一个简单的介绍。想要查阅更多有关格式化字符串的语法内容，请参见Python在线文档有关[格式化定义](https://docs.python.org/3/library/string.html#formatspec)的章节。

## fstring（译者添加）

Python3.6之后，提供了另外一种灵活高效的格式化字符串方法，叫做`fstring`。可以直接将变量值插入到格式化字符串中输出。

如前面pi的例子：

In [39]:
f"The value of pi is {pi}"

'The value of pi is 3.14159'

`fstring`通过在格式化字符串前加上f，然后同样可以通过花括号定义格式化的内容，花括号中是变量名。

再如：

In [40]:
first = 'A'
last = 'Z'
f"First letter: {first}. Last letter: {last}."

'First letter: A. Last letter: Z.'

同理，数字的格式化也类似，仅需在变量名后使用`":"`将变量名和格式化代码分开即可。

如上面的浮点数格式化例子：

In [41]:
f"pi = {pi:.3f}"

'pi = 3.142'

## Flexible Pattern Matching with Regular Expressions

## 使用正则表达式实现模式匹配

> The methods of Python's ``str`` type give you a powerful set of tools for formatting, splitting, and manipulating string data.
But even more powerful tools are available in Python's built-in *regular expression* module.
Regular expressions are a huge topic; there are there are entire books written on the topic (including Jeffrey E.F. Friedl’s [*Mastering Regular Expressions, 3rd Edition*](http://shop.oreilly.com/product/9780596528126.do)), so it will be hard to do justice within just a single subsection.

Python的`str`类型的內建方法提供了一整套强大的格式化、分割和操作字符串的工具。Python內建的*正则表达式*模块提供了更为强大的字符串操作工具。正则表达式是一个巨大的课题；在这个课题上可以写一本书来详细介绍（包括Jeffrey E.F. Friedl写的[*Mastering Regular Expressions, 3rd Edition*](http://shop.oreilly.com/product/9780596528126.do)），所以期望在一个小节中介绍完它是不现实的。

> My goal here is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python.
I'll suggest some references for learning more in [Further Resources on Regular Expressions](#Further-Resources-on-Regular-Expressions).

作者期望通过这个小节的介绍，能够让读者对于什么情况下需要使用正则表达式以及在Python中最基本的正则表达式使用方法有初步的了解。作者建议在[更多的正则表达式资源](#Further-Resources-on-Regular-Expressions)中进一步拓展阅读和学习。

> Fundamentally, regular expressions are a means of *flexible pattern matching* in strings.
If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "``*``" character, which acts as a wildcard.
For example, we can list all the IPython notebooks (i.e., files with extension *.ipynb*) with "Python" in their filename by using the "``*``" wildcard to match any characters in between:

从最基础上来说，正则表达式其实就是一种在字符串中进行*灵活模式匹配*的方法。如果你经常使用命令行，你可能已经习惯了这种灵活匹配机制，比方说`"*"`号，就是一个典型的通配符。我们来看一个例子，我们可以列示所有的IPython notebook（扩展名为*.ipynb*），然后文件名中含有"Python"的文件列表。

In [42]:
!ls *Python*.ipynb

01-How-to-Run-Python-Code.ipynb  02-Basic-Python-Syntax.ipynb


> Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes.
The Python interface to regular expressions is contained in the built-in ``re`` module; as a simple example, let's use it to duplicate the functionality of the string ``split()`` method:

正则表达式就是一种泛化了的"通配符"，使用标准的语法对字符串进行模式匹配。Python中的正则表达式功能包含在`re`內建模块；作为一个简单的例子，我们使用`re`里面的`split()`方法来实现字符串`str`的字符串分割功能：

In [43]:
import re
regex = re.compile('\s+')
regex.split(line)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

> Here we've first *compiled* a regular expression, then used it to *split* a string.
Just as Python's ``split()`` method returns a list of all substrings between whitespace, the regular expression ``split()`` method returns a list of all substrings between matches to the input pattern.

本例中，我们首先*编译了*一个正则表达式，然后我们用这个表达式对字符串进行*分割*。就像`str`的`split()`方法会使用空白字符切割字符串一样，正则表达式的`split()`方法也会返回所有匹配输入的模式的字符串切割出来的字符串列表。

> In this case, the input is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it.
Thus, the regular expression matches any substring consisting of one or more spaces.

在这个例子里，输入的模式是`"\s+"`：`"\s"`是正则表达式里面的一个特殊的字符，代表着任何空白字符（空格，制表符，换行等），`"+"`号代表前面匹配到的字符出现了*一次或多次*。因此，这个正则表达式的意思是匹配任何一个或多个的空白符号。

> The ``split()`` method here is basically a convenience routine built upon this *pattern matching* behavior; more fundamental is the ``match()`` method, which will tell you whether the beginning of a string matches the pattern:

这里的`split()`方法是一个在*模式匹配*之上的字符串分割方法；对于正则表达式来说，更加基础的可能是`match()`方法，它会返回字符串是否成功匹配到了某种模式：

In [44]:
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


> Like ``split()``, there are similar convenience routines to find the first match (like ``str.index()`` or ``str.find()``) or to find and replace (like ``str.replace()``).
We'll again use the line from before:

就像`split()`，正则表达式中也有相应的方法能够找到首个匹配位置（就像`str.index()`或者`str.find()`一样）或者是查找和替换（就像`str.replace()`）。我们还是以前面的那行字符串为例：

In [45]:
line = 'the quick brown fox jumped over a lazy dog'

> With this, we can see that the ``regex.search()`` method operates a lot like ``str.index()`` or ``str.find()``:

可以使用`regex.search()`方法像`str.index()`或者`str.find()`那样查找模式位置：

In [46]:
line.index('fox')

16

In [47]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

16

> Similarly, the ``regex.sub()`` method operates much like ``str.replace()``:

类似的，`regex.sub()`方法就像`str.replace()`那样替换字符串：

In [48]:
line.replace('fox', 'BEAR')

'the quick brown BEAR jumped over a lazy dog'

In [49]:
regex.sub('BEAR', line)

'the quick brown BEAR jumped over a lazy dog'

> With a bit of thought, other native string operations can also be cast as regular expressions.

其他的原始字符串操作也可以转换为正则表达式操作。

### A more sophisticated example

### 一个更加复杂的例子

> But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods?
The advantage is that regular expressions offer *far* more flexibility.

于是，你就会问，既然如此，为什么我们要用复杂的正则表达式的方法，而不用简单的字符串方法呢？原因就是正则表达式提供了更多的灵活性。

> Here we'll consider a more complicated example: the common task of matching email addresses.
I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on.
Here it goes:

下面我们来考虑一个更加复杂的例子：匹配电子邮件地址。作者会使用一个简单的（但又难以理解的）正则表达式，然后我们看看这个过程中发生了什么。如下：

In [50]:
email = re.compile('\w+@\w+\.[a-z]{3}')

> Using this, if we're given a line from a document, we can quickly extract things that look like email addresses

使用这个正则表达式，我们可以很快地在一行文本中提取出来所有的电子邮件地址：

In [51]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

> (Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido).

（请注意这两个地址都是编撰的；肯定有更好的方式能够联系上Guido，译者注：Guido是Python的创始人）。

> We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:

我们可以做更多的处理，比方说将电子邮件地址替换成另一个字符串，此处做了一个脱敏处理：

In [52]:
email.sub('--@--.--', text)

'To email Guido, try --@--.-- or the older address --@--.--.'

> Finally, note that if you really want to match *any* email address, the preceding regular expression is far too simple.
For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes.
So, for example, the period used here means that we only find part of the address:

最后，如果你需要匹配*任何*的电子邮件地址，那么上面的正则表达式还远远不够。它只允许地址由字母数字组成并且一级域名仅能支持少数的通用域名。因为下面的地址含有点`.`，因此只能匹配到一部分的电子邮件地址。

In [53]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

> This goes to show how unforgiving regular expressions can be if you're not careful!
If you search around online, you can find some suggestions for regular expressions that will match *all* valid emails, but beware: they are much more involved than the simple expression used here!

这表明了如果你不小心的话，正则表达式会发生多奇怪的错误。如果你在网上搜索的话，你可以发现一些能够匹配*所有*的电子邮件地址的正则表达式，但是，它们比我们这个简单的版本难理解多了。

### Basics of regular expression syntax

### 正则表达式基本语法

> The syntax of regular expressions is much too large a topic for this short section.
Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more.
My hope is that the following quick primer will enable you to use these resources effectively.

正则表达式的语法对于这个小节的内容来说显得太庞大了。然而，了解一些基础的内容能够让读者走的更远：作者会在这里简单介绍一些最基本的结构，然后列出一个完整的资源以供读者继续深入研究和学习。作者希望通过这些简单的基础内容能让读者更加有效的阅读那些额外的资源。

#### Simple strings are matched directly

#### 简单的字符串会直接匹配

> If you build a regular expression on a simple string of characters or digits, it will match that exact string:

如果你的正则表达式只包括简单的字符和数字的组合，那么它将匹配自身：

In [54]:
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']

#### Some characters have special meanings

#### 特殊含义的字符

> While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:

```
. ^ $ * + ? { } [ ] \ | ( )
```

> We will discuss the meaning of some of these momentarily.
In the meantime, you should know that if you'd like to match any of these characters directly, you can *escape* them with a back-slash:

普通的字符和数字会直接匹配，然后正则表达式中包含很多的特殊字符，他们是：

```shell
. ^ $ * + ? { } [ ] \ | ( )
```

一会我们会稍微详细的介绍其中的部分。同时，你需要知道的是，如果你希望直接匹配上述的特殊字符的话，你需要使用反斜杠`"\"`来转义他们：

In [55]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

> The ``r`` preface in ``r'\$'`` indicates a *raw string*; in standard Python strings, the backslash is used to indicate special characters.
For example, a tab is indicated by ``"\t"``:

上面的正则表达式中的前缀`r`是说明改字符串是一个*原始字符串*; 在标准的Python字符串中，反斜杠用来转义并表示一个特殊字符。例如，制表符写成字符串的形式为`"\t"`：

In [56]:
print('a\tb\tc')

a	b	c


> Such substitutions are not made in a raw string:

这种转义不会出现在原始字符串中：

In [57]:
print(r'a\tb\tc')

a\tb\tc


> For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string.

因此，当你需要在正则表达式中使用反斜杠时，使用原始字符串是一个好的选择。

#### Special characters can match character groups

#### 特殊字符能匹配一组字符

> Just as the ``"\"`` character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning.
These special characters match specified groups of characters, and we've seen them before.
In the email address regexp from before, we used the character ``"\w"``, which is a special marker matching *any alphanumeric character*. Similarly, in the simple ``split()`` example, we also saw ``"\s"``, a special marker indicating *any whitespace character*.

就像反斜杠在正则表达式中能转义特殊字符那样，反斜杠也能将一些普通字符转义成特殊字符。这些特殊字符能代表一组或一类的字符组合，就像我们在前面的例子当中看到的那样。在电子邮件地址的正则表达式中，我们使用了字符`"\w"`，这个特殊字符代表着*所有的字母数字符号*。同样的，在前面的`split()`例子中，`"\s"`代表着*所有的空白字符*。

> Putting these together, we can create a regular expression that will match *any two letters/digits with whitespace between them*:

把这两个特殊符号放在一起，我们就可以构造一个*任意两个字母或数字之间含有一个空格*的正则表达式：

In [58]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

> This example begins to hint at the power and flexibility of regular expressions.

这个例子已经开始展示正则表达式的力量和灵活性了。

> The following table lists a few of these characters that are commonly useful:

> | Character | Description                 || Character | Description                     |
|-----------|-----------------------------||-----------|---------------------------------|
| ``"\d"``  | Match any digit             || ``"\D"``  | Match any non-digit             |
| ``"\s"``  | Match any whitespace        || ``"\S"``  | Match any non-whitespace        |
| ``"\w"``  | Match any alphanumeric char || ``"\W"``  | Match any non-alphanumeric char |

下表列出了常用的特殊符号:

| 特殊符号 | 描述                 || 特殊符号 | 描述                     |
|-----------|-----------------------------||-----------|---------------------------------|
| ``"\d"``  | 任意数字             || ``"\D"``  | 任意非数字             |
| ``"\s"``  | 任意空白符号        || ``"\S"``  | 任意非空白符号        |
| ``"\w"``  | 任意字符或数字 || ``"\W"``  | 任意非字符或数字 |

> This is *not* a comprehensive list or description; for more details, see Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

这张表很不完整；需要详细描述，请参见：[正则表达式语法文档](https://docs.python.org/3/library/re.html#re-syntax)。

#### Square brackets match custom character groups

#### 中括号匹配自定义的字符组

> If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in.
For example, the following will match any lower-case vowel:

如果內建的字符组并不满足你的要求，你可以使用中括号来指定你需要的字符组。例如，下例中的正则表达式匹配任意小写元音字母：

In [59]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

> Similarly, you can use a dash to specify a range: for example, ``"[a-z]"`` will match any lower-case letter, and ``"[1-3]"`` will match any of ``"1"``, ``"2"``, or ``"3"``.
For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

你还可以使用横线`"-"`来指定字符组的范围：例如，`"[a-z]"`匹配任意小写字母，`"[1-3]"`匹配`"1"`，`"2"`或`"3"`。例如，你希望从某个文档中提取出特定的数字代码，该代码由一个大写字母后面跟一个数字组成。你可以这样写：

In [60]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

#### Wildcards match repeated characters

#### 通配符匹配重复次数的字符

> If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, ``"\w\w\w"``.
Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

如果你想要匹配一个字符串包含3个字符或数字，当然你可以这样写`"\w\w\w"`。但是因为这个需求太普遍了，因此正则表达式将它做成了重复次数的规则 - 使用花括号中的数字表示重复的次数：

In [61]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

> There are also markers available to match any number of repetitions – for example, the ``"+"`` character will match *one or more* repetitions of what precedes it:

当然还有一些标记能够匹配任意次数的重复 - 例如，`"+"`号代表前面匹配到的字符重复*一次或多次*：

In [62]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

> The following is a table of the repetition markers available for use in regular expressions:

> | Character | Description | Example |
|-----------|-------------|---------|
| ``?`` | Match zero or one repetitions of preceding  | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
| ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | Match one or more repetitions of preceding  | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
| ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
| ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |

下表列示了正则表达式中可用的重复标记:

| 特殊字符 | 描述 | 例子 |
|-----------|-------------|---------|
| ``?`` | 匹配0次或1次  | ``"ab?"`` 匹配 ``"a"`` 或 ``"ab"`` |
| ``*`` | 匹配0次或多次 | ``"ab*"`` 匹配 ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | 匹配1次或多次  | ``"ab+"`` 匹配 ``"ab"``, ``"abb"``, ``"abbb"``... 但不匹配 ``"a"`` |
| ``{n}`` | 匹配正好n次 | ``"ab{2}"`` 匹配 ``"abb"`` |
| ``{m,n}`` | 匹配最小m次最大n次 | ``"ab{2,3}"`` 匹配 ``"abb"`` 或 ``"abbb"`` |

> With these basics in mind, let's return to our email address matcher:

了解了上述基础只是后，让我们回到我们的电子邮件地址的例子：

In [63]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')

> We can now understand what this means: we want one or more alphanumeric character (``"\w+"``) followed by the *at sign* (``"@"``), followed by one or more alphanumeric character (``"\w+"``), followed by a period (``"\."`` – note the need for a backslash escape), followed by exactly three lower-case letters.

现在我们能理解这个表达式了：我们首先需要一个或多个字母数字字符`"\w+"`，然后需要字符`"@"`，然后需要一个或多个字母数字字符`"\w+"`，然后需要一个`"\."`（注意这里使用了反斜杠，因此这个点没有特殊含义），最后我们需要正好三个小写字母。

> If we want to now modify this so that the Obama email address matches, we can do so using the square-bracket notation:

如果我们需要修改这个正则表达式，让它可以匹配奥巴马的电子邮件地址的话，我们可以使用中括号写法：

In [65]:
email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

> We have changed ``"\w+"`` to ``"[\w.]+"``, so we will match any alphanumeric character *or* a period.
With this more flexible expression, we can match a wider range of email addresses (though still not all – can you identify other shortcomings of this expression?).

上面我们将`"\w+"`改成了`"[\w.]+"`，因此我们可以在这里匹配上任意的字母数字*或*点号。经过这一修改后，这一正则表达式能够匹配更多的电子邮件地址了（虽然还不是全部 - 你能举例说明哪些电子邮件地址不能匹配到吗？）

译者注：`"[\w.]+"`不需要写成`"[\w\.]+"`，原因是在正则表达式的中括号中，除了`^, -, ], \`这几个符号之外，所有其他的符号都没有特殊含义。

#### Parentheses indicate *groups* to extract

#### 使用小括号进行分组匹配

> For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:

对于像上面的电子邮件地址匹配那样复杂的正则表达式来说，我们通常希望提取他们的部分内容而非完全匹配。这可以使用小括号进行分组匹配来完成：

In [66]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [67]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

> As we see, this grouping actually extracts a list of the sub-components of the email address.

正如结果所示，这个分组后的正则表达式将电子邮件地址的各个部分分别提取了出来。

> We can go a bit further and *name* the extracted components using the ``"(?P<name> )"`` syntax, in which case the groups can be extracted as a Python dictionary:

更进一步，我们可以给提取出来的各个部分*命名*，这可以通过使用`"(?P<name>)"`的语法实现，在这种情况下，匹配的分组将会提取到Python的字典结构当中：

In [68]:
email4 = re.compile(r'(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('guido@python.org')
match.groupdict()

{'user': 'guido', 'domain': 'python', 'suffix': 'org'}

> Combining these ideas (as well as some of the powerful regexp syntax that we have not covered here) allows you to flexibly and quickly extract information from strings in Python.

把上述知识结合起来（加上还有很多我们没有介绍的很强大的正则表达式语法功能）能让你迅速灵活地从字符串中提取信息。

### Further Resources on Regular Expressions

### 正则表达式拓展阅读

> The above discussion is just a quick (and far from complete) treatment of this large topic.
If you'd like to learn more, I recommend the following resources:

> - [Python's ``re`` package Documentation](https://docs.python.org/3/library/re.html): I find that I promptly forget how to use regular expressions just about every time I use them. Now that I have the basics down, I have found this page to be an incredibly valuable resource to recall what each specific character or sequence means within a regular expression.
> - [Python's official regular expression HOWTO](https://docs.python.org/3/howto/regex.html): a more narrative approach to regular expressions in Python.
> - [Mastering Regular Expressions (OReilly, 2006)](http://shop.oreilly.com/product/9780596528126.do) is a 500+ page book on the subject. If you want a really complete treatment of this topic, this is the resource for you.

上面对于正则表达式的介绍只是一个快速的入门（远远未达到完整的介绍）。如果你希望学习更多的内容，下面是作者推荐的一些资源：

- [Python的`re`模块文档](https://docs.python.org/3/library/re.html): 每次作者忘记了如何使用正则表达式时都会去浏览它。
- [Python官方正则表达式HOWTO](https://docs.python.org/3/howto/regex.html): 对于Python正则表达式更加详尽介绍。
- [掌握正则表达式(OReilly, 2006)](http://shop.oreilly.com/product/9780596528126.do) 是一本500多页的正则表达式的书籍，你如果需要完全了解正则表达式的方方面面，这是一个不错的选择。

> For some examples of string manipulation and regular expressions in action at a larger scale, see [Pandas: Labeled Column-oriented Data](15-Preview-of-Data-Science-Tools.ipynb#Pandas:-Labeled-Column-oriented-Data), where we look at applying these sorts of expressions across *tables* of string data within the Pandas package.

如果需要学习更多有关字符串操作和正则表达式使用的例子，可以参见[Pandas: 标签化的列数据](15-Preview-of-Data-Science-Tools.ipynb#Pandas:-Labeled-Column-oriented-Data)，那里我们会对Pandas下的表状数据进行字符串的处理和正则表达式的应用。