

# Chapter 3: Regular Expression (正则表达式)


**Regular expression (regex, or regexp for short) is a pattern describing a certain amount of texts**.

With regular expression,
   - You can search through a file to find the text matching the regex.
   - You can do some editing stuffs (deletion, replacement) on the text matching the pattern.
   - You can extract information out of the text (which is useful in text mining).
   


## Regular expression: Examples

   - <span style="color:red">regex</span> is a regular expression to match exactly the string "regex".
   - <span style="color:red">\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\b</span> is a more complex regular expression to match an email address.
   
If you want to get your feet wet with regular expression, go on with the next session.

## Literal characters (普通文本字符)

1. The most basic regular expression consists of a single literal character, for example, <span style="color:red">a</span>, which matches the occurrence of 'a' in such a string "__J<font color='red'>a</font>ck is a student__".
2. The second kind of regular expressions consist of a series of literal characters, such as <span style="color:red">cat</span>, which matches the three contiguous literal characters in the string "__About <font color='red'>cat</font>s and dogs__".
3. Note that the regex engine is case-sensitive, thus a regexp <font color='red'>cat</font> does **NOT match** the string "<font color='red'>Cat</font>s and dogs", unless explicitly told to ignore the case.

## Special characters (特殊字符/元字符)

Special characters (a.k.a metacharacters) are named since they are characters with special meanings:

1. the backslash \ is an important escape character
2. the caret ^ anchors the start of a string or line
3. the dollar sign $ anchors the end of a string or line
4. the period or dot . denotes any single character
5. the vertical bar or pipe symbol | means alternating selection
6. the question mark ? 
7. the asterisk or star *
8. the plus sign +
9. the opening parenthesis (
10. the closing parenthesis )
11. the opening square bracket [
12. the opening curly brace {

<font color='red'>Note</font>: If you want to use these special characters as literal characters, you need to **escape them with backslash**.

## Non-printable characters

| Characters | Meaning |
| --- | --- |
| \t  | TAB character (ASCII: 0x09) |
| \r  | carriege RETURN (ASCII: 0x0D) |
| \n  | Line feed (ASCII: 0x0A) |
| \a  | bell (ASCII: 0x07) |
| \e  | escape (ASCII: 0x1B) |
| \f  | form feed (ASCII: 0x0c) |
| \v  | vertical TAB (ASCII: 0x0B) |
| \cA - \cZ | CTRL+A - CTRL+Z |

## Character classes or character sets (字符集合)

In regular expression, a character class/set is enclosed within a pair of square brackets [], which is used to match only one out of several characters.

| Regular expressions | Meaning |
| --- | --- |
| gr[ae]y | <font color="red">gray</font> or <font color="red">grey</font> |
| [a-z] | a lower-case alphabet between a and z |
| [A-Z] | an upper-case alphabet between A and Z |
| [0-9] | a single digit between 0 and 9 |
| q[^u] | a "<font color="red">q</font>" followed by a character that is NOT "<font color="red">u</font>".|

The metacharaters inside the square brackets do not need to be escaped by the backslash with the **exception of **

   - **closing square bracket (])**,
   - **the caret sign (^) at the first place**,
   - **the hyphen (-) not at the first or last place**, and
   - **the backslash (\\) not at the last place**.
   
### <font color="blue">Exercise</font>
Write down the matchings of the following regular expressions:

| Regular expression |  Matches of the string |
| --- | --- |
| [b-f^] | "aegf^" |
| [^b-f] | "aegf^" |
| [0-9-] | "021-34204348" |
| [0-123456789] | "021-34204348" |
| [1-9][0-9.][0-9] | 12.5, 2.3, 125 |
| [1-9][0-9].[0-9] | 12.5, 2.3, 125 |
| [1-9][0-9]\.[0-9] | 12.5, 2.3, 125 |

## Repeating character classes (重复)

If you repeat a character class by using the <font color="red">?, * or +</font> operators, you're repeating the entire character class. You're not repeating just the character that it matched. **The regex [0-9]+ can match 837 as well as 222**.

### <font color="blue">Exercise</font>

1. Can regular expression "([0-9])\1+" match the number "22", "333" and "234"?
2. How about the regular expression "([0-9])\1\*"?

<font color="red">Note</font>: Here we use the backreference (\\1) to denote the repeating of the matched character, other than the character class. 

## Word boundaries (单词边界)

The metacharacter <font color="red">\\b</font> is an anchor like the caret (^) and the dollar ($) sign. It matches at a position that is called a "**word boundary**". This match is *zero-length*.

There are **three kinds of word boundaries**:

   - Before the first character in the string, if the first character is a word character.
   - After the last character in the string, if the last character is a word character.
   - Between two characters in the string, where one is a word character and the other is not a word character.
   
### <font color="blue">Exercise</font>
1. Is the underscore character (_) a word character? Use an example to verify your statement.
2. Can the regular expression "[aeiou]\b" matches the following texts:
   - "Goodbye!"
   - "Bye_"
   - "Here you are."

## Repetitions (重复)

There are 3 kinds of repetitions:

1. ?, \*, + are 0-1, 0-$\infty$, 1-$\infty$ repetitions of the preceding character, respectively.
2. {m,n} means that the repetition number is between $m$ and $n$. When $m$ is omitted, the lower limit is 0; when $n$ is omitted, the upper limit is $\infty$.
3. \n right after the grouping () is another kind of repetition. Here $n$ is a number 1, 2, 3, and etc.

### <font color="blue">Exercise</font>
Tell whether the following statements for the metacharacters are TRUE:
   - (a) \* is equivalent to {0,}
   - (b) + is equivalent to {1,}
   - (c) ? is equivalent to {0,1}
   - (d) ([a-z])\1 is equivalent to [a-z]{2}

## Optional items (可选项)

The **question mark (?)** or the **limited repetition ({0,1})** makes the preceding token in the regular expression optional.

| Regular expression | Meaning |
| --- | --- |
| colou?r | matches both "color" and "colour" |
| Nov(ember)? | matches "Nov" and "November" |
| Feb(ruary)? 23(rd)? | matches "February 23rd", "February 23", "Feb 23rd" and "Feb 23" |
| colou{0,1}r | matches both "color" and "colour" |

**<font color="red">Note</font>** that POSIX BRE (Basic Regular Expression) and GNU BRE do not support either syntax. These flavors require backslashes to give curly braces their special meaning: "colou\{0,1\}r", "colou\?r".

### Important Regex Concept: Greediness (贪婪匹配) or Lazyness (懒惰匹配)

The question mark is a greedy metacharacter, which means that it tries to match the part. Here is an example, if we apply the regex <font color="red">"Feb 23(rd)?"</font> to the string <font color="blue">"Today is Feb 23rd, 2003"</font>, the match is always <font color="red">Feb 23rd</font> and not <font color="red">Feb 23</font>. You can make the question mark **lazy** (i.e. turn off the greediness) by **putting a second question mark after the first**, <font color="red">Feb 23(rd)??</font> in Python.

By default, `grep` does NOT support non-greedy matching, but you can use `grep -P` to use the Perl syntax:

```bash
echo "Feb 23rd, 2016" | grep -P "Feb 23(rd)??"
echo "February 23rd, 2016" | grep -P "Feb(ruary)? 23(rd)??"
```

## Grouping with parentheses (分组)

By placing part of a regular expression inside round brackets or parentheses (), you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternation to part of the regex.

Only parentheses can be used for grouping. Square brackets ([]) define a character class, and curly braces ({}) are used by a quantifier with specific limits.

The regex <font color="red">Set(Value)?</font> matches <font color="blue">Set</font> or <font color="blue">SetValue</font>. In the first case, **the first (and only) capturing group remains empty**. In the second case, **the first capturing group matches Value**.

**The captured groups can be backreferenced by backslash+NUM, such as \\1, \\2**, and etc.

### <font color="blue">Exercise</font>

Write down the outputs without executing the following commands. 
```bash
echo "abbabba" | grep -E "((a)(b))\3\2"
echo "ababba" | grep -E "(..)\1"
```

## Grouping: Non-capturing, named-capturing and branch-reset capturing

### 1. Non-capturing group

If you do not need the group to capture its match, you can optimize this regular expression into <font color="red">Set(?:Value)?</font>. The **question mark plus colon after the opening parenthesis <font color="red">(?:</font>** are the syntax that creates a non-capturing group.

<font color="red">color=(?:red|green|blue)</font> is another regex with a non-capturing group. This regex has no quantifiers.

### 2. Named-capturing group

Nearly all modern regular expression engines support **numbered capturing groups** and **numbered backreferences**. *Long regular expressions with lots of groups and backreferences may be hard to read*. They can be particularly **difficult to maintain as adding or removing a capturing group** in the middle of the regex upsets the numbers of all the groups that follow the added or removed group.

Python's `re` module was the first to offer a solution: **named capturing groups** and **named backreferences**. Here is the syntax:
```python
(?P<name>group)
```
which captures the match of "group" into the backreference "name". 

   - *The "name" must be an alphanumeric sequence starting with a letter*.
   - *The "group" can be any regular expression*.
   
You can reference the contents of the group with the named backreference <font color="red">(?P=name)</font>.
   - The question mark, P, angle brackets, and equals signs are all part of the syntax.
   - Though the syntax for the named backreference uses parentheses, it's just a backreference that doesn't do any capturing or grouping.
   
One of the example is the HTML tags:

```python
<(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</(?P=tag)>.
```

When doing a search-and-replace in Python, you can use
```python
\g<name>
```
in the replacement text to insert the text matched by the named capturing group.

### 3. Branch-reset capturing group

Alternations inside a branch reset group share the same capturing groups. The syntax is <font color="red">(?|regex)</font> where <font color="red">(?|</font> opens the group and <font color="red">regex</font> is any regular expression. If you don't use any alternation or capturing groups inside the branch reset group, then its special function doesn't come into play. It then acts as a non-capturing group.

The regex <font color="red">(?|(a)|(b)|(c))</font> consists of a single branch reset group with three alternatives. This regex matches either a, b, or c. The regex has only a single capturing group with number 1 that is shared by all three alternatives. After the match, \$1 holds a, b, or c.

Compare this with the regex <font color="red">(a)|(b)|(c)</font> that lacks the branch reset group. This regex also matches a, b, or c. But it has three capturing groups. After the match, \$1 holds a or nothing at all, \$2 holds b or nothing at all, while \$3 holds c or nothing at all.

Backreferences to capturing groups inside branch reset groups work like you'd expect. <font color="red">(?|(a)|(b)|(c))\\1</font> matches aa, bb, or cc. Since only one of the alternatives inside the branch reset group can match, the alternative that participates in the match determines the text stored by the capturing group and thus the text matched by the backreference.

The alternatives in the branch reset group don't need to have the same number of capturing groups. <font color="red">(?|abc|(d)(e)(f)|g(h)i)</font> has three capturing groups. When this regex matches abc, all three groups are empty. When def is matched, \$1 holds d, \$2 holds e and \$3 holds f. When ghi is matched, \$1 holds h while the other two are empty.


### <font color="blue">Exercise</font>

1. Write down the output without running the following commands, and then tell your logic:

```bash
echo "abcaghiideff" | grep -oP "(?|(a)(b)(c)|de(f)|g(h)i)\1*"
echo "abcaghiideff" | grep -oE "(?|(a)(b)(c)|de(f)|g(h)i)\1*"
echo "abcaghiideff" | grep -oP "(?|(a)(b)(c)|de(f)|g(h)i)\1?"
echo "abcaghiideff" | grep -oE "(?|(a)(b)(c)|de(f)|g(h)i)\1?"
echo "abcaghiideff" | grep -oP "(?|(a)(b)(c)|de(f)|g(h)i)\1+"
echo "abcaghiideff" | grep -oE "(?|(a)(b)(c)|de(f)|g(h)i)\1+"
```


## Grouping Example: Day and Month with Accurate Number of Days

It's time for a more practical example. These two regular expressions match a date in m/d or mm/dd format. They exclude invalid dates such as 2/31, 02/30, 4/31.

```
^(?:(0?[13578]|1[02])/([012]?[0-9]|3[01]) # 31 days
 |  (0?[469]|11)/([012]?[0-9]|30)         # 30 days
 |  (0?2)/([012]?[0-9])                   # 29 days
 )$
```

The first version uses a **non-capturing group** <font color="red">(?:regex)</font> to group the alternatives. It has six separate capturing groups. \$1 and \$2 would hold the month and the day for months with 31 days, \$3 and \$4 for months with 30 days, and \$5 and \$6 would only be used for February.

```
^(?|(0?[13578]|1[02])/([012]?[0-9]|3[01]) # 31 days
 |  (0?[469]|11)/([012]?[0-9]|30)         # 30 days
 |  (0?2)/([012]?[0-9])                   # 29 days
 )$
```

The second version uses a **branch reset group** <font color="red">(?|regex)</font> to group the alternatives and merge their capturing groups. Now there are only two capturing groups that are shared between the tree alternatives. When a match is found, \$1 always holds the month, and \$2 always holds the day, regardless of the number of days in the month.

### <font color="blue">Exercise</font>

1. How to use regex to form a valid date in `yyyy-mm-dd`, `yyyy-m-d`, `yyyy.mm.dd`, `yyyy.m.d`, `yyyy/mm/dd`, or `yyyy/m/d` formats? (Hint: year can be 19xx or 20xx, month can be 1-12, date can be 1-31 catering to different months. We can use \\d to represent a single digit.)

# About `grep`




### Versions of Regular Expression

Before we turn into `grep`, let's first introduce three versions of regular expressions:

1. **Basic regular expression (BRE)**

BRE suppress the special meaning of such metacharacters as <font color="red">?, +, {, |, (, and )</font>. To use the metacharacter, these metacharacters should be escaped with backslash.

2. **Extended regular expression (ERE)**

In ERE, the above metacharacters have their own special meanings. To use the extended regular expression, `grep` should use the option `-E` instead of the option `-e`.  

3. **Perl-style regular expression (PRE)**

PRE is the most powerful regular expression, which also supports the lazy matching, while ERE does NOT.

One of the difference between ERE and PRE is that ERE does not support lazy matching, while PRE does. Here is the example:
```
echo "see" | grep -E "e??"
echo "see" | grep -P "e??"
```
The regular expressions "e?" can match an empty string "", and the plural repeats of "e", "e", "ee" and etc. Therefore, due to the lazy matching of "e?" to an empty string, the second command will highlight nothing. However, the first command will highlight the substring "ee" in the string "see".  

### What is `grep`?

`grep` searches input files for lines containing a match to a given pattern list.  When it finds a match in a line, it copies the line to STDOUT (by default), or produces whatever other sort of output you have requested with options.

Though `grep` expects to do the matching on text, it has no limits on input line length other than available memory, and it can match arbitrary characters within a line.  If the final byte of an input file is not a newline, `grep` silently supplies one.  Since newline is also a separator for the list of patterns, there is no way to match newline characters in a text.

Here is the SYNOPSIS for `grep`:
```bash
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
```



# `grep` tutorial

```
匹配模式选择:  
 -E, --extended-regexp     采用扩展正则表达式ERE，相当于egrep
 -F, --fixed-strings       一个换行符分隔的字符串的集合，相当于fgrep，不含任何的metacharacters
 -G, --basic-regexp        采用基本正则表达式BRE
 -P, --perl-regexp         采用perl正则表达式PRE
 -e, --regexp=PATTERN      后面紧跟正则模式，默认无，可以包含多个正则表达式-e REG1 -e REG2  
 -f, --file=FILE           从文件FILE中获得匹配模式
 -i, --ignore-case         不区分大小写
 -w, --word-regexp         匹配整个单词，也就是说，仅输出有整个单词匹配的
 -x, --line-regexp         匹配整行，也就是说，仅输出整行匹配
 -z, --null-data           不用\n作为数据行的分隔符（a data line ends in 0 byte, not newline）  
  
杂项:  
 -s, --no-messages         不显示错误信息
 -v, --invert-match        显示不匹配的行
 -V, --version             显示版本号
 --help                    显示帮助信息
 --mmap                use memory-mapped input if possible
  
输入控制:  
 -m, --max-count=NUM       匹配的最大数，到达NUM行后不再进行匹配
 -b, --byte-offset         打印匹配行前面打印该行所在的字节号
 -n, --line-number         显示的加上匹配所在的行号
 --line-buffered           刷新输出每一行
 -H, --with-filename       当搜索多个文件时，显示匹配文件名前缀
 -h, --no-filename         当搜索多个文件时，不显示匹配文件名前缀
 --label=LABEL             打印LABEL作为标准输入的文件名
 -o, --only-matching       仅输出匹配的模式PATTERN而不是整行匹配的记录
 -q, --quiet, --silent     不显示任何内容
 --binary-files=TYPE       将二进制文件视为TYPE，TYPE可以是'binary', 'text', or 'without-match'  
 -a, --text                匹配二进制的东西  
 -I                        不匹配二进制的东西  
 -d, --directories=ACTION  目录操作，读取，递归，跳过 ACTION 可以是'read', 'recurse', 或者 'skip'  
 -D, --devices=ACTION      设置对设备，FIFO,管道的操作，读取，跳过。ACTION可以是'read' 或者 'skip'  
 -R, -r, --recursive       递归调用
 --include=PATTERN     files that match PATTERN will be examined
 --exclude=PATTERN     files that match PATTERN will be skipped.  
 --exclude-from=FILE   files that match PATTERN in FILE will be skipped.
 -L, --files-without-match 匹配多个文件时，不输出匹配结果，仅显示不匹配的文件名
 -l, --files-with-matches  匹配多个文件时，不输出匹配结果，仅显示匹配的文件名
 -c, --count               不输出匹配结果，仅显示匹配了多少次
 -Z, --null                print 0 byte after FILE name
  
文件控制:  
 -B, --before-context=NUM  打印匹配本身以及前面的几个行由NUM控制
 -A, --after-context=NUM   打印匹配本身以及随后的几个行由NUM控制
 -C, --context=NUM         打印匹配本身以及随后，前面的几个行由NUM控制 
 -NUM                      与-C的用法一样的
 --color[=WHEN],  
 --colour[=WHEN]           高亮标记匹配的部分，其中WHEN可以是`always', `never' or `auto'.  
```

### `grep` Example

Here is the test file `test`:

```
root:x:0:0:root:/root:/bin/bash  
bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa  
DADddd:x:2:2:daemon:/sbin:/bin/false  
mail:x:8:12:mail:/var/spool/mail:/bin/false  
ftp:x:14:11:ftp:/home/ftp:/bin/false  
nobody:$:99:99:nobody:/:/bin/false  
bio:x:500:500:,,,:/home/bio:/bin/bash  
http:x:33:33::/srv/http:/bin/false  
dbus:x:81:81:System message bus:/:/bin/false  
hal:x:82:82:HAL daemon:/:/bin/false  
mysql:x:89:89::/var/lib/mysql:/bin/false  
aaa:x:1001:1001::/home/aaa:/bin/bash  
ba:x:1002:1002::/home/zhangy:/bin/bash  
test:x:1003:1003::/home/test:/bin/bash  
@bio:*:1004:1004::/home/test:/bin/bash  
policykit:x:102:1005:Po  
```

### 1. Let's apply the BRE on the file to find the lines starting with "root" or "bio":
```
cat test | grep "^\(root\|bio\)"
```
since BRE does not support the metacharacters (, |, ), we use backslashes to obtain their special meanings. And here is the output:
```
root:x:0:0:root:/root:/bin/bash
bio:x:500:500:,,,:/home/bio:/bin/bash  
```

To use the ERE, we can apply the option "-E" to obtain the special meanings of the metacharacters:
```
cat test | grep -E "^(root|bio)"
```

### <font color="blue">Exercise</font>
What if we apply the following regular expression?
```
cat test | grep "^root\|bio"
```
Tell the reason.

### 2. Character set

### 3. Grouping and backreference

### 4. Repetition

### 5. Lazy matching

To apply the lazy matching, you should use the option "-P" and add the question mark (?) after the metacharacters "?", "*", and "+". That is, using the PRE.

```
echo "eeeeee" | grep -P "^ee??"
echo "eeeeee" | grep -E "^ee??"
echo "eeeeee" | grep -P "^ee+?"
echo "eeeeee" | grep -E "^ee+?"
echo "eeeeee" | grep -P "^ee*?"
echo "eeeeee" | grep -E "^ee*?"
```

### <font color="blue">Exercise</font>

What differences have you seen from the above commands. And tell the reasons.


## Concluding Remarks

Now, you have learned the basic usage of regular expression and how to apply it on `grep` command. In next chapter, we will introduce `sed` and `awk` which will **extend the usage of regex**.

1. basic regular expression, extended regular expression, and perl regular expression

2. literal character, metacharacter, character set, repetition, grouping, backreference, alternation

3. `grep` usage