  
  
  # 正規表示式 ( Regular expression )

* 正規表示式從字面上就可以得知他是一個**表示法**。
* 定義一<span style="color: green">字串</span>(符合正規表示式語法)，可以利用這定義好的<span style="color: green">字串</span>找出其他<span style="color: red">文本字串</span>中包含此<span style="color: green">字串</span>的部份

# 示意圖

<center> ![Re 示意圖](./images/re1.png) </center>

# 應用範圍

* 網頁字串搜尋功能(Ctrl+f)
* 伺服器端帳號密碼格式驗證
* 大數據文字探勘
* 文章子字串替換
* 還有好多好多...


# 快速瞥過

In [4]:
import re

# Search sub-string in text
text = "Today is good day to learn regular expression."

# Define re pattern
pattern = r'regular expression'

# Search if there is 'regular expression' in the text
match = re.search(pattern, text)

# Print result
print(match.group())

regular expression


# 正規表示式用法

* 一般文字
* 基礎流程
* 常用函式
* 特殊文字 
* 重複語法   

# 一般文字

若指定一般文字(e.g 'hello', 'johnny', 'name')在正規表示式中，則程式會在欲查找的字串中，找出與指定文字**完全一樣的字串**。  
可以想像成瀏覽器中常使用的 Ctrl+f 功能。

![CTRL+F](./images/re2.png)

In [33]:
# Ordinary literal in regular expression
import re

# The text which re applies to
text = "Today is Monday, tommorrow is Tuesday."

# define pattern
pattern = r'Monday'

# search pattern in the text
match = re.search(pattern, text)

# Check
print(type(match))
print(match.group())

<class '_sre.SRE_Match'>
Monday


In [None]:
# Live Demo 

# The text which re applies to
text = "Today is Monday, tomorrow is Tuesday."

# import module

# define pattern

# search pattern

# Check result

In [35]:
# Ordinary literal in regular expression
# re cannot find a match in the text 
import re

# The text which re applies to
text = "Today is Monday, tommorrow is Tuesday."

# Specify pattern
pattern = r'Wednesday'

# Apply re to the text
match = re.search(pattern, text)

# Print out result
print(type(match))
print(match)

<class 'NoneType'>
None


# 基礎流程
  
* 定義正規表示式  
* 編譯並快取住正規表示式物件  
* 套用正規表示式到目標文本上  
* 取出查找到的字串  

In [3]:
import re

# Define pattern
pattern = r'cookie'

In [4]:
# Better option: Compile regex object
regex = re.compile(pattern)

In [5]:
text = "Cake and cookie"

# Compiled Version
match = regex.search(text)

# Non-Compiled Version
match = re.search(pattern, text)

In [6]:
print(match.group())

cookie


# 常用函式

以下列舉出幾個在使用 re 模組時，常用的幾個類別與函式
* Re 模組
    * search
    * match
    * sub
    * compile
* Match 物件中函式
    * span
    * group
    * groups
    * groupdict

##  [search(pattern, string, flags=0)](https://docs.python.org/3/library/re.html#re.search)

* pattern  
定義好的正規表示式（e.g r'cookie')  
  
* string  
欲套用正規表示式之文本字串
  
* flags（目前用不到）  
預設值為0
  
* **Return Value**  
若程式在 string 中找到符合的字串則會回傳一個型態為 Match 的物件。反之，擇回傳一個 None 物件。
> **Note**    
> 與 match 不同之處在於，只要 string 中的**任何位置**出現符合 pattern 的字串, 就會回傳 Match 物件

In [7]:
# 5 mins for coding
# re.search 

import re

# text which re applies to
text = "Cake and cookie"

# Define pattern
pattern = "cookie"

# search pattern in the text
match = re.search(pattern, text)

# Check
print("Type of 'match' object:", type(match))
print("Matching pattern: %r" % match.group())

Type of 'match' object: <class '_sre.SRE_Match'>
Matching pattern: 'cookie'
Matching location: (9, 15)


In [None]:
# Live Demo 

# The text which re applies to
text = "Cake and cookie"

# import module

# define pattern

# search pattern

# Check result

# [match(pattern, string, flag=0)](https://docs.python.org/3/library/re.html#re.match)

* pattern  
定義好的正規表示式（e.g r'cookie')  
  
* string  
欲套用正規表示式之文本字串
  
* flags（目前用不到）  
預設值為0
  
* **Return Value**    
若程式在 string 中有找到符合的字串則會回傳一個型態為 Match 的物件。反之，擇回傳一個 None 物件。
> **Note**
> match 函式只會比對 string 開頭的文字，若 string 的**中間**部份有符合 pattern 的字串擇**不會**回傳 Match 物件。

In [None]:
# 5 mins for coding
# re.match

import re

# texts which re applies to
text1 = "Cake and cookie"
text2 = "cookie and Cake"

# specify pattern
pattern = "cookie"

# search pattern in texts
match1 = re.match(pattern, text1)
match2 = re.match(pattern, text2)

# Check
print("Type of 'match1' object:", type(match1))
print(match1)
print("="*50)
print("Type of 'match2' object:", type(match2))
print("Matching pattern: %r" % match2.group())

In [None]:
# Live Demo 

# texts which re applies to
text1 = "Cake and cookie"
text2 = "cookie and Cake"

# import module

# define pattern

# search pattern in texts

# Check

# [sub(pattern, replace, string, count=0, flags=0)](https://docs.python.org/3/library/re.html#re.sub)

* pattern  
定義好的正規表示式（e.g r'cookie')
  
* replace  
可以是字串抑或是函式。  
若為字串則會將符合 pattern 的字串轉成相對應指定的字串。  
若為函式，則會將函式的回傳值當作欲替代的文字。
  
* string  
欲套用正規表示式之文本字串
  
* count  
預設為0，會將**所有**符合 pattern 的字串替代成相對應的字串(replace)。  
若為大於0的值，則只會將出現 **count 次數**次符合 pattern 的字串做轉換。
  
* flags（目前用不到）  
預設值為0
  
* **Return Value**  
回傳一個做過轉換的新的字串

In [39]:
# 5 mins for coding
# re.sub with plain string

import re

# text which re applies to
text = "cookie and Cake"

# Define pattern
pattern = r'cookie'

# substitute the string 'cookie' with 'biscuit'
new_text = re.sub(pattern, 'biscuit', text)

# Check
print(type(new_text))
print(new_text)

<class 'str'>
biscuit and Cake


In [None]:
# Live Demo 

# texts which re applies to
text = "cookie and Cake"

# import module

# Define pattern

# substitue the string 'cookie' with biscuit in text

# Check

In [10]:
# 5 mins for coding
# re.sub with customed function

import re

# text which re applies to
text = "cookie and Cake"

# Define pattern
pattern = r'cookie'

# Custom replace function
def repl(match):
    new_string = match.group()+'-'+match.group()
    return new_string

new_text = re.sub(pattern, repl, text)

print(new_text)

cookie-cookie and Cake


In [None]:
# Live Demo 

# texts which re applies to
text = "cookie and Cake"

# import module

# Define pattern

# Define substitute function

# substitue the string 'cookie' in text with the result of substitue function

# Check

# [compile(pattern, flag=0)](https://docs.python.org/3/library/re.html#re.compile)

* pattern  
欲快取住的正規表示式
  
* flags（目前用不到）  
預設值為0
  
* **Return Value**  
回傳一個 Pattern 型態的物件

In [5]:
# re.compile

import re

# Define pattern
pattern = r'cookie'

# Compile
regex = re.compile(pattern)

# 為什麼要 Compile 呢？

* 快取住 re 物件  
* 不用重複編寫一樣的 Code
* 執行速度提升

In [6]:
%%time

# Newbie style

text1 = 'Cake and cookie'
text2 = 'cookie and Cake'
text3 = 'Cake cookie'

# search pattern
for i in range(100):
    match1 = re.search(r'cookie', text1)
    match2 = re.search(r'cookie', text2)
    match3 = re.search(r'cookie', text3)

CPU times: user 412 µs, sys: 61 µs, total: 473 µs
Wall time: 481 µs


In [7]:
%%time

# Prof style

text1 = 'Cake and cookie'
text2 = 'cookie and Cake'
text3 = 'Cake cookie'

# Compile regex
regex = re.compile(r'cookie')

# search pattern
for i in range(100):
    match1 = regex.search(text1)
    match2 = regex.search(text2)
    match3 = regex.search(text3)

CPU times: user 95 µs, sys: 14 µs, total: 109 µs
Wall time: 113 µs


# [match.span( [group] )](https://docs.python.org/3.5/library/re.html#re.match.span)

* \[group\]  
預設值為 group 0，代表整個 matching 的字串。group 可以為所有合法 group 的索引值(e.g 1, 2, 3, etc)，
  
  
* **Return Value**  
Tuple 物件，表示此 group 在文本中的位置(起始位置, 結束位置)

In [15]:
# 5 min practice
# match.span

import re

text = 'Cake and cookie'

# Define pattern
pattern = r'cookie'

# search pattern
match = re.search(pattern, text)

# Check
print("Match starting index:", match.span()[0])
print("Match ending index:", match.span()[1])
print("Result String:", text[match.span()[0]:match.span()[1]])

Match starting index: 9
Match ending index: 15
Result String: cookie


# [match.group( [group1, ...] )](https://docs.python.org/3.5/library/re.html#re.match.group)

* \[group1, ...\]  
group1 若沒指定則 default 值為 0，代表回傳整串 match 的字串。  
groupN 可以為任意有效的 group 索引值（e.g 1, 2, 3, etc)。
  
  
* **Return Value**  
回傳屬於那個 group 的字串，若參數 [group1, ...] 大於1以上，則回傳 tuple 包含所有 group 的字串。  
若查找的字串不符合 group 中的定義則回傳 None。

# 如何定義 group

<center><h2>(<span style="color:green"> pattern </span>)</h2></center>
  
> <span style="color:green">pattern</span>: group 的正規表示式  

In [2]:
# Recap
# Only one parameter in match.group

import re

text = 'Cake and cookie'

# Define pattern
pattern = r'cookie'

# search pattern
match = re.search(pattern, text)

# Check
print("Entire matching string:", match.group())
print("Entire matching string:", match.group(0))


Entire matching string: cookie
Entire matching string: cookie


In [8]:
# 5 mins for coding
# two or more parameter in match.group

import re

text = 'Cake and cookie'

# Define pattern: () will define a group
pattern = r'(Cake) and (cookie)'

# search pattern
match = re.search(pattern, text)

# Check: only one argument
print("Group1 matching string:", match.group(1))
print("Group2 matching string:", match.group(2))

# Check: two arguments
print("Group1 and Group2 matching strings:", match.group(1, 2))

# Error
# print("Group3 matching string:", match.group(3))

Group1 matching string: Cake
Group2 matching string: cookie
Group1 and Group2 matching strings: ('Cake', 'cookie')


In [None]:
# Live Demo

import re

# text which re applies to
text = 'Cake and cookie'

# Define pattern: () will define a group

# search pattern in text

# Check: match.group(...) with only one argument

# Check: match.group(...) with more than one argument


# [match.groups(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groups)

* **Return Value**  
回傳一個串列的 group。  
若查找的字串不符合串列中 group 的定義則回傳 default 值 None。

In [42]:
# 5 min for coding
# match.groups()

import re

text = 'Cake and cookie'

# Define pattern
pattern = r'(Cake) and (cookie)'

# search pattern
match = re.search(pattern, text)

# Check
print("Groups of match:", match1.groups())

Groups of match: ('Cake', 'cookie')


# [match.groupdict(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groupdict)

* **Return Value**  
回傳一個字典物件，包含了所有一串列被命名的 groups, **鍵值(key)** 為命名的名稱. **值(Value)** 為 group 中的字串。

# 如何命名 group

<center><h2> (<span style="color:blue">?P</span><<span style="color:red">group_name</span>> <span style="color:green">pattern</span>) </h2></center>  
  
> <span style="color:blue">?P</span>: 命名 group 前必須加的前綴字  
> <span style="color:red">group_name</span>: group 的名稱  
> <span style="color:green">pattern</span>: group 的正規表示式  


In [16]:
# 5 min for coding
# match.groupdict

import re

text = "Cake and cookie"

# Define pattern
pattern = r'(?P<fatter>Cake) and (?P<fat>cookie)'

# search pattern
match = re.search(pattern, text)

# Check
d = match.groupdict()
print(type(d))
print(d)

<class 'dict'>
{'fat': 'cookie', 'fatter': 'Cake'}


In [None]:
# Live Demo

import re

# text which re applies to
text = "Cake and cookie"

# Define pattern (?P<name> pattern)

# search pattern

# Check

# 特殊文字

若是正規表示式只能使用平字(plain text)，那他的功能也太弱了吧...。 所以除了平字之外它還可以有其他的**特殊文字**用來表示**集合**，**特殊符號**，**旗標**等等其他更較為有彈性且強大的功能。

<center><h1 style="color:red">這邊一定要多加練習唷～</h1></center>

# 特殊文字查詢表

| Character | Meaning
| :------- | :--------
| . | Match any single character except newline('\n')
| \w | Match any singel letter, digit, or underscore
| \W | Match any character not part of \w
| \s | Match a singel whitespace character like: space, newline, tab, return
| \S | Match any character not part or \s
| \t | Match tab
| \n | Match newline
| \r | Match return
| \d | Match decimal digit 0-9
| ^ | Match a pattern at the start of the string
| $ | Match a pattern at the end of the string
| \A | Match only at the start of the string
| \b | Match only the beginning or end of the word
| \ | Match special character
| [...] | Match character that appears between '[ ]'
| [^...] | Match character that does not appear in '[ ]'

> 還有其他很多特殊文字，所以有興趣可以自己在參考 Python re 的文件

<center><h1> \w 和 \d 練習</h1><center>

In [1]:
# 5 mins for coding
# \w and \d practice
# Extract CSIE and room_id

import re

# text which re applies to
text = "CSIE-65405"

# specify pattern
pattern = r'(?P<department>\w\w\w\w)-(?P<room_id>\d\d\d\d\d)'

# search pattern
match = re.search(pattern, text)

# Check
print(match.groupdict())

{'room_id': '65405', 'department': 'CSIE'}


# 課堂練習 10 分鐘

請利用 re 模組，寫一個程式讓使用者能夠輸入名字，並且讓程式取出其 lastname 和 firstname。

* 輸入:  
    Tom Tsai   
    Amy Wang    
    Tim Chen    
    
---

* 輸出:  
    "Tom Tsai": { 'lastname': "Tsai", 'firstname': "Tom" }  
    "Amy Wang": { 'lastname': "Wang", 'firstname': "Amy" }  
    "Time Chen": { 'lastname': 'Chen', 'firstname': "Tim" } 

<center><h1> [...] 練習 </h1><center>

In [44]:
# 5 mins for coding
# [...] practice
# Extract CSIE and room_id 

import re

text = "CSIE-65405"

# define pattern
pattern = r'(?P<department>[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z])-(?P<room_id>[0-9][0-9][0-9][0-9][0-9])'

# search pattern in text
match = re.search(pattern, text)

# check
print(match.groupdict())

{'department': 'CSIE', 'room_id': '65405'}


In [None]:
# Live Demo

import re

# text which re applies to
text = "CSIE-65405"

# Define pattern (?P<name> [a-zA-Z])

# search pattern

# Check

# 課堂練習(1) 40 分鐘

請利用 re 模組，寫一個簡單的**使用者驗證系統**，使用者提供帳號，密碼，程式判段是否為合法的帳號密碼。  
* **帳號格式**：
    1. 總長度為 3  
    2. 且第一個英文字母為大寫  
    3. 其餘為小寫  
  
* **密碼格式**:
    1. 總長度為 9  
    2. 前 3 個字為英文小寫  
    3. 後六個數字為數字 0-9  
    4. 但第一個數字必須為 0     

# 課堂練習(1) Con't
* 輸入:  
    Tom, tom059357  
    Amy, amy154852  
    TiM, tim0002356  
    Yen, yen0054321  
    
---

* 輸出:  
    Welcome, Tom!  
    Password format error! Your password is amy154852  
    Username format error! Your username is TiM  
    Password legnth error! Your password length is 10.

In [52]:
# Example

import re

class AuthSystem:
    
    def __init__(self):
        """Define regex"""
        self.username_regex = re.compile(r'johnny')
        self.password_regex = re.compile(r'johnny860410')
    
    def _check_username(self, username):
        """Check username is valid or not"""
        if self.username_regex.search(username) is not None:
            print("Correct username")
            return True
        else: 
            print("Wrong username")
            return False
        
    def _check_password(self, password):
        """Check password is valid or not"""
        if self.password_regex.search(password) is not None:
            print("Correct password")
            return True
        else:
            print("Wrong password")
            return False
        
    def authenticate(self, username, password):
        """authenticate the user"""
        if not self._check_username(username):
            return
        
        if not self._check_password(password):
            return
        
        print("Valid user")

    
# Construct a object of type AuthSystem
auth = AuthSystem()

# authenticate the user's credentials
auth.authenticate("johnny", "johnny860410")

Correct username
Correct password
Valid user


# 重複語法

為了讓在定義正規表示式的時候可以更為**簡潔**，且更為的**彈性**，正規表示式中有提供**重複語法**。  
我們可以將之前寫的方法做一些修改。

<center><h2> \w\w\w\w ==> \w{4} </h2></center>

# 重複語法查詢表

| Character | Meaning
| :--- | :--- 
| + | Match one or more characters to its left
| * | Match zero or more characters to its left
| ? | Match zero or one character to its left
| {x} | Match 'x' times of character to its left
| {x,} | Match 'x' or more times fo charater to its left
| {x, y} | Match 'x' or more times but less than 'y' times of character to its left

<center><h1> + 練習 </h1></center>

In [53]:
# 5 mins for coding
# '+' practice
import re

# texts which re applies to
text1 = " and cookie"
text2 = "CakeCakeCake and cookie"

# Define pattern
plus_pattern = "(Cake)+ and cookie"

# search pattern in texts # 
plus_match1 = re.search(plus_pattern, text1)
plus_match2 = re.search(plus_pattern, text2)

# check
print(plus_match1)
print(plus_match2.group())

None
CakeCakeCake and cookie


In [None]:
# Live Demo

text1 = ' and cookie'
text2 = 'CakeCakeCake and cookie'

# import module

# Define pattern with '+'

# search pattern in texts

# check

<center><h1> * 練習 </h1></center>

In [51]:
# 5 mins for coding

import re

# texts which re applies to
text1 = " and cookie"
text2 = "CakeCakeCake and cookie"

# Define pattern
mul_pattern = "(Cake)* and cookie"

# search pattern in texts
mul_match1 = re.search(mul_pattern, text1)
mul_match2 = re.search(mul_pattern, text2)

# check
print(mul_match1.group())
print(mul_match2.group())

 and cookie
CakeCakeCake and cookie


<center><h1> ？ 練習 </h1></center>

In [46]:
# 5 mins for coding
# '+' practice
import re

# texts which re applies to
text1 = " and cookie"
text2 = "Cake and cookie"

# Define pattern
ques_pattern = "(Cake)? and cookie"

# search pattern in texts 
ques_match1 = re.search(ques_pattern, text1)
ques_match2 = re.search(ques_pattern, text2)

# Check
print(ques_match1.group())
print(ques_match2.group())

 and cookie
Cake and cookie


# 課堂練習(2) 10 分鐘

請延續上一題的練習題，擴充其功能，讓能夠接受(accept)的帳號和密碼格式更複雜。    
* **帳號格式**:  
    1. 總長度大於 6 以上，小於 12 以下  
    2. 第一個英文字母為大寫  
    3. 其餘可為數字或是英文字母大小寫    
    
* **密碼格式**:  
    1. 總長度大於 6 以上  
    2. 只能為小寫英文或數字  
   

# 課堂練習(2) Con't

* 輸入:  
    Tommy7410, tom7410  
    Amy8520, amy85  
    tim9630, tim9630  
    Yen5566123456, yen0054321  
    
---

* 輸出:  
    Welcome, Tommy7410!  
    Password length error! Your password length is 5  
    Username format error! Your username is tim9630  
    Username length error! Your username length is 13. 


# Greedy and Non-Greedy Match

正規表示式在做查找(match)的時候，會盡量查找(match)到**最長符合的字串**，這種行為我們稱為 Greedy match，但有時這些行為不是我們所期望的。  
在**最短符合的字串**就停止查找，這種行為我們稱為 Non-Greedy match。

<center><h1> Greedy Match </h1></center>

In [29]:
# Greedy Match
import re

text = "<h1> Title </h1>"

# Define pattern
pattern = r'<.*>'

# search pattern
match = re.search(pattern, text)

# Check
print(match.group())

<h1> Title </h1>


<center><h1> Non-Greedy Match </h1></center>

In [30]:
# Non-Greedy match
import re

text = "<h1> Title </h1>"

# Define pattern
pattern = r'<.*?>'

# search pattern
match = re.search(pattern, text)

# Check
print(match.group())

<h1>


# 課程練習 10 分鐘 (不算分)

請利用 re 模組，提取出 html 檔案中 'a' 標籤的訊息。

<center><h1> <a href="www.google.com"\> Google </a\> </h1></center>
> Link: www.google.com  
> Content: Google

In [3]:
# Live Demo

text = """
    <a href="www.google.com"> Google </a>
"""

# import module
import re

# Define pattern
# a_tag_pattern = r'<a\s+href=(?P<link>.+)>(?P<content>.*?)</a>'

# search pattern

# Check


<center><h1> 感謝聆聽 </h1></center>
<center><h3> Reference: https://docs.python.org/3/library/re.html </h3></center>