# Regex: 정규표현식

### Regex modules
`import re`

### Regex in Python

```python
pat = re.compile("패턴식")
pat.match("문자열")

# 위와 동일
re.match("패턴식", "문자열")
```

### Regex functions
The `re` module offers a set of functions that allows us to search a string for a match:

|Function|Description|
|:-:|:-:|
|findall|Returns a list containing all matches|
|finditer|Returns a calleable iterator object|
|match|Returns a Match object only if there is a match at the beginning of the string|
|search|Returns a Match object if there is a match anywhere in the string|
|split|Returns a list where the string has been split at each match|
|sub|Replaces one or many matches with a string|

### Metacharacters
Metacharacters are characters with a special meaning:

|Character|Description|Example|
|:-:|:-:|:-:|
|"[]"|A set of characters|[] [abc] [a-z] [a-zA-Z0-9] [가-힣]|
|"\\"|Signals a special sequence (can also be used to escape special characters)|\d|
|"."|Any character(except newline character)|"he..o"|
|"^"|Starts with|"^hello"|
|"\$"|Ends with|"world$"|
|"*"|Zero or more occurrences|"aix*"|
|"+"|One or more occurrences|"aix+"|
|"?"|Zero or one repetitions of the preceding RE|"ab?"|
|"{}"|Exactly the specified number of occurrences|"al{2}"|
|"\|"|Either or|"falls\|stays"|
|"()"|Capture and group||	

### Special Sequences
A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:

|Character|Description|Example|
|:-:|:-:|:-:|
|"\A"|Returns a match if the specified characters are at the beginning of the string|"\AThe"|
|"\b"|Returns a match where the specified characters are at the beginning or at the end of a word(the "r" in the beginning is making sure that the string is being treated as a "raw string")|r"\bain"r"ain\b"|
|"\B"|Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word(the "r" in the beginning is making sure that the string is being treated as a "raw string")|r"\Bain"r"ain\B"|
|"\d"|Returns a match where the string contains digits (numbers from 0-9)|"\d"|
|"\D"|Returns a match where the string DOES NOT contain digits|"\D"|
|"\s"|Returns a match where the string contains a white space character|"\s"|
|"\S"|Returns a match where the string DOES NOT contain a white space character|"\S"|
|"\w"|Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)|"\w"|
|"\W"|Returns a match where the string DOES NOT contain any word characters|"\W"|
|"\Z"|Returns a match if the specified characters are at the end of the string|"Spain\Z"|

### Sets
A set is a set of characters inside a pair of square brackets `[]` with a special meaning:

|Set|Description|
|:-:|:-:|
|[arn]|Returns a match where one of the specified characters (a, r, or n) are present|
|[a-n]|Returns a match for any lower case character, alphabetically between a and n|
|[^arn]|Returns a match for any character EXCEPT a, r, and n|
|[0123]|Returns a match where any of the specified digits (0, 1, 2, or 3) are present|
|[0-9]|Returns a match for any digit between 0 and 9|
|[0-5][0-9]|Returns a match for any two-digit numbers from 00 and 59|
|[a-zA-Z]|Returns a match for any character alphabetically between a and z, lower case OR upper case|
|[+]|In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string|

In [1]:
import re

txt = """
kim 901015-1028473
lee 991018-1939403
"""

res=[]
for line in txt.split("\n"):
    for w in line.split(" "):
        if len(w) == 14 and w[7:].isdigit():
            print(w[:6] + "-" + "*******")
            
mypat = re.compile("(\d{6})[-]\d{7}")
print(mypat.sub("\g<1>-*******", txt))

901015-*******
991018-*******

kim 901015-*******
lee 991018-*******



In [2]:
pat = re.compile("[a-z]+")

In [3]:
match = True if pat.match("1test") else False
print(match)

False


In [4]:
match = True if pat.search("1test") else False
print(match)

True


In [5]:
match = True if pat.search("1test 2test 3test") else False
print(match)

True


In [6]:
match = True if pat.findall("1test 2test 3test") else False
print(match)

True


In [7]:
pat.findall("1test 2test 3test")

['test', 'test', 'test']

In [8]:
res = pat.finditer("1test 2 test 3 test")
for r in res:
    print(r)

<re.Match object; span=(1, 5), match='test'>
<re.Match object; span=(8, 12), match='test'>
<re.Match object; span=(15, 19), match='test'>


### __findall vs finditer__
- finditer를 사용하면 매치된 패턴의 인덱스까지 알 수 있음
- findall을 사용하면 매치된 패턴들을 리스트로 반환

In [9]:
res = pat.match("mypython")

In [10]:
print(res.group()) #매치된 문자열 리턴
print(res.start()) #매치된 문자열 시작위치
print(res.end())   #매치된 문자열 끝위치 
print(res.span())  #(시작, 끝)

mypython
0
8
(0, 8)


In [11]:
res2 = pat.search("7 python")
print(res2)

<re.Match object; span=(2, 8), match='python'>


In [12]:
print(re.match("[0-9]", "1234"))
print(re.match("[0-9]*", "1234"))
print(re.match("[0-9]+", "1234"))

print(re.match("\d*", "1234"))
print(re.match("\d+", "1234"))
print(re.match("\d{4}", "1234"))

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 4), match='1234'>
<re.Match object; span=(0, 4), match='1234'>
<re.Match object; span=(0, 4), match='1234'>
<re.Match object; span=(0, 4), match='1234'>
<re.Match object; span=(0, 4), match='1234'>


In [13]:
print(re.match("a*b", "a"))
print(re.match("a+b", "b"))
print(re.match("a*b", "b"))
print(re.match("a?b", "b"))
print(re.match("a+b", "aaab"))

None
None
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 4), match='aaab'>


In [14]:
### Examples
# Create a regular expression that matches a phone number
re.match("[0-9]{3}-[0-9]{4}-[0-9]{4}$", "010-6291-8015")
    
# 2 or 3 digits - 3 or 4 digits - 4 digits
re.match("[0-9]{2,3}-[0-9]{3,4}-[0-9]{4}", "010-6291-8015")

<re.Match object; span=(0, 13), match='010-6291-8015'>

In [15]:
re.match("\w*\d*", "Hello1234")

<re.Match object; span=(0, 9), match='Hello1234'>

In [16]:
re.match("[A-Z]+", "Hello")

<re.Match object; span=(0, 1), match='H'>

In [17]:
re.match("^[A-z]+", "Hello")

<re.Match object; span=(0, 5), match='Hello'>

In [18]:
re.search("\*+","1 ** 2")

<re.Match object; span=(2, 4), match='**'>

In [19]:
re.match("[$()a-z]+", "$(test)")

<re.Match object; span=(0, 7), match='$(test)'>

In [20]:
# \w : returns a match where the string contains characters, digits, _
# \w : NOT

print(re.match("\w+", "hello_123"))
print(re.search("\W+", "(!@#_)"))

<re.Match object; span=(0, 9), match='hello_123'>
<re.Match object; span=(0, 4), match='(!@#'>


In [21]:
# whitespace
print(re.match("[A-z0-9]+", "Hello 123"))
print(re.match("[A-z0-9 ]+", "Hello 123"))

# \s : returns a match where the string contains a whitespace character
# \S : NOT
print(re.match("[A-z0-9\s]", "Hello 123"))


<re.Match object; span=(0, 5), match='Hello'>
<re.Match object; span=(0, 9), match='Hello 123'>
<re.Match object; span=(0, 1), match='H'>


### Named Groups and Backreferences
- Syntax
    - Named capturing group
        - `(?P<name>regex)`
    - Named backreference syntax
        - `(?P=name)`

In [22]:
# Example : check if the function name is valid or not

r = re.match("(?P<fn>[A-z_]\w+)\((?P<arg>[\,\w ]*)\)", "__init__(a, b, c)")
f_name = r.group('fn')
parameters = r.group('arg').split(", ")
print(f_name, parameters)

__init__ ['a', 'b', 'c']


### re.sub()
- 매칭된 문자열을 원하는 값으로 치환
- syntax
    - `re.sub(pattern, repl, string, count=0, flags=0)`
    - repl: string or function
    - count: 바꿀 문자열의 갯수
    - flags: 상수 (re.IGNORECASE, ...)
    
```python
re.sub("우리나라|한국|대한민국|남한|코리아|South Korea", "대한민국", "대한민국, 한국 코리아")

# output : '대한민국, 대한민국 대한민국'
```

In [23]:
# 숫자만 모두 찾아서 문자열 "num"으로 변경 
s = "1 2 three 4 50 600 seven"

re.sub("[0-9]+", "num", s, count=1)

'num 2 three 4 50 600 seven'

In [24]:
def mul10(arg):
    res = int(arg.group()) * 10
    return str(res)

re.sub("[0-9]+", mul10, s)

'10 20 three 40 500 6000 seven'