### Regular Expression Learn
#### Methods to search for matches
match(), search(), findall(), finditer()

In [1]:
import re

test_string = "123abc456789abc123ABC"
pattern = re.compile(r"abc")
matches = pattern.finditer(test_string) # -> list of match objects
# matches = pattern.findall(test_string) # -> list of match patterns
# match = pattern.match(test_string) # -> obj or None, looks at only beginning of the string
# match = pattern.search(test_string) # -> first occurence

# print(match)

# match(), search(), findall()
for match in matches:
    print(match)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


#### methods on match object
group(), start(), end(), span()

In [4]:
import re

test_string = "123abc456789abc123ABC"
pattern = re.compile(r"abc")
matches = pattern.finditer(test_string)

for match in matches:
    print(f"{match.span()=}, {match.start()=}, {match.end()=}, {match.group()=}")

match.span()=(3, 6), match.start()=3, match.end()=6, match.group()='abc'
match.span()=(12, 15), match.start()=12, match.end()=15, match.group()='abc'


#### Meta Characters
All meta characters: . ^ $ * + ? { } [ ] \ | ( )
. Any character (except newline character) "he..o" <br>
^ Starts with "^hello"<br>
\$ Ends with "world\$"<br>
'*' Zero or more occurrences "aix*"<br>
'+' One or more occurrences "aix+"<br>
{ } Exactly the specified number of occurrences "al{2}"<br>
[] A set of characters "[a-m]"<br>
\ Signals a special sequence (can also be used to escape special characters)"\d"<br>
| Either or "falls|stays"<br>
( ) Capture and group<br>

In [5]:
test_string = "python-engineee.com"
pattern = re.compile(r'\.')
matches = pattern.finditer(test_string)

for match in matches:
    print(match)

<re.Match object; span=(15, 16), match='.'>


\d :Matches any decimal digit; this is equivalent to the class [0-9].<br>
\D : Matches any non-digit character; this is equivalent to the class [^0-9].<br>
\s : Matches any whitespace character;<br>
\S : Matches any non-whitespace character;<br>
\w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].<br>
\W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].<br>
\b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"<br>
\B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"<br>
\A Returns a match if the specified characters are at the beginning of the string "\AThe"<br>
\Z Returns a match if the specified characters are at the end of the string "Spain\Z"<br>

In [19]:
test_string = 'hello 123_ heyho hohey'
pattern = re.compile(r'\d')
matches = pattern.findall(test_string)
[match for match in matches]

['1', '2', '3']

In [21]:
pattern = re.compile(r'\s')
matches = pattern.findall(test_string)
[match for match in matches]

[' ', ' ', ' ']

In [23]:
pattern = re.compile(r'\w')
matches = pattern.findall(test_string)
[match for match in matches]

['h',
 'e',
 'l',
 'l',
 'o',
 '1',
 '2',
 '3',
 '_',
 'h',
 'e',
 'y',
 'h',
 'o',
 'h',
 'o',
 'h',
 'e',
 'y']

In [28]:
pattern = re.compile(r'hey\b')
matches = pattern.findall("heyho hohey all")
[match for match in matches]

['hey']

#### Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning. Append multiple conditions back-to back, e.g. [aA-Z].<br>
A ^ (caret) inside a set negates the expression.<br>
A - (dash) in a set specifies a range if it is in between, otherwise the dash itself.<br>

Examples:<br>
- [arn] Returns a match where one of the specified characters (a, r, or n) are present<br>
- [a-n] Returns a match for any lower case character, alphabetically between a and n<br>
- [^arn] Returns a match for any character EXCEPT a, r, and n<br>
- [0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present<br>
- [0-9] Returns a match for any digit between 0 and 9<br>
- [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59<br>
- [a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case<br>

In [29]:
test_string = "hello 123_"
pattern = re.compile(r'[elo]')
matches = pattern.finditer(test_string)
[match for match in matches]

[<re.Match object; span=(1, 2), match='e'>,
 <re.Match object; span=(2, 3), match='l'>,
 <re.Match object; span=(3, 4), match='l'>,
 <re.Match object; span=(4, 5), match='o'>]

In [36]:
test_string = "hello 123_"
pattern = re.compile(r'[0-1][0-9][0-1]')
matches = pattern.finditer(test_string)
[match for match in matches]

[]

In [42]:
dates = '''
01.04.2020

2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
'''

pattern = re.compile(r'\d\d\d\d[-/]\d\d[-/]\d\d')
matches = pattern.finditer(dates)
[match for match in matches]


[<re.Match object; span=(25, 35), match='2020-04-01'>,
 <re.Match object; span=(36, 46), match='2020-05-23'>,
 <re.Match object; span=(47, 57), match='2020-06-11'>,
 <re.Match object; span=(58, 68), match='2020-07-11'>,
 <re.Match object; span=(69, 79), match='2020-08-11'>,
 <re.Match object; span=(81, 91), match='2020/04/02'>]

#### Quantifier
'star' : 0 or more<br>
'+' : 1 or more<br>
? : 0 or 1, used when a character can be optional<br>
{4} : exact number<br>
{4,6} : range numbers (min, max)<br>

In [53]:
test_string = "20/04/1995"
pattern = re.compile(r'\d{4}')
matches = pattern.finditer(test_string)
[match for match in matches]

[<re.Match object; span=(6, 10), match='1995'>]