# Use Regular Expressions (RegEx) in Python

## RegEx notes and cheatsheet

* use a single `|` to say 'or'
* `[ ]` - says any of the characters inside the brackets
* `?` make the preceding token optional
* `.+` - matches all characters to the end of the line
* use parentheses to group characters as using the `|` alone divide the regex into only a left and right side.
* `^` matches only if it's at the beginning of a string
* `$` matches only if it's the end of a string
* `+` is greedy - it won't just get the next token. it will keep searching until it reaches the last one.
* `+?` is not greedy - it will just search to the next occurrence of a token.
* `{3}` - would match 3 occurrences of the preceding token
* `{3,4}` - could match the range of 3 to 4 characters of the preceding token

### Flags
* `\w` - matches any word character
* `\r` = carriage return
* `\n` = new lines
* `\s` = space characters
* `\w` = word character
* `\d` = digits 0-9

### Lookahead and Lookbehind
* `(?<!`  = negative look behind
* `(?<=`  = positive look behind
* `(?=`  = positive look ahead 
* `(?!`  = negative look ahead

### Verify Email
The first part of email can contain uppercase letters, lowercase letters, and numbers

In [2]:
import re # import the regex module

# define the regex pattern
# a-z = any lowercase letter
# A-Z = any uppercase letter
# 0-9 = any digit
# + = one or more of the preceding character
# @ = the @ symbol
pattern = r"[a-zA-Z0-9]+@[a-zA-Z0-9]+\.(com|edu|net)" 
# you can interpret the above as 
# "one or more of any lowercase letter, 
# uppercase letter, or digit, followed by the @ symbol
# followed by one or more of any lowercase letter,
# uppercase letter, or digit, followed by a period,
# followed by com, edu, or net"




In [3]:
user_input = input() # get user input

if re.search(pattern, user_input): # if the pattern is found in the user input
    print("Valid email address") # print valid email address
else: # otherwise
    print("Invalid email address") # print invalid email address

Valid email address


### Replacing Substrings

In [8]:
import re

# define the regex pattern for a phone number
pattern = '(\d\d\d)-(\d\d\d)-(\d\d\d\d)' 

# new pattern for the groups of numbers in the phone number
# this will be used to remove the dashes
new_pattern = r"\1\2\3"

In [9]:
user_input = input() # get user input

new_user_input = re.sub(pattern, new_pattern, user_input) # replace the pattern with the new pattern
print(new_user_input) # print the new user input

1234567890


### Methods to Search for Mathces

In [16]:
import re

test_string = "123abc456789abc123ABC"

pattern = re.compile(r"abc") # create a pattern object
matches = pattern.finditer(test_string) # create a list of match objects

# to do this in one line, you can use the following:
# matches = re.finditer(r'abc', test_string) # create a list of match objects

print('Match Ojbects using finditer():')
for match in matches: # for each match object
    print(match) # print the match object

# to just return the substring that matches the pattern, use the following:
targets = re.findall(pattern, test_string) # create a list of substrings that match the pattern

print('\nTargets using findall():')
for target in targets: # for each substring
    print(target) # print the substring

# to find a single match at the beginning of the string, use the following:
match = re.match(pattern, test_string) # create a match object
    
print('\nMatch Object (beginning of string) using match():')
print(match) # print the match object, if it exists, otherwise None

# to find a single match anywhere in the string, use the following:
match = re.search(pattern, test_string) # create a match object

print('\nMatch Object (whole string) using search():')
print(match) # print the match object, if it exists, otherwise None


Match Ojbects using finditer():
<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>

Targets using findall():
abc
abc

Match Object (beginning of string) using match():
None

Match Object (whole string) using search():
<re.Match object; span=(3, 6), match='abc'>


### Methods on a Match Object

In [17]:
import re

test_string = '123abc456789abc123ABC'

pattern = re.compile(r'abc') # create a pattern object
matches = pattern.finditer(test_string) # create a list of match objects

for match in matches: # for each match object
    print(match) # print the match object
    print(match.span()) # print the start and end positions of the match
    print(match.start()) # print the start position of the match
    print(match.end()) # print the end position of the match
    print(match.group()) # print the substring that matches the pattern

<re.Match object; span=(3, 6), match='abc'>
(3, 6)
3
6
abc
<re.Match object; span=(12, 15), match='abc'>
(12, 15)
12
15
abc


### Meta Characters

All meta characters: . ^ $ * + ? { } [ ] \ | ( )

* `.` - any character except newline
* `^` - beginning of a string "^Hello"
* `$` - end of a string "World$"
* `*` - 0 or more occurrences "aix*"
* `+` - 1 or more occurrences "aix+"
* `?` - 0 or 1 occurrence "aix?"
* `{n}` - exactly n occurrences "al{2}"
* `{n,}` - n or more occurrences "al{2,}"
* `{,n}` - 0 to n occurrences "al{,2}"
* `{m,n}` - m to n occurrences "al{2,3}"
* `[abc]` - matches a, b, or c 
* `[a-z]` - matches any lowercase letter
* `[A-Z]` - matches any uppercase letter
* `[0-9]` - matches any digit
* `[^abc]` - matches anything except a, b, or c
* `[^0-9]` - matches anything except a digit
* `\` - escape character
* `|` - matches either expression
* `()` - groups subexpressions

#### More Meta Characters
* `\d` - any digit [0-9]
* `\D` - any non-digit
* `\s` - any whitespace character [space " " tab "\t" newline "\n" carriage return "\r"]
* `\S` - any non-whitespace character
* `\w` - any alphanumeric character [a-zA-Z0-9_]
* `\W` - any non-alphanumeric character
* `\b` - matches where the pattern is at the start or end of a word
* `\B` - matches where the pattern is in the word, but not at the start or end of a string



In [28]:
import re

test_string = '123abc456789abc123ABC.'

pattern = re.compile(r'.') # matches any character except newline
# pattern = re.compile(r'\.') # matches the period character
# pattern = re.compile(r'^123') # does the string start with 123?
# pattern = re.compile(r'^abc') # does the string start with abc?
# pattern = re.compile(r'123$') # does the string end with 123?
# pattern = re.compile(r'ABC\.$') # does the string end with ABC.?


matches = pattern.finditer(test_string) # create a list of match objects

for match in matches: # for each match object
    print(match) # print the substring that matches the pattern


1
2
3
4
5
6
7
8
9
1
2
3


In [50]:
import re

test_string = r'hello 123_ heyho $% honey'

# pattern = re.compile(r'\d') # matches any digit
# pattern = re.compile(r'\D') # does the string end with ABC.?
# pattern = re.compile(r'\s') # does the string end with ABC.?
# pattern = re.compile(r'\S') # does the string end with ABC.?
# pattern = re.compile(r'\w') # does the string end with ABC.?
# pattern = re.compile(r'\W') # does the string end with ABC.?
# pattern = re.compile(r'\bhello') # does the string end with ABC.?
# pattern = re.compile(r'\bhoney') # does the string end with ABC.?
# pattern = re.compile(r'\b23') # does the string end with ABC.?
# pattern = re.compile(r'\Bhello') # does the string end with ABC.?
# pattern = re.compile(r'\Bhoney') # does the string end with ABC.?
pattern = re.compile(r'\B23') # does the string end with ABC.?

matches = pattern.finditer(test_string) # create a list of match objects

for match in matches: # for each match object
    print(match) # print the substring that matches the pattern


<re.Match object; span=(7, 9), match='23'>


### Character Sets
A pattern between square brackets `[]` matches any single character in the set.