## Python Regex - An exhaustive guide with examples

**Note**:
- This notebook is built on `learn-by-doing` principle.
- A lot of code examples with proper explanation will be given.
- For thorough understanding of any built-in regex function, please refer to official python doc.
- **re** module is the powerhouse. So it's a must import.

In [1]:
import re

### Simple patterns (letter, word etc.) via re.search()

- Function signature:`re.search(pattern,string-to-perform-search)`
- If search is successful, return -> match object
- Else return -> None

>- Format: re.search(**r"pattern"**,string-to-perform-search))
- `r"pattern"` has a special meaning. It stands for a `raw python string`, which in plain english is a directive to the interpreter to interpret the given pattern literally. (thus a  `print("\n")` will generate a newline, while  `print(r"\n")` will output `r"\n"`. 

In [1]:
print("\n")





In [2]:
print(r"\n")

\n


In [4]:
#Example:
re.search(r"b","badboy")

<re.Match object; span=(0, 1), match='b'>

In [5]:
re.search(r"b","badboy").group()

'b'

### Observations from above:
- Search starts at pos 0
- Search ends at pos 1
- group() method stores the searched pattern

In [6]:
re.search(r"b","badboy").group()

'b'

In [7]:
#Example: let's wrap the abv within a function
def func1(pattern, string):
    match_obj = re.search(pattern, string)
    if match_obj:
        print(f"Found a match for {match_obj.group()} in: {string}")
    else:
        print("Pattern not found")   

In [8]:
pat1 = r"n"
pat2 = r"cat"

In [9]:
func1(pat1,"number")

Found a match for n in: number


In [10]:
func1(pat2,"There's one caterpillar and a cat in the closet") # note that only the 1st occurance of cat is matched

Found a match for cat in: There's one caterpillar and a cat in the closet


In [11]:
func1(pat2,"There's one Caterpillar and a Cat in the closet") # note: cat and Cat are different

Pattern not found


In [12]:
# Ignoring case via re.IGNORECASE
re.search(r"cat","There's one Caterpillar and a Cat in the closet",re.IGNORECASE).group() # note: this will work

'Cat'

In [5]:
re.search(r"[cC]at","There's one caterpillar in the closet") # note: this will also work

<re.Match object; span=(12, 15), match='cat'>

### Big Idea:
- `r"a"` matches `a` at any place within the string
- `r"bit"` matches `bit,fitbit,bite,bitten etc.`, but doesn't match `biit,bid etc.`
- `re.IGNORECASE` can be used to take care of both upper and lowercases; or character class can also be used.

### Above is an example of using a class of characters

- Intitution: Putting a list of characters within `[]`, matches any one character within the given list. 
- So, `r"[Bb][aA]t"` will match Bat,bat,BAt,bAt.

In [20]:
pat4 = r"[aeiou]"
func1(pat4,"another")
func1(pat4,"elephant")
func1(pat4,"is")
func1(pat4,"oliver")
func1(pat4,"uber")

Found a match for a in: another
Found a match for e in: elephant
Found a match for i in: is
Found a match for o in: oliver
Found a match for u in: uber


### Example:

- Let's try to see if "elephant" contains any vowels

**Python implementation will be something like:**
>for "a" in "elephant" or "i" in "elephant" or "u" in "elephant" and so on..

**Regex implemtation will be :**
> match_obj = re.search(r"[aeiou]", "elephant")


### Matching a single character (using .):
 `.`  matches a single character. (can match any character, alphabets,numbers and the `.` itself)

_Let's look at an example_

In [48]:
#Example:
filelst = ["f1.py","f11.py","f2.py","f111.py","fat4.py","f..py"]
print(". will match a single character")
print("\. is used to escape the dot extention in filename\n")
print("Matched filenames:")
for file in filelst:
    print(re.search(r"f.\.",file)) # we want to match f_any character_dot(e.g.f1.py,f5.py, but not f55.py, f555.py and so on)
print("\nNotice the last match where . is matching itself")    
    

. will match a single character
\. is used to escape the dot extention in filename

Matched filenames:
<re.Match object; span=(0, 3), match='f1.'>
None
<re.Match object; span=(0, 3), match='f2.'>
None
None
<re.Match object; span=(0, 3), match='f..'>

Notice the last match where . is matching itself


### Few abbreviations:
- [a-e] = [abcde] (will match any character within the class)
- [0-3] = [0123] (will match any character within the class)
- \d = any digit (\D: matches a non digit character)
- \w = any word character = [a-zA-Z0-9_] (note: `_` is also included in the word class; \W non word)
- \s = a whitespace character = [ \r\t\n]
- **^ is used for negation when used within a character class**; Thus,
> - r[^aeiou] will match `not-vowels`
 - r[^\d] will match `not-digits`

### Anchors

- To make sure that one is at a certain boundary before doing the match
- `^` matches the beginning of a line; thus `^[0-9]` matches a number at the beginning of the string
- `$` matches the end of a line; `\w$` matches a string ending with a word character
    - **Note**: `r[^0-9]` signifies negation of a digit 
- `\b` is used to define a word boundary 
    - `\bman` will match words starting with man (e.g: mankind, manly, mane etc. but not human,superman etc.)
    - `man\b` will match words ending with man (e.g. human, superman etc.)
    - `\bman\b` will match the word man in a sentence

### Quick introduction to re.sub
- Used for regex substitution
- Syntax: `re.sub(r'pattern-to-substitite',"pattern-to-be-substitued-with",string)`

**Note**: Strings are `immutable`; hence a new string gets generated by re.sub(), and original string remains unchanged.

## Worked out Examples

In [2]:
# Example: Replace g with G
tstr = "This is great news"
re.sub(r"g","G",tstr)

'This is Great news'

In [5]:
# Example: Replace g with G upto a certain count(1st 2 counts of g)
tstr = "This is great that we gathered here that I can gurantee going forward"
re.sub(r"g","G",tstr,count=2)

'This is Great that we Gathered here that I can gurantee going forward'

In [16]:
# Example: Let's match a set of alphanumeric numbers starting with 5, followed by letters/numbers at next 2 places
numlst = ["555-aa-5555","5a5-5a-1234","655-bb-4444","555-b3-abc5"] ## 1st 2 matches, but not the last 2
for val in numlst:
    print(re.search(r"^5[\w][\w]-[\w][\w]-[\d][\d][\d][\d]",val))

<re.Match object; span=(0, 11), match='555-aa-5555'>
<re.Match object; span=(0, 11), match='5a5-5a-1234'>
None
None


In [17]:
# Example: This will match only the word cat, as defined by the boundary; so the 2nd cat is matched, but not Caterpillar
re.search(r"\b[cC]at\b","There's one Caterpillar and a cat in the closet")

<re.Match object; span=(30, 33), match='cat'>

In [18]:
# Example: This will match the single word cat (begin with c|C, then a and end with t)
re.search(r"^[cC]at$","cat")

<re.Match object; span=(0, 3), match='cat'>

In [48]:
# Example: This won't match anything as the sentence doesn't start with c|Cat
re.search(r"^[cC]at","There's one Caterpillar and a cat in the closet") 

In [49]:
# Example: This will match Cat as the sentence starts with Cat
re.search(r"^[cC]at","Cat and mouse game") 

<re.Match object; span=(0, 3), match='Cat'>

In [20]:
# Example: There's no anchor; so Cat from Caterpillar is matched
re.search(r"[cC]at","There's one Caterpillar and a cat in the closet")

<re.Match object; span=(12, 15), match='Cat'>

In [33]:
# Example: Using a list comprehension to match mutiple strings; to find the word man at the beginning of a word
lst = ["Mankind is to blame","Human and the masters","A child will grow to be a man","Manchester, England"]
[wrd for wrd in lst if re.search(r'\b[mM]an',wrd) ]

['Mankind is to blame', 'A child will grow to be a man', 'Manchester, England']

In [50]:
# Example: Using a list comprehension to match mutiple strings; to find the word man at the beginning of a word
lst = ["Mankind is to blame","Human and the masters","A child will grow to be a man","Manchester, England"]
[wrd for wrd in lst if re.search(r'[mM]an\b',wrd) ]

['Human and the masters', 'A child will grow to be a man']

In [51]:
# Example: Using a list comprehension to match mutiple strings; to find the word man at the beginning of a word
lst = ["Mankind is to blame","Human and the masters","A child will grow to be a man","Manchester, England"]
[wrd for wrd in lst if re.search(r'\b[mM]an\b',wrd) ]

['A child will grow to be a man']

In [44]:
# Example: Also few subtle examples to note:
print(re.search(r'\b[mM]an',"There are many ppl out there")) # will match word man in many
print(re.search(r'^[mM]an',"There are many ppl out there")) # won't match anything as sentence doesn't start with [mM]

<re.Match object; span=(10, 13), match='man'>
None


### Repetition
- Match more than 1 character 
- Match specific number of characters
- Match characters within a given count
- Match `0 or more occurances` of a character using `*`
- Match `1 or more occurances` of a character using `+`
- Match `0 or 1 occurances` of a character using `?`

In [62]:
# Match more than 1 character 
print(re.search(r'[0-9]{2}',"12")) # will match 2 occurances of a digit
print(re.search(r'[0-9]{2,}',"12223456")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"12")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"12223456")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"1")) # won't match anything
print(re.search(r'[0-9]{,3}',"12223456")) # will match upto 3 characters

<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(0, 8), match='12223456'>
<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(0, 3), match='122'>
None
<re.Match object; span=(0, 3), match='122'>


In [63]:
# Let's write a regex to match all SSNs
# Let's assume that ssn might contain - or , as separator
ssns = ["122-51-1752","221,55,2222","33-444-5555","0,000,000","999-99-9999"]
[num for num in ssns if re.search(r'\d{3}[-,]\d{2}[-,]\d{4}',num)]

['122-51-1752', '221,55,2222', '999-99-9999']

In [6]:
# Let's write a regex to match the below names
namelst = ["Alice,B12","Mr. Rob6666","Oliver Hans Hulk"]
[name for name in namelst if re.search(r'^[\w][\w]+[,\.]?\s?\w+\s?\w+',name)]

['Alice,B12', 'Mr. Rob6666', 'Oliver Hans Hulk']

In [9]:
# Check whether the given strings contain 0xC0 . Display a boolean result
line1 = 'start address: 0xF0, func1 address: 0xC0'
line2 = 'end address: 0xEF, func2 address: 0xA0'

print(bool(re.search(r'0xC0',line1)))
print(bool(re.search(r'0xC0',line2)))

True
False


In [10]:
# Example: Replace all occurrences of 6 with six for the given string
nstr = 'They have 6 shoes and 6 socks'
re.sub(r'6',"six",nstr)

'They have six shoes and six socks'

In [11]:
# Example: Replace first 2 occurrences of 6 with six for the given string
nstr = 'They have 6 shoes and 6 socks and 6 shirts'
re.sub(r'6',"six",nstr,count=2)

'They have six shoes and six socks and 6 shirts'

In [25]:
# Example: For the given list, filter all elements that do not contain e 
items = ['apple', 'item', 'care', 'boy', 'girl', 'lunch',"get","baffle","eat","elephant"]
print(f"All elements that do not contain e:{[word for word in items if not re.search(r'e', word)]}")

# Example: For the given list, filter all elements that do contains e 
print(f"All elements that contains e:{[word for word in items if re.search(r'e', word)]}")

# Example: For the given list, filter all elements that ends with e 
print(f"All elements that ends with e:{[word for word in items if re.search(r'e$', word)]}")

# Example: For the given list, filter all elements that starts with e 
print(f"All elements that starts with e:{[word for word in items if re.search(r'^e', word)]}")


All elements that do not contain e:['boy', 'girl', 'lunch']
All elements that contains e:['apple', 'item', 'care', 'get', 'baffle', 'eat', 'elephant']
All elements that ends with e:['apple', 'care', 'baffle']
All elements that starts with e:['eat', 'elephant']


In [28]:
# Example: Replace all occurrences of doc irrespective of case with DOC 
nstr = 'This document needs to be DOcKed'
re.sub(r'doc',"DOC",nstr,flags=re.IGNORECASE) # flags parameter needs to be passed explicitly (this is not needed in re.search)

'This DOCument needs to be DOCKed'

In [31]:
# Example: Check if pill is present in the given byte input data
tstr = b'Domestic catterpillar'
print(re.search(r'pill',tstr)) # this is throw an error as the input is byte data

TypeError: cannot use a string pattern on a bytes-like object

In [32]:
print(re.search(rb'pill',tstr)) #this will work

<re.Match object; span=(15, 19), match=b'pill'>


In [51]:
# Example: For the given string, replace 0xA0 with 0x7F and 0xC0 with 0x1F 

## Option 1: Using a dictionary
tstr = 'start address: 0xA0, func1 address: 0xC0'        
dct = {r"0xA0":"0x7F",r"0xC0":"0x1F"} 

for k,v in dct.items():
    nstr = re.sub(k,dct[k],tstr)
    tstr = nstr

print(f"Option 1:Using dct; modifies original string : {nstr}")


## Option 2: Calling outer re.sub() on inner re.sub(). This is not valid as the number of opertions increase
nstr = 'start address: 0xA0, func1 address: 0xC0'        
print("Option 2:Original string unmodified: ", re.sub(r"0xC0","0x1F",re.sub(r"0xA0","0x7F",nstr)))

Option 1:Using dct; modifies original string : start address: 0x7F, func1 address: 0x1F
Option 2:Original string unmodified:  start address: 0x7F, func1 address: 0x1F
