## Python Regex - An exhaustive guide with examples

**Note**:
- This notebook is built on `learn-by-doing` principle.
- A lot of code examples with proper explanation will be given.
- For thorough understanding of any built-in regex function, please refer to official python doc.
- **re** module is the powerhouse. So it's a must import.

In [2]:
import re

### Simple patterns (letter, word etc.) via re.search()

- Function signature:`re.search(pattern,string-to-perform-search)`
- If search is successful, return -> match object
- Else return -> None

>- Format: re.search(**r"pattern"**,string-to-perform-search))
- `r"pattern"` has a special meaning. It stands for a `raw python string`, which in plain english is a directive to the interpreter to interpret the given pattern literally. (thus a  `print("\n")` will generate a newline, while  `print(r"\n")` will output `r"\n"`. 

In [1]:
print("\n")





In [2]:
print(r"\n")

\n


In [4]:
#Example:
re.search(r"b","badboy")

<re.Match object; span=(0, 1), match='b'>

### Observations from above:
- Search starts at pos 0
- Search ends at pos 1
- group() method stores the searched pattern

In [6]:
re.search(r"b","badboy").group()

'b'

In [7]:
#Example: let's wrap the abv within a function
def func1(pattern, string):
    match_obj = re.search(pattern, string)
    if match_obj:
        print(f"Found a match for {match_obj.group()} in: {string}")
    else:
        print("Pattern not found")   

In [8]:
pat1 = r"n"
pat2 = r"cat"

In [9]:
func1(pat1,"number")

Found a match for n in: number


In [10]:
func1(pat2,"There's one caterpillar and a cat in the closet") # note that only the 1st occurance of cat is matched

Found a match for cat in: There's one caterpillar and a cat in the closet


In [11]:
func1(pat2,"There's one Caterpillar and a Cat in the closet") # note: cat and Cat are different

Pattern not found


In [12]:
# Ignoring case via re.IGNORECASE
re.search(r"cat","There's one Caterpillar and a Cat in the closet",re.IGNORECASE).group() # note: this will work

'Cat'

In [5]:
re.search(r"[cC]at","There's one caterpillar in the closet") # note: this will also work

<re.Match object; span=(12, 15), match='cat'>

### Big Idea:
- `r"a"` matches `a` at any place within the string
- `r"bit"` matches `bit,fitbit,bite,bitten etc.`, but doesn't match `biit,bid etc.`
- `re.IGNORECASE` can be used to take care of both upper and lowercases; or character class can also be used.

### Above is an example of using a class of characters

- Intitution: Putting a list of characters within `[]`, matches any one character within the given list. 
- So, `r"[Bb][aA]t"` will match Bat,bat,BAt,bAt.

In [20]:
pat4 = r"[aeiou]"
func1(pat4,"another")
func1(pat4,"elephant")
func1(pat4,"is")
func1(pat4,"oliver")
func1(pat4,"uber")

Found a match for a in: another
Found a match for e in: elephant
Found a match for i in: is
Found a match for o in: oliver
Found a match for u in: uber


### Example:

- Let's try to see if "elephant" contains any vowels

**Python implementation will be something like:**
>for "a" in "elephant" or "i" in "elephant" or "u" in "elephant" and so on..

**Regex implemtation will be :**
> match_obj = re.search(r"[aeiou]", "elephant")


### Matching a single character (using .):
 `.`  matches a single character. (can match any character, alphabets,numbers and the `.` itself)

_Let's look at an example_

In [5]:
#Example:
filelst = ["f1.py","f11.py","f2.py","f111.py","fat4.py","f..py"]
print(". will match a single character")
print("\. is used to escape the dot extention in filename\n")
print("Matched filenames:")
for file in filelst:
    print(re.search(r"f.\b",file)) # we want to match f_any character_dot(e.g.f1.py,f5.py, but not f55.py, f555.py and so on)    

. will match a single character
\. is used to escape the dot extention in filename

Matched filenames:
<re.Match object; span=(0, 2), match='f1'>
None
<re.Match object; span=(0, 2), match='f2'>
None
None
None


### Few abbreviations:
- [a-e] = [abcde] (will match any character within the class)
- [0-3] = [0123] (will match any character within the class)
- \d = any digit (\D: matches a non digit character)
- \w = any word character = [a-zA-Z0-9_] (note: `_` is also included in the word class; \W non word)
- \s = a whitespace character = [ \r\t\n]
- **^ is used for negation when used within a character class**; Thus,
> - r[^aeiou] will match `not-vowels`
 - r[^\d] will match `not-digits`

### Anchors

- To make sure that one is at a certain boundary before doing the match
- `^` matches the beginning of a line; thus `^[0-9]` matches a number at the beginning of the string
- `$` matches the end of a line; `\w$` matches a string ending with a word character
    - **Note**: `r[^0-9]` signifies negation of a digit 
- `\b` is used to define a word boundary 
    - This works for both the `start of word and end of word anchoring`. **Start of word** means either the character _prior to the word is a non-word character_ or there is _no character (start of string)_. Similarly, **end of word** means the _character after the word is a non-word character or no character (end of string)_. This implies that you cannot have word boundary \b without a word character.
    - `\bman` will match words starting with man (e.g: mankind, manly, mane etc. but not human,superman etc.)
    - `man\b` will match words ending with man (e.g. human, superman etc.)
    - `\bman\b` will match the word man in a sentence

### Quick introduction to re.sub
- Used for regex substitution
- Syntax: `re.sub(r'pattern-to-substitite',"pattern-to-be-substitued-with",string)`

**Note**: Strings are `immutable`; hence a new string gets generated by re.sub(), and original string remains unchanged.

## Worked out Examples

In [2]:
# Example: Replace g with G
tstr = "This is great news"
re.sub(r"g","G",tstr)

'This is Great news'

In [5]:
# Example: Replace g with G upto a certain count(1st 2 counts of g)
tstr = "This is great that we gathered here that I can gurantee going forward"
re.sub(r"g","G",tstr,count=2)

'This is Great that we Gathered here that I can gurantee going forward'

In [16]:
# Example: Let's match a set of alphanumeric numbers starting with 5, followed by letters/numbers at next 2 places
numlst = ["555-aa-5555","5a5-5a-1234","655-bb-4444","555-b3-abc5"] ## 1st 2 matches, but not the last 2
for val in numlst:
    print(re.search(r"^5[\w][\w]-[\w][\w]-[\d][\d][\d][\d]",val))

<re.Match object; span=(0, 11), match='555-aa-5555'>
<re.Match object; span=(0, 11), match='5a5-5a-1234'>
None
None


In [110]:
#Examples related to word boundary
words = 'par spar apparent spare part'
# replace 'par' irrespective of where it occurs
print("Full substitution:",re.sub(r'par', 'X', words))
# replace 'par' only at start of word
print("Word boundary;so par and part are replaced:",re.sub(r'\bpar', 'X', words))
print("Line anchor;so par at the beginning of line is replaced:",re.sub(r'^par', 'X', words))
# replace 'par' only at end of word
print("This won't work:",re.sub(r"par$","X", words)) # this won't work as line doesn't end in par
print("Word boundary;so par and spar are replaced:", re.sub(r"par\b","X",words))

Full substitution: X sX apXent sXe Xt
Word boundary;so par and part are replaced: X spar apparent spare Xt
Line anchor;so par at the beginning of line is replaced: X spar apparent spare part
This won't work: par spar apparent spare part
Word boundary;so par and spar are replaced: X sX apparent spare part


In [17]:
# Example: This will match only the word cat, as defined by the boundary; so the 2nd cat is matched, but not Caterpillar
re.search(r"\b[cC]at\b","There's one Caterpillar and a cat in the closet")

<re.Match object; span=(30, 33), match='cat'>

In [113]:
# Example: For the given input string, change only whole word red to brown
words = 'bred red spread credible'
print(re.sub(r"\bred\b","brown",words))

bred brown spread credible


In [16]:
# Example: For the given input list, filter all elements that contains 42 surrounded by word characters
words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', 'fake4b']
print("Option 1:",[w for w in words if re.search(r"\B42\B",w)])
print("Option 2:",[w for w in words if re.search(r"\w+42\w+",w)])

Option 1: ['hi42bye', 'nice1423', 'cool_42a']
Option 2: ['hi42bye', 'nice1423', 'cool_42a']


### Explanation of the above approach (using \B):

- `\b` is a word boundary. So `\b42 -> start of word` will only match if the character prior to the word is a non-word character, or there is no character (start of string). Same goes for `42\b` (need a non-word character after 42 or there is no character (end of string).
- Conversely, \B42 will match only if there's a **word character** prior to the word
- Thus, `\B42\B` essentially creates a boundary and matches only if there's a work character on either side.


In [127]:
# Example: For the given input list, filter all elements that start with den or end with ly 
words = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
[w for w in words if re.search(r"\Aden|ly\Z",w)]

['lovely', '2 lonely', 'dent']

In [128]:
# Example: For the given input list, filter all lines that start with den or end with ly (take \n into account)
words = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
[w for w in words if re.search(r"^den|ly$",w,flags = re.MULTILINE)]

['lovely', '1\ndentist', '2 lonely', 'fly\n', 'dent']

In [17]:
# Example: For the given input string, change whole word mall to 1234 only if it is at the start of a line
para = '''ball fall wall tall\nmall call ball pall\nwall mall ball fall\nmallet wallet malls'''
print("Org line:")
print(para)
print("\nModified line:")
print(re.sub(r"^mall\b","1234",para,flags = re.MULTILINE))

Org line:
ball fall wall tall
mall call ball pall
wall mall ball fall
mallet wallet malls

Modified line:
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls


In [146]:
# Example: For the given input list, filter all whole elements 12\nthree irrespective of case
items = ['12\nthree\n', '12\nThree', '12\nthree\n4', '12\nthree']
print("Option 1:",[ele for ele in items if re.fullmatch(r"12\nthree",ele,flags = re.MULTILINE|re.IGNORECASE)])

Option 1: ['12\nThree', '12\nthree']


In [172]:
# Example: For the given input list, replace hand with X for all elements that start with hand followed by at least one word character.
items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']
print("Option 1:")
print("Elements with hand as a word boundary; note the \"-\" which doesn't fall within \w",[ele for ele in items if re.search(r"\bhand\b",ele)])
print("We need to negate the above list and pass it on to re.sub(); so newlist will be:")
nlst = [ele for ele in items if re.search(r"\bhand\B",ele)]
print(nlst)
print("\nApplying sub:")
print([re.sub(r"hand","X",ele) for ele in nlst ])

print("\nDoing all in one shot:")
print([re.sub(r"\bhand\B","X",val) for val in items])


Option 1:
Elements with hand as a word boundary; note the "-" which doesn't fall within \w ['hand', 'hand-2']
We need to negate the above list and pass it on to re.sub(); so newlist will be:
['handed', 'handy', 'handle']

Applying sub:
['Xed', 'Xy', 'Xle']

Doing all in one shot:
['Xed', 'hand', 'Xy', 'unhanded', 'Xle', 'hand-2']
[]


In [18]:
# Example: This will match the single word cat (begin with c|C, then a and end with t)
re.search(r"^[cC]at$","cat")

<re.Match object; span=(0, 3), match='cat'>

In [75]:
# Example: Same can be achieved with re.fullmatch() (begin with c|C, then a and end with t)
re.fullmatch(r"[cC]at","cat") # matches complete input string, not just part

<re.Match object; span=(0, 3), match='cat'>

In [77]:
# Example: This won't match anything
print(re.search(r"^[cC]at$","catterpillar"))
print(re.fullmatch(r"[cC]at","catterpillar"))

None
None


In [48]:
# Example: This won't match anything as the sentence doesn't start with c|Cat
re.search(r"^[cC]at","There's one Caterpillar and a cat in the closet") 

In [49]:
# Example: This will match Cat as the sentence starts with Cat
re.search(r"^[cC]at","Cat and mouse game") 

<re.Match object; span=(0, 3), match='Cat'>

In [20]:
# Example: There's no anchor; so Cat from Caterpillar is matched
re.search(r"[cC]at","There's one Caterpillar and a cat in the closet")

<re.Match object; span=(12, 15), match='Cat'>

In [86]:
# Example: This won't match anything as we haven't specified multiline
print(re.search(r'ˆpop', 'hi top\npop spot'))
print(re.search(r'^pop', 'hi top\npop spot', flags=re.MULTILINE)) # this will work

None
<re.Match object; span=(7, 10), match='pop'>


In [90]:
# Example: # Filter all elements having lines ending with 'are'
ele = ['air\ntool\ndeclare', 'their\n', 'dare','tool\nare useful','tools']
[val for val in ele if re.search(r"are$",val, flags = re.MULTILINE)]

['air\ntool\ndeclare', 'dare']

In [92]:
# Example: Check if any complete line in the string is 'tool'
[val for val in ele if re.search(r"^tool$",val,flags = re.MULTILINE)]

['air\ntool\ndeclare', 'tool\nare useful']

In [100]:
# Example: Add a stop at every line
lines = "Enter\nThe\nDragon"
print("Original line:")
print(lines)
print("\nModified line:")
print(re.sub(r"$",".",lines, flags = re.MULTILINE))

Original line:
Enter
The
Dragon

Modified line:
Enter.
The.
Dragon.


In [33]:
# Example: Using a list comprehension to match mutiple strings; to find the word man at the beginning of a word
lst = ["Mankind is to blame","Human and the masters","A child will grow to be a man","Manchester, England"]
[wrd for wrd in lst if re.search(r'\b[mM]an',wrd) ]

['Mankind is to blame', 'A child will grow to be a man', 'Manchester, England']

In [44]:
# Example: Also few subtle examples to note:
print(re.search(r'\b[mM]an',"There are many ppl out there")) # will match word man in many
print(re.search(r'^[mM]an',"There are many ppl out there")) # won't match anything as sentence doesn't start with [mM]

<re.Match object; span=(10, 13), match='man'>
None


### String operations using anchors and sub()

In [69]:
re.sub(r"^","para","meter") # concat

'parameter'

In [71]:
re.sub(r"$","try","coun") # append

'country'

### Repetition
- Match more than 1 character 
- Match specific number of characters
- Match characters within a given count
- Match `0 or more occurances` of a character using `*`
- Match `1 or more occurances` of a character using `+`
- Match `0 or 1 occurances` of a character using `?`

In [62]:
# Example:
# Match more than 1 character 
print(re.search(r'[0-9]{2}',"12")) # will match 2 occurances of a digit
print(re.search(r'[0-9]{2,}',"12223456")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"12")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"12223456")) # will match 2 and more occurances of a digit
print(re.search(r'[0-9]{2,3}',"1")) # won't match anything
print(re.search(r'[0-9]{,3}',"12223456")) # will match upto 3 characters

<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(0, 8), match='12223456'>
<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(0, 3), match='122'>
None
<re.Match object; span=(0, 3), match='122'>


In [63]:
# Example:
# Let's write a regex to match all SSNs
# Let's assume that ssn might contain - or , as separator
ssns = ["122-51-1752","221,55,2222","33-444-5555","0,000,000","999-99-9999"]
[num for num in ssns if re.search(r'\d{3}[-,]\d{2}[-,]\d{4}',num)]

['122-51-1752', '221,55,2222', '999-99-9999']

In [6]:
# Example: Let's write a regex to match the below names
namelst = ["Alice,B12","Mr. Rob6666","Oliver Hans Hulk"]
[name for name in namelst if re.search(r'^[\w][\w]+[,\.]?\s?\w+\s?\w+',name)]

['Alice,B12', 'Mr. Rob6666', 'Oliver Hans Hulk']

In [9]:
# Example: Check whether the given strings contain 0xC0 . Display a boolean result
line1 = 'start address: 0xF0, func1 address: 0xC0'
line2 = 'end address: 0xEF, func2 address: 0xA0'

print(bool(re.search(r'0xC0',line1)))
print(bool(re.search(r'0xC0',line2)))

True
False


In [10]:
# Example: Replace all occurrences of 6 with six for the given string
nstr = 'They have 6 shoes and 6 socks'
re.sub(r'6',"six",nstr)

'They have six shoes and six socks'

In [11]:
# Example: Replace first 2 occurrences of 6 with six for the given string
nstr = 'They have 6 shoes and 6 socks and 6 shirts'
re.sub(r'6',"six",nstr,count=2)

'They have six shoes and six socks and 6 shirts'

In [27]:
# Example: For the given list, filter all elements that do not contain e 
items = ['apple', 'item', 'care', 'boy', 'girl', 'lunch',"get","baffle","eat","elephant"]
print(f"All elements that do not contain e:{[word for word in items if not re.search(r'e', word)]}")

# Example: For the given list, filter all elements that do contains e 
print(f"All elements that contains e:{[word for word in items if re.search(r'e', word)]}")

# Example: For the given list, filter all elements that ends with e 
print(f"All elements that ends with e:{[word for word in items if re.search(r'e$', word)]}")

# Example: For the given list, filter all elements that starts with e 
print(f"All elements that starts with e:{[word for word in items if re.search(r'^e', word)]}")


SyntaxError: f-string expression part cannot include a backslash (<ipython-input-27-6311e28424bb>, line 12)

In [49]:
# Example
items = ['apple', 'item', 'care', 'boy', 'girl', 'lunch',"get","baffle","eat","elephant"]
print("All elements that ends with e:",[word for word in items if re.search(r'e\Z', word)])
print("All elements that starts with e:",[word for word in items if re.search(r'\Ae', word)])

All elements that ends with e: ['apple', 'care', 'baffle']
All elements that starts with e: ['eat', 'elephant']


In [28]:
# Example: Replace all occurrences of doc irrespective of case with DOC 
nstr = 'This document needs to be DOcKed'
re.sub(r'doc',"DOC",nstr,flags=re.IGNORECASE) # flags parameter needs to be passed explicitly (this is not needed in re.search)

'This DOCument needs to be DOCKed'

In [31]:
# Example: Check if pill is present in the given byte input data
tstr = b'Domestic catterpillar'
print(re.search(r'pill',tstr)) # this is throw an error as the input is byte data

TypeError: cannot use a string pattern on a bytes-like object

In [32]:
print(re.search(rb'pill',tstr)) #this will work

<re.Match object; span=(15, 19), match=b'pill'>


In [51]:
# Example: For the given string, replace 0xA0 with 0x7F and 0xC0 with 0x1F 

## Option 1: Using a dictionary
tstr = 'start address: 0xA0, func1 address: 0xC0'        
dct = {r"0xA0":"0x7F",r"0xC0":"0x1F"} 

for k,v in dct.items():
    nstr = re.sub(k,dct[k],tstr)
    tstr = nstr

print(f"Option 1:Using dct; modifies original string : {nstr}")


## Option 2: Calling outer re.sub() on inner re.sub(). This is not valid as the number of opertions increase
nstr = 'start address: 0xA0, func1 address: 0xC0'        
print("Option 2:Original string unmodified: ", re.sub(r"0xC0","0x1F",re.sub(r"0xA0","0x7F",nstr)))

Option 1:Using dct; modifies original string : start address: 0x7F, func1 address: 0x1F
Option 2:Original string unmodified:  start address: 0x7F, func1 address: 0x1F


In [56]:
# Example:: For the given list, filter all elements that contains either a or w
items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

# Option 1:
print("Option 1:",[wrd for wrd in items if re.search(r"a",wrd) or re.search(r"w",wrd)])
print("Option 2:",[wrd for wrd in items if re.search(r"[aw]",wrd)])
print("Option 3:",[wrd for wrd in items if re.search(r"a|w",wrd)])

Option 1: ['goal', 'new', 'eat']
Option 2: ['goal', 'new', 'eat']
Option 2: ['goal', 'new', 'eat']


In [68]:
# Example: For the given list, filter all elements that contains both e and n
items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

# Option 1:
print("Option 1:",[wrd for wrd in items if re.search(r"e",wrd) and re.search(r"n",wrd)])


Option 1: ['new', 'dinner']


### a) Check if the given strings start with `be`

In [13]:
# Example:
line1 = 'be nice'
line2 = '"best!"'
line3 = 'better?'
line4 = 'oh no\nbear spotted'
pat = re.compile(r"\Abe")       ##### add your solution here

print(bool(pat.search(line1)))
print(bool(pat.search(line2)))
print(bool(pat.search(line3)))
print(bool(pat.search(line4)))

True
False
True
False


### For the given input list, filter all elements starting with `h` . Additionally, replace `e` with `X` for these filtered elements.

In [48]:
# Example:
items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']
print([re.sub(r"e","X",val) for val in items if re.search(r"\bh",val) ])

['handXd', 'hand', 'handy', 'handlX', 'hand-2']
