# Regular Expressions

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions,then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here.

***CERTAIN IMPORTANT POINTS***

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). 
However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a
byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the 
same type as both the pattern and the search string.

Python’s raw string notation for regular expression patterns is used; 
backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a 
two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. 
Usually patterns will be expressed in Python code using this raw string notation.


### 1.compile() method and search() method

syntax - re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its 
match(), search() and other methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following 
variables, combined using bitwise OR (the | operator).


example - usa and canadian phone numbers are of the form 445-666-1234

In [None]:
import re
message = "Call me at 445-555-1234 or if not reachable then at 445-555-4354"
expr = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d") # \d is used for numeric digits
mo = expr.search(message) #returns an object called match object
print(mo)
print(mo.group())

# 2. findall(str) method
print(expr.findall(message)) #findall method returns all the matches for that RE inform of a list


<re.Match object; span=(11, 23), match='445-555-1234'>
445-555-1234
['445-555-1234', '445-555-4354']


In [None]:
message = "Call me at 445-555-1234 or if not reachable then at 445-555-4354"
expr = re.compile(r"(\d\d\d)-(\d\d\d-\d\d\d\d)") # created 2 sep groups inside re
mo = expr.search(message) 
print(mo.group(1)) #printing different groups 
print(mo.group(2))

445
555-1234


#### more on compile method

In [None]:
# re.IGNORECASE or re.I extension to the compile method
pattern = re.compile("\s+", re.IGNORECASE)

In [None]:
# COMBINING DIFFERENT EXTENSIONS
# re.compile method can take only 2 arguments so how do we pass multiple extensions?
# By combining different extensions with a bitwise OR (|) operator
pattern = re.compile("\s+", re.IGNORECASE | re.DOTALL | re.VERBOSE)

In [None]:
# Match object

# You can get methods and attributes of a match object using dir() function.

# Some of the commonly used methods and attributes of match objects are:
#  1. match.group()

# The group() method returns the part of the string where there is a match.


import re

string = '39801 356, 2102 1111'
pattern = '(\d{3}) (\d{2})'
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# 2. match.start() and match.end()

# The start() function returns the index of the start of the matched substring. Similarly, end() returns the end 
# index of the matched substring.
print(match.start())
print(match.end())

# 3. match.span()

print(match.span())

801 35
2
8
(2, 8)


### METACHARACTERS

In [None]:
#1.\
#what if the number was of the format (415) 123-4567. Here we us \( and \) to indicate that () is a part of the pattern

message = "Call me at (445) 555-1234 or if not reachable then at (445) 555-4354"
expr = re.compile(r"\(\d\d\d\) (\d\d\d-\d\d\d\d)") # created 2 sep groups inside re
mo = expr.search(message) 
mo.group()

NameError: ignored

In [None]:
# 2. | (pipe) character
# suppose we want to match words - batman, batmobile,batwoman, batcopter. All of them have bat as a prefix so we can write a re of the following form using 
 

In [None]:
# What to do when we want to match a certan pattern for certain number of repetitions
# Regex has certain special characters which are used for certain specific use

# 3. ? character - (0 or 1) time

    # for example we want to write a re to match batwoman or batman. we can write re as (batman|batwoman)
    
    # we can shorten it using ? char

batreg = re.compile(r'bat(wo)?man')
gen = "batwoman"
print(batreg.search(gen).group())


# 4. * character - 0 or more times


batreg = re.compile(r'bat(wo)*man')
gen = "batwowowowoman"
print(batreg.search(gen).group())

# 5. + character - 1 or more times


batreg = re.compile(r'bat(wo)+man')
gen = "batwowowowoman"
gen2 = "batman"
print(batreg.search(gen).group())
print(batreg.search(gen2))


print()
# escaping ?, * , + characters
regexp = re.compile(r'\?\+\*')
msg = "i learned about ?+* regex methods"
print(regexp.search(msg).group())

print()

# 6. {} character 

# matching specific number of repetitions in the group using {} braces
regexp = re.compile(r'(ha){3}')
msg = 'hahaha'
msg2 = 'haha'
print(regexp.search(msg))
print(regexp.search(msg2))


print()
regexp = re.compile(r'(ha){3,5}') #atleast 3 and atmost 5
msg = 'hahaha'
msg2 = 'haha'
msg3 = "hahahaha"
print(regexp.search(msg))
print(regexp.search(msg2))
print(regexp.search(msg3))

# uses same concept as slicing
regexp = re.compile(r'(ha){3,}') #atleast 3 and atmost any number

batwoman
batwowowowoman
batwowowowoman
None

?+*

<re.Match object; span=(0, 6), match='hahaha'>
None

<re.Match object; span=(0, 6), match='hahaha'>
None
<re.Match object; span=(0, 8), match='hahahaha'>


In [None]:
# in python, RE does greedy matching meaning they match the longest possible string. For example

reg = re.compile(r'\d{3,5}')
mo = reg.search('123456789')
print(mo.group()) # it could have matched 3 or 4 chars but matched max possible char which was 5

# how to do a non greedy match? 
reg = re.compile(r'\d{3,5}?') # to do a non greedy match, put a ? (diff from above one) after such an expression
mo = reg.search('123456789')
print(mo.group()) 


12345
123


### findall() method

In [None]:

import re
message = "Call me at 445-555-1234 or if not reachable then at 445-555-4354"
expr = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d") # \d is used for numeric digits

# if the RE has 0 or 1 group in it, findall wil return a list of strings
print(expr.findall(message)) 

expr = re.compile(r"(\d\d\d)-(\d\d\d-\d\d\d\d)") # \d is used for numeric digits

# if the RE has 2 or more group in it, findall wil return a list of tuples where each tuple will have orders
print(expr.findall(message)) 



['445-555-1234', '445-555-4354']
[('445', '555-1234'), ('445', '555-4354')]


### character classes

In [None]:
# \d - any numeric digit from 0 to 9
# \D - any character that is not numeric digit from 0 to 9
# \w - any letter, digit or underscore character 
# \W - any character that is not a letter, digit or underscore character 
# \s - any space, tab or newline character
# \S - any character that is not a space, tab or newline character

# HOW TO MAKE YOUR OWN CHARACTER CLASSES?
# ANS - using [] brackets - square brackets specifies a set of characters you wish to match.

# for example, we want to have a RE to match vowels, we can either write is as (a|e|i|o|u) or

vowelre = re.compile(r'[aeiou]')
word = 'hippopotamus'
print(vowelre.findall(word))

# lets say we want to identify all the lowercase letters
sentence = "THis Is A Mix SenTeNCe"
vowelre = re.compile(r'[a-z]')
print(vowelre.findall(sentence))
# we can use notation like [a-f] to find a specific set of alphabets also

sentence = "THis Is A Mix SenTeNCe"
vowelre = re.compile(r'[a-zA-Z]') #both capital and lowercase letters
print(vowelre.findall(sentence))

# now we can combine [] brackets with other special characters
sentence = "robocop eats baby food"
vowelre = re.compile(r'[aeiou]{2}') #meaning 2 vowels should come together
print(vowelre.findall(sentence))

# NEGATIVE CHARACTER CLASS
# by adding a caret (^) symbol, we make  a negative character class. For example

vowelre = re.compile(r'[^aeiou]')
word = 'hippopotamus'
print(vowelre.findall(word)) #prints all the consonants


['i', 'o', 'o', 'a', 'u']
['i', 's', 's', 'i', 'x', 'e', 'n', 'e', 'e']
['T', 'H', 'i', 's', 'I', 's', 'A', 'M', 'i', 'x', 'S', 'e', 'n', 'T', 'e', 'N', 'C', 'e']
['ea', 'oo']
['h', 'p', 'p', 'p', 't', 'm', 's']


### (^) symbol and ($) symbol


In [None]:

# The caret symbol ^ is used to check if a string starts with a certain character.
# The dollar symbol $ is used to check if a string ends with a certain character.

msg1 = 'hello my name is vedant'
msg2 = 'hello, how are you vedant'
msg3 = 'a hello a day.'

exreg = re.compile(r'^hello')
print(exreg.match(msg1))
print(exreg.match(msg2))
print(exreg.match(msg3))

print()
exreg = re.compile(r'vedant$')
print(exreg.search(msg1))
print(exreg.search(msg2))
print(exreg.search(msg3))


# regex to search an all digit word
print()
exreg = re.compile(r'^\d+$')
print(exreg.search('11234'))
print(exreg.search('1123a'))

<re.Match object; span=(0, 5), match='hello'>
<re.Match object; span=(0, 5), match='hello'>
None

<re.Match object; span=(17, 23), match='vedant'>
<re.Match object; span=(19, 25), match='vedant'>
None

<re.Match object; span=(0, 5), match='11234'>
None


### wildcard character - (.)

In [None]:
# wildcard dot(.) character 
# stands for any character except newline character (\n)

exreg = re.compile(r'.at')
print(exreg.findall("the cat in the hat sat on the flat mat."))

# if you notice the above o/p, flat is not matched as the dot character only looks for a single character
exreg = re.compile(r'..at')
print(exreg.findall("the cat in the hat sat on the flat mat."))
# the same thing can be written as 
exreg = re.compile(r'.{1,2}at')
print(exreg.findall("the cat in the hat sat on the flat mat."))

['cat', 'hat', 'sat', 'lat', 'mat']
[' cat', ' hat', ' sat', 'flat', ' mat']
[' cat', ' hat', ' sat', 'flat', ' mat']


In [None]:
# dot star wildcard character (.*) 
# dot means any character and * means 0 or more times so dot star character means any pattern whatsoever

# example:

str = "First name: Vedant Last name: Barbhaya"
# how to get the first name and last name of a person from such a string
regexp = re.compile(r'First name: (.*) Last name: (.*)')
mo = regexp.search(str)
print(mo.group(1))
print(mo.group(2))

Vedant
Barbhaya


In [None]:
# .* is a greedy match. .*? is the same implementation of it in a non greedy way
#example

serve = '<to serve fish> in the dinner>'
#non greedy approach
regexp = re.compile('<.*?>')
print(regexp.findall(serve))

#greedy aproach
print()
regexp = re.compile('<.*>')
print(regexp.findall(serve))

['<to serve fish>']

['<to serve fish> in the dinner>']


In [None]:
# How to include newline char in the dot method

primedir = "serve the public.\nHelp the innocent.\nUphold the law."
print(primedir)
regexp = re.compile(r'.*',re.DOTALL) # we pass one more argument to the regular expression
print(regexp.search(primedir))

# another imp argument is re.IGNORECASE which as the name suggests, ignores the case of the string

serve the public.
Help the innocent.
Uphold the law.
<re.Match object; span=(0, 52), match='serve the public.\nHelp the innocent.\nUphold the>


### sub() method

In [None]:


# this is like a find and substitute method
# 2 ways to use to it:
# The syntax of re.sub() is:

# re.sub(pattern, replace, string)

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# can also be used like this

pattern = re.compile("\s+")
print(pattern.sub(replace, string))

# using \1, \2 to select groups to substitute

namesRegex = re.compile(r"Agent (\w)(\w*)")
namesRegex.sub(r"Agent \1****","Agent alice gave the secret documents to Agent Bob")

abc12de23f456
abc12de23f456


'Agent a**** gave the secret documents to Agent B****'