# Regular Expressions

## Definition

- A RegEx, or Regular Expression, is a `sequence of characters` that forms a `search pattern`. RegEx can be used to check if a string contains the specified search pattern.
- Some of the common application:
    - Password checker
    - Email format checker
    - Smart character replacement
    - Text pre-processing and text data analytics

## Match Vs Search

- `re.match()` searches only from the `beginning of the string` and return match object if found. `But if a match of substring is found somewhere in the middle of the string, it returns none`. 
- `re.search()` searches for the whole string even if the string contains multi-lines and tries to find a match of the substring in all the lines of string.

In [1]:
import re

### Match

In [2]:
# re.match("<pattern>",<string>,<flags>)

In [3]:
a = "The quick brown fox jump over the Dog"
mo = re.match("quick brown",a)
print(mo)

None


In [4]:
a = "The quick brown fox jump over the Dog"
mo = re.match("The quick brown",a)
print(mo)
print(mo.span())
print(mo.start())
print(mo.end())
print(mo.group())

<re.Match object; span=(0, 15), match='The quick brown'>
(0, 15)
0
15
The quick brown


### Search

In [5]:
a = "The quick brown fox jump over the Dog"
so = re.search("quick brown",a)
print(so)
print(so.span())
print(so.start())
print(so.end())
print(so.group())

<re.Match object; span=(4, 15), match='quick brown'>
(4, 15)
4
15
quick brown


## What is re.compile()

-  We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [6]:
a = "The quick brown fox jump over the Dog"
co = re.compile("brown")
so = co.search("quick brown")
print(so)
print(so.span())
print(so.start())
print(so.end())
print(so.group())

<re.Match object; span=(6, 11), match='brown'>
(6, 11)
6
11
brown


## Substitute substrings

In [7]:
# re.sub("<pattern>",<replacing_string>,<searching_string>,<number of count to be replaced >)

In [8]:
a = "The quick brown fox jump over the Dog"
so = re.sub("Dog","Cat",a)
print(so)

The quick brown fox jump over the Cat


In [9]:
a = "The quick brown fox jump over the Dog Dog Dog dog DoG"
so = re.sub("Dog","Cat",a)
print(so)

The quick brown fox jump over the Cat Cat Cat dog DoG


In [10]:
a = "The quick brown fox jump over the Dog Dog Dog dog DoG"
so = re.sub("Dog","Cat",a,0,re.I) # re.I or re.IGNORECASE to make search case insensitive
print(so)

The quick brown fox jump over the Cat Cat Cat Cat Cat


In [11]:
a = "The quick brown fox jump over the Dog Dog Dog dog DoG"
so = re.sub("Dog","Cat",a,2,re.I) # re.I or re.IGNORECASE to make search case insensitive
print(so)

The quick brown fox jump over the Cat Cat Dog dog DoG


## Meta Vs Literal characters

<div> <img src="attachment:image.png" width="450"/> </div>

In [12]:
a = "The quick brown fox [jump] over the Dog"
so = re.search("[jump]",a)
print(so)
print(so.group())

<re.Match object; span=(5, 6), match='u'>
u


In [13]:
a = "The quick brown fox [jump] over the Dog"
so = re.search("\[jump\]",a)
print(so)
print(so.group())

<re.Match object; span=(20, 26), match='[jump]'>
[jump]


In [14]:
a = "the quick brown fox [jump] over the Dog"
so = re.search("^The",a,re.I)
print(so)
print(so.group())

<re.Match object; span=(0, 3), match='the'>
the


In [15]:
a = "The quick brown fox [jump] over the Dog1"
so = re.search("Dog1$",a,re.I)
print(so)
print(so.group())

<re.Match object; span=(36, 40), match='Dog1'>
Dog1


In [16]:
a = "The quick brown fox [jump] over the dog"
so = re.search("Dog|dog",a)
print(so)
print(so.group())

<re.Match object; span=(36, 39), match='dog'>
dog


In [17]:
a = "The quick brown fox [jump] over the Dog"
so = re.search("[dD]og",a)
print(so)
print(so.group())

<re.Match object; span=(36, 39), match='Dog'>
Dog


In [18]:
a = "The quick brown fox [jump] over the Cat"
so = re.search("[bC]at",a,re.I)
print(so)
print(so.group())

<re.Match object; span=(36, 39), match='Cat'>
Cat


In [19]:
a = "The quick brown fox [jump] over the at"
so = re.search("[bC]?at",a,re.I) #bat,cat,at
print(so)
print(so.group())

<re.Match object; span=(36, 38), match='at'>
at


In [20]:
a = "1122333"
so = re.search("22",a)
print(so)
print(so.group())

<re.Match object; span=(2, 4), match='22'>
22


In [21]:
a = "1122222222333"
so = re.search("2+",a)
print(so)
print(so.group())

<re.Match object; span=(2, 10), match='22222222'>
22222222


In [22]:
a = "1122222222333"
so = re.search("2{2,5}",a)
print(so)
print(so.group())

<re.Match object; span=(2, 7), match='22222'>
22222


In [23]:
a = "1122222222333"
so = re.search("2{4}",a)
print(so)
print(so.group())

<re.Match object; span=(2, 6), match='2222'>
2222


In [24]:
a = "1122222222333"
so = re.search("2{4,}",a)
print(so)
print(so.group())

<re.Match object; span=(2, 10), match='22222222'>
22222222


## Various patterns and their descriptions

<div> <img src="attachment:image-2.png" width="600"/> </div>

In [25]:
a = "This is PyCSR video"
so = re.search("P.C.R",a)
print(so)
print(so.group())

<re.Match object; span=(8, 13), match='PyCSR'>
PyCSR


In [26]:
a = "This is PyC\nR video"
so = re.search("P.C.R",a)
print(so)
print(so.group())

None


AttributeError: 'NoneType' object has no attribute 'group'

In [27]:
a = "This is PyC\nR video"
so = re.search("P.C.R",a,re.S) #re.DOTALL
print(so)
print(so.group())

<re.Match object; span=(8, 13), match='PyC\nR'>
PyC
R


## What is grouping in patterns

## Grouping

In [28]:
a = "The quick brown fox jump over the Dog"
so = re.search("(quick).*(jump).*(Dog)",a)
print(so)
print(so.group())
print(so.groups())
print(so.group(1))
print(so.group(2))
print(so.group(3))

<re.Match object; span=(4, 37), match='quick brown fox jump over the Dog'>
quick brown fox jump over the Dog
('quick', 'jump', 'Dog')
quick
jump
Dog


## Most commonly used regex flags

- re.I
- re.DOTALL or re.S
- re.M

In [29]:
# re.M
a = '''The quick 

The brown 

the fox jump over

The Dog'''
so = re.search("jump",a,re.M)
print(so)

<re.Match object; span=(32, 36), match='jump'>


In [30]:
# re.M
a = '''The quick 

The brown 

the fox jump over

the Dog'''
re.findall("the",a)

['the', 'the']

## Greedy Vs Non-greedy search

In [31]:
a = "<abc><def><ghi><jkl>"
so = re.search("<.*>",a)
so.group()

'<abc><def><ghi><jkl>'

In [32]:
a = "<abc><def><ghi><jkl>"
so = re.search("<.*?>",a) #use ? to ake non-greedy search
so.group()

'<abc>'

## Back referencing example

In [33]:
a = "111 222 111 333 222 333"
so = re.search("(111 )(222 )\\1(333 )\\2",a)
so.group()

'111 222 111 333 222 '

In [34]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%E*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\d",a)
so.group()

'7'

In [35]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%E*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\d+",a)
so.group()

'747457'

In [36]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%E*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\w",a)
so.group()

'd'

In [37]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%E*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\w+",a)
so.group()

'dkjsfgjshfgW'

In [38]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%E*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\W",a)
so.group()

'*'

In [39]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("\W+",a)
so.group()

'*&^%*&#^$*&%'

In [40]:
# \d,\w,\D,\W,\s,\s
a = "dkjsfgjshfgW*&^%*&#^$*&%kdjfhkghdfAKJDHGSD747457"
so = re.search("(\w+)(\W+)(.*)(\d+)",a)
so.group()
so.groups()

('dkjsfgjshfgW', '*&^%*&#^$*&%', 'kdjfhkghdfAKJDHGSD74745', '7')

# Interview questions Exercise

## What is Greedy and Non-greedy character

## Write regex for finding valid email ID

## Write regex for finding valid IP4 address

## Write regex for finding total numbe of N-letter words in a document

In [41]:
a = "This is line1 line2 line3 line4 Example of pycsr video"
re.findall("\w{4}",a)

['This', 'line', 'line', 'line', 'line', 'Exam', 'pycs', 'vide']

In [42]:
a = "This is line1 line2 line3 line4 Example of pycsr video"
re.findall("\\b\w{5}\\b",a)

['line1', 'line2', 'line3', 'line4', 'pycsr', 'video']

## What is `+ve` and `-ve` `Lookahead/Lookbehind` patterns ?

In [43]:
# Syntax:
# +ve lookahead: (?="pattern")
# -ve lookahead: (?!"pattern")    
# +ve lookbehind: (?<="pattern")
# -ve lookbehind: (?<!"pattern")

In [44]:
a = "This is PyCSR123 video"
so = re.search("PyCSR(?=\d+)",a) # +ve lookahead
print(so.group())

PyCSR


In [45]:
a = "This is PyCSRabc video"
so = re.search("PyCSR(?!\d+)",a)# -ve lookahead
print(so.group())

PyCSR


In [46]:
a = "This is 123PyCSR video"
so = re.search("(?<=123)PyCSR",a)# +ve lookbehind
print(so.group())

PyCSR


In [47]:
a = "This is abcPyCSR video"
so = re.search("(?<!123)PyCSR",a)# -ve lookbehind
print(so.group())

PyCSR
