# Regular expressions

Regular expressions are a powerful language for matching strings of text, such as particular characters, words, or patterns of characters. Python provides module 're' to support regular expressions.

In [1]:
# import module 're'
import re

## Search for Pattern

re.search() function is used to find the first and leftmost match for a pattern.

In [2]:
match = re.search(r'll', 'Hello')
if match :
    print('Match')
else :
    print('Not Match')
print(match.group())

Match
ll


In [3]:
print(re.search(r'lal', 'Hello'))

None


Search for the pattern through the string from start and checks for the pattern to match, if match founds then match is not None and match.group() will be the matched text otherwise match will be None.

### dot '.' : any char

In [4]:
match = re.search(r'..ll.', 'Hello')
print(match is None)
print(match.group())

False
Hello


### \d : digit char

In [5]:
match = re.search(r'\d\d\d', 'Hello123')
print(match is None)
print(match.group())

match = re.search(r'\d\d\d', 'Hello')
print(match is None)
print(match.group())

False
123
True


AttributeError: 'NoneType' object has no attribute 'group'

### \w : word

In [6]:
match = re.search(r'\w\w\w', '#Hello123')
if match :
    print('Match Found : ', match.group())
else :
    print('Match Not Found')

match = re.search(r'\w\w\w', '###')
if match :
    print('Match Found : ', match.group())
else :
    print('Match Not Found')

Match Found :  Hel
Match Not Found


### '+' Plus : 1 or more occurrences of the pattern to its left

In [7]:
match = re.search(r'Hel+o', 'Helllllo Python')
print(match is None)
print(match.group())

match = re.search(r'Hel+o', 'Hello Helllo Python')
print(match is None)
print(match.group())

False
Helllllo
False
Hello


#### Note: Search for the first and leftmost match for the pattern and ignores if it founds any patterns after the first match.

### # '*' : 0 or more occurrences of the pattern to its left

In [8]:
match = re.search(r'\s\d*\s*', 'abc 123  def 456')
print(match is None)
print(match.group())

False
 123  


In [9]:
match = re.search(r'\w+@\w+.com', 'mailtosatish@gmail.com')
if match :
    print('Match Found : ', match.group())
else :
    print('Match Not Found')

Match Found :  mailtosatish@gmail.com


In [10]:
match = re.search(r'\w+@\w+.com', 'mailto-satish@gmail.com')
if match :
    print('Match Found : ', match.group())
else :
    print('Match Not Found')

Match Found :  satish@gmail.com


#### Note: Here we won't get the complete email address, since '\w' does notmatch the '-' char.

In [11]:
match = re.search(r'[\w-]+@\w+.com', 'mailto-satish@gmail.com')
if match :
    print('Match Found : ', match.group())    
else :
    print('Match Not Found')

Match Found :  mailto-satish@gmail.com


Square brackets can be used to indicate a set of chars, so [ab-] matches 'a' or 'b' or '-'. '-' can also use to indicate a range, so [a-z] matches all lowercase letters. 

Note : up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

#### Extact the username and host from the email address by adding parenthesis '( )' in the pattern

In [12]:
match = re.search(r'([\w-]+)@(\w+).com', 'mailto-satish@gmail.com')
if match :
    print('Match Found : ', match.group())    
    print('user name : ', match.group(1)) 
    print('host : ', match.group(2)) 
else :
    print('Match Not Found')

Match Found :  mailto-satish@gmail.com
user name :  mailto-satish
host :  gmail


## re.findall() 

re.search() function is used to find the first and leftmost match for a pattern, where as findall() finds all the matches and returns them as a list of strings, with each string representing one match.

In [13]:
msg = 'Got invoice email from shipment-tracking@amazon.in and no-reply@amazon.in about the product that bought'
lst1 = re.findall(r'[\w-]+@\w+.in', msg)
print('lst1 = ',lst1)

lst2 = re.findall('\S+@\S+', msg)
print('lst2 = ',lst2)

lst1 =  ['shipment-tracking@amazon.in', 'no-reply@amazon.in']
lst2 =  ['shipment-tracking@amazon.in', 'no-reply@amazon.in']


In [14]:
lst3 = re.findall(r'([\w-]+)@(\w+).in', msg)
print('lst3 = ',lst3)

for email in lst3:
    print('user name : ', email[0]) 
    print('host : ', email[1])

lst3 =  [('shipment-tracking', 'amazon'), ('no-reply', 'amazon')]
user name :  shipment-tracking
host :  amazon
user name :  no-reply
host :  amazon


## Python Regular Expression Quick Guide

#### References: 

http://www.pyregex.com/