The Python "re" module provides regular expression support.

In Python a regular expression search is typically written as

** Using r prefix before RegEx **

- When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

- Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.


#### Wild Card Characters: Special Characters

    . A period. Matches any single character except newline character.

In [47]:
import re

# re.match

 it will only find matches if they occur at the start of the string being searched. Otherwise it returns None. 

find all the alphabet letter, both uppercase and lowercase

In [48]:
# exact matches:

pattern = r"Cookie"
sequence = "Cookie"

if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [49]:
input_str = "Thhhhe film Titanic was released in 1998"  

result = re.match(r"[a-zA-Z]+", input_str)  

print(result)

<re.Match object; span=(0, 6), match='Thhhhe'>


This regex expression states that match the text string for any alphabets from small a to small z or capital A to capital Z. The plus sign specifies that string should have at least one character

In the output, you can see that the first word i.e. The is returned. This is because the match function only returns the first match found

In [4]:
input_str = "1998 The film Titanic was released in 1998"  

result = re.match(r"[a-zA-Z]+", input_str)  

print(result)

None


if a string starts with a number instead of an alphabet, the match function will return null even if there are alphabets after the number. since match function only matches the first element in the string.

In [5]:
re.match(r'dog', 'dog cat dog')

<re.Match object; span=(0, 3), match='dog'>

In [6]:
match = re.match(r'dog', 'dog cat dog')
match.group(0)

'dog'

In [7]:
match.group()

'dog'

But, if we call match() on the same string, looking for the pattern ‘cat’, we won’t:

In [8]:
match = re.match(r'cat', 'dog cat dog')
match

# re.search(pat, str)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern;

The search() method is similar to match(), but search() doesn’t restrict us to only finding matches at the beginning of the string, so searching for ‘cat’ in our example string finds a match:

** \w **
- matches any alphanumeric character and the underscore; 
- this is equivalent to the set [a-zA-Z0-9_].

** \W **

- matches any non-alphanumeric character; 
- this is equivalent to the set [^a-zA-Z0-9_]. 


    \w - Lowercase w. Matches any single letter, digit or underscore.
    \W - Uppercase w. Matches any character not part of \w (lowercase w).
    \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.
    \S - Uppercase s. Matches any character not part of \s (lowercase s).
    \t - Lowercase t. Matches tab.
    \n - Lowercase n. Matches newline.
    \r - Lowercase r. Matches return.
    \d - Lowercase d. Matches decimal digit 0-9.
    ^ - Caret. Matches a pattern at the start of the string.
    $ - Matches a pattern at the end of string.
    [abc] - Matches a or b or c.
    \A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.
    \b - Lowercase b. Matches only the beginning or end of the word.
    

In [9]:
# The group() function returns the string matched by the re
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

In [10]:
re.search(r'Co\wk\we', 'Cookie')

<re.Match object; span=(0, 6), match='Cookie'>

In [11]:
re.search(r'Co\wk\we', 'Coo78788kie')

In [12]:
re.search(r'C\Wke', 'C@ke')

<re.Match object; span=(0, 4), match='C@ke'>

In [13]:
re.search(r'Eat\scake', 'Eat cake')

<re.Match object; span=(0, 8), match='Eat cake'>

In [14]:
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

In [15]:
re.search(r'Eat\tcake', 'Eat  cake')

\d - Lowercase d. Matches decimal digit 0-9.

In [16]:
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

^ - Caret. Matches a pattern at the start of the string.

In [17]:
re.search(r'^Eat', 'Eat cake').group()

'Eat'

$ - Matches a pattern at the end of string.

In [18]:
re.search(r'cake$', 'Eat cake').group()

'cake'

[abc] - Matches a or b or c.

[a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). 

In [19]:
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

In [20]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

\A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [21]:
re.search(r'\A[A-E]ookie', 'Cookie').group()

'Cookie'

\b - Lowercase b. Matches only the beginning or end of the word.

In [22]:
re.search(r'\b[A-E]ookie', 'Cookie').group()

'Cookie'

\ - Backslash. If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered as newline. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through.

In [23]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'Back\\stail', 'Back\stail').group()

'Back\\stail'

In [24]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail', 'Back tail').group()

'Back tail'

In [25]:
text   = "1998 was the year when the film titanic was released"  
result = re.search(r"[a-zA-z]+", text)  
print(result)  

<re.Match object; span=(5, 8), match='was'>


In [26]:
re.search(r'cat', 'dog cat dog')

<re.Match object; span=(4, 7), match='cat'>

The search() method, however, stops looking after it finds a match, so search()-ing for ‘dog’ in our example string only finds the first occurrence:

In [27]:
match = re.search(r'dog', 'dog cat dog')
match.group(0)

'dog'

In [28]:
input_str     = 'Machines Learning is great, Maths is fun 11 22 33 44'
search_str    = r'macHineS'

search_results = re.search(search_str, input_str, re.IGNORECASE)

print(search_results)

<re.Match object; span=(0, 8), match='Machines'>


In [29]:
# If-statement after search() tests if it succeeded
if search_results:
    print ('found', search_results.group()) ## 'found word:cat'
else:
    print ('did not find')

found Machines


find out if any of the lines begin with a number:

In [30]:
text = """
1. ricochet robots
2. settlers of catan
3. acquire
"""
match = re.search(r'\d+.', text, re.MULTILINE)
match.group()

'1.'

## Repetition

    + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
    * -- 0 or more occurrences of the pattern to its left
    ? -- match 0 or 1 occurrences of the pattern to its left
    
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. 

- \d matches a decimal digit [0-9]

In [31]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

In [32]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caokie').group()

'Caokie'

In [33]:
# Checks for exactly zero or one occurrence of  a or o or both in the given sequence
re.search(r'Ca?o?kie', 'Caokie').group()

'Caokie'

But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

In [34]:
re.search(r'\d{2,4}', '1234')

<re.Match object; span=(0, 4), match='1234'>

In [35]:
input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
print (re.search('[a-z]*', input_string))

input_string = ' abcdefghijklmnopqrstuvwxyz1234567890'
print (re.search('[a-z]*', input_string))

<re.Match object; span=(0, 26), match='abcdefghijklmnopqrstuvwxyz'>
<re.Match object; span=(0, 0), match=''>


    - * repeats the previous token zero or more times ie, it would match an empty string which exists before each non-matching characters. First [a-z]* returns abcdefghijklmnopqrstuvwxyz because this substring was located at the start.
    
    - If the input is like ' abcdefghijklmnopqrstuvwxyz', it would return an empty string. 
    
    - This behaviour is because of re.search function, where it stops after finding the first match.

In [36]:
input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
print (re.search(r'[0-9]+', input_string))

<re.Match object; span=(26, 36), match='1234567890'>


In [37]:
# first matching digit
input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
re.search(r'[0-9]', input_string)

<re.Match object; span=(26, 27), match='1'>

In [38]:
# 0 or more occurrences of a digit
input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
re.search(r'[0-9]*', input_string)

<re.Match object; span=(0, 0), match=''>

re.search stops after finding the first match. 

Here a is not matched by [0-9] but [0-9]* matches the empty string which exists before a because * would repeat the previous token zero or more times. 

That's why we got an empty string as output 

In [39]:
# 0 or 1 occurrences of a digit
input_string = 'abcdefghijklmnopqrstuvwxyz1234567890'
re.search(r'[0-9]?', input_string)

<re.Match object; span=(0, 0), match=''>

In [40]:
## i+ = one or more i's, as many as possible.
re.search(r'pi+', 'piiig') # found, match.group() == "piii"

<re.Match object; span=(0, 4), match='piii'>

In [41]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
re.search(r'i+', 'piigiiii') # found, match.group() == "ii"

<re.Match object; span=(1, 3), match='ii'>

In [42]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
print(re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')) # found, match.group() == "1 2   3"
print(re.search(r'\d\s*\d\s*\d', 'xx12  3xx') )  # found, match.group() == "12  3"
print(re.search(r'\d\s*\d\s*\d', 'xx123xx   ') )    # found, match.group() == "123"

<re.Match object; span=(2, 9), match='1 2   3'>
<re.Match object; span=(2, 7), match='12  3'>
<re.Match object; span=(2, 5), match='123'>


In [43]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

re.search(r'\w+, \w+, [\d-]+', contactInfo)

<re.Match object; span=(0, 27), match='Raman, Kumar, 080-2856-1733'>

Example ...

use a regular expression to match a date string in the form of ** Month name ** followed by ** day number **

In [44]:
input_str = "Hari was born on June 24"

match = re.search(r"([a-zA-Z]+) (\d+)", input_str) 

if match != None: 
    print ("Match at index %s, %s" % (match.start(), match.end()) )
    
    print ("Full match: %s" % (match.group(0)) )
  
    print ("Month: %s" % (match.group(1)) )
  
    print ("Day: %s" % (match.group(2)) )

else: 
    print ("The regex pattern does not match.")

Match at index 17, 24
Full match: June 24
Month: June
Day: 24


#### Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. 

    - \w (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. 
    - \W (upper case W) matches any non-word character.
    - @ scan till you see this character
    - [w.] a set of characters to potentially match, so w is all alphanumeric characters, and the trailing period . adds to that set of characters.
    - + one or more of the previous set.

In [50]:
input_str    = 'purple support@openDBtech.com monkey dishwasher'
result       = re.search(r'\w+@\w+', input_str)

if result:
    print (result.group() ) 

support@openDBtech


In [46]:
input_str    = 'purple support@openDBtech.com monkey blahblah@gmail.com'
result       = re.search(r'@[w.]+', input_str)

if result:
    print (result.group()) 

#### Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text.

Suppose for the emails problem that we want to extract the username and host separately. 

To do this, add parenthesis ( ) around the username and host in the pattern, like this:

r'([\w.-]+)@([\w.-]+)'. 

In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. 

On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. 

The plain match.group() is still the whole match text as usual.

In [9]:
email_address = 'Please contact us at: bhupen@popcorn-ai.com'

match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)

if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

bhupen@popcorn-ai.com
bhupen
popcorn-ai.com


In [30]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

match = re.search(r'(\w+), (\w+), ([\d-]+)', contactInfo)
#match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)

print(match.group(1))
print(match.group(2))
print(match.group(3))
match.group(0)

Raman
Kumar
080-2856-1733


'Raman, Kumar, 080-2856-1733'

The reason that the group numbering starts with group 1 is because group 0 is reserved to hold the entire match

In [31]:
input_str    = 'purple support@openDBtech.com monkey dishwasher'
result       = re.search(r'([\w.-]+)@([\w.-]+)', input_str)

if result:
    print (result.group() ) 
    print (result.group(1) ) 
    print (result.group(2) ) 

support@openDBtech.com
support
openDBtech.com


use a regular expression to match a few date strings.

In [32]:
input_str    = 'June 24, August 9, Dec 12'

matches = re.findall(r"[a-zA-Z]+ \d+", input_str)

for match in matches:
    print("match: %s" % (match))

match: June 24
match: August 9
match: Dec 12


To capture the specific months of each date 

In [33]:
input_str    = 'June 24, August 9, Dec 12'

matches = re.findall(r"([a-zA-Z]+) \d+", input_str)

for match in matches:
    print("match: %s" % (match))

match: June
match: August
match: Dec


#### Grouping by Name

Sometimes, especially when a regular expression has a lot of groups, it is impractical to address each group by its number. 

In [34]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

match = re.search(r'(?P<last>\w+), (?P<first>\w+), (?P<phone>\S+)', contactInfo)

print(match.group('last'))
print(match.group('first'))
print(match.group('phone'))
match.group(0)

Raman
Kumar
080-2856-1733


'Raman, Kumar, 080-2856-1733'

Grouping can be used with the findall() method too, even though it doesn’t return match objects. Instead, findall() will return a list of tuples, where the Nth element of each tuple corresponds to the Nth group of the regex pattern:

_However, named grouping doesn’t work when using the findall() method._

In [302]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

re.findall(r'(\w+), (\w+), (\S+)', contactInfo)

[('Raman', 'Kumar', '080-2856-1733')]

# findall()

 get a list of all matching patterns.

In [35]:
re.findall(r'dog', 'dog cat dog')

['dog', 'dog']

In [36]:
results = re.findall(r'dog', 'dog cat dog')
results

['dog', 'dog']

In [37]:
results = re.findall(r'\w', 'http://www.openDB.com')
print(results)

['h', 't', 't', 'p', 'w', 'w', 'w', 'o', 'p', 'e', 'n', 'D', 'B', 'c', 'o', 'm']


In [38]:
input_str     = 'Machine Learning is great, Maths is fun 11 22 33 44. Machine is fine'
search_str    = r'Machine'

search_results = re.findall(search_str, input_str)

print(search_results)

['Machine', 'Machine']


In [46]:
input_str = 'purple bks@google.com, blah monkey arvind@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
results = re.findall(r'[\w.-]+@[\w.-]+', input_str) 
  
for email in results:  
    print (email)

bks@google.com
arvind@abc.com


In [47]:
text = """
1. ricochet robots
2. settlers of catan
3. acquire
"""
re.findall(r'\d+.', text, re.MULTILINE)

['1.', '2.', '3.']

In [48]:
string  = """Hello my Number is 123456789 and 
             my friend's number is 987654321"""       
  
match = re.findall(r'\d+', string) 
print(match) 

['123456789', '987654321']


# Substituting text in a String

- The sub function is used for this purpose.

In [52]:
input_str = "The film THREE IDIOTS was released in year 2003"  

print(re.sub(r"2003", "2001", input_str)  )
print("----------------------------------------")

input_str

The film THREE IDIOTS was released in year 2001
----------------------------------------


'The film THREE IDIOTS was released in year 2003'

let's substitute all the alphabets in our string with character "X"

In [54]:
input_str = "The film THREE IDIOTS was released in year 2003"  

result = re.sub(r"[a-z]", "D", input_str)  
print(result)  

TDD DDDD THREE IDIOTS DDD DDDDDDDD DD DDDD 2003


all the characters have been replaced except the capital ones.

In [325]:
re.sub(r"[a-z]", "D", input_str, flags=re.I)  

'DDD DDDD DDDDD DDDDDD DDD DDDDDDDD DD DDDD 2003'

#### Removing Digits from a String

The regex expression to find digits in a string is \d. This pattern can be used to remove digits from a string by replacing them with an empty string of length zero

In [57]:
text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"\d", "", text)  
print(result)  

The film Pulp Fiction was released in year 


#### Removing Alphabet Letters from a String

In [58]:
text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"[a-z]", "", text, flags=re.I)  
print(result)  

        1994


In [330]:
text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"[a-z\s]", "", text, flags=re.I)  
print(result)  

1994


#### Removing Word Characters

to remove all the word characters (letters and numbers) from a string and keep the remaining characters, you can use the \w pattern in your regex and replace it with an empty string of length zero

In [59]:
text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."  
result = re.sub(r"\w","", text, flags = re.I)  
print(result)  

 , '@ '  ?   % $  .


#### Removing Non-Word Characters
To remove all the non-word characters, the \W pattern can be used as follows:

In [332]:
text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."  
result = re.sub(r"\W", "", text, flags=re.I)  
print(result)  

ThefilmPulpFictionwasreleasedinyear1994


#### Removing Multiple Spaces
Sometimes, multiple spaces appear between words as a result of removing words or punctuation. 

For instance, in the output of the last example, there are multiple spaces between in and year. These spaces can be removed using the \s pattern, which refers to a single space.

In [60]:
text = "The film      Pulp Fiction      was released in   year 1994."  
result = re.sub(r"\s+", " ", text, flags = re.I)  
print(result) 

The film Pulp Fiction was released in year 1994.


#### Removing Spaces from Start and End
Sometimes we have a sentence that starts or ends with a space, which is often not desirable.

In [61]:
text = "         The film Pulp Fiction was released in year 1994"  
result = re.sub(r"^\s+", "", text)  
print(result)  

The film Pulp Fiction was released in year 1994


#### Removing a Single Character
Sometimes removing punctuation marks, such as an apostrophe, results in a single character which has no meaning. 

For instance, if you remove the apostrophe from the word Akram's and replace it with space, the resultant string is Akram s.

In [336]:
text = "The film Pulp Fiction     s was b released in year 1994"  
result = re.sub(r"\s+[a-zA-Z]\s+", " ", text)  
print(result)  

The film Pulp Fiction was released in year 1994


#### Splitting a String
split a string of words where one or more space characters are found

In [337]:
text = "The film      Pulp   Fiction was released in year 1994      "  
result = re.split(r"\s+", text)  
print(result)  

['The', 'film', 'Pulp', 'Fiction', 'was', 'released', 'in', 'year', '1994', '']


split string of words when a comma is found:

In [338]:
text = "The film, Pulp Fiction, was released in year 1994"  
result = re.split(r"\,", text)  
print(result)

['The film', ' Pulp Fiction', ' was released in year 1994']
