A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions. Its primary function is to offer a search, where it takes a regular expression and a string. Here, it either returns the first match or else none.

In [2]:
import re 


match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group()) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 


<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


Here r character (r’portal’) stands for raw, not RegEx. The raw string is slightly different from a regular string, it won’t interpret the \ character as an escape character. This is because the regular expression engine uses \ character for its own escaping purpose.

# Why RegEx?
Let’s take a moment to understand why we should use Regular expression.

Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern. Some common scenarios are identifying an email, URL, or phone from a pile of text.


Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns. A few examples are validating phone numbers, emails, etc.

# MetaCharacters

To understand the RE analogy, MetaCharacters are useful, important, and will be used in functions of module re. Below is the list of metacharacters.

MetaCharacters	Description
![image.png](attachment:image.png)

# Character Classes
Character classes allow you to match a single set of characters with a possible set of characters. You can mention a character class within the square brackets. Let’s consider an example of case sensitive words. 

In [3]:
import re 


print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))


['Geeks', 'Geeks', 'geeks']


# Ranges
The range provides the flexibility to match a text with the help of a range pattern such as a range of numbers(0 to 9), a range of characters (A to Z), and so on. The hyphen character within the character class represents a range.

In [4]:
import re 


print('Range',re.search(r'[a-zA-Z]', 'x'))


Range <re.Match object; span=(0, 1), match='x'>


# Negation
Negation inverts a character class. It will look for a match except for the inverted character or range of inverted characters mentioned in the character class.

In [5]:
import re 

print(re.search(r'[^a-z]', 'c'))


None


In the above case, we have inverted the character class that ranges from a to z. If we try to match a character within the mentioned range, the regular expression engine returns None.



In [6]:
import re 

print(re.search(r'G[^e]', 'Geeks')) 


None


![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let’s discuss some of the shortcuts provided by the regular expression engine.

\w – matches a word character

\d – matches digit character

\s – matches whitespace character (space, tab, newline, etc.)

\b – matches a zero-length character

In [7]:
import re 
  
  
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks')) 
print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks')) 

Geeks: <re.Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None


# Beginning and End of String
The ^ character chooses the beginning of a string and the $ character chooses the end of a string.

In [8]:
import re 
  
  
# Beginning of String 
match = re.search(r'^Geek', 'Campus Geek of the month') 
print('Beg. of String:', match) 
  
match = re.search(r'^Geek', 'Geek of the month') 
print('Beg. of String:', match) 
  
# End of String 
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks') 
print('End of String:', match) 

Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


# Any Character
The . character represents any single character outside a bracketed character class.

In [9]:
import re 

print('Any Character', re.search(r'p.th.n', 'python 3'))


Any Character <re.Match object; span=(0, 6), match='python'>


# Optional Characters
Regular expression engine allows you to specify optional characters using the ? character. It allows a character or character class either to present once or else not to occur. Let’s consider the example of a word with an alternative spelling – color or colour.

In [10]:
import re 


print('Color',re.search(r'colou?r', 'color')) 
print('Colour',re.search(r'colou?r', 'colour'))


Color <re.Match object; span=(0, 5), match='color'>
Colour <re.Match object; span=(0, 6), match='colour'>


# Repetition
Repetition enables you to repeat the same character or character class. Consider an example of a date that consists of day, month, and year. Let’s use a regular expression to identify the date (mm-dd-yyyy).

In [11]:
import re 


print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}', '18-08-2020')) 


Date{mm-dd-yyyy}: <re.Match object; span=(0, 10), match='18-08-2020'>


Here, the regular expression engine checks for two consecutive digits. Upon finding the match, it moves to the hyphen character. After then, it checks the next two consecutive digits, and the process is repeated.  

# Repetition ranges
The repetition range is useful when you have to accept one or more formats. Consider a scenario where both three digits, as well as four digits, are accepted. Let’s have a look at the regular expression.

In [12]:
import re 


print('Three Digit:', re.search(r'[\d]{3,4}', '189')) 
print('Four Digit:', re.search(r'[\d]{3,4}', '2145')) 


Three Digit: <re.Match object; span=(0, 3), match='189'>
Four Digit: <re.Match object; span=(0, 4), match='2145'>


# Open-Ended Ranges
There are scenarios where there is no limit for a character repetition. In such scenarios, you can set the upper limit as infinitive. A common example is matching street addresses. Let’s have a look  

In [14]:
import re 


print(re.search(r'[\d]{1,}','5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))


<re.Match object; span=(0, 1), match='5'>


# Shorthand
Shorthand characters allow you to use + character to specify one or more ({1,}) and * character to specify zero or more ({0,}.

In [16]:
import re 

print(re.search(r'[\d]+', '5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))


<re.Match object; span=(0, 1), match='5'>


# Grouping
Grouping is the process of separating an expression into groups by using parentheses, and it allows you to fetch each individual matching group.  

In [17]:
import re 


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020') 
print(grp) 


<re.Match object; span=(0, 10), match='26-08-2020'>


# Return the entire match
The re module allows you to return the entire match using the group() method

In [18]:
import re 


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020') 
print(grp.group())


26-08-2020


# Return a tuple of matched groups
You can use groups() method to return a tuple that holds individual matched groups

In [19]:
import re 


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020') 
print(grp.groups())


('26', '08', '2020')


# Retrieve a single group
Upon passing the index to a group method, you can retrieve just a single group.

In [20]:
import re 


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020') 
print(grp.group(3))


2020


# Name your groups
The re module allows you to name your groups. Let’s look into the syntax.

In [21]:
import re 


match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.group('mm')) 


08


# Individual match as a dictionary
We have seen how regular expression provides a tuple of individual groups. Not only tuple, but it can also provide individual match as a dictionary in which the name of each group acts as the dictionary key.

In [22]:
import re 


match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.groupdict()) 


{'dd': '26', 'mm': '08', 'yyyy': '2020'}


# Lookahead
In the case of a  negated character class, it won’t match if a character is not present to check against the negated character. We can overcome this case by using lookahead; it accepts or rejects a match based on the presence or absence of content.  

In [23]:
import re 


print('negation:', re.search(r'n[^e]', 'Python')) 
print('lookahead:', re.search(r'n(?!e)', 'Python')) 


negation: None
lookahead: <re.Match object; span=(5, 6), match='n'>


Lookahead can also disqualify the match if it is not followed by a particular character. This process is called a positive lookahead, and can be achieved by simply replacing ! character with = character.

In [24]:
import re 

print('positive lookahead', re.search(r'n(?=e)', 'jasmine')) 


positive lookahead <re.Match object; span=(5, 6), match='n'>


# Compiled RegEx
The Python regular expression engine can return a compiled regular expression(RegEx) object using compile function. This object has its search method and sub-method, where a developer can reuse it when in need.  

In [25]:
import re 

regex = re.compile(r'([\d]{2})-([\d]{2})-([\d]{4})') 

# search method 
print('compiled reg expr', regex.search('26-08-2020')) 

# sub method 
print(regex.sub(r'\1.\2.\3', '26-08-2020')) 


compiled reg expr <re.Match object; span=(0, 10), match='26-08-2020'>
26.08.2020


In [26]:
import re 
p = re.compile('[a-e]') 

print(p.findall("Aye, said Mr. Gibenson Stark")) 


['e', 'a', 'd', 'b', 'e', 'a']


In [27]:
import re 
p = re.compile('\d') 
print(p.findall("I went to him at 11 A.M. on 4th July 1886")) 

p = re.compile('\d+') 
print(p.findall("I went to him at 11 A.M. on 4th July 1886")) 


['1', '1', '4', '1', '8', '8', '6']
['11', '4', '1886']


In [29]:
import re 
  
p = re.compile('\w') 
print(p.findall("He said * in some_lang.")) 
  
p = re.compile('\w+') 
print(p.findall("I went to him at 11 A.M., he \said *** in some_language.")) 
  
p = re.compile('\W') 
print(p.findall("he said *** in some_language.")) 

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']


In [30]:
from re import split 

print(split('\W+', 'Words, words , Words')) 
print(split('\W+', "Word's words Words")) 
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM')) 
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM')) 


['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']


In [31]:
import re 
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1)) 
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE)) 
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here')) 


['On ', 'th Jan 2016, at 11:02 AM']
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']


In [32]:
import re 
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24") 
if match != None: 
	print ("Match at index %s, %s" % (match.start(), match.end())) 
	print ("Full match: %s" % (match.group(0))) 
	print ("Month: %s" % (match.group(1))) 
	print ("Day: %s" % (match.group(2))) 

else: 
	print ("The regex pattern does not match.") 


Match at index 14, 21
Full match: June 24
Month: June
Day: 24
