# Regular Expressions (Regex)

Python makes regular expressions available through the re module.
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern. For
instance, the expression 'amount\D+\d+' will match any string composed by the word amount plus an integral
number, separated by one or more non-digits, such as:amount=100, amount is 3, amount is equal to: 33, etc.

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:

Function:	
findall	Returns a list containing all matches
search	Returns a Match object if there is a match anywhere in the string
split	Returns a list where the string has been split at each match
sub	Replaces one or many matches with a string


In [None]:
Metacharacters
Metacharacters are characters with a special meaning:

[]	A set of characters	"[a-m]"	
\	Signals a special sequence (can also be used to escape special characters)	"\d"	
.	Any character (except newline character)	"he..o"	
^	Starts with	"^hello"	
$	Ends with	"world$"	
*	Zero or more occurrences	"aix*"	
+	One or more occurrences	"aix+"	
{}	Exactly the specified number of occurrences	"al{2}"	
|	Either or	"falls|stays"	
()	Capture and group

In [1]:
import re

str = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", str)
print(x)


['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [2]:
import re

str = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", str)
print(x)

['5', '9']


In [3]:
import re

str = "hello world"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", str)
print(x)

['hello']


In [4]:
import re

str = "hello world"

#Check if the string starts with 'hello':

x = re.findall("^hello", str)
if (x):
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


In [5]:
import re

str = "hello world"

#Check if the string ends with 'world':

x = re.findall("world$", str)
if (x):
  print("Yes, the string ends with 'world'")
else:
  print("No match")

Yes, the string ends with 'world'


In [7]:
import re

str = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "ai" followed by 0 or more "x" characters:

x = re.findall("aix*", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!


In [6]:
import re

str = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "ai" followed by 1 or more "x" characters:

x = re.findall("aix+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [8]:
import re

str = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "a" followed by exactly two "l" characters:

x = re.findall("al{2}", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['all']
Yes, there is at least one match!


In [9]:
import re

str = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['falls']
Yes, there is at least one match!


Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

Character	Description	Example	Try it
\A	Returns a match if the specified characters are at the beginning of the string	"\AThe"	
\b	Returns a match where the specified characters are at the beginning or at the end of a word	r"\bain"
r"ain\b"	
\B	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word	r"\Bain"
r"ain\B"	
\d	Returns a match where the string contains digits (numbers from 0-9)	"\d"	
\D	Returns a match where the string DOES NOT contain digits	"\D"	
\s	Returns a match where the string contains a white space character	"\s"	
\S	Returns a match where the string DOES NOT contain a white space character	"\S"	
\w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w"	
\W	Returns a match where the string DOES NOT contain any word characters	"\W"	
\Z	Returns a match if the specified characters are at the end of the string	"Spain\Z"	


In [10]:
#\A	Returns a match if the specified characters are at the beginning of the string
import re

str = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall("\AThe", str)

print(x)

if (x):
  print("Yes, there is a match!")
else:
  print("No match")

['The']
Yes, there is a match!


In [11]:
#\b	Returns a match where the specified characters are at the beginning or at the end of a word
import re

str = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")



[]
No match


In [12]:
import re

str = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [13]:
#\B	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
import re

str = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [14]:
import re

str = "The rain in Spain"

#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [15]:
#\d	Returns a match where the string contains digits (numbers from 0-9)
import re

str = "The rain in Spain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [16]:
#\D	Returns a match where the string DOES NOT contain digits
import re

str = "The rain in Spain"

#Return a match at every no-digit character:

x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [17]:
#\s	Returns a match where the string contains a white space character
import re

str = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [18]:
#\S	Returns a match where the string DOES NOT contain a white space character
import re

str = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [19]:
#\w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
import re

str = "The rain in Spain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [20]:
#\W	Returns a match where the string DOES NOT contain any word characters
import re

str = "The rain in Spain"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [21]:
#\Z	Returns a match if the specified characters are at the end of the string
import re

str = "The rain in Spain"

#Check if the string ends with "Spain":

x = re.findall("Spain\Z", str)

print(x)

if (x):
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


findall() Function:The findall() function returns a list containing all matches.
The list contains the matches in the order they are found.

If no matches are found, an empty list is returned

search() Function
The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned

In [22]:
#Search for the first white-space character in the string
#If no matches are found, the value None is returned
import re

str = "The rain in Spain"
x = re.search("\s", str)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [23]:
#Make a search that returns no match:
import re

str = "The rain in Spain"
x = re.search("Portugal", str)
print(x)

None


The split() Function
The split() function returns a list where the string has been split at each match

In [24]:
#Split at each white-space character
import re

str = "The rain in Spain"
x = re.split("\s", str)
print(x)

['The', 'rain', 'in', 'Spain']


In [25]:
#You can control the number of occurrences by specifying the maxsplit parameter
import re

str = "The rain in Spain"
x = re.split("\s", str, 1)
print(x)

['The', 'rain in Spain']


The sub() Function
The sub() function replaces the matches with the text of your choice

In [26]:
#Replace every white-space character with the number 9:
import re

str = "The rain in Spain"
x = re.sub("\s", "9", str)
print(x)

The9rain9in9Spain


In [27]:
#You can control the number of replacements by specifying the count parameter
import re

str = "The rain in Spain"
x = re.sub("\s", "9", str, 2)
print(x)

The9rain9in Spain


Match Object
A Match Object is an object containing information about the search and the result.

Note: If there is no match, the value None will be returned, instead of the Match Object.
The Match object has properties and methods used to retrieve information about the search, and the result:

.span() returns a tuple containing the start-, and end positions of the match.
.string returns the string passed into the function
.group() returns the part of the string where there was a match

In [31]:
import re
match = re.search('word', 'an example word cat1!!')
print (match)
print (match.group())
print (dir(match))

<re.Match object; span=(11, 15), match='word'>
word
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
