A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

# RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [1]:
import re

# Example
Search the string to see if it starts with "The" and ends with "Spain":

In [3]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
x


<re.Match object; span=(0, 17), match='The rain in Spain'>

# RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:

<table style="float:left;">
    <tbody><tr>
        <th >Function</th>
        <th>Description</th>
        </tr>
        <tr>
        <td><a href="#">findall</a></td>
        <td>Returns a list containing all matches</td>
        </tr>
        <tr>
        <td><a href="#">search</a></td>
        <td>Returns a <a href="#">Match object</a> if there is a match anywhere in the string</td>
        </tr>
        <tr>
        <td><a href="#">split</a></td>
        <td>Returns a list where the string has been split at each match </td>
        </tr>
        <tr>
        <td><a href="#">sub</a></td>
        <td>Replaces one or many matches with a string</td>
        </tr>
    </tbody>
</table>

In [7]:
txt = "The rain in Spain and Behrain"
x = re.findall("ai", txt)
print(x)

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

['ai', 'ai', 'ai']
[]


In [9]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

The first white-space character is located in position: 3
None


In [14]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
# You can control the number of occurrences by specifying the maxsplit parameter.
# Lets use 1 as maxsplit parameter
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain', 'in', 'Spain']
['The', 'rain in Spain']


In [16]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
# You can control the number of replacements by specifying the count parameter.Lets use 2 as count
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in9Spain
The9rain9in Spain


# Metacharacters
Metacharacters are characters with a special meaning:

<table style="float:left;">
<tbody><tr>
<th >Character</th>
<th >Description</th>
<th>Example</th>

</tr>
<tr>
<td >[]</td>
<td>A set of characters</td>
<td>"[a-m]"</td>

</tr>
<tr>
<td>\</td>
<td>Signals a special sequence (can also be used to escape special characters)</td>
<td>"\d"</td>

</tr>
<tr>
<td>.</td>
<td>Any character (except newline character)</td>
<td>"he..o"</td>

</tr>
<tr>
<td>^</td>
<td>Starts with</td>
<td>"^hello"</td>

</tr>
  <tr>
<td> &#36; </td>
<td> Ends with </td>
<td> world&#36; </td>

  </tr>
  <tr>
<td>*</td>
<td>Zero or more occurrences</td>
<td>"aix*"</td>

  </tr>
  <tr>
<td>+</td>
<td>One or more occurrences</td>
<td>"aix+"</td>

  </tr>
  <tr>
<td>{}</td>
<td>Exactly the specified number of occurrences</td>
<td>"al{2}"</td>

  </tr>
  <tr>
<td>|</td>
<td>Either or</td>
<td>"falls|stays"</td>

  </tr>
  <tr>
<td>()</td>
<td>Capture and group</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
  </tr>
</tbody></table>

In [17]:

txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [18]:
txt = "That will be 59 dollars"

#Find all digit characters:
x = re.findall("\d", txt)
print(x)

['5', '9']


In [19]:
txt = "hello world"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he..o", txt)
print(x)

['hello']


In [23]:
txt = "hello world"

#Check if the string starts with 'hello':
x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


In [27]:
txt = "hello world"

#Check if the string ends with 'world':

x = re.findall("world$", txt)
if x:
  print("Yes, the string ends with 'world'")
else:
  print("No match")

Yes, the string ends with 'world'


In [28]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "ai" followed by 0 or more "x" characters:
x = re.findall("aix*", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!


In [31]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("aix+", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [32]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "a" followed by exactly two "l" characters:
x = re.findall("al{2}", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['all']
Yes, there is at least one match!


In [33]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['falls']
Yes, there is at least one match!


# Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

<table style="float:left;">
<tbody><tr>
<th >Character</th>
<th>Description</th>
<th >Example</th>

</tr>
<tr>
<td>\A</td>
<td>Returns a match if the specified characters are at the beginning of the 
string</td>
<td>"\AThe"</td>

</tr>
  <tr>
<td>\b</td>
<td>Returns a match where the specified characters are at the beginning or at the 
end of a word<br>(the "r" in the beginning is making sure that the string is 
being treated as a "raw string")</td>
<td>r"\bain"<br>r"ain\b"</td>

  </tr>
  <tr>
<td>\B</td>
<td>Returns a match where the specified characters are present, but NOT at the beginning 
(or at 
the end) of a word<br>(the "r" in the beginning is making sure that the string 
is being treated as a "raw string")</td>
<td>r"\Bain"<br>r"ain\B"</td>

  </tr>
  <tr>
<td>\d</td>
<td>Returns a match where the string contains digits (numbers from 0-9)</td>
<td>"\d"</td>

  </tr>
  <tr>
<td>\D</td>
<td>Returns a match where the string DOES NOT contain digits</td>
<td>"\D"</td>

  </tr>
  <tr>
<td>\s</td>
<td>Returns a match where the string contains a white space character</td>
<td>"\s"</td>

  </tr>
  <tr>
<td>\S</td>
<td>Returns a match where the string DOES NOT contain a white space character</td>
<td>"\S"</td>

  </tr>
  <tr>
<td>\w</td>
<td>Returns a match where the string contains any word characters (characters from 
a to Z, digits from 0-9, and the underscore _ character)</td>
<td>"\w"</td>

  </tr>
  <tr>
<td>\W</td>
<td>Returns a match where the string DOES NOT contain any word characters</td>
<td>"\W"</td>

  </tr>
<tr>
<td>\Z</td>
<td>Returns a match if the specified characters are at the end of the string</td>
<td>"Spain\Z"</td>

</tr>
</tbody>
</table>

In [35]:
txt = "The rain in Spain"

#Check if the string starts with "The":
x = re.findall("\AThe", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['The']
Yes, there is a match!


In [40]:
txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [41]:
txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['ain', 'ain']
Yes, there is at least one match!


In [42]:
txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r"\Bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [43]:
txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the end of a word:
x = re.findall(r"ain\B", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [44]:
txt = "The rain in Spain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [45]:
txt = "The rain in Spain"

#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [46]:
txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [47]:
txt = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [49]:
txt = "The rain 2in Sp_ain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', '2', 'i', 'n', 'S', 'p', '_', 'a', 'i', 'n']
Yes, there is at least one match!


In [50]:
txt = "The rain in Spain"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [51]:
txt = "The rain in Spain"

#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


# Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

<table style="float:left;">
<tbody><tr>
<th >Set</th>
<th>Description</th>

</tr>
  <tr>
<td>[arn]</td>
<td>Returns a match where one of the specified characters (<code class="w3-codespan">a</code>,
<code class="w3-codespan">r</code>, or <code class="w3-codespan">n</code>) are 
present</td>

  </tr>
  <tr>
<td>[a-n]</td>
<td>Returns a match for any lower case character, alphabetically between
<code class="w3-codespan">a</code> and <code class="w3-codespan">n</code></td>

  </tr>
  <tr>
<td>[^arn]</td>
<td>Returns a match for any character EXCEPT <code class="w3-codespan">a</code>,
<code class="w3-codespan">r</code>, and <code class="w3-codespan">n</code></td>

  </tr>
  <tr>
<td>[0123]</td>
<td>Returns a match where any of the specified digits (<code class="w3-codespan">0</code>,
<code class="w3-codespan">1</code>, <code class="w3-codespan">2</code>, or <code class="w3-codespan">
3</code>) are 
present</td>

  </tr>
  <tr>
<td>[0-9]</td>
<td>Returns a match for any digit between
<code class="w3-codespan">0</code> and <code class="w3-codespan">9</code></td>

  </tr>
<tr>
<td>[0-5][0-9]</td>
<td>Returns a match for any two-digit numbers from <code>00</code> and <code>
59</code></td>

</tr>
  <tr>
<td>[a-zA-Z]</td>
<td>Returns a match for any character alphabetically between
<code>a</code> and <code>z</code>, lower case OR upper case</td>

  </tr>
  <tr>
<td>[+]</td>
<td>In sets, <code >+</code>, <code>*</code>,
<code >.</code>, <code >|</code>,
<code >()</code>, <code>&#36;</code>,<code>{}</code> 
has no special meaning, so <code>[+]</code> means: return a match for any
<code >+</code> character in the string</td>

  </tr>
</tbody></table>

In [52]:
txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!


In [53]:
txt = "The rain in Spain"

#Check if the string has any characters between a and n:

x = re.findall("[a-n]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


In [54]:
txt = "The rain in Spain"

#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
Yes, there is at least one match!


In [55]:
txt = "The rain in Spain"

#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


In [56]:
txt = "8 times before 11:45 AM"

#Check if the string has any digits:

x = re.findall("[0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['8', '1', '1', '4', '5']
Yes, there is at least one match!


In [57]:
txt = "8 times before 11:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall("[0-5][0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['11', '45']
Yes, there is at least one match!


In [58]:
txt = "8 times before 11:45 AM"

#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!


In [59]:
txt = "8 times before 11:45 AM"

#Check if the string has any + characters:

x = re.findall("[+]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match
