A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,

^a...s$
The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

A pattern defined using RegEx can be used to match against a string.

Expression	String	Matched?
^a...s$	
        abs	No match
        alias	Match
        abyss	Match
        Alias	No match
        An abacus	No match


In [1]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Search successful.


MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

[] . ^ $ * + ? {} () \ |



[] - Square brackets

Square brackets specifies a set of characters you wish to match.

Expression	String	Matched?
[abc]	
a	1 match
ac	2 matches
Hey Jude	No match
abc de ca	5 matches
Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

[a-e] is the same as [abcde].
[1-4] is the same as [1234].
[0-39] is the same as [01239].
You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

[^abc] means any character except a or b or c.
[^0-9] means any non-digit character.
. - Period

. - Period

A period matches any single character (except newline '\n').

Expression	String	Matched?
..	a	No match
ac	1 match
acd	1 match
acde	2 matches (contains 4 characters)
^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

Expression	String	Matched?
^a	a	1 match
abc	1 match
bac	No match
^ab	abc	1 match
acb	No match (starts with a but not followed by b)
$ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

Expression	String	Matched?
a$	a	1 match
formula	1 match
cab	No match
* - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

Expression	String	Matched?
ma*n	
mn	1 match
man	1 match
maaan	1 match
main	No match (a is not followed by n)
woman	1 match
+ - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

Expression	String	Matched?
ma+n	mn	No match (no a character)
man	1 match
maaan	1 match
main	No match (a is not followed by n)
woman	1 match
? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

Expression	String	Matched?
ma?n	
mn	1 match
man	1 match
maaan	No match (more than one a character)
main	No match (a is not followed by n)
woman	1 match
{} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

Expression	String	Matched?
a{2,3}
abc dat	No match
abc daat	1 match (at daat)
aabc daaat	2 matches (at aabc and daaat)
aabc daaaat	2 matches (at aabc and daaaat)
Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

Expression	String	Matched?
[0-9]{2,4}	ab123csde	1 match (match at ab123csde)
12 and 345673	3 matches (12, 3456, 73)
1 and 2	No match
| - Alternation

Vertical bar | is used for alternation (or operator).

Expression	String	Matched?
a|b	cde	No match
ade	1 match (match at ade)
acdbea	3 matches (at acdbea)
Here, a|b match any string that contains either a or b

() - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

Expression	String	Matched?
(a|b|c)xz	ab xz	No match
abxz	1 match (match at abxz)
axz cabxz	2 matches (at axzbc cabxz)
\ - Backslash

Backlash \ is used to escape various characters including all metacharacters. For example,

\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

Special Sequences

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:

Ad
\A - Matches if the specified characters are at the start of a string.

Expression	String	Matched?
\Athe	the sun	Match
In the sun	No match


\b - Matches if the specified characters are at the beginning or end of a word.

Expression	String	Matched?
\bfoo	football	Match
a football	Match
afootball	No match
foo\b	the foo	Match
the afoo test	Match
the afootest	No match
\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

Expression	String	Matched?
\Bfoo	football	No match
a football	No match
afootball	Match
foo\B	the foo	No match
the afoo test	No match
the afootest	Match
\d - Matches any decimal digit. Equivalent to [0-9]

Expression	String	Matched?
\d	12abc3	3 matches (at 12abc3)
Python	No match

In [None]:
\s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

Expression	String	Matched?
\s	Python RegEx	1 match
PythonRegEx	No match
\S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

Expression	String	Matched?
\S	a b	2 matches (at  a b)

In [None]:
\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

Expression	String	Matched?
\w	12&": ;c 	3 matches (at 12&": ;c)
%"> !	No match
\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

Expression	String	Matched?
\W	1a2%c	1 match (at 1a2%c)
Python	No match
\Z - Matches if the specified characters are at the end of a string.

Expression	String	Matched?
Python\Z	I like Python	1 match
I like Python Programming	No match
Python is fun.	No match


In [3]:

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


In [None]:
re.split()
The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

In [4]:

import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


re.sub()
The syntax of re.sub() is:

re.sub(pattern, replace, string)
The method returns a string where matched occurrences are replaced with the content of replace variable.



In [5]:

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

abc12de23f456


In [6]:

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6


abc12de 23 
 f45 6


In [7]:
import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string


In [5]:

import re

string = '39801 356 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

print(match.group(1))
print(match.group(2))


# Output: 801 35

801 35
801
35


In [9]:

import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

['\n', '\r']


In [6]:

import re

string = '39801 356 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.findall(pattern, string) 

print(match)

[('801', '35'), ('102', '11')]


In [11]:

import re

string = '39801 356 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = r'\d{3} \d{2}'

# match variable contains a Match object.
match = re.search(pattern, string) 
print(match.group())


801 35


In [7]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.



In [9]:
match

<re.Match object; span=(2, 8), match='801 35'>

In [10]:
match.

{}