# Python RegEx (Regular Expression)

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

### RegEx Module

Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [2]:
import re

In [None]:
string = "The quick brown fox jumps over the lazy dog"
string.find('u')

5

In [None]:
re.search()
re.match()
re.findall()
re.finditer()

In [None]:
string = "The quick brown fox jumps quick over the lazy dog"
search = re.search('lion',string)

In [None]:
string = "The quick brown fox jumps over the lazy dog fox along with another fox with some other fox."
find = re.finditer('fox',string)

In [None]:
for i in find:
    print(i)

<re.Match object; span=(16, 19), match='fox'>
<re.Match object; span=(44, 47), match='fox'>
<re.Match object; span=(67, 70), match='fox'>
<re.Match object; span=(87, 90), match='fox'>


In [None]:
string = "The quick brown fox jumps over the lazy dog fox along with another fox with some other fox."
result = re.finditer('jumps',string)
for i in result:
    print(i)

<re.Match object; span=(20, 25), match='jumps'>


![image.png](attachment:image.png)

### [] – Square Brackets

Square Brackets ([]) represent a character class consisting of a set of characters that we wish to match. For example, the character class [abc] will match any single a, b, or c.

We can also specify a range of characters using – inside the square brackets. For example,

* [0, 3] is sample as [0123]
* [a-c] is same as [abc]

We can also invert the character class using the caret(^) symbol. For example,

* [^0-3] means any number except 0, 1, 2, or 3
* [^a-c] means any character except a, b, or c

In [None]:
string = "The quick brown fox jumps 20 over 10 the ;,.[lazy dog"
output = re.findall('[0-9]',string)
print(output)

['2', '0', '1', '0']


In [None]:
x = re.compile('[0-9]')
x.findall(string)

['2', '0', '1', '0']

In [4]:
string = "The quick brown fox jumps over the lazy dog"
pattern = "[a-c]"
result = re.findall(pattern, string)

print(result)

['c', 'b', 'a']


### ^ – Caret

Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts with the given character(s) or not. For example –  

* ^g will check if the string starts with g such as globe, girl, g, etc.
* ^ge will check if the string starts with ge such as geese, geof, etc.

In [None]:
string = 'The quick brown fox'
print(re.search('^The', string))

<re.Match object; span=(0, 3), match='The'>


In [None]:
string = 'fox The quick 1253721828 brown fox'
print(re.findall('^[a-zA-Z ]', string))

['f']


In [None]:
string = 'The lazy dog'
print(re.match('^The', string))

<re.Match object; span=(0, 3), match='The'>


In [None]:
string = 'A quick The 1a2b67 brown 1z2t35 fox 7893 his and 452'
print(re.findall('\d[a-z]\d[a-z]\d\d', string))

['1a2b67', '1z2t35']


In [1]:
documents = ['andgfdjsakdba_sjh13418n3518n@#lahdn',
             'anhsdi__nwqejdwn67128nsaajdasn',
             'asndsanfajsdfkas----$$$$$*****dnbanf',
             'asndg__234sadka8654sbdas,nshdas???A$$$&&&&***dbajdfs',
             'asbgd_@_san 21352189andksan@#']

In [29]:
string = 'andgfdjsakdba_sjh13418n3518n@#lahdnanhsdi__nwqejdwn67128nsaajdasnasndsanfajsdfkas----$$$$$*****dnbanf'
pat = re.compile(r'[a-zA-Z_]\w*')
pat.findall(string)

['andgfdjsakdba_sjh13418n3518n',
 'lahdnanhsdi__nwqejdwn67128nsaajdasnasndsanfajsdfkas',
 'dnbanf']

In [5]:
emp_list = []
x = re.compile('[0-9:"@#]')

for i in documents:
    result = x.search(i)
    if result==None:
        emp_list.append(i)
        # print(i)
    else:
        pass

print(emp_list)

['asndsanfajsdfkas----$$$$$*****dnbanf']


In [6]:
string = "The quick brown fox jumps over the lazy dog"
pattern = "[^a-c]"
result = re.findall(pattern, string)

print(result)

['T', 'h', 'e', ' ', 'q', 'u', 'i', 'k', ' ', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'z', 'y', ' ', 'd', 'o', 'g']


### * – Star

Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol. For example –  

* ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched for abdc because b is not followed by c.

In [None]:
string = "abbcacabdcabbbbcadbcabc"
pattern = "ab{0,1}c"

match = re.findall(pattern, string)

print(match)

['ac', 'abc']


In [None]:
string = "abdc"
pattern = "ab*c"

match = re.search(pattern, string)

print(match)

None


### + – Plus

Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol. For example –  

* ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc, because there is no b in ac and b, is not followed by c in abdc.

In [3]:
string = "abbbbbbbc"
pattern = "ab+c"

match = re.findall(pattern, string)

print(match)

['abbbbbbbc']


In [4]:
string = "abdc"
pattern = "ab+c"

match = re.search(pattern, string)

print(match)

None


### ? – Question Mark

The question mark (?) is a quantifier in regular expressions that indicates that the preceding element should be matched zero or one time. It allows you to specify that the element is optional, meaning it may occur once or not at all. For example,

* ab?c will be matched for the string ac, acb, dabc but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.

In [6]:
string = "I can write regularcn expressioncaan."
pattern = "ca?n"

match = re.findall(pattern, string)

print(match)

['can', 'cn']


In [7]:
string = "abdc"
pattern = "ab?c"

match = re.search(pattern, string)

print(match)

None


### {m, n} – Braces

Braces match any repetitions preceding regex from m to n both inclusive. For example –  

* a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.

In [8]:
string = "aaabcaabcaghaaaabhaaaaaolp"
pattern = "a{2,4}"

match = re.findall(pattern, string)

print(match)

['aaa', 'aa', 'aaaa', 'aaaa']


In [9]:
string = "abc"
pattern = "a{2,4}"

match = re.search(pattern, string)

print(match)

None


In [10]:
string = 'A quick The 2356 brown 9178 fox 789 his and 452'
print(re.findall('[0-9]{3,4}', string))

['2356', '9178', '789', '452']


### $ – Dollar

Dollar($) symbol matches the end of the string i.e checks whether the string ends with the given character(s) or not. For example-

* s$ will check for the string that ends with a such as geeks, ends, s, etc.
* ks$ will check for the string that ends with ks such as geeks, geeksforgeeks, ks, etc.

In [11]:
string = "Hello World."
pattern = ".$"

match = re.search(pattern, string)

print(match)

<re.Match object; span=(11, 12), match='.'>


In [12]:
string = "Hello World!"
pattern = "!$"

match = re.search(pattern, string)

print(match)

<re.Match object; span=(11, 12), match='!'>


### . – Dot

Dot(.) symbol matches only a single character except for the newline character (\n). For example –  

* a.b will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc
* .. will check if the string contains at least 2 characters

In [13]:
string = "The quick newe brown fox jumps over f_1x fx the lazy dog."
pattern = "f..x"

match = re.findall(pattern, string)

print(match)

['f_1x']


In [14]:
string = "The quick brown fox jumps over the lazy xof dog."
pattern = "fox|xof"

match = re.findall(pattern, string)

print(match)

['fox', 'xof']


### | – Or

Or symbol works as the or operator meaning it checks whether the pattern before or after the or symbol is present in the string or not. For example –  

* a|b will match any string that contains a or b such as acd, bcd, abcd, etc.

In [17]:
string = "aicujbud"
pattern = "c|b"

match = re.findall(pattern, string)

print(match)

['c', 'b']


### \ – Backslash

The backslash (\) makes sure that the character is not treated in a special way. This can be considered a way of escaping metacharacters.

For example, if you want to search for the dot(.) in the string then you will find that dot(.) will be treated as a special character as is one of the metacharacters (as shown in the above table). So for this case, we will use the backslash(\) just before the dot(.) so that it will lose its specialty. See the below example for a better understanding.

In [28]:
import re

s = 'data?science'

# without using \
match = re.search(r'a\?', s)
print(match)

# using \
# match = re.search('\.', s)
# print(match)

<re.Match object; span=(3, 5), match='a?'>


The first search <code>(re.search(r'.', s))</code> matches any character, not just the period,
while the second search <code>(re.search(r'\.', s))</code> specifically looks for and matches the period character.

In [21]:
string = "The quick brown fox jumps over The lazy fox dog 123780 and some extra fox numbers 208410."
pattern = "fox"
result = re.finditer(pattern, string)

for i in result:
    print(i)

<re.Match object; span=(16, 19), match='fox'>
<re.Match object; span=(40, 43), match='fox'>
<re.Match object; span=(70, 73), match='fox'>


![image.png](attachment:image.png)

In [23]:
string = """Hello my Number is 12345u6789 and
            my friend's number is 987654321"""
regex = r'\d+'

match = re.findall(regex, string)
print(match)

['12345', '6789', '987654321']


In [25]:
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
regex = r'\D'

match = re.findall(regex, string)
print(match)

['H', 'e', 'l', 'l', 'o', ' ', 'm', 'y', ' ', 'N', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ', ' ', 'a', 'n', 'd', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'm', 'y', ' ', 'f', 'r', 'i', 'e', 'n', 'd', "'", 's', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ']


In [30]:
string = """He said * in 1234 some_lang."""
regex = r'\W'

match = re.findall(regex, string)
print(match)

[' ', ' ', '*', ' ', ' ', ' ', '.']


In [31]:
string = """I went to him at 11 A.M., he said *** in some_language."""
regex = r'\w+'

match = re.findall(regex, string)
print(match)

['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']


In [33]:
string = """he said *** in some_language."""
regex = r'\W'

match = re.findall(regex, string)
print(match)

[' ', ' ', '*', '*', '*', ' ', ' ', '.']


In [34]:
string = """he said *** in some_language.
            I didn't understand."""
regex = r'\S'

match = re.findall(regex, string)
print(match)

['h', 'e', 's', 'a', 'i', 'd', '*', '*', '*', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '.', 'I', 'd', 'i', 'd', 'n', "'", 't', 'u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', '.']


In [None]:
string = """he said *** in some_language."""
regex = '\S'

match = re.findall(regex, string)
print(match)

['h', 'e', 's', 'a', 'i', 'd', '*', '*', '*', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '.']


![image.png](attachment:image.png)

#### \A Returns a match if the specified characters are at the beginning of the string

In [None]:
import re

txt = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall("\AThe", txt)

print(x)

['The']


In [None]:
import re

txt = "The rain in Spain"

#Check if the string starts with "The":

b = re.compile("\AThe")
b.findall(txt)

['The']

#### \b Returns a match where the specified characters are at the beginning or at the end of a word

In [None]:
import re

txt = "The rain in Spain india"

#Check if "in" is present at the beginning of a WORD:

x = re.finditer(r"in\B", txt)

for i in x:
    print(i)

<re.Match object; span=(18, 20), match='in'>


In [None]:
import re

txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"in\b", txt)

print(x)

['in', 'in', 'in']


In [None]:
import re

txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.finditer(r"ain\b", txt)

for i in x:
    print(i)

<re.Match object; span=(5, 8), match='ain'>
<re.Match object; span=(14, 17), match='ain'>


#### \B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word

In [None]:
import re

txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

['ain', 'ain']


In [None]:
import re

txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", txt)

print(x)

[]


### re.split()

In [None]:
text = "This test, **\\ is short and sweet ,,,... --- words 23567 ##asdfg"

p = re.compile('\W+')
p.split(text)

['This', 'test', 'is', 'short', 'and', 'sweet', 'words', '23567', 'asdfg']

### re.sub()

In [None]:
text = "blue red shoes and white blue red white socks"

p = re.compile('(blue\s|white\s|red\s)+')

p.sub('color ', text)

'color shoes and color socks'

In [None]:
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""

p = re.compile('\d{5}\b')
new_text = p.sub('#####',string)
p1 = re.compile(r'\s+')
p1.subn(' ', new_text)

("Hello my Number is 123456789 and my friend's number is 987654321", 10)

In [None]:
print(r"Hello my Number is 123456789 and \nmy friend's number is 987654321")

Hello my Number is 123456789 and \nmy friend's number is 987654321


In [None]:
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""

p = re.compile(r'\d{5}\b')
p.subn('#####',string)

("Hello my Number is 1234##### and \n            my friend's number is 9876#####",
 2)

### re.subn()

In [None]:
text = "blue red shoes and white blue red white socks"

p = re.compile('(blue\s|white\s|red\s)+')

p.subn('color ', text)

('color shoes and color socks', 2)