Simply put, regular expression or regex is a sequence of character(s) mainly used to find and replace patterns in a string or file.

Python has a built-in module to work with regular expressions called re. Some of the commonly used methods from the re module are listed below:

1. re.match()

2. re.search()

3. re.findall()

4. re.sub()

5. re.split()

#### 1. re.match()

It is use to find the first occurance of pattern in the string.

In [1]:
import re

string = "hockey is the national sport of india"
pattern = "hockey"
mo = re.match(pattern, string)

In [2]:
mo

<re.Match object; span=(0, 6), match='hockey'>

Since output of the re.match is an object, we will use group() function of match object to get the matched expression.

In [3]:
mo.group()

'hockey'

#### 2. re.search()
This function locates the pattern anywhere in the string. 

In [4]:
pattern2 = "national"
mo2 = re.search(pattern2, string)

In [5]:
mo2

<re.Match object; span=(14, 22), match='national'>

#### 3. re.findall()
It will return all the occurrences of the pattern from the string. I would recommend you to use re.findall() always, it can work like both re.search() and re.match().

In [6]:
string2 = "hockey is the national sport of india and national bird is peacock"
pattern3 = "national"
mo3 = re.findall(pattern3, string2)

In [7]:
mo3

['national', 'national']

In [8]:
mo4 = re.finditer(pattern3, string2)
for m in mo4:
    print(m)

<re.Match object; span=(14, 22), match='national'>
<re.Match object; span=(42, 50), match='national'>


#### 4. re.split()
This function split the text by the given regular expression pattern.

In [10]:
string3 = "this:is,a,simple,text,string"
pattern4 = r'[:,\s]'
re.split(pattern4, string3)

['this', 'is', 'a', 'simple', 'text', 'string']

#### 5. re.sub()
This function search for a particular pattern and replaces it.

In [11]:
string5 = "cricket is a popular sport of India"
pattern5 = "India"
replacement = "the world"
re.sub(pattern5, replacement, string5)

'cricket is a popular sport of the world'

## Special Sequences

1. **\A** returns a match if the specified pattern is at the beginning of the string.

In [12]:
str = r'Corona virus spreading exponentially in India'

x = re.findall("\ACorona", str)

print(x)

['Corona']


This is useful in cases where you have multiple strings of text, and you have to extract the first word only, given that first word is 'Corona'.

If you would try to find some other word, then it will return an empty list as shown below.

In [13]:
x = re.findall("\Avirus", str)

print(x)

[]


2. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [14]:
#Check if there is any word that ends with "lly"
x = re.findall(r"lly\b", str)
print(x)

['lly']


It returns the last three characters of the word "exponentially".

3. **\B** returns a match where the specified pattern is present, but NOT at the beginning (or at the end) of a word.

In [15]:
x = re.findall(r"\Ben", str)

print(x)

['en']


4. **\d** returns a match where the string contains digits (numbers from 0-9)

In [16]:
str = "Approx. 800 cases repoted everyday."

#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['8', '0', '0']
Yes, there is at least one match!


In [17]:
# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space
x = re.findall("\d+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['800']
Yes, there is at least one match!


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as \d does character wise comparison.

5. **\D** returns a match where the string does not contain any digit.

In [18]:
#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['A', 'p', 'p', 'r', 'o', 'x', '.', ' ', ' ', 'c', 'a', 's', 'e', 's', ' ', 'r', 'e', 'p', 'o', 't', 'e', 'd', ' ', 'e', 'v', 'e', 'r', 'y', 'd', 'a', 'y', '.']
Yes, there is at least one match!


In [19]:
#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Approx. ', ' cases repoted everyday.']
Yes, there is at least one match!


6. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

In [20]:
#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['A', 'p', 'p', 'r', 'o', 'x', '8', '0', '0', 'c', 'a', 's', 'e', 's', 'r', 'e', 'p', 'o', 't', 'e', 'd', 'e', 'v', 'e', 'r', 'y', 'd', 'a', 'y']
Yes, there is at least one match!


In [21]:
#returns a match at every word (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w+",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['Approx', '800', 'cases', 'repoted', 'everyday']
Yes, there is at least one match!


7. **\W** returns match at every non alphanumeric character.

In [22]:
#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['.', ' ', ' ', ' ', ' ', '.']
Yes, there is at least one match!


In [23]:
str = "Mars' average distance from the Sun is roughly 230 million km and its orbital period is 687 (Earth) days."

# extract the numbers starting with 0 to 4 from in the above string
x = re.findall(r"\b[0-4]\d+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['230']
Yes, there is at least one match!


2. **[^]** Check whether string has other characters mentioned after ^

In [24]:
str = "Analytics Vidhya is the largest data sciece community of India"

#Check if every word character has characters than y, d, or h

x = re.findall("[^ydh]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['A', 'n', 'a', 'l', 't', 'i', 'c', 's', ' ', 'V', 'i', 'a', ' ', 'i', 's', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', ' ', 'o', 'f', ' ', 'I', 'n', 'i', 'a']
Yes, there is at least one match!


3. **[a-zA-Z0-9]** : Check whether string has alphanumeric characters

In [25]:
str = "@AV Largest Data Science community #AV!!"

# extract words that start with a special character
x = re.findall("[^a-zA-Z0-9 ]\w+", str)

print(x)

['@AV', '#AV']


## Some Complex Queries

#### Extracting Email IDs

In [27]:
str = 'Send a mail to sshivam.singh1996@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times 
#x = re.findall('\w+@\w+\.com', str)     
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['sshivam.singh1996@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']


#### Extracting Dates

In [28]:
text = "London Olympic 2012 was held from 2012-07-27 to 2012/08/12."

# '\d{4}' repeats '\d' 4 times
match = re.findall('\d{4}.\d{2}.\d{2}', text)
print(match)

['2012-07-27', '2012/08/12']


In [29]:
text="London Olympic 2012 was held from 27 Jul 2012 to 12-Aug-2012."

match = re.findall('\d{2}.\w{3}.\d{4}', text)

print(match)

['27 Jul 2012', '12-Aug-2012']


In [30]:
# extract dates with varying lengths
text="London Olympic 2012 was held from 27 July 2012 to 12 August 2012."

#'\w{3,10}' repeats '\w' 3 to 10 times
match = re.findall('\d{2}.\w{3,10}.\d{4}', text)

print(match)

['27 July 2012', '12 August 2012']
