## Regular Exression

Regular expressions or RegEx is a sequence of characters mainly used to find or replace patterns embedded in the text.

## Python Built-in Module for Regular Expressions

### re.match(pattern, string)
The re.match function returns a match object on success and none on failure.

In [1]:
import re
result = re.match('Vishnu','Vishnu is student')
print(result)

<re.Match object; span=(0, 6), match='Vishnu'>


### re.search(pattern, string)
Matches the first occurrence of a pattern in the entire string(and not just at the beginning).

In [2]:
result = re.search('good','Vishnu is good student ')
print(result)

<re.Match object; span=(10, 14), match='good'>


## re.findall(pattern, string)
It will return all the occurrences of the pattern from the string.

In [3]:
result = re.findall('founded','Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

['founded', 'founded']


## Special Sequences in Regular Expressions

# \b
\b returns a match where the specified pattern is at the beginning or at the end of a word.



In [4]:
str ='Vishnu is cool and fool'
#Check if there is any word that ends with "ool"
x = re.findall(r"ool\b", str)
print(x)

['ool', 'ool']


# /d 
It returns a match where the string contains digits (numbers from 0-9).

In [5]:
str = "2 million monthly visits in Jan'19."
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)
print(x)

['2', '1', '9']


# \D
\D returns a match where the string does not contain any digit. It is basically the opposite of \d.

In [6]:
str = "2 million monthly visits in Jan'19."
#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)
print(x)

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'm', 'o', 'n', 't', 'h', 'l', 'y', ' ', 'v', 'i', 's', 'i', 't', 's', ' ', 'i', 'n', ' ', 'J', 'a', 'n', "'", '.']


# \w
\w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

In [7]:
str = "2 million monthly visits!"
#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)
x = re.findall("\w+",str)
print(x)

['2', 'million', 'monthly', 'visits']


# \W
\W returns match at every non-alphanumeric character. Basically opposite of \w.


In [8]:
str = "How are You? and wow beautiful !"
#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall("\W", str)
print(x)

[' ', ' ', '?', ' ', ' ', ' ', ' ', '!']


# Metacharacters in Regular Expression
Metacharacters are characters with a special meaning.


### (.) matches any character (except newline character)

In [9]:
str = "rohan and rohit recently published a research paper!"
#Search for a string that starts with "ro", followed by any number of characters

x = re.findall("ro.", str)           #searches one character after ro
x2 = re.findall("ro...", str)        #searches three characters after ro

print(x)
print(x2)

['roh', 'roh']
['rohan', 'rohit']


# (^) starts with
It checks whether the string starts with the given pattern or not.



In [10]:
str = "Data Science"
#Check if the string starts with 'Data':
x = re.findall("^Data", str)
if (x):
    print("Yes, the string starts with 'Data'")
else:
    print("No match")

Yes, the string starts with 'Data'


# ($) ends with
It checks whether the string ends with the given pattern or not.

In [11]:
str = "Data Science"
#Check if the string ends with 'Science':
x = re.findall("Science$", str)
if (x):
  print("Yes, the string ends with 'Science'")
else:
  print("No match")

Yes, the string ends with 'Science'


## (*) matches for zero or more occurrences of the pattern to the left of i

In [12]:
str = "easy easssy eay ey"
#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)
print(x)

if (x):
 print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'easssy', 'eay']
Yes, there is at least one match!


## (+) matches one or more occurrences of the pattern to the left of it

In [13]:
#Check if the string contains "ea" followed by 1 or more "s" characters and ends with y

x = re.findall("eas+y", str)
print(x)

['easy', 'easssy']


# (|) either or


In [14]:
str = "Analytics Vidhya is the largest data science community of India"
#Check if the string contains either "data" or "India":
x = re.findall("data|India", str)
print(x)

['data', 'India']
