# Python - Regular Expressions


A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

In [1]:
s = 'foo123bar'
'123' in s

True

In [2]:
'7' in s

False

In [7]:
 s.find('12')

3

In [3]:
s.index('123')

3

In [2]:
 s.index('5')

NameError: name 's' is not defined

## RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:



In [1]:
!pip install re

ERROR: Could not find a version that satisfies the requirement re (from versions: none)
ERROR: No matching distribution found for re


<table align="left";>
<tr >
    <th align="left">Function</th>
    <th align="left">Description</th>
</tr>

<tr>
    <td>search()</td> 
    <td>Returns a Match object if there is a match anywhere in the string</td> 
</tr>

<tr>
    <td>match()</td> 
    <td>Determine if the RE matches at the beginning of the string</td> 
</tr>
    
<tr>
    <td>findall() </td>
    <td>Returns a list containing all matches </td>
</tr>

<tr>
    <td>finditer() </td>
    <td>Find all substrings where the RE matches, and returns them as an iterator. </td>
</tr>
    
<tr>
    <td>split() </td>
    <td>Returns a list where the string has been split at each match </td>
</tr>

<tr>
    <td>sub() </td>
    <td>Replaces one or many matches with a string </td>
</tr>
    
<tr>
    <td> </td>
    <td> </td>
</tr>


</table>

#### The search Function

This function searches for first occurrence of RE pattern within string with optional flags.

    re.search(pattern, string, flags=0)

In [6]:
import re

s = 'foo123bar'
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [7]:
if re.search('123', s):
    print("Match!")
else: 
    print("Not a match!")

Match!


### Python Regex Metacharacters

In [9]:
s = 'foo191bar'
if re.search('[0-9][0-9][0-9]', s):
    print("Match!")
else: print("Not a match!")

Match!


In [23]:
s = 'foo1002bar'
if re.search('[0-9][0-9][0-9][0-9]', s):
    print("Match!")
else: print("Not a match!")

Match!


![Screenshot%202023-04-13%20095537.png](attachment:Screenshot%202023-04-13%20095537.png)

### Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

![Screenshot%202023-04-13%20095931.png](attachment:Screenshot%202023-04-13%20095931.png)

### Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

![Screenshot%202023-04-13%20100124.png](attachment:Screenshot%202023-04-13%20100124.png)

In [19]:
# Search for the first white-space character in the string:

txt = "The rain in Spain"
x = re.search("\s", txt)
print(x)

print("The first white-space character is located in position:", x.start())

<re.Match object; span=(3, 4), match=' '>
The first white-space character is located in position: 3


In [20]:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


In [24]:
txt = "The rain in Spain"
x = re.search("..in", txt)
print(x)

<re.Match object; span=(4, 8), match='rain'>


In [11]:
line = "Cats are smarter than dogs"

searchObj = re.search( '(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
    print("searchObj.group() : ", searchObj.group())
    print("searchObj.group(1) : ", searchObj.group(1))
    print("searchObj.group(2) : ", searchObj.group(2))
else:
    print("Nothing found!!")

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter


re.M (multiline) allows ^ and $ to match at the beginning and end of each line, not just the beginning and end of the entire string.
re.I (ignorecase) makes the regular expression case-insensitive.

#### Problem 1

 Write a regular expression that matches any sequence of 3 letters, followed by a dash, followed by 4 numbers (e.g. "ABC-1234").

In [29]:
text = "ABC-1234"
#pattern = "\w{3}-\d{4}"
pattern = "[A-Za-z]{3}-[0-9]{4}"

if re.search(pattern,text):
    print("Match is found")
else:
    print("No match")

Match is found


### The match() Function

This function attempts to match RE pattern to string with optional flags.

re.match(pattern, string, flags=0)

In [12]:
pattern = "Cookie"
sequence = "Cookie is tasty"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [13]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Search successful.


In [23]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
    print("matchObj.group() : ", matchObj.group())
    print("matchObj.group(1) : ", matchObj.group(1))
    print("matchObj.group(2) : ", matchObj.group(2))
else:
    print("No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


### Matching Versus Searching

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

In [33]:
line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
    print("match --> matchObj.group() : ", matchObj.group())
else:
    print("No match!!")

searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
    print("search --> searchObj.group() : ", searchObj.group())
else:
    print("Nothing found!!")

No match!!
search --> searchObj.group() :  dogs


### The findall() Function
The findall() function returns a list containing all matches.

In [34]:
## Print a list of all matches:

txt = "The rain in Spain"
x = re.findall("..ai", txt)
print(x)

[' rai', 'Spai']


In [24]:
x = re.findall("Portugal", txt)
print(x)

[]


In [37]:
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']


In [35]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    print(email)

alice@google.com
bob@abc.com


### finditer(string, [position, end_position])

Similar to findall() - it finds all the possible matches in the entire sequence but returns regex match objects as an iterator.

In [36]:

statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address)

<re.Match object; span=(22, 42), match='support@datacamp.com'>
<re.Match object; span=(44, 60), match='xyz@datacamp.com'>


### The split() Function

The split() function returns a list where the string has been split at each match:

In [29]:
## Split at each white-space character:

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [25]:
## Split the string only at the first occurrence:
txt = "The rain in Spain"
x = re.split("\s", txt, 2)
print(x)

['The', 'rain', 'in Spain']


In [26]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '.']


### The sub() Function
The sub() function replaces the matches with the text of your choice:

In [32]:
## Replace every white-space character with the number 9:

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


In [33]:
## Replace the first 2 occurrences:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


In [27]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc12de23f456


In [28]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

abc12de 23 
 f45 6


In [29]:
print(re.sub('ub', '~*', 'Subject has Uber booked already',flags=re.IGNORECASE))

S~*ject has ~*er booked already


In [30]:
print(re.sub('ub', '~*', 'Subject has Uber booked already'))

S~*ject has Uber booked already
