# Regular Expressions

Regular expressions, or "regex" for short, are a powerful tool for matching patterns in text. In Python, the `re` module provides support for regular expressions. With this module, you can search for patterns in strings, replace parts of strings that match a pattern, and split strings based on a pattern.

The basic building blocks of regular expressions are characters and metacharacters. Characters match themselves, while metacharacters are used to match groups of characters or to specify the structure of the pattern. Some examples of metacharacters are `.` (matches any character), `*` (matches zero or more occurrences of the preceding character), `+` (matches one or more occurrences of the preceding character), `?` (matches zero or one occurrence of the preceding character).

In [1]:
import re # importing re module to perform regular expressions

text = "The quick brown fox jumps over the lazy dog."
match = re.search(r"fox", text)
print(match.group())

fox


In [2]:
# raw strings in python

firstPath = "D:\01.dataScience\01.Python"

secondPath = "D:\01.dataScience\01.Python\newfolder" # in strings "\n" would be considered as a newline

secondPathModified = r"D:\01.dataScience\01.Python\newfolder"



print(firstPath)
print(secondPath) # n is not displayed because python thinks it is calling for new line
print(secondPathModified)

D:.dataScience.Python
D:.dataScience.Python
ewfolder
D:\01.dataScience\01.Python\newfolder


### Matching patterns with regex

To match a US format phone number like 415-555-4757.

In regex, `\d` stands for digit characters. i.e. any number ranging from 0 - 9.

The above pattern can be matched by writing `\d\d\d-\d\d\d-\d\d\d\d`

But regular expressions can be more sophisticated. For example, adding a 3 in braces (`{3}`) after a certain pattern will say that the pattern should be repeated *three times*.

Read: https://docs.python.org/3/library/re.html?highlight=re#module-re

In [3]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})') # saving the pattern using compile

In [4]:
mo = phoneNumRegex.search("My number is 312-415-2222")

print(mo.group(1))


312


mo.groups() returns a tuple of multiple values. We can use `(` and `)` to separate the characters as each groups in the `mo.groups` output.


### Special Characters in Regular Expressions

`.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )`


If you want to detect any of these characters, you can use a blackslash inorder to display them in your text pattern.


# Matching Multiple groups with pipe symbol

### `|` = The pipe symbol

In [5]:
heroRegex = re.compile(r'Frog|Batman') # saving the pattern using compile

mo = heroRegex.search("Cat and Frog")

print(mo.group())

Frog


You can use it anywhere want to match one more expressions. for example the regular expression `r'Frog|Batman'` will match either `'Frog'` or `'Batman'`

When both `Batman` and `Frog` occur in the searched string, the first occurrence of matching text will be returned as the `Match` object.

In [9]:
mo2 = heroRegex.search('Frog and Batman')
mo2.group()


'Frog'

You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings `'Batman'`, `'Batmobile'`, `'Batcopter'`, and `'Batbat'`. Since all these strings start with `Bat`, it would be nice if you could specify that prefix only once.

In [18]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batman lost a wheel')

In [19]:
mo.group()

'Batman'

In [20]:
mo.group(1)

'man'

### Optional Matching with the Question Mark `?`

Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match regardless of whether that bit of text is there. The `?` character flags the group that precedes it as an optional part of the pattern.

In [21]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [24]:
mo2 = batRegex.search('The Adventures of Batman and Batwoman')
mo2.group()

'Batman'

In [37]:
phoneNumRegex = re.compile(r'(\d{3}-)?\d{3}-\d{4}')
mo1 = phoneNumRegex.search("My number is 312-415-2222")

print(mo1.group())

312-415-2222


In [38]:
mo2 = phoneNumRegex.search("My number is 415-2222")

print(mo2.group())

415-2222


### Matching zero or more with a star(asterik) `*`

The `*` (called the star or asterisk) means __“match zero or more”__ —the group that precedes the star can occur any number of times in the text.

In [39]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [40]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [44]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

### Matching one or more with the Plus `+`

The `+` means it will __"Match one or more"__

In [45]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [48]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

### Matching Specific repetitions with Braces

`(Ha){3}` - this will match the string `HaHaHa`

`(Ha){3,5}` - will match `'HaHaHa'`, `'HaHaHaHa'`, and `'HaHaHaHaHa'`


### The `findall()` method

While `search()` will return a match object of the first _matched_ text in the searched string, the `findall()` will return the string for every match in the searched string.

In [49]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

phoneNumRegex.findall('Home: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

### The Character class 

That is, `\d` is shorthand for the regular expression (`0|1|2|3|4|5|6|7|8|9`)


`\d` - Any number between 0 and 9.

`\D` - Any character that is __not__ a number from 0 to 9

`\w` - Any letter, numbers or the underscore character. (Think of it as words)

`\W` - Any character that is __not__ a letter, number or the underscore character.

`\s` - Any space, tab, or newline character. (Think of it as space characters)

`\S` - Any character that is __not__ space, tab, or newline character.

In [52]:
xmasRegex = re.compile(r'\d+\s\w+')

xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

# Project - Find Email ids

This program will let you copy all the email id's from any webpage.

In [5]:
import pyperclip, re
# for example
# donjoe93@gmail.com
# donjoe@gov.co.in
# ayesha_riyaz@hot-mail.com
# joe@pm.me

emailRegex = re.compile(r'''(
            [a-zA-Z0-9._%+-]+
            @
            [a-zA-Z0-9.-]+
            (\.[a-zA-Z]{2,4}))''',re.VERBOSE)

text = str(pyperclip.paste())

matches = []

for i in emailRegex.findall(text):
    matches.append(i[0])
    
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to Clipboard:')
    print('\n'.join(matches))
else:
    print("No email id found")
    

No email id found
