# Regular Expressions

In [299]:
import re

## regular vs raw string

**regular string** - \t will be intrepreted as a tab

In [300]:
print('\tthis is a tab') 

	this is a tab


**raw string** - does't treats \t as tab but instead prints it normally

In [301]:
print(r'\tthis is not a tab')

\tthis is not a tab


hence, for regex we use raw strings

## searching text

In [302]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

### creating pattern - .compile()

In [303]:
pattern = re.compile(r'abc')

### finding using pattern - pattern.finditer()

In [304]:
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>


using the above info given in **span** to get the text - abc

In [305]:
text_to_search[1:4]

'abc'

## characters that need to be escaped

if we use . to search

In [306]:
pattern = re.compile(r'.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

this prints out almost every line  
hence there are some special character which mean somthing in regex that needs to be escaped

**MetaCharacters (Need to be escaped)**:  
. ^ $ * + ? { } [ ] \ | ( )

**SO TO ESCAPE WE USE \ IN FRONT OF META CHARACTERS**

In [307]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(113, 114), match='.'>
<re.Match object; span=(149, 150), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(175, 176), match='.'>
<re.Match object; span=(223, 224), match='.'>
<re.Match object; span=(254, 255), match='.'>
<re.Match object; span=(267, 268), match='.'>


another example

In [308]:
pattern = re.compile(r'coreyms\.com')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(142, 153), match='coreyms.com'>


In [309]:
text_to_search[142:153]

'coreyms.com'

## for regex cheatsheet refer this [txt file](./module_regular_expressions.txt)

if we see there is a pattern that regex follows  
eg:  

if small **\d** refers to any digit between 0-9  
then big **\D** refers to _not a digit_ between 0-9

so capital letters just negate the original meaning

In [310]:
sentence = 'start something just start'

In [311]:
pattern = re.compile(r'^start') # match start that are at the beginning of the string (^ - carrot symbol)

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='start'>


In [312]:
pattern = re.compile(r'start$') # match start that are at the end of the string ($ - dollar symbol)

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(21, 26), match='start'>


## pattern to match phone numbers

```
321-555-4321  
123.555.1234  
123*555*1234  
800-555-1234  
900-555-1234  
```

*notice that the seperators are different*

In [313]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


**\d** represents each digit  
**.** represents any character 

## finding from a file [regex.txt](./regex.txt)

In [314]:
with open('regex.txt', 'r') as f:

    data = f.read()

    pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

    matches = pattern.finditer(data) # find phone numbers in the file

    for match in matches:
        print(match)

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; spa

#### Q: pattern to match phone numbers with seperator as . & - only!

we can use **character set []** which only matches charcters we want to match

also we dont need to use \. inside character set

**one charset only matches one character**

In [315]:
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


#### Q: pattern to match numbers with 800 and 900 numbers

In [316]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


## - 

\- mathes exatly when it is at the start or end of the pattern  
but when used in between it specifies a certain **range** 

can be used to specify alphabet/numeric ranges

In [317]:
pattern = re.compile(r'[2-4]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(155, 156), match='3'>
<re.Match object; span=(156, 157), match='2'>
<re.Match object; span=(163, 164), match='4'>
<re.Match object; span=(164, 165), match='3'>
<re.Match object; span=(165, 166), match='2'>
<re.Match object; span=(169, 170), match='2'>
<re.Match object; span=(170, 171), match='3'>
<re.Match object; span=(177, 178), match='2'>
<re.Match object; span=(178, 179), match='3'>
<re.Match object; span=(179, 180), match='4'>
<re.Match object; span=(182, 183), match='2'>
<re.Match object; span=(183, 184), match='3'>
<re.Match object; span=(190, 191), match='2'>
<re.Match object; span=(191, 192), match='3'>
<re.Match object; span=(192, 193), match='4'>
<re.Match object; span=(203, 204), match='2'>
<re.Match object; span=(204, 205), match='3'>
<re.Match object; span=(205, 206), match='4'>
<re.Match object; span=(216, 217), match

a-z or A-Z

In [318]:
pattern = re.compile(r'[a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

## negation in character set


### ^ inside [] character set represents negation
So match everything except

In [319]:
pattern = re.compile(r'[^a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(27, 28), match='\n'>
<re.Match object; span=(54, 55), match='\n'>
<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(66, 67), match='\n'>
<re.Match object; span=(69, 70), match=' '>
<re.Match object; span=(74, 75), match='\n'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(90, 91), match=' '>
<re.Match object; span=(91, 92), match='('>
<re.Match object; span=(96, 97), match=' '>
<re.Match object; span=(99, 100), match=' '>
<re.Match object; span=(10


#### Q: match every word ending in at except starting with b and h

In [320]:
eg_text="""
cat
mat
hat
bat
"""

In [321]:
pattern = re.compile(r'[^bh]at')

matches = pattern.finditer(eg_text)

for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='cat'>
<re.Match object; span=(5, 8), match='mat'>


## quantifiers 
to match more than one character

```
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
```

In [322]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

we can make mistakes in above code when we writing multiple \d for each and every digit

In [323]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


we don't know the exact numbers at times

eg: Mr or Mr. or Mrs or Ms

#### Q: match names start with Mr or Mr.

In [324]:
text_eg = """
Mr. Yash
Mr Yash
Ms. Vxyz
Ms. Vxyz
Mrs. Vxyz
Mrs. Vxyz
"""

In [325]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')

matches = pattern.finditer(text_eg)

for match in matches:
    print(match)

<re.Match object; span=(1, 9), match='Mr. Yash'>
<re.Match object; span=(10, 17), match='Mr Yash'>


## groups

allow is to match several different patterns  
treat multiple characters as a single unit

#### Q: match names start with Mr Mr. Mrs Mrs. Ms. Ms

In [326]:
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')

matches = pattern.finditer(text_eg)

for match in matches:
    print(match)

<re.Match object; span=(1, 9), match='Mr. Yash'>
<re.Match object; span=(10, 17), match='Mr Yash'>
<re.Match object; span=(18, 26), match='Ms. Vxyz'>
<re.Match object; span=(27, 35), match='Ms. Vxyz'>
<re.Match object; span=(36, 45), match='Mrs. Vxyz'>
<re.Match object; span=(46, 55), match='Mrs. Vxyz'>


(r|s|rs) - can either be **r** or **s** or **rs**

| represents **OR**

#### Q: match these emails which are very different

CoreyMSchafer@gmail.com  
corey.schafer@university.edu  
corey-321-schafer@my-work.net  

In [327]:
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

In [328]:
pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>


to match second email we need to add  
. to character set  
&  
.edu to group along with .com

In [329]:
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>


to match third email we need to add  
\- and 1-9 to character set  
&  
.net to group along with .com and .edu

In [330]:
pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


### Q: strip the website to only get the domain name

In [331]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

In [332]:
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)') # adding subsections in groups

matches = pattern.finditer(urls)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 42), match='http://coreyms.com'>
<re.Match object; span=(43, 62), match='https://youtube.com'>
<re.Match object; span=(63, 83), match='https://www.nasa.gov'>


we have added groups to existing pattern

we can access them using indexing

In [333]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)') # adding subsections in groups

matches = pattern.finditer(urls)

for match in matches:
    print(match.group(2))
    # print(match[2]) ## this also works!

google
coreyms
youtube
nasa


## substituting values

use back-references to reference the groups

(groups, text to replace)

In [334]:
subbed_urls = pattern.sub(r'\2\3', urls)

print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



## findall - just return matches with no extra information

**ONLY RETURNS GROUPS**

In [335]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)') # adding subsections in groups

matches = pattern.findall(urls)

for match in matches:
    print(match)

('www.', 'google', '.com')
('', 'coreyms', '.com')
('', 'youtube', '.com')
('www.', 'nasa', '.gov')


**IF THERE ARE NO GROUPS, IT WILL JUST RETURN LIST OF STRINGS**

In [336]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.findall(text_to_search)

for match in matches:
    print(match)

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234


## match - only to check start of the sentence if match exists else return None

^ exists but for some reason there is seperate method for this

In [337]:
sentence = 'start something just start'

In [338]:
pattern = re.compile(r'start') 

exists = pattern.match(sentence)

print(exists)

<re.Match object; span=(0, 5), match='start'>


In [339]:
pattern = re.compile(r'not-there') 

exists = pattern.match(sentence)

print(exists)

None


## search - search the whole sentence if match exists else return None

only checks for first occurence

In [340]:
pattern = re.compile(r'just') 

exists = pattern.search(sentence)

print(exists)

<re.Match object; span=(16, 20), match='just'>


In [341]:
pattern = re.compile(r'not-there') 

exists = pattern.search(sentence)

print(exists)

None


## flags

special options that change how the regex behaves

In [None]:
sentence = 'Start something just start'

eg: we want to match the _start_ word but every character can be either capital or small

In [343]:
pattern = re.compile(r'start', re.IGNORECASE) # add a flag to ignore cases

exists = pattern.search(sentence)

print(exists)

<re.Match object; span=(0, 5), match='start'>


there are many different flags