## Regular Expressions - 1 

#### A raw string in Python is just a normal string prefixed with a r and that tells Python not to handle \ in a special way.

In [1]:
print("\tc1\tc2")

	c1	c2


In [2]:
print(r"\tc1\tc2")     # raw string

\tc1\tc2


In [3]:
import re

In [4]:
random_string = """Dolores wouldn't FROM eaten the meal if she had known what it actually was.
The tortoise jumped into the lake with dreams of becoming a sea turtle. Garlic ice-cream was her favorite.
She looked at the masterpiece hanging in the museum but all she could think is that her five-year-old could do better.
There from have been days from //-2586

lol lollol

782-665-56
061.564.58
100-150-55

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

Mr. schafer
Mr Y
Mrs. Davis

helloworld@python.org"""

### Let's try to find a pattern from the random string that we created

In [5]:
# Creating the pattern variable using compile()

pattern = re.compile(r'from')

# Creating an object of all the possible matches

matches = pattern.finditer(random_string)


# Printing all the matches

for match in matches:
    print(match)

<re.Match object; span=(308, 312), match='from'>
<re.Match object; span=(328, 332), match='from'>


***The indexes of the matches are (158, 162) and (503, 507) respectively.***

In [6]:
random_string[158:162]

'-cre'

#### If we want to find a metacharacter, we need to escape it.

In [7]:
# Creating the pattern variable using compile()

pattern = re.compile(r'\.')

# Creating an object of all the possible matches

matches = pattern.finditer(random_string)


# Printing all the matches

for match in matches:
    print(match)

<re.Match object; span=(74, 75), match='.'>
<re.Match object; span=(146, 147), match='.'>
<re.Match object; span=(181, 182), match='.'>
<re.Match object; span=(300, 301), match='.'>
<re.Match object; span=(368, 369), match='.'>
<re.Match object; span=(372, 373), match='.'>
<re.Match object; span=(425, 426), match='.'>
<re.Match object; span=(456, 457), match='.'>
<re.Match object; span=(474, 475), match='.'>
<re.Match object; span=(500, 501), match='.'>


#### To match patterns we are going to use some of the metacharacters.

<pre>
i)    . - Any character except new line.
ii)  \d - Digit (0-9)
iii) \D - Not a Digit (0-9)
iv)  \w - Word Character (a-z, A-Z, 0-9, _)
v)   \W - Not a word character
vi)  \s - Whitespace (space, tab, newline)
vii) \S - Not Whitespace (space, tab, newline)

</pre>

The dot matches any character except a new line. Similarly all the other meta characters do the same as written in the right column.

#### Let's try to see a few in action.

In [8]:
pattern = re.compile(r'\d')
matches = pattern.finditer(random_string)

for match in matches:     # Prints all digits
    print(match)

<re.Match object; span=(336, 337), match='2'>
<re.Match object; span=(337, 338), match='5'>
<re.Match object; span=(338, 339), match='8'>
<re.Match object; span=(339, 340), match='6'>
<re.Match object; span=(354, 355), match='7'>
<re.Match object; span=(355, 356), match='8'>
<re.Match object; span=(356, 357), match='2'>
<re.Match object; span=(358, 359), match='6'>
<re.Match object; span=(359, 360), match='6'>
<re.Match object; span=(360, 361), match='5'>
<re.Match object; span=(362, 363), match='5'>
<re.Match object; span=(363, 364), match='6'>
<re.Match object; span=(365, 366), match='0'>
<re.Match object; span=(366, 367), match='6'>
<re.Match object; span=(367, 368), match='1'>
<re.Match object; span=(369, 370), match='5'>
<re.Match object; span=(370, 371), match='6'>
<re.Match object; span=(371, 372), match='4'>
<re.Match object; span=(373, 374), match='5'>
<re.Match object; span=(374, 375), match='8'>
<re.Match object; span=(376, 377), match='1'>
<re.Match object; span=(377, 378),

In [9]:
pattern = re.compile(r'\W')
matches = pattern.finditer(random_string)

for match in matches:     # Prints all digits
    print(match)

<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(14, 15), match="'">
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(21, 22), match=' '>
<re.Match object; span=(27, 28), match=' '>
<re.Match object; span=(31, 32), match=' '>
<re.Match object; span=(36, 37), match=' '>
<re.Match object; span=(39, 40), match=' '>
<re.Match object; span=(43, 44), match=' '>
<re.Match object; span=(47, 48), match=' '>
<re.Match object; span=(53, 54), match=' '>
<re.Match object; span=(58, 59), match=' '>
<re.Match object; span=(61, 62), match=' '>
<re.Match object; span=(70, 71), match=' '>
<re.Match object; span=(74, 75), match='.'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(79, 80), match=' '>
<re.Match object; span=(88, 89), match=' '>
<re.Match object; span=(95, 96), match=' '>
<re.Match object; span=(100, 101), match=' '>
<re.Match object; span=(104, 105), match=' '>
<re.Match object; span=(109, 110), match=' '>
<re.Match object; span=(114

<br><br>

#### The next set are called anchors. They don't match any characters but rather invisible positions before or after characters and we can use this in conjunction with other patterns we are searching for.

<pre>
i)   \b - Word Boundary
ii)  \B - Not a Word Boundary
iii)  ^ - Beginning of a String
iv)   $ - End of a String
</pre>

In [10]:
pattern = re.compile(r'\blol')
matches = pattern.finditer(random_string)

for match in matches:     #This is going to find all the 'lol' which start as a word boundary
    print(match)

<re.Match object; span=(342, 345), match='lol'>
<re.Match object; span=(346, 349), match='lol'>


**Hence, we can see that it gave the location of only those 'lol' which start as a word boundary. If we want to get the other 'lol', we can use \B**

In [11]:
pattern = re.compile(r'\Blol')
matches = pattern.finditer(random_string)

for match in matches:     #This is going to find all the 'lol' except ones at word boundary
    print(match)

<re.Match object; span=(349, 352), match='lol'>


#### ^ - Checks if the pattern is present at the start of the string or not and $ - checks if the pattern is present at the end of the string.

In [12]:
pattern = re.compile(r'^Dolo')
matches = pattern.finditer(random_string)

for match in matches:     
    print(match)

<re.Match object; span=(0, 4), match='Dolo'>


#### Let's say we want to match the phone numbers (if any) present in the string assuming that we know the pattern in which the phone numbers appear in the string. which is 000-000-00.

In [13]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d')
matches = pattern.finditer(random_string)

for match in matches:     
    print(match)

<re.Match object; span=(354, 364), match='782-665-56'>
<re.Match object; span=(365, 375), match='061.564.58'>
<re.Match object; span=(376, 386), match='100-150-55'>


##### In the above, we added three <b>\d</b> to get the first 3 digits; we added a <b>.</b> to match any character that could be present after the first 3 digits; we added three <b>\d</b> again to match the next 3 digits; <b>.</b> to match the next character and so on...

<br>

### Character Sets:

<pre>
[]    -  Matches characters in brackets.
[^ ]  -  Matches characters not in brackets.
|     -  Either or
()    -  Group 
</pre>

#### Now let's say that we want to match only the Phone numbers with - in between. Using <b>.</b> is going to select any character. 

In [14]:
pattern = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d')
matches = pattern.finditer(random_string)

for match in matches:     
    print(match)

<re.Match object; span=(354, 364), match='782-665-56'>
<re.Match object; span=(376, 386), match='100-150-55'>


#### Now, let's say that we want to get only those numbers which begin with a 100. That's again a good use case for 100.

In [15]:
pattern = re.compile(r'[1]00[-]\d\d\d[-]\d\d')

matches = pattern.finditer(random_string)

for match in matches:
    print(match)

<re.Match object; span=(376, 386), match='100-150-55'>


#### Let's say you want to match everything except a-z and A-Z. Putting a carat inside the chracter set negates whatever is written inside it.

In [16]:
pattern = re.compile(r'[^a-zA-Z]') 
matches = pattern.finditer(random_string)

for match in matches:     
    print(match)

<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(14, 15), match="'">
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(21, 22), match=' '>
<re.Match object; span=(27, 28), match=' '>
<re.Match object; span=(31, 32), match=' '>
<re.Match object; span=(36, 37), match=' '>
<re.Match object; span=(39, 40), match=' '>
<re.Match object; span=(43, 44), match=' '>
<re.Match object; span=(47, 48), match=' '>
<re.Match object; span=(53, 54), match=' '>
<re.Match object; span=(58, 59), match=' '>
<re.Match object; span=(61, 62), match=' '>
<re.Match object; span=(70, 71), match=' '>
<re.Match object; span=(74, 75), match='.'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(79, 80), match=' '>
<re.Match object; span=(88, 89), match=' '>
<re.Match object; span=(95, 96), match=' '>
<re.Match object; span=(100, 101), match=' '>
<re.Match object; span=(104, 105), match=' '>
<re.Match object; span=(109, 110), match=' '>
<re.Match object; span=(114

### Quantifiers:

<pre>
*     -   0 or more
+     -   1 or more
?     -   0 or one
{3}   -   Exact number
{3,4} -   Range of Numbers (Minimum, Maximum)
</pre>

#### In the previous example, trying to match a phone number with multiple digits is a tedious task and is prone to mistakes if the number is large. That's where we can use quantifiers.

In [17]:
pattern = re.compile(r'\d{3}.\d{3}.\d{2}')

matches = pattern.finditer(random_string)

for match in matches:
    print(match)

<re.Match object; span=(354, 364), match='782-665-56'>
<re.Match object; span=(365, 375), match='061.564.58'>
<re.Match object; span=(376, 386), match='100-150-55'>


#### To match with all the Mr. we can do the following

In [18]:
pattern = re.compile(r'Mr\.?')  # Question mark ensures that the . is optional

matches = pattern.finditer(random_string)

for match in matches:
    print(match)

<re.Match object; span=(454, 457), match='Mr.'>
<re.Match object; span=(466, 468), match='Mr'>
<re.Match object; span=(471, 473), match='Mr'>


#### Let's say that we want to match the complete name now.

In [19]:
pattern = re.compile(r'Mr\.?\s[a-zA-Z]\w*')

matches = pattern.finditer(random_string)

for match in matches:
    print(match)

<re.Match object; span=(454, 465), match='Mr. schafer'>
<re.Match object; span=(466, 470), match='Mr Y'>


#### How do we also include Mrs ? We can us group.

In [20]:
pattern = re.compile(r'M(r|s|rs)\.?\s[a-zA-Z]\w*')

matches = pattern.finditer(random_string)

for match in matches:
    print(match)

<re.Match object; span=(454, 465), match='Mr. schafer'>
<re.Match object; span=(466, 470), match='Mr Y'>
<re.Match object; span=(471, 481), match='Mrs. Davis'>
