#### Regex Review.

Following Symbols have special meaning in Regex patterns.
RegEx is short for Regular Expression which means a pattern to look for in a particular string

•	 The ? matches zero or one of the preceding group.

•	 The * matches zero or more of the preceding group.

•	 The + matches one or more of the preceding group.

•	 The {n} matches exactly n of the preceding group.

•	 The {n,} matches n or more of the preceding group.

•	 The {,m} matches 0 to m of the preceding group.

•	 The {n,m} matches at least n and at most m of the preceding group.

•	 {n,m}? or *? or +? performs a non-greedy match of the preceding group.

•	 ^spam means the string must begin with spam.

•	 spam$ means the string must end with spam.

•	 The . matches any character, except newline characters.

•	 \d, \w, and \s match a digit, word, or space character, respectively.

•	 \D, \W, and \S match anything except a digit, word, or space character,
respectively.

•	 [abc] matches any character between the brackets (such as a, b, or c).

•	 [^abc] matches any character that isn’t between the brackets




### Patterns Examples
'abc'                 # Matches the literal 'abc'

r'\w+'                # Matches a word

r'\d+\s\w+'           # Matches '1555 dogs'

r'\d+-\d+-\d+\s.+'    # Matches '2005-12-05 Jons birthday'

r'\s+'                # Matches any number of whitespace





### Re Module 

re.search (<regex>, <string>)

    
#### Import re as Regex module and lets start practice

In [3]:
import re as re

In [58]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be Escaped during Search)

. ^ $ * + ? { } [] \ | ( )

nitinrana.blogspot.com 

123-456-7890
123.123.4567
123*369*4567
123+369-4567
800.123.4567
900.555.4567

Mr. Nitin 
Mr Nitin_Rana

Ms Smith
Mr Rana 
Mrs Rana 

Mr. T

cat
mat
pat
bat

'''




### Raw String in Python is a special string that tells Python not to treat \ (backslash) in any special manner

In [8]:
print('\tTab') ## Will Show a Tab

	Tab


#### Store the String Pattern in a local variable using re.compile () which can be used later for further searches in Python


In [9]:
pattern = re.compile(r'abc') # Always use Raw String 

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(1, 4), match='abc'>


In [10]:
# Span gives the start and end index location
print(text_to_search[1:4])

abc


In [12]:
pattern = re.compile(r'.') # dot means all characters

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='r'>
<re.Match object; span=(19, 20), match='s'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

In [13]:
pattern = re.compile(r'\.') # Look for dot (.) using Escape Character since . is a special charater in RegEx 

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(127, 128), match='.'>
<re.Match object; span=(164, 165), match='.'>
<re.Match object; span=(173, 174), match='.'>
<re.Match object; span=(195, 196), match='.'>
<re.Match object; span=(199, 200), match='.'>
<re.Match object; span=(208, 209), match='.'>
<re.Match object; span=(249, 250), match='.'>


In [14]:
pattern = re.compile(r'nitinrana\.blogspot\.com') # Look for dot (.) using Escape Character since . is a special charater in RegEx 

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

 

<re.Match object; span=(155, 177), match='nitinrana.blogspot.com'>


In [16]:
print(text_to_search[155:177])

nitinrana.blogspot.com




## Now Lets Look into Pattern Search

. - Any Character except New Line 

\d - Digit (0-9)

\D - Not a Digit (0-9) 

\w - Word Character (a-x, A-Z, 0-9, _) 

\W - Not a Word Character 

\s - Whitespace (space, tab, newline)

\S - Not Whitespace (Space, Tab, New Line) 

In [27]:
#pattern = re.compile(r'.') # Search Everything 
#pattern = re.compile(r'\d') # Search only digits 
#pattern = re.compile(r'\D') # Search not a digit
#pattern = re.compile(r'\w') # Search Word CHaracted 
#pattern = re.compile(r'\W') # Not a Word Character 
#pattern = re.compile(r'\s') # Not a Word Character 
pattern = re.compile(r'\S') # Not a Word Character 

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='r'>
<re.Match object; span=(19, 20), match='s'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

### Setting up Anchors in RegEx. These are not search patterns but rather acts as anchor to define the behavior. They are used in conjunction with other pattern as explained above

\b - Word Boundary

\B - Not a Word Boundary 

^ - Beginning of a String 

$ - End of the string



In [29]:
#pattern = re.compile(r'\bHa') 
pattern = re.compile(r'\BHa') 

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(72, 74), match='Ha'>


In [33]:
sentence = 'Start a sentence and then bring it to an end'
#pattern = re.compile(r'^Start') 
#pattern = re.compile(r'end$') 

matches = pattern.finditer(sentence) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(41, 44), match='end'>


In [54]:
# Seach for a Phone NUmber 
#pattern = re.compile(r'\d\d\d')
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d') # . searches for any character which may not be correct 
#pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d') # . searches for only dash(-) or dot(.) as a separator for phone number
#pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d') # . searches for Phone Starting with 800 or 900

# Using Hyphen to define a Range inside the character set []
#pattern = re.compile(r'[a-zA-Z]') # searches for a single character between a-z or A-Z
#pattern = re.compile(r'[^a-zA-Z]') # a Carrot Sign inside the character Set Negates the items in Character Set while Searching

#Find 3 letter words that ends with "at" such as cat mat etc... but not bat
#pattern = re.compile(r'[^b]at') # bat will not be found
#pattern = re.compile(r'[b]at') # Only bat will be found

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(324, 327), match='bat'>


#### Use Quantifier to Describe Repetition of the Pattern


###### Quantifier Listing

"*" 0 or More 

"+" 1 or More

"?" - 0 or One 

{3} - Exact Number 

{3,4} - Range of Numbers (Minimum, Maximum)

[] - Always Searches for single character 


In [64]:
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)

#pattern = re.compile(r'\d{3}.\d{3}.\d{4}') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)


# Search for Mr/Ms 
#pattern = re.compile(r'Mr\.') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)
pattern = re.compile(r'Mr\.?\s[A-Z]\w*') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(259, 268), match='Mr. Nitin'>
<re.Match object; span=(270, 283), match='Mr Nitin_Rana'>
<re.Match object; span=(294, 301), match='Mr Rana'>
<re.Match object; span=(314, 319), match='Mr. T'>


#### Create a Group Pattern so multiple Pattern can be added in one Go rather than multiple Search Pattern Setups

In [66]:
#pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)

pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*') # Find all numbers in the XXX XXX XXXX format (Separator can be any character)

matches = pattern.finditer(text_to_search) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(259, 268), match='Mr. Nitin'>
<re.Match object; span=(270, 283), match='Mr Nitin_Rana'>
<re.Match object; span=(285, 293), match='Ms Smith'>
<re.Match object; span=(294, 301), match='Mr Rana'>
<re.Match object; span=(303, 311), match='Mrs Rana'>
<re.Match object; span=(314, 319), match='Mr. T'>


#### Load data.txt file for Pattern Searches

In [45]:
# Seach for a Phone NUmber 
#pattern = re.compile(r'\d\d\d')
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d') # . searches for any character which may not be correct 
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

with open ('data.txt', 'r') as f:
    contents = f.read() 
    
    matches = pattern.finditer(contents) ## finditer() will help to find all matches 
    
    for match in matches:
        print (match)

f.close ()

<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(1093, 1105), match='900-555-3205'>
<re.Match object; span=(1443, 1455), match='800-555-6089'>
<re.Match object; span=(1794, 1806), match='800-555-7100'>
<re.Match object; span=(2055, 2067), match='900-555-5118'>
<re.Match object; span=(2830, 2842), match='900-555-5428'>
<re.Match object; span=(3290, 3302), match='800-555-8810'>
<re.Match object; span=(3977, 3989), match='900-555-9598'>
<re.Match object; span=(4951, 4963), match='800-555-2420'>
<re.Match object; span=(5572, 5584), match='900-555-3567'>
<re.Match object; span=(6195, 6207), match='800-555-3216'>
<re.Match object; span=(6897, 6909), match='900-555-7755'>
<re.Match object; span=(7872, 7884), match='800-555-1372'>
<re.Match object; span=(8751, 8763), match='900-555-6426'>


#### RegEx for email Search

In [90]:
emails = '''
nitinrana@gmail.com
nitin.rana@gmail.com
nitin.rana@stern.nyu.edu 
nr743@stern.nyu.edu
nitinrana1976@yahoo.co.uk
nitin-rana@ocp.com
'''

In [91]:
#pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com') # Simplest Case
#pattern = re.compile(r'[a-zA-Z0-9.]+@[a-zA-Z]+\.com') # Put Number and dot also in search before @ sign

#pattern = re.compile(r'[a-zA-Z0-9.]+@[a-zA-Z]+\.(com|edu|co.uk)') 

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails) ## finditer() will help to find all matches 

for match in matches:
    print (match)

<re.Match object; span=(1, 20), match='nitinrana@gmail.com'>
<re.Match object; span=(21, 41), match='nitin.rana@gmail.com'>
<re.Match object; span=(42, 66), match='nitin.rana@stern.nyu.edu'>
<re.Match object; span=(68, 87), match='nr743@stern.nyu.edu'>
<re.Match object; span=(88, 113), match='nitinrana1976@yahoo.co.uk'>
<re.Match object; span=(114, 132), match='nitin-rana@ocp.com'>




### Once the Pattern is Found, Now time to retrieve the search strings

In [93]:
urls = '''
http://nitinrana.blogspot.com
http://www.nitinrana.blogspot.com
https://www.nitinrana.blogspot.com
https://www.nasa.gov 
www.cnn.com
'''

### Need to find High Level Domain Only

In [96]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls) 

for match in matches:
    print (match.)

<re.Match object; span=(1, 26), match='http://nitinrana.blogspot'>
<re.Match object; span=(31, 60), match='http://www.nitinrana.blogspot'>
<re.Match object; span=(65, 95), match='https://www.nitinrana.blogspot'>
<re.Match object; span=(100, 120), match='https://www.nasa.gov'>
