# What Are Regular Expressions
Regular expressions (regex) are a powerful language for matching text patterns. This tutorial gives a basic introduction to regular expressions.

The Python "re" module provides regular expression support.

[Regular Expression Tester](https://regex101.com/)

# When To Use Regular Expressions
Regular expressions can be used:

- in log files to extract, for instance,  some dates and events information.
- to check the pattern or format of email addresses
- to check the pattern or format of phone numbers for a given country
- global replace, find a particular string in some data to replace it with another one
- regular expressions can be used in many languages: Python, Java, Javascript, C#, Ruby, PHP, Swift, Groovy, Scala ... 

## Python Regex Cheatsheet

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/


## Frequently Used Regular Expressions 
       
        ^        Matches the beginning of a line
        $        Matches the end of the line
        .        Matches any character
        \s       Matches whitespace
        \S       Matches any non-whitespace character
        *        Repeats a character zero or more times
        *?       Repeats a character zero or more times 
                 (non-greedy)
        +        Repeats a character one or more times
        +?       Repeats a character one or more times 
                 (non-greedy)
        [aeiou]  Matches a single character in the listed set
        [^XYZ]   Matches a single character not in the listed set
        [a-z0-9] The set of characters can include a range
        (        Indicates where string extraction is to start
        )        Indicates where string extraction is to end


# Basic Regular Expressions Operations
In this section we will find:
- a word in a string
- generate an iterator
- match on of any several letters
- match series of range of characters
- replace string
- match a single character


In [12]:
nameAge = '''
Jessica is 15 years old, and Daniel is 27 years old.
Edward is 97 years old, and his grandfather, Oscar, is 102. 
'''

NOTE: First observation, all the names have a first capitalized letter, and all the ages are integers.

Now we define the regular expression, using a simple findall method to find all examples of the pattern we specify as the first parameter within the string we specify as the second parameter.

In [13]:
import re
ages = re.findall(r'\d{1,3}', nameAge)
names = re.findall(r'[A-Z][a-z]*', nameAge)

NOTE1: ages = **\d{1,3}** matches a digit (equal to [0-9])

**{1,3}** is a quantifier — Matches between 1 and 3 times, as many times as possible, giving back as needed

![title](./../../img/regex-number.png)

NOTE2: names = **[A-Z][a-z]* **

**[A-Z]** Match a single character present in the nameAge string (case sensitive)

**[a-z]** Match a single character present in the nameAge string(case sensitive)

**'*' ** is a quantifier and matches between zero and unlimited times.

![title](./../../img/regex-string.png)

In [14]:
print(ages)
print(names)

['15', '27', '97', '102']
['Jessica', 'Daniel', 'Edward', 'Oscar']


In [17]:
ageDict = {}
x = 0
for eachName in names:
    ageDict[eachName] = ages[x]
    x +=1
print(ageDict)

{'Edward': '97', 'Daniel': '27', 'Oscar': '102', 'Jessica': '15'}


## Undestanding the cursor in regular expression

![title](./../../img/stringAndregexCursor.png)


## Search a word in a string

    Syntax: match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded.

In [27]:
# search a word in a string
import re
if re.search('inform', 'we need to inform him with the latest information'):
    print('There is the word inform in the text string')

There is the word inform in the text string


In [16]:
import re
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


## Find All
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings, with each string representing one match.

In [19]:
# findall method
import re
allInform = re.findall('inform', r'we need to inform him with the latest information')

for i in allInform:
    print(i)

inform
inform


## Generate an iterator
Getting the starting and ending index of a particular string.

In [23]:
string = 'we need to inform him with the latest information'
for i in re.finditer("inform", string):
    locTuple = i.span()
    print(locTuple)

(11, 17)
(38, 44)


NOTE: Using the method **finditer**, we can find the location of the string 'inform' in that sentence, by giving the starting and ending index of a given string we search on.

first location at (11, 17) "inform"
second location at (38, 44) "inform[ation]"

the starting and ending indexes are stored in a tuple list.

## Match words with a particular pattern

In [24]:
import re

string = "Sat, hat, mat, pat"

# get all the words that end with at
allstr = re.findall('[shmp]at', string)

for i in allstr:
    print(i)

hat
mat
pat


NOTE: the first string 'Sat' has not been printed because in the regex I did not include a capital S.  Let's fix it

In [35]:
import re

string = "Sat, hat, mat, pat"

# get all the words that end with at
allstr = re.findall('[S[shmp]at', string)

for i in allstr:
    print(i)

Sat
hat
mat
pat


## Match series of range of characters
For instance find all letters in the range h to m.


In [36]:
import re

string = "Sat, hat, mat, pat"

# match series of range of characters 
someStr = re.findall("[h-m]at", string)

for i in someStr:
    print(i)

hat
mat


In [40]:
# using a caret symbol ^ to indicate not
import re

string = "Sat, hat, mat, pat"

# match series of range of characters - everything except hat and mat
someStr = re.findall("[^h-m]at", string)

for i in someStr:
    print(i)

Sat
pat


NOTE: **[^h-m)]at**  Match a single character not present in the list above.

## Replace a string
We will compile a pattern object to get additional methods, such as substitute (sub) to replace a character.

In [7]:
import re
string = 'hat mat rat pat'

# let's replace rat with food
# we will compile a pattern object to get additional method, one of which is substitute.
regex = re.compile('[r]at')

stuff = regex.sub('food', string)
print(stuff)


hat mat food pat


In [50]:
# solving a backslash issue
string = "here is \\Antony"
print(string)

here is \Antony


In [8]:
import re
# If I want to keep the string as is, the following must be done:
string = "here is \\Antony"
print(re.search(r'\\Antony', string))

<_sre.SRE_Match object; span=(8, 15), match='\\Antony'>


## Match a single character

In [74]:
import re
string = '''
Keep the UN flag
flying high
at the United Nations
'''
print(string)


Keep the UN flag
flying high
at the United Nations



In [75]:
# let's remove the newline (\n) by a space
import re
regex = re.compile('\n')
string = regex.sub(' ', string)
print(string)

 Keep the UN flag flying high at the United Nations 


In [66]:
# now let's find how many numbers in a string.
import re
string = '12345'
print('Matches', len(re.findall("\d", string)))

Matches 5


NOTE: **\d** matches a digit (equal to [0-9])

In [70]:
# now let's find anything except numbers in a string.
import re
string = '12345'
print('Matches', len(re.findall("\D", string)))

Matches 0


NOTE:  **\D** matches any character that IS NOT A DIGIT (equal to [^0-9])

In [71]:
# now let's find only one specific digit.
import re
string = '12345'
print('Matches', len(re.findall("\d{5}", string)))

Matches 1


NOTE: **\d{5}** matches exactly the number 5

In [73]:
import re
string = '123 1234 12345 123456 1234567'
print('Matches', len(re.findall("\d{5,7}", string)))

Matches 3


**\d{5,7}** matches between 5 to 7

# Phone Verification Using Regular Expressions
Let's verify this phone number format: ex.: 444-122-1234

Format is 3 digits + '-' + 3 digits + '-' + 4 digits



**\w** matches any word character (equal to [a-zA-Z0-9_])

**\W** matches any non-word character (equal to [^a-zA-Z0-9_])

In [10]:
import re
phn = '412-555-1212'

if re.search('\w{3}-\w{3}-\w{4}', phn):
    print('it is a phone number')

it is a phone number


NOTE: the regular expression above matches fully the phone number : 412-555-1212

We coud have also used instead of \w \d because the phone number has digits.


## Phone number - China
format: +86 yyy xxx xxxx Calls from outside China

In [26]:
import re
phn = '+86 021 412 2542'

if re.search('\w{3}\s\w{3}\s\w{4}', phn):
    print('it is a phone number')

it is a phone number


**\s** matches any whitespace character (equal to [\r\n\t\f\v ])

**\S** matches any non-whitespace character (equal to [^\r\n\t\f\v ])

In [82]:
# validating a full name
import re
if re.search('\w{2,20}\s\w{2,20}', 'anne-marie roy'):
    print('fullname is valid')
    

fullname is valid


NOTE: \w{2,20}\s\w{2,20} \w 'anne-marie' \s 'space' \w 'roy'

# Email Verification Using Regular Expressions
Verifying the email format.

Email addresses should have the following pattern:

- 1 to 20 lower case and upper case letters, numbers, plus (._%+-)
- an @ symbol
- 2 to 20 lower case and upper case letters, numbers, plus a period (.)
- a period (.)
- 2 to 3 lower case and upper case letters


In [104]:
emails = 'sk@aol.com md@gmail.com, @seo.com, dc.@.com sk@aol.com'

print('Email Matches:', len(re.findall("[\w._%+-]{1,20}@[\w.-]{2,20}.[A-Za-z]{2,3}", emails)))

Email Matches: 2


NOTE: sk@aol.com has 2 matches for a valid email address.

In [15]:
import re

# Simple Regex for syntax checking
regex = '^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,})$'

# Email address to verify
inputAddress = input('Please enter the emailAddress to verify:')
addressToVerify = str(inputAddress)

# Syntax check
match = re.match(regex, addressToVerify)
try:
    if match:
        print('Good email syntax')
    else:
        print('Bad email syntax')
except: raise ValueError('Bad Syntax')
    

Please enter the emailAddress to verify:amroy@codeacademy123.com
Good email syntax


NOTE: ^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$
    
    ^ start at the beginning of the string
    
    $ end of the string
    

# Web Scraping Using Regular Expressions

Getting phone numbers from a webpage

http://www.summet.com/dmsi/html/codesamples/addresses.html

In [39]:
import urllib.request
from re import findall
#import ssl

url = 'https://www.summet.com/dmsi/html/codesamples/addresses.html'

response = urllib.request.urlopen(url)

html = response.read

htmlStr = html.decode()

pdata = findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)

for item in pdata:
    print(item)



URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:645)>

# Summary
We studies the following:
- What are Regular Expressions?
- Why we use Regular Expressions?
- Basic Regular Expressions operations
- E-mail verification using Regular Expressions
- Phone number verification using Regular Expressions
- Web scraping using Regular Expressions
