# <center>Regular Expressions</center>

References:
- http://www.tutorialspoint.com/python/python_reg_expressions.htm
- https://developers.google.com/edu/python/regular-expressions
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

## 1. What is a regular expression
- A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a **pattern**. 
- Regular expressions are widely used in UNIX world
- <font color="green">**re**</font> is the built-in python package for regular expressions
- Other modules such as BeautifulSoup, NLTK also use regular expressions

## 2. Useful Regular Expression Patterns

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re

In [2]:
text = """COM-101   COMPUTERS
COM-111   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS learning
MAT-102   STATISTICS"""
print(text) 
# trible ' can use space as newline

COM-101   COMPUTERS
COM-111   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS learning
MAT-102   STATISTICS


In [9]:
# Find all occurrences of a pattern 
re.findall('101',text)
re.findall('COM',text)
re.findall('^COM',text) # only the first one
#re.findall('.',text) #all elements
re.findall('10[12]',text)
re.findall('COM-1*',text)
re.findall('COM-1+',text)
re.findall('COM-1?',text)
re.findall('[a-z]+',text)

['101']

['COM', 'COM', 'COM', 'COM']

['COM']

['101', '102']

['COM-1', 'COM-111', 'COM-']

['COM-1', 'COM-111']

['COM-1', 'COM-1', 'COM-']

['learning']

| Pattern     | Description                              | Example |
| :------------|:----------------------------------------|:-------|
| ^     | Matches beginning of a string | '^COM'         |
| \$     | Matches end of a string        | 'STATISTICS$' |
| .     | Matches any single character except newline. |'.' |
| [...] | Matches any single character in brackets. | '10[12]'|
| [^...] | Matches any single character not in brackets | '10[^12]'| 
| \*     | Matches 0 or more occurrences of preceding expression  |   'COM-1\*'                 |
| +           | Matches 1 or more occurrence of preceding expression  | 'COM-1+'    |
| ?           | Matches 0 or 1 occurrence of preceding expression    |  'COM-1?'  |                  |
| {n}         | Matches exactly n number of occurrences of preceding expression    |  '1{2}'|
| {n,}        | Matches n or more occurrences of preceding expression.   |  '1{3,}'|
| {n,m}       | Matches at least n and at most m occurrences of preceding expression       | '1{2,3}'|
| [0-9]       | Match any digit; same as [0123456789] |'[0-9]+' |
| [a-z]       | Match any lowercase ASCII letter | '[a-z]+'     |
| [A-Z]       | Match any uppercase ASCII letter | '[A-Z]+'     |
| [a-zA-Z0-9] | Match any of number or alphabetic letter |'[a-zA-Z0-9]+'                                  |
| [^0-9]      |Match anything other than a digit |  '[^0-9]+' |
| a&#124;b    | Matches either a or b  | '101&#124;102'    |
| \w          | Matches word characters, i.e. ASCII characters [A-Za-z0-9\_].   | '\w+'                                    |
| \W          | Matches nonword characters | '\W\w+'      |
| \s          | Matches whitespace. Equivalent to [ \t\n\r\f] |   '\s\w+'   |
| \S          | Matches nonwhitespace  |'\S+'     |
| \d          | Matches digits. Equivalent to [0-9] | '\d+'  |
| \D          | Matches nondigits | '\D+' |
| ( )         | Group regular expressions and remember matched text|  '(\w+)-(\w+)' |

<div class="alert alert-block alert-info"> Various characters (e.g. '\n'), which would have special meaning in Python, could cause confusion when they are used in regular expression. To avoid such confusion, it is recommended to use <b>Raw Strings</b> as r'expression' (e.g. **r'**\n').

For example, "\n" means new line in python. With prefix **r**, "r'\n'" is two characters - a backslash and the the letter 'n' (i.e. they are treated literarally instead of signifying a new line).
</div>

<div class="alert alert-block alert-info"> However, to match with characters such as |, \, etc., which have special meaning in regular expression, use escape character "**\**".</div>

## 3. Regular Expression Functions
- <font color="green">**match(pattern, string, flags=0)**</font>: match *pattern* to *string* from the beginning. The re.match function returns a match object on success, None on failure. We use group() function of match object to get matched expression.
- <font color="green">**search(pattern, string, flags=0)**</font>: match *pattern* to *string*, similar to <font color="green">**match**</font>. The difference between <font color="green">**match**</font> and <font color="green">**search**</font> is: <font color="green">**match**</font> checks for a match only at the **beginning** of the string, while <font color="green">**search**</font> checks for a match **anywhere** in the string
- <font color="green">**findall(pattern, string, flags=0)**</font>: find **all occurrences** of the *pattern* in *string* and save the result into a list. Note that **match** and **search** functions only find the **first match**.
- <font color="green">**sub(pattern, repl, string, max=0)**</font>: replaces all occurrences of the *pattern* in *string* with *repl*, substituting all occurrences unless *max* provided. This method returns modified string.
- <font color="green">**split(pattern, string,  maxsplit=0, flags=0)**</font>: Split string by the occurrences of pattern.If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

In [10]:
# Exercise 3.1. match function

text="The cat catches a rat"

# match is to find the pattern in the 
# string from ***the beginning***
# if found, a match object is returned
# otherwise, None

match= re.match(r'cat', text)
if match:
    print ("find cat!")
else:
    print ("not found!")
    
# How to modify the pattern so that 
# 'cat' strings can be found ?

not found!


In [11]:
# Exercise 3.2. search function

# search is to find the pattern in the string 
# from ***any position***
# group() is the function to return matched string

text="The cat catches a rat"

match= re.search(r'cat',text)
if match:
    print ("find cat!")
    print (match.group())
else:
    print ("not found!")

find cat!
cat


In [12]:
text="The cat catches a rat"

match= re.search(r'.*cat',text)
if match:
    print ("find cat!")
    print (match.group())
else:
    print ("not found!")

find cat!
The cat cat


In [13]:
# Exercise 3.3. findall function

# find all "cat" substrings in text

text="The cat catches a rat"

match= re.findall(r'cat', text)
print (match)

['cat', 'cat']


In [14]:
# Exercise 3.4. sub function

# replace all "cat" substrings in text with 'CAT'

text="The cat catches a rat"

match= re.sub(r'cat','CAT', text)
print (match)

The CAT CATches a rat


In [15]:
# Exercise 3.5. split the sentence into words

text="The cat catches a rat!!!!!!!"

words= re.split(r'\W+', text) # \W+ matches a sequence of non-words
print (words)

# any other way to tokenize?

['The', 'cat', 'catches', 'a', 'rat', '']


In [17]:
text="The cat catches a rat"

match= re.findall(r't', text)                      
print (match)

['t', 't', 't']


In [16]:
# Exercise 3.6. case insensitive search

# flag re.I means case insensitive. 
# It can be applied to search, match, findall, and sub

# find all "t" or "T"     

text="The cat catches a rat"

match= re.findall(r't', text, re.I)                      
print (match)

['T', 't', 't', 't']


In [18]:
# Exercise 3.7. Match with capturing groups (i.e. "()")

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")

# group() or group(0) always returns the whole text that was matched 
# no matter if it was captured in a group or not
print(m.group())
print(m.group(0))

# refer to each group by index starting from 1
print("first word:", m.group(1))
print("second word:", m.group(2))
print("first & second groups:", m.group(1,2))

Isaac Newton
Isaac Newton
first word: Isaac
second word: Newton
first & second groups: ('Isaac', 'Newton')


## 4. Expression pattern examples

In [19]:
# Exercise 4.1. Replace line breaks with a single space

text='''first
        second'''

# \s matches with whitespaces, includeing line breaks, 
# tabs, spaces etc. + means one or more
print (re.sub(r"\s+", ' ',text)  ) 

first second


In [25]:
# Exercise 4.2. find phone number
text = "201-959-5599 # This is Phone Number 201-966-5599"

# \d matches with any number, 
# {n} means n number of preceding characters are needed
phones = re.findall(r'\d{3}-\d{3}-\d{4}', text)
print (phones)

# How about phone numbers like 201.959.5599?
text = "201.959-5599 # This is Phone Number 201-966-5599"
phones2 = re.findall(r'\d{3}[-\.]\d{3}[-\.]\d{4}', text)
print(phones2)
# how to get all the numbers, no hyphen or dot?
phones3 = re.findall(r'(\d{3})[-\.](\d{3})[-\.](\d{4})', text)
print(phones3)

['201-959-5599', '201-966-5599']
['201.959-5599', '201-966-5599']
[('201', '959', '5599'), ('201', '966', '5599')]


In [27]:
# Exercise 4.3. find email address
text = "email me at joe.doe@example1.com or at abc-xyz@example2.edu"

emails = re.findall(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9.-_]+', \
                    text)
print (emails)
emails = re.findall(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+', \
                    text)
print (emails)

# [a-zA-Z0-9._-] means any alphabetic character, 
# number, .(dot), _, and - is allowed
# note that special characters lose their special meaning inside []
# although .(dot) have special meaning, 
# within [], it's treated literally 
# -(hyphen) placed at the end of list inside []
# is treated literally

['joe.doe@example1.com', 'abc-xyz@example2.edu']
['joe.doe@example1.com', 'abc-xyz@example2.edu']


In [32]:
# Exercise 4.4. Extract course name and title as 
# [('COM-101', 'COMPUTERS'),
#  ('COM-111', 'DATABASE'),
#  ... ]

text = '''COM-101   COMPUTERS
COM-111   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS learning
MAT-102   STATISTICS'''

re.findall(r'([A-Z]{3}-\d{3})\s+([\w]+)',text)

[('COM-101', 'COMPUTERS'),
 ('COM-111', 'DATABASE'),
 ('COM-211', 'ALGORITHM'),
 ('MAT-103', 'STATISTICS'),
 ('MAT-102', 'STATISTICS')]