# <center>Regular Expressions</center>

References:
- http://www.tutorialspoint.com/python/python_reg_expressions.htm
- https://developers.google.com/edu/python/regular-expressions

## 1. What is a regular expression
- A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a **pattern**. 
- Regular expressions are widely used in UNIX world
- <font color="green">**re**</font> is the built-in python package for regular expressions
- Other modules such as BeautifulSoup, NLTK also use regular expressions

## 2. Useful Regular Expression Patterns

| Pattern     | Description                                                              |
| :------------|:----------------------------------------------------------------------------|
| ^           | Matches beginning of line                                        |
| $           | Matches end of line                                               |
| .           | Matches any single character except newline.                              |
| [...]       | Matches any single character in brackets. e.g. [ab] matches either a or b                                  |
| [^...]      | Matches any single character not in brackets                               |
| \*           | Matches 0 or more occurrences of preceding expression                      |
| +           | Matches 1 or more occurrence of preceding expression                       |
| ?           | Matches 0 or 1 occurrence of preceding expression                          |
| {n}         | Matches exactly n number of occurrences of preceding expression            |
| {n,}        | Matches n or more occurrences of preceding expression.                     |
| {n,m}       | Matches at least n and at most m occurrences of preceding expression       |
| a&#124;b    | Matches either a or b                                                      |
| ( )         | Group regular expressions and remember matched text                        |
| \w          | Matches word characters, i.e. ASCII characters [A-Za-z0-9_].                                                    |
| \W          | Matches nonword characters                                                 |
| \s          | Matches whitespace. Equivalent to [ \t\n\r\f]                               |
| \S          | Matches nonwhitespace                                                      |
| \d          | Matches digits. Equivalent to [0-9]                                        |
| \D          | Matches nondigits                                                          |
| [0-9]       | Match any digit; same as [0123456789]                                      |
| [a-z]       | Match any lowercase ASCII letter                                           |
| [A-Z]       | Match any uppercase ASCII letter                                           |
| [a-zA-Z0-9] | Match any of number or alphabetic letter                                   |
| [^0-9]      |Match anything other than a digit                                           |

<div class="alert alert-block alert-info"> Various characters (e.g. '\n'), which would have special meaning in Python, could cause confusion when they are used in regular expression. To avoid such confusion, it is recommended to use <b>Raw Strings</b> as r'expression' (e.g. **r'**\n').

For example, "\n" means new line in python. With prefix **r**, "r'\n'" is two characters - a backslash and the the letter 'n' (i.e. they are treated literarally instead of signifying a new line).
</div>

<div class="alert alert-block alert-info"> However, to match with characters such as |, \, etc., which have special meaning in regular expression, use escape character "**\**".</div>

## 3. Regular Expression Functions
- <font color="green">**match(pattern, string, flags=0)**</font>: match *pattern* to *string* from the beginning. The re.match function returns a match object on success, None on failure. We use group() function of match object to get matched expression.
- <font color="green">**search(pattern, string, flags=0)**</font>: match *pattern* to *string*, similar to <font color="green">**match**</font>. The difference between <font color="green">**match**</font> and <font color="green">**search**</font> is: <font color="green">**match**</font> checks for a match only at the **beginning** of the string, while <font color="green">**search**</font> checks for a match **anywhere** in the string
- <font color="green">**findall(pattern, string, flags=0)**</font>: find **all occurrences** of the *pattern* in *string* and save the result into a list. Note that **match** and **search** functions only find the **first match**.
- <font color="green">**sub(pattern, repl, string, max=0)**</font>: replaces all occurrences of the *pattern* in *string* with *repl*, substituting all occurrences unless *max* provided. This method returns modified string.


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re    # import re module

In [None]:
# Exercise 3.1. match function


text="The cat catches a rat"

# match is to find the pattern in the 
# string from ***the beginning***
# if found, a match object is returned
# otherwise, None

match= re.match(r'cat', text)
if match:
    print ("find cat!")
else:
    print ("not found!")

In [None]:
# Exercise 3.2. match with any preceding characters

# change the pattern to allow any characters preceding "cat"
# it always returns the longest match

text="The cat catches a rat"
match= re.match(r'.*cat', text)
if match:
    print (match.group())
else:
    print ("not found!")
    

In [None]:
# Exercise 3.3. search function

# search is to find the pattern in the string 
# from ***any position***
# group() is the function to return matched string

text="The cat catches a rat"

match= re.search(r'cat',text)
if match:
    print ("find cat!")
    print (match.group())
else:
    print ("not found!")

In [None]:
# Exercise 3.4. findall function

# find all "cat" substrings in text

text="The cat catches a rat"

match= re.findall(r'cat', text)
print (match)

In [None]:
# Exercise 3.5. sub function

# replace all "cat" substrings in text with 'CAT'

text="The cat catches a rat"

match= re.sub(r'cat','CAT', text)
print (match)

In [None]:
# Exercise 3.6. case insensitive search

# flag re.I means case insensitive. 
# It can be applied to search, match, findall, and sub

# find all "t" or "T"     

text="The cat catches a rat"

match= re.findall(r't', text, re.I)                      
print (match)

In [None]:
# Exercise 3.7. Match with capturing groups (i.e. "()")

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")

# group() or group(0) always returns the whole text that was matched 
# no matter if it was captured in a group or not
print(m.group())
print(m.group(0))

# refer to each group by index starting from 1
print("first word:", m.group(1))
print("second word:", m.group(2))
print("first & second groups:", m.group(1,2))

## 4. Expression pattern examples

In [None]:
# Exercise 4.1. Replace line breaks with a single space

text='''first
        second'''

# \s matches with whitespaces, includeing line breaks, 
# tabs, spaces etc. + means one or more
print (re.sub(r"\s+", ' ',text)  ) 

In [None]:
# Exercise 4.2. Tokenization: Get words out of a paragraph
text='Regular expressions ("re") are a powerful language for matching text patterns.' 

# We use "split" to get words, but punctuation will be included
tokens_with_punctuation=text.split(" ")
print (tokens_with_punctuation)

# Use regular expression to tokenize
# get substrings only containing word characters
tokens=re.findall(r'\w+', text)           
print (tokens)

In [None]:
# Exercise 4.3. find phone number
text = "201-959-5599 # This is Phone Number"

# \d matches with any number, 
# {n} means n number of preceding characters are needed
phones = re.findall(r'\d{3}\-\d{3}\-\d{4}', text)
print (phones)

In [None]:
# Exercise 4.4. find email address
text = "email me at joe.doe@example1.com or at abc-xyz@example2.edu"

# [a-zA-Z0-9\.\-_] means any alphabetic character, 
# number, .(dot), -, and _ is allowed
# note .(dot) and - have special meaning. 
# They need to be escaped using "\".
emails = re.findall(r'[a-zA-Z0-9\.\-_]+@[a-zA-Z0-9\.\-_]+', \
                    text)
print (emails)

In [None]:
# Exercise 4.5. Find topics (starting with #) in the list of tokens

tokens=['#Blockchain','Block#chain','Decentralized', 'education','economy','#EDU', '#cryptocurrency', ]

# retrive a token if it satisfies the following:
# a. starts with "#" (^)
# b. has at least one word character (\w+)

tags=[token for token in tokens if re.search(r'^#\w+', token)]
print(tags)

# or using match (search from beginning) without the need of "^"
[token for token in tokens if re.match(r'#\w+', token)]

In [None]:
# Exercise 4.6. Find sentences ending with a question mark (?)

sentences=['Where you going?','Come here!', \
        'I\'m leaving.','Where are you?', \
        'Put a question mark (?) at the end.']

# note "?" is a reserved word in re
# scape it using "\"

questions=[s for s in sentences if re.match(r'.+\?$', s)]
print(questions)



In [None]:
# Exercise 4.7. Class exercise

ss=[ 'acbacb','abbcbb', 'babbbac','A_B_C_','A_bb_c']

#1. find strings each has an "a" followed by zero or one 'b's


#2. find strings each has an "a" followed by one or more 'b's


#3. find strings each has an "a" followed by three 'b's


#4. find strings each has an 'a' followed by anything, ending in 'b'


#5. find strings each has uppercase letters joined by a underscore


#6. find strings containing 'b', not at its start or end

