# <center>Regular Expressions</center>

References:
- http://www.tutorialspoint.com/python/python_reg_expressions.htm
- https://developers.google.com/edu/python/regular-expressions
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

## 1. What is a regular expression
- A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a **pattern**. 
   - Patterns with ordinary characters, e.g. '101'
   - Patterns with wildcard characters, such as <b><font color="red">. + \* ?</font></b>, e.g. '101\*'
- Regular expressions are widely used in UNIX world
- <font color="green">**re**</font> is the built-in python package for regular expressions

## 2. Useful Regular Expression Patterns

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re

In [18]:
text = """COM-101   COM-PUTERS
COM-101   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS_AND_MACHINE_LEARNING
MAT-102   STATISTICS"""
print(text)

COM-101   COM-PUTERS
COM-101   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS_AND_MACHINE_LEARNING
MAT-102   STATISTICS


In [5]:
# Simple example: Find all occurrences of a pattern 
# e.g.'101' 

re.findall("COM-101",text)

['COM-101', 'COM-101']

- Try the example patterns listed in the table to find matching strings, e.g. `re.findall('^COM', text)`
- Write down your answer in the last column and be sure to understand the result

| Pattern     | Description                              | Example |Example Output |
| :------------|:----------------------------------------|:-------|    |
| `^`     | Matches beginning of a string | `^COM`         | |
| `$`     | Matches end of a string        | `STATISTICS$` | | 
| `.`     | Matches any single character except newline. |`.` | |
| `[...]` | Matches any single character in brackets. | `10[12]`|  |
| `[^...]` | Matches any single character not in brackets | `10[^12]`|  |
| `*`     | Matches 0 or more occurrences of preceding expression  |   `COM-1*`                 | |
| `+`           | Matches 1 or more occurrence of preceding expression  | `COM-1+`    |  |
| `?`           | Matches 0 or 1 occurrence of preceding expression    |  `COM-1?`  |                  |
| `{n}`         | Matches exactly n number of occurrences of preceding expression    |  `1{2}`| |
| `{n,}`        | Matches n or more occurrences of preceding expression.   |  `1{3,}`| | 
| `{n,m}`       | Matches at least n and at most m occurrences of preceding expression       | `1{2,3}`| |
| `[0-9]`       | Match any digit; same as [0123456789] |`[0-9]+` | |
| `[a-z]`       | Match any lowercase ASCII letter | `[a-z]+`     | |
| `[A-Z]`       | Match any uppercase ASCII letter | `[A-Z]+`     | |
| `[a-zA-Z0-9]` | Match any of number or alphabetic letter |`[a-zA-Z0-9]+`                                  | |
| `[^0-9]`      |Match anything other than a digit |  `[^0-9]+` | |
| `a`&#124;`b`    | Matches either a or b  | `101`&#124;`102`    |  |
| `\w`          | Matches word characters, i.e. ASCII characters [A-Za-z0-9\_].   | `\w+`                                    | |
| `\W`          | Matches nonword characters | `\W\w+`      |  |
| `\s`          | Matches whitespace.  |   `\s\w+`   |  |
| `\S`          | Matches nonwhitespace  |`\S+`     |  |
| `\d`          | Matches digits. Equivalent to [0-9] | `\d+`  |  |
| `\D`          | Matches nondigits | `\D+` |  |
| `( )`         | Group regular expressions and remember matched text|  `(\w+)-(\w+)` |  |



In [11]:
re.findall('^COM',text)
re.findall('^MAT',text)

['COM']

[]

In [12]:
re.findall('STATISTICS$',text)
re.findall('DATABASE$',text)

['STATISTICS']

[]

In [13]:
re.findall('10[12]',text)

['101', '101', '102']

In [61]:
text="COM-1"
#text="COM-111"
#text="COM-"

re.findall('COM-1*',text)
re.findall('COM-1+',text)
re.findall('COM-1?',text)

['COM-1']

['COM-1']

['COM-1']

In [21]:
re.findall('\w+',text)
re.findall('\W+',text)
re.findall('\w+-\w+',text)
re.findall('(\w+)-(\d+)',text)# control what to be returned, dash will not be returned.
re.findall('\S+',text)

['COM',
 '101',
 'COM',
 'PUTERS',
 'COM',
 '101',
 'DATABASE',
 'COM',
 '211',
 'ALGORITHM',
 'MAT',
 '103',
 'STATISTICS_AND_MACHINE_LEARNING',
 'MAT',
 '102',
 'STATISTICS']

['-',
 '   ',
 '-',
 '\n',
 '-',
 '   ',
 '\n',
 '-',
 '   ',
 '\n',
 '-',
 '   ',
 '\n',
 '-',
 '   ']

['COM-101', 'COM-PUTERS', 'COM-101', 'COM-211', 'MAT-103', 'MAT-102']

[('COM', '101'),
 ('COM', '101'),
 ('COM', '211'),
 ('MAT', '103'),
 ('MAT', '102')]

['COM-101',
 'COM-PUTERS',
 'COM-101',
 'DATABASE',
 'COM-211',
 'ALGORITHM',
 'MAT-103',
 'STATISTICS_AND_MACHINE_LEARNING',
 'MAT-102',
 'STATISTICS']

<div class="alert alert-block alert-info">To match with characters such as |, \, etc., which have special meaning in regular expression, use escape character "\". <br>
However regular expression escape may collide with python escape
</div>

In [23]:
x = "I'd like to match '\\' literally"
print(x)

re.findall('\\\\',x)   # This will give an error; try '\\\\' to see what happens

I'd like to match '\' literally


['\\']

**Raw string notation (r prefix)**

Prefixing with an `'r'` indicates the string in the pattern be treated literally without any special meaning.

For example, without the r-prefix, `\\` means `\` (the first `\` is to backslash the second one) in python. However, `\` in regular expression means `excape`. In the example of `re.findall('\\',x)`, an error was raised because `\` in regular expression is waiting for another character to be escaped. In fact, to find `\` literally, you need to write the pattern as `\\\\`

With prefix `r`, `r'\\'` stands for two backslashes, i.e. they are treated literarally instead of signifying escaping.

In [26]:
x = "I'd like to match '\\' literally"
print(x)

re.findall(r'\\',x)

# In fact, r prefix can be used in front of a string
print('this is \n a test\n')
print(r'this is \n a test\n')

I'd like to match '\' literally


['\\']

this is 
 a test

this is \n a test\n


## 3. Regular Expression Functions
- <font color="green">**match(pattern, string, flags=0)**</font>: match `pattern` to `string` from the <b>beginning</b>. The `re.match` function returns a match object on success, None on failure. We use `group()` function of match object to get matched expression.
- <font color="green">**search(pattern, string, flags=0)**</font>: match `pattern` to `string`, similar to <font color="green">**match**</font>. The difference between <font color="green">**match**</font> and <font color="green">**search**</font> is: <font color="green">**match**</font> checks for a match only at the **beginning** of the string, while <font color="green">**search**</font> checks for a match **anywhere** in the string
- <font color="green">**findall(pattern, string, flags=0)**</font>: find **all occurrences** of the `pattern` in `string` and save the result into a list. Note that **match** and **search** functions only find the **first match**.
- <font color="green">**sub(pattern, repl, string, max=0)**</font>: replaces all occurrences of the `pattern` in `string` with `repl`, substituting all occurrences unless `max` provided. This method returns modified string.
- <font color="green">**split(pattern, string,  maxsplit=0, flags=0)**</font>: Split `string` by the occurrences of `pattern`. If `maxsplit` is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

In [3]:
# Exercise 3.1. match function

text="The cat catches a rat"

# match is to find the pattern in the 
# string from ***the beginning***
# if found, a match object is returned
# otherwise, None
# matches the longest string

match= re.match(r'cat', text)
if match:
    print ("find cat!")
    
    # .group returns matched string 
    # from the match object
    
    print(match.group())
else:
    print ("not found!")
    
# How to modify the pattern so that 
# 'cat' strings can be found ?

not found!


In [31]:
# Exercise 3.2. search function

# search is to find the pattern in the string 
# from ***any position***
# group() is the function to return matched string

text="The cat catches a rat"

match= re.search(r'cat',text)
if match:
    print ("find cat!")
    print (match.group())
else:
    print ("not found!")

find cat!
cat


In [32]:
# Exercise 3.3. findall function

# find all "cat" substrings in text

text="The cat catches a rat"

match= re.findall(r'cat', text)
print (match)

['cat', 'cat']


In [33]:
# Exercise 3.4. sub function

# replace all "cat" substrings in text with 'CAT'

text="The cat catches a rat"

match= re.sub(r'cat','CAT', text)
print (match)

The CAT CATches a rat


In [34]:
# Exercise 3.5. split the sentence into words

text="The cat catches a rat!!!!!!!"

words= re.split(r'\W+', text) # \W+ matches a sequence of non-words
print (words)

re.findall(r'\w+', text)
# Ways to tokenize?
# find characters
# or split on non-word characters

['The', 'cat', 'catches', 'a', 'rat', '']


['The', 'cat', 'catches', 'a', 'rat']

In [159]:
# Exercise 3.6. case insensitive search

# flag re.I means case insensitive. 
# It can be applied to search, match, findall, and sub

# find all "t" or "T"     

text="The cat catches a rat"

match= re.findall(r't', text, re.I)                      
print (match)

['T', 't', 't', 't']


In [35]:
# Exercise 3.7. Match with capturing groups (i.e. "()")

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")

# group() or group(0) always returns the whole text that was matched 
# no matter if it was captured in a group or not
print(m.group())
print(m.group(0))

# refer to each group by index starting from 1
print("first word:", m.group(1))
print("second word:", m.group(2))
print("first & second groups:", m.group(1,2))

Isaac Newton
Isaac Newton
first word: Isaac
second word: Newton
first & second groups: ('Isaac', 'Newton')


## 4. Expression pattern examples

In [36]:
# Exercise 4.1. Replace multiple spaces or 
# line breaks with a single space

text='''first            second
        third'''

# \s matches with whitespaces, includeing line breaks, 
# tabs, spaces etc. + means one or more
print (re.sub(r"\s+", ' ',text)  ) 

first second third


In [8]:
# Exercise 4.2. find phone number
text = "201-966-5599 # This is Phone Number 201.966.5599  201396645599"

# \d matches with any number, 
# {n} means n number of preceding characters are needed
# "-" only has special meaning of range if placed within []
re.findall(r'\d{3}.\d{3}.\d{4}', text)


# How about phone numbers like 201.959.5599? What about both?
# how to get all the numbers, no hyphen or dot?



['201-966-5599', '201.966.5599', '201396645599']

In [9]:
# Exercise 4.3. find email address
text = "email me at abc-xyz@example2.edu"

emails = re.findall(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+', \
                    text)
print (emails)


['abc-xyz@example2.edu']


**Some Key Points here**:
- `[a-zA-Z0-9._-]` means any alphabetic character, number, `.(dot)`, `_`, and `-` is allowed. 
- Note that special characters lose their special meaning inside `[]`. For example, although `.(dot)` have special meaning, within `[]`, it's treated *literally* 
- `-(hyphen)` placed at the end of list inside `[]` is treated *literally*

In [59]:
# Exercise 4.4. Extract course name and title as 
# [('COM-101', 'COMPUTERS'),
#  ('COM-111', 'DATABASE'),
#  ... ]

text = '''COM-101   COMPUTERS
COM-111   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS learning
MAT-102   STATISTICS'''

re.findall(r'',text)

[('COM-101', 'COMPUTERS'),
 ('COM-111', 'DATABASE'),
 ('COM-211', 'ALGORITHM'),
 ('MAT-103', 'STATISTICS learning'),
 ('MAT-102', 'STATISTICS')]

In [64]:
text = '''Symbol   Last Price  Change   % Change
          BTC-USD  56,212.15   -58.16   -0.10%
          ETH-USD  1,787.79    -53.63   -2.91%
          BNB-USD  290.51      +5.81    +2.04%
          USDT-USD 1.0003      -0.0004  -0.04%
          ADA-USD  1.1187      -0.0528  -4.51%
      '''
# write a regression expression to return  
# [('BTC-USD', '56,212.15', '-58.16','-0.10%'),...] 
