## Regular Expression (= Regex) Cheatsheet 

Regular expression is used to find a set of patterns in a string (specific term, an email address, a specific format of string.)
Documentation of regex in python: https://docs.python.org/3.6/library/re.html

### There are a few common ways to use regex, such as 
| Function | Definition | Return data type |
| --- | --- | --- |
| re.findall(pattern, text) | find all occurance | **a list** |
| re.split(pattern, text)    | This split the text with the pattern | **a list**|
| re.search(pattern, text)  |  This search the pattern occur in anyplace in the text. | match.object|
|re.match(pattern, text)    | It returns only the "Beginning" matched texts.| match.object|
|re.fullmatch(pattern, text)| It returns only the full match.|match.object|
| re.sub**(pattern,repl, text)** | This replace the pattern in text with "repl(acement)"|match.object|


In [None]:
import re

In [11]:
# Create a text we will be using through out the following examples.
text = "A grey Woodchunk chunk gray wood as his work. Haha! 2-345+10-"

In [24]:
# Quick try:
pattern1 = "[Ww]ood"
print(re.findall(pattern1, text))
print(re.search(pattern1, text))

['Wood', 'wood']
<_sre.SRE_Match object; span=(7, 11), match='Wood'>


### There are "literals" and "meta_characters"
* Literals: No special meanings, read as normal text
* Metacharacters: With special meanings. They are not read as normal text anymore.
* Meta characters = [] {} () | ? + - * ^ $ \ .
* To read metacharacters "literally" you have to add an "\" infront of each of them.
* OR, put them in [ ] <-- This is a character class.

### 1. Character classes: 

* Everything in the [ ] are read as literals.    
* [abc0-9] <--- this is a character class. Match to "one" character. **
* ^ and | are used in character class. 
* "^" means: not
* "|" means: or  = Alternatives. Match one or another.

#### There are also pre-defined python character classes
|python character classes |**Meanings**|regex character classes|
|---|---|---|
|. | every characters. including meta characters|(none)|
|\d| digit    |  **[0-9]**|
|\D| non digit| **[^0-9]**|
|\s| whitespace characters ( including tab, space, endline(?) )|(none)|
|\S| non whitespace|(none)|
|\w| word characters|(none)|
|\W| non words|(none)|


In [13]:
# Examples:
pattern2 = "gr[a|e]y"
pattern3 = "wo[^o]" # Not "woo"
pattern4 = "[a-mA-M][n-zN-Z]"  # two character classes =  2 characters, the first from a-m/A-M, the second from the second half of alphabets
pattern5 = "[1-5][0-9]" # Two digits number: from 10-59.
pattern6 = "\s\S"  # a space followed by non space

# replace the pattern to see the output
re.findall(pattern3, text)

['wor']

### 2. repetition qualifiers <------ check existence of directly previous character class.
|qualifier|meaning|
|---|---|
| ? |optional appearance of the previous character|
| \+| one or more times of the previous character|
| \*| zero or more times of the previous character|
| {n}| exactly n times of the previous character|
| {n,m}| n-m times of the previous character|


** Note: ? appears again later as a "reluctant match" **

### 3. Bondary characters
1. ^ start
2. $ end

** Note: ^ appears earlier as a "not" in character groups **

In [40]:
# Examples:
# Repetition qualifiers.
pattern7 = "wood?"  # This means match wood or woo.
pattern8 = "wood??"# This means match wood or woo, but as short as possible.
pattern9 = "\w+"     # This match as long as possible where ever there is "\w"
pattern10 = "\s\w{3,5}\s"  # Returns including the whitespaces around
pattern11 = "[1-9]\d*"
pattern12 = "06[1-5][1-9][0-9]{6}"


# 3. Bondary characters
pattern13 = "\w$"  # return empty [''] since the last character is not a word
pattern14= "^\w" # return ['H']

re.findall(pattern9, text)


['A',
 'grey',
 'Woodchunk',
 'chunk',
 'gray',
 'wood',
 'as',
 'his',
 'work',
 'Haha',
 '2',
 '345',
 '10']

### 4.  capturing groups () <--- only return the part in the (    )

* None-capturing groups: (?:  ) . Capture but not return this part!
* can capture more than once! (one in another).  return like a list (see exercise pattern 15)




In [35]:
text2 = "First Name: Liting, Last Name: Chen, Email: alicechen@gmail.com, Telephone: 06-47831990"
pattern15 = "Name:\s(\w+)"
text3 = "Bold fond marks <b>important</b> words."
pattern16 = '<b>(\w+)</b>'
pattern17 = '((\w+)\W(g\w+))'
pattern18 = '((\w+)\W(?:g\w+))'

print(re.findall(pattern14, text2))
print(re.findall(pattern14, text3))

[]
['important']


### 5. Configuring groups

* (?i) = ignore case   <----- most useful!
* (?m) = multiline
* (?a) = non ascii character
* (?L) = local
* (?s) = dot matches all
* (?u) = unicode characters
* (?x) = verbose

** (?i) is the most useful one. **


In [37]:
# Examples
pattern19 = "(?i)wood"
pattern20 = "(?i)pdf"
print(re.findall(pattern17,text))
print(re.findall(pattern18,"Pdf, pdf, PDF, jpeg, JPEG, JPG"))


['Wood', 'wood']
['Pdf', 'pdf', 'PDF']


### 6. Greedy behaviour
.re by default its behaviour is greedy! Only stop when it cannot be longer. But it does not stop the smaller match.  

If you want to stop it as soon as it matches ( = non-greedy, reluctant), add an ?

In [39]:
pattern21 = "A.*a"
pattern22 = "A.*?a"  # ? non greedy!
name = "Anna-Lena"
print(re.findall(pattern21, name))
print(re.findall(pattern22,name))

pattern23 = "\w+"
pattern24 = "\w+?"

print(re.findall(pattern23, text))
print(re.findall(pattern24, text))


['Anna-Lena']
['Anna']
['A', 'grey', 'Woodchunk', 'chunk', 'gray', 'wood', 'as', 'his', 'work', 'Haha', '2', '345', '10']
['A', 'g', 'r', 'e', 'y', 'W', 'o', 'o', 'd', 'c', 'h', 'u', 'n', 'k', 'c', 'h', 'u', 'n', 'k', 'g', 'r', 'a', 'y', 'w', 'o', 'o', 'd', 'a', 's', 'h', 'i', 's', 'w', 'o', 'r', 'k', 'H', 'a', 'h', 'a', '2', '3', '4', '5', '1', '0']


### Explore other re functions

In [42]:
matcher = re.match(pattern12,"06478319910647831991jkl;")
print(matcher)

# Try it your self:
# re.fullmatch
# re.search
matcher2 = re.search("wood|chunk|woodchunk",text)
print(matcher2)

# re.sub
# for substitution
text_new = re.sub("wood|chunk|woodchunk","XXX",text)
print(text_new)


<_sre.SRE_Match object; span=(0, 10), match='0647831991'>
<_sre.SRE_Match object; span=(11, 16), match='chunk'>
A grey WoodXXX XXX gray XXX as his work. Haha! 2-345+10-


### Thank you! Please leave your comment!

Edited: 05/05/2018, Lit