# [Regular expressions](https://www.regular-expressions.info/tutorial.html)

"a regular expression is a pattern describing a certain amount of text"

#### There are special characters:

. + * ? ^ $ ( ) [ ] { } | \

Each one of them has a specific meaning. If you want to search for those characters you need to use a backshash

#### Import regex library

In [None]:
import re

#### Raw strings
_An 'r' before a string tells the Python interpreter to treat backslashes as a literal (raw) character_

In [None]:
print('This is a line \nand this is a new one')
print(r'This is a line \nand this is a new one')

#### How are regex useful?
For example, imagine that you have a long list of dates one hunderd times larger that the above. How one can search for all dates in August? 

In [None]:
dates = """
08-12-2012
06-07-2015
08/08/08
04.08.08
09.09.2019
9.9.2019
8.9.2019
"""

In [None]:
results = re.findall(r'[\d]?8[-\/\.][\d]{1,2}[-/\.][\d]{2,4}', dates)
print(results)

#### Text string

In [None]:
with open(faketext) as f:
    faketext = f.read()

In [None]:
faketext = """A

Alex
Georgios
Shama
Suleiman
Liam
Olivia
Noah
Emma
Oliver
Charlotte
Elijah
Amelia
James
Ava
William
Sophia
Benjamin
Isabella
Lucas
Mia
Henry
Evelyn
Theodore
Harper

08-12-2012
06-07-2015
08/08/08
04.08.08
09.09.2019
9.9.2019
8.9.2019

202-555-0166
201*555*0177
(202) 555-0128
(201)555-0178
202 555 0198
900-555-0166
800*555*0177
(900) 555-0128
(800)555-0178
900 555 0198
800.555.0199
800.555.0152
201.555.0199
201.555.0152


Mr. Darcy
Dr. Tsolakis
Prof. Cartledge
Mr. T
Mrs Robinson
Mr. Bean
Miss Piggy 


JohnDoe@gmail.com
John_Doe@gmail.com
JohnDoe@facebook.net
john.doe@uchicago.edu
sincere.jakubowski@gutkowski.com
kerluke.cierra@bradtke.com
nromaguera@yahoo.com
john-doe@gmail.com
johndoe1990@gmail.com

Z"""

#### finditer

### Word Characters

#### . 
Any character except a new line

#### \d
Digits

#### \D
Not a digit

#### \w
Word character (lowercase and uppercase letters, digits, or underscore)

#### \W
Not a word character 

#### \s
Whitespace (space, tab, newline)

#### \S
Not whitespace

### Anchors and Boundaries
They do not match a character but positions

#### \b
Word boundary 

#### \B
Not a word boundary 

#### ^
Beginning of the string

#### $
End of the string

#### Phone numbers

#### Quantifiers

| Quantifier | Meaning |
| ----------- | ----------- 
| a? | Zero or one of a |
| a* | Zero or more of a |
| a+ | One or more of a |
| [0-9]+ | One or more of 0-9 |
| a{3} | Exactly 3 of a |
| a{3,} | 3 or more of a |
| a{3,6} | Between 3 and 6 of a |
| a* | Greedy quantifier |
| a*? | Lazy quantifier |
| a*+ | Possessive quantifier |

In [None]:
sentence = """
coffe coffee please sir I need some more coffee please"
"""

#### Simplify the above regex

#### Get all phone numbers

#### Get all phone numbers that begin with 800

#### -
Range e.g., [1-5] or [a-z] or [A-Z]

#### ^-
Not the range e.g., [^1-5] or [^a-z] or [^A-Z]

#### Get all phone numbers that begin with 800 or 900

#### Get all the prefixes with the attached names

#### Match the emails

#### Replace

#### sub

In [None]:
dates = """08-12-2012
06-07-2015
08/08/08
04.08.08
09.09.2019
9.9.2019
8.9.2019"""

#### findall

#### search
Returns the first match otherwise it returns None

#### Flags

#### Write a regex that finds all roman numberals and print only how many roman numerals exist in the string.
| Roman | Value |
| --- | --- |
| I | 1 |
| II | 2 |
| III | 3 |
| IV | 4 |
| IX | 9 |
| X | 10 |
| L | 50 |
| C | 100 |
| D | 500 |
| M | 1000 |

In [None]:
numbers = "DCXXXIV, CXXXI, CCXXXI, DCXX, DXCIX, DCIV, DCLVII, CLXXXV, XVI, CV, MCCLXXIV, CMIX, DXXXI, DL, DXCI, CDLXXIX, DLXII, CMXLII, CDX, CLXXXVIII, CDV, CXCI, XLII, LXVI, DCCXLVI, CDLXXXVII"

----