import re
# ^^^ pyforest auto-imports - don't write above this line
# <u>Reg</u>ular <u>Ex</u>pression - Introduction to Regex

Regex is a powerful way to match text. Instead of trying to match a `literal string`, you can try to match `patterns`.

In [1]:
import re

Website to visually see what your regular expressions look like: https://regexper.com/

In [10]:
# matching literal strings
text = 'My neighbor, Mr. Roger, has 5 dogs and one of them is named roger.'

pattern = 'Roger'

re.findall(pattern, text)

['Roger']

## Introducing Sets

Sets are used to match `one of the letters inside it`. That is, if you try to match a set `[Roger]`, it will find match every one of the following: `R` or `o` or `g` or `e` or `r`.

In [11]:
text = 'My neighbor, Mr. Roger, has 5 dogs and one of them is named roger.'
pattern = '[Roger]'

re.findall(pattern, text)

['e',
 'g',
 'o',
 'r',
 'r',
 'R',
 'o',
 'g',
 'e',
 'r',
 'o',
 'g',
 'o',
 'e',
 'o',
 'e',
 'e',
 'r',
 'o',
 'g',
 'e',
 'r']

You can mix sets and literal strings:

In [12]:
text = 'My neighbor, Mr. Roger, has 5 dogs and one of them is named roger.'

pattern = '[Rr]oger'
# pattern = rogers ou Rogers ou xogers

re.findall(pattern, text)

['Roger', 'roger']

There's a function called `.sub()` inside the module `re` that helps you substitute a string based on regex

In [14]:
text = 'My neighbor, Mr. Roger, has 5 dogs and one of them is named roger.'

pattern = '[Rr]oger' # R or r, followed by the literal string 'ogers'

# find this pattern and substitute by Rogers
re.sub(pattern, 'Fucker', text)

'My neighbor, Mr. Fucker, has 5 dogs and one of them is named Fucker.'

In [16]:
text = '''
Sáo Paulo 
São Paulo 
Sao Paulo 
Sao Paolo 
San Pablo 
sao paulo 
sao Paulo 
são Paulo 
sao-paulo 
são paulo 
São Paulo 
San Paulo'''

pattern = '[Ss][áãa][on][ -][Pp]a[uob]lo'

re.findall(pattern, text)

['Sáo Paulo',
 'São Paulo',
 'Sao Paulo',
 'Sao Paolo',
 'San Pablo',
 'sao paulo',
 'sao Paulo',
 'são Paulo',
 'sao-paulo',
 'são paulo',
 'São Paulo',
 'San Paulo']

In [None]:
print(re.sub(pattern, 'São Paulo', text))

# Pattern sets:

Range

1. `[a-z]`: Any lowercase letter between a and z.
2. `[A-Z]`: Any uppercase letter between A and Z.
3. `[0-9]`: Any numeric character between 0 and 9.

Match the opposite of the range
1. `[^a-z]`: Any **NON** lowercase letter between a and z.
2. `[^A-Z]`: Any **NON** uppercase letter between A and Z.
3. `[^0-9]`: Any **NON** numeric character between 0 and 9.



In [20]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
pattern = '[a-e]'

re.findall(pattern, text)

['e', 'b', 'e', 'a', 'd']

In [21]:
re.findall('[A-Z]', text)

['M', 'M', 'R']

In [22]:
re.findall('[A-N]', text)

['M', 'M']

In [23]:
re.findall('[efghijklmno]', text)

['n', 'e', 'i', 'g', 'h', 'o', 'o', 'g', 'e', 'h', 'o', 'g']

In [24]:
re.findall('[e-o]', text)

['n', 'e', 'i', 'g', 'h', 'o', 'o', 'g', 'e', 'h', 'o', 'g']

In [25]:
re.findall('[0-9]', text)

['5']

## Pattern sets concatenation

- `[A-Za-z]`: Matches any one of lowercase letter between a and z OR uppercase letter between a and z

In [28]:
# you can concatenate ranges

re.findall('[A-Za-z0-9]', text)

['M',
 'y',
 'n',
 'e',
 'i',
 'g',
 'h',
 'b',
 'o',
 'r',
 'M',
 'r',
 'R',
 'o',
 'g',
 'e',
 'r',
 's',
 'h',
 'a',
 's',
 '5',
 'd',
 'o',
 'g',
 's']

The opposite: 
- `^` matches everything except the pattern 

In [29]:
pattern = '[^a-z]'
re.findall(pattern, text)

['M', ' ', ',', ' ', 'M', '.', ' ', 'R', ',', ' ', ' ', '5', ' ', '.']

In [30]:
# concat patterns [] 
# space character == \s

pattern = '[^a-zA-Z0-9\s]'
pattern = '[^a-zA-Z0-9 ]'
re.findall(pattern, text)


[',', '.', ',', '.']

# character classes:

A single character that represents a class of characters

1. `\w`: Any alphanumeric character == `[A-Za-z0-9_]`.
2. `\d`: Any numeric character (a digit) == `[0-9]`.
3. `\s` : Match spaces characters

And their opposites:
1. `\W`: Any **NON** alphanumeric character === `[^A-Za-z0-9_]`.
2. `\D`: Any **NON** numeric character (a digit).
3. `\S` : Match **NON** spaces characters

And to match `ANY` character:
1. `.`: matches any character

In [31]:
text = 'My phone number is +55 (11) 99111-1831'

In [37]:
# to find the literal string '+', it has to be escaped because the '+' itself means something else
country_code = re.findall('\+\d\d', text)
country_code

['+55']

In [38]:
# to find parenthesis, it has to be escaped because parenthesis itself means something else
ddd = re.findall('\(\d\d\)', text)
ddd

['(11)']

In [None]:
phone = re.findall('\d\d\d\d\d-\d\d\d\d', text)
phone

In [None]:
text
re.findall('.\d\d', text)


In [39]:
# Let's say I want to match each of these, except the 'abc1'

text = '''
cat.
896.
?=+.
abc1
'''

In [40]:
re.findall('...\.', text)

['cat.', '896.', '?=+.']