# Advanced Regex

## Regular Expression review

- A powerful way to match text

In [1]:
import pandas as pd
import numpy as np
import re

https://regexper.com/

In [None]:
text = "That person wears marvelous trousers."

### `Literal strings` vs `sets`

In [None]:
# literal strings: find the pattern 'person'
pattern = 'person'
re.findall(pattern, text)

In [None]:
pattern = 'persona'
re.findall(pattern, text)

In [None]:
pattern = 'person'
re.sub(pattern,'man', text)

In [None]:
# sets: Finding the pattern `p` or `e` or `r` or ...
pattern = '[person]'
print(re.findall(pattern, text))

In [None]:
text = 'São Paulo Sao Paulo Sáo Paulo Sun Paulo seu paulo san paolo sao paulo são paolo sAo Paolo sao_paulo'

pattern = '[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
print(re.sub(pattern, 'São Paulo\n', text))

In [None]:
text = "Is it spelled gray or grey?"

pattern = 'gr[ae]y'
re.findall(pattern, text)

> So anything within brackets `[ ]` are considered `sets` in RegEx. A set of patterns you want to find. 

## Since it is a set, you can look for complete sets

For example: The set of upper-case letters from A to C.

In [None]:
text = "This is an A and B conversation, so C your way out of it, or Even F."

pattern = '[A-C]'
re.findall(pattern, text)

In [None]:
pattern = '[A-Z]'
re.findall(pattern, text)

In [None]:
text = "I'm not going to 0A the party because 1) Karen is going, 2) I don't like her, and 3) 3B I already have a headache."

pattern = '[1-3]'
re.findall(pattern, text)

In [None]:
pattern = '[0-9]'
re.findall(pattern, text)

In [None]:
pattern = '[0-9A-Z]'
re.findall(pattern, text)

In [None]:
# pattern = '[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
# re.findall(pattern, text)

In [None]:
pattern = '[0-9][A-Z]'
re.findall(pattern, text)

Some useful sets: 

* [a-z]: Any lowercase letter between a and z.
* [A-Z]: Any uppercase letter between A and Z.
* [0-9]: Any numeric character between 0 and 9.

In [None]:
pattern = '[^0-9 a-z]'
re.findall(pattern, text)

# Meta characters - They means something different than the character they represent.

* `.` : Match **any character** except newline (`\n`)
* `^` : If used within a `set`, negates the condition (similar to `~` in python)
> Careful, this pattern also represents another thing: If used <u>outside a set</u>, it represents `match if at the beginning of the line`
* `$` : Match if at end of the line
* `|` : "OR" operator

## OR

In [None]:
text = 'Andre andre'

In [None]:
pattern = '[Aa]'
re.findall(pattern, text)

In [None]:
pattern = 'A|a'
re.findall(pattern, text)

In [None]:
text = '''
I like penguins
I like lions
I like penguins and lions
'''

pattern = 'penguins|lions'
re.findall(pattern, text)

## Match any character

In [None]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""

pattern = '.'
print(re.findall(pattern, text))

## Match everything not in specific set

In [None]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""

In [None]:
pattern = '[^a-m]'
print(re.findall(pattern, text))

## Match sentences `beginning with pattern`

In [15]:
text = '''My boss asked me to turn in my TPS reports. 
The boss told him they were done, but they are not.'''

In [16]:
pattern = '^[My boss]'
print(re.findall(pattern, text))

['M']


In [17]:
pattern = '^[The boss]'
print(re.findall(pattern, text))

[]


In [18]:
pattern = '^turn'
print(re.findall(pattern, text))

[]


In [19]:
pattern = 'reports.$'
print(re.findall(pattern, text))

[]


In [20]:
pattern = 'are not.$'
print(re.findall(pattern, text))

['are not.']


## Characters classes

* `\d`: numeric characters
* `\w`: alphanumeric characters 
* `\s`: spaces
* `\D`: not numeric characters

In [None]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d'
print(re.findall(pattern, text))

In [None]:
pattern = '[^\d]'
# pattern = '\D'

print(re.findall(pattern, text))

# Quantifiers 

* *: Matches previous character 0 or more times
* +: Matches previous character 1 or more times
* ?: Matches previous character 0 or 1 times (optional)
* {}: Matches previous characters however many times specified within:
* {n} : Exactly n times
* {n,} : At least n times
* {n,m} : Between n and m times

## \d* --> Matches any numeric character that appears 0 or more times.

In [None]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d*'
print(re.findall(pattern, text))

In [None]:
## \d+ --> Matches any numeric character that appears 1 or more times.

In [None]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d+'
print(re.findall(pattern, text))

In [9]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d+\.?\d+'
print(re.findall(pattern, text))

['3232', '13', '3.1']


## Application of previous example of `$` using one of the most useful quantifiers `*`

In [None]:
text = '''My boss asked me to turn in my TPS reports. 
My boss told him they were done, but they are not.'''

In [None]:
pattern = 'are not\.$'
print(re.findall(pattern, text))

In [None]:
pattern = '.are not\.$'
print(re.findall(pattern, text))

In [None]:
pattern = '.*\n.*are not\.$'
print(re.findall(pattern, text))

In [None]:
text

In [None]:
pattern = ',.*are not.$'
print(re.findall(pattern, text))

In [None]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss told, him they were done, but they, are not.'''

In [None]:
pattern = ',.*are not.$'
print(re.findall(pattern, text))

In [10]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss (told him they) were done (but they) are not.'''

In [11]:

pattern = '\(.*?\)'
print(re.findall(pattern, text))

['(told him they)', '(but they)']


In [None]:
re.findall('coisas?','coisa coisas')

# Capturing group

What if I wanted to capture only things up until the comma (`,`), however, not include the comma?

I would have to use a capturing group to specify what specifically I want to capture.

In [None]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss (told him they) were done (but they) are not.'''

In [None]:
pattern = '\((.*?)\)'
print(re.findall(pattern, text))

In [None]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss -told him they- were done -but they- are not.'''

In [None]:
pattern = '-(.*?)-'
print(re.findall(pattern, text))

In [None]:
pattern = '-(.*?)-'
print(re.findall(pattern, text))

In [None]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss told, him they were done, but they, are ,not.'''

In [None]:
pattern = ',(.*?),'
print(re.findall(pattern, text))

In [12]:
text = "TerraPower, a nuclear-energy company founded by Bill Gates, is unlikely to follow through on building a demonstration reactor in China, due largely to the Trump administration’s crackdown on the country."

pattern = '[A-Z][a-zA-Z]+'
print(re.findall(pattern, text))

['TerraPower', 'Bill', 'Gates', 'China', 'Trump']


In [13]:
pattern = '([A-Z][a-zA-Z]+ ?[A-Z][a-zA-Z]+)|([A-Z][a-z]+)'

In [14]:
print(re.findall(pattern, text))

[('TerraPower', ''), ('Bill Gates', ''), ('', 'China'), ('', 'Trump')]


In [None]:
simple_names = [name[1] for name in re.findall(pattern, text) if name[1] != '']
combined_names = [name[0] for name in re.findall(pattern, text) if name[0] != '']

In [None]:
combined_names

In [None]:
pattern = '([A-Z][a-z]+)|([A-Z][a-zA-Z]+ ?[A-Z][a-zA-Z]+)'

In [None]:
print(re.findall(pattern, text))

# Important Regex Concept: Greediness


What will this match?

In [None]:
text = 'You are yelling! So I will yell too! Let me yell!.'

# anything up to exclamation point
pattern = ".*!"
print(re.findall(pattern, text))

In [None]:
pattern = ".*?!"
re.findall(pattern, text)

In [3]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2}"
print(re.findall(pattern, text))

['aww', 'aww', 'aww', 'aww']


In [8]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,}"
print(re.findall(pattern, text))

['aww', 'awww', 'awwww', 'awwwww']


In [4]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,3}"
print(re.findall(pattern, text))

['aww', 'awww', 'awww', 'awww']


In [None]:
text = "Ooooooiiiii gente"

pattern = "[Oo]{1,}i{1,}"
pattern = "[Oo]+i+"
print(re.findall(pattern, text))

In [None]:
text = "If you tell the truth 1 time, you don't have to remember anything 2 times."

pattern = '\w+'
print(re.findall(pattern, text))

In [9]:
## word length
pattern = '\w{4,}'
print(re.findall(pattern, text))

['match', 'following', 'awww', 'awwww', 'awwwww']


https://phoneregex.com/