# Advanced Regex

## Regular Expression review

- A powerful way to match text

In [1]:
import pandas as pd
import numpy as np
import re

1. https://regexper.com/ 
2. https://regexr.com/

In [15]:
text = "That person wears marvelous trousers."

### `Literal strings` vs `sets`

In [21]:
# literal strings: find the pattern 'person'
pattern = 'person'
re.findall(pattern, text)

['person']

In [18]:
pattern = 'persona'
re.findall(pattern, text)

[]

In [22]:
pattern = 'person'
re.sub(pattern,'man', text)

'That man wears marvelous trousers.'

In [25]:
# sets: Finding the pattern `p` or `e` or `r` or ...
pattern = '[person]'
print(re.findall(pattern, text))

['p', 'e', 'r', 's', 'o', 'n', 'e', 'r', 's', 'r', 'e', 'o', 's', 'r', 'o', 's', 'e', 'r', 's']


In [30]:
text = 'São Paulo Sao Paulo Sáo Paulo Sun Paulo seu paulo san paolo sao paulo são paolo sAo Paolo sao_paulo'

pattern = '[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uob]lo'
print(re.sub(pattern, 'São Paulo\n', text))

São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo
 São Paulo



In [None]:
df['text'].apply(lambda x: re.sub(pattern, 'São Paulo\n', x))
df['text'].str.replace(pattern,'São Paulo \n', regex=True)

In [31]:
text = "Is it spelled gray or grey?"

pattern = 'gr[ae]y'
re.findall(pattern, text)

['gray', 'grey']

> So anything within brackets `[ ]` are considered `sets` in RegEx. A set of patterns you want to find. 

## Since it is a set, you can look for complete sets

For example: The set of upper-case letters from A to C.

In [32]:
text = "This is an A and B conversation, so C your way out of it, or Even F."

pattern = '[A-C]'
re.findall(pattern, text)

['A', 'B', 'C']

In [33]:
pattern = '[A-Z]'
re.findall(pattern, text)

['T', 'A', 'B', 'C', 'E', 'F']

In [34]:
text = "I'm not going to 0A the party because 1) Karen is going, 2) I don't like her, and 3) 3B I already have a headache."

pattern = '[1-3]'
re.findall(pattern, text)

['1', '2', '3', '3']

In [35]:
pattern = '[0-9]'
re.findall(pattern, text)

['0', '1', '2', '3', '3']

In [36]:
pattern = '[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
re.findall(pattern, text)

['I', '0', 'A', '1', 'K', '2', 'I', '3', '3', 'B', 'I']

In [37]:
pattern = '[0-9A-Z]'
re.findall(pattern, text)

['I', '0', 'A', '1', 'K', '2', 'I', '3', '3', 'B', 'I']

In [41]:
pattern = '[0-9][A-Z]'
re.findall(pattern, text)

['0A', '3B']

Some useful sets: 

* [a-z]: Any lowercase letter between a and z.
* [A-Z]: Any uppercase letter between A and Z.
* [0-9]: Any numeric character between 0 and 9.

In [44]:
pattern = '[^0-9 a-z]'
re.findall(pattern, text)

['I', "'", 'A', ')', 'K', ',', ')', 'I', "'", ',', ')', 'B', 'I', '.']

# Meta characters - They means something different than the character they represent.

* `.` : Match **any character** except newline (`\n`)
* `^` : If used within a `set`, negates the condition (similar to `~` in python)
> Careful, this pattern also represents another thing: If used <u>outside a set</u>, it represents `match if at the beginning of the line`
* `$` : Match if at end of the line
* `|` : "OR" operator

## OR

In [45]:
text = 'Andre andre'

In [46]:
pattern = '[Aa]'
re.findall(pattern, text)

['A', 'a']

In [47]:
pattern = 'A|a'
re.findall(pattern, text)

['A', 'a']

In [48]:
text = '''
I like penguins
I like lions
I like penguins and lions
'''

pattern = 'penguins|lions'
re.findall(pattern, text)

['penguins', 'lions', 'penguins', 'lions']

In [50]:
text = '''
I like penguins
I like lions
I like pinguins and lions
'''

pattern = 'p[ei]nguins|l[ie][oõ][en]s'
re.findall(pattern, text)

['penguins', 'lions', 'pinguins', 'lions']

In [56]:
text = '''
I like penguins
I like lions
I like penguins and lions
'''

pattern = '[pl][ei][no][gn][us]u?i?n?s?'
re.findall(pattern, text)

['penguins', 'lions', 'penguins', 'lions']

In [53]:
text='Andre Park Andre Picchi Andre Gordon'
pattern = 'Andre[_ ]Park|Andre Gordon'
re.findall(pattern, text)

['Andre Park', 'Andre Gordon']

## Match any character

In [63]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""

pattern = '.|\n'
print(re.findall(pattern, text))

['M', 'y', ' ', 'b', 'o', 's', 's', ' ', 'a', 's', 'k', 'e', 'd', ' ', 'm', 'e', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'i', 'n', ' ', 'm', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'e', 'p', 'o', 'r', 't', 's', '.', ' ', '\n', 'I', ' ', 't', 'o', 'l', 'd', ' ', 'h', 'i', 'm', ' ', 't', 'h', 'e', 'y', ' ', 'w', 'e', 'r', 'e', ' ', 'd', 'o', 'n', 'e', ',', ' ', 'b', 'u', 't', ' ', 't', 'h', 'e', 'y', ' ', 'a', 'r', 'e', ' ', 'n', 'o', 't', '.']


## Match everything not in specific set

In [64]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""

In [65]:
pattern = '[^a-m]'
print(re.findall(pattern, text))

['M', 'y', ' ', 'o', 's', 's', ' ', 's', ' ', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'n', ' ', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'p', 'o', 'r', 't', 's', '.', ' ', '\n', 'I', ' ', 't', 'o', ' ', ' ', 't', 'y', ' ', 'w', 'r', ' ', 'o', 'n', ',', ' ', 'u', 't', ' ', 't', 'y', ' ', 'r', ' ', 'n', 'o', 't', '.']


## Match sentences `beginning with pattern`

In [66]:
text = '''My boss asked me to turn in my TPS reports. 
The boss told him they were done, but they are not.'''

In [67]:
pattern = '^My boss'
print(re.findall(pattern, text))

['My boss']


In [68]:
pattern = '^The boss'
print(re.findall(pattern, text))

[]


In [69]:
pattern = '^turn'
print(re.findall(pattern, text))

[]


In [80]:
text = '''My boss asked me to turn in my TPS reports. 
The boss told him they were done, but they are not.'''

In [76]:
pattern = 'reports.$'
print(re.findall(pattern, text))

[]


In [81]:
pattern = 'are not\.$'
print(re.findall(pattern, text))

['are not.']


In [82]:
pattern = 'are not\.$'
print(re.findall(pattern, text))

['The boss told him they were done, but they are not.']


## Characters classes

* `\d`: numeric characters
* `\w`: alphanumeric characters 
* `\s`: spaces
* `\D`: not numeric characters

In [82]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d'
print(re.findall(pattern, text))

['3', '2', '3', '2', '1', '3', '3', '1']


In [84]:
pattern = '[^\d]'
#
pattern = '\D'

print(re.findall(pattern, text))

['A', 'n', 'd', 'r', 'e', ' ', 'a', 'n', 'd', 'r', 'e', ' ', 'a', 'o', 'i', 'j', 'o', ' ', '(', ' ', ' ', '$', ' ', 'p', ' ', 'i', 'o', ' ', 'x', ' ', '-', 'o', ' ', '=', ' ', ' ', ' ', '™', '¡', '¡', '™', '£', '¡', 'Ω', 'å', ' ', '.', ' ', 'á', 'é', 'ó', 'ã', 'à']


# Quantifiers 

* *: Matches previous character 0 or more times
* +: Matches previous character 1 or more times
* ?: Matches previous character 0 or 1 times (optional)
* {}: Matches previous characters however many times specified within:
* {n} : Exactly n times
* {n,} : At least n times
* {n,m} : Between n and m times

## \d* --> Matches any numeric character that appears 0 or more times.

In [74]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.11648 áéóãà 1'

pattern = '\d*'
print(re.findall(pattern, text))

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3232', '', '13', '', '', '', '', '', '', '', '', '', '', '3', '', '11648', '', '', '', '', '', '', '', '1', '']


In [None]:
## \d+ --> Matches any numeric character that appears 1 or more times.

In [75]:
text = 'Andre andre aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d+'
print(re.findall(pattern, text))

['3232', '13', '3', '1']


In [76]:
text = 'Andre andre aoijo 1 (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'

pattern = '\d+\.?\d*'
print(re.findall(pattern, text))

['Andre', 'andre', 'aoijo', '1', 'p', 'io', 'x', 'o', '3232', '13', 'Ωå', '3.1', 'áéóãà']


## Application of previous example of `$` using one of the most useful quantifiers `*`

In [85]:
text = '''My boss asked me to turn in my TPS reports. 
My boss told him they were done, but they are not.'''

In [84]:
pattern = 'are not\.$'
print(re.findall(pattern, text))

['are not.']


In [92]:
pattern = '.*are not\.$'
print(re.findall(pattern, text)[0])

My boss told him they were done, but they are not.


In [88]:
pattern = '.*\n?.*are not\.$'
print(re.findall(pattern, text))

['My boss asked me to turn in my TPS reports. \nMy boss told him they were done, but they are not.']


In [89]:
text

'My boss asked me to turn in my TPS reports. \nMy boss told him they were done, but they are not.'

In [90]:
pattern = ',.*are not\.$'
print(re.findall(pattern, text))

[', but they are not.']


In [93]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss told, him they were done, but they, are not.'''

In [94]:
pattern = ',.*are not\.$'
print(re.findall(pattern, text))

[', him they were done, but they, are not.']


In [124]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss (told him they) were done (but they) are not.'''

In [128]:
pattern = '\(.+\)'
print(re.findall(pattern, text))

['(told him they) were done (but they)']


In [126]:
pattern = '\([\w ]*\)'
print(re.findall(pattern, text))

['(told him they)', '(but they)']


# Important Regex Concept: Greediness


What will this match?

In [118]:
text = 'You are yelling! So I will yell too! Let me yell!'

# anything up to exclamation point
pattern = ".*!"
print(re.findall(pattern, text))

['You are yelling! So I will yell too! Let me yell!']


In [119]:
pattern = ".*?!"
re.findall(pattern, text)

['You are yelling!', ' So I will yell too!', ' Let me yell!']

In [129]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2}"
print(re.findall(pattern, text))

['aww', 'aww', 'aww', 'aww']


In [130]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,}"
print(re.findall(pattern, text))

['aww', 'awww', 'awwww', 'awwwww']


In [135]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,3}"
print(re.findall(pattern, text))

['aww', 'awww', 'awww', 'awww']


In [136]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,3}?"
print(re.findall(pattern, text))

['aww', 'aww', 'aww', 'aww']


In [139]:
text = "Ooooooiiiiie gente"

pattern = "[Oo]{1,}i{1,}e{0,}!{0,1}"
#pattern = "[Oo]+i+e*!?"
print(re.findall(pattern,text))

['Ooooooiiiiie!']


In [141]:
text = "If you tell the truth 1 time, you don't have to remember anything 2 times."

pattern = '\w+'
print(re.findall(pattern, text))

['If', 'you', 'tell', 'the', 'truth', '1', 'time', 'you', 'don', 't', 'have', 'to', 'remember', 'anything', '2', 'times']


In [142]:
## word length
pattern = '[A-Za-z]{4,}'
print(re.findall(pattern, text))

['tell', 'truth', 'time', 'have', 'remember', 'anything', 'times']


# Capturing group

What if I wanted to capture only things up until the comma (`,`), however, not include the comma?

I would have to use a capturing group to specify what specifically I want to capture.

In [144]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss -told him they- were done -but they- are not.'''

In [145]:
pattern = '-[\w ]*-'
print(re.findall(pattern, text))

['-told him they-', '-but they-']


In [146]:
pattern = '-([\w ]*)-'
print(re.findall(pattern, text))

['told him they', 'but they']


In [147]:
text = '''My boss asked, me to turn in my TPS reports. 
My boss told, him they were done, but they, are ,not.'''

In [148]:
pattern = ',([\w ]*),'
print(re.findall(pattern, text))

[' him they were done', ' are ']


In [149]:
pattern = ',([\w ]*)'
print(re.findall(pattern, text))

[' me to turn in my TPS reports', ' him they were done', ' but they', ' are ', 'not']


In [108]:
text = "TerraPower, a nuclear-energy company founded by Bill Gates, is unlikely to follow through on building a demonstration reactor in China, due largely to the Trump administration’s crackdown on the country."

pattern = '[A-Z][a-z]+'
print(re.findall(pattern, text))

['Terra', 'Power', 'Bill', 'Gates', 'China', 'Trump']


In [5]:
# Find all the first and last names
text = "TerraPower, a nuclear-energy company founded by Bill Gates, is unlikely to follow through on building a demonstration reactor in China, due largely to the Trump administration’s crackdown on the country."


pattern = '[A-Z][a-z]+ ?[A-Z][a-z]*'
print(re.findall(pattern, text))

print(re.findall(pattern, text))

['TerraPower', 'Bill Gates']
['TerraPower', 'Bill Gates']


In [163]:
pattern ='([A-Z][a-z]+ ?[A-Z][a-z]+)|([A-Z][a-z]+)'# '[A-Z][a-z]+ ?[A-Z][a-z]+|[A-Z][a-z]+'

In [164]:
print(re.findall(pattern, text))

[('TerraPower', ''), ('Bill Gates', ''), ('', 'China'), ('', 'Trump')]


In [167]:
name = re.findall(pattern, text)

In [172]:
for n in name:
    print(n[1])



China
Trump


In [13]:
[name[0] for name in re.findall(pattern, text) if name[1]=='']
[name[1] for name in re.findall(pattern, text) if name[0]=='']

[]

In [165]:
simple_names = [name[1] for name in re.findall(pattern, text) if name[1] != '']
combined_names = [name[0] for name in re.findall(pattern, text) if name[0] != '']

In [166]:
print(simple_names)
print(combined_names)

['China', 'Trump']
['TerraPower', 'Bill Gates']


https://phoneregex.com/

In [173]:
phone='+55 (61) 99453-6852\n'
pattern='(\+55\s?)?(^|\()?\s*(\d{2})\s*(\s|\))*(9?\.?\d{4})(\s|-)?(\d{4})($|\n)'
print(re.findall(pattern,phone))

[('+55 ', '(', '61', ' ', '99453', '-', '6852', '')]
