# <u>Reg</u>ular <u>Ex</u>pression - Intro to Regex

Regex is a powerful way to match text. Instead of trying to match a `literal string`, you can try to match `patterns`.

In [2]:
import re

Website to visually see what your regular expressions look like: https://regexper.com/

In [3]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
pattern = 'neighbor'

re.findall(pattern, text)

['neighbor']

In [4]:
text = 'My neighbor, Mr. neighrogers, has 5 dogs.'
pattern = 'neigh'

re.findall(pattern, text)

['neigh', 'neigh']

In [5]:
text = 'My , Mr. Rogers, has 5 dogs.'
pattern = 'neigh'

re.findall(pattern, text)

[]

## Introducing Sets

In [6]:
text = 'My neighbor, Mr. Rogers, has 5 dogs.'
pattern = '[neigh]'

re.findall(pattern, text)

['n', 'e', 'i', 'g', 'h', 'g', 'e', 'h', 'g']

In [7]:
text = 'My neighbor, Mr. Rogers, has 5 rogers.'

#pattern = 'rogers'
pattern = '[Rr]ogers'
#pattern = '[Rr][Oo][Gg][Ee][Rr][Ss]'
#pattern = '[RrOoGgEeRrSs]'
# pattern = rogers ou Rogers ou xogers

re.findall(pattern, text)
#re.findall(pattern, text,re.I)
#re.findall(pattern, text.lower())

['Rogers', 'rogers']

In [8]:
text = 'Sãn Paulo São Paulo Sao Paulo Sao Paolo San Pablo sao paulo sao Paulo são Paulo sao-paulo são paulo São Paulo Saon Paulo'

pattern = '[Ss][áãa][on][ -][Pp]a[uob]lo'

re.findall(pattern, text)

['Sãn Paulo',
 'São Paulo',
 'Sao Paulo',
 'Sao Paolo',
 'San Pablo',
 'sao paulo',
 'sao Paulo',
 'são Paulo',
 'sao-paulo',
 'são paulo',
 'São Paulo']

# Pattern sets:

Range

1. [a-z]: Any lowercase letter between a and z.
2. [A-Z]: Any uppercase letter between A and Z.
3. [0-9]: Any numeric character between 0 and 9.

In [9]:
text = 'My neighbor, Mr. Rogers, has 5 rogers.'
pattern = '[a-e]'

re.findall(pattern, text)

['e', 'b', 'e', 'a', 'e']

In [10]:
re.findall('[A-Z]', text)

['M', 'M', 'R']

In [11]:
re.findall('[A-N]', text)

['M', 'M']

In [12]:
re.findall('[efghijklmno]', text)

['n', 'e', 'i', 'g', 'h', 'o', 'o', 'g', 'e', 'h', 'o', 'g', 'e']

In [13]:
re.findall('[e-o]', text)

['n', 'e', 'i', 'g', 'h', 'o', 'o', 'g', 'e', 'h', 'o', 'g', 'e']

In [14]:
re.findall('[0123456789]', text)

['5']

In [15]:
text

'My neighbor, Mr. Rogers, has 5 rogers.'

In [21]:
re.findall('[0-9]', text)

['5']

In [79]:
# you can concatenate ranges

re.findall('[A-Za-z0-9]', text)

['M',
 'y',
 'n',
 'e',
 'i',
 'g',
 'h',
 'b',
 'o',
 'r',
 'M',
 'r',
 'R',
 'o',
 'g',
 'e',
 'r',
 's',
 'h',
 'a',
 's',
 '5',
 'r',
 'o',
 'g',
 'e',
 'r',
 's']

In [23]:
text

'My neighbor, Mr. Rogers, has 5 rogers.'

In [24]:
re.findall('[A-Z,0-9]', text)

['M', ',', 'M', 'R', ',', '5']

In [82]:
text = 'My neighbor, Mr. Rogers, has] 5 rogers.'

In [86]:
re.findall('[a-m\]]',text )

['e', 'i', 'g', 'h', 'b', 'g', 'e', 'h', 'a', ']', 'g', 'e']

The opposite: 
- `^` matches everything except the pattern 

In [89]:
pattern = '[^a-z]'
re.findall(pattern, text)

['M', ' ', ',', ' ', 'M', '.', ' ', 'R', ',', ' ', ']', ' ', '5', ' ', '.']

In [90]:
# concat patterns [] 
# space character == \s
pattern = '[^a-zA-Z0-9\s]'
pattern = '[^a-zA-Z0-9 ]'
re.findall(pattern, text)


[',', '.', ',', ']', '.']

In [29]:
re.findall('[ ]', text)

[' ', ' ', ' ', ' ', ' ', ' ']

In [30]:
re.findall('[\s]', text)

[' ', ' ', ' ', ' ', ' ', ' ']

# Meta Characters:

Characters that don't mean what they are.

1. `\w`: Any alphanumeric character.
3. `\d`: Any numeric character.
7. `.` : Any character except newline (\n).

In [93]:
text = 'My neighbor, Mr. Rogers, ] has 5 - dogs 10. α π _'

In [94]:
pattern = '\w'
print(re.findall(pattern, text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'h', 'a', 's', '5', 'd', 'o', 'g', 's', '1', '0', 'α', 'π', '_']


In [97]:
print(re.findall('\d', text))
#print(re.findall('[0-9]', text))

['5', '1', '0']


In [98]:
print(re.findall('.', text))

['M', 'y', ' ', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', ',', ' ', 'M', 'r', '.', ' ', 'R', 'o', 'g', 'e', 'r', 's', ',', ' ', ']', ' ', 'h', 'a', 's', ' ', '5', ' ', '-', ' ', 'd', 'o', 'g', 's', ' ', '1', '0', '.', ' ', 'α', ' ', 'π', ' ', '_']


## Quantifiers

1. `*`: 0 or more
2. `?`: 0 or 1
3. `+`: 1 or more

In [143]:
text = '''My neighbor, Mr. Rogers, has 5 dogs and 100 cats and β sheeps.'''

In [138]:
print(re.findall('\d*', text))

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '5', '', '', '', '', '', '', '', '', '', '100', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '569', '', '']


In [104]:
print(re.findall('\d+', text))

['5', '100']


In [141]:
print(re.findall(',.*', text))

[', Mr. Rogers, has 5dogs and 100 cats and β sheeps.']


In [144]:
print(re.findall('\w+', text))

['My', 'neighbor', 'Mr', 'Rogers', 'has', '5', 'dogs', 'and', '100', 'cats', 'and', 'β', 'sheeps']


In [145]:
my_string = 'Andre Park 1 andre21 and andre 21, Andre Aguiar André 21,65 are part of Ironhack"s andré 21.65 Da Silva andré 32 Sauro'


In [155]:
# find all the numbers
re.findall('', my_string)

['1', '21', '21', '21,65', '21.65', '32']

In [153]:
re.findall('\d+[,.]?\d*', my_string)

['1', '21', '21,', '21,65', '21.65', '32']

In [57]:
text = 'Sáo Paulo São Paulo Sao Paulo Sao Paolo San Pablo sao paulo sao Paulo são Paulo sao-paulo são paulo São SãoPaulo Saon Paulo'

pattern = '[Ss][ãaáàâä][on]n?[ -]?[Pp]a[buo]lo'

re.findall(pattern, text)

['Sáo Paulo',
 'São Paulo',
 'Sao Paulo',
 'Sao Paolo',
 'San Pablo',
 'sao paulo',
 'sao Paulo',
 'são Paulo',
 'sao-paulo',
 'são paulo',
 'SãoPaulo',
 'Saon Paulo']

In [156]:
text = 'This colonel has the colour or color blue'

re.findall('colou?r', text)

['colour', 'color']

In [157]:
text = 'These apples are beautiful and the apple is blue.'

re.findall('apples?', text)

['apples', 'apple']

In [159]:
text = 'Andre Aguiar and Andre Park and Fatima Aguiar and Frank Park are from the Ironhack team'

re.findall('Andre \w+', text)

['Andre Aguiar', 'Andre Park']

In [165]:
re.sub('\w+ Park', 'Joao Park', text, 2)

'Andre Aguiar and Joao Park and Fatima Aguiar and Joao Park are from the Ironhack team'

In [166]:
text

'Andre Aguiar and Andre Park and Fatima Aguiar and Frank Park are from the Ironhack team'

# Other methods for regular expressions

In [170]:
text = 'My neighbor, Mr. Rogers, ] has 5 - rogers 1000,'

In [168]:
re.sub('[Rr]ogers','Andre', text)

'My neighbor, Mr. Andre, ] has 5 - Andre 1000,'

In [169]:
re.sub('\d+','-1', text)

'My neighbor, Mr. Rogers, ] has -1 - rogers -1,'

In [87]:
text.split('Rogers')

['My neighbor, Mr. ', ', ] has 5 - rogers 1000,']

In [171]:
print(re.split('[Rr]ogers', text))

['My neighbor, Mr. ', ', ] has 5 - ', ' 1000,']


In [173]:
text

'My neighbor, Mr. Rogers, ] has 5 - rogers 1000,'

In [172]:
print(re.split('[0-9]+', text))

['My neighbor, Mr. Rogers, ] has ', ' - rogers ', ',']


In [174]:
print(re.findall('[^0-9]+', text))

['My neighbor, Mr. Rogers, ] has ', ' - rogers ', ',']


In [175]:
re.search('\d+',text)

<re.Match object; span=(31, 32), match='5'>

In [178]:
search_result = re.search('\d+','mnosanbdias f saonsiao fsamnp')
if search_result:
    print(text)
else:
    print('Pattern not found')

Pattern not found


# Examples

Find the regexes that: 
1. Matches “Dan” and “Ban” (first letter can be "D" or "B").
2. Matches “Dan”, “Ban”, “Tan”, and “Pan”.
3. Matches “Dan” and “Dag” (last letter can be "n" or "g").
4. Matches Dan followed by lower case "and"


In [180]:
text = 'Dan and Ban and Tan Dah andSan And Dag Dan Pan i 09wie'

In [191]:
pattern= '[DB]an'
re.findall(pattern, text)

['Dan', 'Ban', 'Dan']

In [193]:
pattern = '[DBTP]an'
re.findall(pattern, text)

['Dan', 'Ban', 'Tan', 'Dan', 'Pan']

In [194]:
pattern = 'Da[ng]'
re.findall(pattern, text)

['Dan', 'Dag', 'Dan']

In [201]:
pattern = 'Dan and'
re.findall(pattern, text)

['Dan and']

# Resumo
* Cheatsheet https://cheatography.com/davechild/cheat-sheets/regular-expressions/
* Testar regex https://regexr.com/
* import re - faz operações de string baseados em padrões
## Funções
* re.findall(padrão, texto) - encontra o padrão no texto, retorno de lista
* re.sub(padrão,substituto,texto, numero de subtituição) - substitui o padrão por uma nova palavra 
* re.split(padrão,texto) - separa a string em uma lista baseado no padrão de separação, eliminando o padrão
* re.search(padrão,texto) - retorna bool falando se o padrão existe no texto
## Padrões
* padrão = '[caracteres]' - encontra um ou outro caracter dentro do set(colchete)
* '[A-Za-z0-9]' - sets que buscam de uma letra ate outra
* '[\]]' - vai buscar o caracter ] dentro do set
* '[^]' - procura tudo o que não esta no set
* meta caracter
* \w: Any alphanumeric character.
* \d: Any numeric character.
* .: Any character except newline (\n)
## quantificador
* *: 0 or more
* ?: 0 or 1
* +: 1 or more