# Regular expressions

- are a sequence of characters that define a search pattern
- used for find and replace algorithms

### The "in"-operator

- checks if a character or substring is contained in a string

In [2]:
s = 'an expression'
'expression' in s

True

### The "re"-module

- has regular expression matching algorithms


In [4]:
import re
x = re.search("cat","A cat and a rat can't be friends.")
#prints if there is a math and the position where the match is
print(x)
#returns None Object if ther eis no match
y = re.search("cow","A cat and a rat can't be friends.")
print(y)

<re.Match object; span=(2, 5), match='cat'>
None


### Placeholders
- r' .at ' can be used to find all 3 letter words ending with at
    - this might lead to overmatching though

- r'M[ae][iy]er is an expression that accepts all combinations of words within the brackets [...]

- the expression [-a-z] selects all the small letters from a to z and the -

- [A-Za-z] defines all upper case and lowercase characters from a to z

- [n-m] with n, m integers < 10 defines a range of accepted numbers

- [^0-9] is a negation placeholder. Everything BUT a number is accepted

- [a^bc] however defines a or ^ or b or c

### Predefined placeholders

- since there are common expression classes there are predefined placeholders

    - \d a digit
    - \D NON digit = [^0-9]
    - \s whitespace = [\t\n\r\f\v] tabs newlines etc.
    - \S non whitespace = [^\t\n\r\f\v]
    - \w alphanumerical character including "_". Also includes LOCALE e.g german umlaut
    - \W complement of \w
    - \b empty string at the start or end of a string
    - \B empty string neither at the satrt or end of a string
    - \\ a backslash
    
    
    

In [11]:
import re
s1 = "Mayer is a very common Name"
s2 = "He is called Meyer but he isn't German."
#[^...] ensures we are only looking at the satrt of a string
print(re.search(r"^M[ae][iy]er", s1)) 
#the match algorithms cant find the expression because it is not at the beginning of a string
print(re.search(r"^M[ae][iy]er", s2)) 
# when the strings are combined with a newline the re.M (multiline) option also finds the substring
print(re.search(r"^M[ae][iy]er", s2+'\n'+s1, re.M))

<re.Match object; span=(0, 5), match='Mayer'>
None
<re.Match object; span=(40, 45), match='Mayer'>


### Optional placeholders
- r"M[ae][iy]e?r" is an expression where the last e is optional
- this is also useful for dates e.g. r'Feb(ruary)?'
### Quantifier
- r"[0-9]*" means that any series (also empty series) of digits is allowed
- r".*" allows any series of characters

### Exercise 8.1

#### Write a regular expression matching strings that begin with at least on digit followed by a space 

In [7]:
import re
s_test_1 = "1 man cannot handle manufacturing a table."
s_test_1_1 = "11 man cannot handle manufacturing a table."
s_test_2 = "I cannot read this text, because I'm blind."
pattern = r'^\d+ '
print(re.search(pattern, s_test_1))
print(re.search(pattern, s_test_1_1))




















<re.Match object; span=(0, 2), match='1 '>
<re.Match object; span=(0, 3), match='11 '>


In [17]:
test = "9111 "
test2 = '234'
print(re.match(r"^\d\d* ", test))
print(re.match(r"^\d\d* ", test2))

<re.Match object; span=(0, 5), match='9111 '>
None


### The "+"-operator and the {} Syntax
- the "+"-operator ensures that atleast one occurence of the expression is needed
- the {}-operator can specify a defined number of occurences
    -\d{4} means that exactly 4 digits are wanted here

### Exercise 8.2

#### Write a regular expression that can parse all postal codes of germany

In [9]:
pattern_zip = '\d{5} \w+'
test_plz = "58644 Iserlohn"
test_plz2 = "48143 Münster"
print(re.match(pattern_zip, test_plz))
print(re.match(pattern_zip, test_plz2)) #if LOCALE is set correctly this works

<re.Match object; span=(0, 14), match='58644 Iserlohn'>
<re.Match object; span=(0, 13), match='48143 Münster'>


In [19]:
expr = r"\d{5} \w+"
test_plz = "58644 Iserlohn"
test_plz2 = "48143 Münster"
print(re.match(expr, test_plz))
print(re.match(expr, test_plz2)) #if LOCALE is set correctly this works

<re.Match object; span=(0, 14), match='58644 Iserlohn'>
<re.Match object; span=(0, 13), match='48143 Münster'>


### Accessing the values of the match-object
- to do stuff with the expression we have to access the values


In [21]:
import re
mo = re.search("[0-9]+", "Customer number: 232454, Date: February 12, 2011")
print(mo.group())
print(mo.span())
print(mo.start())
print(mo.end())
print(mo.span()[0])
print(mo.span()[1])

232454
(17, 23)
17
23
17
23


### More regular expression operations
- finding ALL matching substrings: re.findall(pattern, string [,flags])

### Exercise 8.3

#### Use the find all method to find all substrings enting with "at" in the sentence: "A fat cat doesn't eat oat but a rat eats bats."

In [13]:
test_str = "A fat cat doesn't eat oat but a rat eats bats."
mo = re.findall(r'\w*at', test_str)
print(mo)

['fat', 'cat', 'eat', 'oat', 'rat', 'eat', 'bat']


In [22]:
t="A fat cat doesn't eat oat but a rat eats bats."
print(re.findall(r"\w*at", t))

['fat', 'cat', 'eat', 'oat', 'rat', 'eat', 'bat']


### Logical  operations
- logical operations are also possible in regular expressions
- the | is the logical or in python

In [26]:
import re
str = "The destination is London!"
mo = re.search(r"destination.*(London|Paris|Zurich|Strasbourg)",str)
if mo:
    print(mo.group())

destination is London


### Flags
- re can be run with flags
    - re.I ignore lower and upper case
    - re.L LOCALE is forced
    - re.M multiline
    - re.S "." also fits the newline \n
    - re.U Unicode
    - re.X verbose regular expressions are allowed. Meaning spaces and tabs are ignored

### Splitting strings
- the built in split function can split strings at regular expressions
- re.split can also split strings at regular expressions

In [32]:
teststring = 'Hello, this is'
print(teststring.split())
print(re.split(r' ', teststring))
print(teststring.split(','))
print(re.split(r',', teststring))

['Hello,', 'this', 'is']
['Hello,', 'this', 'is']
['Hello', ' this is']
['Hello', ' this is']


### Find and replace with re.sub

In [35]:
string = 'Yes this is'
print(string)
res = re.sub(r'[Yy]es','no', string)
print(res)

Yes this is
no this is
