**SOURCE TUTORIALS:**   
https://realpython.com/lessons/building-regexes-overview/  
https://www.educative.io/courses/python-regular-expressions-with-data-scraping-projects
  
  
#### Regexp Reference Documentation
- Libraries: re and regex (much more powerful than re)
- https://docs.python.org/fr/3/howto/regex.html
- https://www.w3schools.com/python/python_regex.asp
- https://www.w3schools.com/python/python_ref_string.asp (all string methods)

#### Tools

- [pythex: online regex editor](https://pythex.org/)
- [Regular Expressions 101: online regex editor](https://regex101.com/)
- [Matther Branett's regex Library](https://bitbucket.org/mrabarnett/mrab-regex/src/hg/)   
  nested sets, set operations, infinite look-behind
- [Parse: parse strings using a specification based on the Python format() syntax](https://github.com/r1chardj0n3s/parse)

#### Further reading about regexes

- [Regular Expression: Wikipedia article](https://en.wikipedia.org/wiki/Regular_expression)
- [Perl Compatible Regular Expressions: Wikipedia article](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions)
- [Finite State Machine: Wikipedia article](https://en.wikipedia.org/wiki/Finite-state_machine)
- [Regex Denial of Service (DOS)](https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865)

#### Python all modules by category

- [Python 3 Module of the Week](https://pymotw.com/3/)

 ### When to compile a regex
 
 - SPEED advantage (caches the last MAXCACHE=100 patterns)
 - precompile a pattern to be reuse multiple times
 - use it accross your code base, pass it as argument to functiosn...

In [None]:
pattern = re.compile(r'([-\s.,;!?])+')
tokens = pattern.split(sentence)

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Meta-characters</h1> 


- `.` : any character except line terminations (like `\n`)
- `\` : escapes meta-characters, or denotes classes
- `|` : OR
- `-` : set below about 'Sets'   

  
- `^` : start of string (multiline off) – start of each line (multiline on)
- `\A` : absolute start of string (even in multiline mode)    
     
     
- `$` and `\Z` : end of string (multiline off) – end of each line (multiline on)
- `\z` : absolute end of string (even in multiline mode)  
  
  
- **Multiline** mode changes how `^` and `$` behave around newlines  
  
  
- `[-\s]` : meta-char can be used inside matching class

In [34]:
r"^P"      # line starts with P
r"\AP"     # idem
r"\.$"     # all lines ending with period character (multiline mode on)
           # period character at the end of the string (multiline mode off)

'\\.$'

In [48]:
import re
print(re.search("^aaa", "xxx bbb\naaa ccc", re.MULTILINE))
print(re.search("\Aaaa", "xxx bbb\naaa ccc", re.MULTILINE))

<re.Match object; span=(8, 11), match='aaa'>
None


<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Classes</h1> 

Lower case classes are opposite to upper case classes

  - `\d` : digits, `[0-9]`
  - `\D` : non-digits
  
  
  - `\w` : alphanumerics, `[a-zA-Z0-9_]`
  - `\W` : non-alphanumerics
    
    
  - `\b` : boundary (empty string) at start and end of a word, that is, between `\w` and `\W`
  - `\B` : matches where `\b` does not, that is, the boundary of `\w` characters
    
    
  - `\s` : whitespace (include the `\t`, `\n`, `\r`, and space characters)
  - `\S` : non-whitespace
    
    
- `\G` : point where last match finished

In [25]:
r"^P"      # line starts with P
r"\AP"     # idem
r"\.$"     # all lines ending with period character (multiline mode on)
           # period character at the end of the string (multiline mode off)
    
r"\bf"     # words that start with f
r"er\b"    # words that end with er
r"\Ber\B"  # er inside of a word (non word boundary)

'\\Ber\\B'

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Sets</h1> 


- `[ ]` : set of characters to match
- `[abc]` : a or b or c
- `[^abc]` : not a nor b nor c (excludes any character in the set)
- `[a-z0-9]` : a to z and 0 to 9
- `[a\-z]` = `[-az]` = `[az-]` : a or - or z
- `[(+*)]` : **(** or **+** or ***** or **)** (special characters become literal in a set)

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Quantifiers</h1> 


`?` when added to quantifiers (`+`, `*` and `?` itself) has the meaning 'non-greedy'
- `+` : 1 or more times, greedy – `+?` lazy, absorbs the minimum: 1
- `*` : 0 or more times, greedy – `*?` lazy, absorbs the minimum: 0
- `?` : 0 or 1 times, greedy – `??` lazy, absorbs the minimum: 0
- `{m,n}` : m to n matches, greedy – `{m,n}?` : lazy, matches the minimum: m
- `{m}` : exactly m matches
- `{m,}` : m or more matches
- `{,m}` : 0 to m matches

In [24]:
r"\d*"       # any number of digits
r"[a-z]+"    # at least one lower case character
r"<.*>"      # any character any number of times

r"Aa+"       # [Aaaaaaaa]: absorbs the maximum found
r"Aa+?"      # [Aa]aaaaaaa: absorbs the minimum found, which is 1 (+ means 1 or more)

r"Aa?"       # [Aa]aaaaaaa: absorbs the maximum found
r"Aa??"      # [A]aaaaaaaa: absorbs the minimum found, which is O (? means 0 or more)

r"\d{7}"     # 7 digits
r"\d{2,7}"   # 2 to 7 digits
r"\d{,7}"    # 1 to 7 digits
r"\d{2,7}?"  # absorbs the minimum found which is 2, ignores the 7

import re
text = 'a "witch" and her "broom" is one'
print(re.findall("\".+\"", text))    # "witch" and her "broom"
print(re.findall("\".+?\"", text))   # "witch", "broom"

['"witch" and her "broom"']
['"witch"', '"broom"']


### Issue

In [148]:
re.sub("x*", "-", "spam")  # 0 or more match, so replaces even if no 'x' in the string!

'-s-p-a-m-'

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Groups</h1> 

# Numbered groups

- `()` : capture a portion of the match
- `\n` : `n` being the group number, backreference with backslash: `\n`   
  
  
- capture repeated sequences at different positions   
- can use 1 to 99 groups

In [33]:
r"(\d[aeiou])\1"        # matches twice the pattern captured in parentheses: \d[aeiou]
r"\d[aeiou]\d[aeiou]"   # idem

'\\d[aeiou]\\d[aeiou]'

### Issue

In [56]:
#-- Use \g<group_number>

re.sub(r"(\d+)", r"\10", "is this a 1?")  # replace with group \1 followed by '0'
                                          # but interperter understands as 'group 10'
                                          # -> error!

re.sub(r"(\d+)", r"\g<1>0", "is this a 1?")

'is this a 10?'

# Group without capturing

```
(?:...)
```

- `?:` indicate that group cannot be called later


In [91]:
#-- "(?:x{6})*" : matches any multiple of six ‘x’ characters

print(re.search(r"(?:x{6})", "xxxxxxxxx"))   # 9 'x' -> matches only 6 first ones
print(re.search(r"(?:x{3})", "xxxxxx"))      # 6 'x' -> matches all 6

<re.Match object; span=(0, 6), match='xxxxxx'>
<re.Match object; span=(0, 3), match='xxx'>


# Named groups

```
(?P<group_name>...)
```

- `?P` indicate that it's a named group
- `<>` contain the name of the group

In [101]:
import re
content = "I count like this: one, two, three"
match = re.search(r"(\w+), (\w+), (\w+)", content)
print(match)
print(match.groups())

<re.Match object; span=(19, 34), match='one, two, three'>
('one', 'two', 'three')


In [32]:
import re
content = "I count like this: one, two, three"
match = re.search(r"(?P<first>\w+), (?P<second>\w+), (?P<third>\w+)", content)
print(match)
print(match.groups())
print(match.groupdict())

<re.Match object; span=(19, 34), match='one, two, three'>
('one', 'two', 'three')
{'first': 'one', 'second': 'two', 'third': 'three'}


# Backreference

```
(?P=group_name...)
```

- backreference: matches the same content as a previous named group

In [109]:
import re
content = "I count like this: one, one, one"
match = re.search(r"(?P<first>\w+), (?P=first), (?P=first)", content)
print(match)
print(match.groups())
print(match.groupdict())

<re.Match object; span=(19, 32), match='one, one, one'>
('one',)
{'first': 'one'}


# Non-capturing groups

```
(?:...)
```

Useful because:
- saves memory (each group takes up memory)
- readability by grouping sub-components but without capturing them

In [117]:
import re
content = "sep"
match = re.search(r"(?P<first>[^aeiou])(?:[aeiou])(?P<second>[^aeiou])", content)
print(match)
print(match.groups())
print(match.groupdict())

<re.Match object; span=(0, 3), match='sep'>
('s', 'p')
{'first': 's', 'second': 'p'}


# Conditional matches

- `(?...)` : conditional group
- `(?(group_number)...)` : conditional group based on presence of a backreference

In [209]:
MY_REGEX = '''
    (ACME\s)?      # optional plain string, followed by whitespace
    Super\s        # then 'Super', then whitespace
    (?(1)(Out)     # if 'ACME' is found (group 1), then read 'Out'
     |\w*(fit)     # otherwise, read a word that ends with 'fit'
    )
'''
text = re.compile(MY_REGEX, re.VERBOSE)
print(text.match('ACME Super Outfit'))
print(text.match('Super Outfit'))

<re.Match object; span=(0, 14), match='ACME Super Out'>
<re.Match object; span=(0, 12), match='Super Outfit'>


# Look ahead matches

- `(?=...)` : positive look ahead = '***followed by something***' – doesn't absorb the part after
- `(?!...)` : negative look ahead = ***'not followed by something'***

In [105]:
MY_REGEX = 'writing\s(?=to)'    # 'to' is not absorbed
print(re.search(MY_REGEX, "I'm writing to you."))

MY_REGEX = 'writing\s(to)'     # 'to' is absorbed
print(re.search(MY_REGEX, "I'm writing to you."))

<re.Match object; span=(4, 12), match='writing '>
<re.Match object; span=(4, 14), match='writing to'>


In [53]:
MY_REGEX = '\d{4}(?=\[\w\])'   # is followed by
print(re.search(MY_REGEX, "3990[X]"))

MY_REGEX = '\d{4}(?!\[\w\])'   # negative look ahead: is not followed by
print(re.search(MY_REGEX, "3990[X] 3881 "))

<re.Match object; span=(0, 4), match='3990'>
<re.Match object; span=(8, 12), match='3881'>


# Look behind matches

- `(?<=...)` : positive look behind = ***'preceded by something'***
- `(?<!...)` : negative look behind = ***'not preceded by something'***

In [97]:
MY_REGEX = '(?<=\[\w\])\d{4}'    # look behind: is preceded by...
re.search(MY_REGEX, " 3990 [X]3991")

<re.Match object; span=(9, 13), match='3991'>

In [106]:
MY_REGEX = '(?<!\[\w\])\d{4}'   # negative look behind: isnot preceded by...
re.search(MY_REGEX, " 3990 [X]3990")

<re.Match object; span=(1, 5), match='3990'>

# Comments

- `(?#my comments)` : anything inside of the group is a comment

In [231]:
MY_REGEX = '(\d{4})(?#four digits)'
re.findall(MY_REGEX, " 5888 [X]3999")

['5888', '3999']

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Flags</h1> 

- Change regex behavior (example: make them case sensitive)
  
  
- Combine flags with `|`
  
  
# Behavior flags  

- `re.I`, `re.IGNORECASE`: matching is case insensitive for alphabetic characters
  
  
- `re.M`, `re.MULTILINE`: causes `^` and `$` anchors to match embedded newlines
  
  
- `re.L`, `re.LOCALE`: interprets words according to current locale  
  Affects the behavior of `\w`, `\W`, `\b` and `\B`
    
    
- `re.U` (Unicode): interprets letters according to current locale  
  Affects the behavior of `\w`, `\W`, `\b` and `\B`
   
   
- `re.S`, `re.DOTALL`: causes dot metacharacter (`.`) to match a newline
  
  
- `re.X`, `re.VERBOSE`: allows whitespace and comments within regex   
  (changes how literal whitespace works inside the regexp)
    
- `re.DEBUG`: Debug information is printed to console
  

# Character encoding flags

- `re.A`, `re.ASCII`: use ASCII encoding for character classification   
  (default in Python 3 if regex pattern is a bytestring)
- `re.U`, `re.UNICODE`: use Unicode encoding for character classification     
  (default in Python 3 if regex pattern is a Unicode string)
- `re.L`, `re.LOCALE`: use current locale to determine encoding for character classification

# Activate flags from inside the regexp
- `(?<flag_chars>)` 
  - turns on the flag for the entire regexp (even if not placed at the beginning)    
  - example: `re.search("(?im)^n", "Yes! \nNo!")`
  
  
- `(?<flag_chars>:...)` 
  - turns on the flag for a group   
  - example: `re.findall("(?i:s)pam", "Spam, spam, spam")`
  
  
- `(?-<flag_chars>:...)` 
  - turns off the flag for a group   
  - example: `re.findall("(?-i:s)pam", "Spam, spam, spam")`

# Other

See [mrabarnett regex library](https://bitbucket.org/mrabarnett/mrab-regex/src/hg/)
- `V1`, `VERSION1`:   
  -> nested set operations, ex '[[a-z]--[aeiou]]' = any lower case letter except vowels
- scoped flags: apply to only part of a pattern, can be turned on or off   
  global flags: apply to the entire pattern, can only be turned on
- other flags
- `(*PRUNE)`, `(*SKIP)`, `(*FAIL)`

# Examples

In [232]:
import re
re.search("a+", "AAA", re.IGNORECASE)

<re.Match object; span=(0, 3), match='AAA'>

In [59]:
print(re.search("^N", "Yes!\nNo!", re.MULTILINE))

# \A : no match with or without multiline mode
print(re.search("\AN", "Yes!\nNo!", re.MULTILINE))

<re.Match object; span=(5, 6), match='N'>
None


In [61]:
print(re.search("1.2", "First 1\n2 next"))
print(re.search("1.2", "First 1\n2 next", re.DOTALL))

None
<re.Match object; span=(6, 9), match='1\n2'>


In [236]:
# North american style phone numbers:
PHONE_REGEX = '''
    (1\s)?          # optional leading 1
    \(\d\d\d\)      # area code
    \s
    \d\d\d-         # prefix
    \d\d\d\d        # line number
'''
phone = re.compile(PHONE_REGEX, re.VERBOSE)
phone.match('1 (416) 967-1111')

<re.Match object; span=(0, 16), match='1 (416) 967-1111'>

In [71]:
# No result: literal whitespace not meaningful with re.VERBOSE flag -> use \s
re.search("bacon, eggs", "bacon, eggs, and spam", re.VERBOSE)

In [66]:
#-- Combine multiple flags
print(re.search("^n", "Yes! \nNo!", re.MULTILINE | re.IGNORECASE))

# Same as before: (?<flag_chars>) turns on the flag for all regexp
print(re.search("(?im)^n", "Yes! \nNo!"))

<re.Match object; span=(6, 7), match='N'>
<re.Match object; span=(6, 7), match='N'>


In [70]:
#  (?<flag_chars>) -> turns on the flag for the group
#  re.IGNORECASE   -> turns it on for the whole regexp
#  ?-i:            -> turns it off for the group

print(re.findall("(?i:s)pam",  "Spam, spam, spam, SPAM"))
print(re.findall("(?-i:s)pam", "Spam, spam, spam, SPAM", re.IGNORECASE))

['Spam', 'spam', 'spam']
['spam', 'spam']


<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Escaping meta-characters</h1>

# Escape with `[ ]` and `\`

In [27]:
# To match square brackets:

#-- put them into square brackets:
#   following example matches: 
#   - an opening square bracket: [[]
#   - a capital letter:          [A-5]
#   - a closing square bracket:  []]
r"[[][A-Z][]]"

#-- or escape them with a backslash:
r"\[[A-Z]\]"

'\\[[A-Z]\\]'

# Issue

In [70]:
# When you put a backslash into a Python string 
# and have it espace a character that is not a valid escape character,
# Python automatically turns that into a literal backslash.

import re
re.search("\w+,", "one,two")

<re.Match object; span=(0, 4), match='one,'>

In [71]:
# But when you overlap between escape sequences in the string
# and the escape sequences in the regex,
# you need to use 2 backslashes in the regex and in the string.
# So we should have done:

import re
print(re.search("\\w+,", "one,two"))

<re.Match object; span=(0, 4), match='one,'>


## SOLUTION 1: Escape the escape

In [81]:
print(re.search("\w+\\two", "one\\two"))

# 2 backslashes in a regexp are turned into 1 backslash by the re module
# '\\\\' -> re module receives '\\'
print(re.search("\\w+\\\\two", "one\\two"))

None
<re.Match object; span=(0, 7), match='one\\two'>


In [None]:
print(re.search("\\w+\\eight", "one\\eight"))
# Error: bad escape \e at position 3
# because '\\' -> receives '\' (and \e has no meaning)

In [82]:
print(re.search("\\w+\\\\eight", "one\\eight"))

<re.Match object; span=(0, 9), match='one\\eight'>


## SOLUTION 2: Use raw strings = BETTER = Avoid regexp bugs !
Says: This string doesn't have any escape sequences in it

In [95]:
content = r"one\eight"
content

'one\\eight'

In [96]:
regexp = r"\w+\\eight"
re.search(regexp, content)

<re.Match object; span=(0, 9), match='one\\eight'>

In [121]:
re.search(r"\w+\\\w+", r"one\two")

<re.Match object; span=(0, 7), match='one\\two'>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">The re standard module</h1>

**Reference:** https://www.w3schools.com/python/python_regex.asp

- Most methods take ***a pattern string*** and ***a string*** to match against:   
  `re.<method>(pattern, string, flags)`
  
  
- Most methods return:
  - a `re.Match` object if success
  - `None` if failed

# The `re.Match` object: Accessing groups

`re.Match` object contains attributes and methods providing details about a match:
  
  
- Is "truthy": can be compared to booleans
  
  
- `.start()` : starting index of a match
- `.end()` : ending index of a match
- `.span()` : tuple containing the start and end of match
  
  
- `.string` : the string passed into the function
  
  
- `.group()` : the part of the string where there was a match
- `.groups()` : all matched groups
- `.group(*args)` : the matched group or tuple of matched groups
- `.groupdict()` : dictionary of all named subgroups of the match   
   keys = group names (see above 'Named groups' `(?P<group_name>\d+)`)
  
  
- `.expand(template)` : returns a string based on a template backreference substitutions

In [96]:
import re
question = "one,two,three"
match = re.search("(\w+),(\w+)", question)

match.groups()           # all matched groups

print(match.group(0))    # group 0 : the entire match

print(match.group(1))    # first matched
print(match[1])          # idem

print(match.group(2,1))  # reverse order

print(match.start(2))    # start index of the second match

one,two
one
one
('two', 'one')
4


In [93]:
number = "124.13";
m = re.match( r'(?P<Exponent>\d+)\.(?P<Fraction>\d+)', number)
print(m.groupdict())

{'Exponent': '124', 'Fraction': '13'}


In [84]:
match.expand("Second was '\\2'. First was '\\1'")   # like f-strings

"Second was 'two'. First was 'one'"

# `re.compile()` compiles regexp string into regexp object

`re.compile(pattern, flags=0)`
  
  
- No difference in efficiency
- Python still compiles the regexp when we call the `re` functions
- So same as puting the regexp in a variable
  
  
- Syntax:
  
```
match = re.match(pattern, string)
```
  
  
  becomes 
  
  
```
pattern = re.compile(regexp)
match = pattern.search(string)
```

In [104]:
import re
digits_re = re.compile("(\d+)")
print(digits_re)
print(digits_re.search("My favorite numbers are 13 and 42."))

re.compile('(\\d+)')
<re.Match object; span=(24, 26), match='13'>


In [103]:
#-- Same as:
DIGITS_RE = "(\d+)"
re.search(DIGITS_RE, "My favorite numbers are 13 and 42.")

<re.Match object; span=(24, 26), match='13'>

# `re.search()` matches the first instance

`re.search(pattern, string, flags=0)`

In [81]:
import re
question = "Hello world! Wonderful world!"
match = re.search("world", question)
print(match)

<re.Match object; span=(6, 11), match='world'>


In [82]:
print(bool(match))
print(match is None)

print(question[6:11])
print(match.span())

True
False
world
(6, 11)


# `re.match()` matches the beginning of a string

`re.match(pattern, string, flags=0)`

In [20]:
question = "Hello world! Wonderful world!"
match = re.match("world", question)
match

In [26]:
print(re.match("\w{5}", question))   # matches a 5 characters word

<re.Match object; span=(0, 5), match='Hello'>


# `re.fullmatch()` matches the complete string

`re.fullmatch(pattern, string, flags=0)`

In [29]:
text = "spam"
print(re.fullmatch("Hello", text))
print(re.fullmatch("spam", text))

None
<re.Match object; span=(0, 4), match='spam'>


In [30]:
question = "Hello world! Wonderful world!"
print(re.fullmatch("((\w*\s*)*!)*", question))

<re.Match object; span=(0, 29), match='Hello world! Wonderful world!'>


# `re.findall()` list of matching substrings or list of tuples if groups used

`re.findall(pattern, string, flags=0)`

In [102]:
#-- Without groups -> returns a list of substrings
question = "Lovely spam! Wonderful spam!"
print(re.findall("[aeiou][^aeiou]", question))

#-- With groups -> returns a list of tuples of substrings
#   Each tupe is one match of the pattern
line = 'your alpha@scientificprograming.io, blah beta@scientificprogramming.me blah user'
print(re.findall(r'([\w\.-]+)@([\w\.-]+)', line))

['ov', 'el', 'am', 'on', 'er', 'ul', 'am']
[('alpha', 'scientificprograming.io'), ('beta', 'scientificprogramming.me')]


# `re.finditer()` same but returns an iterator instead of list

`re.finditer(pattern, string, flags=0)`
  
    
See: https://www.educative.io/courses/python-regular-expressions-with-data-scraping-projects/qV2WpVJk6gG

In [45]:
question = "Lovely spam! Wonderful spam!"
match = re.finditer("[aeiou][^aeiou]", question)
for match in re.finditer("[aeiou][^aeiou]", question):
    s = match.start()
    e = match.end()
    print(f'String match {question[s:e]} at {s}:{e}')

String match ov at 1:3
String match el at 3:5
String match am at 9:11
String match on at 14:16
String match er at 17:19
String match ul at 20:22
String match am at 25:27


# `re.sub()` returns new string with substitutions

`re.sub(pattern, repl, string, count=0, flags=0)`

In [136]:
#-- Substitute with a plain string:

import re
content = "My favorite numbers are 1, 8, 13 and 99"
content = re.sub(r"\d+", "#", content)
content

'My favorite numbers are #, #, # and #'

In [138]:
#-- Substitute only the first match:

content = "My favorite numbers are 1, 8, 13 and 99"
content = re.sub(r"\d+", "#", content, count=1)
content

'My favorite numbers are #, 8, 13 and 99'

In [146]:
#-- Substitute with a regexp:

re.sub(r"(\w+)\s(\w+)", r"\2 \1", "one two")

'two one'

In [141]:
#-- Substitute with a function:

def reverse(match):
    '''Slice reverses the string'''
    return match.group(0)[::-1]  

content = "My favorite numbers are 1, 8, 13 and 99"
print(re.sub(r"\d+", reverse, content))

My favorite numbers are 1, 8, 31 and 99


# `re.subn()` same but returns tuple `(new_string, nb_substitutions)`

`re.subn(pattern, repl, string, count=0, flags=0)`
  
  
- same as `sub()`, different return value

In [153]:
import re
print(re.sub("x*", "-", "spam"))
print(re.subn("x*", "-", "spam"))

-s-p-a-m-
('-s-p-a-m-', 5)


# `str.split()` vs `re.split()`

`re.split(pattern, string, maxsplit=0, flags=0)`

In [154]:
#-- Split point is a character:
"one,two,three".split(',')

['one', 'two', 'three']

In [98]:
#-- Split point is a regexp = more complex:
print(re.split(r"\d+", "My favorite numbers are 13 and 42."))
print(re.split('[A-Za-z]+', "+61Lean7489Scientific324234", re.I))

['My favorite numbers are ', ' and ', '.']
['+61', '7489', '324234']


In [158]:
#-- Use groups to include the split points in the result:
re.split(r"(\d+)", "My favorite numbers are 13 and 42.")

['My favorite numbers are ', '13', ' and ', '42', '.']

In [163]:
#-- Non capturing groups (same as without groups):
re.split(r"(?:\d+)", "My favorite numbers are 13 and 42.")

['My favorite numbers are ', ' and ', '.']

In [99]:
re.split(r"\d+", "My favorite numbers are 13 and 42.", maxsplit=1)

['My favorite numbers are ', ' and 42.']

# `re.escape(pattern)`

In [176]:
re.findall("2^4", "Even more than 2^2 is 2^4")
# no result. caret matches beginning of string -> element before caret = weird!

[]

In [186]:
regex = re.escape("2^4")
print(regex)   # build a properly escaped string
print(re.findall(regex, "Even more than 2^2 is 2^4"))

2\^4
['2^4']


<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Fun (?!?!) regular expressions</h1>

- [RFC822 in Perl: regexp-based address validation](http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html)
- [Divide by 7 - Example](https://github.com/matthiasgoergens/Div7)
- [Hard Code Golf: Regex for divisibility by 7](https://codegolf.stackexchange.com/questions/3503/hard-code-golf-regex-for-divisibility-by-7)
- [Demystifying the regular expression that checks if a number is a prime](https://iluxonchik.github.io/regular-expression-check-if-number-is-prime/)
- [Xavier Noria - repo on math by regex](https://github.com/fxn/math-with-regexps)

In [254]:
PRIME_RE = '''
    ^.?$             # single digit 0 or 1 times
    |^(..+?)\1+$     # matches 2 characters, then 3 characters, etc.
                     # then look for duplicates of the match with '\1'
                     # -> not prime!
'''
def is_prime(n):
    text = re.compile(PRIME_RE, re.VERBOSE)
    return not text.match("1"*n)

is_prime(2)  # 2, 3, 5, 7, 11...

# converts n to uniary format (1, 2, 3, 4) = (1, 11, 111, 1111)
# Originally from a Perl hacker knows as Abigail
# deep dive on how this works:
#  https://iluxonchik.github.io/regular-expression-check-if-number-is-prime

True