# Python Regex

Objectives:
  - parse strings using `str` functions
  - match patterns using `re` module

Resources:
  - https://docs.python.org/3/library/re.html
  - ECiP, Chapter 8

## `str` magic

Useful tools:

  - `in`
  - `str.split`
  - `str.find`
  - `str.count`
  - `str.replace`

In [6]:
s = '1      2 3 4 5\n 6 7 8 9'
s.split()  

['1', '2', '3', '4', '5', '6', '7', '8', '9']

In [19]:
s = '1,2,3,4,5'
import numpy as np
nums = map(int, s.split(','))


In [22]:
s = 'hayneedlestack'
s.find('needle') 


3

In [None]:
s[3:(3+len('needle'))]

In [31]:
set(s)  

{'a', 'c', 'd', 'e', 'h', 'k', 'l', 'n', 's', 't', 'y'}

In [27]:
ss = s.replace('e', ' ')
ss

'hayn  dl stack'

**Exercise**: Process the file `pwr.log` and store `kinf` as a function of `burnup`.  Note, this is a pretty simple file, but it's just special enough to make `np.loadtxt` not the answer. 

## regex

A REGular EXpression is a *pattern* that defines a set of strings that matches it. 

In [32]:
import re

In [33]:
p = '123' # the pattern

In [34]:
s = '123 abc' # the string that matches (or not)

In [37]:
re.match(p, s)

<_sre.SRE_Match object; span=(0, 3), match='123'>

In [38]:
p = 'abc'

In [39]:
re.match(p, s)

In [40]:
re.search(p, s)

<_sre.SRE_Match object; span=(4, 7), match='abc'>

### Basic Special Characters

  - `.`  any character
  - `^` beginning of line
  - `$` end of line
  - `*` 0 or more
  - `+` 1 or more
  - `?` 0 or 1
  - `[]` e.g., `[abc]` matches `a`, `b`, or `c` individually

In [6]:
import re
s = "abcabcabc123abc456abc"
p = "[abc]+?"

In [7]:
for match in re.finditer(p, s):
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='a'>
<_sre.SRE_Match object; span=(1, 2), match='b'>
<_sre.SRE_Match object; span=(2, 3), match='c'>
<_sre.SRE_Match object; span=(3, 4), match='a'>
<_sre.SRE_Match object; span=(4, 5), match='b'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='b'>
<_sre.SRE_Match object; span=(8, 9), match='c'>
<_sre.SRE_Match object; span=(12, 13), match='a'>
<_sre.SRE_Match object; span=(13, 14), match='b'>
<_sre.SRE_Match object; span=(14, 15), match='c'>
<_sre.SRE_Match object; span=(18, 19), match='a'>
<_sre.SRE_Match object; span=(19, 20), match='b'>
<_sre.SRE_Match object; span=(20, 21), match='c'>


### The Special `\` Sequences

  - `\d` any decimal digit
  - `\D` any character that is *not* `\d`
  - `\s` any whitespace character (`[ \t\n\r\f\v]`)
  - `\S` any character that is *not* `\s`

In [1]:
p = '[0-9]' 

### Special Operations

  - `?` following `*` or `+` or `?` makes it *non-greedy*
  - `{m}` requires `m` repeats
  - `{m, n}` requires `m`, `m+1`, ..., or `n` repeats
  - `\` is the escape (except for the special sequences)
  - `|` "or" between arbitrary patterns
  - `(...)` group
  - `(?:...)` non-matching group
  - `(?P<name>...)` named group

In [19]:
result1 = re.search("(abc)\d", s)
result2 = re.search("(?:abc)\d", s)
result3 = re.search("(?P<foo>abc)\d", s)
result3.groups('foo')

('abc',)

In [77]:
s = """
    varA         varB        varC
    0.0000E+00   0.00000E+00 0.0000
    3.0000E-01   7.00000E-04 0.3778
    2.0000E+02   9.99300E-01 0.0003
"""
print(s)
#p = '(\d\.\d{4,6}E[+-]\d{2})\s+(\d\.\d{4,6}E[+-]\d{2})\s+(\d\.\d{4,6})'
p = '((\d\.\d+E[+-]\d{2})\s+){2}'

m = re.findall(p, s)
m


    varA         varB        varC
    0.0000E+00   0.00000E+00 0.0000
    3.0000E-01   7.00000E-04 0.3778
    2.0000E+02   9.99300E-01 0.0003



[('0.00000E+00 ', '0.00000E+00'),
 ('7.00000E-04 ', '7.00000E-04'),
 ('9.99300E-01 ', '9.99300E-01')]

In [None]:
pattern = r'\d.\d\d\d\dE[+-]\d\d'
re.search(pattern, s)

In [None]:
re.findall(pattern, s)