## Regular Expressions

<img src="https://imgs.xkcd.com/comics/regular_expressions.png " width=400>

## RegEx Golf

<img src="https://imgs.xkcd.com/comics/regex_golf.png" width=600>

## Perl Problems

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski 
<img src="https://imgs.xkcd.com/comics/perl_problems.png" width=600>

## Regular Expressions Workflow in Python

* `import re`
* Use *raw strings* to define a pattern,
* Use `re.match` or `re.search` to apply Regex.

In [1]:
import re

In [2]:
m1 = re.match(r'ab', 'abc')
m1

<re.Match object; span=(0, 2), match='ab'>

### `match` returns `None` when there is no match

In [3]:
m2 = re.match(r'ab', 'acb')
m2

In [4]:
m2 is None

True

In [5]:
not m2 

True

#### `match` versus `search`

* `match` $\longrightarrow$ *match* the beginning
* `search` $\longrightarrow$ *search* the whole string

In [6]:
re.match(r'ab', 'abc')

<re.Match object; span=(0, 2), match='ab'>

In [7]:
re.match(r'bc', 'abc')

In [8]:
re.search(r'ab', 'abc')

<re.Match object; span=(0, 2), match='ab'>

In [9]:
re.search(r'bc', 'abc')

<re.Match object; span=(1, 3), match='bc'>

### The result of a `match`/`search` can be used in Boolean expressions

In [10]:
"Yes" if m1 else "No"

'Yes'

In [11]:
"Yes" if m2 else "No"

'No'

### Filter using `match` and `search` in a comprehension

In [12]:
strings = ('abcdefg','abcde', 'abc', 'cdefg')
strings

('abcdefg', 'abcde', 'abc', 'cdefg')

In [13]:
[s for s in strings if re.match(r'cd', s)]

['cdefg']

In [14]:
[s for s in strings if re.search(r'cd', s)]

['abcdefg', 'abcde', 'cdefg']

## Whitespace and escaped characters

* Whitespace includes spaces, tabs, and newlines
* Python uses escape characters: `"\t"`, `"\n"`

#### Use `"\n"` for newlines

In [15]:
"\n"

'\n'

In [16]:
len("\n")

1

In [17]:
print('\n')





In [18]:
print('a\nb')

a
b


#### Use `"\t"` for tab

In [19]:
"\t"

'\t'

In [20]:
len("\t")

1

In [21]:
print('\t')

	


In [22]:
print('a\tb')

a	b


## Why use `r"raw strings"` in `regex`

* Regular strings $\rightarrow$ `\` is for special characters: `'\n'`, `'\t'`
* In regular expressions, `\` is for
    * Escaping: i.e. `\.` vs. `.`
* Without raw string, we would need 
    * `'\\n'`  to match a new line
    * `'\\t'` to match a tab

In [23]:
r'\n' # Raw string allow us to match newlines without the extra \

'\\n'

## Important `match object` methods

<table class="docutils" border="1">
<colgroup>
<col width="29%">
<col width="71%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Method/Attribute</th>
<th class="head">Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">group()</span></code></td>
<td>Return the string matched by the RE</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">start()</span></code></td>
<td>Return the starting position of the match</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">end()</span></code></td>
<td>Return the ending position of the match</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">span()</span></code></td>
<td>Return a tuple containing the (start, end)
positions  of the match</td>
</tr>
</tbody>
</table>

In [24]:
m1

<re.Match object; span=(0, 2), match='ab'>

In [25]:
[m for m in dir(m1) if not m.startswith('__')]

['end',
 'endpos',
 'expand',
 'group',
 'groupdict',
 'groups',
 'lastgroup',
 'lastindex',
 'pos',
 're',
 'regs',
 'span',
 'start',
 'string']

In [26]:
m1.group()

'ab'

In [27]:
m1.start()

0

In [28]:
m1.end()

2

In [29]:
m1.span()

(0, 2)

## Be sure to check for `None`

In [30]:
m2 # example that DIDN'T match 

In [31]:
m2 is None # non-matches return None

True

In [32]:
m2.group() # Oh you silly 'Nonetype' errors

AttributeError: 'NoneType' object has no attribute 'group'

## Dealing with `None`

1. Forget it (and code crashes),
2. Filter on match/search, or
3. Use a conditional expression.

### Bad solution - Forget about `None` :(

In [33]:
{s:re.search(r'cde?', s).group() for s in strings}

AttributeError: 'NoneType' object has no attribute 'group'

### Solution 1 - Filter

In [34]:
{s:re.search(r'cde?', s).group() for s in strings if re.search(r'cde?', s)}

{'abcdefg': 'cde', 'abcde': 'cde', 'cdefg': 'cde'}

### Solution 2 - Conditional statement

Always check for `None`

In [35]:
m2.group() if m2 else None

In [36]:
{s:re.search(r'cde?', s).group() if re.search(r'cde?', s) else None for s in strings}

{'abcdefg': 'cde', 'abcde': 'cde', 'abc': None, 'cdefg': 'cde'}

## DRY principle - Ways to remove the replicated code

### Option 1 - Compile the pattern

In [37]:
pat = re.compile(r'([abd]+)')

In [38]:
# With filter
{s:pat.search(s).group() for s in strings if pat.search(s) }

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

In [39]:
# With conditional expression
{s:pat.search(s).group() if pat.search(s) else None for s in strings}

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

### Option 2 - Abstract with functions

In [40]:
pat = r'([abd]+)'

search_pat = lambda s: re.search(r'([abd]+)', s)

get_group = lambda s: search_pat(s).group()

In [41]:
{s:get_group(s) for s in strings if search_pat(s)}

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

In [42]:
maybe_get_group = lambda m: m.group() if m else None

In [43]:
{s:maybe_get_group(search_pat(s)) for s in strings}

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

### EVEN BETTER! Pipeable functions

In [44]:
from composable import pipeable

pat = r'([abd]+)'
search_pat = pipeable(lambda s: re.search(r'([abd]+)', s))

get_group = pipeable(lambda m: m.group())

maybe_get_group = pipeable(lambda m: m.group() if m else None)

In [45]:
{s: s >> search_pat >> get_group for s in strings if s >> search_pat}

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

In [46]:
{s:s >> search_pat >> maybe_get_group for s in strings}

{'abcdefg': 'ab', 'abcde': 'ab', 'abc': 'ab', 'cdefg': 'd'}

## A better `str.replace`

* Often chain many `replace` calls
* Example `s.replace('(', '').replace(')','').replace(':', '')`
* We can use `re.sub` to simplify.

In [47]:
s = "The string (has) some: things in (it)"
s.replace('(', '').replace(')','').replace(':', '')

'The string has some things in it'

In [48]:
re.sub(r"[():]", '', s)

'The string has some things in it'

## Substitutions with a compiled RegEx

1. Compile a pattern
2. Use `pat.sub(new_substr, s)`

In [49]:
paren_or_colon = re.compile(r"[():]")
paren_or_colon.sub('', s)

'The string has some things in it'

## <font color="red"> Exercise 3.0.3 </font>

**Task:** Write and test a function that uses `re.sub` to remove all punctuation from a string.  **Hint:** Use the `punctuation` variable from the `string` module.

In [57]:
from string import punctuation

practice1 = 'He110!! W0rld>>>'

In [56]:
def remove_punc(s):
    """
    Removes punctuation from a string.

    Args:
        s: a string, input string

    Returns:
        A string without punctuation removed
    """

    return re.sub(f"[{punctuation}]", '', s)

#f-string! for substituting other values, could set x = punctuation, and put f"[{x}]" instead
#[] for character sets! matches any of the char inside of [], which is the value of punctuation
#inserts the value of punctuation into the pattern spot (patter, replacement, string)

In [61]:
def test_remove_punc():
  assert remove_punc('123!@#') == '123'
  assert remove_punc('H3!!0 W..ld') == 'H30 Wld'
  assert remove_punc('[{fstring}]') == 'fstring'
  assert remove_punc('10 >= 10') == '10  10'
  assert remove_punc('tasha_koehl') == 'tashakoehl'
test_remove_punc()