## documentation

* This presentation: [https://github.com/wolf/re-presentation](https://github.com/wolf/re-presentation)
* Online regex tester and debugger [regular expressions 101](https://regex101.com)
* [re module documentation](https://docs.python.org/3.7/library/re.html)
* third party [regex module homepage](https://bitbucket.org/mrabarnett/mrab-regex)
* by Jeffrey Friedl: ["Mastering Regular Expressions" First Edition](https://www.amazon.com/Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573) (covers Python) [Third Edition](https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124) (does *not* cover Python)

In [1]:
import re

## regular expressions themselves

### what is a regular expression?

A regular expression is a formal description of a set of strings and a well-known collection of rules to know if any given string is *in* that set.  If a string is in the set, the regular expression is said to *match* that string.  The formal description is often referred to as the *pattern*.  The `re` module has a function to test if a regular expression matches a string: `re.match(pattern, string)`.  It actually has a couple others as well.

This document is about regular expressions (sometimes called regexes) and the `re` module from Python 3.7.  Most of it applies to earlier versions of Python but your mileage may vary.

### where will you use regular expressions?

* In your favorite editor for find and replace
* In Python code (or any other language, really)
* In `grep`, `sed`, and `awk`
* When writing compilers, so with `flex`, `lex`, and `re2c`
* In full-text database searches
* When trying to understand other people's code
* To impress your friends at parties

In [2]:
from termcolor import colored

def demo(pattern, match_against, fn=None, show_location=False):
    """demo is built around re.match, but applied to a list of strings
       instead of a single string; and it prints a message for each, appropriate
       to a demonstration.  It uses colors and bold to make things nice.
       It's really *mostly* about printing.
    """
    compiled_pattern = re.compile(pattern)  # don't really need to do this, but it's good form
    if isinstance(match_against, str):  # OK, it takes a list, but if you give it a single string it still works
        match_against = [ match_against ]
    if fn is None:
        fn = re.match
    else:
        show_location = True
        print(f'using re.{fn.__name__}, the pattern ', end='')

    print(colored(f'{pattern}', 'blue', 'on_yellow', attrs=['bold']))
    for string in match_against:
        match = fn(pattern=compiled_pattern, string=string)
        if match:
            location_info = ''
            r = match.span()
            if show_location:
                location_info = f'at {r[0]}-{r[1]} of '
            print(f'  matches {location_info}', end='')
            
            print(colored(f'{string[:r[0]]}', 'green'), end='')
            print(colored(f'{string[r[0]:r[1]]}', 'green', attrs=['bold']), end='')
            print(colored(f'{string[r[1]:]}', 'green'))
        else:
            print(f'  does *not* match ', end='')
            print(colored(f'{string}', 'red'))
    print()

As we progress, the patterns will look more and more complicated.  Don't be intimidated!

In [3]:
# Simple characters like 'c', 'a', or 't', match themselves
# Note that I specified the pattern as a raw string... make that a habit for yourself
if re.match(pattern=r'cat', string='cat'):
    print('The pattern "cat" matches the string "cat"')

demo(pattern=r'cat', match_against='dog')

The pattern "cat" matches the string "cat"
[1m[43m[34mcat[0m
  does *not* match [31mdog[0m



`re.match` actually only tests that the string *starts* with a match.  Later we'll talk about `re.search` and `re.fullmatch`.

In [4]:
demo(pattern=r'cat', match_against='catch')

[1m[43m[34mcat[0m
  matches [32m[0m[1m[32mcat[0m[32mch[0m



#### character classes or sets

In [5]:
# Using a simple set: [aou]
demo(pattern=r'c[aou]t', match_against=['cat', 'cote', 'cute', 'city'])

[1m[43m[34mc[aou]t[0m
  matches [32m[0m[1m[32mcat[0m[32m[0m
  matches [32m[0m[1m[32mcot[0m[32me[0m
  matches [32m[0m[1m[32mcut[0m[32me[0m
  does *not* match [31mcity[0m



In [6]:
# Inverting a set: [^y]
demo(r'c[^y]t', ['cat', 'cote', 'cytoplasm', 'cctv'])

[1m[43m[34mc[^y]t[0m
  matches [32m[0m[1m[32mcat[0m[32m[0m
  matches [32m[0m[1m[32mcot[0m[32me[0m
  does *not* match [31mcytoplasm[0m
  matches [32m[0m[1m[32mcct[0m[32mv[0m



In [7]:
# Using ranges: [a-f0-9]
demo(r'0x[a-f0-9]', ['0x2', '0xc', '0xq'])

[1m[43m[34m0x[a-f0-9][0m
  matches [32m[0m[1m[32m0x2[0m[32m[0m
  matches [32m[0m[1m[32m0xc[0m[32m[0m
  does *not* match [31m0xq[0m



In [8]:
# Using shortcut: \d == [0-9]; . == [^\n]
demo(r'\d\d\d-\d\d\d\d', ['555-12q2', '411', '867-5309'])
demo(r'\d\d\d-\d\d.\d', ['555-12q2', '411', '867-5309'])

[1m[43m[34m\d\d\d-\d\d\d\d[0m
  does *not* match [31m555-12q2[0m
  does *not* match [31m411[0m
  matches [32m[0m[1m[32m867-5309[0m[32m[0m

[1m[43m[34m\d\d\d-\d\d.\d[0m
  matches [32m[0m[1m[32m555-12q2[0m[32m[0m
  does *not* match [31m411[0m
  matches [32m[0m[1m[32m867-5309[0m[32m[0m



#### repetition

In [9]:
# {min, max}, note that \w == [a-zA-Z0-9_] (at least for ASCII)
demo(r'\d{3}-\d{4}', ['555-12q2', '411', '867-5309'])
demo(r'=\w{3,5}=', ['=cat=', '=catch=', '=dogs=', '=catchy=', '=my='])

[1m[43m[34m\d{3}-\d{4}[0m
  does *not* match [31m555-12q2[0m
  does *not* match [31m411[0m
  matches [32m[0m[1m[32m867-5309[0m[32m[0m

[1m[43m[34m=\w{3,5}=[0m
  matches [32m[0m[1m[32m=cat=[0m[32m[0m
  matches [32m[0m[1m[32m=catch=[0m[32m[0m
  matches [32m[0m[1m[32m=dogs=[0m[32m[0m
  does *not* match [31m=catchy=[0m
  does *not* match [31m=my=[0m



In [10]:
# shortcuts for repetition
# ?, +, *, Note that \s is whitespace
# ? == {0,1}, + == {1,}, * == {0,}
demo(r'\d+\s*-?\s*\d*', ['555-\t1212', '411', '12345-abc'], show_location=True)

[1m[43m[34m\d+\s*-?\s*\d*[0m
  matches at 0-9 of [32m[0m[1m[32m555-	1212[0m[32m[0m
  matches at 0-3 of [32m[0m[1m[32m411[0m[32m[0m
  matches at 0-6 of [32m[0m[1m[32m12345-[0m[32mabc[0m



#### alternation

In [11]:
# binds lower than other operators, so cat|dog does what it looks like
demo(r'cat|dog', ['catchy', 'dog-lover', 'apple pie'])

[1m[43m[34mcat|dog[0m
  matches [32m[0m[1m[32mcat[0m[32mchy[0m
  matches [32m[0m[1m[32mdog[0m[32m-lover[0m
  does *not* match [31mapple pie[0m



#### groups

In [12]:
# the default kind of group, capturing, is specified with simple parens
demo(r'\d{3}(-\d{4})?', ['555-1212', '411'])

# When you've captured a group, you can reference it later with a backslash
demo(r'(\w+) \1', ['abc def', 'abc abc'])
demo(r'href=([\'\"]).*?\1', ['href="hello"', 'href=\'goodbye\'', 'href=\'abc"'])
# note the non-greedy '.*?' -- that gets the shortest matching string instead of the longest

[1m[43m[34m\d{3}(-\d{4})?[0m
  matches [32m[0m[1m[32m555-1212[0m[32m[0m
  matches [32m[0m[1m[32m411[0m[32m[0m

[1m[43m[34m(\w+) \1[0m
  does *not* match [31mabc def[0m
  matches [32m[0m[1m[32mabc abc[0m[32m[0m

[1m[43m[34mhref=([\'\"]).*?\1[0m
  matches [32m[0m[1m[32mhref="hello"[0m[32m[0m
  matches [32m[0m[1m[32mhref='goodbye'[0m[32m[0m
  does *not* match [31mhref='abc"[0m



In [13]:
# special groups start with a '?' just inside the left paren
# (?:...) is a non-capturing group
demo(r'(?:\d+-)?(\d{4})', ['555-1212', '4501'])

# 'x' is the verbose flag, this is a flag-setting group
# verbose mode lets me break apart the regex, even onto multiple lines
demo(r'''(?x)  # verbose mode
    \d{3}          # prefix
    ( - \d{4} )?   # optional''', ['555-1212', '411'])

[1m[43m[34m(?:\d+-)?(\d{4})[0m
  matches [32m[0m[1m[32m555-1212[0m[32m[0m
  matches [32m[0m[1m[32m4501[0m[32m[0m

[1m[43m[34m(?x)  # verbose mode
    \d{3}          # prefix
    ( - \d{4} )?   # optional[0m
  matches [32m[0m[1m[32m555-1212[0m[32m[0m
  matches [32m[0m[1m[32m411[0m[32m[0m



In [14]:
# setting flags for the whole pattern, for just a portion of the pattern
# the 'i' flag makes the pattern case-insensitive
demo(r'(?i)abcXYZ', ['abcxyz', 'AbCXYz'])
demo(r'(?i:abc)XYZ', ['abcxyz', 'abcXYZ', 'ABCXYZ'])

[1m[43m[34m(?i)abcXYZ[0m
  matches [32m[0m[1m[32mabcxyz[0m[32m[0m
  matches [32m[0m[1m[32mAbCXYz[0m[32m[0m

[1m[43m[34m(?i:abc)XYZ[0m
  does *not* match [31mabcxyz[0m
  matches [32m[0m[1m[32mabcXYZ[0m[32m[0m
  matches [32m[0m[1m[32mABCXYZ[0m[32m[0m



In [15]:
# assertions, positive and negative look-ahead (specialized groups)
# `re.match`: the pattern must match at the beginning of the string;
# `re.search`: the pattern can match anywhere inside the string
demo(r'Isaac(?=\s+Asimov)', ['Isaac Newton', 'Isaac Asimov'], show_location=True)
demo(r'Isaac(?!\s+Newton)', ['Isaac Newton', 'Isaac Asimov'], show_location=True)
demo(r'(?<!particle)-physics', ['nuclear-physics', 'particle-physics'], fn=re.search)
demo(r'(?<=nuclear)-physics', ['nuclear-physics', 'particle-physics'], fn=re.search)

[1m[43m[34mIsaac(?=\s+Asimov)[0m
  does *not* match [31mIsaac Newton[0m
  matches at 0-5 of [32m[0m[1m[32mIsaac[0m[32m Asimov[0m

[1m[43m[34mIsaac(?!\s+Newton)[0m
  does *not* match [31mIsaac Newton[0m
  matches at 0-5 of [32m[0m[1m[32mIsaac[0m[32m Asimov[0m

using re.search, the pattern [1m[43m[34m(?<!particle)-physics[0m
  matches at 7-15 of [32mnuclear[0m[1m[32m-physics[0m[32m[0m
  does *not* match [31mparticle-physics[0m

using re.search, the pattern [1m[43m[34m(?<=nuclear)-physics[0m
  matches at 7-15 of [32mnuclear[0m[1m[32m-physics[0m[32m[0m
  does *not* match [31mparticle-physics[0m



#### special matches

Look-ahead and look-behind are a special form of match instruction called an 'assertion'.  They match zero characters, but still influence where the actual match can happen.  Here are a couple of other zero-length assertions:

In [16]:
# beginning or end of a word, or not (zero-length)
demo(r'cat\b', ['cat', 'catchy', 'cat-call'], show_location=True)
demo(r'cat\B', ['cat', 'catchy', 'cat-call'], show_location=True)

# beginning or end of the string (zero-length)
demo(r'^abc', ['abcb', 'bbabc'], fn=re.search)
demo(r'abc$', ['abcb', 'bbabc'], fn=re.search)

[1m[43m[34mcat\b[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32m[0m
  does *not* match [31mcatchy[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32m-call[0m

[1m[43m[34mcat\B[0m
  does *not* match [31mcat[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32mchy[0m
  does *not* match [31mcat-call[0m

using re.search, the pattern [1m[43m[34m^abc[0m
  matches at 0-3 of [32m[0m[1m[32mabc[0m[32mb[0m
  does *not* match [31mbbabc[0m

using re.search, the pattern [1m[43m[34mabc$[0m
  does *not* match [31mabcb[0m
  matches at 2-5 of [32mbb[0m[1m[32mabc[0m[32m[0m



When the string to search contains newlines, you use the flag `re.MULTILINE`.  Then `^` and `$` match at the beginning and end of a line (any line), and you'll use `\A` to match at the beginning of the entire string, and `\Z` to match at the end.

## the `re` module

We've already discussed pattern strings and the strings against which we are matching.  Let's talk about the other nouns in the `re` world: compiled patterns, match objects, and the `re` module itself.  To get a compiled pattern, call `re.compile(pattern_string, flags=0)`.  A compiled pattern has method calls that are mirrored in the `re` module:

* `match`, `search`, `fullmatch`
* `split`
* `findall`, `finditer`
* `sub`, `subn`

The `re` versions take a compiled pattern *or* a pattern-string; the compiled pattern versions take (optionally) string indeces.  So for example:

`re.match(pattern, string, flags=0)` [doc](https://docs.python.org/3.7/library/re.html#re.match) vs.<br/>
`pattern.match(string[, pos[, endpos]])` [doc](https://docs.python.org/3.7/library/re.html#re.Pattern.match)

In [17]:
# `match` vs. `search` vs. `fullmatch`
# `match` and `search` you know; `fullmatch` must match the entire string
demo(r'cat', ['catchy', 'cat', 'alley-cat'], fn=re.match)
demo(r'cat', ['catchy', 'cat', 'alley-cat'], fn=re.search)
demo(r'cat', ['catchy', 'cat', 'alley-cat'], fn=re.fullmatch)

using re.match, the pattern [1m[43m[34mcat[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32mchy[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32m[0m
  does *not* match [31malley-cat[0m

using re.search, the pattern [1m[43m[34mcat[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32mchy[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32m[0m
  matches at 6-9 of [32malley-[0m[1m[32mcat[0m[32m[0m

using re.fullmatch, the pattern [1m[43m[34mcat[0m
  does *not* match [31mcatchy[0m
  matches at 0-3 of [32m[0m[1m[32mcat[0m[32m[0m
  does *not* match [31malley-cat[0m



In [18]:
# split with a regex can do things regular str.split cannot
row = 'a,  b;\tc;d, e,          f'
re.split(r'[,;]\s*', row)

['a', 'b', 'c', 'd', 'e', 'f']

In [19]:
# or clean the data with re.sub
re.sub(r'[,;]\s*', ', ', row)

'a, b, c, d, e, f'

In [20]:
# just for completeness' sake will look at re.finditer and re.findall
for i, match in enumerate(re.finditer(r'[,;]\s*', row), 1):
    print(f'match {i} spans {match.span()}')

re.findall(r'\w+', row)

match 1 spans (1, 4)
match 2 spans (5, 7)
match 3 spans (8, 9)
match 4 spans (10, 12)
match 5 spans (13, 24)


['a', 'b', 'c', 'd', 'e', 'f']

#### match objects

Matching functions will return a match object (success) or `None` (failure).  A simple `if` is all you need to test for success.  Save the match object to access some details about the match.  For instance, remember we talked about "capturing" groups.  This is where the data in a captured group is saved.

In [21]:
match = re.match(r'(\d{3})-(\d{4})', '867-5309')
if match:
    print(f'group 1: {match.group(1)}')
    print(f'group 2: {match[2]}')
    print(f'try a tuple: {match.group(1, 2)}')
    print(f'group 1 matched the range {match.span(1)}')
    print(f'group 2 matched the range ({match.start(2)}, {match.end(2)})')
    print(f'original pattern: {match.re}')

group 1: 867
group 2: 5309
try a tuple: ('867', '5309')
group 1 matched the range (0, 3)
group 2 matched the range (4, 8)
original pattern: re.compile('(\\d{3})-(\\d{4})')


## tips and tricks

* verbose mode
* to compile or not
* using a function argument in `sub` and `subn`
* `re.escape`
* `match.expand`

#### Simple string operations vs. `re` calls

In [22]:
%timeit 'cat' in 'dog-catcher'

36.7 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [23]:
%timeit re.search('cat', 'dog-catcher')

639 ns ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [24]:
%timeit 'catchy'.startswith('cat')

126 ns ± 4.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [25]:
%timeit re.match('cat', 'catchy')

690 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [26]:
%timeit 'cat' == 'cat'

23.7 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [27]:
%timeit re.fullmatch('cat', 'cat')

742 ns ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


The conclusion is: if your regular expression is a simple string, it's way faster, simpler, and easier reading to use simple string operations.  If it's file-system path, you probably want to be working mostly with `os.path`, or possibly `glob`.  Although I parse URLs below, you probably *really* want to be using `urllib.parse`.

## common pitfalls

* failure to make your pattern a raw string
* not understanding `re.MULTILINE` (which we'll use below) and `re.DOTALL`
* mixing `str` with `bytes`
* trying to [parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

## Real-World(ish) Examples

### parsing URLs

In [28]:
# (?P<name>...) is a special group: named capturing group
# and we'll use verbose mode to make a hard pattern a little easier
# but really you should be using urllib.parse

url_pattern_str = r'''
    (?P<protocol>  # URL protocol, required and captured
      http s?      # the 's' in 'https' is optional
    )
    ://            # required, but not captured
    (?P<host>      # host, required _and_ captured
      [^/:]+       # ...stops at the first slash or colon
    )
    (?:            # an optional group for the port
      :            # ...so we don't capture the colon
      (?P<port>    # optional (because of the containing group), but captured
          \d+      # the port is all digits
      )
    )?
    (?P<path>      # path, optional but captured
      /[^?]*       # ...stops at the first question mark
    )?
    (?:            # an optional group for the query
      \?           # ...so we don't capture the '?' that starts a query
      (?P<query>
          .+       # everything _after_ the question mark is the query itself
      )
    )?
'''
url_pattern = re.compile(url_pattern_str, re.VERBOSE)

urls = [
    'https://google.com',
    'https://google.com/',
    'https://www.kidsa-z.com/main/Activity/reading',
    'http://local.kidsa-z.com:8088/main/Activity/reading?module=razkids',
    'https://github.com/wolf/re-presentation.git',
    'https://www.amazon.com/Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573',
]

for url in urls:
    match = url_pattern.match(url)

    for group_name in url_pattern.groupindex.keys():
        if match[group_name]:
            print(f'{group_name:>8}:', colored(f'{match[group_name]}', 'green'))
    print()

protocol: [32mhttps[0m
    host: [32mgoogle.com[0m

protocol: [32mhttps[0m
    host: [32mgoogle.com[0m
    path: [32m/[0m

protocol: [32mhttps[0m
    host: [32mwww.kidsa-z.com[0m
    path: [32m/main/Activity/reading[0m

protocol: [32mhttp[0m
    host: [32mlocal.kidsa-z.com[0m
    port: [32m8088[0m
    path: [32m/main/Activity/reading[0m
   query: [32mmodule=razkids[0m

protocol: [32mhttps[0m
    host: [32mgithub.com[0m
    path: [32m/wolf/re-presentation.git[0m

protocol: [32mhttps[0m
    host: [32mwww.amazon.com[0m
    path: [32m/Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573[0m



### extracting flags from a status page (abridged)

In [29]:
import os.path

# I've cleaned partial-web-page-dump.txt of passwords, but I still don't want
# to check it into the repository.  I'll just make this cell take a pass if
# you don't have it.  Which you don't.

if os.path.exists('partial-web-page-dump.txt'):
    with open('partial-web-page-dump.txt', 'r') as f:
        text = f.read()

    enabled_features_pattern = re.compile('enabledFeatures[^(]+\(([^)]*)\)', re.MULTILINE)
    enabled_features = enabled_features_pattern.search(text)

    # The switch itself is the thing inside the single quotes
    switch_pattern = re.compile("'(.+)' => true", re.MULTILINE)
    switches = set(switch_pattern.findall(text, *enabled_features.span()))

    print('\n'.join(sorted(switches)))

ACCESS_CONTROL_USAGE_REPORTING_EMAIL
ACCOUNTS_IMPORT_TEACHERS
ACCOUNTS_REGISTER_USER_CHANGES
ACTIVITY_TESTLET_RENAME
ADDITIONAL_PARENT_LETTER_LANGUAGES
ADMIN_DASHBOARDS
ADMIN_DASHBOARDS_REWRITE
AUTHORING_TOOL
AVATAR_BUILDER_REDESIGN
BATCHADD_SISID
BENCHMARK_BOOK_REVISIONS_PROJECT
BOOK_REVIEW_FAVORITES
BRAINHONEY_LINK_ACCOUNTS
BRAINHONEY_LINK_HEADSPROUT
BRAINHONEY_LINK_KIDSAZ
BRAINHONEY_LINK_RAZ
BRAINHONEY_LINK_RAZKIDS
BRAINHONEY_LINK_SAZ
BRAINHONEY_LINK_VOCAB
BRAINHONEY_LINK_WAZ
CLASSROOM_REPORTS_ANGULAR_UI_ROUTING
CLEAR_SESSION_FOR_SSO_AUTHENTICATION
CMS_STICKER_BOOK
CONSTRUCTED_RESPONSE_TOGGLE
CREATE_FLAG_CATEGORIES
CSI_SOLR_SEARCH
CSI_SSO_AUTHENTICATION
DEMO_ROSTER
DEMO_ROSTER_ACTIVITY
DESK_CUSTOMER_SERVICE
DESK_LIVE_CHAT_SERVICE
DISABLE_ACT_ON_EMAIL_CREATE
DISABLE_ACT_ON_EMAIL_SEND
EL_MULTI_LINEITEM_ORDER_API
EL_SFC_LINE_ITEMS
EL_SFC_TAX
ENABLE_MOST_RECENTLY_AVAILABLE_DIST_RPT_DATE
EUROPEAN_CURRICULUM_STANDARDS
FFMPEG_2_0_2
FF_CREATE_NOTES
FF_HTML5_CLIENTS
FF_NEW_DEFINITIONS
FF_NOT