In [1]:
import re

## regular expressions themselves

### what is a regular expression?

It's a formal description of a set of strings and a well-known collection of rules to know if any given string is *in* that set.  If a string is in the set, the regular expression is said to *match* that string.  The formal description is often referred to as the *pattern*.  The `re` module has a function to test if a regular expression matches a string: `re.match(pattern, string)`.

In [2]:
def demo_re_match(pattern, match_against, show_location=False):
    """demo_re_match is built around re.match, but applied to a list of strings
       instead of a single string; and it prints a message for each, appropriate
       to a demonstration.
    """
    compiled_pattern = re.compile(pattern)  # don't really need to do this, but it's good form
    if isinstance(match_against, str):  # OK, it takes a list, but if you give it a single string it still works
        match_against = [ match_against ]

    print(f'The pattern "{pattern}"...')
    for string in match_against:
        match = re.match(pattern=compiled_pattern, string=string)
        if match:
            location_info = ''
            if show_location:
                r = match.span()
                location_info = f'at {r[0]}-{r[1]} of '
            print(f'  ...matches {location_info}"{string}"')
        else:
            print(f'  ...does *not* match "{string}"')
    print()

As we progress, the patterns will look more and more complicated.  Don't be intimidated!

In [3]:
# Simple characters like 'c', 'a', or 't', match themselves
# Note that I specified the pattern as a raw string... make that a habit for yourself
if re.match(pattern=r'cat', string='cat'):
    print('The pattern "cat" matches the string "cat"')

demo_re_match(pattern=r'cat', match_against='dog')

The pattern "cat" matches the string "cat"
The pattern "cat"...
  ...does *not* match "dog"



`re.match` actually only tests that the string *starts* with a match.

In [4]:
demo_re_match(pattern=r'cat', match_against='catch')

The pattern "cat"...
  ...matches "catch"



#### character sets 

In [6]:
# Using a simple set: [aou]
demo_re_match(pattern=r'c[aou]t', match_against=['cat', 'cote', 'cute', 'city'])

The pattern "c[aou]t"...
  ...matches "cat"
  ...matches "cote"
  ...matches "cute"
  ...does *not* match "city"



In [7]:
# Inverting a set: [^y]
demo_re_match(r'c[^y]t', ['cat', 'cote', 'cytoplasm', 'cctv'])

The pattern "c[^y]t"...
  ...matches "cat"
  ...matches "cote"
  ...does *not* match "cytoplasm"
  ...matches "cctv"



In [8]:
# Using ranges: [a-f0-9]
demo_re_match(r'0x[a-f0-9]', ['0x2', '0xc', '0xq'])

The pattern "0x[a-f0-9]"...
  ...matches "0x2"
  ...matches "0xc"
  ...does *not* match "0xq"



In [9]:
# Using shortcut: \d == [0-9]
demo_re_match(r'\d\d\d-\d\d\d\d', ['555-12q2', '411', '867-5309'])

The pattern "\d\d\d-\d\d\d\d"...
  ...does *not* match "555-12q2"
  ...does *not* match "411"
  ...matches "867-5309"



#### repetition

In [10]:
# {min, max}, note that \w == [a-zA-Z0-9_] (at least for ASCII)
demo_re_match(r'\d{3}-\d{4}', ['555-12q2', '411', '867-5309'])
demo_re_match(r'=\w{3,5}=', ['=cat=', '=catch=', '=dogs=', '=catchy=', '=my='])

The pattern "\d{3}-\d{4}"...
  ...does *not* match "555-12q2"
  ...does *not* match "411"
  ...matches "867-5309"

The pattern "=\w{3,5}="...
  ...matches "=cat="
  ...matches "=catch="
  ...matches "=dogs="
  ...does *not* match "=catchy="
  ...does *not* match "=my="



In [11]:
# ?, +, *
# ? == {0,1}, + == {1,}, * == {0,}
demo_re_match(r'\d+\s*-?\s*\d*', ['555-\t1212', '411'])

The pattern "\d+\s*-?\s*\d*"...
  ...matches "555-	1212"
  ...matches "411"



#### alternation

In [12]:
# binds lower than other operators, so cat|dog does what it looks like
demo_re_match(r'cat|dog', ['catchy', 'dog-lover', 'apple pie'])

The pattern "cat|dog"...
  ...matches "catchy"
  ...matches "dog-lover"
  ...does *not* match "apple pie"



#### groups

In [13]:
# the default kind of group (capturing)
demo_re_match(r'\d{3}(-\d{4})?', ['555-1212', '411'])

The pattern "\d{3}(-\d{4})?"...
  ...matches "555-1212"
  ...matches "411"



In [14]:
# special groups start with a '?' just inside the left paren
# this one sets the verbose flag
demo_re_match(r'''(?x)  # verbose mode
    \d{3}               # prefix
    ( - \d{4} )?        # optional''', ['555-1212', '411'])

The pattern "(?x)  # verbose mode
    \d{3}               # prefix
    ( - \d{4} )?        # optional"...
  ...matches "555-1212"
  ...matches "411"



In [15]:
# setting flags for the whole pattern, for just a portion of the pattern
demo_re_match(r'(?i)abcXYZ', ['abcxyz', 'AbCXYz'])
demo_re_match(r'(?i:abc)XYZ', ['abcxyz', 'abcXYZ', 'ABCXYZ'])

The pattern "(?i)abcXYZ"...
  ...matches "abcxyz"
  ...matches "AbCXYz"

The pattern "(?i:abc)XYZ"...
  ...does *not* match "abcxyz"
  ...matches "abcXYZ"
  ...matches "ABCXYZ"



In [16]:
# assertions
demo_re_match(r'Isaac(?=\s+Asimov)', ['Isaac Newton', 'Isaac Asimov'], show_location=True)
demo_re_match(r'Isaac(?!\s+Newton)', ['Isaac Newton', 'Isaac Asimov'], show_location=True)

The pattern "Isaac(?=\s+Asimov)"...
  ...does *not* match "Isaac Newton"
  ...matches at 0-5 of "Isaac Asimov"

The pattern "Isaac(?!\s+Newton)"...
  ...does *not* match "Isaac Newton"
  ...matches at 0-5 of "Isaac Asimov"



#### special matches
* beginning or end of the string (zero-length)
* word characters or not
* beginning or end of a word, or not (zero-length)
* whitespace or not
* digit or not

## the `re` module

* regular expression strings (already discussed)
* strings to be searched
* compiled patterns
    * `re.compile`
    * `match` vs. `search` vs. `fullmatch`
* re module parallels compiled pattern functions
    * `search`, `match`, `fullmatch`
    * `split`
    * `findall`, `finditer`
    * `sub`, `subn`
    * `re` versions take pattern *or* pattern-string + flags; pattern versions take string indeces
* match objects
    * testing for success
    * accessing captured groups
* other interesting functions
    * `re.escape`
    * `match.expand`

## tips and tricks

* verbose mode
* to compile or not
* greedy or not
* using a function argument in `sub` and `subn`
* setting flags within the regex
* when to use alternatives (e.g., `os.path`, `glob`, `regex`, `str.startswith`)

## common pitfalls

* failure to use a raw string
* not understanding `re.MULTILINE`, `re.DOTALL`
* mixing `str` with `bytes`

## documentation

* This presentation: [https://github.com/wolf/re-presentation](https://github.com/wolf/re-presentation)
* Online regex tester and debugger [regular expressions 101](https://regex101.com)
* [re](https://docs.python.org/3.7/library/re.html)
* [regex](https://bitbucket.org/mrabarnett/mrab-regex)
* by Jeffrey Friedl: ["Mastering Regular Expressions" First Edition](https://www.amazon.com/Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573) (covers Python) [Third Edition](https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124) (does *not* cover Python)

In [17]:
url_pattern_str = r'''
    (?P<protocol>  # URL protocol, required and captured
      https?       # the 's' in 'https' is optional
    )
    ://            # required, but not captured
    (?P<host>      # host, required _and_ captured
      [^/:]+       # ...stops at the first slash or colon
    )
    (?:            # an optional group for the port
      :            # ...so we don't capture the colon
      (?P<port>    # optional (because of the containing group), but captured
          \d+      # the port is all digits
      )
    )?
    (?P<path>      # path, optional but captured
      /[^?]*       # ...stops at the first question mark
    )?
    (?:            # an optional group for the query
      \?           # ...so we don't capture the '?' that queries start with
      (?P<query>
          .+       # everything _after_ the question mark is the query
      )
    )?
'''
url_pattern = re.compile(url_pattern_str, re.VERBOSE)

In [18]:
urls = [
    r'https://google.com',
    r'https://google.com/',
    r'https://www.learninga-z.com/main/Activity/reading',
    r'http://learninga-z.com:8088/main/Activity/reading?module=razkids',
    r'https://github.com/wolf/re-presentation.git',
    r'https://www.amazon.com/Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573',
]

for url in urls:
    match = url_pattern.match(url)

    for group_name in url_pattern.groupindex.keys():
        if match[group_name]:
            print(f"{group_name:>8}:  {match[group_name]}")
    print()

protocol:  https
    host:  google.com

protocol:  https
    host:  google.com
    path:  /

protocol:  https
    host:  www.learninga-z.com
    path:  /main/Activity/reading

protocol:  http
    host:  learninga-z.com
    port:  8088
    path:  /main/Activity/reading
   query:  module=razkids

protocol:  https
    host:  github.com
    path:  /wolf/re-presentation.git

protocol:  https
    host:  www.amazon.com
    path:  /Mastering-Regular-Expressions-Techniques-Handbooks/dp/1565922573

