# Regular Expressions Demo

A notebook for examples showing how to use regular expressions in python.

In [5]:
import re



First show a simple illustration of matching

In [13]:
test_num = '03223401'
test_str = 'highlight'

p = re.compile('[a-z]+')

num_match = p.match(test_num)
str_match = p.match(test_str)

if num_match is None:
    print('When a match is not found, return None')

if str_match:
    print(f"When a match is found, a match object ({str_match}) is returned")

When a match is not found, return None
When a match is found, a match object (<re.Match object; span=(0, 2), match='hi'>) is returned


The example below shows the basic matching principle by which the expression will stop as soon as it fails to meet the criteria. If the initial part of the test string does not meet the requirement, the expression returns empty.

In [23]:
q = re.compile('[h-z]+')

print( q.match('highland'))         # Stops after second letter as no more characters meet condition
print( q.match('grassland'))        # First letter doesn't meet condition, even though later letters do...
print( q.match('land'))            

<re.Match object; span=(0, 2), match='hi'>
None
<re.Match object; span=(0, 1), match='l'>


The behavior of the **match** function can be contrasted by the **search** function that looks across the string for the expression:

In [24]:
print( q.match('grassland'))        # First letter doesn't meet condition
print( q.search('grassland'))        # Fifth letter does meet condition

None
<re.Match object; span=(1, 2), match='r'>


### Making long expressions readable with re.VERBOSE

Long expressions can be hard to read and interpret, but the re.VERBOSE flag allows us to write complex expressions in ways that are easy to understand and that can include comments

In [26]:
# rather than writing 
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")

# we can write
pat = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

### Greedy vs. Non-Greedy

In the example below, the greedy expression will search for as much text as it can get and thus return the full test string. However this can limit performance and give us something we don't want. To be more specific, we can use **non-greedy** operators.

In [20]:
test_str = '<html><head><title>Title</title>'

greedy_p = re.compile('<.*>')           # .* means any number of characters
nongreedy_p = re.compile('<.*?>')       # .*? is the non-greedy version of .*

print(greedy_p.match(test_str))
print(nongreedy_p.match(test_str))

<re.Match object; span=(0, 32), match='<html><head><title>Title</title>'>
<re.Match object; span=(0, 6), match='<html>'>
