<a id = 're'></a>

# re - Regular expression operations

Walkthrough of Python Standard Library documentation

1. [re - Regular expression operations](#re)
    1. [Metacharacters](#Metacharacters)
    1. [Using regular expressions](#Using-regular-expressions)
    1. [Module contents](#Module-contents)
    1. [Regular expression objects](#Regular-expression-objects)
    1. [Match objects](#Match)
    1. [Examples](#Examples)
        

## Metacharacters

The complete list of metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

[ ] - The square brackets specify a character class. Characters can be listed individually, or a range can be conveyed with a hyphen '-'. For example, both [abc] and [a-c] refer to the letter a,b and c.

Also, metacharacters are not considered to be metacharacters when within square brackets. So [abc?] includes a, b, c and '?', even though '?' is a metacharacter.

\ - The backslash '\' signals a special sequance of characters. It can also be used to escape metacharacters so that these can be found in patterns. For example, '\w' means a special sequance is initiated where we are matching on any alphanumber value (and underscores), indicated by 'w'. This is equivalent to [a-zA-Z0-9_]. In addition to '\w', here are several other special sequences:
- \d - Any decimal digit, equivalent to [0-9]
- \D - Any non-digit character, equivalent to [^0-9]
- \s - Matches any whitespace character, equivalent to [\t\n\t\f\v]
- \W - Any non-alphanumber character, equivalent to [^a-zA-Z0-9_]

These special sequences can be included inside character classes. For example, [\s,,] will match any white-space character, comma or period.

. - The dot '.' metacharacter matches anything except for a newline character

\* - The asterisk '*' specifies that the previous character can be matches zero or more times rather than exactly once. For example, 'ca*t' matches 'ct' (where 'a' appears 0 times), 'cat' ('a' appears 1 time), and 'caaaaat' ('a' appears f times).

\? - The question mark '?' is another repetition character. This matches either zero or once. In a sense this marks something as optional. For example, 'home-?brew' matches 'homebrew' and 'home-brew'

\+ - The plus sign '+' is another repetition character. This matches either one or more time, but not zero. 

{} - The pair of french brackes take the form {m,n}, where m and n are both decimal integers are refer to the minimum and maximum repetitions respectively. For example, 'a/{1,3}b' will match 'a/b', 'a//b' and 'a///b', but not 'ab' or 'a////b'. Replacing {1,3} with an asterisk would accomplish that. m or n can be omitted as well. In these circumstances, the remaining element acts as an upper or lower limit.

Interestingly {0,} is equal to *, {0,1} is equal to '?', and {1,} is equal to +. 

| - The pipe character '|' is an 'or' operator in practice. For example, 'A|B' will match on 'A' or 'B'

^ - The caret character '^' matches at the beginning of lines. Unless the MULTILINE flag has been set, this only matches at the beginning of the string. With MULTILINE enabled, this matches immediately after each newline

\\$ - The dollar sign character '\$' Matches at the end of a line. Either the end of the string or any location directly before a newline.

( ) - The parenthesis characters '(' and ')' group together the expressions contained inside them. The contents of a group can be repeated with a repeater, such as '*', '+', '?' or '{m,n}'. For example, '(ab)* will match on zero or more repetitions of 'ab'.

<a id = 'Metacharacters'></a>

## Using regular expressions

<a id = 'Using-regular-expressions'></a>

In [108]:
import collections
import re

__Compiling regular expressions__

In [135]:
# compiling regular expressions
p = re.compile(r"ab*")
p

re.compile(r'ab*', re.UNICODE)

In [137]:
# re.compile() also accepts a flags argument, which is used to enable
# various special features and alternate syntaxes. For example,
p = re.compile(r"ab*", flags=re.IGNORECASE)
p

re.compile(r'ab*', re.IGNORECASE|re.UNICODE)

__Performing matches__

match() - Determine if RE matches at beginning of string

search() - Determine if RE matches anywhere within the entire string

finall() - Find all substrings where the RE matches and return matches as a list

finiter() - Find all substrings where the RE matches and return as an iterator

match() and search() return None if no match is found, but otherwise return a match object that has its own attributes, such as the beginning and end location, the string that matches, etc.

In [157]:
# compile regex rule
p = re.compile("[a-z]+")
p

re.compile(r'[a-z]+', re.UNICODE)

In [158]:
# an empty string won't match at all since the '+' means 1 or more repetition
print(p.match(""))

None


__Match methods__

group() - Return the string matched

start() - Return the starting position of the match

end() - Return the ending position of the match

span() - Return a tuple containing the (start, end) positions of the match


In [159]:
# sample matching operation
m = p.match("tempo")
print(m)

<_sre.SRE_Match object; span=(0, 5), match='tempo'>


In [160]:
# review word
m.group()

'tempo'

In [161]:
# in the context of match(), this will also be 0 because match() only
# looks at the beginning of the string
m.start()

0

In [162]:
# ending position
m.end()

5

In [163]:
# span of matched word
m.span()

(0, 5)

In [166]:
# search()
m = p.search("::: message")
print(m)
print(m.group())
print(m.start())
print(m.end())
print(m.span())

<_sre.SRE_Match object; span=(4, 11), match='message'>
message
4
11
(4, 11)


In [167]:
# findall()
p = re.compile(r"\d+")
p.findall("12 drummers drumming, 11 pipers pipig, 10 lords whatever")

['12', '11', '10']

In [168]:
# finditer()
# While findall() must determine the entire list before returing the result,
# finditer() acts as an iterator
iterator = p.finditer("12 drummers drumming, 11 pipers pipig, 10 lords whatever")
for match in iterator:
    print(match.group(), match.span())

12 (0, 2)
11 (22, 24)
10 (39, 41)


__Module level functions__

We don't need to create a pattern object before calling on its methods. there are also top level functions which accomplish the same thing and return the same results in one-off fashion

In [173]:
# a search for 'From' followed immediately by 1 or more whitespace character
print(re.match(pattern=r"From\s+", string="Fromage amk"))

print(re.match(pattern=r"From\s+", string="From me to you"))

None
<_sre.SRE_Match object; span=(0, 5), match='From '>


__Groupings__

In [175]:
# this RE will match on any number of repetitions of 'ab'
p = re.compile(r"(ab)*")
print(p.match("ababababab").span())

(0, 10)


In [179]:
# groups also capture the start/end index of the matched text.
# this is retrieved by passing an argument to group(), end(), etc.
p = re.compile(r"(a)b")
m = p.match("ab")
print(m.group())
print(m.group(0))  # The same as passing no argument, the whole string
print(m.group(1))

ab
ab
a


In [182]:
# compound matching rule
p = re.compile("(a(b)c)d")
m = p.match("abcd")
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(1, 2, 2))

abcd
abcd
abc
b
('abc', 'b', 'b')


## Module contents
A small sample of functions available

<a id = 'Module-contents'></a>


re.__compile__(_pattern_, _flags = 0_)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

_The sequence..._
```python
prog = re.compile(pattern)
result = prog.match(String)
```

_is equivalent to..._
```python
result = re.match(pattern, string)
```

Using re.compile() and saving the resulting RE object for resuse is more efficient when the expression will be used several times.

re.__split__(_pattern, string, maxsplit = 0, flags = 0_)

Split a string by the occurrences of a pattner

In [19]:
# evaluate sample string with varying capitalization and punctuation
string = "Words, words, words."

print(re.split(r"\W+", string))

# Parentheses in pattern capture the text of all groups in the pattern.
# Includes the comma, space and period
print(re.split(r"(\W+)", string))

# Nonzero value for maxsplit means that at most that value splits occur.
# the remainder of the string is returned as the fina element of the list
# in this case, value of 1 and 2 prevent the empty string from appearing
print(re.split(r"\W+", string, maxsplit=2))

#
print(re.split("[a-f]+", "0a3B9", flags=re.IGNORECASE))

['Words', 'words', 'words', '']
['Words', ', ', 'words', ', ', 'words', '.', '']
['Words', 'words', 'words.']
['0', '3', '9']


## Regular expression objects

Compiled regular expression objects support the several methods and attributes. Here is a sample.

<a id = 'Regular-expression-objects'></a>

Pattern.__search__(_string[, pos [, endpos]]_)

Scan through string looking for the first match  of this RE and return a corresponding match object

pos is an optional parameter that gives an index in the string indicating where the search should start

endpos is an index indicating how far the string will be searched.

In [36]:
# simple rule
pattern = re.compile("d")

# match at index 0
pattern.search("dog")

<_sre.SRE_Match object; span=(0, 1), match='d'>

In [37]:
# no Match and returns None, as search start at index 1`
pattern.search("dog", 1)

Pattern.__match__(_string[, pos [, endpos]]_)

If 0+ characters at the beginning of the string match this RE, return a corresponding match object. Return None otherwise.

pos and endpos have the same meaning as they do for the search() method

In [38]:
# simple rule
pattern = re.compile("o")

# no match because 'o' is not at the start of dog'
pattern.match("dog")

In [39]:
# match because search starts at index 1
pattern.match("dog", 1)

<_sre.SRE_Match object; span=(1, 2), match='o'>

Pattern.__fullmatch__(_string[, pos [, endpos]]_)

If the whole string matches this RE, return a match object

pos and endpos have the same meaning as they do for the search() method

In [40]:
# slightly less simple rul
pattern = re.compile("o[gh]")

# No match becaues 'o' is not at the start of dog
pattern.fullmatch("dog")

In [41]:
# no match even though 'o' starts the string.
# the full string doesn't match

pattern.fullmatch("ogre")

In [42]:
# matches within the given index limits
pattern.fullmatch("doggie", 1, 3)

<_sre.SRE_Match object; span=(1, 3), match='og'>

## Match objects

Match objects support the several methods and attributes. Here is a sample.

<a id = 'Match'></a>

Match.__group__(_[group1,...]_)

Returns one or more subgroups of the match

In [107]:
# 'r' keeps from having to escape the backslash '\'
#      '\' signals a special sequence
# (\w+) roughly translates to 'any unicode character of length >= 1
m = re.match(pattern=r"(\w+) (\w+)", string="Isaac Newton, physicist")
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(1, 2))

Isaac Newton
Isaac
Newton
('Isaac', 'Newton')


In [54]:
# using (?P<name>...) allows for identifying groups with more descriptive indexes
m = re.match(pattern=r"(?P<firstName>\w+) (?P<lastName>\w+)", string="Malcolm Reynolds")
print(m.group("firstName"))
print(m.group("lastName"))

# can still refernce groups by their index
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(1, 2))

Malcolm
Reynolds
Malcolm Reynolds
Malcolm
Reynolds
('Malcolm', 'Reynolds')


In [66]:
# if a group matches multiple times, only the last match is accessible
# m.group(2) will throw an error
m = re.match(pattern=r"(..)+", string="a1b2c3")
print(m.group(0))
print(m.group(1))

a1b2c3
c3


Match.__getitme__(*g*)

same as m.group(g), just allows easier access to individual groups in match

In [74]:
# access individual matches via object indexing
m = re.match(pattern=r"(\w+) (\w+)", string="Isaac Newton, physicist")
print(m[0])
print(m[1])
print(m[2])

Isaac Newton
Isaac
Newton


Match.__groups__(_default = None_)

Return a tuple that contains all subgroups in the match

In [75]:
# (\d+) roughly translates to all digits of length >=1
m = re.match(pattern=r"(\d+)\.(\d+)", string="24.1632")
m.groups()

('24', '1632')

Match.__groupdict__(_default = None_)

Return dictionary containing all the named subgroups of the match.

In [79]:
# tidy result of match
m = re.match(pattern=r"(?P<firstName>\w+) (?P<lastName>\w+)", string="Tyler Peterson")
m.groupdict()

{'firstName': 'Tyler', 'lastName': 'Peterson'}

Match.__start__(_[group]_)
Match.__end__(_[group]_)

Return the indices of the start/end of the substring match by group

In [82]:
# remove part of string
email = "tony@tiremove_thisger.net"
m = re.search("remove_this", email)
email[: m.start()] + email[m.end() :]

'tony@tiger.net'

## Examples

<a id = 'Examples'></a>

__Check if a string matches a range of possibilities__

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if a given string is a valid hand, one could do the following:

In [87]:
# matching function
def displayMatch(match):
    if match is None:
        return None
    return "<Match: {0}, groups = {1}>".format(match.group(), match.groups())


valid = re.compile(r"[a2-9tjqk]{5}$")
displayMatch(valid.match("akt5q"))  # valid

'<Match: akt5q, groups = ()>'

In [88]:
# demonstrate invalid match
displayMatch(valid.match("akt5e"))  # invalid

That last hand, "727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

In [90]:
# '.' matches any character except line breaks \n
# '*' quantifier, match 0 or more of the preceding token
# (.) capturing group #1 Groups multiple tokens together and creates cpature
#     group for extracting a substring or using a backreference
# '\1' number reference. matches the results of capture group #1
pair = re.compile(r".*(.).*\1")
displayMatch(pair.match("717ak"))  # Pair of 7s

"<Match: 717, groups = ('7',)>"

In [96]:
#  test match
displayMatch(pair.match("aaakq"))  # three of a kind

# Three of a kind consists of this group
print(pair.match("aaakq").group(1))

a


__search() vs. match()__

In [98]:
# demonstrate search vs. match
re.match(pattern="c", string="abcdef")  # no match because starts with 'a'
re.search(pattern="c", string="abcdef")  # match because searches whole string

<_sre.SRE_Match object; span=(2, 3), match='c'>

In [100]:
# demonstrate successful search vs. invalid search
re.search(pattern="^c", string="abcdef")  # no match due to ^ restriction
re.search(pattern="^a", string="abcdef")  # match

<_sre.SRE_Match object; span=(0, 1), match='a'>

__Find all adverbs and their positions__

In [102]:
# \w Word. Matches any word character (alphanumber & underscore)
#     + Quantifier. Match 1 or more of the preceding toke
# 'l' Character. Matches a 'l' character. case sensitive
# 'y' Character. Matches a 'y' character. case sensitive
# Translates to 'any word ending in 'ly
text = "he was carefully disguised by captured quickly by police."
re.findall(pattern=r"\w+ly", string=text)

['carefully', 'quickly']

__Finding all adverbs and their positions__

In [106]:
# find adverbs based on targeting of words that end in 'ly'
text = "he was carefully disguised by captured quickly by police."
for m in re.finditer(pattern=r"\w+ly", string=text):
    print("{0}-{1} {2}".format(m.start(), m.end(), m.group(0)))

7-16 carefully
39-46 quickly


__Writing a tokenizer__

A tokenizer analyzes strings and categorizes groups of characters

Text categories are specified with regular expression. In this function, the strategy is to combine the RE's into a single master RE list and loop over it

In [132]:
# custom tokenizer
Token = collections.namedtuple("Token", ["typ", "value", "line", "column"])


def tokenize(code):
    keywords = {"IF", "THEN", "ENDIF", "FOR", "NEXT", "GOSUB", "RETURN"}
    token_specification = [
        ("NUMBER", r"\d+(\.\d*)?"),  # Integer or decimal number
        ("ASSIGN", r":="),  # Assignment operator
        ("END", r";"),  # Statement terminator
        ("ID", r"[A-Za-z]+"),  # Identifiers
        ("OP", r"[+\-*/]"),  # Arithmetic operators
        ("NEWLINE", r"\n"),  # Line endings
        ("SKIP", r"[ \t]+"),  # Skip over spaces and tabs
        ("MISMATCH", r"."),  # Any other character
    ]
    tok_regex = "|".join("(?P<%s>%s)" % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        if kind == "NEWLINE":
            line_start = mo.end()
            line_num += 1
        elif kind == "SKIP":
            pass
        elif kind == "MISMATCH":
            raise RuntimeError(f"{value!r} unexpected on line {line_num}")
        else:
            if kind == "ID" and value in keywords:
                kind = value
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)


statements = """
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
"""

In [133]:
# demonstrate tokenizer
for token in tokenize(statements):
    print(token)

Token(typ='IF', value='IF', line=2, column=4)
Token(typ='ID', value='quantity', line=2, column=7)
Token(typ='THEN', value='THEN', line=2, column=16)
Token(typ='ID', value='total', line=3, column=8)
Token(typ='ASSIGN', value=':=', line=3, column=14)
Token(typ='ID', value='total', line=3, column=17)
Token(typ='OP', value='+', line=3, column=23)
Token(typ='ID', value='price', line=3, column=25)
Token(typ='OP', value='*', line=3, column=31)
Token(typ='ID', value='quantity', line=3, column=33)
Token(typ='END', value=';', line=3, column=41)
Token(typ='ID', value='tax', line=4, column=8)
Token(typ='ASSIGN', value=':=', line=4, column=12)
Token(typ='ID', value='price', line=4, column=15)
Token(typ='OP', value='*', line=4, column=21)
Token(typ='NUMBER', value='0.05', line=4, column=23)
Token(typ='END', value=';', line=4, column=27)
Token(typ='ENDIF', value='ENDIF', line=5, column=4)
Token(typ='END', value=';', line=5, column=9)
