# RegEx Overview

# https://docs.python.org/3/howto/regex.html

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.

In [1]:
# Function to display regex matches
import regex as re
import IPython.core.display as ipd
import ipywidgets as ipw

@ipw.interact(regex=ipw.Text(), string=ipw.Textarea())
def findall(dotall=False, multiline=False, ignorecase=False, only_first=False, regex="", string=""):
    if not (regex and string):
        ipd.display(ipd.HTML(""))
        return None
    flags = 0
    if dotall:
        flags |= re.DOTALL
    if multiline:
        flags |= re.MULTILINE
    if ignorecase:
        flags |= re.IGNORECASE
    start = '<span style="background-color: gold">'
    end = "</span>"
    offset_bump = len(start) + len(end)
    offset = 0
    html = string
    matches = []
    for m in re.finditer(regex, string, flags):
        matches.append(m.captures()[0])
        span = m.span()
        sstart, send = span[0] + offset, span[1] + offset
        html = html[:sstart] + start + html[sstart:send] + end + html[send:]
        offset += offset_bump
        if only_first:
            break
    ipd.display(ipd.HTML("<p>regex: <strong>" + regex + "</strong></p>" + "<pre>" + html + "</pre"))
    return matches

In [2]:
text = """
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.
"""

## Matching Characters
Characters match themselves. Arranging regex characters in a sequence matches a sequence of characters in the target text.

In [3]:
findall(regex="expression", string=text)

['expression', 'expression']

## Metacharacters
Some characters have special meanings in the context of a regex: 
```
. ^ $ * + ? { } [ ] \ | ( )
```
These metacharacters provide patterns with wildcards, optionality, repetition, character classes, disjunction, groups, and anchors

## Character classes

In [None]:
findall(regex="[Rr]eg", string=text)

## Wildcards

In [8]:
findall(regex="RE.", string=text)

['REs', 'REs', 'RE,', 'RE ']

## Optionality

In [9]:
findall(regex="patterns?", string=text)

['patterns', 'pattern', 'pattern', 'patterns']

## Repetition

In [27]:
re1="pattern.{0,2}"
re2="pattern[a-z?]*"
re3="pattern[a-z?]+"
findall(regex=re2, string=text)

['patterns', 'pattern?', 'pattern', 'patterns']

## Escapes

In [33]:
findall(regex="\.", string=text)

['.', '.', '.', '.', '.', '.', '.']

## Anchors

In [38]:
findall(regex="^R", string=text, multiline=True)

['R', 'R']

## Groups

In [42]:
findall(regex="(th)?is", string=text)

['this', 'this', 'is', 'this', 'this', 'is', 'this']

## Disjunction

In [45]:
findall(regex="(th)|(Engl)is", string=text)

['th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'Englis',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th',
 'th']