# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Part 3: An extreme case of regular expression processing

There is a beautiful theory underlying regular expressions, and efficient regular expression processing is regarded as one of the classic problems of computer science. In the last part of this lab, you will explore a bit of that theory, albeit by experiment.

In particular, the code cells below will walk you through a simple example of the potentially hidden cost of regular expression parsing.

> If you really want to geek out, look at the article on which this example is taken: https://swtch.com/~rsc/regexp/regexp1.html

## Quick review

**Exercise 1.** Let $a^n$ be a shorthand notation for a string in which $a$ is repeated $n$ times. For example, $a^3$ is the same as $aaa$ and $a^6$ is the same as $aaaaaa$. Write a function to generate $a^n$.

In [None]:
def rep_str (s, n):
    """Returns a string consisting of an input string repeated a given number of times."""
    assert type (s) is str
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert rep_str ('a', 3) == 'aaa'
assert rep_str ('cat', 4) == 'catcatcatcat'
assert rep_str ('', 100) == ''

## An initial experiment

Intuitively, you should expect (or hope) that the time to determine whether a string of length $n$ matches a given pattern will be proportional to $n$. Let's see if this holds when matching simple input strings of repeated letters against a pattern designed to match such strings.

In [None]:
import re

In [None]:
# Set up an input problem
n = 3
s_n = rep_str ('a', n) # Input string
pattern = '^a{%d}$' % n # Pattern to match it exactly

# Test it
print ("Matching input '{}' against pattern '{}'...".format (s_n, pattern))
assert re.match (pattern, s_n) is not None

# Benchmark it & report time, normalized to 'n'
timing = %timeit -q -o re.match (pattern, s_n)
t_avg = sum (timing.all_runs) / len (timing.all_runs) / timing.loops / n * 1e9
print ("Average time per match per `n`: {:.1f} ns".format (t_avg))

Before moving on, be sure you understand what the above benchmark is doing. For more on the Jupyter "magic" command, `%timeit`, see: http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=magic#magic-magic

**Exercise 2.** Repeat the above experiment for various values of `n`. To help keep track of the results, feel free to create new code cells that repeat the benchmark for different values of `n`. Explain what you observe. Can you conclude that matching simple patterns of the form `^a{n}$` against input strings of the form $a^n$ does, indeed, scale linearly?

In [None]:
# Use this code cell (and others, if you wish) to set up an experiment
# to test whether matching simple patterns behaves at worst linearly
# in the length of the input.

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## A more complex pattern

Consider patterns of the form:

$$(a?)^n(a^n) \quad$$

For instance, $n=3$, the regular expression pattern is `(a?){3}a{3} == a?a?a?aaa`. Start by convincing yourself that an input string of the form,

$$a^n = \underbrace{aa\cdots a}_{n \mbox{ occurrences}}$$

should match this pattern. Here is some code to set up an experiment to benchmark this case.

In [None]:
def setup_inputs (n):
    """Sets up the 'complex pattern example' above."""
    s_n = rep_str ('a', n)
    p_n = "^(a?){%d}(a{%d})$" % (n, n)
    print ("[n={}] Matching pattern '{}' against input '{}'...".format (n, p_n, s_n))
    assert re.match (p_n, s_n) is not None
    return (p_n, s_n)

n = 3
p_n, s_n = setup_inputs (n)
timing = %timeit -q -o re.match (p_n, s_n)
t_n = sum (timing.all_runs) / len (timing.all_runs) / timing.loops / n * 1e9
print ("==> Time per run per `n`: {} ns".format (t_n))

**Exercise 3.** Repeat the above experiment but for different values of $n$, such as $n \in \{3, 6, 9, 12, 15, 18\}$. As before, feel free to use the code cell below or make new code cells to contain the code for your experiments. Summarize what you observe. How does the execution time vary with $n$? Can you explain this behavior?

In [None]:
# Use this code cell (and others, if you wish) to set up an experiment
# to test whether matching simple patterns behaves at worst linearly
# in the length of the input.

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE