# Practice with regular expressions using Python tools

Many programming languages have libraries (sets of functions) that support the use of regular expressions for string analysis.  This notebook, and the set of activities it asks you to complete, make use of a Python library for regular expressions called the `re` module.

## REMINDER: Running cells in a notebook
Remember that to run a cell in a notebook, you can either:

* Go to the *Runtime* menu and choose the option *Run all*
* Press the *Play* button next to a given cell

**NOTE**: It is important that you run the code cell below the next paragraph (the one that includes the code `import re`) before running any of the other code cells.  This will import the needed regular expression module (set of functions).

## Introduction 
The cell of code below imports the `re` module into this notebook so we can access the regular expression functions.

In [None]:
import re

The Python function `re.fullmatch(pattern_arg, string_arg)` is used to indicate whether or not the  *string* represented by the parameter `string_arg` is in the language that the  *regular expression string* represented by `pattern_arg` generates.  This function either returns a value of `None` if the word is not in the language or, if the word is in the language, a `Match` object that returns the matching string and where the match occurs within the string.  The reasons that this `Match` object has this level of detail is that the `re` library also contains functions that allow one to see if instances of a regular expression are *anywhere within* a string, not just the whole string itself - for those functions, finding both *what matched* and *where it exists* are useful.


The cell below shows the use of the `re.fullmatch` function, passing in the regular expression string `ab*` for the parameter `pattern_arg` and the string `abb` for the parameter `string_arg`. `ab*` represents strings that start with the `a` symbol and are followed by *zero or more* `b` symbols.  There are an infinite number of string in the language generated by this regular expression, but a few examples are `[a, ab, abb, abbb]`, so `abb` should be a match. The results of the comparison are captured in the `result` variable.  By executing the cell below, one can see the `Match` object that is returned indicating that `abb` is in the language generated by the regular expression `ab*`.



In [None]:
result = re.fullmatch("ab*","abb")
print("Result is: ", result)

To simplify the exercises, I've written two functions.  

* One, named `boolean_match_result` with one parameter, `match_arg`, takes the result from a call to `re.fullmatch` and converts that result to a simple `True` or `False` boolean value.
* The other, named `match_against_list`, takes in a regular expression parameter, `pattern_arg`, and a list of strings, `list_arg`, and returns a list which contains only the strings in `list_arg` that are in the language generated by the regular expression defined by `pattern_arg`.

**NOTE**: Make sure to run the cell immediately below before running any of the later cells to ensure that the two functions are defined.

In [None]:
# This function converts the result returned from matching a regular expression 
# against a string into a Boolean value 
# True (there was a match) or False (there was not a match)
def boolean_match_result(match_arg):
    if (match_arg == None):
        return False
    else:
        return True

# This function determines, for each string in a list of strings (list_arg),
# which of those strings matches a given regular expression (pattern_arg)
def match_against_list(pattern_arg, list_arg):
    output_list = []
    for index in range(len(list_arg)):
        word = list_arg[index]
        result = re.fullmatch(pattern_arg, word)
        if (result != None):
            output_list.append(word)
    return output_list

The cell below shows the same use of the `re.fullmatch` function as in the cell a few steps back, but now also shows how use of the function `boolean_match_result` converts the `Match` object into the value `True`. 

In [None]:
result = re.fullmatch("ab*","abb")
print("Result is: ", result)
print("Boolean result is: ", boolean_match_result(result))

The cell below shows the use of the `match_against_list` function, with the string `ab*` as the value for the parameter for `pattern_arg` and the list `["a", "ab", "abb", "b", "ba", "bba", "bab", "abbb"]` as the value for the parameter `list_arg`. It should, when run, return and print a list containing only the elements of the list which are in the language generated by `ab*`, namely `["a", "ab", "abb", "abbb"]`.

In [None]:
regular_expression = "ab*"
list_of_words = ["a", "ab", "abb", "b", "ba", "bba", "bab", "abbb"]
matching_word_list = match_against_list(regular_expression, list_of_words)
print(matching_word_list)

### Problems for you to practice with

**Problem 1.** Given the regular expression `b*aab*`, replace the string below that says CHANGEME with a length-5 word (a word containing 5 symbols) such that the output from running this cell is: `Boolean result is:  True` and `Word length is: 5`.


In [None]:
regular_expression = "b*aab*"
word = "CHANGEME"
print("Boolean result is: ", boolean_match_result(re.fullmatch(regular_expression,word)))
print("Word length is: ", len(word))

**Problem 2.** Given the regular expression `b*aab*`, come up with a list of five words, of any length, in the list `list_of_words`, such that all the words are in the language generated by the regular expression.  The cell will print the original list of words, the resulting list from matching against the regular expression, and then, whether or not the two lists are the same. 

*Big picture problem solving*: You are trying to come up with five words that are generated by the provided regular expression.


In [None]:
regular_expression = "b*aab*"
list_of_words = ["CHANGE","ME"]

matching_word_list = match_against_list(regular_expression, list_of_words)

print("Original list: ", list_of_words)
print("Result list: ", matching_word_list)
if (matching_word_list==list_of_words):
    print("All words in original list are in the language")
else:
    print("An error occurred - there were one or more words in the original list not in the language")

**Problem 3.** Consider the alphabet `{a,b}`.  The list in the cell below associated with the variable `list_of_words` contains all of the one and two symbol words over that alphabet (there are 6 such words).  

Without running the cell, predict how many of those words are in the language generated by the regular expression `b*aab*` and fill in your number as the value for the `expected_match_count` variable. 

Then, run the cell and see if the actual number of words in that list that match the regular expression matches your expectation.

*Big picture problem solving*: You are trying to predict how many of the words provided could be generated from the regularl expression provided.



In [None]:
regular_expression = "b*aab*"
list_of_words = ["a", "b", "aa", "ab", "ba", "bb"]
expected_match_count = -1 # Change this 

matching_word_list = match_against_list(regular_expression, list_of_words)
actual_match_count = len(matching_word_list)
if (actual_match_count == expected_match_count):
    print("Your predicted number of matches, " + str(expected_match_count) + ", was correct")
else:
    print("Your predicted number of matches, " + str(expected_match_count) + ",  was incorrect")

**Problem 4.** In the cell below is a list of words in the variable `list_of_words`.  The words are all from the alphabet `{0,1}` and are all in the language generated from a given regular expression. 

Come up with a regular expression, replacing the `CHANGEME` associated with the regular expression, that generates all the words.  The output from the cell will be the words from the original list, the words in the original list covered by your regular expression, and the number of words that are not covered.  

*Big picture problem solving*: You are trying to come up with a regular expression that would generate all the words in a list.

In [None]:
regular_expression = "CHANGEME"
list_of_words = ["0","00","000","0000","1","10","100","1000","01","001", "0001", "010","0010","0100"]

matching_word_list = match_against_list(regular_expression, list_of_words)
number_of_words_not_covered = len(list_of_words) - len(matching_word_list)

print("Original list: ", list_of_words)
print("Result list: ", matching_word_list)
print("Number of words not covered by regular expression: ", number_of_words_not_covered)

**Problem 5.** In the cell below, the variable `list_of_words` contains a set of words over the alphabet `{0,1}`.  In addition, there is a regular expression stored in the variable `re_one` that has the value `1(0*|1*)1`.  Come up with another regular expression, replacing the string associated with variable `re_two` (which currently contains `"CHANGEME"`), such that the two regular expressions match the same set of words from the original list.  The cell will print the original list of words, the resulting list from matching against `re_one`, the resulting list from matching against `re_two` and then, whether or not the two lists of matches are the same. 

*Big picture*: You are trying to come up with a 2nd regular expression that is expressed differently but is equivalent to the one stored in `re_one`.


In [None]:
list_of_words = ["0","1","10","11","100","101","110","111","1000","1001","1010","1011","1100","1101","1110","1111"]

re_one = "1(0*|1*)1"
re_one_matching_word_list = match_against_list(re_one, list_of_words)

re_two = "CHANGEME"
re_two_matching_word_list = match_against_list(re_two, list_of_words)


print("Original list: ", list_of_words)
print("re_one match list: ", re_one_matching_word_list)
print("re_two match list: ", re_two_matching_word_list)
if (re_one_matching_word_list == re_two_matching_word_list):
    print("The matches from both regular expressions are the same")
else:
    print("The matches from both regular expressions are different")