# UNCLASSIFIED

Transcribed from FOIA Doc ID: 6689695

https://archive.org/details/comp3321

# (U) Regular Expressions (Regex) 

## (U) Now You’ve Got Two Problems... 

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two 
problems. 
- Jamie Zawinski, 1997 

(U) A regular expression is a tool for finding and capturing patterns in text strings. It is very powerful and can be very complicated; the second problem referred to in the quote is a commentary on how regular expressions are essentially a separate programming language. As a rule of thumb, use the in operator or string methods like `find` or `startswith` if they are suitable for the task. When things get more complicated, use regular expressions, but try to use them sparingly, like a seasoning. At times it may be tempting to write one giant, powerful, super regular expression, but that is probably not the best thing to do. 

(U) The power of regular expressions is found in the special characters. Some, like `^` and `$`, are roughly equivalent to string methods `startswith` and `endswith`, while others are more flexible, especially `.` and `*`, which allow flexible matching. 

## (U) Getting Stuff Done without Regex 

In [None]:
"mike" in "so many mikes!"

In [None]:
"mike".startswith("mi")

In [None]:
"mike".endswith("ke")

In [None]:
"mike".find("k")

In [None]:
"mike".isalpha()

In [None]:
"mike".isdigit()

In [None]:
"mike".replace("k", "c")

## (U) Regular expressions in Python 

There are only a few common methods for using the **re** module, but they don't always do what you would first expect. Some functionality is exposed through _flags_, which are actually constants (i.e. `int` defined for the **re** module), which means that they can be combined by addition.

In [None]:
import re

In [None]:
re.match("c", "abcdef")

In [None]:
re.match("a", "abcdef")

In [None]:
re.search("c", "abcdef")

In [None]:
re.search("C", "abcdef") 

In [None]:
re.search("C", "abcdef", re.I) # re.IGNORECASE

In [None]:
re.search("^c", "ab\ncdef")

In [None]:
re.search("^c", "ab\ncdef", re.M) # re.MULTILINE

In [None]:
re.search("^C", "ab\ncdef", re.M + re.I)

(U) In both `match` and `search`, the _regular expression_ precedes the string to search. The difference between the two functions is that `match` works only at the beginning of the string, while `search` examines the whole string.

(U) When repeatedly using the same regular expression, _compiling_ it can speed up processing. After a compiled regular expression is created, `find`, `search`, and other methods can be called on it, and given only the search string as a single argument.

In [None]:
c_re = re.compile("c")
c_re.search("abcde")

## Regex Operators 

```
. - matches any character but the newline character. Wildcard
^ - matches beginning of a string or newline
$ - matches end of string
* - 0 or more of something
+ - 1 or more of something
? - 0 or 1 of something
*?, +?, ?? - don’t be greedy (see example below)
{3} - match 3 of something
{2,4} - match 2 to 4 of something
\ - escape character
[lrnLRN] - match any ONE of the letters l, r, n, L, R, N
[a-m] - match any ONE of letters from a to m
[a|m] - match letter a or m
\w - match any letter, number, or underscore. Word characters
\W - match any character that is NOT a letter, number, or underscore
\s - match a space, tab, or newline character
\S - match any character that is NOT a space, tab, or newline character
\d - match a digit 0-9
\D - match any character that is NOT a digit 0-9
```

In [None]:
re.search("\w*s$", "Mike likes cheese\nand Mike likes bees")

In [None]:
re.findall("\(\d{3}\)\d{3}-\d{4}", "Hello, I am a very bad terrorist. If you wanted to know, my phone number is (303)555-2345")

In [None]:
# greedy search will match everything between the 1st 'mi' and the last 'ke'
re.findall("mi.*ke", "i am looking for mike and not all this stuff in between mike, but micheal and ike is okay.")

In [None]:
# the '?' tells python we want a non-greedy search. It will only match from the first 'mi' to the first 'ke'
re.findall("mi.*?ke", "i am looking for mike and not all this stuff in between mike, but micheal and ike is okay.")

### Interlude

How would we have recognized the bad terrorist's phone number without a regex? We could write a function that could recognize phone numbers. What would that function look like?

In [None]:
def match_phone_numbers(text):
    if len(text) != 13:
        return False
    if text[0] != '(' and text[4] != ')':
        return False
    if text[8] != '-':
        return False
    for i in range(1, 4):
        if not text[i].isdecimal():
            return False
    for i in range(5, 8):
        if not text[i].isdecimal():
            return False
    for i in range(9, len(text)):
        if not text[i].isdecimal():
            return False
    return True

terror_message = "Hello, I am a very bad terrorist. If you wanted to know, my phone number is (303)555-2345"

for word in terror_message.split():
    if match_phone_numbers(word):
        print('Phone number found!')
        print(word)

That function took up 16 lines and it can really only match phone numbers that are in the same format as our bad terrorist's number: (303)555-2345. What if there's a space between the area code and the main number? `.split()` will treat that as two separate words and we won't be able to match it. What if someone writes the area code separated from the rest of the number by a `-` instead of by parentheses `()`? The phone number is now going to be 12 characters long instead of 13 and our length check won't work anymore.

There are other ways of breaking chunks of text up besides `.split()` that might help us, but regular expressions are ideal when you're looking for patterns instead of exact text because they provide the language for setting up our matches for us.

Here's an example of a regular expression that will match multiple phone number formats:

In [None]:
phone_re = re.compile(r'(\(?\d{3}\)?-?\s?)?(\d{3}-\d{4})')

In [None]:
phone_re.search("Hello, I am a very bad terrorist. If you wanted to know, my phone number is 303-555-2345")

In [None]:
phone_re.search("Hello, I am a very bad terrorist. If you wanted to know, my phone number is (303) 555-2345")

In [None]:
phone_re.search("Hello, I am a very bad terrorist. If you wanted to know, my phone number is 555-2345")

In [None]:
phone_re.search("Hello, I am a very bad terrorist. If you wanted to know, my phone number is (303)555-2345")

Let's look at our regular expression and break it down:

`r'(\(?\d{3}\)?-?\s?)?(\d{3}-\d{4})'`

Notice there is an `r` in front of our regular expression. It's very common to set up our regular expression patterns as raw strings since regular expression patterns usually have lots of escape characters in them. This can sometimes make things easier, especially if we need to match a literal backslash. It's probably a good idea to get in the habit of using raw strings for patterns even though it isn't always necessary.

After the raw string starts we notice we have several parentheses in our pattern. Some of them are escaped with backslashes and others aren't. The escaped parenthesis characters are literal parenthesis and the non-escaped parenthesis are setting up an optional capture group. More about that in the next section, but basically most of our regular expression pattern is for matching different types of area codes and will let our pattern match a phone number that doesn't even have an area code.

`(\(?\d{3}\)?-?\s?)?` is the part of our pattern that is specific to the area code. The outermost parenthesis set up the capture group for the area code and the trailing question mark makes the whole capture group optional.

- `(` Start the capture group.
- `\(?` Optionally match a literal opening parenthesis.
- `\d{3}` Match three numeric characters.
- `\)?` Optionally match a closing parenthesis.
- `-?` Optionally match a hyphen.
- `\s?` Optionally match a space.
- `)?` Close the capture group and make the whole thing optional

That takes care of the area code, but what about the remaining bits?

- `(` Start a capture group for the main part of the phone number.
- `\d{3}` Match three numbers.
- `-` Match a hyphen.
- `\d{4}` Match four numbers.
- `)` Close our second capture group and keep it non-optional.

The options in this section are not optional so something that looks like 555-1212 will always match whether there's an area code or not.

This regular expression isn't perfect. It will match some weird but unlikely strings.

In [None]:
phone_re.findall('These are not phone numbers: (000)- 000-0000, 000)000-0000, (000- 000-0000')

Even though it is catching some things we might want to exclude, it should catch all American style phone numbers. We could also improve it by adding a capture group to catch country codes. Sometimes you have to decide how much tolerance you have for false positives and whether it's worth the extra effort to craft a more precise regular expression.

### Adding some logic

The previous problem with area codes matching can be solved by adding some or `|` operators to match more accurately.

`(\d{3}-|\(\d{3}\)\s?)?` This capture group will capture either three digits without parens followed by a - or three digits surrounded with parenthesis and followed by an optional space.

Let's try it on our bad phone numbers and some good phone numbers and see what it finds.

In [None]:
phone_re = re.compile(r'(\d{3}-|\(\d{3}\)\s?)?(\d{3}-\d{4})')

In [None]:
phone_nums = '(000)- 000-0000, 000)000-0000, (000- 000-0000, 555-1212, (801)555-1212, (801) 555-1212, 801-555-1212'

phone_re.findall(phone_nums)

Success? Sort of. It still matched the main parts of our invalid phone numbers because they still match but at least it excluded our invalid area codes. These days the area code isn't really optional anymore so maybe we can just drop  the question mark from that capture group and require our matches to have valid area codes. As long as the area code portion is optional it's still going to match on the valid main number. There's more we could do to fine tune this to avoid matching invalid area codes but this seems like a good place to stop and move on. Let's learn more about capture groups.

## Capture Groups 

Put what you want to pull out of the strings in parentheses () 

In [None]:
my_string = "python is the best language for doing 'pro'gramming" 
result = re.findall("'(\w+)", my_string) 
print(result) 
print(result[0]) 

## Matches and Groups 

(U) The return value from a successful call of `match` or `search` is a _match object_; an unsuccessful call returns `None`. First, this is suitable for use in `if` statements, such as `if c_re.search("abcde"): ...`. For complicated regular expressions, the match object has all the details about the substring that was matched, as well as any captured groups, i.e. regions surrounded by parentheses in the regular expression. These are available via the `group` and `groups` methods. Group 0 is always the whole matching string, after which remaining groups (which can be nested) are ordered according to the opening parenthesis. 

In [None]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 

In [None]:
m.group() 

In [None]:
m.group(1) 

In [None]:
m.group(2) 

In [None]:
m.groups()

## Other Methods 

(U) Other regular expression methods work through all matches in the string, although what is returned is not always straightforward, especially when captured groups are involved. We demonstrate out some basic uses without captured groups. When doing more complicated things, please remember: be careful, read the documentation, and do experiments to test! 

In [None]:
re.findall("a.c", "abcadcaecafc") # returns list of strings

In [None]:
re.finditer("a.c","abcadcaecafc") # returns iterator of match objects

In [None]:
re.split("a.", "abcadcaecafc") # returns list of strings.

(U) The `sub` method returns a modified copy of the target string. The first argument is the regular expression to match, the second argument is what to replace it with -- which can be another string or a function, and the third argument is the string on which the substitutions are to be carried out. If the sub method is passed a function, the function should take a single match object as an argument and return a string. For some cases, if the substitution needs to reference captured groups from the regular expression, it can do so using the syntax `\g<number>`, which is the same as accessing the groups method within a function. 

In [None]:
re.sub("a.*?c", "a--c", "abracadabra")

In [None]:
re.sub("a(.*?)c", "a\g<1>\g<1>c", "abracadabra")

In [None]:
def reverse_first_group(matchobj): 
    match = matchobj.group()
    rev_group = matchobj.group(1)[::-1]
    return match[:matchobj.start(1)] + rev_group + match[matchobj.end(1):]

In [None]:
re.sub("a(.*?)c", reverse_first_group, "abracadabra")

(U) In the above, we used `start` and `end`, which are methods on a match object that take a single numeric argument -- the group number -- and return the starting and ending indices in the string of the captured group. 

(U) One final warning: if a group can be captured more than once, for instance when its definition is followed by a `+` or a `*`, then only the last occurrence of the group will be captured and stored.

## Resources:

- Regular Expression Tester https://regex101.com/
    - Paste in some text to match against and see how different patterns will match against that text.
- Python RegEx Module Documentation https://docs.python.org/3/library/re.html
    - Read the docs.
- The book Automate the Boring Stuff with Python has a very good chapter about regular expressions. It's available in safari books.

# UNCLASSIFIED

Transcribed from FOIA Doc ID: 6689695

https://archive.org/details/comp3321