# Regular Expressions

Regular expressions provide a powerful and flexible way to search or match patterns in text, which can be combined with other methods for processing textual data. They are a general concept in computer science, and are not specific to Python (e.g., C, Perl). Regular expressions can be used for tasks such as:

* Parsing data in many formats (e.g., unstructured text, CSV, HTML, XML, JSON)
* Extracting data elements (e.g., email addresses, URLs, IP addresses, user names)
* Text formatting and standardization
* Data validation
* Syntax highlighting
* Natural language processing

Today, we will explore the core functionality of regular expressions within the **re** module. However, this is a LARGE topic, and one that we will not be able to cover in totality. For additional information, please refer to the Python documentation for the **re** module:

* Tutorial: https://docs.python.org/3/howto/regex.html#regex-howto
* Complete: https://docs.python.org/3/library/re.html

Next time, we will focus on processing text more generally, which will combine what we know about string methods, regular expressions, plus vectorized string methods in pandas.

Friendly Reminders:

* Homework #3 due tonight by 11:59 p.m.
* Project proposal due March 7 by 11:59 p.m.
* Homework #4 released, due March 14 by 11:59 p.m.

In [1]:
import re

## Character Matching

A single regular expression--often called a *regex* or a *pattern*--is a string object that contains the pattern of text that you want to match. Oftentimes, there are multiple regexes that work for matching a pattern of a given string. **_However, you must be VERY careful with the syntax of your regex; otherwise, you may not get the match that you are seeking_**. 

Some examples of patterns could include text such as:

* Email addresses (user@domain.com)
* IP addresses (172.16.254.1)
* URL (http://www.google.com)
* Dates and times (6-Mar-2018 12:30:00 P.M.)
* Phone numbers (301-555-1234)
* And more!

The regex itself often contains a mix of regular characters and metacharacters that specify the pattern that you want to match, which can range from very simple to very complex. Regular characters will match the text exactly (i.e., in a literal sense), whereas metacharacters are more flexible and can capture variations in text. Regex metacharacters include:

```
. ^ $ * + ? { } [ ] \ | ( )
```

To match a metacharacter in a literal sense, you must precede the metacharacter with a backslash (e.g., '\\\\$') or enclose it in brackets (e.g., '[\$]'). You should also be aware that standard Python string special characters are also recognized (e.g., '\t', '\n').

Let's begin by defining a function that will check whether or not a regex matches a pattern in a string:

In [3]:
def check_re(pat, S):
    return bool(re.search(pat, S))
# return True or False

Now, let's explore the simplest case, for which we want to match a pattern with a specific substring that only contains regular characters. For these cases, if we only want to verify whether a string contains the substring, we can simply use **in**. Regular expressions work as well, but they are not needed for such simple cases.

In [4]:
S = 'This is a tweet about apples'

In [5]:
# Standard string method approach
'apple' in S

True

In [6]:
# Check regex for apple
check_re('apple', S)

True

Regular expressions become increasingly more useful when there is more potential variation in the text.

In [7]:
S = 'This is a tweet about Apple'

In [8]:
# Case-sensitive string method approach
'apple' in S

False

In [9]:
# Case-insenstive string method approach
'apple' in S.lower()

True

In [10]:
# Check regex for Apple or apple
pat = '[Aa]pple'
check_re(pat, S)

True

Let's suppose we want to match text with alternative spellings.

In [11]:
S = 'Clouds are gray'
('gray' in S) or ('grey' in S)

True

In [12]:
pat = 'gr[ae]y'
[check_re(pat, S) for S in ['Clouds are grey in the U.K.', 'Clouds are gray in the U.S.']]

[True, True]

As we've seen, we can use square brackets [ ] to enclose a set of characters in our regex that can be used to match a single character in a string. You can also specify a range of characters or digits:

* [a-z] will match any lowercase letter (equivalent to string.ascii_lowercase)
* [A-Z] will match any uppercase letter (equivalent to string.ascii_uppercase)
* [0-9] will match any digit (equivalent to string.digits)

Abbreviated and combined ranges also work (e.g., [a-e], [1-4], [a-e1-4]).

In [13]:
pat = '20[01][0-9]'
year = 2000
check_re(pat, 'This data was collected in the year %d' % year)

True

We can also use [^ ] to specify characters that we do not want to include when matching a single character in a string (set complement). 

In [14]:
# Standard string methods approach
import string
letters_to_exclude = list('etaoin')
letters_to_include = [let for let in string.ascii_lowercase if let not in letters_to_exclude]
print('Letters to exclude:', letters_to_exclude)
print('Letters to include:', letters_to_include)
word = 'eat'
any([let in word for let in letters_to_include])

Letters to exclude: ['e', 't', 'a', 'o', 'i', 'n']
Letters to include: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'p', 'q', 'r', 's', 'u', 'v', 'w', 'x', 'y', 'z']


False

In [15]:
# Test for word that contains one character outside of the most popular English letters
word = 'great'
check_re('[^etaoin]', word)

True

In addition, we can use the **or** operator ('|') to specify multiple options for matching text. The **or** operator is much more flexible than specifying a set of characters, as it can more easily capture patterns that differ by more than one or two characters.

In [16]:
pat = 'gray|grey'
[check_re(pat, word) for word in ['grey', 'gray', 'green']]

[True, True, False]

In [17]:
pat = 'blue|yellow|red'
[check_re(pat, S) for S in ['green','red','purple','yellow','orange','blue']]

[False, True, False, True, False, True]

Any metacharacters listed inside brackets will be matched literally (e.g., $, \\). You do not need to use the escape backslash character as you would need to if specifying a character to match outside of brackets.

In [18]:
pat = '[$][0-9]\.[0-9][0-9]'
check_re(pat, 'I paid $4.52 for my coffee this morning')

True

In addition to using brackets to specify search patterns for special characters, there are several special sequences that represent specific sets of characters. Most commonly used are:

* \d - Represents all digits, and is equivalent to [0-9]
* \s - Represent all whitespace characters, and is equivalent to [ \t\n\r\f\v] (note the space character in the first position)
* \w - Represents all alphanumeric characters, including the underscore ( \_ ) and is equivalent to [a-zA-Z0-9\_]

There are also complements to these sets, represented by the corresponding capital letter:

* \D - Represents all non-digits, and is equivalent to [^0-9]
* \S - Represents all non-whitespace characters, and is equivalent to [^ \t\n\r\f\v]
* \W - Represents all non-alphanumeric characters, and is equivalent to [^a-zA-Z0-9\_]

These character sets can also be specified within square brackets (e.g., [\s;,.!?]).

The **.** metacharacter matches any character, except for a newline character (\n) (by default), and is especially useful for utilizing regular expressions on unstructured text.

In [19]:
# Money
pat = '[$]\d[.]\d\d'
check_re(pat, 'I paid $4.52 for my coffee this morning')

True

In [20]:
# Numbering conventions
pat = '[A-Z]\d\d'
check_re(pat, 'Now serving customer C52 at booth 14')

True

In [21]:
# Washington, DC license plates
pat = '[A-Z][A-Z] \d\d\d\d'
check_re(pat, 'In Washington, D.C., I saw a car with a license plate number of AA 1010')

True

In [22]:
# Virginia license plates
pat = '[A-Z][A-Z][A-Z]-\d\d\d\d'
check_re(pat, 'In Virginia, I saw a car with a license plate number of ABC-1010')

True

In [23]:
# Dates
pat = '\w\w\w \d'
check_re(pat, 'Today, the date is Mar 6.')

True

In [24]:
# Times
pat = '[\s\d]\d:\d\d [AP]\.M\.'
check_re(pat, 'Right now, the time is 7:06 P.M.')

True

In [25]:
# Space or tab-delimited data
pat = '\w\s\d\s\d\s\d'
S = 'A\t1 5\n10'
print(S)
check_re(pat, S)

A	1 5
10


True

## Quantification

Up to this point, we have only been able to perform character-by-character pattern matching on a string. In practice, we do not always have the luxury of processing structured string patterns such as monetary strings, dates, times, or other structured formats. And even for the structured formats, there may still be some variability (e.g. \\$12.38 vs. \\$8.92).

Quantification allows us to specify the number of times that a specific element of a pattern is allowed to be repeated. The most common quantifiers are:

* \* \- specifies that the previous element can be matched zero or more times (equivalent to {0,})
* \+ \- specifies that the previous element can be matched one or more times (equivalent to {1,})
* ? \- specifies that the previous element can be matched zero or one time(s) (equivalent to {0,1})
* {n} - specifies that the previous element must be matched exactly *n* times
* {n,m} - specifies that the previous element must be matched between _n_ and *m* times
* {n,} - specifies that the previous element must be matched at least *n* times
* {,m} - specifies that the previous element must be matched at most *m* times

Quantifiers are greedy by default, which means that they will try to match as many characters as possible (while still getting a match). If this behavior does not match what you want, you can append the '?' character to the end of the quantifier to enforce a conservative approach (e.g., \*?, \+?, ??, { }?).

In [26]:
# Money
pat = '[$]\d*[.]\d\d'
check_re(pat, 'I paid $14.52 for my coffee this morning')

True

In [27]:
# Dates
pat = '\w\w\w \d+'
check_re(pat, 'Today, the date is Mar 06.')

True

In [28]:
# More dates
pat = '\d?\d-[A-Za-z]+-\d{2,4}'
check_re(pat, 'Today, the date is 16-March-18')

True

## Regex Methods

By combining character matching and quantification, you can create some extremely complex patterns to match various types of text.

Given that you have defined an appropriate regex, what can you do with it? The **re** module contains several functions for applying a regex to a string, which are categorized into 3 types:

* Pattern matching
    - re.search, re.match, re.fullmatch, re.findall, re.finditer
* Substitution
    - re.sub, re.subn
* Splitting
    - re.split
    
All of these functions accept a regex pattern, the string on which to search for the pattern, and an optional set of flags that specify how the pattern matching should be performed (e.g., re.IGNORECASE). Additionally, the substitution functions accept the string object that you want to replace the matched pattern (or a function to apply to each non-overlapping match), and an optional count argument that specifies the maximum number of replacements.

**re.search** and **re.match** both attempt to match the pattern within a given string object. The difference is that re.match will attempt to match the pattern from the beginning of the string, whereas re.search is more general and will search for the pattern throughout the string until it finds a match (if there is one). Technically, re.search can be used to search at the beginning of a string by placing the '^' character at the beginning of the regex pattern.

If processing a consistently structured set of text (e.g., one statement per line), re.match is probably more appropriate, whereas re.search is probably more appropriate for unstructured text. Either function could work for many cases.

**re.fullmatch** must match the full string.

These pattern matching functions return a **match object** if a match is found; otherwise, they will return None. Oftentimes, you can check whether a match is found by combining a pattern matching attempt with a conditional (e.g., if statement, ternary expression, np.where). The match object has several methods for extracting information about the match:

* .group - returns the text that matched the regex pattern
* .start - returns the starting index of the text that matched the regex pattern
* .end - returns the end index of the text that matched the regex pattern
* .span - returns a tuple of (start, end) indices of the text that matched the regex pattern

In addition, regexes allow you to specify groups within a given pattern. Any patterns enclosed within parentheses ( ) will be assigned to a group (numbered in order, beginning at 1), and this allows you to extract the specific matched pattern and use it for other purposes (e.g., assign to a variable, append to a list). The above methods accept a group number (if applicable), which returns the associated information about the specific group.

In [29]:
# Extracting dates
pat = '\d?\d-[A-Za-z]+-\d{2,4}'
S = 'Today, the date is 6-Mar-2018'
m = re.search(pat, S)
m

<re.Match object; span=(19, 29), match='6-Mar-2018'>

In [30]:
# .group method
m.group()

'6-Mar-2018'

In [31]:
# .start and .end methods
S[m.start():m.end()]

'6-Mar-2018'

In [32]:
# .span method
start, end = m.span()
S[start:end]

'6-Mar-2018'

In [33]:
# match function
m = re.match(pat,S)
print(m)

None


In [34]:
# Revised search pattern for match
pat = '.*\d?\d-[A-Za-z]+-\d{2,4}'
m = re.match(pat,S)
print(m)

<re.Match object; span=(0, 29), match='Today, the date is 6-Mar-2018'>


In [35]:
# Another revised search pattern for match
pat = '.*(\d?\d-[A-Za-z]+-\d{2,4})'
m = re.match(pat,S)
print(m)

<re.Match object; span=(0, 29), match='Today, the date is 6-Mar-2018'>


In [36]:
# .group method - specify group number
m.group(1)

'6-Mar-2018'

In [37]:
# .span method - specify group number
m.span(1)

(19, 29)

The **re.findall** and **re.finditer** also match a given string pattern, but rather than match a specific instance (or specific instances) of a pattern, these functions return all matches. The re.findall function returns the result in the form of a list. The re.finditer returns an iterator, which is useful for a process with many potential matches.

In [38]:
# Times
pat = '\d{1,2}:\d\d'
S = 'The games tomorrow will begin at 1:00, 7:00, and 10:00.'
re.findall(pat, S)

['1:00', '7:00', '10:00']

In [39]:
# Times with groups
pat = '(\d{1,2}):(\d\d)'
S = 'The games tomorrow will begin at 1:00, 7:00, and 10:00.'
re.findall(pat, S)

[('1', '00'), ('7', '00'), ('10', '00')]

The substitution functions--**re.sub** and **re.subn**--are especially useful for data standardization functions. The re.sub function will replace all instances of the matched pattern, whereas re.subn will replace up to *n* occurences.

In [40]:
# Phone numbers
pat = '\(?(\d{3})\)?\s*(\d{3})-?(\d{4})'
S = 'My phone number is (301) 555-1234.'
repl = lambda m: '(' + m.group(1) + ') ' + m.group(2) + '-' + m.group(3)
re.sub(pat, repl, S)

'My phone number is (301) 555-1234.'

The **re.split** function is a more flexible implementation of the standard string split method, which can split a string using regular expression patterns.

In [41]:
# Parsing URLs
pat = '://|\.|\?|#|/'
S = 'scheme://www.host.com/path?query#fragment'
re.split(pat,S)

['scheme', 'www', 'host', 'com', 'path', 'query', 'fragment']

## Compiling Regex Objects

Up to this point, we have created a string object that contains the regex pattern, and input this object as the pattern argument into a regex function. For patterns that we expect to use repeatedly (e.g., in a loop), it is more efficient if we compile the object once up front and then utilize the compiled object directly. You can compile regex objects using the **re.compile** function, where you can specify any optional flags similar to the standard **re** functions (see previous section). A combiled **re** object has the same functions as the **re** module, accessed as methods of the compiled object.

In [42]:
date_regex = re.compile('\d?\d-[A-Z]+-\d{2,4}', flags=re.IGNORECASE)
date_regex

re.compile(r'\d?\d-[A-Z]+-\d{2,4}', re.IGNORECASE|re.UNICODE)

In [43]:
date_list = ['5-APR-18', '22-February-2018', '12-august-18', '2/8/2016', 'March, 28, 2018']

In [44]:
for dt in date_list:
    m = date_regex.search(dt)
    if m:
        print(m.group())
    else:
        print('No match found, date invalid.')

5-APR-18
22-February-2018
12-august-18
No match found, date invalid.
No match found, date invalid.


## Next Time: Text Processing