# Introduction to Data Science – Regular Expressions
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/* 

In this lecture we'll learn about regular expressions. Regular expressions are a way to match strings. They are very useful to find (and replace) text, to extract structured information such as e-mails, phone numbers, etc., or for cleaning up text that was entered by humans, and many other applications. 

In Python, regular expressions are available as part of the [`re`](https://docs.python.org/3/library/re.html#module-re) module. There are various [good](https://docs.python.org/3/howto/regex.html) [tutorials](https://developers.google.com/edu/python/regular-expressions) available on which this document is partially based. 

The basic syntax to search for a match in a string is this: 

```python
match = re.search(pattern, text)
```

Here, `pattern` is the regular expression, `text` is the text that the regular expression is applied to. Match holds the search result that matches the string in a object.

[`search()`](https://docs.python.org/3/library/re.html#re.search) returns only the first occurrence of a match, in contrast, [`findall()`](https://docs.python.org/3/library/re.html#re.findall) returns all matches.

Another useful function is [`split()`](https://docs.python.org/3/library/re.html#re.split), which splits a string based on a regex pattern – we'll use all of these functions and others where appropriate. 

Mostly, we'll use search to learn about the syntax, but sometimes we'll use split instead of search to explain a pattern. There are other functions which we'll use later.

## A simple Example

We'll use a regular expression: 
```python
'animal:\w\w\w'
```

To match the substring 'animal:' followed by a three letter word, encoded by '\w\w\w'

In [1]:
import re

In [2]:
# example text
text = "an example animal:cat!! animal:dog! animal:hedgehog"

# running the search, r before the string denotes a raw string
match = re.search(r"animal:\w\w\w", text)
# If-statement after search() tests if it succeeded
if match:                      
    print ("found:", match.group()) ## 'found word:cat'
else:
    print ("did not find")

found: animal:cat


Here, the `r` before the string denotes that this should be treated as a raw string literal, i.e., that python shouldn't try to interpret the backslashes as escape characters, as it would, e.g., for `\n` - new line. This is quite useful for regular expressions, because we'd have to write the above query like this otherwise:

```
"animal:\\w\\w\\w"
```

The specific match can be retrieved using [`match.group()`](https://docs.python.org/3/library/re.html#re.match.group).

## Basic Patterns

Ordinary characters, such as "`a, X, 9, <`" match themselves literally. 

In [3]:
# search for occurence of "sc"
re.search(r"sc", "datascience").group()

'sc'

In [4]:
# search for occurence of <
re.search(r"<", "data<science").group()

'<'

Special characters do not match themselves because they are part of the language. These are `. ^ $ * + ? { [ ] \ | ( )`.
 

In [5]:
# search for the beginning of the string, not the ^ symbol
re.search(r"^", "datascience^2").group()

''

We can escape special characters to match litteraly with a backslash `\`.

In [6]:
# search for the ^ symbol by escaping it
re.search(r"\^", "datascience^2").group()

'^'

 A period `.` matches a single character, but not a newline character.

In [7]:
# search for the first single character
re.search(r".", "datascience.net").group()

'd'

`\w` matches a "word" character: a letter or digit or underbar `[a-zA-Z0-9_]`. Note  that it only matches a single word char, not a whole word.

In [8]:
# search for the first word char
re.search(r"\w", "datascience").group()

'd'

In [9]:
# search for the first word char - note that < doesn't match
re.search(r"\w", "<datascience>").group()

'd'

\W (upper case W) matches any non-word character.

In [10]:
# search for the first non-word char
re.search(r"\W", "<datascience>").group()

'<'

\s matches a single whitespace character - space, newline, return, tab, form [ \n\r\t\f].

In [11]:
# split by whitespace - searching for whitespace is boring
re.split(r"\s", "Intro datascience")

['Intro', 'datascience']

\S (upper case S) matches any non-whitespace character.

In [12]:
# search for first non-whitespace character
re.search(r"\S", " Intro datascience").group()

'I'

`\t`, `\n`, and `\r` match tab, newline, and return respectively.

In [13]:
# split the string based on tab \t
re.split(r"\t", "Intro\tdatascience    2018")

['Intro', 'datascience    2018']

`\d` matches a decimal digit [0-9].

In [14]:
re.search(r"\d", "Intro datascience 2018").group()

'2'

`^` matches the start and `$` matches the end of the string. These are useful in context of a larger regular expressions, but not very useful in isolation. 

### Repetition

A key concept in regex is repetition.

`+` matches 1 or more occurrences of the pattern to its left.

In [15]:
# this matches as much as it can
re.search(r"o+", "Introoooo datascience").group()

'ooooo'

`*` matches 0 or more occurrences of the pattern on its left

In [16]:
# serch for digits \d possibly seprated by one ore more whitespaces
re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx').group()

'1 2   3'

In [17]:
re.search(r'\d\s*\d\s*\d', 'xx123xx').group()

'123'

We can use this, for example to look for words starting with a certain character: 

In [18]:
# d\w* start with a d, then match zero or more word characters
re.search(r"d\w*", "Introoooo datascience !").group()

'datascience'

`?` matches 0 or 1 occurrences of the pattern on its left:

In [19]:
# d\w? start with a d, then match zero or one characters. Why is the result "da" not "d"?
re.search(r"d\w?", "Introoooo datascience !").group()

'da'

Be aware that the zero or more condition can be tricky. For example, if we want to match a `dd` with `*` and do it like this, we get a zero match, because the start of the string already matches the "or zero" condition. The correct pattern here would be `d+`.

In [20]:
re.search(r"d*", "Introoooo ddatascience !").group()

''

### Example: E-Mails

Let's take a look at how we can use regular expressions. Suppose you're a spammer and you want to scrape e-mail addresses from website. 

Here is an example:

In [21]:
html = 'You can reach me <a href="mailto:alex@sci.utah.edu">by e-mail</a> if necessary.'

# a first attempt:
# \w+ 1-n word letters, @, 1-n word letters
re.search(r'\w+@\w+', html).group()

'alex@sci'

That didn't work because `.` doesn't match for `\w`. We can write a more specific query:

In [22]:
# \w+ 1-n word letters, @, 1-n word letters, a period \., 1-n word letters, another period \., 
# and more 1-n word letters 
re.search(r'\w+@\w+\.+\w+\.\w+', html).group()

'alex@sci.utah.edu'

That worked! But it's easy to see that this isn't very general, i.e., it doesn't work for every legal e-mail. 

In [23]:
html2 = 'You can reach me <a href="mailto:alex@utah.edu">by e-mail</a> if necessary.'
match = re.search(r'\w+@\w+\.+\w+\.\w+', html2)
if match:
    print(match.group())
else:
    print ("didn't match")

didn't match


In [24]:
html3 = "You can reach me <a href='mailto:alex-lex@sci.utah.edu'>by e-mail</a> if necessary."

# \w+ 1-n word letters, @,  
match = re.search(r'\w+@\w+\.+\w+\.\w+', html3)
if match:
    print(match.group())
else:
    print ("didn't match")

lex@sci.utah.edu


Here, something matched but it's the wrong e-mail! It's not alex-lex@sci.utah.edu, but lex@sci.utah.edu. 

To fix this, we need another concept:

## Sets of legal chars 

We need another tool: **square brackets** `[]`. When using square brackets to enclose an expression, all the characters in the expression match:

In [73]:
#[\w.-]+ matches all strings that are made up of one or more word, a period ., or dash - characters.
re.search(r'[\w.-]+@[\w.-]+', html).group()

'alex@sci.utah.edu'

In [74]:
re.search(r'[\w.-]+@[\w.-]+', html2).group()

'alex@utah.edu'

That worked wonderfully! See how easy it is to extract an e-mail from a website. 

However, this pattern matches valid e-mail addresses, but it also matches non-valid ones. So this is a fine regex if you want to extract e-mail addresses, but not if you want to validate an e-mail address:

In [75]:
html4 = "alexander@sci..."

re.search(r'[\w.-]+@[\w.-]+', html4).group()

'alexander@sci...'

## Grouping
If we want to be more specific about repeating substrings, for example, we need to be able to group a part of a regular expression. You can group with round brackets ():

In [76]:
# (da)+ gives us 1+ matches of the string "da", e.g., this will match da dada dadada, etc.
re.search(r"(da)+", "Introoooo dadadadascience 2016").group()

'dadadada'

Groups are also a handy way to match a larger string, but only extract what is nested within a group. The [`group()`](https://docs.python.org/3/library/re.html#re.match.group) method we've been using provides access to matched groups independently. Here is an example of extracting a URL from a string: 

In [77]:
url = 'Visit the course website <a href="http://datasciencecourse.net">here</a>'
# legal characters in a url are \w, :, slash / which we have to escape to \/, period ., 
# which we have to escape to \.
match = re.search(r'href="([\w:/\.]+)"', url)

print("The whole match:", match.group())
# Here we retreive the first individual group:
print("Only the match within the second group at index 1:", match.group(1))

The whole match: href="http://datasciencecourse.net"
Only the match within the second group at index 1: http://datasciencecourse.net


## Exercise 2.1

You're an evil Spammer who's observed that many people try to obfuscate their e-mail using this notation: "`alex at utah dot edu`". Below are three examples of such e-mails text. Try to extract "alex at utah dot edu", etc. Start with the first string. Then extend your regular expression to work on all of them at the same time. Note that the second and third are slightly harder to do! 

In [26]:
html_smart = "You can reach me: alex at utah dot edu"
html_smart2 = "You can reach me: alex dot lex at utah dot edu"
html_smart3 = "You can reach me: alex dot lex at sci dot utah dot edu"

def testRegex(regex):
    for html in (html_smart, html_smart2, html_smart3):
        print(re.search(regex, html).group())

In [34]:
# TODO write your regex here
mail_regex = "((\w+\s)+(\sdot)*)+at\s\w+\sdot\s\w+"
testRegex(mail_regex)

alex at utah dot edu
alex dot lex at utah dot edu
alex dot lex at sci dot utah


## Find All Occurrences

Instead of finding only a single occurrence of a match, we can also find all occurrences. Here is an example:

In [35]:
findall_html = 'You can reach us at <a href=\"mailto:alex-lex@sci.utah.edu\">Alex\'s</a>  ' \
    'or <a href="mailto:b-osting@math.utah.edu">Braxton\'s</a> e-mail if necessary.'

e_mail_re = r'[\w.-]+@[\w.-]+'

re.findall(e_mail_re, findall_html)

['alex-lex@sci.utah.edu', 'b-osting@math.utah.edu']

You can also combine the findall with groups:

In [36]:
# separating username and domain
e_mail_re_groups = r'([\w.-]+)@([\w.-]+)'

re.findall(e_mail_re_groups, findall_html)

[('alex-lex', 'sci.utah.edu'), ('b-osting', 'math.utah.edu')]

If we want to use parentheses only for logic, not for grouping, we can use the `(?:)` syntax (a non-capturing grouping):

In [37]:
re.findall(r'(?:[\w.-]+)@(?:[\w.-]+)', findall_html)

['alex-lex@sci.utah.edu', 'b-osting@math.utah.edu']

## Greedy vs Non-Greedy

By default, regular expressions are greedy. In this example, we try to match HTML tags:

In [38]:
html_tags = "The <b>amount and complexity</b> of information produced in <i>science</i>..."

# start with <, repeat any character 1-n times, close with >
re.findall("<.+>", html_tags)

['<b>amount and complexity</b> of information produced in <i>science</i>']

This wasn't what we tried to do - the greedy nature of regex matched from the first opening tag < to the last closing tag. We can modify this behavior with the `?` character, which signals that the expression on the left should not be greedy:

In [39]:
# start with <, repeat any character 1-n times in a non-greedy way, terminat at the first >
re.findall("<.+?>", html_tags)

['<b>', '</b>', '<i>', '</i>']

Greedy applies to the `*`, `+` and `?` operators – so these are legal sequences: `*?`, +?, ??

## Custom character subsets

You can also define custom character sets by specifying a range with a dash: 

In [40]:
re.search(r"[2-9]+", "0123405").group()

'234'

When combined with character sets, we can use the `^` operator to invert a match.

In [41]:
re.search(r"[^0-2]+", "0123405").group()

'34'

## Specifying number of copies

`{m}` Specifies that exactly m copies of the previous RE should be matched. Fewer matches cause the entire RE not to match. 

In [42]:
phone_numbers = "(857) 131-2235, (801) 134-2215, this is common in twelve (12) countries and one (1) state"

# match exactly three digits enclosed in brackets
re.findall("\(([0-9]{3})\)", phone_numbers)
re.findall("[0-9]{3}", phone_numbers)

['857', '131', '223', '801', '134', '221']

{m,n} specifies that m to n copies match:

In [43]:
# match two to three digits enclosed in brackets
re.findall("\(([0-9]{2,3})\)", phone_numbers)

['857', '801', '12']

## Or expression

We can use the pipe `|` to define an or between any regular expression:

In [44]:
weekdays = "We could meet Monday or Wednesday"

re.findall("Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday", weekdays)

['Monday', 'Wednesday']

## Replacing strings

We can use the [`sub()`](https://docs.python.org/3/library/re.html#re.sub) to dynamically replace content. 

In [45]:
re.sub("Monday|Tuesday|Wednesday|Thursday|Friday", "Weekday",  weekdays)

'We could meet Weekday or Weekday'

## Other Functions

We've covered a lot, but not all of the functionality of regex.  A couple of other functions that could be helpful:

* [finditer](https://docs.python.org/3/library/re.html#re.finditer) returns an iterator
* the [IGNORECASE](https://docs.python.org/3/library/re.html#re.IGNORECASE) option
* the [DOTALL](https://docs.python.org/3/library/re.html#re.DOTALL) option makes a . match a new line character too.



## Exercises

### Exercise 2.2: Find Adverbs

Write a regular expression that finds all adverbs in a sentence. Adverbs are characterized by ending in "ly".

In [46]:
text = "He was carefully disguised but captured quickly by police."

In [55]:
re.findall(r"\w+ly",text)

['carefully', 'quickly']

### Exercise 2.3: Phone Numbers

Extract the phone numbers that follow a (xxx) xxx-xxxx pattern from the text:

In [44]:
phone_numbers = "(857) 131-2235, (801) 134-2215, but this one (12) 13044441 shouldnt match. Also, this is common in twelve (12) countries and one (1) state"

In [54]:
re.findall(r"\([0-9]{3}\)\s[0-9]{3}-[0-9]{4}",phone_numbers)

['(857) 131-2235', '(801) 134-2215']

### Exercise 2.4: HTML Content

Extract the content between the `<b>` and `<i>` tags but not the other tags:

In [57]:
html_tags = "This is <b>important</b> and <u>very</u><i>timely</i>"

In [58]:
re.findall(r"<[bi]>(.*?)<\/[bi]>",html_tags)

['important', 'timely']