# Lab 7 Regular Expressions and Strings

In [6]:
require(tidyverse)
require(stringr)

Loading required package: tidyverse

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──

[32m✔[39m [34mggplot2[39m 3.1.0     [32m✔[39m [34mpurrr  [39m 0.3.0
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.1
[32m✔[39m [34mtidyr  [39m 0.8.2     [32m✔[39m [34mstringr[39m 1.3.1
[32m✔[39m [34mreadr  [39m 1.1.1     [32m✔[39m [34mforcats[39m 0.3.0

“package ‘tibble’ was built under R version 3.5.2”
“package ‘dplyr’ was built under R version 3.5.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Regular Expressions

Regular expressions (regex) are a way to describe patterns in text and are used to search for and match certain patterns in strings.

`Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.` - Jamie Zawinski

For instance, say that you want to find and extract all the email addresses in a document automatically. How might we do that?

### Special characters

Regex takes advantage of several reserved characters that are used for special functions. 

`. \ | ( ) [ ] ^ $ { } * + ?`

### Character classes

* `.` matches anything (wildcard)
* `[aeiou]` matches a single character in the set provided
* `[^aeiou]` matches a single character NOT in the set
* `[a-e]` matches a range, equivalent to `[abcde]`

#### Shorthand

* `\w` matches a "word" character, equivalent to `[a-zA-Z0-9_]`
* `\s` matches any whitespace, including tabs and newlines
* `\d` matches digits, equivalent to `[0-9]`
* `\W`, `\S`, and `\D` match the opposite of the lower-case versions

#### Special characters

* Note that `\t` and `\n` match the tab and newline characters. 
* If you want the "literal" versions of any of the reserved characters, you will need to escape them with a backslash `\`, e.g. `[\.\\\|]`


### Grouping

* `()` are used to group patterns together. This can be used with any of the below operators. This can also be used to extract portions of a regex out individually, which we will later learn.
* `\1`, `\2`, etc. refers to the first, second, etc. group in the match.

### Operators

* `|` is the OR operator and allows matches of either side
* `{}` describes how many times the preceeding character of group must occur:
  * `{m}` must occur exactly `m` times
  * `{m,n}` must occur between `m` and `n` times, inclusive
  * `{m,}` Must occur at least `m` times
* `*` means the preceeding character can appear zero or more times, equivalent to `{0,}`
* `+` means the preceeding character must appear one or more times, equivalent to `{1,}`
* `?` means the preceeding character can appear zero or one time, equivalent to `{0,1}`

### Anchors

* `^` matches the start of a string (or line)
* `$` matches the end of a string (or line)
* `\b` matches a word "boundary"
* `\B` matches not word boundary

### Examples

Go to https://regex101.com/ to play around with creating regex patterns. To start, copy-paste the following paragraph (from [The Ringer](https://www.theringer.com/mlb/2018/10/22/18008004/world-series-boston-red-sox-los-angeles-dodgers-mookie-betts-second-base-jd-martinez)) into the text field.

`According to Baseball Reference’s wins above average, the Red Sox had the best outfield in baseball— one-tenth of a win ahead of the Milwaukee Brewers, 11.5 to 11.4. And that’s despite, I’d argue, the two best position players in the NL this year (Christian Yelich and Lorenzo Cain) being Brewers outfielders. More importantly, the distance from Boston and Milwaukee to the third-place Yankees is about five wins. Two-thirds of the Los Angeles Angels’ outfield is Mike Trout (the best player in baseball) and Justin Upton (a four-time All-Star who hit 30 home runs and posted a 122 OPS+ and .348 wOba this year), and in order to get to 11.5 WAA, the Angels’ outfield would have had to replace right fielder Kole Calhoun with one of the three best outfielders in baseball this year by WAA.`

#### 1. Write a regex that captures all capitalized words.



#### 2. Write a regex that captures all the numbers.



#### 3. Write a regex that captures all hyphenated words



#### 4. Write a regex that captures all words with two consecutive vowels (do not consider `y` to be a vowels).



#### 5. Write a regex that captures all words with a repeated letter.



#### 6. Write a regex that matches `this` and `the` but not `third`.




### Exercise

#### Write a regex that matches all the valid numbers below but none of the invalid ones.

```
12
1048
3.14529
0.87
-255.34
123,340.00 
-16,123,340
1.9e10 
-5.8e5
1.45e-5
720p
164.
.87
124..43
153.243.232
123,,546
24.256,453
123,34,123
,253
12.4e6```



In [4]:
nums <- c('12',
'1048',
'3.14529',
'0.87',
'-255.34',
'123,340.00', 
'-16,123,340',
'1.9e10', 
'-5.8e5',
'1.45e-5',
'720p',
'164.',
'.87',
'124..43',
'153.243.232',
'123,,546',
'24.256,453',
'123,34,123',
',253',
'12.4e6')
nums

## Strings

In [11]:
string1 = "Michigan: BIG 10 Champion!!"
string1

In [13]:
our_state = "Michigan"
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
lakemich_states = c("Wisconsin", "Illinois", "Michigan", "Indiana")

In [13]:
our_state %in% ne_states
our_state %in% lakemich_states

Note that there are some special characters. The most commonly used ones are `\n` and `\t` for newlines and tabs, respectively.

Also note that there are some reserved characters do special things in strings. If you want to include them, you must escape them with a backslash `\`.

In [136]:
double_quote = "hi\"bye"
backslash_ex = "a\\tb"
backslash_ex2 = "a\tb"

In [137]:
cat(double_quote)

hi"bye

In [34]:
cat(backslash_ex)

a\tb

In [35]:
cat(backslash_ex2)

a	b

You’ll also sometimes see strings like `"\u00b5"`, this is called Unicode-escaping, and is a way of writing non-ASCII characters that works on all platforms.

In [23]:
x = "\u00b6"
x

### String Functions

In [24]:
ne_states

In [25]:
str_length(ne_states)

In [26]:
str_c('Ann Arbor', 'Michigan', sep=', ')

In [27]:
x = c('abc', '123', NA)

In [28]:
str_c('|-', x, '-|')

In [29]:
str_c('|-', str_replace_na(x), '-|')

To collapse a vector of strings, use the `collapse` argument to `str_c`:

In [30]:
str_c(ne_states, collapse=", ")

### Subsetting Strings

In [31]:
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
ne_states

In [32]:
str_sub(ne_states, 1, 3)

In [33]:
str_sub(ne_states, -3, -1)

In [34]:
str_sub(ne_states, 1, 7)  # notice that this didn't fail for Maine

In [35]:
str_sub(ne_states, 1, 1) = str_to_lower(str_sub(ne_states, 1, 1))
ne_states

In [36]:
str_sub(ne_states, -3, -1) = str_to_upper(str_sub(ne_states, -3, -1))
ne_states

## Using regular expressions in R

In `R`, we will use `str_view` and `str_view_all` to play with regular expressions. 

Note that other functions you've used previously, such as `str_extract`, `str_detect` and `str_replace`, can also take regular expressions as patterns.

`str_view` and `str_view_all` take a string (or a vector of strings) and show you the matches to a pattern.

In [37]:
x = c("apple", "banana", "pear", "orange")

In [39]:
str_view(x, "an")

In [40]:
str_view_all(x, "an")

In [47]:
baseball = "According to Baseball Reference’s wins above average, the Red Sox had the best 
outfield in baseball— one-tenth of a win ahead of the Milwaukee Brewers, 11.5 to 11.4. And 
that’s despite, I’d argue, the two best position players in the NL this year (Christian 
Yelich and Lorenzo Cain) being Brewers outfielders. More importantly, the distance from 
Boston and Milwaukee to the third-place Yankees is about five wins. Two-thirds of the Los 
Angeles Angels’ outfield is Mike Trout (the best player in baseball) and Justin Upton (a 
four-time All-Star who hit 30 home runs and posted a 122 OPS+ and .348 wOba this year), 
and in order to get to 11.5 WAA, the Angels’ outfield would have had to replace right 
fielder Kole Calhoun with one of the three best outfielders in baseball this year by WAA."

str_view_all(baseball, "\\b[A-Z][a-z]+")

Note that any time you want to use a backslash `\` in a regex pattern in `R`, you'll need to use a double backslash `\\` instead. This is because `R` has its own layer of string processing that also uses backslashes to escape reserved characters. So you need to tell `R` to use a literal backslash so that it passes a backslash to the regex function.

In [48]:
naive = "a.c"
dot = "a\\.c"

# cat(naive)
str_view(c("abc", "a.c", "bef"), naive)

# cat(dot)
str_view(c("abc", "a.c", "bef"), dot)

Question: How many backslashes do you need to create a regex pattern that matches a literal backslash when using `R`?

In [49]:
x = "a\\b"
cat(x)

a\b

## Exercises

Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:

- Start with `y` (I've done this one for you)
- End with `x`
- Are exactly two letters long (don’t cheat by `using str_length`!)
- Have ten letters or more

In [135]:
words = stringr::words

In [118]:
str_view(words, "^y\\w*", match=TRUE)

Create regular expressions to find all words that:

- End with `ed`, but not with `eed`
- End with `ing` or `ise`
- End with the same two-letter sequence they start with (e.g. `church`)

Try to match the valid `dates` below (first row) without matching the invalid dates (the rest).

Hint: Start by writing a pattern that matches all the entries. Then try to refine your pattern to omit the invalid dates.

In [3]:
dates = c('2012-05-13', '2014-12-31', '1991-06-14', '1991/06/14',
          '200a-05-13',  # invalid year
          '2014-15-20',  # invalid month
          '2014-00-20',  # invalid month
          '2016-04-35',  # invalid day
          '2014-12-00',  # invalid day
          '2013/03-25')  # non-matching separators

str_view(dates, 'pattern')