In [5]:
library(tidyverse)
library(stringr)
options(jupyter.rich_display=T)

# Lecture 13: Regular Expressions II

In this lecture we continue learning about regular expressions

* [Character classes](#Character-classes)
* [Alternatives](#Alternatives)

## Character classes
Recap from last lecture:

* `\d`: matches any digit.a
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything *except* a, b, or c.

In [7]:
x = c("apple", "banana", "pear")
str_view(x, '[be]a')

We can also *negate* the character class. A character class which begins with a `^` will match everything except what is in the brackets:

In [121]:
str_view(x, '[^b]a')  # Match anything except b, followed by a

Note also that, inside of the brackets, the `.` loses its special meaning:

In [122]:
str_view(c('a.', 'ban', 'tribeca.'), 'a[nea.]')  # Match a, following by one of n, e, a or period

#### Exercise
I want to match all words that begin with a `q` but are not followed by a `u`.

In [8]:
str_view(stringr::words, '^q[^u]', match=T)

It turns out that there aren't any in this list of common words. (These are uncommon words.) Let's try a bigger list:

In [124]:
english = read_table('/usr/share/dict/words', col_names = F)$X1  # works on OSX or linux
length(english)
str_view(english, '^q[^u]', match=T)

Parsed with column specification:
cols(
  X1 = col_character()
)


How about words that end in q?

In [125]:
str_view(english, 'q$', match=T)

## Alternatives
An *alternative* means *match this or that*. Alternative patterns can be matched using the syntax `(this|that)`.

In [9]:
color_re = "colo(r|ur)"
x <- c("color", "red colour", "coloured glass", "chair", "colored chair")
str_view(x, color_re)

### Example
Suppose we want to match telephone numbers of the form:

* xxx-xxx-xxxx
* (xxx) xxx-xxxx

In [11]:
phone_re = "(\\d\\d\\d-|\\(\\d\\d\\d\\) )\\d\\d\\d-\\d\\d\\d\\d" # complicated because of all the double backslashes
writeLines(phone_re)

(\d\d\d-|\(\d\d\d\) )\d\d\d-\d\d\d\d


In [12]:
n <- c("123-456-7890", "(123) 456-7890", "1234567890", "+1-123-456-7890")
str_view(n, phone_re)

## Repetition
Above we had to repeat `\d` lots of times in order to match phone numbers. Fortunately, regexps let us precisely control the number of repetitions to match.

We can control how many times a pattern matches:

* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more

Each of these modifies the thing before it. So:
* `ab?` matches `a` or `ab`.
* `ab+` matches `ab`, `abb`, `abbb`, etc. (`a` followed by any # of b's.)
* `ab*` matches `a` as well as everything that `ab+` matches.

In [13]:
x <- c("cat", "dog", "dogs", "cats")
str_view(x, "cats?")

As with previous examples, to match a literal `*`, `+` or `?` you must properly escape them:

In [15]:
str_view_all("Why? No really, why?", "\\?") # to match a literal question mark

We can also specify the number of matches precisely:

* `{n}`: exactly n
* `{n,}`: n or more
* `{n,m}`: between n and m

In [21]:
x = c('abc', 'bc', 'aabc', 'aaabbc')
str_view(x, "ab{2}c")
str_view(x, "a{2,}b{1,}c")

## Grouping
Above we vaguely noted that `+`/`*`/`?` modify the previous "thing". "Thing" could be a character (the simplest example), a character class, or group of characters. To group characters together, surround them with parentheses:

In [16]:
x <- c("hiking is fun", "reading is fun", "driving is not fun", "flying is not fun", "biking is fun")
str_view(x, "is (not )?fun")

## Backreferences
Parentheses define groups that can be referred to later in the match as `\1`, `\2` etc. Let's look at a somewhat advanced example:
```
re = "^(.).*\\1$" 
```

Let's unpack what this regex does. We go from left to right:
1. The first character `^` anchors the regex to the beginning of the string. 
2. Then there is a group (denoted by the parentheses); inside the group is a period (`.`) which, as we learned, matches anything.
3. Moving past the group, we encounter another period, which is modified by `*`. `.` matches anything and `*` matches any number of those things, so `.*` is regexp-ese for "match as many things as possible".
4. Next, we encounter "\1", which is a backreference to the match that occurred inside the parentheses in step 2. So this tells the pattern matching engine that whatever character was matched by `(.)` must also occur at this point in the string.
5. Finally, we encounter the end-of-line anchor `$`.

Putting it all together, this regex will match all words whose start and end characters are the same.

In [25]:
x = c("mom", "dad", "brother", "sister")
re = "^(.).*\\1$" 
str_view(x, re) # find strings that start and end with the same character

In [27]:
# out of curiosity
str_view(stringr::words, re, match=T)

Let's try another example:
```
re <- "(..).*\\1"
```
Using the same logic as before, this matches words that contain two repeated characters, followed by the same characters repeated later on. 

In [28]:
x <- c("he moved his head", "she moved her car", "nobody moved anything", "they moved their bikes")
re <- "(..).*\\1" # find a repeated pair of characters
str_view(x, re, match = TRUE)

### Example
How could I match all the words that end with the same vowel repeated twice. (For example, "levee".)

In [36]:
str_view(stringr::words, "([aeiou])\\1$", match=T)

### Example
Same thing, but ending in a consonant instead of a vowel. (For example, "hiss".)

In [35]:
str_view(stringr::words, "([^aeiou])\\1$", match=T)

## Tools
Now that we have defined what it means to be a regular expression, let's look at some examples of how they are used in programming and data analysis.

### `str_detect` / `str_subset`
The `str_detect(v, re)` function returns a logical vector indicating whether each element of `v` matches the regex `re`:

In [52]:
words[str_detect(words, 'ing$')]
# This can be abbreviated:
str_subset(words, "ing$")

`str_detect` is useful in conjunction with `filter`:

In [49]:
df = tibble(word=words) %>% mutate(i=row_number())
filter(df, str_detect(words, "ing$"))

word,i
bring,113
during,251
evening,280
king,448
meaning,512
morning,533
ring,709
sing,765
thing,860


### Exercise
How many of the words in `words` begin with "q" but do not have a "u" immediately following?

In [54]:
# Code

## `str_count`
`str_count(v, re)` will count up the number of matches of `re` for each entry of `v`:

In [72]:
# count all the words in the sentence
re = NA
str_count("A gentleman is one who never hurts anyone's feelings unintentionally.", re)

### Example
What is the median number of vowels and consonants in each word?

In [55]:
# Code

### `str_extract`
`str_extract(v, re)` extracts substring matched by `re` from each element of `v`. Another way to think of this is as returning the portion of the string which is highlighted by `str_view`:

In [67]:
q = 'Research is formalized curiosity. It is poking and prying with a purpose.'
# re to match capitalized words
re = NA
str_view(q, re)
str_extract(q, re)

Analogous to `str_view_all` we have `str_extract_all`:

In [69]:
str_view_all(q, re)
str_extract_all(q, re)

### `str_match`
`str_match(v, re)` will create a matrix out of the grouped matches in `re`. The first column has the whole match, and additional columns are added for each character group. If the pattern does not match, you will get `NA`s.

In [90]:
head(str_match(words, '^(.).*(.)$'))

0,1,2
,,
able,a,e
about,a,t
absolute,a,e
accept,a,t
account,a,t


### `str_replace`
`str_replace(v, re, rep)` will replace each match of `re` in `v` with `rep`. The most basic usage is as a sort of find and replace:

In [95]:
str_replace('Give me liberty or give me death', '\\w+$', 'pizza')

A very useful feature of regexp replacements is the ability to use backreferences:

In [103]:
str_replace("If you're going through hell, keep going.", NA) # code to de-apostrophize