analysis/ch11_strings.Rmd

---
title: "Chapter 11 - Strings with {stringr}"
author: "Vebash Naidoo"
date: "31/10/2020"
output: html_document
---

```{css, echo = FALSE}
.tabset h2 {display: none;}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE,
                      comment = "#>", collapse = TRUE)

options(scipen=10000)
library(tidyverse)
library(flair)
library(magrittr)
library(stringr)
```
# Strings {#buttons .tabset .tabset-fade .tabset-pills}


__Click on the tab buttons below for each section__

<h2>String Basics</h2>
## String Basics

```{r str1, include=FALSE}
(string1 <- "This is a string")
(string2 <- 'To put a "quote" inside a string, use single quotes')

writeLines(string1)
writeLines(string2)
```

```{r, echo = FALSE}
decorate('str1') %>% 
  flair("\"", 
        background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("\'", 
        background = "#e5989b", 
        color = "#6d6875") 
```

```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

If you want to include a literal backslash, you'll need to <span style="color: #008080;background-color:#9FDDBA">double it up: `"\\"`</span>.

The printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the <span style="color: #008080;background-color:#9FDDBA">raw contents of the string, use  `writeLines()`</span>:

```{r write, include=FALSE}
x <- c("\"", "\\")
x
writeLines(x)
```

```{r, echo=FALSE}
decorate("write") %>% 
  flair("writeLines", background = "#9FDDBA", 
        color = "#008080")
```

__Other useful ones__:

- `"\n"`: newline
- `"\t"`: tab
- See the complete list by getting help on `"`: `?'"'`, or `?"'"`. 
- When you see strings like `"\u00b5"`, this is a way of writing non-English characters.

```{r}
(string3 <- "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?")
writeLines(string3)

## From `?'"'` help page
## Backslashes need doubling, or they have a special meaning.
x <- "In ALGOL, you could do logical AND with /\\."
print(x)      # shows it as above ("input-like")
writeLines(x) # shows it as you like it ;-)
```

<h2>Some String Functions</h2>
## Some String Functions

### String Length

Use <span style="color: #008080;background-color:#9FDDBA">`str_length()`</span>.

```{r str_len, include=FALSE}
str_length(c("a", "R for Data Science", NA))
```

```{r, echo=FALSE}
decorate("str_len") %>% 
  flair("str_length", background = "#9FDDBA", 
        color = "#008080")
```

### Combining Strings

Use <span style="color: #008080;background-color:#9FDDBA">`str_c()`</span>.

- Use `sep = some_char` to separate values with a character, the 
default separator is the empty string.
- Shorter length vectors are recycled.
- Use `str_replace_na(list)` to replace NAs with literal __NA__.
- Objects of length 0 are silently dropped.
- Use `collapse to reduce a __vector of strings__ to a single
string.

```{r str_com, include=FALSE}
str_c("a", "R for Data Science")

str_c("x", "y", "z")

str_c("x", "y", "z", sep = ", ") # separate using character

str_c("prefix-", c("a","b", "c"), "-suffix")
```

```{r, echo=FALSE}
decorate("str_com") %>% 
  flair("str_c", background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("sep = ", background = "#9FDDBA", 
        color = "#008080")  
```

```{r str_com2, include=FALSE}
x <- c("abc", NA)

str_c("|=", x, "=|") # concatenating a 1 long, with 2 long, with 1 long

str_c("|=", str_replace_na(x), "=|") # to actually show the NA
```

```{r, echo=FALSE}
decorate("str_com2") %>% 
  flair("NA", background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("str_replace_na(x)", background = "#9FDDBA", 
        color = "#008080")  
```

Notice that the shorter vector is recycled.

Objects of 0 length are dropped.

```{r str_com3, include=FALSE}
name <- "Vebash"
time_of_day <- "evening"
birthday <- FALSE

str_c("Good ", time_of_day, " ",
      name, if(birthday) ' and Happy Birthday!')

str_c("prefix-", c("a","b", "c"), "-suffix", collapse = ', ')

str_c("prefix-", c("a","b", "c"), "-suffix") # note the diff without
```

```{r, echo=FALSE}
decorate("str_com3") %>% 
  flair("if(birthday) ' and Happy Birthday!'", 
        background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("collapse = ', '", 
        background = "#9FDDBA", 
        color = "#008080")
```


### Subsetting Strings

Use <span style="color: #008080;background-color:#9FDDBA">`str_sub()`</span>.

- `start` and `end` args  give the (inclusive) position of the substring you're looking for.
- does not fail if string too short, returns as much as it can.
- can use the assignment operator of `str_sub()` to modify strings.

```{r}
x <- c("Apple", "Banana", "Pear")

str_sub(x, 1, 3) # get 1st three chars of each

str_sub(x, -3, -1) # get last three chars of each

str_sub("a", 1, 5) # too short but no failure

x # before change

# Go get from x the 1st char, and assign to it
# the lower version of its character
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

x # after the str_sub assign above
```

### Locales

`str_to_lower()`, `str_to_upper()` and `str_to_title()` are all
functions that amend case. Amending case may be dependant on your
locale though.

```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```

Sorting is also affected by `locales`. In Base R we use `sort` or `order`, in {stringr} we use `str_sort()` and `str_order()` with the
additional argument `locale`.

```{r}
x <- c("apple", "banana", "eggplant")

str_sort(x, locale = "en")

str_sort(x, locale = "haw")

str_order(x, locale = "en")

str_order(x, locale = "haw")
```

### Exercises

1.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
    What's the difference between the two functions? What stringr function are
    they equivalent to? How do the functions differ in their handling of 
    `NA`?
    
    ```{r}
    # from the help page
    ## When passing a single vector, paste0 and paste work like as.character.
    paste0(1:12)
    paste(1:12)        # same
    as.character(1:12) # same
    
    ## If you pass several vectors to paste0, they are concatenated in a
    ## vectorized way.
    (nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9))))
    
    (nth <- paste(1:12, c("st", "nd", "rd", rep("th", 9))))
    
    (nth <- str_c(1:12, c("st", "nd", "rd", rep("th", 9))))
    
    
    (na_th <- paste0(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    
    (na_th <- paste(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    
    (na_th <- str_c(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    ```

    - `paste()` inserts a space between values, and may be overridden
    with `sep = ""`. In other words the default separator is a 
    space.
    
    - `paste0()` has a separator that is by default the empty 
    string so resulting vector values have no spaces in
    between.
    
    - `str_c()` is the stringr equivalent.
    
    - `paste()` and `paste0()` treat NA values as literal string NA,
    whereas `str_c` treats NA as missing and that vectorised
    operation results in an NA.
    
    
1.  In your own words, describe the difference between the `sep` and `collapse`
    arguments to `str_c()`.
    
    - `sep` is the separator that appears between vector values
    when these are concatenated in a vectorised fashion.
    - `collapse` is the separator between values when all 
    vectors are collapsed into a single contiguous string value.
    
    ```{r}
    (na_th_sep <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # sep only
                        sep = "'"))
    
    (na_th_col <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # collapse only
                        collapse = "; "))
    
    (na_th <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                    # both
                    sep = " ", collapse = ", "))
    ```

1.  Use `str_length()` and `str_sub()` to extract the middle character from 
    a string. What will you do if the string has an even number of characters?
    
    ```{r}
    x <- "This is a string."
    
    y <- "This is a string, no full stop"
    
    z <- "I"
    
    str_length(x)/2
    str_length(y)/2
    
    str_sub(x, ceiling(str_length(x)/2),
            ceiling(str_length(x)/2))
    
    str_sub(y, str_length(y)/2,
            str_length(y)/2 + 1)
    
    str_sub(z, ceiling(str_length(z)/2),
            ceiling(str_length(z)/2))
    ```
    

1.  What does `str_wrap()` do? When might you want to use it?

    It is a wrapper around stringi::stri_wrap() which implements 
    the Knuth-Plass paragraph wrapping algorithm.
    
    The text is wrapped based on a given width. The default
    is 80, overridding this to 40 will mean 40 characters
    on a line. Further arguments such as `indent` (the indentation
    of start of each paragraph) may be specified.

1.  What does `str_trim()` do? What's the opposite of `str_trim()`?

    It removes whitespace from the left and right of a string.
    `str_pad()` is the opposite functionality.
    
    `str_squish()` removes extra whitepace, in beginning of string,
    end of string and the middle. `r emo::ji("celebrate")`
    
    ```{r}
    (x <- str_trim("  This has \n some spaces   in the     middle and end    "))
    # whitespace removed from begin and end of string
    writeLines(x)
    
    (y <- str_squish("  This has \n some spaces   in the     middle and end    ... oh, not any more ;)"))
    # whitespace removed from begin, middle and end of string
    writeLines(y)
    ```
    

1.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into 
    the string `a, b, and c`. Think carefully about what it should do if
    given a vector of length 0, 1, or 2.
    
    - length 0: return empty string
    - length 1: return string
    - length 2: return first part "and" second part
    - length 3: return first part "," second part "and" third part.
    
    ```{r}
    stringify <- function(v){
      if (length(v) == 0 | length(v) == 1){
        v
      }
      else if (length(v) == 2){
        str_c(v, collapse = " and ")
      }
      else if (length(v) > 2){
        str_c(c(rep("", (length(v) - 1)), " and "),
              v, c(rep(", ", (length(v) - 2)), rep("", 2)), 
               collapse = "")
      }
    }
    emp <- ""
    stringify(emp)
    
    x <- "a"
    stringify(x)
    
    y <- c("a", "b")
    stringify(y)
    
    z <- c("a", "b", "c")
    stringify(z)
    
    l <- letters
    stringify(letters)
    ```

<h2>Pattern Matching with Regex</h2>
## Pattern Matching with Regex

- Find a specific pattern
    
  ```{r view1, include=FALSE}
  x <- c("apple", "banana", "pear")
  # find any "an" char seq in vector x
  str_view(x, "an") 
  ```
    
  ```{r, echo=FALSE}
  decorate("view1") %>% 
    flair("str_view", background = "#9FDDBA", 
          color = "#008080")
  ```

- Find any character besides the newline char.

    ```{r}
    # find any char followed by an "a" followed by any char
    str_view(x, ".a.") 
    ```
    
- What if we want to literally match `.`? 

    We need to escape the `.` to say "hey, literally find me a
    . char in the string, I don't want to use it's special
    behaviour this time".
    
    `\\.`

    ```{r}
    (dot <- "\\.")
    writeLines(dot)
    
    str_view(c("abc", "a$c", "a.c", "b.e"), 
             # find a char 
             # followed by a literal . 
             # followed by another char
             ".\\..")
    
    ```

- What if we want the literal `\`?

    Recall that to add a literal backslash in a string we have to
    escape it using `\\`.
    
    ```{r}
    (backslash <- "This string contains the \\ char and we
    want to find it.")
    writeLines(backslash)
    ```
    
    So to find it using regex we need to escape each backslash
    in our regex i.e. `\\\\`. `r emo::ji("horns")`
    
    ```{r}
    writeLines(backslash)
    str_view(backslash, "\\\\")
    ```

#### Exercises

1.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.

    As we saw above in a string to literally print a `\` 
    we use `"\\"`.
    If we need to match it we need to escape each `\`, 
    with a `\`. Since we have __two__ `\`'s in a string, 
    matching requires 2 * 2 i.e. `r 2*2` `\`

1.  How would you match the sequence `"'\`?

    ```{r}
    (string4 <- "This is the funky string: \"\'\\")
    writeLines(string4)
    str_view(string4, "\\\"\\\'\\\\")
    ```


1.  What patterns will the regular expression `\..\..\..` match? 
    How would you represent it as a string?
    
    It matches the pattern literal . followed by any character x 3.
    
    ```{r}
    (string5 <- ".x.y.z something else .z.a.r")
    writeLines(string5)
    str_view_all(string5, "\\..\\..\\..")
    ```

<h2>Anchors</h2>
## Anchors 

Use:

* `^` to match the start of the string.
* `$` to match the end of the string.

    ```{r}
    x
    str_view(x, "^a") # any starting with a?
    str_view(x, "a$") # any ending with a?
    ```

* To match a full string (not just the string being a part 
of a bigger string).

    ```{r}
    (x <- c("apple pie", "apple", "apple cake"))
    str_view(x, "apple") # match any "apple"
    str_view(x, "^apple$") # match the word "apple"
    ```
    
* Match boundary between words with `\b`.

### Exercises

1.  How would you match the literal string `"$^$"`?

    ```{r}
    (x <- "How would you match the literal string $^$?")
    str_view(x, "\\$\\^\\$")
    ```

1.  Given the corpus of common words in `stringr::words`, create regular
    expressions that find all words that:
    
    a. Start with "y".
    
      ```{r}
      stringr::words %>% 
          as_tibble()
          
      str_view(stringr::words, "^y", match = TRUE)

      ```
    a. End with "x"
    
      ```{r}
      str_view(stringr::words, "x$", match = TRUE)
      ```
    
    a. Are exactly three letters long. (Don't cheat by using `str_length()`!)
    
      ```{r}
      str_view(stringr::words, "^...$", match = TRUE)
      ```
    
    a. Have seven letters or more.
    
      ```{r}
      str_view(stringr::words, "^.......", match = TRUE)
      ```

    Since this list is long, you might want to use the `match` argument to
    `str_view()` to show only the matching or non-matching words.

<h2>Character classes</h2>
## Character classes

* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything except a, b, or c.

To create a regular expression containing `\d` or `\s`, we'll need to escape the `\` for the string, so we'll type `"\\d"` or `"\\s"`.

A character class containing a single character is a nice alternative to backslash escapes when we're looking for a single metacharacter in a regex.

```{r}
(x <- "How would you match the literal string $^$?")
str_view(x, "[$][\\^][$]")

(y <- "This sentence has a full stop. Can we find it?")
str_view(y, "[.]")

# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```

This works for most (but not all) regex metacharacters: 

- __Works for__: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. 
- __Does not work for__: Some characters have special meaning even inside a character class, and hence must be handled with backslash escapes. These are `]` `\` `^` and `-`. E.g. In the first example above.

You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, and
hence may be confusing (e.g. we may have expected the above to match either _abc_ or _abdeaf_ or _abchgf_, but it does not - it matches either the first part abc OR the second part dxxf). We need to use parentheses to make it clear what we are looking for.

```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```

#### Exercises

1.  Create regular expressions to find all words that:

    1. Start with a vowel.
    
        ```{r}
        reg_ex <- "^[aeiou]"
        (x <- c("aardvark", "bat", "umbrella", 
                "escape", "xray", "owl"))
        str_view(x, reg_ex)
        ```
    

    1. That only contain consonants. (Hint: thinking about matching 
       "not"-vowels.)
       
       I don't know how to do this with only the tools we have 
       learnt so far so you will see a new character below `+` 
       that is after the character class end bracket - this 
       means one or more, i.e. find words that contain one or more
       non-vowel words in `stringr::words`.

        ```{r}
        reg_ex <- "^[^aeiou]+$"
        str_view(stringr::words, reg_ex, match = TRUE)
        ```       

    1. End with `ed`, but not with `eed`.
    
        ```{r}
        reg_ex <- "[^e][e][d]$"
        str_view(stringr::words, reg_ex, match = TRUE)
        ```        
    
    1. End with `ing` or `ise`.
    
        ```{r}
        reg_ex <- "i(ng|se)$"
        str_view(stringr::words, reg_ex, match = TRUE)
        ```          
    
1.  Empirically verify the rule "i before e except after c".

    ```{r}
    correct_reg_ex <- "[^c]ie|[c]ei"
    str_view(stringr::words, correct_reg_ex, match = TRUE)
    opp_reg_ex <- "[^c]ei|[c]ie" # opp is e before i before a non c
    str_view(stringr::words, opp_reg_ex, match = TRUE)
    ```


1.  Is "q" always followed by a "u"?

    ```{r}
    reg_ex <- "q[^u]"
    str_view(stringr::words, reg_ex, match = TRUE)
    reg_ex <- "qu"
    str_view(stringr::words, reg_ex, match = TRUE)
    ```
    
    In the `stringr::words` dataset yes.


1.  Write a regular expression that matches a word if it's probably written
    in British English, not American English.
    
    ```{r}
    reg_ex <- "col(o|ou)r"
    str_view(c("colour", "color", "colouring"), reg_ex)
    reg_ex <- "visuali(s|z)(e|ation)"
    str_view(c("visualisation", "visualization", 
               "visualise", "visualize"), 
             reg_ex)
    ```
    

1.  Create a regular expression that will match telephone numbers as commonly
    written in your country.
    
    ```{r}
    reg_ex <- "[+]27[(]0[)][\\d]+"
    str_view(c("0828907654", "+27(0)862345678", "777-8923-111"),
             reg_ex)
    ```
    

<h2>Repetition</h2>
## Repetition

The next step up in power involves controlling how many times a pattern matches:

* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more

You can also specify the number of matches precisely:

* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"

str_view(x, "CC?") # C or CC if exists

str_view(x, "CC+") # CC or CCC or CCCC etc. at least two C's

# CL or CX or CLX at least 1 C, followed by one of more L's & X's
str_view(x, "C[LX]+") 

str_view(x, "C{2}") # find exactly 2 C's

str_view(x, "C{1,}") # find 1 or more C's

str_view(x, "C{1,2}") # min 1 C, max 2 C's

(y <- '&lt;span style="color:#008080;background-color:#9FDDBA"&gt;`alpha`&lt;//span&gt;')
writeLines(y)

# .*? - find to the first > otherwise greedy
str_view(y, '^&lt;.*?(&gt;){1,}') 
```

The `?` after `.*` makes the matching less greedy. It finds the
first multiple characters until a `>` is encountered

### Exercises

1.  Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.

    - `?` - {0,1} 0 or 1
    - `+` - {1,} 1 or more
    - `*` - {0,} 0 or more

1.  Describe in words what these regular expressions match:
    (read carefully to see if I'm using a regular expression 
    or a string that defines a regular expression.)

    1. `^.*$`
        Matches any string that does not contain a newline
        character in it. String defining regular expression.
        
        ```{r}
        reg_ex <-  "^.*$"
        (x <- "This is a string with 0 newline chars")
        writeLines(x)
        str_view(x, reg_ex)
        
        (y <- "This is a string with a couple \n\n newline chars")
        writeLines(y)
        str_view(y, reg_ex)
        ```
        
        Notice no match for y (none of the text highlighted).
        
        
    1. `"\\{.+\\}"`
    
        Matches a `{` followed by one or more of any character 
        but the newline character followed by the `}`. String
        defining a regular expression.
        
        ```{r}
        reg_ex <- "\\{.+\\}"
        str_view(c("{a}", "{}", "{a,b,c}", "{a, b\n, c}"), reg_ex)
        ```
        
        Notice that `{a, b , c}` is not highlighted, this is because
        there is a `\n` (newline sequence) after the b.
        
    1. `\d{4}-\d{2}-\d{2}`
    
        Matches exactly 4 digits followed by a - followed by exactly
        2 digits, followed by a -, followed by exactly 2 digits.
        Regular expression (the `\d` needs another \).
        
        ```{r}
        reg_ex <- "\\d{4}-\\d{2}-\\d{2}" 
        str_view(c("1234-34-12", "12345-34-23", "084-87-98",
                   "2020-01-01"), reg_ex)
        ```
        
    1. `"\\\\{4}"`
      
        Matches exactly 4 backslashes. String defining reg expr.
        
        ```{r}
        reg_ex <- "\\\\{4}"
        str_view(c("\\\\", "\\\\\\\\"),
                 reg_ex)
        ```
        

1.  Create regular expressions to find all words that:

    1. Start with three consonants.
    
      ```{r}
      reg_ex <- "^[^aeiou]{3}.*"
      str_view(c("fry", "fly", "scrape", "scream", "ate", "women",
                 "strap", "splendid", "test"), reg_ex)
      ```
    
    1. Have three or more vowels in a row.
    
      ```{r}
      reg_ex <- ".*[aeiou]{3,}.*"
      str_view(stringr::words, reg_ex, match=TRUE)
      ```
    
    1. Have two or more vowel-consonant pairs in a row.
    
      ```{r}
      reg_ex <- ".*([aeiou][^aeiou]){2,}.*"
      str_view(stringr::words, reg_ex, match = TRUE)
      ```
    

1.  Solve the beginner regexp crosswords at
    <https://regexcrossword.com/challenges/beginner>.
    
    <img src="assets/exercise.PNG" width="1167" height="258" alt="regex complete">


<h2>Backreferences</h2>
## Backreferences

Parentheses can be used to make complex expressions more clear, and can also create a _numbered_ capturing group (number 1, 2 etc.). A capturing group stores _the part of the string_ matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with _backreferences_, like `\1`, `\2` etc. 

The following regex finds all fruits that have a repeated pair of letters.

```{r}
# (..)\\1 says find any two letters - these are a group, is 
# this then followed by the same 2 letters?
# Yes - match found
# No - whawha
str_view(fruit, "(..)\\1", match = TRUE)
```

For e.g. for `banana`:

- It starts at "ba" that becomes the group 1, then it moves it along 
and says is the next 2 letters "ba" (i.e. equivalent to group 1) too? Nope.
- It moves along to "an" and that is the new group 1. Then it moves along and says - are the next two letters equivalent to group 1 (i.e. is it "an") - Yes it is! found a word that matches.

### Exercises

1.  Describe, in words, what these expressions will match:

    1. `(.)\1\1`
    
        This matches any character repeated three times.
        
        ```{r}
        reg_ex <- "(.)\\1\\1"
        str_view(c("Oooh", "Ahhh", "Awww", "Ergh"), reg_ex)
        ```
        
        Note that `O` and `o` are different.
        
        
    1. `"(.)(.)\\2\\1"`
    
        This matches any two characters repeated once in 
        reverse order. e.g. abba
        
        ```{r}
        reg_ex <- "(.)(.)\\2\\1"
        str_view(c("abba"), reg_ex)
        str_view(words, reg_ex, match=TRUE)
        ```
        
    1. `(..)\1`
        
        This matches two letters that appear twice. b`anan`a.
        
        ```{r}
        str_view(fruit, "(..)\\1", match = TRUE)
        ```
        
    1. `"(.).\\1.\\1"`
    
        This matches a character followed by another char followed 
        by the same character as the start, followed by another char,
        followed by the character. e.g. abaca
        
        ```{r}
        str_view(words, "(.).\\1.\\1", match = TRUE)
        ```        
        
    1. `"(.)(.)(.).*\\3\\2\\1"`
    
        This matches three characters followed by 0 or more other
        characters, ending with the 3 characters at the start in
        reverse order.
        
        ```{r}
        reg_ex <- "(.)(.)(.).*\\3\\2\\1"
        str_view(c("bbccbb"), reg_ex)
        str_view(words, reg_ex, match=TRUE)
        ```        
        

1.  Construct regular expressions to match words that:

    1. Start and end with the same character.
    
      ```{r}
      reg_ex <- "^(.).*\\1$"
      str_view(words, reg_ex, match = TRUE)
      ```
    
    1. Contain a repeated pair of letters
       (e.g. "church" contains "ch" repeated twice.)
       
      ```{r}
      reg_ex <- "(..).*\\1"
      str_view("church", reg_ex)
      str_view(words, reg_ex, match=TRUE)
      ```
    
    1. Contain one letter repeated in at least three places
       (e.g. "eleven" contains three "e"s.)
       
      ```{r}
      reg_ex <- "(.).*\\1.*\\1"
      str_view(words, reg_ex, match = TRUE)
      ```

<h2>Detect Matches</h2>
## Detect Matches

#### str_detect()

Use <span style="color: #008080;background-color:#9FDDBA">`str_detect()`</span>. It returns a logical vector the same length as the input.

Since it is a logical vector and numerically TRUE == 1 and FALSE == 0
we can also use `sum()`, `mean()` to get information about
matches found.

```{r detect1, include=FALSE}
(x <- c("apple", "banana", "pear"))
str_detect(x, "e")
```

```{r, echo=FALSE}
decorate("detect1") %>% 
  flair("str_detect", background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("TRUE", background = "#9FDDBA", 
        color = "#008080")
```

```{r detect2, include=FALSE}
x
sum(str_detect(x, "e"))
# How many common words start with t?
sum(str_detect(words, "^t"))
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
```

```{r, echo=FALSE}
decorate("detect2") %>% 
  flair("sum", background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("mean", background = "#9FDDBA", 
        color = "#008080")
```

```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
# you can also use `negate = TRUE`
no_vowels_3 <- str_detect(words, "[aeiou]", negate = TRUE)
identical(no_vowels_1, no_vowels_3)
identical(no_vowels_3, no_vowels_2)
```

#### str_subset()

We use `str_detect()` often to match patterns using the wrapper
<span style="color: #008080;background-color:#9FDDBA">`str_subset()`</span>.

```{r sub1, include=FALSE}
words[str_detect(words, "x$")]
# str_subset() is a wrapper around x[str_detect(x, pattern)]
str_subset(words, "x$")
```

```{r, echo=FALSE}
decorate("sub1") %>% 
  flair("str_subset", background = "#9FDDBA", 
        color = "#008080")
```

#### filter(str_detect())

When we want to find matches in a column in a dataframe we can combine `str_detect()` with `filter()`.

```{r sub2, include=FALSE}
(df <- tibble(
  word = words,
  i = seq_along(word)
))

df %>% 
  filter(str_detect(word, "x$"))
```

```{r, echo=FALSE}
decorate("sub2") %>% 
  flair("filter", background = "#9FDDBA", 
        color = "#008080") %>% 
  flair("str_detect", background = "#9FDDBA", 
        color = "#008080")
```

#### str_count()

Instead of using `str_detect()` which returns a __TRUE__ OR __FALSE__ we can use <span style="color: #008080;background-color:#9FDDBA">`str_count()`</span> which gives
us a number of matches in each string.

```{r count1, include=FALSE}
(x <- c("apple", "banana", "pear"))
str_count(x, "e")
str_count(x, "a")
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
```

```{r, echo=FALSE}
decorate("count1") %>% 
  flair("str_count", background = "#9FDDBA", 
        color = "#008080")
```

We often use `str_count()` with `mutate()`.

```{r}
df %>% 
  mutate(vowels = str_count(word, "[aeiou]"),
         consonants = str_count(word, "[^aeiou]"))
```

Matches never overlap. For example, in `"abababa"`, the pattern `"aba"` matches twice. You can think of it as placing a marker at the beginning of the string, then moving along looking for `pattern`, it sees `a` then `b` then `a`, so it has found one 
`pattern == aba`. The marker is lying at the 4th letter in the string. It proceeds from there to look for more occurrences of the pattern. `b` does not do it, so it skips over and goes to the 5th
character `a`, then the 6th `b`, then the 7th `a` and has found another occurrence. Hence 2 occurrences found. I.e it moves sequentially over the string, and does not brute force every combination in the string.

  <img src="assets/str_count.PNG" width="650" height="550" alt="how matching proceeds">

```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```

### Exercises

1.  For each of the following challenges, try solving it by using both a single
    regular expression, and a combination of multiple `str_detect()` calls.
    
    a.  Find all words that start or end with `x`.
    
        ```{r}
        reg_ex <- "(^x.*|.*x$)"
        str_view(words, reg_ex, match = TRUE)
        str_detect(c("xray", "box", "text", "vex"), reg_ex)
        
        reg_ex <- "(^x.*|.*x$)"
        str_detect(c("xray", "box", "text", "vex"), "^x") |
          str_detect(c("xray", "box", "text", "vex"), "x$")
        ```
    
    
    a.  Find all words that start with a vowel and end 
    with a consonant.
    
        ```{r}
        reg_ex <- "^[aeiou].*[^aeiou]$"
        df %>% 
          filter(str_detect(word, reg_ex))
        ```
    
    
    a.  Are there any words that contain at least one of each different
        vowel?
        ```{r}
        # https://stackoverflow.com/questions/54267095/what-is-the-regex-to-match-the-words-containing-all-the-vowels
        reg_ex <- "\\b(?=\\w*?a)(?=\\w*?e)(?=\\w*?i)(?=\\w*?o)(?=\\w*?u)[a-zA-Z]+\\b"
        str_detect(c("eunomia", "eutopia", "sequoia"), reg_ex)
        str_view(c("eunomia", "eutopia", "sequoia"), reg_ex)
        ```
        

    a.  What word has the highest number of vowels? What word 
    has the highest proportion of vowels? (Hint: what 
    is the denominator?)
    
        ```{r}
        df %>% 
          mutate(vowels = str_count(word, "[aeiou]+"),
                 len_word = str_length(word),
                 prop_vowels = vowels / len_word) %>% 
          arrange(-prop_vowels)
          
        df %>% 
          mutate(vowels = str_count(word, "[aeiou]+"),
                 len_word = str_length(word),
                 prop_vowels = vowels / len_word) %>% 
          arrange(-vowels, -prop_vowels)          
        ```
        
        I see these are two different things.
        The highest number of vowels, is just the word with the
        most vowels. The proportion on the other hand is
        `num_vowels_in_word / num_letters_in_word`.

<h2>Extract Matches</h2>  
## Extract Matches

To extract the actual text of a match, use <span style="color: #008080;background-color:#9FDDBA">`str_extract()`</span>.

```{r}
length(sentences)
head(sentences)
```

Let's say we want to find all sentences that contain a colour.

```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
# make a match string by saying red|orange|...|purple
(colour_match <- str_c(colours, collapse = "|"))
```

```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```

```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
```

- To extract multiple matches, use <span style="color: #008080;background-color:#9FDDBA">`str_extract_all()`</span>. 
- This returns a list.
- To get this in matrix format use `simplify = TRUE`

### Exercises

1.  In the previous example, you might have noticed that the regular
    expression matched "flickered", which is not a colour. Modify the     regex to fix the problem.
    
    ```{r}
    colours <- c("red", "orange", "yellow", "green", "blue", "purple")
    # make a match string by saying red|orange|...|purple
    (colour_match <- str_c(prefix = "\\b", colours, 
                           suffix = "\\b", collapse = "|"))
    more <- sentences[str_count(sentences, colour_match) > 1]
    str_view_all(more, colour_match)
    ```
    

1.  From the Harvard sentences data, extract:

    a. The first word from each sentence.
    
        ```{r}
        reg_ex <- "^[A-Za-z']+\\b"
        first_word <- str_extract(sentences, reg_ex)
        head(first_word)
        ```
    
    a. All words ending in `ing`.
    
        ```{r}
        reg_ex <- "\\b[a-zA-Z']+ing\\b"
        words_ <- str_extract_all(str_subset(sentences, reg_ex),
                                  reg_ex, simplify = TRUE)
        head(words_)
        ```
    
    a. All plurals.
    
        Ok so some words end with `s` but are NOT plurals! For e.g.
        `bass`, `mass` etc.
    
        ```{r}
        reg_ex <- "\\b[a-zA-Z]{4,}(es|ies|s)\\b"
        words_ <- str_extract_all(sentences, reg_ex,
                              simplify = TRUE)
        head(words_, 10)
        ```