# String Manipulation with stringr in R
## -- DataCamp - R Programmer Career Track - Course 7 --

 **Note**

To use this code, select the R environment in Anaconda
 
 **Used packages**
 - tidyverse
 - rebus
 - stringi - read txt
 - babynames - data package
 
**Table of content**
- [&nbsp;&nbsp;1. String basics](#toc_46431_1)
    - [&nbsp;&nbsp;1.1  Entering strings](#toc_46431_1.1)
    - [&nbsp;&nbsp;1.2  Convert numbers to string using formats](#toc_46431_1.2)
    - [&nbsp;&nbsp;1.3  Paste together strings and variables](#toc_46431_1.3)
- [&nbsp;&nbsp;2. Introducting stringr](#toc_46431_2)
    - [&nbsp;&nbsp;2.1  Basic string functions](#toc_46431_2.1)
    - [&nbsp;&nbsp;2.2  Subset strings and detect matches using patterns](#toc_46431_2.2)
    - [&nbsp;&nbsp;2.3  Join and Split strings](#toc_46431_2.3)
    - [&nbsp;&nbsp;2.4  Remove and replace matches in string](#toc_46431_2.4)  
- [&nbsp;&nbsp;3. Regular expressions](#toc_46431_3)
    - [&nbsp;&nbsp;3.1  Stringr regular expressions](#toc_46431_3.1)
    - [&nbsp;&nbsp;3.2  Using the rebus package](#toc_46431_3.2)
- [&nbsp;&nbsp;4. Advanced string manipulation](#toc_46431_4)
    - [&nbsp;&nbsp;4.1  Capture groups](#toc_46431_4.1)
    - [&nbsp;&nbsp;4.2  Backreferences](#toc_46431_4.2)
    - [&nbsp;&nbsp;4.3  Unicode matching](#toc_46431_4.3)

**Set environment and plot size**

In [2]:
suppressMessages(library(tidyverse))
suppressMessages(library(rebus))
suppressMessages(library(stringi))
suppressMessages(library(babynames))
options(repr.plot.width=7, repr.plot.height=7) # controls display format
theme_set(theme_grey(base_size =10))

Note: if the above code return an error message:
- Check that the correct R environment is selected in Anaconda
- Restart computer

**Import data**

In [117]:
adverbs <- readRDS("data/strings/adverbs.RDS")
catcidents <- readRDS("data/strings/catcidents.RDS")
dna <- readRDS("data/strings/dna.RDS")
narratives <- readRDS("data/strings/narratives.RDS")

<a name="toc_46431_1"></a>
## 1.   String basics

<a name="toc_46431_1.1"></a>
**1.1 Entering strings**
 
- To tell R something is a string you surround it with double quotes `" "`.
- If we need to input a string that has double quotes inside it, we can
    - use single quotes `' '`, 
    - use **escape sequence** with `\"`. Actually, as R always prints strings with double quotes (even if we used single quotes to define them), the double quotes inside the printed string will always have a backslash in front of them `\"`.
    
<u>Good practices</u>
- If the string's text does not contain any quotes, wrap it in double quotes.
- If the text contains double quotes but not single quotes, wrap it in single quotes. 
- If the text contains both kinds of quotes, wrap it in double quotes and escape the double quotes in the text.

<u>Other uses of escape sequences</u>:
- For `"\"` use `"\\"`
- newline: `"\n"`
- Unicode characters: `"\u"`. Use [8 hex digits sequence](http://www.unicode.org/charts/) for a particular Unicode character. E.g. writeLines("\U1F30D") may not work in Jupyter.

<u>Some common ways to print strings:</u>:
- `print()`
- `writeLines()`
- `cat()`

In [3]:
print("This is a string")
print('I said "hi!"')
print("I'd say\"hi!\"")

[1] "This is a string"
[1] "I said \"hi!\""
[1] "I'd say\"hi!\""


To see the string the intended way, use `writelines()` or `cat()` instead of `print()`:

In [4]:
writeLines("I'd say\"hi!\"")
cat("I'd say\"hi!\"")

I'd say"hi!"
I'd say"hi!"

These functions can also be used to concatenate strings:

In [5]:
line1 <- "The table was a large one, but the three were all crowded together at one corner of it:"
line2 <- '"No room! No room!" they cried out when they saw Alice coming.'
line3 <- "\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."
lines <- c(line1, line2, line3)

print(lines) 
writeLines("\n")

writeLines(lines, sep = " ")
writeLines("\n")

cat(lines) # adds space by defaul

[1] "The table was a large one, but the three were all crowded together at one corner of it:"                           
[2] "\"No room! No room!\" they cried out when they saw Alice coming."                                                  
[3] "\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."


The table was a large one, but the three were all crowded together at one corner of it: "No room! No room!" they cried out when they saw Alice coming. "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table. 

The table was a large one, but the three were all crowded together at one corner of it: "No room! No room!" they cried out when they saw Alice coming. "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.

<a name="toc_46431_1.2"></a>
**1.2 Convert numbers to string using formats**

- `as.character()` - gives an unformatted character representation of the number. Trailing zeros are removed
$$$$
- `format()` - Arguments:
    - `scientific = TRUE`
    - `digits =` When the representation is scientific, the digits argument is the number of digits before the exponent. When the representation is fixed (normal), digits controls the significant digits used for the smallest (in magnitude) number. 
    - `trim = ` - if FALSE, values are right-justified to a common width; if TRUE the leading blanks for justification are suppressed.
    - `big.mark =` - used for prettying (longish) numerical and complex sequences.
    - `nsmall` - the minimum number of digits to the right of the decimal point.
$$$$
- `formatC()` -  Has more arguments than `format()` and formats numbers individually. Some examples:
    - `format =` "f": fixed (normal) format; "e": scientific format; "g": use the scientific format only if it saves space (default)
   - `digits =` if "g": significant digits; if "f": number of digits 
   - `flag =`  allows modifiers:
       - `+`: force the display of the sign 
       - `-`: left align numbers 
       - `0`: pad numbers with leading zeros

<u>`as.character()` and `as.numeric()`</u> 

In [6]:
as.character(c(1.30, 0.01))
as.numeric(c("1.30", "0.01"))

☝️ No trailing zeros

<u>`format()`</u> 

In [7]:
format(1.3458985) # rounded up
format(1.3458975) # rounded down

☝️ format() uses the **“round to even"** rule (similar to the `round()` function) instead of the more familiar **"round up"** rule. This means: If the difference between the number and the nearest integer is exactly 0.5, look at the integer part of the number. If the integer part is EVEN, round towards zero. If the integer part of the number is ODD, round away from zero. In either case, the rounded number is an even integer.

In [8]:
format(c(0.0011, 10012, 0.011, 1, 1.0011, 2.011),  scientific = FALSE, digits = 2) # the smalles number is 0.0011
format(c(1.0011, 2.011, 1), scientific = FALSE, digits = 1) # the smalles number is 1 

☝️ When `scientific = FALSE` , `digits=` presents a given number of significant digits according to the smalles number in the vector. 

In [9]:
format(1.34589829100, scientific = TRUE)
format(1.34589829100, scientific = TRUE, digits = 3)

☝️ When `scientific = TRUE`, `digits=` sets the number of digits before the exponent

In [10]:
format(c(1.000, 2.0, 1, 1.245), scientific = FALSE, digits = 2)
format(c(1.000, 2.0, 1, 1.245), scientific = FALSE, digits = 2, nsmall = 2)

☝️ `nsmall` sets the minimum number of digits after ".". It also keeps trailing zeros.

In [11]:
income <- c(72.19 , 1030.18, 10291.93, 1189192.18)
pretty_income <- format(income, digits = 2, big.mark = ",")
writeLines(pretty_income)

       72
    1,030
   10,292
1,189,192


☝️ When numbers are long it can be helpful to "prettify" them by adding "," to every 3 digits.

<u>`formatC()`</u> 

In [12]:
# format() vs formatC() defaults
format(1.34589889100)
formatC(1.34589889100)

In [13]:
formatC(c(0.0011, 0.021, 10012, 1),  format = "g", digits = 1) # flexible format
formatC(c(0.0011, 0.021, 10012, 1),  format = "e", digits = 1)  # scientific format
formatC(c(0.0011, 0.021, 10012, 1),  format = "f", digits = 1)  # fixed format

☝️ Behaviour of `digits` argument:
- `formatC()` formats numbers individually
- when using scientific format, the digits argument behaves like it does in `format()`; it specifies the number of significant digits. 
- when using fixed format, unlike `format()`, digits is the number of digits after the decimal point. This is more predictable than format(), because the number of places after the decimal is fixed regardless of the values being formatted.

In [14]:
percent_change <-c(4.000, -1.910,  3.000, -5.002)
formatC(percent_change, flag = "+")

<a name="toc_46431_1.3"></a>
**1.3 Paste together strings and variables**

- `paste()` - provides a separator operator by default
- `paste0()` - no separator operator
- `str_c()` - same as paste() but part of the stringr package (see later)

Arguments:
- `sep=` separator character
- `collapse=`  it further collapses all the resulting strings into one, with the specified separator character between them

We can use `paste()` to pasted together strings element by element:

In [15]:
paste("A", "B", "C", " D", "E")
paste0("A", "B", "C", " D", "E")

In [16]:
paste("A", "B", "C", "D", "E", sep = "-")
paste(income, "%", sep = "")
paste(c("A", "B", "C", "D", "E"), "a", sep = "-")

However, usually we'll use paste with a mix of fixed strings and variables, allowing us to build up complicated output:

In [17]:
years <- c(2010, 2011, 2012, 2013)
pretty_percent <- formatC(percent_change, flag = "+")

paste(years, paste(pretty_percent, "%", sep=""), sep=": ")

In [18]:
animal_goes <- "moo"

paste(c("Here", "There", "Everywhere"), "a", animal_goes, sep = ", ") # returns 3 separate strings
paste(c("Here", "There", "Everywhere"), "a", animal_goes, collapse = ", ") # returns 1 collaped strings

Some more complex examples:

In [19]:
# Create a function that works for any animal in Old MacDonals' a farm

old_mac <- function(animal, animal_goes){
    
    eieio <- paste("E", "I", "E", "I", "O", sep = "-")
    old_mac <- "Old MacDonals had a farm"
    
    writeLines(c(
        old_mac,
        eieio,
        paste("And on his farm he had a", animal),
        eieio,
        paste(c("Here", "There", "Everywhere"), "a", 
              c(animal_goes, animal_goes, paste(rep(animal_goes, 2), collapse = "-")),
              collapse = ", "),
        old_mac,
        eieio))
    }

old_mac("dog", "woof")

Old MacDonals had a farm
E-I-E-I-O
And on his farm he had a dog
E-I-E-I-O
Here a woof, There a woof, Everywhere a woof-woof
Old MacDonals had a farm
E-I-E-I-O


In [20]:
# When re-running this cell we will get a brand new pizza order each time!

toppings <- c("anchovies","artichoke","bacon","breakfast bacon","Canadian bacon","cheese","chicken","chili peppers","feta","garlic","green peppers","grilled onions","ground beef","ham","hot sauce","meatballs","mushrooms","olives","onions","pepperoni","pineapple","sausage","spinach","sun-dried tomato","tomatoes")

my_toppings <- sample(toppings, size = 3) # Randomly sample 3 toppings
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")
these_toppings <- paste(my_toppings_and, collapse = ", ")
my_order <- paste("I want to order a pizza with ", these_toppings, ".", sep = "")
writeLines(my_order)

I want to order a pizza with sausage, spinach, and feta.


<a name="toc_46431_2"></a>
## 2.   Introducing stringr

- all functions start with `str_`
- all functions take a vector of strings as the first argument

In [21]:
# Prepare data
head(babynames::babynames)

babynames_2014 <- filter(babynames, year == 2014)
boy_names <- filter(babynames_2014, sex == "M")$name
girl_names <- filter(babynames_2014, sex == "F")$name

year,sex,name,n,prop
<dbl>,<chr>,<chr>,<int>,<dbl>
1880,F,Mary,7065,0.07238359
1880,F,Anna,2604,0.02667896
1880,F,Emma,2003,0.02052149
1880,F,Elizabeth,1939,0.01986579
1880,F,Minnie,1746,0.01788843
1880,F,Margaret,1578,0.0161672


<a name="toc_46431_2.1"></a>
**2.1 Basic string functions**

Manage case
- `str_to_lower` - lowecase
- `str_to_upper` - uppercase
- `str_to_title` - sentence case

Manage length
- `str_length()` - gives the number of characters for each string in a vector
- `str_trim()` - removes whitespaces from the left. right or both
- `str_squish()` - removes whitespaces and collapse multiple spaces into single spaces

Manage order
- `str_sort()` - sort a character vector
- `str_order()` - return indexes that sorts a character vector

<u>Manage case </u>

In [22]:
st <-  "MyStrinG"
str_to_lower(st)
str_to_upper(st)
str_to_title(st)

<u>The `str_length()` function </u>

In [23]:
# Find the length of all boy_names

length(boy_names) # gives the number of elements

str_length(boy_names) %>%
  head()  # gives the length of each element

In [24]:
str_trim(" hel   lo  ")
str_trim(" hel   lo  ", side = "left")
str_squish(" hel  lo  ")

<u>Manage order </u>

In [25]:
head(boy_names)

str_order(boy_names)  %>%
  head()

str_sort(boy_names, decreasing = TRUE)  %>%
  head()

<a name="toc_46431_2.2"></a>
**2.2 Subset strings and detect matches using patterns**

All of the below functions use a **pattern** argument, that can also be **regular expressions**:

- `str_sub()` - extracts parts of strings based on their location.
- `str_subset()` - returns the strings that are a match
- `str_detect()` - detect matches, gives TRUE, FALSE result
- `str_which()` - Find the indexes of strings that contain the pattern
- `str_count()` - count the number of matches in the string
- `str_locate()` - locates the start and end positions of a pattern in a string. 
- `str_starts()` | `str_ends()` - Detect match at the begining | end of the string
- `str_extract()` - returns the first matched part of each string (or NA) as a vector
- `str_match()` - returns the first matched part of each string (or NA) as a matrix
- `str_view_all()` - view HTML rendering of all matches (including regex)

In [131]:
boy_names %>% head()

str_sub(boy_names, start = -1, end = -1) %>% head() # last letters

str_sub(boy_names, start = 1, end = 1) %>% # table of first letters
  table()

.
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P 
1454  651  770  998  549  185  334  403  235 1390 1291  537  914  424  207  230 
   Q    R    S    T    U    V    W    X    Y    Z 
  56  778  806  771   43  160  174   56  252  379 

<u>Using patterns:</u>

In [27]:
pizzas <- c("cheese", "pepperoni", "sausage and green peppers ", "sausage and green peppers and red peppers")

str_subset(pizzas, pattern = "pepper")
str_detect(pizzas, pattern = "pepper")
str_which(pizzas, pattern = "pepper")
str_count(pizzas, pattern = "pepper")
str_starts(pizzas, pattern = "pepper")

In [28]:
str_locate(pizzas, pattern = "pepper")

start,end
,
1.0,6.0
19.0,24.0
19.0,24.0


In [29]:
str_extract(pizzas, pattern = "^p.+i$")  # for regular expression, see chapter 3
str_match(pizzas, pattern = "^p.+i$")

0
""
pepperoni
""
""


In [30]:
# Find girl_names that contain "U" and "z"
starts_U <- str_subset(girl_names, "U")
str_subset(starts_U, "z")

In [31]:
# Find rows where baby name contains "zz"
contains_zz <- str_detect(babynames$name, "zz")
babynames[contains_zz,] %>%
  head()

year,sex,name,n,prop
<dbl>,<chr>,<chr>,<int>,<dbl>
1880,F,Lizzie,388,0.00397521
1880,F,Kizzie,13,0.00013319
1881,F,Lizzie,396,0.00400587
1881,F,Kizzie,9,9.104e-05
1882,F,Lizzie,495,0.00427849
1882,F,Kizzie,9,7.779e-05


In [32]:
# Compare how many boy names end in "ee" compared to girl names. 
all_names <- babynames_2014$name
last_two_letters <- str_sub(all_names, start = -2, end = -1)
ends_in_ee <- str_detect(last_two_letters, "ee")
table(babynames_2014$sex, ends_in_ee)

   ends_in_ee
    FALSE  TRUE
  F 18609   572
  M 13963    84

The `fixed()` function specifies that a pattern is a fixed string, rather than a regular expression:

In [33]:
pattern <- "a.b"
strings <- c("abb", "a.b")
str_detect(strings, pattern)
str_detect(strings, fixed(pattern))

In [34]:
# Visualize matches
str_view_all("Alvin & Simon & Theodore", pattern = "&")
str_view_all("Alvin & Simon & Theodore", pattern = "[Ao]") # for more regex patterns, see chapter 3

"[1m[22m`str_view()` was deprecated in stringr 1.5.0.
[36mi[39m Please use `str_view_all()` instead."


[90m[1] |[39m Alvin [36m<&>[39m Simon [36m<&>[39m Theodore

[90m[1] |[39m [36m<A>[39mlvin & Sim[36m<o>[39mn & The[36m<o>[39md[36m<o>[39mre

<a name="toc_46431_2.3"></a>
**2.3 Join and Split strings**

Joining:
- `str_dup()` - repeats strings
- `str_c()` - stringr version of `paste()`
- `str_flatten()` - Combines a vector of stings into one long string.

Splitting:
- `str_split()` - split into substrings at occurrences of a pattern match. Arguments:
   - `pattern=` - pattern to met matched
   - `n = ` number of substrings to be returned. Stops splitting when n is reached.
   - `simplify = TRUE` split a vector of strings into a matrix of substrings
   - `fixed()` takes pattern literally
   
Other:
- `str_unique()` - removes duplicate strings from a vector of strings

<u>The `str_c()` function </u>

In [35]:
paste("I want to order a pizza with ", these_toppings, ".", sep = "")
str_c("I want to order a pizza with ", these_toppings, ".", sep = "") # same result

<u>The `str_dup()` and `str_unique()` function </u>

In [36]:
rep("Hey", 3)  # creates 3 strings
str_dup("Hey", times = 3) # creates 1 string

In [37]:
hey <- rep("Hey", 3) # vector of strings
str_unique(hey)

<u>The `str_flatten()` function </u>

In [38]:
rep("Hey", 3)
str_flatten(rep("Hey", 3), ", ")

<u>The `str_split()` function </u>

In [39]:
str_split("Alvin & Simon & Theodore", pattern = "&")
str_split("Alvin & Simon & Theodore", pattern = "&", n = 2) # n = number of substrings returned. 

chars <- c("You & Me", "Tom & Jerry", "Alvin & Simon & Theodore")
str_split(chars, pattern = "&") # gives a list of substrings

In [40]:
str_split(chars, pattern = "&", n = 2, simplify = TRUE) # gives a matrix of substrings

0,1
You,Me
Tom,Jerry
Alvin,Simon & Theodore


In [41]:
# Extract start day, month and year
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")

split_dates_n <- str_split(date_ranges, fixed(" - "), n = 2, simplify = TRUE)
split_dates_n

start_dates <- split_dates_n[,1]
str_split(start_dates, fixed("."), simplify = TRUE)

0,1
23.01.2017,29.01.2017
30.01.2017,06.02.2017


0,1,2
23,1,2017
30,1,2017


In [42]:
# Extract first and last names 
both_names <- c("Box, George", "Cox, David")

both_names_split <- str_split(both_names, fixed(", "), simplify = TRUE)
both_names_split[,2]
both_names_split[,1]

In [43]:
# Some simple text statistics
lines

words <- str_split(lines, fixed(" "))

lapply(words, length) # Number of words per line
  
word_lengths <- lapply(words, str_length) # Number of characters in each word
word_lengths

lapply(word_lengths, mean) # Average word length per line

In [1]:
str_flatten(fruit, ", ")

ERROR: Error in str_flatten(fruit, ", "): could not find function "str_flatten"


<a name="toc_46431_2.4"></a>
**2.4 Remove and replace matches in strings**

- `str_remove()` - remove first match
- `str_remove_all()` - removes all matches
- `str_replace()` - replace the first matched pattern in each string.
- `str_replace_all()` - replace all matched pattern in each string

In [8]:
timepoint <- c("8.52 minutes", "4.62 minutes minutes")
str_remove(timepoint, " minutes")
str_remove_all(timepoint, " minutes")

In [9]:
str_replace("Tox & Jerry", pattern = "&", replace = "and")
str_replace("Alvin & Simon & Theodore", pattern = "&", replace = "and")
str_replace_all("Alvin & Simon & Theodore", pattern = "&", replace = "and")

<a name="toc_46431_3"></a>
## 3.  Regular expressions

Regular expressions are a language for describing patterns in strings. Apart from `stringr`, we can also use the package `rebus` to write regular expressions.

<a name="toc_46431_3.1"></a>
**3.1 `Stringr` regular expressions**

In stringr, Regex is written af ter the `pattern =` argument in `""`


- Match characters
    - `.` - any character (except a new line)
    - `\\.`, `\\-`, `\\\\`, `\\}`  - match `"."`, `"-"`, `"\"` or `"}"` characters
    - `\\d` - match any digits; `\\D` - match any non-digits
    - `\\w`- match any word character (letters or digits); `\\W` - match any non-word characters
    - `\\s` - match any whitespace; `\\S` - match any non-whitespace
    - `\\n`- match new lines
    - `\\t` - match tab
    - `\\b` - match word boundaries (e.g. spaces between words)
    - `[:alpha:]` - match every letters
    - `[:lower:]` - match lower case letters
    - `[:upper:]` - match upper case letters
    - `[:punct:]` - match punctuetion (e.g. `.!?(\}"@`)
    - `[:symbol:]` - match punctuetion (e.g. `=+^$>`)
    - `[:space:]` - match spaces
    - `[:blank:]` - match spaces and tab (but not new line))
$$$$
- Anchors:
    - `^a` - starts with "a"
    - `a$` - ends with "a"
$$$$    
- Quantifiers:
    - `a?` - 0 or 1
    - `a*` - 0 or more
    - `a+` - 1 or more
    - `a{n}` - exactly n  matches consecutively
    - `a{n,}` - n or more
    - `a{n,m}` - between n and m
$$$$    
- Alternatives
    - `ab|c` - "ab" or "c"
    - `[abc]` - one of "a" "b" or "c"
    - `[^abc]` - anything but "a", "b" or "c"
    - `[a-z]` `[A-F]` `[1-9]` `[a-zA-Z0-9]`- in the range given
$$$$    
- Look Arounds
    - `a(?=c)` - "a" that is followed by"c", e.g. acb
    - `a(?!c)` - "a" that is not followed by"c", e.g. abc
    - `(?<=c)a` - "a" that is preceded by "c", e.g. cab
    - `(?<!c)a` - "a" that is not preceded by "c", e.g. bac

In [46]:
str_detect(c("S85 dc", "Xff.", "0125", "abcd ", ".-?"), pattern = "\\d")
str_detect(c("S85 dc", "Xff.", "0125", "abcd ", ".-?"), pattern = "\\D")
str_detect(c("S85 dc", "Xff.", "0125", "abcd ", ".-?"), pattern = "\\w")
str_detect(c("S85 dc", "Xff.", "0125", "abcd ", ".-?"), pattern = "\\s")
str_detect(c("S85 dc", "Xff.", "0125", "abcd ", ".-?"), pattern = "[:upper:]")

In [47]:
str_detect(c("R2-D2", "C-3PO"), pattern = "^R")
str_detect(c("R2-D2", "C-3PO"), pattern = "O$")

In [48]:
str_detect(c("R2-D2", "C-3PO"), pattern = "C?")
str_detect(c("R2-D2", "C-3PO"), pattern = "C+")
str_detect(c("R2-D2", "C-3PO", "C22"), pattern = "2{2}") # in R2-D2 the two "2" is not consecutive

In [49]:
str_detect(c("R2-D2", "C-3PO"), pattern = "R2|PO")
str_detect(c("R2-D2", "C-3PO"), pattern = "[RG]")
str_detect(c("R2-D2", "C-3PO"), pattern = "[^RD2-]")
str_detect(c("R2-D2", "C-3PO"), pattern = "[1-3]")

In [50]:
str_detect(c("baca", "acab", "cab", "bac"), pattern = "a(?=c)") # followed by
str_detect(c("baca", "acab", "cab", "bac"), pattern = "a(?!c)") # not followed by
str_detect(c("baca", "acab", "cab", "bac"), pattern = "(?<=c)a") # preceded by
str_detect(c("baca", "acab", "cab", "bac"), pattern = "(?<!c)a") # not preceded by

str_view(c("baca", "acab", "cab", "bac"), pattern = "(?<!c)a") # look matches  

[90m[1] |[39m b[36m<a>[39mca
[90m[2] |[39m [36m<a>[39mcab
[90m[4] |[39m b[36m<a>[39mc

In [51]:
# Complex examples
str_detect(c("R2-D2", "C-3PO"), pattern = "^.\\d+") # starts with any character then one or more digits
str_detect(c("R2-D2", "C-3PO"), pattern = "^C\\-\\d+") # starts with "C" then a "-" then one or more digits

<a name="toc_46431_3.2"></a>
**3.2 Using the `rebus` package**

Rebus provides alternative syntax that can be easier to read. It uses special wildcards before or after `%R%`, which is a sepcial rebus operator that can be read as "then".

Rebus wildcards:
- `START` - match the start of the string
- `END` - match the end of the string
- `ANY_CHAR` - match any characters
- `or("a", "b")` - match "a" or "b"
- `DGT` - match digits
- `one_or_more("a")` - match strings with 1 or more "a"
- `optional()` - expressions inside are only optional
- `capture()` - use with `str_match()` to capture groups

for more, see rebus's documentation

`rebus` syntax can be converted to basic regular expressions:

In [52]:
START %R% ANY_CHAR %R% "a"

In [53]:
x <- c("cat", "coat", "scotland", "tic toc")

str_detect(x, pattern = START %R% "c") # same result as with "^c"
str_detect(x, pattern = "at" %R% END)
str_detect(x, pattern = START %R% "cat" %R% END)
str_detect(x, pattern = START %R% ANY_CHAR %R% "a") # second charactr is "a"

In [54]:
ckath <- START %R% or("Cathr", "Kathr")
str_view(girl_names, pattern = ckath, match = TRUE)

[90m  [293] |[39m [36m<Kathr>[39myn
[90m [2929] |[39m [36m<Cathr>[39myn
[90m [4598] |[39m [36m<Kathr>[39mine
[90m [7854] |[39m [36m<Kathr>[39mynn
[90m [8257] |[39m [36m<Kathr>[39myne
[90m[11136] |[39m [36m<Cathr>[39mine
[90m[17803] |[39m [36m<Kathr>[39mynne

In [55]:
contact <- c("Call me at 555-555-0191", "123 Main St", "(555) 555 0191", "Phone: 555.555.0191 Mobile: 555.555.0192")

three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")

phone_pattern <- optional(OPEN_PAREN) %R% 
  three_digits %R% 
  zero_or_more(separator) %R% 
  three_digits %R% 
  zero_or_more(separator) %R%
  four_digits
  
str_extract(contact, pattern = phone_pattern) # Extract phone numbers

str_extract_all(contact, pattern = phone_pattern) # Extract ALL phone numbers

<a name="toc_46431_4"></a>
## 4.  Advanced string manipulation

<a name="toc_46431_4.1"></a>
**4.1 Capture groups**

Pattern groups can be matched with `str_match()`. The different groups will be in different columns of the output matrix. To create groups:
- `()` - With stringr syntax, simply put the part of the regex in () to indicate it is a group
- `capture()` - With rebus, put the given expression inside the capture() function

☝️ Here each row corresponds to an input string. The first column will be the entire match, the same as you'd get from str_extract. Then, there is a column with just the piece that matched the captured part of the pattern. The piece of the string that matched the ANY_CHAR was "F" in the first string and "c" in the second.

In [56]:
str_match(c("Fat", "cat"), pattern = "(.+)a")
str_match(c("Fat", "cat"), pattern = capture(ANY_CHAR) %R% "a")  # same result with rebus

0,1
Fa,F
ca,c


0,1
Fa,F
ca,c


In [57]:
pattern <-  DOLLAR %R% 
            capture(DGT %R% optional(DGT)) %R% 
            DOT %R% 
            capture(dgt(2))
                              
str_match(c("$5.50", "$32.00"), pattern = pattern)

0,1,2
$5.50,5,50
$32.00,32,0


<a name="toc_46431_4.2"></a>
**4.2 Backreferences**

Referring to a captured part of a pattern is known as a **backreference**. Since there may be multiple captures, regular expressions, we need to distinguish the different captures:
- `stringr`: `\\1`, `\\2` for the 1st and 2nd capture, respectively
- `rebus`: `REF1`, `REF˙2`

In [58]:
str_view("Paris in the the spring",
         SPC %R%
         capture(one_or_more(WRD)) %R%
         SPC %R%
         REF1)

[90m[1] |[39m Paris in[36m< the the>[39m spring

☝️ Here `REF1` changes to `"the"` as that is the 1st captured word we are referring to.

In [63]:
# Capture names with a pair of letters repeated twice
pairs <- capture(LOWER %R% LOWER) %R% REF1 %R% END
str_view(boy_names, pattern = pairs, match = TRUE)

[90m [3439] |[39m T[36m<alal>[39m
[90m [5703] |[39m Yoch[36m<anan>[39m
[90m [5828] |[39m J[36m<alal>[39m
[90m [6347] |[39m Ch[36m<anan>[39m
[90m [7290] |[39m Ke[36m<anan>[39m
[90m [7347] |[39m M[36m<anan>[39m
[90m [7832] |[39m K[36m<anan>[39m
[90m [7991] |[39m R[36m<onon>[39m
[90m [8308] |[39m Elch[36m<onon>[39m
[90m [9058] |[39m H[36m<anan>[39m
[90m [9215] |[39m Ke[36m<enen>[39m
[90m[10649] |[39m Yoh[36m<anan>[39m
[90m[11356] |[39m Joh[36m<anan>[39m
[90m[13039] |[39m K[36m<arar>[39m

In [77]:
# The same with stringr syntax:
pairss <- '([:lower:][:lower:])\\1$'
str_view(boy_names, pattern = pairss, match = TRUE)

[90m [3439] |[39m T[36m<alal>[39m
[90m [5703] |[39m Yoch[36m<anan>[39m
[90m [5828] |[39m J[36m<alal>[39m
[90m [6347] |[39m Ch[36m<anan>[39m
[90m [7290] |[39m Ke[36m<anan>[39m
[90m [7347] |[39m M[36m<anan>[39m
[90m [7832] |[39m K[36m<anan>[39m
[90m [7991] |[39m R[36m<onon>[39m
[90m [8308] |[39m Elch[36m<onon>[39m
[90m [9058] |[39m H[36m<anan>[39m
[90m [9215] |[39m Ke[36m<enen>[39m
[90m[10649] |[39m Yoh[36m<anan>[39m
[90m[11356] |[39m Joh[36m<anan>[39m
[90m[13039] |[39m K[36m<arar>[39m

In [73]:
# Capture names with a pair of letters followed by their reverse
pairs2 <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1 %R% END
str_view(boy_names, pattern = pairs2, match = TRUE)

[90m [173] |[39m J[36m<esse>[39m
[90m [452] |[39m R[36m<occo>[39m
[90m [802] |[39m Ap[36m<ollo>[39m
[90m[1011] |[39m Pi[36m<erre>[39m
[90m[1078] |[39m Ca[36m<naan>[39m
[90m[1365] |[39m Gius[36m<eppe>[39m
[90m[1494] |[39m Ja[36m<leel>[39m
[90m[2124] |[39m Ever[36m<ette>[39m
[90m[3001] |[39m Jah[36m<leel>[39m
[90m[3714] |[39m Eti[36m<enne>[39m
[90m[3956] |[39m Ka[36m<naan>[39m
[90m[4324] |[39m Lav[36m<elle>[39m
[90m[4842] |[39m Ta[36m<meem>[39m
[90m[4957] |[39m Jo[36m<elle>[39m
[90m[5140] |[39m Jeanpi[36m<erre>[39m
[90m[5316] |[39m G[36m<reer>[39m
[90m[5362] |[39m Keny[36m<atta>[39m
[90m[6131] |[39m Ka[36m<leel>[39m
[90m[6152] |[39m Kha[36m<leel>[39m
[90m[6164] |[39m Lafay[36m<ette>[39m
... and 36 more

In [80]:
x <- c("hello", "sweet", "kitten")
str_replace(x, capture(ANY_CHAR), str_c(REF1, REF1))

<a name="toc_46431_4.3"></a>
**4.3 Unicode matching**

Place to find unicodes: E.g. https://symbl.cc/en/search/?q=plus+minus

<u>Writing unicode characters in R:</u>
- `"\U___"` - where "___" is the unicode (often written after `U+` on websites). 

Note: On Windows, unicode characters more that 4 digits may not be handled correctly!

<u>Matching Unicode groups:</u>
- `\p {name}`

In [103]:
"\U03BC"
"\U0001F51F"
"\U1F33B"
"\U82B1"

In [106]:
x <- "Normal(\U03BC = 0, \U03C3 = 1)"
x

In [112]:
str_view(x, pattern = "\U03BC")
str_view(x, pattern = rebus::greek_and_coptic())

[90m[1] |[39m Normal([36m<µ>[39m = 0, s = 1)

[90m[1] |[39m Normal([36m<µ>[39m = 0, [36m<s>[39m = 1)