# Manipulating Strings
These are the solutions to selected code blocks. You will have to run the code blocks to see what they do.

In [1]:
# Load libraries
library(stringr)    # stringr
library(tidyverse)  # tidyverse


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mreadr  [39m 1.1.1
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mpurrr  [39m 0.2.5
[32m✔[39m [34mtidyr  [39m 0.8.1     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


## String literals and variables
When you use string functions you can pass in a string literal (a stand alone string) or a variable that contains the string data. Let's get familiar with how to create string literals and save them in variables. 

For diagnostic purposes, you can enclose a line of code in parenthesis '(< code >)' and it will display the result.

The below information is for your reference as it is too confusing to provide text instructions of how to do this.

In [2]:
# String literal
"This is a string literal"

In [3]:
# A three letter string
str_three <- "abc"

# Strings can use single or double quotes
str_single <- 'This is a string using single quotes'
str_double <- "This is a string using double quotes"

# Choose single or double quotes with embedded quote characters
str_single_embedded <- 'This is a string using single quotes with "embedded" double quotes'
str_double_embedded <- "This is a string using double quotes with 'embedded' single quotes"

# Escaping a quote character using a backslash
str_double_embedded_escaped <- "This is a string using double quotes with \"embedded\" single quotes using escape"

# Use print to print these
print(str_single)
print(str_double)
print(str_single_embedded)
print(str_double_embedded)
print(str_double_embedded_escaped)

[1] "This is a string using single quotes"
[1] "This is a string using double quotes"
[1] "This is a string using single quotes with \"embedded\" double quotes"
[1] "This is a string using double quotes with 'embedded' single quotes"
[1] "This is a string using double quotes with \"embedded\" single quotes using escape"


In [4]:
# Assign to variable and print using parenthesis
(str_assign_print <- "Assigning string and printing using parenthesis")

In [5]:
# multi-line string
str_multi_line <- "Line 1
Line 2
Line 3"

# Print displays newline as \n
print(str_multi_line)


[1] "Line 1\nLine 2\nLine 3"


In [6]:
# cat displays actual newline
cat(str_multi_line)
cat("\n") # Needed to reset to new line for next output

Line 1
Line 2
Line 3


In [7]:
# Storing multiple strings in one variable
# known as a character vector
# Can store literals, variables, and NA
str_vector <- c("string 1", str_single_embedded, NA, str_multi_line)

# cat output
cat(str_vector)
cat("\n") # Needed to reset to new line for next output

string 1 This is a string using single quotes with "embedded" double quotes NA Line 1
Line 2
Line 3


In [8]:
# print output easier to see the vector
print(str_vector)

[1] "string 1"                                                            
[2] "This is a string using single quotes with \"embedded\" double quotes"
[3] NA                                                                    
[4] "Line 1\nLine 2\nLine 3"                                              


## Stringr package

There are 25 string function categories in the stringr package. 

* case                    Convert case of a string.

* invert_match            Switch location of matches to location of non-matches.
                        
* modifiers               Control matching behaviour with modifier functions.
                        
* str_c                   Join multiple strings into a single string.

* str_conv                Specify the encoding of a string.

* str_count               Count the number of matches in a string.

* str_detect              Detect the presence or absence of a pattern in a string.
                        
* str_dup                 Duplicate and concatenate strings within a character vector.
                        
* str_extract             Extract matching patterns from a string.

* str_interp              String interpolation.

* str_length              The length of a string.

* str_locate              Locate the position of patterns in a string.

* str_match               Extract matched groups from a string.

* str_order               Order or sort a character vector.

* str_pad                 Pad a string.

* str_replace             Replace matched patterns in a string.

* str_replace_na          Turn NA into "NA"

* str_split               Split up a string into pieces.

* str_sub                 Extract and replace substrings from a character vector.
                        
* str_subset              Keep strings matching a pattern.

* str_trim                Trim whitespace from start and end of string.

* str_trunc               Truncate a character string.

* str_view                View HTML rendering of regular expression match.
                        
* str_wrap                Wrap strings into nicely formatted paragraphs.

* stringr-data            Sample character vectors for practicing string manipulations.
                        
* word                    Extract words from a sentence.


In [9]:
# Use library(help = "package")
# to see the functions
library(help = "stringr")

## str_length()


In [10]:
# What is the string length of 
# str_three
print(str_three)
str_length(str_three)

[1] "abc"


In [11]:
# What is the string length of 
# str_multi_line
print(str_multi_line)
str_length(str_multi_line)

[1] "Line 1\nLine 2\nLine 3"


In [12]:
# What is the string length of 
# str_vector
print(str_vector)
str_length(str_vector)

[1] "string 1"                                                            
[2] "This is a string using single quotes with \"embedded\" double quotes"
[3] NA                                                                    
[4] "Line 1\nLine 2\nLine 3"                                              


In [13]:
# What is the string length of 
# letters
print(letters)
str_length(letters)

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"


Are the string lengths what you expected? 

For the most part, str_length is counting characters. Be careful that this function is vectorized and when you pass it a character vector, which is what letters was, it will count each of them separately.

Notice that new lines although printed as \n are counted as a single character.

## str_c()
Join multiple strings into a single string. To understand how str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules. The sep string is inserted between each column. If collapse is NULL each row is collapsed into a single string. If non-NULL that string is inserted at the end of each row, and the entire matrix collapsed to a single string. 
### Usage
str_c(..., sep = "", collapse = NULL)

In [14]:
# Combine two string literals
# print result and length
str_combine_literals <- str_c("first", "second")
print(str_combine_literals)
str_length(str_combine_literals)

[1] "firstsecond"


In [15]:
# reminder of str_vector
print(str_vector)

[1] "string 1"                                                            
[2] "This is a string using single quotes with \"embedded\" double quotes"
[3] NA                                                                    
[4] "Line 1\nLine 2\nLine 3"                                              


In [16]:
# Combine literals and single string variables
# print result and length
str_combine_alltypes <- str_c("literal", str_three, str_vector)
print(str_combine_alltypes)
str_length(str_combine_alltypes)

[1] "literalabcstring 1"                                                            
[2] "literalabcThis is a string using single quotes with \"embedded\" double quotes"
[3] NA                                                                              
[4] "literalabcLine 1\nLine 2\nLine 3"                                              


In [17]:
# Combine literals and character vector str_vector
# print result and length
str_combine_alltypes <- str_c("literal2", str_vector)
print(str_combine_alltypes)
str_length(str_combine_alltypes)

[1] "literal2string 1"                                                            
[2] "literal2This is a string using single quotes with \"embedded\" double quotes"
[3] NA                                                                            
[4] "literal2Line 1\nLine 2\nLine 3"                                              


In [18]:
# Combine literals and character vector letters
# print result and length
str_combine_letters <- str_c("Letter: ", letters)
print(str_combine_letters)
str_length(str_combine_letters)


 [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d" "Letter: e" "Letter: f"
 [7] "Letter: g" "Letter: h" "Letter: i" "Letter: j" "Letter: k" "Letter: l"
[13] "Letter: m" "Letter: n" "Letter: o" "Letter: p" "Letter: q" "Letter: r"
[19] "Letter: s" "Letter: t" "Letter: u" "Letter: v" "Letter: w" "Letter: x"
[25] "Letter: y" "Letter: z"


## Recycling
Combining strings reveals an interesting property of R called recycling. This isn't referring to memory garbage collection. It is the replication of data in one vector to match the length of data in another vector. This part of the vectorization built into R. 

When combining the string literal "Letter: " with the letters character vector of length 26, the str_c() function made 25 copies of the string literal "Letter: " so it could be matched up which each entry of the 26 letters character vector.

In [19]:
# Combine two character vectors of same size
str_1_2 <- c("1", "2")
str_a_b <- c("a", "b")
str_c(str_1_2, str_a_b)

In [20]:
# Combine two character vectors of multiples of longest one
str_c(str_1_2, letters)

In [21]:
# Combine two character vectors not multiple of longest one
str_1_2_3 <- c("1", "2", "3")
str_c(str_1_2_3, letters)

“longer object length is not a multiple of shorter object length”

* When the length of both character vectors are the same, they just match up. 
* When one character vector is smaller than the other but a whole multiple of each other, then it recycles or replicates the smaller vector to match the larger one. 
* When the larger character vector isn't a whole multiple of the smaller one, it still recycles, but also issues a warning message "longer object length is not a multiple of shorter object length"

## Collapsing and separating strings
str_c() can also take multiple strings and character vectors and collapse them into a single string.
* sep value is put between each of the string vectors
* collapse value is put between each set of string vectors and collapses the result into a single string

In [22]:
# Collapse letters character vector into a single string
# No separator between strings
# Hint: collape=""
str_c(letters, collapse="")

In [23]:
# Collapse letters character vector into a single string
# Separate using a comma and a space
# Hint: collape=", "
str_c(letters, collapse=", ")

In [24]:
# Combine str_1_2 and letters and str_a_b character vectors into a single string
# separating each combined string by a colon 
# separating each collapsed string by a comma and space
# Result: '1:a:a, 2:b:b, 1:c:a, ...'
# Hint: sep=":", collapse=", "
str_c(str_1_2, letters, str_a_b, sep=":", collapse=", ")

There is a lot of flexibility in how to combine, separate, and collapse strings.

## str_replace_na()
Turn NA into "NA". 
### Usage
str_replace_na(string, replacement = "NA")

In [25]:
# Missing inputs give missing outputs
str_c("something", NA)

In [26]:
# Combining character vectors containing NA
# Combine str_contains_na with letters
str_contains_na <- c("-", NA)
str_c(str_contains_na, letters)

In [27]:
# Use str_replace_na before str_c to display NA
# Convert NA in str_contains_na to "NA"
# Combine above with letters
str_c(str_replace_na(str_contains_na), letters)

## str_sub()
Extract and replace substrings from a character vector. str_sub will recycle all arguments to be the same length as the longest argument. If any arguments are of length 0, the output will be a zero length character vector.
### Usage
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L) <- value

In [28]:
# Define a string to work with
str_fox_dog <- "The crazy brown fox jumped over the lazy dog."

# Display the first 9 letters
# of str_fox_dog
# use all positional arguments
# no named parameters
str_sub(str_fox_dog, 1, 9)

In [29]:
# Pipe in string using %>% 
# use named parameter 'end'
str_fox_dog %>% str_sub(end = 9)

In [30]:
# Display the last 9 letters
# of str_fox_dog
str_fox_dog %>% str_sub(start = -9)

In [31]:
# Display just dog using str_sub
# of str_fox_dog
str_fox_dog %>% str_sub(start = -4, end = -2)

In [32]:
# In str_fox_dog replace dog with cats
# Hint: Pipe doesn't work in this case
str_sub(str_fox_dog, start = -4, end = -2) <- "cats"

# Print str_fox_dog
print(str_fox_dog)

[1] "The crazy brown fox jumped over the lazy cats."


If you know the starting and ending locations in a string relative to either the start or the end of the string, you can extract the substring and even replace it.

## substring using character vectors
When applying str_sub() to a character vector (containing multiple strings) the function is applied element-wise in typical vectorized fashion including vector recycling as necessary. 

In [33]:
# Define some strings
str_single <- 'This is a string using single quotes'
str_double <- "This is a string using double quotes"
str_combined <- c(str_single, str_double)
print(str_combined)

[1] "This is a string using single quotes"
[2] "This is a string using double quotes"


In [34]:
# Return 'This' from str_combined
str_combined %>% str_sub(end = 4)

# Return 'quotes' from str_combined
str_combined %>% str_sub(start = -6)

In [35]:
# Return 'quotes' from str_combined
# Return a *single* comma separated string
# Hint: str_c with collapse
str_combined %>% 
   str_sub(start = -6) %>% 
   str_c(collapse = ", ")

In [36]:
# Return single string 'b, c, d' 
# using letters vector
# Hint: str_c collapse the str_sub
letters %>% 
   str_c(collapse = ", ") %>% 
   str_sub(start = 3, end = 10)

You can combine multiple string functions together which enables additional flexibility. This also provides many ways to accomplish the same task. Always look for the most closely matching function for your desired result.

## Upper case, Lower case and Sentence case
Convert case of a string.
### Usage
str_to_upper(string, locale = "")

str_to_lower(string, locale = "")

str_to_title(string, locale = "")

In [37]:
# Define a string to work with
str_fox_dog <- "The crazy brown fox jumped over the lazy dog."

# Return the above string in
# lowercase
# uppercase
# First character of each word capitalized
str_to_lower(str_fox_dog)
str_to_upper(str_fox_dog)
str_to_title(str_fox_dog)

## str_trim()
Trim whitespace from start and end of string. 
### Usage
str_trim(string, side = c("both", "left", "right"))

In [38]:
# Define strings
str_left_whitespace <- "   Whitespace    on left"
str_right_whitespace <- "Whitespace   on right   "
str_both_whitespace <- "   Whitespace    on both    "
str_newline <- "
line2
"
str_combined <- c(str_left_whitespace, str_right_whitespace, 
                  str_both_whitespace, str_newline)
print(str_combined)

[1] "   Whitespace    on left"     "Whitespace   on right   "    
[3] "   Whitespace    on both    " "\nline2\n"                   


In [39]:
# Trim whitespace from str_combined
# left only
# right only
# both
print("Trim left")
str_combined %>% str_trim("left")

print("Trim right")
str_combined %>% str_trim("right")

print("Trim both")
str_combined %>% str_trim("both")

[1] "Trim left"


[1] "Trim right"


[1] "Trim both"


In [40]:
# str_trim without specifying a side does what?
print("Trim <none>")
str_combined %>% str_trim()

[1] "Trim <none>"


* It did trims all sorts of whitespace characters including spaces and newline characters. 
* It did not trim whitespace in the middle of the string.
* <span style="background-color: yellow;">Default option is to trim both left and right sides</span>

## Regular Expressions
Regular expressions are strings that can be passed to string functions to aid in matching and replacing characters. It defines many of the symbols to have special meaning for wildcard pattern matching. 

To use a symbol as a text literal, you will need to prefix the symbol with two backslash characters to "escape" it. It is needed twice, since both R and regular expression use the same character to mean escape. 

* \s matches whitespace
* * means zero or more occurrences of previous character
* . means any character
* ^ means start of string
* $ means end of string
* [abc] means one of these characters

## str_detect()
Detect the presence or absence of a pattern in a string. Vectorised over string and pattern. 
### Usage
str_detect(string, pattern)

In [41]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Return true false vector for strings containing 'apple'
# Assign to match_index
print("strings containing 'apple'")
match_index <- str_detect(str_apple, "apple")

# print match_index
print(match_index)

[1] "strings containing 'apple'"
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE


In [42]:
# Print strings associated with
# TRUE in match_index
# Hint: Use the index inside [] for the string variable
str_apple[match_index] %>% 
   print()

[1] " apple pie"              "apple"                  
[3] "banana apple pie"        "apple apple"            
[5] "apricot applesause cake"


In [43]:
# Print strings containing 'pie'
print("strings containing 'pie'")
match_index <- str_detect(str_apple, "pie")
str_apple[match_index] %>% 
   print()

[1] "strings containing 'pie'"
[1] " apple pie"       "Apple pie cake"   "banana apple pie" "blueberry pie"   


In [44]:
# Print strings ending in 'pie'
# Hint: use $
print("strings ending in 'pie'")
match_index <- str_detect(str_apple, "pie$")
str_apple[match_index] %>% 
   print()

[1] "strings ending in 'pie'"
[1] " apple pie"       "banana apple pie" "blueberry pie"   


In [46]:
# Print strings starting with apple
# Hint: use ^ for starting
print("strings starts with 'apple'")
match_index <- str_detect(str_apple, "^apple")
str_apple[match_index] %>% 
   print()

[1] "strings starts with 'apple'"
[1] "apple"       "apple apple"


In [48]:
# Print strings starting with apple
# Ignore whitespace and match both 'apple' and 'Apple'
# Hint: use ^ for starting
# Hint: use \s for space and * for zero or more
# Hint: use [Aa] for upper and lower case A
print(str_apple)
print("strings starts with 'apple' enhanced")
match_index <- str_detect(str_apple, "^\\s*[Aa]pple")
str_apple[match_index] %>% 
   print()

[1] " apple pie"              "apple"                  
[3] "Apple pie cake"          "banana apple pie"       
[5] "blueberry pie"           "apple apple"            
[7] "apricot applesause cake"
[1] "strings starts with 'apple' enhanced"
[1] " apple pie"     "apple"          "Apple pie cake" "apple apple"   


str_detect is useful to find the rows of data that has a given pattern match. 

## str_extract()
Extract matching patterns from a string. Vectorised over string and pattern. 
### Usage
str_extract(string, pattern)

In [47]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Print strings starting with apple
# Hint: use ^ for starting
print("strings starts with 'apple'")
str_extract(str_apple, "^apple")

[1] "strings starts with 'apple'"


In [48]:
# Print strings starting with apple
# Ignore whitespace and match both 'apple' and 'Apple'
# Hint: use ^ for starting
# Hint: use \s for space and * for zero or more
# Hint: use [Aa] for upper and lower case A
print("strings starts with 'apple' enhanced")
str_extract(str_apple, "^\\s*[Aa]pple")

[1] "strings starts with 'apple' enhanced"


In [49]:
# Find 'apple'or 'Apple' _middle_text_ then 'cake'
# Hint: use [Aa] and .*
print("Find 'apple' _middle_text_ then 'cake'")
str_extract(str_apple, "[Aa]pple.*cake")

[1] "Find 'apple' _middle_text_ then 'cake'"


str_detect is useful for extracting the matching text

## str_replace()
Replace matched patterns in a string. Vectorised over string, pattern and replacement. 
### Usage
str_replace(string, pattern, replacement)

In [49]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Replace apple with cherry for str_apple
str_replace(str_apple, "apple", "cherry")

## str_detect with factors

In [51]:
# Work with mpg dataset model column
# Glimpse mpg
glimpse(mpg)

Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class        <chr> "compact", "compact", "compact", "compact", "compact",...


In [52]:
# Make a copy of mpg as df
df <- mpg

# Convert df$model to a factor
# Hint: as.factor()
df <- df %>% mutate(model = as.factor(model))

# Display model factor levels
# Hint: levels()
df$model %>% levels()

In [53]:
# Select only model column
# Provide distinct data only
# Sort alphabetically
# Display first 5
df %>% select(model) %>% 
   distinct() %>%
   arrange() %>%
   head(5)

model
a4
a4 quattro
a6 quattro
c1500 suburban 2wd
corvette


In [54]:
# Detect 2, 4 and all wheel drive in mpg$model directly
# Display unique values of model containing 2 or 4 or all wheel drive
# Sort alphabetically
# Display first 5
# Hint: use filter() and str_detect()
df %>% filter(str_detect(model, "[24a]wd")) %>%
   select(model) %>% 
   distinct() %>%
   arrange() %>%
   head(5)

model
c1500 suburban 2wd
k1500 tahoe 4wd
caravan 2wd
dakota pickup 4wd
durango 4wd


In [55]:
# Update df$model to remove 2wd, 4wd, and awd
# Also remove any surrounding whitespace
# Hint: mutate(model), convert to char and back to factor as.character() and as.factor()
# Hint: str_replace() \s for whitespace * for zero or more occurrances, [] for character groups
df <- df %>% mutate(model = str_replace(model %>% as.character(), "\\s*[24a]wd\\s*", "") %>% as.factor())

# Display model factor levels
# Hint: levels()
df$model %>% levels()

* Notice the removal of 2wd and 4wd and awd from model column
* Factors need to be converted to character vectors before replacement

# Summary
There are many string functions to help you wrangle data in the stringr package. You can combine them together to solve all your string manipulation needs.