In [1]:
library(tidyverse)
library(stringr)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()


# Administrative stuff
- Midterms are graded. Statistics:
    - Mean / s.d.: 72% ± 15%
    - Median: 74%
    - Mode: 84%
    - Max: 100% (x5)
- Grades and solutions will be posted Wednesday night.
- PS6 went up Monday at midnight and is due at the usual time next week.

# Lecture 11: Strings

Reading for this week: [Chapter 14](http://r4ds.had.co.nz/strings.html). In this lecture, we will cover:

* [String basics and length](#String-basics-and-length)
* [Combining strings](#Combining-strings)
* [Subsetting strings](#Subsetting-strings)
* [Other useful functions](#Other-functions)
* [Regular expressions](#Regular-expressions) (part i)

## String basics

We've already encountered strings several times in the book and on problem sets.

You can create a string by assigning a quoted value to a variable. It does not matter if you use `"double quotes"` or `'single quotes'`. They are the same.

In [2]:
(mystring <- "STATS 306")
(mystring2 <- 'STATS 306')

[1] "STATS 306"

[1] "STATS 306"

One reason to use one versus the other is when the string itself contains quotes:

In [3]:
(mystring3 <- '"MLE" stands for "Maximum Likelihood Estimate"')

[1] "\"MLE\" stands for \"Maximum Likelihood Estimate\""

To create a string containing double quotes, while using double quotes to create it, you must *escape* the quotes using a backslash (`\`):

In [4]:
(mystring3 <- "\"MLE\" stands for \"Maximum Likelihood Estimate\"")

[1] "\"MLE\" stands for \"Maximum Likelihood Estimate\""

What if you actually want a backslash? Then you need to escape it as well:

In [5]:
(mystring4 <- "\\ is the backslash character")

[1] "\\ is the backslash character"

The printed representation of strings shows the escapes:

In [6]:
mystrings = c("\"", '"', '\'', "'", "\\", "\\/")
print(mystrings)

[1] "\""  "\""  "'"   "'"   "\\"  "\\/"


Use `writeLines()` to see the raw contents of the string. 

In [7]:
writeLines(mystrings)

"
"
'
'
\
\/


### Escape sequences
`\\` and `\"` are examples of what are called "escape sequences". They tell R to do something special instead of just print the character. There are a couple of other useful escape sequences:

In [8]:
writeLines("First line\nSecond line") # newline

First line
Second line


In [9]:
writeLines("Text\tIndented Text\tText") # tab

Text	Indented Text	Text


### ASCII and unicode
Early computers could only read and write the ASCII character set, essentially just roman letters, numbers and some punctuation.

Nowadays, computers need to be able to understand alphabets from all over the world. For this we have *Unicode*.

You can print characters if you know their unicode using `\u`. For example, the copyright character has unicode `00A9`. Wikipedia has [a complete list](https://en.wikipedia.org/wiki/List_of_Unicode_characters).

In [10]:
writeLines("\u00A9")

©


## String functions in R

R base built-in commands for dealing with strings, but as with the base R data manipulation commands, they have an inconsistent interface and are hard to remember. Instead we will focus on functions provided by the `stringr` package. They all start with `str_`. Make sure to `library(stringr)` in order to be able to use them.

### String length

In [11]:
str_length(c("a", "character", "vector"))

[1] 1 9 6

#### Example
What's the median length of airport names?

In [12]:
library(nycflights13)
airports %>% summarize(m=median(str_length(name)))

“package ‘bindrcpp’ was built under R version 3.4.4”

  m 
1 19

### Combining strings
Combining two strings into one is called "concatenation" by computer scientists and "combining strings" by everyone else. `concatenate` is hard to type, so it is abbreviated `str_c`:

In [13]:
str_c("Let us con", "catenate strings!")

[1] "Let us concatenate strings!"

Like most other commands, `str_c` is vectorized, meaning it will take vector arguments and recycle the shorter ones to the length of the longest:

In [14]:
mystrings <- c("one", "two", "ten")
str_c("*** ", mystrings , " ***") # each argument is expanded to the length of the longest

[1] "*** one ***" "*** two ***" "*** ten ***"

As usual, `NA` values propagate:

In [15]:
mystrings_na <- c("one", "two", NA)
str_c("*** ", mystrings_na, " ***") # missingness is contagious!

[1] "*** one ***" "*** two ***" NA           

There is a command which will replace `NA` with the *string* NA:

In [16]:
str_replace_na(NA)
str_c("*** ", str_replace_na(mystrings_na), " ***") # converts missing values to the string "NA"

[1] "NA"

[1] "*** one ***" "*** two ***" "*** NA ***" 

Another use of `str_c` is to combine multiple strings into one:

In [17]:
str_c("one", "two", "ten", sep = ", ") # can provide a separator

[1] "one, two, ten"

If you already know some R, you might recognize this as being equivalent to 
```{r} 
paste("one", "two", "ten", sep=", ")
```

Be mindful of the difference between passing in a vector of strings as a single argument, and passing in multiple strings as separate arguments:

In [18]:
str_c(mystrings, sep = ", ") # why does this not combine the strings?
str_c(mystrings, collapse = ", ") # use collapse if the strings you want to combine are in a vector

[1] "one" "two" "ten"

[1] "one, two, ten"

### Subsetting strings
`str_sub` can be used to extract a *sub-string* from a larger string:

In [19]:
letters
(letters_str = str_c(letters, collapse = ""))
str_sub(letters_str, 1, 10) # the substring from position 1 through 10 (both inclusive)

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

[1] "abcdefghijklmnopqrstuvwxyz"

[1] "abcdefghij"

The indices passed to `str_sub` can be negative, in which case they count from the end of the string:

In [20]:
str_sub(letters_str, -10, -1) # negative numbers count from the end
str_sub(letters_str, -1, -1) # z

[1] "qrstuvwxyz"

[1] "z"

You can change part of a string using the assignment form of `str_sub()`.

In [21]:
str_sub(letters_str, 1, 10) = str_to_upper(str_sub(letters_str, 1, 10))
letters_str

[1] "ABCDEFGHIJklmnopqrstuvwxyz"

## Other functions
There are a number of other useful functions:
```{r}
> str_<TAB TAB>
str_c            str_extract      str_locate       str_pad          str_split        str_to_lower     str_view
str_conv         str_extract_all  str_locate_all   str_replace      str_split_fixed  str_to_title     str_view_all
str_count        str_interp       str_match        str_replace_all  str_sub          str_to_upper     str_which
str_detect       str_join         str_match_all    str_replace_na   str_sub<-        str_trim         str_wrap
str_dup          str_length       str_order        str_sort         str_subset       str_trunc
```

In [22]:
writeLines(str_wrap("this is a super duper really long line of text which should be wrapped", width = 20))

this is a super
duper really long
line of text which
should be wrapped


In [23]:
str_trim("  whitspace on either side  ")

[1] "whitspace on either side"

In [24]:
str_interp("Hello ${name}, this is a string template", list(name="Jonathan"))

[1] "Hello Jonathan, this is a string template"

In [25]:
str_to_title("this will be capitalized")

[1] "This Will Be Capitalized"

## Regular expressions

Regular expressions (regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

The commands `str_view` and `str_view_all` take a character vector and a regular expression, and show you how they match. 

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [26]:
options(jupyter.rich_display=T) # needed for str_view to work in jupyter notebook

In [27]:
x = c("apple", "banana", "pear")
str_view(x, "an")

Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

You might wonder why, if `banana` has two instances of the pattern `an`, did `str_view` only return the first? This is its default behavior. To print all the matches, use `str_view_all`:

In [28]:
str_view_all(x, "an")

### Wildcards
Matching plain substrings is not that interesting -- perhaps you already know how to do it using `grep` or `str_detect`. Next we will learn about the `.`. A period inside of a regexp matches any character, except a newline:

In [29]:
str_view(x, '.a.')  # Match any character triple with an a in the middle position

#### Exercise
What does `...` match?

In [30]:
str_view(x, '...')

In [31]:
str_view_all(x, '...')

Notice that `str_view_all` produces one match for apple and pear, but two matches for banana--the characters that have already participated in a match cannot participate in another one.

The period has a special meaning in regexps. If we wanted to match a literal period, we would need to escape it by writing `\.`. Recall from earlier in the lecture that the backslash itself must be escaped in strings!

In [32]:
str_view(c("abc", "a.c", "bef"), "a.c")  # no
# str_view(c("abc", "a.c", "bef"), "a\.c")  # error
str_view(c("abc", "a.c", "bef"), "a\\.c")  # yes

#### Example
`stringr::words` contains a large list of common words:

In [33]:
head(stringr::words)
length(stringr::words)

How can we find all the words that contain a `b`, followed by any character, followed by an `e`, followed by any character?

In [34]:
str_view(stringr::words, "b.e.", match = T)

(The `match = T` option tells `str_view` to only return the words that had a match.)

### Anchors
Sometimes we want a match to occur at a particular position in the string. For example, "all words which start with `b`". For this we have the special *anchor* characters: `^` and `$`. The caret `^` matches the beginning of a string. The `$` matches the end. 

In [35]:
str_view(x, '^b')
str_view(x, 'r$')

Both anchors can be used together to match entire lines:

In [36]:
str_view(x, "^pear$")

Again, if you wanted to match the *character* `^` or `$` you would need to escape them:

In [37]:
str_view('^this string', "\\^.h")

### Character classes
A "character class" is a special pattern that matches a collection of characters. For example, `\d` will match any digit:

In [38]:
str_view(c("number1", "two", "3hree"), "\\d")

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [39]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


You can form your own character class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [40]:
str_view(x, '[be]a')  # Match either 'b' or 'e' followed by a

We can also *negate* the character class. A character class which begins with a `^` will match everything except what is in the brackets:

In [41]:
str_view(x, '[^b]a')  # Match anything except b, followed by a

Note also that, inside of the brackets, the `.` loses its special meaning:

In [42]:
str_view(c('a.', 'ban', 'tribeca.'), 'a[nea.]')  # Match a, following by one of n, e, a or period

#### Exercise
I want to match all words that begin with a `q` but are not followed by a `u`.

In [43]:
str_view(stringr::words, '^q[^u]', match=T)

It turns out that there aren't any in this list of common words. (These are uncommon words.) Let's try a bigger list:

In [44]:
english = read_table('/usr/share/dict/words', col_names = F)$X1  # works on OSX or linux
length(english)
str_view(english, '^q[^u]', match=T)

Parsed with column specification:
cols(
  X1 = col_character()
)


How about words that end in q?

In [45]:
str_view(english, 'q$', match=T)