-
Notifications
You must be signed in to change notification settings - Fork 3
/
week9.Rmd
303 lines (222 loc) · 15.1 KB
/
week9.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
title: 'Text manipulation: regular expressions'
author: "Jeremy Van Cleve"
output: html_document
date: 24th October 2018
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Reminders
- **Think about data to use and plots to make for lightning talk**.
- E-mail me if you want to discuss your ideas.
# Outline for today
- Text basics
- Basic string manipulation using `stringr`
- Regular expressions: basics
- Regex: character classes
- Regex: quantifiers
- Regex: grouping
- Regex: replacing matches
- Regex: practice!
# Text basics
Working with text and strings can feel like the most boring and tedious aspect of data analysis and wrangling, but it is often important and sometimes the most crucial element. Not only are data sets often untidy, but they are also often full of quirks related to how the data was initially entered. For example, *categorical variables*, like a country name, might be mixed with *numerical variables*, such as dates or population sizes, and these must then be separated intelligently in order to be further analyzed in R. **String** analysis tools are crucial for this task. The first tool you will use includes functions from the `stringr` package. The second tool, more difficult but arguably the most crucial, is the "regular expression", which is the like a Swiss Army knife for finding patterns in text.
## `stringr`
R comes with some base functions for manipulating text, but these can be inconsistent in terms of their arguments and their outputs. As normal for material in this course, Hadley Wickham comes to the rescue with a package, `stringr` (loaded when you load `tidyverse`) that provides some basic function for manipulating strings and for using regular expressions. All the functions start with `str_`, which makes seeing a list of them easy using the "tab complete" functionality in RStudio. While it is certainly useful to learn the base packages for string manipulation, the hope is that learning a simpler framework with `stringr` can make the concepts easier so that if you need to understand how to work one of the base functions, the task will be easier.
## Encoding
Computers store information in bits and while it can seems intuitive to convert bits to numbers (go from base 2 to base 10), converting bits to text requires some arbitrary choices. How many characters can convert to bits? Which special characters (e.g., ?, !, ^, etc) do you include beyond numbers and digits? Once you settle on which characters to include, you can simply list them in a table with the corresponding binary value that represents them. Initially, since English speakers were the dominant users of computers, this table, or "encoding", was called ASCII and encoding only 128 characters (CS challenge: how many bits do you need to encode 128 characters?).
![ASCII chart from 1972](assets/ASCII.png)
Since 128 characters is not enough to encode characters found in many foreign languages (particularly those with non-Roman characters), other encodings exist that do much better. The main encoding using on the internet now is "Unicode", which contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts. The most common variant of Unicode is "UTF-8", which was designed for backward compatibility with ASCII.
As more and more computer software moves to use UTF-8 as the default encoding, encoding will become something users don't have to worry about. However, some software, including older R functions, may read into text expecting ASCII or some limited encoding when text is actually UTF-8. This means that some or all of the text will look mangled, say like this:
> ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ
There are some ways to fix this kind of problem. Jeny Bryan discusses some of these solutions here <http://stat545.com/block032_character-encoding.html> and more generally gives some examples of how to convert from one encoding to another. For the purpose of learning more about text wrangling, assume that everything is in UTF-8 and that you do not have to convert anything. You may run into problems and need to google/pray to the R gods, but its a small price to pay for the generality of unicode. For more on encodings than you probably want to know, try this article: <http://kunststube.net/encoding/>.
## Creating strings: quotes and special characters
You've already been creating strings all over the place, but we'll step through things a bit more methodically here.
Creating strings is easy: just use quotes, single or double.
```{r}
iamstring = "I am very worried about the country right now."
alsostring = 'So very very worried'
print(iamstring)
print(alsostring)
```
Including a quote within a string is easy if you mix the quote styles.
```{r}
print("A 'quote' within a quote")
print('Another "quote" within a quote')
```
Note in the second example how R converted the single quotes defining the string to double quotes and put backslashes before the double quotes inside the string? This is how you can tell R to ignore the special function some characters have. Within a double quoted string, a double quote would mean "end of the string". Putting a backslash before the double quote says "this is just a double quote character." Generally, putting a backslash before a character in a string "*escapes*" the character. Other special "escape characters", which are often characters you can't type with the keyboard easily, include tab, "\t", and the end-of-line character, "\n".
```{r}
cat("at a tab separates this\tand\tthis\n")
cat("\n")
cat("creating multiple lines\nis easy with end-of-line\ncharacter")
```
# Basic string manipulation using `stringr`
First, load `stringr` via `tidyverse.
```{r}
library(tidyverse)
```
### Length
The function `str_length()` not only tells you how a string is,
```{r}
str_length("ugh. so disappointed. so very disappointed.")
```
but it "threads" over vectors and will tell you how long each string is in a vector.
```{r}
whysosad = c("trump", "gingrich", "giuliani")
str_length(whysosad)
```
### Combining strings
Putting strings together is done using `str_c()` where the "c=concatenate" (again, R, use more characters for this word!).
```{r}
str_c("x", "y", "z", "w")
```
To put characters separating the strings, add the `sep` argument:
```{r}
str_c("x", "y", "z", "w", sep = " is a character and ")
```
`str_c` is also "vectorized" and threads over vectors,
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
which can a very handy way to avoid writing `for` loops.
Combined strings can be collapsed into a single string with the `collapse` argument.
```{r}
str_c(c("x", "y", "z", "w")) # just a vector still!
str_c(c("x", "y", "z", "w"), collapse = ", ") # now, a single string
```
### Detecting a string
Use `str_detect` to see if a string is contained in a vector of strings. Use `str_subset` to return the vector containing only those strings that match
```{r}
# vector of strings from tidyverse
fruit
str_detect(fruit, "fruit")
# get the subset that match
my_fruit <- str_subset(fruit, "fruit")
print(my_fruit)
```
### Splitting a string
Passing a delimiter to `str_split()` will split strings or vectors of strings based on the delimiter:
```{r}
str_split(my_fruit, " ") # split on a space
```
### Subsetting a string
You can pull out parts of a string with `str_sub()`, which is also vectorized,
```{r}
x <- c("Trump", "Gingrich", "Giuliani")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
and works for assignment too,
```{r}
str_sub(x, 1, 3) = "XXX"
print(x)
```
### Replacing a substring within a string
Replacing a substring can be done using `str_replace()`:
```{r}
str_replace(my_fruit, "fruit", "XXX")
my_fruit # it didn't modify the original (as usual for most R functions)
```
# Regular expressions: basics
In the previous examples, strings were matched directly and entirely; either the who string (e.g, "fruit") matched some piece of the string exactly or it didn't match at all. Regular expressions ("regex"" for short) provide a syntax for constructing much more complicated ways to match strings than an exact match. Using a regular expression, you can construct a pattern that will only match dates (no matter the date!) or ones that will only match URLs (no matter the URL!). Needless to say, this can be extremely helpful when you have to modify/edit tons of data, but the drawback is the regular expressions have a steep learning curve.
## Metacharacters
Like the quote characters and backslash, regex have special characters that are meant to represent something other than the character itself. The metacharacters are: '. \\ | ( [ { ^ $ * + ?'. To see how some of them work, you can use the function `str_view_all()`, which shows visually all places where a regex matches a string.
- `.`: match any character
```{r}
str_view_all("some characters #*$#(", ".")
```
- `^`: match at the beginning of the string
```{r}
str_view_all(c("trump", "gingrich is a troll", "travesty"), "^tr")
```
- `$`: match at the end of the string
```{r}
str_view_all(c("trump", "gingrich is a sycophant", "travesty"), "p$")
```
- `|`, `[]`: creating alternatives and character classes; see below
- `*`, `+`, `?`, `{}`: quantifiers; see below
- `()`: create "groups"; see below
# Regex: character classes
Suppose that you want to match any of the characters "a" or "b"? The "|" operator, which is an "or" operator, works for this because it matches either pattern on its sides. For example,
```{r}
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "a|b")
```
What about matching "a", "b", or "c"? A character class is used for this where "[abc]" denotes the class of characters "a", "b", and "c" and will matches any of those characters.
```{r}
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "[abc]")
```
You can create an "inverted" character class that matches all characters except those specified by putting "^" in the class:
```{r}
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "[^abc]")
```
Finally, there are some handy character classes that are built into regex. For example,
- `\d`, `\D`, `\s`, `\S`, and `\w` represent any decimal digit, not a digit, a space character, not a space character, and a ‘word’ character, respectively
```{r}
# we need the extra "\" so that the "\" is actually included in the string!
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\d")
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\D")
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\s")
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\S")
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\w")
```
- `\b` and `\B`: matches the empty string at either edge of a word, or not at the edge of a word, respectively. The example below shows how this allows you to match the character "h" either at the edge of a word or within a word.
```{r}
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\bh")
str_view_all("we have to go back to our abc's to figure out what happened in 2016!!!!", "\\Bh")
```
Other built in character classes are specified with "[::]" or ranges such as [a-z]. Some of these include
- `[:digit:]`: same as `\d` and [0-9].
- `[:lower:]`: lower-case letters, equivalent to [a-z].
- `[:upper:]`: upper-case letters, equivalent to [A-Z].
- `[:alpha:]`: alphabetic characters, equivalent to [A-z].
- `[:alnum:]`: alphanumeric characters, equivalent to [A-z0-9].
- `[:blank:]`: blank characters, i.e. space and tab.
- `[:space:]`: space characters: tab, newline, vertical tab, form feed, carriage return, space.
- `[:punct:]`: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
# Regex: quantifiers
One of the most important pieces of a regex pattern is a **quantifier** that denotes how many times each piece of the pattern show match. There are four basic quantifiers:
- `*`: match zero or more times
- `+`: match one or more times
- `?`: match zero or one times
- `{n}`, `{n,}`, or `{n,m}`: match n times, at least n times, or between n and m times.
```{r}
str_view("we have to go back to our abc's to figure out what happened in 2016!!!!", "!+")
str_view("we have to go back to our abc's to figure out what happened in 2016!!!!", "!?")
str_view("we have to go back to our abc's to figure out what happened in 2016!!!!", "!{3}")
str_view("we have to go back to our abc's to figure out what happened in 2016!!!!", "!+?")
```
# Regex: grouping and detecting matches.
Eventually you will want to extract pieces of a match. For example, if you try to match four digit years, you may want to know what years the regex matched. In order to extract this piece from a larger pattern, you can "group" the year part of the pattern.
```{r}
str_match("we have to go back to our abc's to figure out what happened in 2016!!!!", "(\\w+) +(\\d{4})")
```
The above example shows a match for a word followed by a four digit year. Using the `str_match()` function, you get the matched string and then the grouped expressions afterwards in columns two and three.
If you want to a true/false for whether the pattern matched, you can use `str_detect()` instead of `str_match()`
```{r}
str_detect("we have to go back to our abc's to figure out what happened in 2016!!!!", "(\\w+) +(\\d{4})")
```
and `str_count()` to get the number of matches in the whole string
```{r}
str_count("we have to go back to our abc's to figure out what happened in 2016!!!!", "!{2}")
```
Note here that matches don't overlap; otherwise you would count more than 2 matches in the four exclamation points.
# Regex: replacing matches
Replacing text with regular expression involves supplying a pattern to match and a replacement string. For example, using `str_replace()`,
```{r}
str_replace("we have to go back to our abc's to figure out what happened in 2016!!!!", "(\\w+) +(\\d{4})", "in our dreams; thank god it wasn't real")
```
The groups can also be taken advantage of in the replacement where the variables (called "backreferences") "\\n" contain the nth saved or grouped part of the pattern.
```{r}
str_replace("we have to go back to our abc's to figure out what happened in 2016!!!!", "(\\w+) +(\\d{4})", "\\1 this awful time. I wish that \\2 hadn't happened.")
```
# Regex: practice!
There are a couple of nice websites that allow you to easily construct a regex pattern and see how it matches text at the same time. This can be very useful when trying to construct a pattern to do a very specific job.
- <http://regexr.com/>
- <http://regex101.com/>
This is a good places to learn the basics too in a more fun way:
- <https://regexcrossword.com/challenges/tutorial>
# Lab ![](assets/beaker.png)
### Problems
1. Go to <https://regexcrossword.com/challenges/beginner>. Solve the first three puzzles and submit the answers to them.
2. Write your own 2x2 regex crossword in the style of the puzzles in problem #1. Provide the puzzle and solution.