/
numbers-in-strings.Rmd
146 lines (100 loc) · 4.08 KB
/
numbers-in-strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: "Numbers Within Strings"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Numbers Within Strings}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
tidy = ifelse(utils::packageVersion("knitr") >= "1.20.15", "styler", TRUE)
)
library(stringr)
```
# A common way to encode numerical data
It's common for a lot of numerical information to be encoded in strings, particularly in file names. Consider a series of microscope images of cells from different patients detailing the patient number, the cell number and the number of hours after biopsy that the image was taken. They might be named like:
```{r make-img_names, echo=FALSE}
img_names <- expand.grid(1:2, 1:3, c(0, 2.5)) %>%
apply(1, function(x) {
str_c("patient", x[1], "-cell", x[2], "-", x[3], "hours-after-biopsy.tif")
}) %>%
sort()
```
```{r print-img_names}
img_names
```
# All of the numbers
For some crude reason, you might just want all of the numbers:
```{r extract-all-numbers}
library(strex)
str_extract_numbers(img_names)
```
It seems to have missed the fact that 2.5 is a number and not two numbers 2 and 5. This is because the default is `decimals = FALSE`. To recognise decimals, set `decimals = TRUE`. Also, note that there is an option to recognise scientific notation. More on that below.
```{r extract-all-numbers-use-decimals}
str_extract_numbers(img_names, decimals = TRUE)
```
It's also possible to extract the non-numeric parts of the strings:
```{r extract-non-numbers}
str_extract_non_numerics(img_names, decimals = TRUE)
```
# Extract specific numbers
What if we just want the cell number from each image?
## The `n`^th^ number
We know the cell number is always the second number, so we can use the `str_nth_number()` function with `n = 2`.
```{r nth-number-n2}
str_nth_number(img_names, n = 2)
```
## Numbers after patterns
To be more specific, you could say the cell number is the first number after the first instance of the word "cell". To go this route, `strex` provides `str_nth_number_after_mth()` which gives the `n`^th^ number after the `m`^th^ appearance of a given pattern:
```{r nth-number-after-mth}
str_nth_number_after_mth(img_names, "cell", n = 1, m = 1)
```
There's also a convenient wrapper for getting the first number after the first appearance of a pattern:
```{r first-number-after-first}
str_first_number_after_first(img_names, "cell")
```
## Numbers before patterns
Now what if we want the number of hours after biopsy for each image? Looking at the image file names, we'd need the last number _before_ the first occurrence of the word "biopsy".
```{r las-number-before-first}
str_last_number_before_first(img_names, "biopsy", decimals = TRUE)
```
## Tidy number extraction
To extract all of this information tidily, use a data frame:
```{r dataframe}
data.frame(img_names,
patient = str_first_number_after_first(img_names, "patient"),
cell = str_first_number_after_first(img_names, "cell"),
hrs_after_biop = str_last_number_before_first(img_names, "biop",
decimals = TRUE
)
)
```
## Other number formats
`strex` can also deal with numbers in scientific and comma notation.
```{r scicom}
string <- c("$1,000", "$1e6")
str_first_number(string, big_mark = ",", sci = TRUE)
```
It can even do underscore notation or space notation, or both at once:
```{r underscore-or-space}
string <- c("1_000", "1 000", "1_000 000", "1 000_000")
str_first_number(string, big_mark = "_ ")
```
# All of the number functions
There are a whole host of functions for extracting numbers from strings in the `strex` package:
```{r all-number-functions}
str_subset(ls("package:strex"), "number")
```
# Regular expression
Of course, all of the above is possible with regular expression using `stringr`, it's just more difficult and less expressive:
```{r regex}
data.frame(img_names,
patient = str_match(img_names, "patient(\\d+)")[, 2],
cell = str_match(img_names, "cell(\\d+)")[, 2],
hrs_after_biop = str_match(img_names, "(\\d*\\.*\\d+)hour")[, 2]
)
```