## Lecture 18: Regular expressions in R

### STAT598z: Intro. to computing for statistics


***




### Vinayak Rao

#### Department of Statistics, Purdue University

In [1]:
options(repr.plot.width=5, repr.plot.height=3)

We have seen the `print` function:

In [48]:
x <- 1
print(x)
y <- list('Hello', TRUE, c(1,2,3))
print(y)

[1] 1
[[1]]
[1] "Hello"

[[2]]
[1] TRUE

[[3]]
[1] 1 2 3



print is a *generic* function:
+ looks at class of input and calls appropriate function

In [51]:
my_df <- data.frame(x = c(1,2), y = c(3,4))
print(my_df)

  x y
1 1 3
2 2 4


In [52]:
print.default(my_df)

$x
[1] 1 2

$y
[1] 3 4

attr(,"class")
[1] "data.frame"


In [53]:
print.data.frame(my_df)

  x y
1 1 3
2 2 4


In [54]:
class(df) <- NULL
print(my_df)

  x y
1 1 3
2 2 4


### `print` and `cat`

`print` can only print its first term

In [56]:
print('Right now it is', date())

“NAs introduced by coercion”

ERROR: Error in print.default("Right now it is", date()): invalid 'digits' argument


For this we need the cat (concatenate) function

In [3]:
cat('Right now it is', date())

Right now it is Mon Mar 27 21:39:50 2017

```
cat(..., file = '' , sep = ' ' , fill = FALSE, labels = NULL,
     append = FALSE)
```

`…`: Inputs that R concatenates to print

`sep`: What to append after each input (default is space)

`file`: Destination file (default is stdout)

Use `paste()` to store the concatenated output (a string)

In [58]:
cat(1:5)

1 2 3 4 5

In [59]:
cat(1:5,sep= ',' )

1,2,3,4,5

In [60]:
cat(1:5,sep= '\n' )

1
2
3
4
5


In [61]:
cat('[' ,1:5, ']' ,sep=(',' ))

[,1,2,3,4,5,]

In [62]:
cat('[',1:5, ']' ,sep=c('', rep(',' ,4), '' ))

[1,2,3,4,5]

In [63]:
cat('Hello','World','New para',sep='\n',file='new_file.txt')

In [64]:
my_cmd <- paste('[' ,1:5, ']',sep=c('', rep(',' ,4),''))

R needs a newline at end of string (not RStudio )
Section 8.1.22 in *The R Inferno*, Patrick Burns:
+ `print` outputs all characters in the string
+ `cat` outputs what the string represents

Compare:

In [15]:
print('Hello\n') 

[1] "Hello\n"


In [16]:
cat('Hello\n')

Hello


+ ‘\’ escapes the following character (indicating it is special)

What if we want to output ‘\n’ using cat ?

Escape `\` with another `\`

In [65]:
cat('Hello\\n')

Hello\n

**Regular expression**: representation of a collection of strings

Useful for searching and replacing patterns in strings

Composed of a grammar to build complicated patterns of
strings

R has functions, which coupled with regular expressions allow
powerful string manipulation

`E.g. grep, grepl, regexpr, gregexpr, sub, gsub`

### Matching simple patterns

In [17]:
cities <- c('lafayette', 'indianapolis' , 'cincinnati')
grep('in', cities)

In [66]:
grepl('in', cities)

Usage:  
``` grep(pattern, x, ignore.case = FALSE,  perl = FALSE, value = FALSE) ```

In [67]:
grep('in',cities,value=TRUE) #Return values instead of indices

Where in each element did the match occur?

In [68]:
regexpr('in', cities)

What if more than one match occured?

In [69]:
gregexpr('in', cities)

What if we want to match
+ any letter followed by ’n’?
+ any vowel followed by ’n’?
+ two letters followed by ’n’?
+ any number of letters followed by ’n’?

### Regular expressions!
+ allow us to match much more complicated patterns
+ build patterns from a simple vocabulary and grammar

R supports two ﬂavors of regular expressions, we will always
use perl (set option `perl = TRUE` )

'`.`' (period) represents any character except empty string '`””`'

In [71]:
vec<-c('ct','at', 'cat', 'cart', 'dog', 'rat', 'carert', 'bet')

In [72]:
grep('.at', vec, perl = TRUE)

In [73]:
grep('..t', vec, perl = TRUE)

`+` represents one or more occurrences

In [74]:
grep( 'c.+t', vec, perl = TRUE)

`*` represents zero or more occurrences

In [75]:
grep('c.*t', vec, perl = TRUE)

Group terms with parentheses ’(’ and ’)’

In [76]:
grep('c(.r)+t', vec, perl = TRUE)

In [78]:
grep('c(.r)*t', vec, perl = TRUE)

‘`.`’ ‘`,`’ ‘`+`’ ‘`*`’ are all metacharacters

Other useful ones include:

+ ˆ and $ (start and end of line)

In [79]:
grep('r.$', vec, perl = TRUE)

| ( logical OR )

In [80]:
grep('(c.t)|(c.rt)', vec, perl = TRUE)

`[` and `]` ( create special character classes)

`[a-z]`: lowercase letters

`[a-zA-Z]`:  any letter

`[0-9]`: any number

`[aeiou]`: any vowel

`[0-7ivx]`: any of 0 to 7, i, v, and x

Inside a character class `ˆ` means "anything except the following
characters". E.g.

`[ˆ0-9]`: anything except a digit

What if we want to match metacharacters like `.` or `+`?

In [82]:
vec <- c('ct', 'cat', 'caat', 'caart', 'caaaat', 'caaraat', 
         'c.t')
grep('c.t', vec, perl = TRUE) #Is this what we want?

Escape them with `\`

WARNING: a single `\` doesn’t work. Why?

In [83]:
cat('c\.t')

ERROR: Error: '\.' is an unrecognized escape in character string starting "'c\."


R thinks `\.` is a special character like `\n`. 

Use two \'s

In [84]:
cat('c\\.t')

c\.t

In [85]:
grep('c\.t', vec, perl = TRUE)

ERROR: Error: '\.' is an unrecognized escape in character string starting "'c\."


In [86]:
grep('c\\.t', vec, perl = TRUE)

To match a `\`, our pattern must represent `\\`

In [88]:
my_var <- '\n'
grep('\\n', my_var)

In [90]:
my_var <- ('\\')
grep('\\\\', my_var)

### Search and replace
The `sub` function allows search and replacement:

In [91]:
vec <-c('ct','cat','caat','caart','caaaat','caaraat','c.t')
sub('a+', 'a', vec, perl = TRUE)

`sub` replaces only ﬁrst match, `gsub` replaces all

Use backreferences \1, \2 etc to refer to ﬁrst, second group etc

In [92]:
gsub('(a+)r(a+)', 'b\\1brc\\2c', vec, perl = TRUE)


Use \U, \L, \E to make following backreferences upper or lower case or leave unchanged respectively

In [45]:
gsub('(a+)r(a+)', '\\U\\1r\\2', vec, perl = TRUE)

In [47]:
gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec, perl = TRUE)