In [58]:
library(tidyverse)
library(nycflights13)

# Lecture 16: Functions

In this lecture we will learn about functions. We already have plenty of experience calling functions in order to mutate, plot, and explore data. Now we will learn how and when to write our own functions.

## When to write a function

Often when programming we find ourselves repeating the same block of code with minor modifications. Suppose we want to normalize each column of this tibble to be in $[0,1]$:

In [13]:
df = tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
) %>% print

# A tibble: 10 x 4
        a       b       c       d
    <dbl>   <dbl>   <dbl>   <dbl>
 1 -0.626  1.51    0.919   1.36  
 2  0.184  0.390   0.782  -0.103 
 3 -0.836 -0.621   0.0746  0.388 
 4  1.60  -2.21   -1.99   -0.0538
 5  0.330  1.12    0.620  -1.38  
 6 -0.820 -0.0449 -0.0561 -0.415 
 7  0.487 -0.0162 -0.156  -0.394 
 8  0.738  0.944  -1.47   -0.0593
 9  0.576  0.821  -0.478   1.10  
10 -0.305  0.594   0.418   0.763 


To normalize we'll subtract the min from each column and divide by its range:

In [59]:
df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
 
print(df)

# A tibble: 10 x 4
         a     b     c     d
     <dbl> <dbl> <dbl> <dbl>
 1 0.0860  1     1     1    
 2 0.419   0.699 0.953 0.466
 3 0       0.428 0.710 0.645
 4 1       0     0     0.484
 5 0.479   0.896 0.897 0    
 6 0.00624 0.582 0.665 0.352
 7 0.544   0.590 0.630 0.359
 8 0.647   0.848 0.178 0.482
 9 0.581   0.815 0.520 0.905
10 0.218   0.754 0.828 0.782


This required a bunch of repetitive typing. (Worse, there is an error in the code.) In situations like this we should write a function!

## Anatomy of a function
To write a function we should first think about the inputs and output. A function takes input(s), does something(s) to them, and then returns an output.

What are the input(s) and output of our normalize function?
```{r}
df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```

We will call our function `rescale01`. The input is the vector we wish to normalize, and the output is the a copy of the vector where each entry is normalized.

In [70]:
rescale01 <- function(x) {
#  ^ function name   ^ function argument (input vector)
    (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
#   ^ function output
}

Notice how we have taken our code and converted every instance of `df$a` to `x`, which is the name that we have assigned to our function argument.

Let's test our function on a few examples:

In [71]:
rescale01(c(.01, .2, .3))

[1] 0.0000000 0.6551724 1.0000000

In [17]:
rescale01(c(1,0,3,2))

[1] 0.3333333 0.0000000 1.0000000 0.6666667

Now that we have defined our function, we can replace our code with a nicer looking version:

In [18]:
df$a = rescale01(df$a)
df$b = rescale01(df$b)
df$c = rescale01(df$c)
df$d = rescale01(df$d)

This is considerably simpler, but still has some repetition. Soon we will learn about iteration and ways to cut down further on repetition.

What happens if we pass an infinite value to our function?

In [73]:
x = c(1:10, Inf)
rescale01(x)

 [1]   0   0   0   0   0   0   0   0   0   0 NaN

“NaNs produced”

[1] NaN

We have turned up a bug in our function! But since the code now all lives in one place, we can fix the function once rather than having to chase down the bug every place that we copied and pasted the code.

In [75]:
rescale01 = function(x) {
  rng = range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
# range(x, finite=T)
rescale01(x)

 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
 [8] 0.7777778 0.8888889 1.0000000       Inf

## Conditional execution
Often when writing functions we need to do different things depending on what data is passed in. This is known as *conditional execution*, and is accomplished using the `if/else` construct:
```{r}
if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}
```

### Exercise
The *Heaviside step function* is defined as
$$H(x)=\begin{cases}0,&x\le 0\\
1,&\text{otherwise}
\end{cases}.$$
How can we code this as an R function?

In [86]:
H <- function(x) as.integer(x > 0)
H(-.5)
H(1)

[1] 0

[1] 1

### Exercise
Write a function `fizzbuzz(x)` that prints "fizz" if x is divisible by three, and "buzz" otherwise.
```{r}
> fizzbuzz(3)
[1] "fizz"
> fizzbuzz(4)
[2] "buzz"
```

In [91]:
fizzbuzz <- function(x) {
   if (x %% 3)
       "buzz"
   else
       "fizz"
}
3:10 %% 3

[1] 0 1 2 0 1 2 0 1

### Exercise
What do the following functions do?
```{r}
f1 <- function(string, prefix) {
  substr(string, 1, nchar(prefix)) == prefix
}

f2 <- function(x) {
  if (length(x) <= 1) return(NULL)
  x[-length(x)]
}
```

In [100]:
f1 <- function(string, prefix) {
  substr(string, 1, nchar(prefix)) == prefix
}

f2 <- function(x) {
  if (length(x) <= 1) return(NULL)
  x[-length(x)]
}

c(1:10, -1)[-11]

 [1]  1  2  3  4  5  6  7  8  9 10

### Conditions
The `condition` part of the `if` statement must evaluate to either a single `TRUE` or `FALSE`. If it does not, you will get a warning:

In [104]:
if (c(T, F)) { 1 } else { }

ifelse(
    1:10 > 5,
    "gt5",
    "lte5"
)


“the condition has length > 1 and only the first element will be used”

[1] 1

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

(Why?) Similarly, a condition of `NA` will generate an error:
```{r}
> if (NA) { 1 }
Error in if (NA) {: missing value where TRUE/FALSE needed
Traceback:
```

#### Logical operators
Often you will need to combine multiple logical conditions in an `if` statement. To do this we have the `&&` and `||` operators, which take the logical `and` and `or`, respectively, of several logical conditions:

In [109]:
# TRUE && FALSE && TRUE

if(c(T, T, F) & c(T, F, T)) {}  # Vectorized version. single &

“the condition has length > 1 and only the first element will be used”

NULL

In [25]:
# FALSE || TRUE || FALSE

[1] TRUE

There is a subtle but important difference betwen the single and double versions of these operators. The single `&` performs entrywise `AND` over logical vectors:

In [26]:
c(T, T, F) & c(F, T, F)

[1] FALSE  TRUE FALSE

In contrast, the double ampersand `&&` returns `F` as soon as it encounters a value of `F`:

In [27]:
c(T, T, T) && c(F, T, F)

[1] FALSE

It only returns `T` if it gets to the end of a vector without finding any `F` values:

In [28]:
c(T, T, T) && c(T, T, T)

[1] TRUE

This is known as "short-circuiting": R can stop evaluating as soon as it hits *one* false value, since this will cause the `&` to return false:

In [29]:
f = function() { print("f called"); F }
g = function() { print("g called"); T }
f() && g()

[1] "f called"


[1] FALSE

The or operator works similarly:

In [30]:
g() || f()

[1] "g called"


[1] TRUE

#### Testing for equality
Be careful when testing for equality in conditionals. The `==` operator will return a *vector* of logicals. If you want to make sure that any/all entries of a vector are `TRUE`, use the `any()` or `all()` functions:

In [112]:
v1 = c(1, 2, 3)
v2 = c(1, 1, 2)
v1 == v2
all(v1 == v2)
any(v1 == v2)
#if (v1 == v2) { print("Wrong!") }
#if (all(v1 == v2)) { print("All!") }
#if (any(v1 == v2)) { print("Any!") }

[1]  TRUE FALSE FALSE

[1] FALSE

[1] TRUE

Also be wary of testing floating point numbers for equality:

$$2 = \sqrt(2^2)?$$

In [126]:
sqrt(2)

[1] 1.414214

If you need to do this, use the `near()` function instead:

In [120]:
near(2, sqrt(2) ^ 2)

[1] TRUE

### Multiple conditions
Sometimes you will want to check multiple conditions using an `if` statement. For example, let's define the function $$\operatorname{sgn}(x) = \begin{cases}-1,&x<0\\0,&x=0\\1,&x>0.\end{cases}$$

In [34]:
sgn = function(x) {
    if (x < 0) {
        return(-1)
    } else if (x == 0) {
        return(0)
    }
    return(1)
} 
sgn(0)

[1] 0

The general form is
```{r}
if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # 
}
```

If you find yourself chaining together a long string of `if/else if/else if/else if/.../else` statements, chances are there is an easier way. For example, say we have a continuous variable `temp` that we want to convert to a factor. One way is using lots of `if/else`:
```{r}
if (temp <= 0) {
  "freezing"
} else if (temp <= 10) {
  "cold"
} else if (temp <= 20) {
  "cool"
} else if (temp <= 30) {
  "warm"
} else {
  "hot"
}
```

Alternative, we could use the `cut` function:
```{r}
cut(temp, c(0, 10, 20, 30), 
    c('freezing', 'cold', 'cool', 'warm', 'hot'))
```

### Brackets
Both `function` and `if` are usually called using the curly bracket delimiters `{` and `}`. For one-line statements, the brackets are optional:

In [35]:
if (TRUE) { 
    print("A1") 
} else { 
    print("B") 
}

if (TRUE) print("A2") else print("B")

[1] "A1"
[1] "A2"


You should almost always use the curly braces. One exception is for very brief, unnamed functions. We'll see some examples of this next week when we study map/reduce computations.

### Exercise

**Beginner**: Write a function `f(x, a, b)` that takes three numbers $x,a,b$ and returns true iff $x \in [a, b)$.
```
> f(1, 2, 3)
[1] FALSE
> f(2.5, 2, 3)
[1] TRUE
```

**Advanced**: Write a function `f(v)` that takes a vector of numbers and returns their product.

**More advanced**: ^^^^^ Do this in one line.

In [146]:
1:10 %>% log %>% sum %>% exp

[1] 3628800

### Exercise

**Beginner**: Write a function `num_words(s)` which takes a sentence `s` and returns the number of words.
```
> num_words("This sentences has five words.")
[1] 5
```

**Advanced**: A *pangram* is a sentence that uses all 26 letters of the alphabet. Write a function `is_pangram(s)` that checks this:
```
> is_pangram("Not a pangram.")
[1] FALSE
> is_pangram("J. Terhorst very quickly ate beer and pizza from Zachary's while relaxing.")
[1] TRUE
```

In [148]:
setequal(c(1, 2, 3),c(1, 1, 2, 3))

[1] TRUE

### Exercise

**Beginner**: Write a function `is_even(x)` that tells whether the integer $x$ is even.

**Advanced**: Write a function that `is_prime(x)` that tells whether the integer $x$ is prime.

In [164]:
is_prime <- function(n) {
   ! any((n %% 2:(n-1)) == 0) 
}
is_prime(10)

[1] FALSE

## Review
The *mode* of a vector `v` is its most frequent value:
```
> v <- c(1, 1, 2, 3,)
> mode(v)
[1] 1
```
Write a function `mode(v)` that computes this.

In [19]:
v <- c(1, 1, 2, 3, NA)
sum(v)

[1] NA

## Function arguments
Functions can take multiple arguments. Generally they fall into one of two categories:
* *Data* to be processed by the function, and
* *Options*, which affect how the data gets processed.

```{r}
mean(x, na.rm=TRUE)
log(x, base=y)
str_c(..., sep=" ")
```
What is/are the data? What are the options?

### Rules for function arguments
Generally:
1. The *data* parameters should come first; and
2. The *options* should come second, and have sensible defaults.

Default parameter values are specified by the `option=default` notation:

In [22]:
mean_ci <- function(x, conf=0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
mean_ci(1:40, conf=.99)

[1] 15.73878 25.26122

When you call a function, you can omit the values of the default arguments. If overriding the default, you should specify the parameter you are overriding and then input the overridden value with an ` = ` in between:
```{r}
mean_ci(c(1, 2, 3, 4))  # standard
mean_ci(c(1, 2, 3, 4), conf = .99)  # yes
mean_ci(c(1, 2, 3, 4), .99)  # no
```

### Validation
When writing functions it's a good idea to *validate* the input -- that is, make sure it matches your assumptions about what is being passed to the function. Consider the following function which returns the weighted average of a vector:

In [26]:
w_mean = function(x, w) {
    sum(x * w) / sum(w)
}

This function relies implicitly on the fact that the weight vector `w` is the same length as the input vector `x`. If it's not, you'll get a warning and unexpected behavior.

In [33]:
w_mean(c(1, 2, 3), w = 1)

ERROR: Error in w_mean(c(1, 2, 3), w = 1): length(w) == length(x) is not TRUE


It's best to make the assumption of equal length explicit by checking it:

In [32]:
w_mean = function(x, w) {
    stopifnot(length(w) == length(x))
    (x * w) / sum(w)
}

Now:
```{r}
> w_mean(c(1,2,3), w=c(1, 2))
Error: length(w) == length(x) is not TRUE
Traceback:

1. w_mean(c(1, 2, 3), w = c(1, 2))
2. stopifnot(length(w) == length(x))   # at line 2 of file <text>
3. stop(msg, call. = FALSE, domain = NA)
```

Adding comments is another good way to make sure that you don't encounter unexpected situations in your functions:

In [36]:
w_mean = function(x, w, ...) {
    # Return the average of `x` weighted by weight vector `w`
    stopifnot(length(w) == length(x))
    (x * w) / sum(w)
}

w_mean(1:3, 1:3, 1:3)

[1] 0.1666667 0.6666667 1.5000000

###  Dot-dot-dot (`…`)
Some functions are designed to take a variable number of inputs. We saw this for example with the `str_c` function:

In [38]:
stringr::str_c("a", "b")
stringr::str_c("a", "b", "c", "d")

[1] "ab"

[1] "abcd"

To construct a function that takes a variable number of arguments we use the `...` notation:
```{r}
f = function(...) {
    <do something with variable arguments>
}
```

One thing you can do with the `...` is pass it to another function:

In [44]:
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])

[1] "a, b, c, d, e, f, g, h, i, j"

You can also access individual arguments in `...` using the `list(...)` notation. We'll learn more about lists in the next lecture.

## Return values
Thus far we have relied on the default behavior of R, which is to return the last value in the function:

In [45]:
f = function() {
    1
    2
    3  # this will be returned
}
f()

[1] 3

In more complicated functions you'll need to manually return values using the `return()` function:
```{r}
complicated_function <- function(x, y, z) {
  if (length(x) == 0 || length(y) == 0) {
      return(0)  # this immediately returns and halts the function.
  }
  # Complicated code here
}
```

### Pipeable functions
We've seen a lot of uses of the pipe operator `%>%`. As you become more advanced, you may find it useful to create your own functions which can be used in data pipelines. 

#### Transformations
For pipeable functions that transform a data frame, simply return the altered version of the data frame. For example:

In [40]:
first_row <- function(df) {
    df %>% slice(1)
} 
tibble(x=c(1,2,3), y=c("a","b","c")) %>%first_row

  x y
1 1 a

### Exercise
Define a function `drop_even(df)` which drops all the even-numbered rows from a data frame:
```{r}
> tibble(x=c(1,2,3), y=c("a","b","c")) %>% drop_even
# A tibble: 2 x 2
      x y    
  <int> <chr>
1     1 a    
2     3 c    
```

In [50]:
drop_even <- function(df) slice(df, seq(from=2, to=nrow(df), by=2))

## Environments
The environment is, roughly, the set of variables and data defined in your R session. The default environment is called the "global environment":

In [52]:
environment()$drop_even

function(df) slice(df, seq(from=2, to=nrow(df), by=2))

The environment in which a function was defined is called the *enclosing environment*. If you reference a variable inside of a function, which is not *defined* in that function, R will look for it in the enclosing environment.

In [53]:
f = function(x, y) {
    x + y
}
x = 1:3
y = 3
f(x)

[1] 4 5 6

In [54]:
y = 10
f(x)

[1] 11 12 13

If you want to modify a variable which lives in the enclosing environment, you need to use a special syntax:

In [57]:
i <- 0
f1 <- function() {
    i <- i + 1
}
f2 <- function() {
    i <<- i + 1  # special assignment syntax: <<
}
# f1()
i
f2()
i

[1] 0

[1] 1

### Exercise
Write a function `howmany()` which prints the number of times that it has been called:
```
> howmany()
[1] 1
> howmany()
[1] 2
> howmany()
[1] 3
```

In [62]:
howmany = function(){
   if(exists('num')){
       num <<- num + 1
   }else{
       num <<- 1
   }
   print(num)
}
howmany()
howmany()
howmany()
num

[1] 7
[1] 8
[1] 9


[1] 9

### Exercise

Harder: write a function `same(x)` which prints "yes!" if `x` is the same value that was passed into `same()` on the previous call, and "no!" otherwise. (`same()` always prints "no!" to start):
```{r}
> same(1)
[1] "no!"
> same(1)
[1] "yes!"
> same(1)
[1] "yes!"
> same("hello")
[1] "no!"
```

In [57]:
# Your code here