# Lab 11: Misc Functions

In [2]:
library(tidyverse)
library(nycflights13)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


## Review

Remember that **lists** are sequences where the elements are allowed to be different data types (including vectors or even other lists). You will usually want to name your list elements.

In [3]:
x = list(a='a', b=FALSE, c=1:3, d=list(first=c(1, 3, 5), second=c('a', 'b', 'c')), e=pi)
print(names(x))
print(x)

[1] "a" "b" "c" "d" "e"
$a
[1] "a"

$b
[1] FALSE

$c
[1] 1 2 3

$d
$d$first
[1] 1 3 5

$d$second
[1] "a" "b" "c"


$e
[1] 3.141593



There are three ways to subset or extract elements from a list:
* `[...]` will extract a sublist. Note that the result of this will always be another list. Integer, logical, or character vectors can be used.
* `[[...]]` will extract a single element. Either the index or name of the desired element can be provided.
* `$a` will also extract a single element. Note that this requires a named list, and the name must be used. 

### Functional Programming

Tidyverse contains a suite of functions used for functional programming in `purrr` ([documentation](https://purrr.tidyverse.org/index.html)).

Functional programming is generally built on three main operations:
* `map`
* `keep` (usually known as `filter` in other languages)
* `reduce`

Note that `map` always returns a list, if you want a vector, then use the functions `map_lgl`, `map_int`, `map_dbl`, or `map_chr` for logicals, integers, doubles/floats, and strings, respectively.

In [4]:
map(1:5, function(x) x^2) %>% print

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25



In [5]:
map_dbl(1:5, function(x) x^2)

In [6]:
keep(1:5, function(x) x %% 2 == 0)

In [7]:
reduce(1:5, function(x, y) x + y)

In [8]:
accumulate(1:5, function(x, y) x + y)

In [9]:
map_dbl(1:5, ~ .^2)
keep(1:5, ~ . %% 2 == 0)
reduce(1:5, ~ .x + .y)

Map applies to all the function to all the components, and return the appropriate results. 

In [19]:
mixed_list = list(a=2, b=c(1,2,3), c = matrix(c(1,2,3,4), nrow=2), d='a')
map(mixed_list, mean)

“argument is not numeric or logical: returning NA”

In [20]:
map(mixed_list, sd)

“NAs introduced by coercion”

### If we just want to keep the numeric class components and apply the map function, how do we proceed?

In [24]:
mixed_numeric = keep(mixed_list, ) # Can use ~ or function(x) to define your own function here
map(mixed_numeric, mean)

ERROR: Error in is_logical(.p): argument ".p" is missing, with no default


Remember that dataframes are also lists where each element is a vector (i.e. a column of the data), so when applied to dataframes these will apply to each column of the data.

In [198]:
mtcars %>% map_dbl(sum) %>% keep(~ . > 200) %>% print

    mpg    disp      hp    qsec 
 642.90 7383.10 4694.00  571.16 


## Q0: Warm-up with variance
Variance of the vector $x$ is calculated as
$$ \sigma^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} = \frac{\sum_{i=1}^n x_i^2}{n-1} - \frac{n}{n-1} \bar{x}^2\$$
Write a function that calculates the variance using reduce and mean function

In [65]:
variance = function(x){
    n = length(x)
    (reduce(x, function(prev,x) prev+x^2) - n*mean(x)^2)/(n-1)
}

## Q1: Working with logicals

### 1.1

Given a logical vector, consider a function that returns `TRUE` if every element is `TRUE`, and returns `FALSE` if any element is `FALSE`. Have this function ignore missing values (don't worry about empty vectors or vectors will all missing values).

Write the function `all_iter` which uses iteration/loops to perform this calculation.

Write the function `all_func` which uses functional programming tools.

In [28]:
all_iter = function(x) {
    for (val in x) {
        if (!is.na(val)) {
            if (!val) {
                return(FALSE)
            }
        }
    }
    return(TRUE)
}

In [1]:
all_func = function(x) {
    # T & F --> F
    # F & F --> F
    # T & T --> T
    x %>% keep(~!is.na(.)) %>% reduce(`&`)
}

In [2]:
stopifnot(all_iter(c(TRUE, TRUE, TRUE)))
stopifnot(!all_iter(c(TRUE, TRUE, FALSE)))
stopifnot(!all_iter(c(FALSE, FALSE, FALSE)))
stopifnot(all_iter(c(TRUE, TRUE, NA)))
stopifnot(!all_iter(c(TRUE, FALSE, NA)))
stopifnot(all_iter(c(NA, TRUE, NA)))

stopifnot(all_func(c(TRUE, TRUE, TRUE)))
stopifnot(!all_func(c(TRUE, TRUE, FALSE)))
stopifnot(!all_func(c(FALSE, FALSE, FALSE)))
stopifnot(all_func(c(TRUE, TRUE, NA)))
stopifnot(!all_func(c(TRUE, FALSE, NA)))
stopifnot(all_func(c(NA, TRUE, NA)))

ERROR: Error in all_iter(c(TRUE, TRUE, TRUE)): could not find function "all_iter"


### 1.2

Now consider the function that does the opposite, returns `TRUE` if any of the elements are `TRUE`, and returns `FALSE` if all of the elements are `FALSE`.

Write a function `any_func` that does this. Hint: this function can be written extremely simply given a correct implementation of `all_func` or `all_iter` if you think carefully about what this is doing.

In [5]:
require(tidyverse)
any_func = function(x) {
    !all_func(!x)
}
x = c(FALSE, FALSE, FALSE)
!x
all_func(!x)
!all_func(!x)


In [32]:
stopifnot(any_func(c(TRUE, TRUE, TRUE)))
stopifnot(any_func(c(TRUE, TRUE, FALSE)))
stopifnot(!any_func(c(FALSE, FALSE, FALSE)))
stopifnot(any_func(c(TRUE, TRUE, NA)))
stopifnot(any_func(c(TRUE, FALSE, NA)))
stopifnot(!any_func(c(FALSE, FALSE, NA)))
stopifnot(any_func(c(NA, TRUE, NA)))

Note that `any` and `all` are built-in `R` functions that do the same things as the above functions you wrote.

### 1.3

Write a function `exists_outlier` that takes a vector and checks if an outlier exists. We define an outlier to be a point that is more than 2.5x the standard deviation away from the mean.

In [33]:
exists_outlier = function(x) {
    any(abs(x - mean(x)) > 2.5*sd(x))
}

In [34]:
stopifnot(!exists_outlier(c(1, 2, 3, 4, 5)))
stopifnot(exists_outlier(c(rep(1, 20), 5)))

### 1.4

The following code creates a dataframe `dat`. Write a one-liner using `exists_outlier` that prints the variable names of `dat` that contain an outlier.

In [35]:
set.seed(123515)
dat = data.frame(X1=rnorm(30))
for (varn in 2:20) {
    dat[[paste0('X', varn)]] = rnorm(30)
}

In [36]:
names(dat)[map_lgl(dat, exists_outlier)] %>% print

[1] "X1"  "X2"  "X5"  "X14" "X17"


### 1.5
We are going to use the columns of dataframe `dat` that do not have outliers. For each column, calculate the mean of the values above 50th quantile.

In [58]:
dat5 = dat %>% select(names(dat)[!map_lgl(dat, exists_outlier)])

In [61]:
results = c()
for (x in dat5){
    results = c(results, mean(x[x > median(x)]))
}
results

In [59]:
sol5 = map(map(dat5, function(x) keep(x, ~ . > median(x))), mean) # quantile(x)[3] is same as median(x)
head(sol5)

## Q2: Max and Argmax

### 2.1

Define the function `max_iter` that takes a vector and returns the maximum element using iteration/loops.

Define the function `max_func` that does the same thing using `reduce`. Note: don't use the `max` function, instead define your own function that goes into the reduce.

In [211]:
max_iter = function(x) {
    curr = x[1]
    
    for (i in 2:length(x)) {
        if (x[i] > curr) {
            curr = x[i]
        }
    }
    curr
}


In [212]:
reduce_max = function(curr, newval) {
    if (newval > curr) {
        return(newval)
    } else {
        return(curr)
    }
}

max_func = function(x) {
    reduce(x, reduce_max)
}

In [213]:
test = rnorm(100)
stopifnot(max_iter(test) == max(test))
stopifnot(max_func(test) == max(test))

### 2.2

Define the function `argmax_iter` that takes a vector and returns the index maximum element using iteration/loops. If there is a tie return the index of the first time the max appears. Hint: you may need to keep track of multiple things as you iterate.

(Difficult) Define the function `argmax_func` that does the same thing using reduce. Hint: how can you keep track of multiple things through the reduce steps?


In [214]:
argmax_iter = function(x) {
    curr_val = x[1]
    curr_arg = 1
    
    for (i in 2:length(x)) {
        if (x[i] > curr_val) {
            curr_val = x[i]
            curr_arg = i
        }
    }
    curr_arg
}

In [215]:
reduce_argmax = function(curr, newval) {
    if (newval > curr$val) {
        return(list(val=newval, arg=curr$i+1, i=curr$i+1))
    } else {
        return(list(val=curr$val, arg=curr$arg, i=curr$i+1))
    }
}

argmax_func = function(x) {
    res = reduce(x, reduce_argmax, .init=list(val=-Inf, arg=0, i=0))
    res$arg
}

In [216]:
test = rnorm(100)
stopifnot(argmax_iter(c(1, -3, 2, 6, 3)) == 4)
stopifnot(argmax_iter(test) == which.max(test))
stopifnot(argmax_func(c(1, -3, 2, 6, 3)) == 4)
stopifnot(argmax_func(test) == which.max(test))