## R Things You Should Know
In this notebook, we'll go over fundamentals of R including:
 - Setting up a working directory and importing data
 - Data structures
 - Sampling
 - Control Flow
 - Packages

## Working Directory and importing Data
 - Working directory (`wd`) is where your `R` session will load/save files
 - To see where your current working directory is, run

In [202]:
getwd()

- To set the working directory to desired `path`, run

In [207]:
setwd("/insert/path/here")

ERROR: Error in setwd("/insert/path/here"): cannot change working directory


 # .csv 
 - csv stands for comma-separated values, which is a very common format to store data in.
 - You can import a csv file into R using read_csv.

In [206]:
read_csv('some_data.csv')

Parsed with column specification:
cols(
  x = col_integer(),
  y = col_integer()
)


x,y
1,2
2,3
3,4


You can also use write_csv to save tables of data into an existing csv file, which appears wherever your working directory. Here we use data.frame to create a table - we'll take more about tables later.

In [282]:
write_csv(data.frame(a = c(1, 2, 3),
                 b = c(2, 3, 4)),
         'new_data.csv')

Try reading in the file you just wrote.

In [283]:
#use read_csv to read in new_data.csv
# START
read_csv('new_data.csv')
# END

Parsed with column specification:
cols(
  a = col_integer(),
  b = col_integer()
)


a,b
1,2
2,3
3,4


# Data and Data structures

## Strings

 - Strings are what computer science people call text
 - A String variable can be declared in either double quotes("") or single quotes ('')



In [284]:
s <- "This is a valid string"
s

 - And you can reassign to the same variable

In [285]:
s <- 'and so is this'
s

# Vectors

## Vectors: `c()`
 - Vectors are the building blocks of `R` --- even a single variable is actually
 an "atomic" vector (vector of size 1)
 - Vectors in `R` are created by *c*oncatenating a series of elements
 - We saw lots of vectors in the birthday problem notebook!

## Vectors: `seq()`
- Create a vector from a sequence with `seq(from, to, by=1)`

In [189]:
print(seq(1, 10))

 [1]  1  2  3  4  5  6  7  8  9 10


In [218]:
print(seq(1, 10, 2))

[1] 1 3 5 7 9


- Use short-hand `from:to` if you're incrementing by one

In [219]:
print(1:10)

 [1]  1  2  3  4  5  6  7  8  9 10


## Vectors: `rep()`
- Use `rep()` to repeat values

In [220]:
print(rep(13, 4))

[1] 13 13 13 13


In [222]:
print(rep('Yes!', 3))

[1] "Yes!" "Yes!" "Yes!"


In [223]:
print(rep(c('Sat.', 'Sun.'), 2))

[1] "Sat." "Sun." "Sat." "Sun."


# Exercise
- Make a vector using seq() and rep() which would be equivalent to (10,20,30,10,20,30,10,20,30)

In [290]:
#START
print(rep(seq(10,30,by=10), 3))
#END

[1] 10 20 30 10 20 30 10 20 30


## Vectors: Indexing
- We saw vector indexing in the birthday problem notebook, but let's review.
- Use square braces (`[]`) to index a vector (base 1)
    - Indexing out-of-bounds returns a special value called `NA`, *does NOT*
      fail

In [127]:
X <- c(10, 11, 12, 13)

In [128]:
print(X[1])

[1] 10


In [129]:
print(X[4])

[1] 13


In [130]:
print(X[5])  # Does NOT fail; but returns NA

[1] NA


## Vectors: Indexing (cont'd)
- Negative indexing is used to exclude elements

In [252]:
print(X[-1])

[1] 0.8194384 0.7045800


- Index multiple objects by indexing with a vector

In [253]:
ind <- c(2, 4)
print(X[ind])

[1] 0.8194384        NA


## Vectors: Re-assignment with Indices
- Replace elements by re-assigning with index

In [254]:
X[1] <- 101
print(X)

[1] 101.0000000   0.8194384   0.7045800


- Replace multiple elements as well

In [255]:
X[2:3] <- c(22, 33)
print(X)

[1] 101  22  33


## Vectors: Add Elements by Index
- Add new elements to a vector by assigning

In [293]:
print(X[5])

[1] 555


In [294]:
X[5] <- 555
print(X[5])

[1] 555


# Exercise
The below vector is supposed to contain all the days of the week, but two of them are messed up.  Write code to reassign the missing days using re-assignment with indices.

In [301]:
days_of_the_week <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 5, 6, 'Sunday')

In [302]:
#Replace the numbers with the correct days
#START
days_of_the_week[5:6] <- c('Friday', 'Saturday')
print(days_of_the_week)
#END

[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[7] "Sunday"   


# Sampling

- In the birthday problem notebook, we used sample() to sample from a vector of numbers.
        sample(1:366, 30, replace = TRUE)
- Sometimes we instead want to generate samples from known distributions
    - e.g., simulating 1,000 coin flips and counting the number of heads
- For this, we can use the family of `r`*`dist`*`()` functions, where *`dist`*
  is replaced with the desired distribution
  (e.g., `unif`orm, `norm`al, `pois`son)

In [225]:
print(runif(n = 5)) # 5 samples from Unif(0, 1)

[1] 0.8217362 0.4133482 0.2749151 0.9316357 0.5912007


In [226]:
print(rnorm(n = 5))  # 5 samples from Norm(0, 1)

[1]  1.7609726 -0.5051783  1.3015926  0.1025125  0.4249084


In [227]:
print(rpois(n = 5, 1))  # 5 samples from Poisson(1)

[1] 0 1 1 1 0


In [228]:
print(rexp(n = 5))  # 5 samples from Exp(1)

[1] 0.755312 1.315421 1.532311 2.353120 1.113656


- Distribution parameters can be specified as arguments, e.g.

In [229]:
# 7 samples from a Norm(20, 5) distribution
print(rnorm(n = 7, mean = 20, sd = 5))

[1] 26.65328 25.37107 23.78810 20.11862 24.46719 27.37631 20.32917


- In general, you can bring up a help page for any function which comes from a library or base R by typing ?function_name

In [230]:
?rnorm

0,1
Normal {stats},R Documentation

0,1
"x, q",vector of quantiles.
p,vector of probabilities.
n,"number of observations. If length(n) > 1, the length is taken to be the number required."
mean,vector of means.
sd,vector of standard deviations.
"log, log.p","logical; if TRUE, probabilities p are given as log(p)."
lower.tail,"logical; if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]."


 - The help page tells us that qnorm is the quantile function.

In [261]:
qnorm(.975, mean = 0, sd = 1)

This means that the 97.5 th percentile of the normal distribution with mean 0 and sd 1 is at 1.96.
 - The help page also tells us that pnorm is the distribution function.

In [245]:
pnorm(1)

This means that 84% of draws from a normal distribution with mean 0 and sd 1 will be less than 1.  This squares up with the rule of thumb that 68% of draws from N(0, 1) are within 1 standard deviation of the mean, so (1 - .68)/2 = .16 will be greater than 1 standard deviation above the mean, since 1 - .16 = .84

# Exercise
Let's say we are ice cream researchers measuring the effect of a treatment where we bombard people with advertising which talks about how delicious ice cream is.  Say we measure our control group eating on average 3 ice creams a day, with a standard error of .5.  Then say our treatment group eats on average 4 ice creams a day, with a standard error of .7.  Give an 80% confidence interval on the effect of our advertising treatment.

Remember: When you take the difference between normals N(a, b) - N(c, d), the result is N(a - c, sqrt(b^2 + d^2))
So our estimate of the difference in how much ice cream people eat (the effect of our treatment) is a normal with mean=(4 - 3) and standard deviation=sqrt(.5^2 + .7^2).

In [304]:
#START
qnorm(.1, mean=1, sd = sqrt(.5^2 + .7^2)) #lower end
qnorm(.9, mean = 1, sd = sqrt(.5^2 + .7^2)) #upper end
#END

# Custom functions and control

## Control Flow
- We saw `if` statements inside of loops in the birthday problem, but we can use them outside of loops too!

In [50]:
if (condition) {
  # stuff to do when condition is TRUE
} else if(other_condition) {  # (OPTIONAL)
  # stuff to do if other_condition is TRUE
} else {  # (OPTIONAL)
  # stuff to do if all other conditions are FALSE
}

ERROR: Error in eval(expr, envir, enclos): object 'condition' not found


## Loops 

 - We saw one type of loop in the birthday problem

In [168]:
counter <- 0

for (i in 1:5) {
    counter <- counter + 1

    print(counter)
    
    if (counter >= 3) {
        print("Counter is now bigger than or equal to 3!")
    }
}

[1] 1
[1] 2
[1] 3
[1] "Counter is now bigger than or equal to 3!"
[1] 4
[1] "Counter is now bigger than or equal to 3!"
[1] 5
[1] "Counter is now bigger than or equal to 3!"


 - But there's another type of loop as well, the while loop

In [169]:
counter <- 0

while (counter < 5){
    counter <- counter + 1 
    
    print(counter)
    
    if (counter >=3){
        print('Counter is now bigger than or equal to 3!')
    }
}

[1] 1
[1] 2
[1] 3
[1] "Counter is now bigger than or equal to 3!"
[1] 4
[1] "Counter is now bigger than or equal to 3!"
[1] 5
[1] "Counter is now bigger than or equal to 3!"


 - be careful that the condition in `while (condition){` eventually becomes false!

## User Defined Functions
- Write your own functions in the form

In [170]:
name_of_function <- function(arguments) {
  # do some stuff with arguments
  return(result)
}

- You can use your functions like any other function, e.g.,

In [171]:
name_of_function(arguments)  # gives you the 'result'

ERROR: Error in name_of_function(arguments): object 'result' not found


For example, here's an example of a function add_nums() which adds its arguments, and how to call the function.

In [308]:
add_nums <- function(x, y){
    s <- x + y
    return(s)
}
add_nums(3, 4)

## User Defined Functions: Exercise
- Write a function that will take a vector in $\mathbb{R}^3$ (for example, c(1,3,5)) and tell you if you
can make a triangle or not (i.e., return `TRUE` if a triangle can be made and 
`FALSE` otherwise.)
- Hint: you can make a triangle from three side lengths if and only if none of the side lengths are greater than the sum of the other two.

In [309]:
is_good <- function(vec) {
  for (i in 1:3) {
    # Check if element i is greater than 
    # sum of other two elements
    if (vec[i] > sum(vec[-i])) {
      return(FALSE)
    }
  }
  return(TRUE)
}

## `replicate`
- Loops in `R` are inefficient, and best avoided if possible
- Vectorize operations whenever possible.
- `replicate` can be used to repeat some operation (function),
  and collect the results[^apply]
- e.g., to run `some_function()` 1,000 times and collect the results:

In [310]:
replicate(1000, some_function())  

ERROR: Error in some_function(): could not find function "some_function"


[^apply]: `replicate` is actually a convenient wrapper for one of the `apply` 
functions, which are more general. See the documentation for details.


# Exercise

## The Question
> You are given three sticks, each of a random length between 0 and 1.

> What's the probability you can make a triangle?

- The answer is 1/2
- Use `R` to simulate 100,000 times and estimate the answer by
    1. generate 100,000 triplets of uniform (0, 1) random variables
    2. find the portion that can be made into a triangle (hint: use the
    `is_good` function)

## Part 1: You're allowed to use a 'for loop' if you want!

In [312]:
#START
N <- 10000
m <- 0
for (i in 1:N) {
  X <- runif(3)
  if (is_good(X)) {
      m <- m + 1
  }
}
m/N
#END

## Part 2: Try doing it without a loop!

In [313]:
#START
N <- 10000
m <- replicate(N, is_good(runif(3)))
sum(m)/N
#END

# Packages

## Installing `R` Packages
- `R` has many (*MANY*) packages created by other users that implement
state-of-the-art tools (e.g., data manipulation, statistical models)
- These packages can be downloaded from the Comprehensive R Archive Network (CRAN)
- This is as simple as running a single line of code:

In [187]:
install.packages("package name")

Installing package into ‘/home/willcai/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
“package ‘package name’ is not available (for R version 3.4.4)”

- You will have to select one of many CRAN mirrors (copies across different
servers) from which to download the package from
- For example, to install the `tidyverse` package, run

In [188]:
install.packages("tidyverse")

Installing package into ‘/home/willcai/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)


- You only need to do this *once* for each machine


## Loading Packages
- Once you've installed a package on a machine, you can load the package into
your current workspace with the `library()` command
- For example, to use the `tidyverse` package, first load it with

In [None]:
library("tidyverse")

- You can also use specific functions from a package without loading it,
  by telling `R` which package the function belongs to with a namespace prefix,
  `package_name::`.
- For example, to use the `round_any()` function from the `plyr` package,
  without actually loading `plyr`, write

In [None]:
# Assuming plyr is installed
plyr::round_any()

## Namespace collision
- One of the (unfortunately many) things that `R` is bad at is preventing
  namespace collisions
- For example, the packages `plyr` and `dplyr`[^dplyr] have functions that are
  named the same (e.g, `mutate()`, `summarize()`), and if you ever load both,
  `R` will only "see" the function belonging to the package you loaded later
- So beware of what packages you load, and if you only intend to use a function
  or two, consider just specifying the namespace with `::`, instead of loading
  the whole package.

[^dplyr]: `dplyr` is part the tidyverse, and loaded when you load `tidyverse`

# Helpful stuff we won't go over during the workshop but will leave here for later reference!

## Vector Operations

In [262]:
X = c(1:4)

In [263]:
print(X)

[1] 1 2 3 4


In [264]:
print(X + X)  # element-wise summation

[1] 2 4 6 8


In [265]:
print(X - X)  # element-wise subtraction

[1] 0 0 0 0


## Vector Operations (cont'd)

In [266]:
print(X)

[1] 1 2 3 4


In [267]:
print(X^3)   # element-wise exponentiation

[1]  1  8 27 64


In [268]:
print(X * X)  # element-wise multiplication

[1]  1  4  9 16


In [270]:
X %*% X  # dot (inner) product

0
30


## Vector comparisons
- Comparisons are all done element-wise

In [271]:
print(c(1, 2, 3) == c(1, 2, 4))

[1]  TRUE  TRUE FALSE


In [272]:
print(c(1, 2, 3) < c(1, 2, 4))

[1] FALSE FALSE  TRUE


In [273]:
print(c(1, 2, 3) >= c(1, 2, 4))

[1]  TRUE  TRUE FALSE


- Note the double equal sign for comparing equality (one would be assignment!)

## Helpful Vector Functions

In [274]:
X = c(1:4)
mean(X)                # mean
sd(X)                  # standard deviation
var(X)                 # variance
max(X)                 # maximum
min(X)                 # minimum
median(X)              # median
sum(X)                 # sum
prod(X)                # product
quantile(X,probs=0.5)  # quantile for specified probs
length(X)              # length of the vector
range(X)               # range

# Built-in functions

## Some more built-in functions
- We've already seen many built-in functions, but here are some more!

In [275]:
print(log(X))   # element-wise log

[1] 0.0000000 0.6931472 1.0986123 1.3862944


In [276]:
print(exp(X))   # element-wise exponential

[1]  2.718282  7.389056 20.085537 54.598150


In [277]:
print(sqrt(X))  # element-wise square root

[1] 1.000000 1.414214 1.732051 2.000000


## Functions for Strings

In [153]:
# concatenate two (or more) strings
paste('one plus one equals', 1+1, '!')

In [154]:
# specify a separator
paste('one plus one', 1+1, sep='=')

## Functions for Strings (cont'd)
- Often, we want to concatenate strings with no spaces
(e.g., when constructing filenames/paths at run-time)

In [156]:
# short-hand for concatenation w/o spaces
filename = 'some_file_name.csv'
paste0('path/to/', filename)
# function specifically for constructing file paths
file.path('path', 'to', filename)

## Functions for Strings (cont'd)
- To enforce upper/lower cases

In [161]:
s <- 'SoMe CraZY STRING'
tolower(s)

In [162]:
toupper(s)

## Generic Functions
- Some functions for exploring objects

In [163]:
obj <- 1:100
head(obj, n=5)  # display first n rows of obj

In [164]:
tail(obj, n=5)  # display last n rows of obj

## Generic Functions (cont'd)

In [166]:
str(obj)  # display structure of obj

 int [1:100] 1 2 3 4 5 6 7 8 9 10 ...


In [167]:
summary(obj)  # display summary of obj

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   25.75   50.50   50.50   75.25  100.00 