WEEK 1: Fundamentals

WEEK 1: Fundamentals
Introduction to RStudio
- Orientation
- RStudio configuration
- (Optional) Workstation configuration
- Workflow in RStudio
Introduction to R
- Mathematical expressions
- Built-in functions
- Comparing things
- Variables and assignment
- Vectorization
- Managing your environment
- Built-in data sets
- R Packages
Project management with RStudio
- General file management
- Create projects with Rstudio
Seeking help
- Basic help syntax
- Help file format
- Special operators
- Library examples
- What if you don't know where to start?
Data structures
- R stores "atomic" data as vectors
- Every vector has a type
- Vectors and type coercion
- Challenge 1: Generate and label a vector
- Matrices
- (Optional) Factors
- Data Frames are central to working with tabular data
- Lists
- (Optional) Challenge 2: Creating matrices
Exploring data frames
- Adding columns
- Appending rows (remember, rows are lists!)
- Removing missing data
- Working with realistic data
- Challenge 3: New gapminder data frame
Subsetting data
- Subset by index
- Subset by name
- Subsetting matrices
- (Optional) Extracting list elements
- Subsetting by logical operations
- (Optional) Subset by factor
- Subsetting Data Frames
- Challenge 4: Extract data by region
WEEK 2: Building Programs in R
Control flow
- Conditionals
- Review Subsetting section
- Iteration
Vectorization
- Vector operations are element-wise by default
- Vectors of unequal length are recycled
- Logical comparisons
- Matrix operations are also element-wise by default
- Linear algebra uses matrix multiplication
- Challenge 5: Sum of squares
Higher-order functions
- apply(): Apply a function over the margins of an array
- lapply(): Apply a function over a list, returning a list
- sapply(): Apply a function polymorphically over list, returning vector, matrix, or array as appropriate
- Use apply and friends to extract nested data from a list
- (Optional) Convert nested list into data frame
Functions explained
- Defining a function
- Combining functions
- Most functions work with collections
- Defensive programming
- Working with rich data
- Challenge 6: Testing and debugging your function
Reading and writing data
- Create sample data sets and write them to the `processed` directory
- How to find files
- Read files using a for loop
- Read files using apply
- Concatenate list of data frames into a single data frame
WEEK 3: Tidyverse
Data frame manipulation with dplyr
- Orientation
- Select data frame variables
- Filter data frames by content
- (Optional) Challenge 7: Filter
- Group rows
- Summarize grouped data
- Use group counts
- Mutate the data to create new variables
- Add conditional filtering to a pipeline with ifelse
- Challenge 8: Life expectancy in random countries
Data frame manipulation with tidyr
- Gapminder data
- Wide to long with pivot_longer()
- Long to intermediate with pivot_wider()
- Long to wide with pivot_wider()
Additional tidyverse libraries
- Reading data with readr
- String processing with stringr
- Functional programming with purrr
(Optional) Database interfaces
- Data frame joins with dplyr
- Access databases using dplyr
Endnotes
Credits
References
Data Sources

WEEK 1: Fundamentals

Introduction to RStudio

Orientation

R was created by statisticians for statisticians (and other researchers)
R contains multitudes; this can be good and bad

RStudio configuration

Configuration menu

PC/Linux: Tools > Global Options
MacOS: RStudio > Preferences or Tools > Global Options

Helpful configuration settings

General > Basic
- Don't save or restore .RData
Code > Editing
- Use native pipe operator
- Ctrl+Enter executes single line (or Multi-line R statement)
Code > Display
- Rainbow parentheses
Appearance: Adjust font and syntax colors
Pane Layout: Move IDE panes

(Optional) Workstation configuration

By default, your view of your file system will be opaque. We want to make it transparent (e.g. you may have a local Desktop and a cloud Desktop folder).

Mac OS Finder > Preferences

Your local Desktop folder is in your Home directory.

General
1. New finder window shows: /Users/<home>
Sidebar
1. Favorites: /Users/<home>
2. iCloud: iCloud Drive
3. Locations: <computer name>, Cloud Storage
Advanced
1. Show all filename extensions
2. Keep folders on top (all)

Windows System > File Explorer

Your local Desktop folder is in your Home directory or Computer directory.

File > Change folder and search options > View
1. Files and Folders
  1. Show hidden files, folders, and drives
  2. Hide protected operating system files
  3. Uncheck Hide extensions for known file types
2. Navigation Pane
  1. Show all folders
View
1. File name extensions

Workflow in RStudio

Set working directory
Test code snippets in the R console [REPL]
```
print("hello")
```
Create an .R script in the working directory
```
print("hello")
```
Run the script
1. Keyboard shortcut
  - Windows/Linux: Control-Enter
  - MacOS: Command-Enter
2. Run button
3. Highlight and run lines
Source the script to reduce console clutter and make contents available to other scripts
- https://stackoverflow.com/a/24418219
- https://support.rstudio.com/hc/en-us/articles/200484448-Editing-and-Executing-Code
Insert assignment arrow <-
- MacOS: Option -
- Windows/Linux: Alt -
- Good customization: Control -
Break execution if console hangs
1. Windows: ESC
2. MacOS/Linux: Control-c
Clear console
1. RStudio: C-l
2. Emacs: C-c M-o / M-x comint-clear-buffer
Comment/Uncomment code
- MacOS: Command-/

Introduction to R

A whirlwind tour of R fundamentals

Mathematical expressions

1 + 100
(3 + 5) * 2  # operator precedence
5 * (3 ^ 2)  # powers
2/10000      # outputs 2e-04
2 * 10^(-4)  # 2e-04 explicated

Built-in functions

Some functions need inputs ("arguments")

getwd()      # no argument required
sin(1)       # requires arg
log(1)       # natural log

RStudio has auto-completion
```
log...
```
Use help() to find out more about a function
```
help(exp)
exp(0.5)    # e^(1/2)
```

Comparing things

Basic comparisons
```
1 == 1
1 != 2
1 < 2
1 <= 1
```

Use all.equal() for floating point numbers

all.equal(3.0, 3.0)        # TRUE
all.equal(2.99, 3.0)       # 7 places: Gives difference
all.equal(2.99999999, 3.0) # 8 places: TRUE
2.99999999 == 3.0          # 8 places: FALSE

Variables and assignment

R uses the assignment arrow (C-c C-= in ESS)

# Assign a value to the variable name
x <- 0.025

You can inspect a variable's value in the Environment tab or by evaluating it in the console
```
# Evaluate the variable and echo its value to the console
x
```
Variables can be re-used and re-assigned
```
log(x)
x <- 100
x <- x + 1
y <- x * 2
```

Use a standard naming scheme for your variables

r.style.variable <- 10
python_style_variable <- 11
javaStyleVariable <- 12

Vectorization

Vectorize all the things! This makes idiomatic R very different from most programming languages, which use iteration ("for" loops) by default.

# Create a sequence 1 - 5
1:5

# Raise 2 to the Nth power for each element of the sequence
2^(1:5)

# Assign the resulting vector to a variable
v <- 1:5
2^v

Managing your environment

ls()             # List the objects in the environment
ls               # Echo the contents of ls(), i.e. the code
rm(x)            # Remove the x object
rm(list = ls())  # Remove all objects in environment

Note that parameter passing (=) is not the same as assignment (<-) in R!

Built-in data sets

data()

R Packages

"Package" and "library" are roughly interchangeable.

Install additional packages

install.packages("tidyverse")
## install.packages("rmarkdown")

Activate a package for use
```
library("tidyverse")
```

Project management with RStudio

General file management

See /scripts/curriculum.Rmd

project_name
├── project_name.Rproj
├── README.md
├── script_1.R
├── script_2.R
├── data
│   ├── processed
│   └── raw
├── results
└── temp

Create projects with Rstudio

File > New Project
Create in existing Folder
If you close RStudio and double-click Rproj, RStudio will open to the project location and set the working directory.

Seeking help

Basic help syntax

help(write.csv)
?write.csv

Help file format

Description
Usage
Arguments
Details
Examples (highlight and run with C-Enter)

Special operators

help("<-")

Library examples

vignette("dplyr")

What if you don't know where to start?

RStudio autocomplete
Fuzzy search
```
??set
```
Browse by topic: https://cran.r-project.org/web/views/

Data structures

R stores "atomic" data as vectors

There are no scalars in R; everything is a vector, even if it's a vector of length 1.

v <- 1:5

length(v)
length(3.14)

Every vector has a type

There are 5 basic (vector) data types: double, integer, complex, logical and character.

typeof(v)
typeof(3.14)
typeof(1L)
typeof(1+1i)
typeof(TRUE)
typeof("banana")

Vectors and type coercion

A vector must be all one type. If you mix types, R will perform type coercion. See coercion rules in scripts/curriculum.Rmd
```
c(2, 6, '3')
c(0, TRUE)
```

You can change vector types

# Create a character vector
chr_vector <- c('0', '2', '4')
str(chr_vector)

# Use it to create a numeric vector
num_vector <- as.numeric(chr_vector)

# Show the structure of the collection
str(num_vector)

There are multiple ways to generate vectors

# Two options for generating sequences
1:10
seq(10)

# The seq() function is more flexible
series <- seq(1, 10, by=0.1)
series

Get information about a collection

# Don't print everything to the screen
length(series)
head(series)
tail(series, n=2)

# You can add informative labels to most things in R
names(v) <- c("a", "b", "c", "d", "e")
v
str(v)

Get an item by its position or label
```
v[1]
v["a"]
```
Set an item by its position or label
```
v[1] = 4
v
```

(Optional) New vectors are empty by default

# Vectors are logical by default
vector1 <- vector(length = 3)
vector1

# You can specify the type of an empty vector
vector2 <- vector(mode="character", length = 3)
vector2
str(vector2)

Challenge 1: Generate and label a vector

See /scripts/curriculum.Rmd

Matrices

A matrix is 2-dimensional vector

# Create a matrix of zeros
mat1 <- matrix(0, ncol = 6, nrow = 3)

# Inspect it
class(mat1)
typeof(mat1)
str(mat1)

Some operations act as if the matrix is a 1-D wrapped vector

mat2 <- matrix(1:25, nrow = 5, byrow = TRUE)
str(mat2)
length(mat2)

(Optional) Factors

Factors represent unique levels (e.g., experimental conditions)

coats <- c("tabby", "tortoise", "tortoise", "black", "tabby")
str(coats)

# The reprentation has 3 levels, some of which have multiple instances
categories <- factor(coats)
str(categories)

R assumes that the first factor represents the baseline level, so you may need to change your factor ordering so that it makes sense for your variables

## "control" should be the baseline, regardless of trial order
trials <- c("manipulation", "control", "control", "manipulation")

trial_factors <- factor(trials, levels = c("control", "manipulation"))
str(trial_factors)

Data Frames are central to working with tabular data

Create a data frame

coat = c("calico", "black", "tabby")
weight = c(2.1, 5.0, 3.2)
chases_bugs = c(1, 0, 1)

cats <- data.frame(coat, weight, chases_bugs)

cats         # show contents of data frame
str(cats)    # inspect structure of data frame

# Convert chases_bugs to logical vector
cats$chases_bugs <- as.logical(cats$chases_bugs)
str(cats)

Write the data frame to a CSV and re-import it. You can use read.delim() for tab-delimited files, or read.table() for flexible, general-purpose input.

write.csv(x = cats, file = "../data/feline_data.csv", row.names = FALSE)
cats <- read.csv(file = "../data/feline_data.csv", stringsAsFactors = TRUE)

str(cats) # the chr column is now a factor column

Access the column (vectors) of the data frame
```
cats$weight
cats$coat
```
A vector can only hold one type. Therefore, in a data frame each data column (vector) has to be a single type.
```
typeof(cats$weight)
```

Use data frame vectors in operations

cats$weight + 2
paste("My cat is", cats$coat)

# Operations have to be legal for the data type
cats$coat + 2

# Operations are ephemeral unless their outputs are reassigned to the variable
cats <- cats$weight + 1

Data frames have column names names() gets or sets a name
```
names(cats)
names(cats)[2] <- "weight_kg"
cats
```

Lists

Lists can contain anything
```
list1 <- list(1, "a", TRUE, 1+4i)

# Inspect each element of the list
list1[[1]]
list1[[2]]
list1[[3]]
list1[[4]]
```
If you use a single bracket [], you get back a shorter section of the list, which is also a list. Use double brackets [[]] to drill down to the actual value.

(Optional) This includes complex data structures

list2 <- list(title = "Numbers", numbers = 1:10, data = TRUE)

# Single brackets retrieve a slice of the list, containing the name:value pair
list2[2]

# Double brackets retrieve the value, i.e. the contents of the list item
list2[[2]]

Data frames are lists of vectors and factors
```
typeof(cats)
```

Some operations return lists, others return vectors (basically, are you getting the column with its label, or are you drilling down to the data?)

Get list slices

# List slices
cats[1]      # list slice by index
cats["coat"] # list slice by name
cats[1, ]    # get data frame row by row number

Get list contents (in this case, vectors)

# List contents (in this case, vectors)
cats[[1]]      # content by index
cats[["coat"]] # content by name
cats$coat      # content by name; shorthand for `cats[["coat"]]`
cats[, 1]      # content by index, across all rows
cats[1, 1]     # content by index, single row

You can inspect all of these with typeof()
Note that you can address data frames by row and columns

(Optional) Challenge 2: Creating matrices

See /scripts/curriculum.Rmd

Exploring data frames

Adding columns

age <- c(2, 3, 5)
cbind(cats, age)
cats                     # cats is unchanged
cats <- cbind(cats, age) # overwrite old cats

# Data frames enforce consistency
age <- c(2, 5)
cats <- cbind(cats, age)

Appending rows (remember, rows are lists!)

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)

# Legal values added, illegal values are NA
cats

# Update the factor set so that "tortoiseshell" is a legal value
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))

Removing missing data

cats is now polluted with missing data

na.omit(cats)
cats
cats <- na.omit(cats)

Working with realistic data

gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)

# Get an overview of the data frame
str(gapminder)
dim(gapminder)

# It's a list
length(gapminder)
colnames(gapminder)

# Look at the data
summary(gapminder$gdpPercap)  # summary varies by data type
head(gapminder)

Challenge 3: New gapminder data frame

See /scripts/curriculum.Rmd

Subsetting data

Subset by index

v <- 1:5

Index selection

v[1]
v[1:3]     # index range
v[c(1, 3)] # selected indices

(Optional) Index exclusion
```
v[-1]
v[-c(1, 3)]
```

Subset by name

letters[1:5]
names(v) <- letters[1:5]

Character selection
```
v["a"]
v[names(v) %in% c("a", "c")]
```
(Optional) Character exclusion
```
v[! names(v) %in% c("a", "c")]
```

Subsetting matrices

m <- matrix(1:28, nrow = 7, byrow = TRUE)

# Matrices are just 2D vectors
m[2:4, 1:3]
m[c(1, 3, 5), c(2, 4)]

(Optional) Extracting list elements

Single brackets get you subsets of the same type (list -> list, vector -> vector, etc.). Double brackets extract the underlying vector from a list or data frame.

# Create a new list and give it names
l <- replicate(5, sample(15), simplify = FALSE)
names(l) <- letters[1:5]

# You can extract one element
l[[1]]
l[["a"]]

# You can't extract multiple elements
l[[1:3]]
l[[names(l) %in% c("a", "c")]]

Subsetting by logical operations

Explicitly mask each item using TRUE or FALSE. This returns the reduced vector.
```
v[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
```

Evaluate the truth of each item, then produce the TRUE ones

# Use a criterion to generate a truth vector
v > 4

# Filter the original vector by the criterion
v[v > 4]

Combining logical operations
```
v[v < 3 | v > 4]
```

(Optional) Subset by factor

# First three items
gapminder$country[1:3]

# All items in factor set
north_america <- c("Canada", "Mexico", "United States")
gapminder$country[gapminder$country %in% north_america]

Subsetting Data Frames

Data frames have characteristics of both lists and matrices.

Get first three rows

gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)

# Get first three rows
gapminder[1:3,]

Rows and columns

gapminder[1:6, 1:3]
gapminder[1:6, c("country", "pop")]

Data frames are lists, so one index gets you the columns
```
gapminder[1:3]
```

Filter by contents

gapminder[gapminder$country == "Mexico",]
north_america <- c("Canada", "Mexico", "United States")
gapminder[gapminder$country %in% north_america,]
gapminder[gapminder$country %in% north_america & gapminder$year > 1999,]
gapminder[gapminder$country %in% north_america & gapminder$year > 1999, c("country", "pop")]

Challenge 4: Extract data by region

See /scripts/curriculum.Rmd

WEEK 2: Building Programs in R

Control flow

Conditionals

Look at Conditional template in curriculum.Rmd

If

x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
}

Else

if (x >= 10) {
  print("x is greater than or equal to 10")
} else {
  print("x is less than 10")
}

Else If

if (x >= 10) {
  print("x is greater than or equal to 10")
} else if (x > 5) {
  print("x is greater than 5, but less than 10")
} else {
  print("x is less than 5")
}

Vectorize your tests

x <- 1:4

if (any(x < 2)) {
  print("Some x less than 2")
}

if (all(x < 2)){
  print("All x less than 2")
}

Review Subsetting section

Subsetting is frequently an alternative to if-else statements in R

Iteration

Look at Iteration template in curriculum.Rmd
Basic For loop
```
for (i in 1:10) {
  print(i)
}
```

Nested For loop

for (i in 1:5) {
  for (j in letters[1:4]) {
    print(paste(i,j))
  }
}

This is where we skip the example where we append things to the end of a data frame. For loops are slow, vectorize operations are fast (and idiomatic). Use for loops where they're the appropriate tool (e.g., loading files, cycling through whole data sets, etc). We will see more of this in the section on reading and writing data.

Vectorization

Vector operations are element-wise by default

x <- 1:4
y <- 6:9
x + y
log(x)

# A more realistic example
gapminder$pop_millions <- gapminder$pop / 1e6
head(gapminder)

Vectors of unequal length are recycled

z <- 1:2
x + z

Logical comparisons

Do the elements match a criterion?

x > 2
a <- (x > 2) # you can assign the output to a variable

# Evaluate a boolean vector
any(a)
all(a)

Can you detect missing data?

nan_vec <- c(1, 3, NaN)

## Which elements are NaN?
is.nan(nan_vec)

## Which elements are not NaN?
!is.nan(nan_vec)

## Are any elements NaN?
any(is.nan(nan_vec))

## Are all elements NaN?
all(is.nan(nan_vec))

Matrix operations are also element-wise by default

m <- matrix(1:12, nrow=3, ncol=4)

# Multiply each item by -1
m * -1

Linear algebra uses matrix multiplication

# Multiply two vectors
1:4 %*% 1:4

# Matrix-wise multiplication
m2 <- matrix(1, nrow = 4, ncol = 1)
m2
m %*% m2

# Most functions operate on the whole vector or matrix
mean(m)
sum(m)

Challenge 5: Sum of squares

See /scripts/curriculum.Rmd

Higher-order functions

apply() lets you apply an arbitrary function over a collection. This is an example of a higher-order function (map, apply, filter, reduce, fold, etc.) that can (and should) replace loops for most purposes. They are an intermediate case between vectorized operations (very fast) and for loops (very slow). Use them when you need to build a new collection and vectorized operations aren't available.

`apply()`: Apply a function over the margins of an array

m <- matrix(1:28, nrow = 7, byrow = TRUE)

apply(m, 1, mean)
apply(m, 2, mean)
apply(m, 1, sum)
apply(m, 2, sum)

`lapply()`: Apply a function over a list, returning a list

lst <- list(title = "Numbers", numbers = 1:10, data = TRUE)

## length() returns the length of the whole list
length(lst)

## Use lapply() to get the length of the individual elements
lapply(lst, length)

`sapply()`: Apply a function polymorphically over list, returning vector, matrix, or array as appropriate

## Simplify and return a vector by default
sapply(lst, length)

## Optionally, eturn the original data type
sapply(lst, length, simplify = FALSE)

Use `apply` and friends to extract nested data from a list

Read a file JSON into a nested list

## Read JSON file into nested list
library("jsonlite")
books <- fromJSON("../data/books.json")

## View list structure
str(books)

Extract all of the authors with lapply(). This requires us to define an anonymous function.

## Extract a single author
books[["bk110"]]$author

## Use lapply to extract all the authors
authors <- lapply(books, function(x) x$author)

## Returns list
str(authors)

Extract all of the authors with sapply()

authors <- sapply(books, function(x) x$author)

# Returns vector
str(authors)

(Optional) Convert nested list into data frame

Method 1: Create a list of data frames, then bind them together into a single data frame
```
## This approach omits the top-level book id
df <- do.call(rbind, lapply(books, data.frame))
```
- lapply() applies a given function for each element in a list, so there will be several function calls.
- do.call() applies a given function to the list as a whole, so there is only one function call.

Method 2: Use the rbindlist() function from data.table

## This approach includes the top-level book id
df <- data.table::rbindlist(books, idcol = TRUE)

Functions explained

Functions let you encapsulate and re-use chunks of code. This has several benefits:

Eliminates repetition in your code. This saves labor, but more importantly it reduces errors, and makes it easier for you to find and correct errors.
Allows you to write more generic (i.e. flexible) code.
Reduces cognitive overhead.

Defining a function

Look at Function template in data/curriculum.Rmd

Define a simple function

# Convert Fahrenheit to Celcius
f_to_celcius <- function(temp) {
  celcius <- (temp - 32) * (5/9)
  return(celcius)
}

Call the function

f_to_celcius(32)

boiling <- f_to_celcius(212)

Combining functions

Define a second function and call the first function within the second.

f_to_kelvin <- function(temp) {
  celcius <- f_to_celcius(temp)
  kelvin <- celcius + 273.15
  return(kelvin)
}

f_to_kelvin(212)

Most functions work with collections

## Create a vector of temperatures
temps <- seq(from = 1, to = 101, by = 10)

# Vectorized calculation (fast)
f_to_kelvin(temps)

# Apply
sapply(temps, f_to_kelvin)

Defensive programming

Check whether input meets criteria before proceeding (this is `assert` in other languages).

f_to_celcius <- function(temp) {
  ## Check inputs
  stopifnot(is.numeric(temp), temp > -460)
  celcius <- (temp - 32) * (5/9)
  return(celcius)
}

f_to_celcius("a")
f_to_celcius(-470)

Fail with a custom error if criterion not met

f_to_celcius <- function(temp) {
  if(!is.numeric(temp)) {
    stop("temp must be a numeric vector")
  }
  celcius <- (temp - 32) * (5/9)

  return(celcius)
}

Working with rich data

## Prerequisites
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)
north_america <- c("Canada", "Mexico", "United States")

Calculate the total GDP for each entry in the data set

gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)

gdp <- gapminder$pop * gapminder$gdpPercap

Write a function to perform a total GDP calculation on a filtered subset of your data.

calcGDP <- function(df, year=NULL, country=NULL) {
  if(!is.null(year)) {
    df <- df[df$year %in% year, ]
  }
  if (!is.null(country)) {
    df <- df[df$country %in% country,]
  }
  gdp <- df$pop * df$gdpPercap

  new_df <- cbind(df, gdp=gdp)
  return(new_df)
}

Mutating df inside the function doesn't affect the global gapminder data frame (because of pass-by-value and scope).

Challenge 6: Testing and debugging your function

See data/curriculum.Rmd

Reading and writing data

Create sample data sets and write them to the `processed` directory

Preliminaries

if (!dir.exists("../processed")) {
  dir.create("../processed")
}

north_america <- c("Canada", "Mexico", "United States")

Version 1: Use calcGDP function

for (year in unique(gapminder$year)) {
  df <- calcGDP(gapminder, year = year, country = north_america)

  ## Generate a file name. This will fail if "processed" doesn't exist
  fname <- paste("../processed/north_america_", as.character(year), ".csv", sep = "")

  ## Write the file
  write.csv(x = df, file = fname, row.names = FALSE)
}

Version 2: Bypass calcGDP function

for (year in unique(gapminder$year)) {
  df <- gapminder[gapminder$year == year, ]
  df <- df[df$country %in% north_america, ]
  fname <- paste("processed/north_america_", as.character(year), ".csv", sep="")
  write.csv(x = df, file = fname, row.names = FALSE)
}

How to find files

## Get matching files from the `processed` subdirectory
dir(path = "../processed", pattern = "north_america_[1-9]*.csv")

Read files using a for loop

Read each file into a data frame and add it to a list

## Create an empty list
df_list <- list()

## Get the locations of the matching files
file_names <- dir(path = "../processed", pattern = "north_america_[1-9]*.csv")
file_paths <- file.path("../processed", file_names)

for (f in file_paths){
  df_list[[f]] <- read.csv(f, stringsAsFactors = TRUE)
}

Access the list items to view the individual data frames

length(df_list)
names(df_list)
lapply(df_list, length)
df_list[["north_america_1952.csv"]]

Read files using apply

Instead of a for loop that handles each file individually, use a single vectorized function.

df_list <- lapply(file_paths, read.csv, stringsAsFactors = TRUE)

## The resulting list does not have names set by default
names(df_list)

## You can still access by index position
df_list[[2]]

Add names manually

names(df_list) <- file_names
df_list$north_america_1952.csv

(Optional) Automatically set names for the output list This example sets each name to the complete path name (e.g., "../processed/north_america_1952.csv").
```
df_list <- sapply(file_paths, read.csv, simplify = FALSE, USE.NAMES = TRUE)
```

Concatenate list of data frames into a single data frame

Method 1: Create a list of data frames, then bind them together into a single data frame
```
df <- do.call(rbind, df_list)
```
- lapply() applies a given function for each element in a list, so there will be several function calls.
- do.call() applies a given function to the list as a whole, so there is only one function call.
(Optional) Method 2: Use the rbindlist() function from data.table. This can be faster for large data sets. It also give you the option of preserving the list names (in this case, the source file names) as a new column in the new data frame.
```
df_list <- sapply(file.path("../processed", file_names), read.csv, simplify = FALSE, USE.NAMES = TRUE)
df <- data.table::rbindlist(df_list, idcol = TRUE)
```

WEEK 3: Tidyverse

Data frame manipulation with dplyr

Orientation

library("dplyr")

Explain Tidyverse briefly: https://www.tidyverse.org/packages/
(Optional) Demo unix pipes with history | grep
Explain tibbles briefly
dplyr allows you to treat data frames like relational database tables; i.e. as sets

Select data frame variables

select() provides a mini-language for selecting data frame variables
```
df <- select(gapminder, year, country, gdpPercap)
str(df)
```
select() understands negation (and many other intuitive operators)
```
df2 <- select(gapminder, -continent)
str(df2)
```

You can link multiple operations using pipes. This will be more intuitive once we see this combined with filter()

df <- gapminder %>% select(year, country, gdpPercap)

## You can use the native pipe. This has a few limitations:
## df <- gapminder |> select(year, country, gdpPercap)

Filter data frames by content

Filter by continent

df_europe <- gapminder %>%
  filter(continent == "Europe") %>%
  select(year, country, gdpPercap)

str(df_europe)

Filter by continent and year

europe_2007 <- gapminder %>%
  filter(continent == "Europe", year == 2007) %>%
  select(country, lifeExp)

str(europe_2007)

(Optional) Challenge 7: Filter

See data/curriculum.Rmd

Group rows

Group data by a data frame variable

grouped_df <- gapminder %>% group_by(continent)

## This produces a tibble
str(grouped_df)

The grouped data frame contains metadata (i.e. bookkeeping) that tracks the group membership of each row. You can inspect this metadata:
```
grouped_df %>% tally ()
grouped_df %>% group_keys ()
grouped_df %>% group_vars ()

## These produce a lot of output:
grouped_df %>% group_indices ()
grouped_df %>% group_rows ()
```
- More information about grouped data frames: https://dplyr.tidyverse.org/articles/grouping.html

Summarize grouped data

Calculate mean gdp per capita by continent

grouped_df %>% summarise(mean_gdpPercap = mean(gdpPercap))

(Optional) Using pipes allows you to do ad hoc reporting with creating intermediate variables
```
gapminder %>%
  group_by(continent) %>%
  summarise(mean_gdpPercap = mean(gdpPercap))
```

Group data by multiple variables

df <- gapminder %>%
  group_by(continent, year) %>%
  summarise(mean_gdpPercap = mean(gdpPercap))

Create multiple data summaries

df <- gapminder %>%
  group_by(continent, year) %>%
  summarise(mean_gdp = mean(gdpPercap),
            sd_gdp = sd(gdpPercap),
            mean_pop = mean(pop),
            sd_pop = sd(pop))

Use group counts

count() lets you get an ad hoc count of any variable

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

n() gives the number of observations in a group

## Get the standard error of life expectancy by continent
gapminder %>%
  group_by(continent) %>%
  summarise(se_le = sd(lifeExp)/sqrt(n()))

Mutate the data to create new variables

Mutate creates a new variable within your pipeline

## Total GDP and population by continent and year
df <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
  group_by(continent, year) %>%
  summarise(mean_gdp = mean(gdp_billion),
            sd_gdp = sd(gdp_billion),
            mean_pop = mean(pop),
            sd_pop = sd(pop))

Add conditional filtering to a pipeline with `ifelse`

Perform previous calculation, but only in cases in which the life expectancy is over 25

df <- gapminder %>%
  mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
  group_by(continent, year) %>%
  summarise(mean_gdp = mean(gdp_billion),
            sd_gdp = sd(gdp_billion),
            mean_pop = mean(pop),
            sd_pop = sd(pop))

(Optional) Predict future GDP per capita for countries with higher life expectancies

df <- gapminder %>%
  mutate(gdp_expected = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_gdpPercap_expected = mean(gdp_expected))

Challenge 8: Life expectancy in random countries

gapminder %>%
  filter(year == 2002) %>%
  group_by(continent) %>%
  sample_n(2) %>%
  summarize(mean_lifeExp = mean(lifeExp), country = country) %>%
  arrange(desc(mean_lifeExp))

Data frame manipulation with tidyr

Long format: All rows are unique observations (ideally)
1. each column is a variable
2. each row is an observation
Wide format: Rows contain multiple observations
1. Repeated measures
2. Multiple variables

Gapminder data

library("tidyr")
library("dplyr")

str(gapminder)

3 ID variables: continent, country, year
3 Observation variables: pop, lifeExp, gdpPercap

Wide to long with `pivot_longer()`

Load wide gapminder data

gap_wide <- read.csv("../data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

Group comparable columns into a single variable. Here we group all of the "pop" columns, all of the "lifeExp" columns, and all of the "gdpPercap" columns.
```
gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )

str(gap_long)
head(gap_long, n=20)
```
1. Original column headers become keys
2. Original column values become values
3. This pushes all values into a single column, which is unintuitive. We will generate the intermediate format later.

(Optional) Same pivot operation as (2), specifying the columns to be omitted rather than included.

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(-continent, -country),
    names_to = "obstype_year", values_to = "obs_values"
  )

str(gap_long)

Split compound variables into individual variables

gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)

Long to intermediate with `pivot_wider()`

Recreate the original gapminder data frame (as a tibble)

## Read in the original data without factors for comparison purposes
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = FALSE)

gap_normal <- gap_long %>%
  pivot_wider(names_from = obstype, values_from = obs_values)

str(gap_normal)
str(gapminder)

Rearrange the column order of gap_normal so that it matches gapminder
```
gap_normal <- gap_normal[, names(gapminder)]
```

Check whether the data frames are equivalent (they aren't yet)

all.equal(gap_normal, gapminder)

head(gap_normal)
head(gapminder)

Change the sort order of gap_normal so that it matches

gap_normal <- gap_normal %>% arrange(country, year)
all.equal(gap_normal, gapminder)

Long to wide with `pivot_wider()`

Create variable labels for wide columns. In this case, the new variables are all combinations of metric (pop, lifeExp, or gdpPercap) and year. Effectively we are squishing many columns together.
```
help(unite)

df_temp <- gap_long %>%
  ## unite(ID_var, continent, country, sep = "_") %>%
  unite(var_names, obs_type, year, sep = "_")

str(df_temp)
head(df_temp, n=20)
```

Pivot to wide format, distributing data into columns for each unique label

gap_wide_new <- gap_long %>%
  ## unite(ID_var, continent, country, sep = "_") %>%
  unite(var_names, obs_type, year, sep = "_") %>%
  pivot_wider(names_from = var_names, values_from = obs_values)

str(gap_wide_new)

Sort columns alphabetically by variable name, then check for equality. You can move a single column to a different positions with relocate()
```
gap_wide_new <- gap_wide_new[,order(colnames(gap_wide_new))]
all.equal(gap_wide, gap_wide_new)
```

Additional tidyverse libraries

Reading data with readr

Fast, user-friendly file imports.

String processing with stringr

Real string processing for R.

Functional programming with purrr

Functional programming for the Tidyverse. The map family of functions replaces the apply family for most use cases. Map functions are strongly typed. For example, you can use purrr:::map_chr() to extract nested data from a list:

## View the relevant map function
library("purrr")
library("jsonlite")

help(map_chr)

books <- fromJSON("books.json")

## Returns vector
authors <- map_chr(books, ~.x$author)

The ~ operation in Purrr creates an anonymous function that applies to all the elements in the .x collection.
1. Best overview in as_mapper() documentation: https://purrr.tidyverse.org/reference/as_mapper.html
2. https://stackoverflow.com/a/53160041
3. https://stackoverflow.com/a/62488532
4. https://stackoverflow.com/a/44834671
Additional references
1. https://purrr.tidyverse.org/reference/map.html
2. https://jtr13.github.io/spring19/ss5593&fq2150.html

(Optional) Database interfaces

Data frame joins with dplyr

Access databases using dplyr

https://dbplyr.tidyverse.org

Endnotes

Credits

R for Reproducible Scientific Analysis: https://swcarpentry.github.io/r-novice-gapminder/
Andrea Sánchez-Tapia's workshop: https://github.com/AndreaSanchezTapia/UCMerced_R
Instructor notes for "R for Reproducible Scientific Analysis": https://swcarpentry.github.io/r-novice-gapminder/guide/

References

R Project documentation: https://cran.r-project.org/manuals.html
CRAN task views: https://cran.r-project.org/web/views/
R Cookbook: http://www.cookbook-r.com
RStudio cheat sheets: https://www.rstudio.com/resources/cheatsheets/
Matrix algebra operations in R: https://www.statmethods.net/advstats/matrix.html
RStudio keyboard shortcuts: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
RStudio shortcuts and tips: https://appsilon.com/rstudio-shortcuts-and-tips/
Why typeof() and class() give different outputs: https://stackoverflow.com/a/8857411
How to get function code from the different object systems: https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
Various approaches to contrast coding: https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

If you tell R that a factor is ordered, it defaults to Orthogonal polynomial contrasts. This means that it assumes you want it to check for linear, cubic, and quadratic trends. If you tell R that a factor is NOT ordered, it defaults to treatment contrasts: it compares all levels to a reference level. This probably doesn't make sense for lots of psych data. So if I say income is ordered, it calculates linear, quadratic etc. trends for income, which is not only not what I want, but is inappropriate unless your groups are evenly spaced. Treatment means it calculates whether each level is significantly different from a reference level (i.e. the highest income group).

So if you want first-year stats output in a design with more than 2 levels in the factor, put this at the top of the R code:
```
options(contrasts = c("contr.sum","contr.poly"))
```
contr.sum is R for deviation contrasts, which you may recall as contrasts like -1, 0, 1.

Data Sources

Gapminder data:
- https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv
- https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv
JSON derived from Microsoft sample XML file: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
README.org		README.org
_config.yml		_config.yml
instructor_notes.org		instructor_notes.org

ucmerced/r-programming

Folders and files

Latest commit

History

Repository files navigation

WEEK 1: Fundamentals

Introduction to RStudio

Orientation

RStudio configuration

Configuration menu

Helpful configuration settings

(Optional) Workstation configuration

Mac OS Finder > Preferences

Windows System > File Explorer

Workflow in RStudio

Introduction to R

Mathematical expressions

Built-in functions

Comparing things

Variables and assignment

Vectorization

Managing your environment

Built-in data sets

R Packages

Project management with RStudio

General file management

Create projects with Rstudio

Seeking help

Basic help syntax

Help file format

Special operators

Library examples

What if you don't know where to start?

Data structures

R stores "atomic" data as vectors

Every vector has a type

Vectors and type coercion

Challenge 1: Generate and label a vector

Matrices

(Optional) Factors

Data Frames are central to working with tabular data

Lists

(Optional) Challenge 2: Creating matrices

Exploring data frames

Adding columns

Appending rows (remember, rows are lists!)

Removing missing data

Working with realistic data

Challenge 3: New gapminder data frame

Subsetting data

Subset by index

Subset by name

Subsetting matrices

(Optional) Extracting list elements

Subsetting by logical operations

(Optional) Subset by factor

Subsetting Data Frames

Challenge 4: Extract data by region

WEEK 2: Building Programs in R

Control flow

Conditionals

Review Subsetting section

Iteration

Vectorization

Vector operations are element-wise by default

Vectors of unequal length are recycled

Logical comparisons

Matrix operations are also element-wise by default

Linear algebra uses matrix multiplication

Challenge 5: Sum of squares

Higher-order functions

apply(): Apply a function over the margins of an array

lapply(): Apply a function over a list, returning a list

sapply(): Apply a function polymorphically over list, returning vector, matrix, or array as appropriate

Use apply and friends to extract nested data from a list

(Optional) Convert nested list into data frame

Functions explained

Defining a function

Combining functions

Most functions work with collections

`apply()`: Apply a function over the margins of an array

`lapply()`: Apply a function over a list, returning a list

`sapply()`: Apply a function polymorphically over list, returning vector, matrix, or array as appropriate

Use `apply` and friends to extract nested data from a list

Add conditional filtering to a pipeline with `ifelse`

Wide to long with `pivot_longer()`

Long to intermediate with `pivot_wider()`

Long to wide with `pivot_wider()`

Packages