### Installation and code editors

### Variables, types, data structures

In R, each variable is assigned an "R Object" whose type matches that of the data contained in it.  Some common R objects are:
- *vectors*: the simplest data structure; a scalar variable in R is a vector of length 1. There are six types of vectors (sometimes referred to as the "atomic types"): `logical`, `numeric`, `integer`, `complex`, `character`, and `raw`.  All real numbers in R are stored in double-precision format.
- *factors*: object for representing the levels (in the form of strings or integers) of a categorical variable
- *lists*: collections of named elements which can be of different types
- *matrices*: elements of the same atomic type are arranged in two-dimensional, rectangular layout; dimensions can be given names
- *arrays*: multidimensional data structure of the same atomic type, i.e. a collection of matrices
- *data frames*: table/2D array-like structure in which each column stores values of one variable, and each row corresponds to one observation

In [None]:
# Vectors
a = c(1,2,3) 
b = c(4,5,6)  
d = c("a","b","c")

# Combining vectors
cbind(a,b,d)
rbind(a,b,d)

# Lists
e = list(x = "R", y = c("is", "fun"), z = pi)
e$x
str(e)

# Matrices
f = matrix(1:9, nrow = 3, ncol = 3)
det(f)     # Right answer is 0. Matlab says this determinant is 6.6613e-16 ...
eigen(f)   # The third eigenvalue should be 0 (Matlab got this right) ...

# Data frames
mc1973 = c(30843, 27752, 31557, 31089, 35222, 33587, 31418, 30129, 27327, 31383, 30430, 34031)
mc2004 = c(215333, 204911, 226643, 222094, 242950, 222523, 214999, 216362, 214577, 217289, 211547, 225771)
q = data.frame(production1973 = mc1973, production2004 = mc2004)
summary(q)
mean(q$production1973)

### Sampling and probability

### Functions, higher-order functions, lambdas

### Linear/nonlinear statistical models

### **Tabulating data**

There are multiple functions in the R base package and elsewhere for grouping data based on one or more variables, as well as for constructing frequency and contingency tables. Useful functions include:
- `ddply` (package: `plyr`): 
- `table` (package: `base`):
- `aggregate` (package: `base`):

The ``summary`` function can also be used to obtain descriptive statistics (quartiles, mean, minimum, maximum) for each variable in a data table.

**Example 1**: On which day of the year were more children born in the US than on any other day? (For each year in 2000-2014)

In [None]:
library(plyr)
births2000_2014 = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv", header=TRUE)
birthdays = ddply(births2000_2014, "year", subset, births==max(births))
formattedBirthdays = apply(birthdays, 1, function(x) { cat(x[1], ":", month.abb[x[2]], x[3], fill=TRUE) })

**Example 2:** Which states have had the most university commencement addresses given by US presidents?

In [None]:
commencementSpeeches = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/presidential-commencement-speeches/commencement_speeches.csv", header = TRUE)
tail(sort(table(commencementSpeeches$state)))

### **Statistical tests**

Many functions are available in R for performing a wide range of statistical tests.  A few of the popular functions are:
- `t.test`:
- `shapiro.test`:
- `var.test`:

**Example 1**: Is there a statistically significant difference between the number of American newspapers circulating in 2004 and 2013?

In [None]:
newspapers = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/pulitzer/pulitzer-circulation-data.csv", header=T)
circulation2004 = as.numeric(gsub(",", "", newspapers$Daily.Circulation..2004))
circulation2013 = as.numeric(gsub(",", "", newspapers$Daily.Circulation..2013))

In [None]:
# Paired t-test:
t.test(circulation2004, circulation2013, paired = TRUE, conf.level = 0.95)

In [None]:
# Paired t-test with null hypothesis that readership in 2013 was less (i.e. 2004-2013 > 0)
t.test(circulation2004, circulation2013, paired = TRUE, alternative = "greater")

### **Model selection**

Using 'tidyverse' packages to reshape a large, complex dataset: 
What proportion of the New York Philarmonic's performances have been Beethoven, Mozart, or Mendelssohn pieces, from 1842-2016?

In [None]:
library(plyr);  library(tidyverse);  library(reshape2);  options(warn=-1)
NYPhil1842 = fromJSON("https://raw.githubusercontent.com/nyphilarchive/PerformanceHistory/master/Programs/json/complete.json")
seasons = (NYPhil1842 %>% map("season"))$programs
composers = (NYPhil1842 %>% map("works"))$programs %>% map("composerName")

beethovenCount = unlist(composers %>% map(~ sum(str_count(na.omit(.), "Beethoven"))))
mozartCount = unlist(composers %>% map(~ sum(str_count(na.omit(.), "Mozart"))))
mendelssohnCount = unlist(composers %>% map(~ sum(str_count(na.omit(.), "Mendelssohn"))))

pieceCount = unlist(composers %>% map(~ length(na.omit(.))))

composersBySeason <- data_frame(season = seasons, pieceCount = pieceCount, beethoven = beethovenCount, 
                                mozart = mozartCount, mendelssohn = mendelssohnCount)

composersBySeason = plyr::ddply(composersBySeason,"seasons",numcolwise(sum))
composersBySeason[,3:5] = composersBySeason[,3:5]/composersBySeason$pieceCount

composersBySeasonLong = melt(composersBySeason[-NROW(composersBySeason),-2], id="seasons")

p1 = ggplot(data=composersBySeasonLong,  aes(x=seasons, y=value, color=variable, group=variable)) + 
            geom_line() + theme(axis.text.x = element_text(size=5, angle=90))
print(p1)