# Numerical R and data science packages
Many of the examples in this notebook come from R for data science,
a fantastic and easy to read tutorial/book and I recommend you check it out!
https://r4ds.had.co.nz/

### Create a vector of consecutive integers

In [None]:
x <- seq(1,100)

In [None]:
x

In [None]:
1:100 == seq(1,100)

### Specifying an increment
 * (behaves similar to np.arange)

In [None]:
seq(1,10, by=.1)

In [None]:
# Like...
# np.arange(1,10.01, .1)

### Reshaping a vector into a matrix by specifying number of rows

In [None]:
M <- matrix(x, nrow=10)

In [None]:
M

### Reshaping a vector into a matrix by specifying number of columns

In [None]:
N = matrix(x, ncol=10)
N

In [None]:
W = matrix(N, nrow=4)

In [None]:
W

### Reshaping a *matrix* into a new shape
Just as we reshaped a vector in to a matrix 
by calling `matrix` and specifying `nrow` or `ncol`,
we can reshape a matrix with similar syntax.


In [None]:
matrix(W, nrow=10)

## Transpose of a Matrix

In [None]:
t(W)

## Element-wise operations on matrices 

In [None]:
A = matrix(1:12, nrow=3)
B = matrix(1:12, nrow=3) * 2
print(A)
print(B)

In [None]:
A + B  # Addition
A * B  # Multiplication
A / B  # Division
A ^ B  # Exponent
B %% A # Modulo

## Matrix matrix products

In [None]:
A %*% t(B)

In [None]:
t(A) %*% B

## Common linear algebra routines

In [None]:
eigen(A %*% t(B))

In [None]:
eigen(t(A) %*% B)

## Creating a Diagonal Matrix

In [None]:
X = 1:10
X

In [None]:
diag(X)

## Random numbers

In [None]:
# Sampling from a uniform distribution over [0,1] interval.
runif(1)

In [None]:
# Change of variables to sample from alternative distributions
# qnorm(runif(1)) 

In [None]:
# Draw multiple samples from same underlying distribution
runif(10)

In [None]:
# Sample uniformly on different interval (other than [0,1])
runif(10, min=4, max=357)

In [None]:
# sample from the integers on same interval
floor(runif(10, min=0, max=100))

In [None]:
# sample from values in a given vector (with replacement)
sample(1:100, 10, replace=TRUE)

In [None]:
# sample from values in a given vector (without replacement)
sample(1:100, 10, replace=FALSE)

In [None]:
# doesn't matter what values
sample(-10:-15, 3, replace=FALSE)

In [None]:
# sample from normal distribution (similar to `runif` but with `rnorm`)
rnorm(10)

In [None]:
# cannot specify min or max but can specify mean and sd
rnorm(10, mean=-5, sd=2)

In [None]:
x <- rnorm(400, mean=50, sd=10)
hist(x)

In [None]:
# These are convenience functions but we can do much of the same work with change of variables formulae
# E.g., say we want to turn uniform randomness over [0,1] into random draws from a Gaussian w mean 0 and sd 1?
x <- qnorm(runif(10000000))
hist(x)

### Install Tidyverse

In [None]:
install.packages("tidyverse")  

In [None]:
library("tidyverse")

In [None]:
lubridate::now()

In [None]:
require("lubridate")

In [None]:
now()

In [None]:
ggplot2::mpg

In [None]:
mpg <- ggplot2::mpg 

In [None]:
?mpg

In [None]:
rownames(mpg)
colnames(mpg)

In [None]:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

In [None]:
?mpg

## Mechanics of `ggplot2`
 * Create a ggplot by calling `ggplot(data = DATASET)`, pass in the data set as argument.
 * Add one or more layers to the plot.
 
**Template**

```ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))```

### Learn more about the data. 
 * What's in mpg? 
 * What does `displ` mean?
 * What is `drv`?

In [None]:
?mpg

### Aesthetic plots
Make information in a 2D plot pop, e.g., calling attention to some subset of datapoints by adding a category associated with each point as reflected in say size, shape, or color.

For example...

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Because it's R, there are too many ways to do this. Both "color" and "colour" work.

#### Aesthetic plots with size

In [None]:
?mpg

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = year, y = hwy))

#### Using Alpha (transparency) and Shape


In [None]:
# Top
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Bottom
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

### Coloring all the data points
Just put the color directive outside of the call to `aes()`

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

## Some quick exercises
 *  What’s gone wrong with this code? Why are the points not blue?


In [None]:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))



 *   Which variables in mpg are categorical? Which variables are continuous? 

 *   Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

 *   What happens if you map the same variable to multiple aesthetics?

 *   What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

 *   What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.



## Facets
Allow us to stratify our data based on some feature. For example we can plot separately for each class of vehicle:

In [None]:
ggplot(data = mpg) + 
  geom_line(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

## Smooth plots
Sometimes we might want to smooth the data, visualizing the mean and error bars.

In [None]:
# top
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# bottom
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

### With a separate line type broken down by drive type

In [None]:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

In [None]:
### Overlaying the raw data

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

In [None]:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

## Using `filter()` to specify different data for each layer

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "suv"), se = FALSE)

### Bar charts


In [None]:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

### Filling in the bar plot by stratifying

In [None]:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

### Tibbles
Replacement for data.frame. Plays well with tidyverse libraries.
Can access most standard data frames as tibbles by using `as_tibble()`

In [None]:
# as_tibble(iris)

### Easily specify ranges, constants and functions

In [None]:
tb <- tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

In [None]:
tb

In [None]:
rownames(tb)
colnames(tb)


## Creating tibbles with `tribble`
* Stands for transposed tibble

In [None]:
demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)
demo

In [None]:
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

### Stat_summary

In [None]:
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = mean
  )

## Box Plots and coordinate systems

In [None]:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

### Lubridate

In [None]:
lubridate::now()

In [None]:
?lubridate

In [None]:
 df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
 as.Date(df$Date, "%m/%d/%Y %H:%M:%S")

In [None]:
mytime<-lubridate::ymd_hms("2015-08-14-05-30-00", tz="America/Halifax")
mytime

In [None]:
library("lubridate")

In [None]:
year(mytime)
month(mytime)
day(mytime)
leap_year(mytime)
weekdays(mytime)

In [None]:
weekdays(now())

### Random numbers
 * sample from random uniform dist over interval [0,1] with `runif(n)`

 * Sample from Gaussian with `rnorm()`

In [None]:
length(rnorm(10, mean=1, sd=1))

### Sampling from Bernoulli and Binomial RVs.

In [None]:
num_events = 5
num_trials = 10
prob = .5

rbinom(num_events, num_trials, prob)

In [None]:
normal <- tibble(
  x = seq(-5,5, .01), 
  y = dnorm(x)
)

In [None]:
ggplot(data = normal) + 
    geom_line(mapping = aes(x = x, y = y))

## Data transformation with `dpylr`


In [None]:
install.packages("nycflights13")

In [None]:
library(nycflights13)
library(tidyverse)

In [None]:
flights

### Key dpylr functionality

 * Pick observations by their values (filter()).
 * Reorder the rows (arrange()).
 * Pick variables by their names (select()).
 * Create new variables with functions of existing variables (mutate()).
 * Collapse many values down to a single summary (summarise()).
 
 

### Filtering

In [None]:
filter(flights, month == 1, day == 1)

### Avoiding numerical precision errors with `near()`

In [None]:
filter(flights, near(month, 1))

## Filtering on disjunctions

In [None]:
filter(flights, month == 11 | month == 12)


### Shortand: filtering on membership in a set with `%in%`



In [None]:
nov_dec <- filter(flights, month %in% c(11, 12))

### Handling missing values
 * can check with `is.na(x)`

## Arranging
* Re-order the rows. IF you supply multiple columns, each successive column used to break ties among the former.

In [None]:
arrange(flights, desc(dep_delay))


In [None]:
arrange(flights, year, month, day)

## Selecting columns

In [None]:
select(flights, year, month, day)

## Select all columns in some range (incusive)

In [None]:
select(flights, year:day)

### Add columns (e.g., calculated columns) with `mutate()`

In [None]:
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

## Grouping operations together



In [None]:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'