# Today's Topic

### 1) R, its pros and cons

### 2) R coding practices and time

### 3) Apply functions in R

### 4) Parallel Processing in R

### 5) Useful packages in R (sqldf, dplyr, ggplot2)

### 1. The Pros  

1) The large, vibrant R community. (with a strong base in academia)

2) Visualization packages (ggvis, ggplot2, googlevis, rcharts) 

3) Easy to start learning even with little programming background

4) Ideal for individual servers and data analysis  

5) Once a certain skill level is achieved, very powerful code can be achieved through few lines of code


### 2. The Cons (although rather misunderstood)  

1) Computationally limited speed 

2) Steep Learning Curve ( can be a lot of googling )



# Ways of Getting Around the Cons 


## 1. Code better!  
a) Avoid for loops! (well, for certain obvious cases!)   
b) Avoid populating arguments in functions that don't require populating  
c) UTILIZE VECTORIZATION!  
d) Avoid unnecessary ()s
e) If possible use matrices instead of data.frames for intensive calculations

   
## 2. Use R-spinoffs if speed is your priority
a) PQR (pretty quick R)    
b) RENJIN  
   
## 3. Use alongside other languages
a) MATLAB is extremely fast in terms of matrix calculations  
b) Python is useful too! (Panda, sklearn)  
c) JUPYTER is a nice way to use multiple languages  

## 4. Use parallel processing features on R
a) R uses one core by default  
b) If "embarassingly parallel" then multithreading is a good option  
c) specific packages "Parallel"  

## 5. Get better hardware
a) Larger memory ( since R requires storage in physical memory)  
b) Better processor


## Performance is very important in coding, especially when dealing with large data


In [None]:
# UPDATE your R! 3.3.2 was released on the 31st of October
# Copy over to your RStudio or R Gui

install.packages("installr",repos = 'http://cran.us.r-project.org')
library(installr)
updateR()


In [6]:
library(parallel)
# Example of a simple numerical computation done on a matrix
create <- function(x){
  x <- vector(mode = "numeric", length = 100000)
  x <- c(1:100000)
  x <- cbind(x,x+1,x+2,x+3,x+4,x+5,x+6,x+7,x+8,x+9)
  return(x)
}

head(create(x))

x,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
1,2,3,4,5,6,7,8,9,10
2,3,4,5,6,7,8,9,10,11
3,4,5,6,7,8,9,10,11,12
4,5,6,7,8,9,10,11,12,13
5,6,7,8,9,10,11,12,13,14
6,7,8,9,10,11,12,13,14,15


In [7]:
# Now let's see the performance of a simple for loop in R!
# To traverse across all elements in a matrix, must have a nested for-loop

x <- create(x)
system.time(
for(i in 1:dim(x)[1]){
  for(j in 1:dim(x)[2]){
    x[i,j] <- x[i,j]^10
    }
})
rm(x)

# Creates "leftover" variables, that is by default not a good coding practice
# Need to remove variables
rm(i,j)

   user  system elapsed 
   2.66    0.00    2.65 

In [8]:
#Let's look at the apply functions. They basically do the same task, but what happens to the code?

#apply function. (by column)
x <- create(x)
system.time( 
    x <- apply(x, 2, FUN = function(x){x^10}) 
)
rm(x)

#apply function (by row)
x <- create(x)
system.time(
x <- apply(x, 1, FUN = function(x){x^10})
)
rm(x)

#apply function by individual elements
x <- create(x)
system.time(
x <- apply(x, c(1,2), FUN = function(x){x^10})
)
rm(x)

   user  system elapsed 
   0.06    0.03    0.09 

   user  system elapsed 
   0.84    0.00    0.92 

   user  system elapsed 
   2.49    0.02    2.55 

In [9]:
#sapply, lapply, the children of the apply family

#sapply
x <- create(x)
system.time(
    x <- sapply(x, FUN = function(x){x^10})
)
rm(x)

#lapply
x <- create(x)
system.time(
    x <- lapply(x, FUN = function(x){x^10})
)
rm(x)

   user  system elapsed 
    1.7     0.0     1.7 

   user  system elapsed 
   1.32    0.00    1.32 

In [None]:
#Parallel processing (advanced)
x <- create(x)

#Reading the number of cores and creating a cluster (may involve a firewall notification for initial run)
n_core <- detectCores() -1
cl <- makeCluster(n_core)

system.time(
    x <- parApply(cl = cl,x,2, FUN = function(x){x^10})
)

stopCluster(cl)
rm(cl,n_core)

## Why is the parApply() slower than Apply()?

In [None]:
#Let's make the task a lot more computationally expensive!
#non-parallel
x <- create(x)
system.time(
    x <- apply(x, 2, FUN = function(x){median(median(rnorm(x^20))^median(x^(1/8)) * median(rexp(x^20)))^20})
)
rm(x)

# parallel
x <- create(x)
n_core <- detectCores() -1
cl <- makeCluster(n_core)

system.time(
    x <- parApply(cl = cl,x,2, FUN = function(x){median(median(rnorm(x^20))^median(x^(1/8)) * median(rexp(x^20)))^10})
)
stopCluster(cl)
rm(cl,n_core)

## In practice, system.time(), proc.time() are all sort of inaccurate  
### 1. Every run will always differ depending on the state of your computer  
### 2. Some calls will take on the scale of nano, micro seconds which cannot be seen with the above functions 

## Use microbenchmark instead! (will show in RStudio)

## Additional Tips

1) Don't use ifelse() if possible  
2) Don't use additional brackets if you don't need to  
2) Use .subset2 to access points in matrices

In [None]:
# Let's explore different ways of accessing a data point
x <- create(x)
x <- as.data.frame(x)
names(x) <- as.character(c(1:10))


x[1,][2]         # access the first row, get the second element
x[1,2]           # access in a matrix format (perhaps the most familiar)
x[,2][1]         # access the second column, and then get the first element
x$'2'[1]         # access the column named "2", and then get the first element
x[[2]][1]        # get the 2nd column, get the first element
x[['2']][1]      # get column named "2" get the first element
.subset2(x,2)[1] # magic subset2 function, get 2nd column get first element

# Let's see what's the fastest with microbenchmark!