<font style="color:#68829E; font-family:calibri; font-size:36px"> <b> Introduction: R and Jupyter </b> </font>

<font style="color:black; font-family:calibri; font-size:15px">
Welcome to <em>R</em>, on HCC/Binder, through Jupyter. <br>

<p style="margin-left: 2em">
<b><em>"R is a free software environment for statistical computing and graphics" </em></b><br>
<a href="https://www.r-project.org/"> www.r-project.org </a> <br> <br>
<b><em>"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text" </em></b><br>
<a href="https://jupyter.org/"> www.jupyter.org </a> <br> <br>
</p>

Here, we'll be touching upon some of the fundamentals of programming in the <em>R</em> language. Topics to be covered include: functions, structures, packages, manipulation, graphics, saving and loading. XXX <br><br>
First though lets briefly talk Jupyter ...
</font>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Jupyter Notebook </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Segregated into a kernel (execute code) and dashboard/editor (interface), Jupyter permits mixing real-time coding with interactive rich-text ouputs in self-contained <em>.ipynb(JSON)</em> documents (local or remote). <br>
Kernels include Julia, Python, R, MATLAB/Octave, etc. <br>
Interact with input cells (code, markdown, etc) that exhibit an order and mode - insert, edit, run. <br>
</font>

``` $ jupyter notebook ```

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> R </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Starting <em>R</em> loads an interpreter and creates a workspace/environment to store values and perform commands - 'session'. <br>
Interact with terminal directly or indirectly through scripts (<em>.R</em> extension).
</font>

In [None]:
#R - load & open (terminal)
#module load R.3.4
#R

In [None]:
#get (/set) current/working directory
getwd()
#setwd('/home/XXX');

#list.files(); #RUN BELOW, INSERT NEW CELL ...

In [None]:
list.files()

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 1.</b>  a popular GUI alternative is R Studio.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Calculations </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Mathematical calculations - elements & arrays (/vectors). <br>
Example operators: <code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>^</code>.
</font>

In [None]:
(20+20)/5

In [None]:
c(1,2,3,4) * c(1,2,3,4)

In [None]:
81^(1/2)

In [None]:
sqrt(81); #built-in code (base functions)

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Functions </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Perform procedures - Input to Output with options/arguments (<em>round brackets</em>). <br>
Combining elements <code>c()</code> is one of the simplest 'base' functions. <br>
With no defined arguments, functions when called will resort to default values found in documentation (? / help).
</font>

In [None]:
#nested example (runif = generate random number) 
sort( runif(10, min=0, max=1) , decreasing=FALSE, index.return=FALSE )

In [None]:
#check documentation (usage, arguments, details, examples, etc)
?runif
?sort

In [None]:
calculate_SE <- function(x){
    #define function to calculate standard error ({body - local env})
    return( sqrt(var(x) / length(x)) )
}

calculate_SE( c(10.4, 5.6, 3.1, 6.4, 21.7) ); #run function

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 2.</b>  source("Rcode.R").
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Assignment </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Commands <code>-></code> OR <code>=</code> assigns and stores outputs/results as well as functions to memory (case-sensitive). <br> <br>
<em>R</em> operates on named data structures, that is, sets of ordered entities.<br>
Example vectors: numeric (double, integer, complex), logical, character.
</font>

In [None]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7); 
#x = c(10.4, 5.6, 3.1, 6.4, 21.7);

typeof(x)
x

In [None]:
y <- c("test", "control", "control", "control", "test");
#y = c("test", "control", "control", "control", "test");

typeof(y)
y

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 3.</b>  quotation characters, double (") and single ('), are exchangeable.
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Structures </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
While vectors are the most important type of object/structure/variable in <em>R</em>, others exist each with intrinsic attributes (mode and length). Conversion between types is relatively intuitive - <code>as.character</code>, <code>as.numeric</code>, <code>as.data.frame</code>. <br> <br>
Examples: matrices, lists, dataframes.
</font>

In [None]:
#matrices - visualize toy example
set.seed(5); #define seed to reproduce random results 
X = matrix( runif(20, min=0, max=10) , nrow=4, ncol=5, byrow=FALSE);

dim(X)
X

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Indexing </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Access parts of a dataset, rows then columns (<em>square brackets</em>), starting with 1, can be coded multiple ways. <br>
Wildcard <code>:</code> generates a regular sequence from start to stop (by 1).
</font>

In [None]:
X[2, 4]

In [None]:
t(X)[2, 4]; #transpose (rows to columns, columns to rows)

In [None]:
X[c(2,4), 4] = c(0,10); #replace elements
X

In [None]:
X[1:3, 2:3]

cbind(X[1:3, 2], X[1:3, 3]); #concatenate columns

rbind(X[1, 2:3], X[2, 2:3], X[3, 2:3]); #concatenate rows

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Conditions </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Checks that control the flow of actions/execution (if / else). <br>
Comparsion: lesser than (<code><</code>), greater than (<code>></code>), equal (<code>==</code>), different (<code>!=</code>). <br>
Logical/Boolean: AND (<code>&</code>) if both <em>true</em>, OR (<code>|</code>) if either <em>true</em>, NOT (<code>!</code>) converts <em>true</em> to <em>false</em>.
</font>

In [None]:
X[X > 5]; #output - free from dimensions (byrow=FALSE)
X

In [None]:
X[X >= 6 & X <= 7]; #across whole matrix
X[1, X[1, ] >= 6 & X[1, ] <= 7]; #only first row

In [None]:
if (dim(X)[1] != 4) {
    print("check - dimensions don't match"); #first check cond.1, TRUE do something
} else if (dim(X)[2] != 5) {
    print("check - dimensions don't match"); #FALSE then try cond.2, TRUE now do something different
} else {
    print("good - dimensions match"); #FALSE again, do something else
}

In [None]:
y[y == "test"]

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 4.</b>  use tab for auto-completion & arrow keys for command history.
</div>

In [None]:
#CHECK-IN POINT
#EXERCISE - create new variable, randlett, of 2000 randomly sampled ?LETTERS (hint), and count occurrences (a-z).

#EXERCISE - define new function, roundndivide, to take two numeric/double inputs (x1 and x2), rounds to nearest integers,
#divides integers, and returns logical output (TRUE/FALSE) if absolute of result is greater than 1. 


<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Iterations </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Scalar loops, element by element, can be used to repeat procedures (for / while). <br>
Vectorization, all elements at once, however is faster and scalable - <code>apply</code> (array), <code>lapply</code> (list), <code>sapply</code> (vector) - reducing redundancies/overhead, <br>
using Fortan / C / C++ routines, and so on.
</font>

In [None]:
#initiate variable of NA (missing values) to fill during loop (preallocate - growing objects inefficient)
colMean = rep(NA, times=dim(X)[2]);

for (i in 1:dim(X)[2]){
    #variable i in sequence start:end
    print(i)
    colMean[i] = mean(X[ ,i])
    #break;#next
}

colMean

In [None]:
colMean = apply(X, 2, mean); #column by column
colMean

#colMeans(X) #built-in

In [None]:
rowSE = apply(X, 1, calculate_SE); #row by row
rowSE

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 5.</b>  check system.time().
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Packages </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Executable codes/functions developed by the community are organised and shared online in resposities - these can be retrieved/downloaded (source (compile) vs bundled (compressed) vs binary (pre-built)), structured into directories (libraries) and loaded into memory to use. Packages will often require other packages - dependencies. <br> <br>
Repositories: CRAN (official network), Bioconductor (bioinformatic specific), GitHub. <br>
Packages: <code>Tidyverse</code>, <code>Shiny</code>, <code>dada2</code>, <code>phyloseq</code>, <code>MetaboAnalystR</code>, <code>lme4</code>.
</font>

In [None]:
#lapply(.libPaths(), dir)
lapply(.libPaths(), list.files) #print available/downloaded packages

In [None]:
#install.packages('ggplot2')
library(ggplot2) #load into session

#source("https://bioconductor.org/biocLite.R");biocLite()
#install.packages('devtools');devtools::install_github()

In [None]:
#packages can contain data also 
data(package = "ggplot2")

In [None]:
#data("mpg", package = "ggplot2")
data("midwest", package = "ggplot2"); #call dataset (midwest demographics) into memory/env

class(midwest)
dim(midwest)

#?detach();#?remove.packages()

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Dataframes </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
A special list of vectors of equal length that suggests an implicit relationship between elements. <br>
Without such restrictions, variable type would default to a basic list.
</font>

In [None]:
head(midwest, n=2)

In [None]:
print(colnames(midwest))
#rownames(midwest)

In [None]:
#get unique 'states' (with example indexing)
unique(midwest$state)
unique(midwest[ ,"state"])
unique(midwest[ ,3])

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 6.</b>  tibbles are modern dataframes with subtleties (printing & subsetting).
</div>

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Manipulation </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Packages <code>dplyr</code> and <code>tidyr</code> are popular methods to wield (<em>select, exclude, add</em>) dataframes - rows/columns. <br>
Base functions exist also and again can be coded multiple ways.
</font>

In [None]:
#base method (logical TRUE/FALSE)
datOH = midwest[midwest$state=="OH", ];datOH[1:2,]
#datOHc = midwest[midwest$state=="OH", 2];datOHc[1:2,]

In [None]:
#base method (indexed INTEGER)
indexOH = which(midwest$state=="OH");indexOH[1:10]
#datOH = midwest[indexOH, ];head(datOH)

#datminusOH = midwest[-indexOH, ]; #exclude (-)

In [None]:
#install.packages('dplyr')
library(dplyr)

In [None]:
#dplyr uses '%>%' as a pipe - pass left side as first argument of right side
unique(midwest$state)
distinct(midwest, state)
midwest %>% distinct(state)

In [None]:
#dplyr filters rows & selects columns (%in% = belong to/member of | INVERSE with !%in%)
datsubAAR = midwest %>%
    #filter(category=="AAR") %>%
    filter(category %in% "AAR" & state %in% "IL") %>% #multiple cond. (category & state)
    #arrange(desc(popdensity)) %>% 
    select(county, state, popdensity, popwhite, popblack, popamerindian, popasian, category)

datsubAAR

In [None]:
#add new columns (using current data or not)
datsubAAR = datsubAAR %>%
    mutate(poptotalrecal = popwhite + popblack + popamerindian + popasian,
           fakegroups = rep( c("North","South","Central"), times=17) )
    #transmute(poptotalrecal = popwhite + popblack + popamerindian + popasian)

datsubAAR

In [None]:
#dplyr has helper functions for ease/speed
datsubAAR = datsubAAR %>%
    #select(-starts_with("poptotal")) #exclude (-)
    #select(-ends_with("recal")) #exclude (-)
    select(-contains("recal")) #exclude (-)

datsubAAR

In [None]:
#summarise data (statistics - with or without groups)
datsubAAR %>%
    summarise(mean(popwhite), median(popdensity), n())

#?group_by(); #operations according groups

In [None]:
#install.packages('tidyr')
library(tidyr)

In [None]:
#dataframes - visualize toy example
newdf = data.frame(
        sampleID = c("exp1","exp2","exp3","exp4"),
        groupID = c("control","control","test","test"),
        tp1 = c(10, 12, 11, 11), #measurement @ timepoint-1
        tp2 = c(20, 20, 38, 39), #measurement @ timepoint-2
        tp3 = c(20, 20, 50, 52) #measurement @ timepoint-3
)

newdf

In [None]:
#reshape/format - wide to long (ggplot2 likes)
newdflong <- gather(newdf, tPoint, tMeasure, tp1:tp3, factor_key=TRUE)
newdflong

#?spread(); #reshape/format - long to wide

In [None]:
#CHECK-IN POINT 
#EXERCISE - print group statistics - three of your choice - for 'percadultpoverty' and 'percadultpoverty', 
#using 'fakegroups', in dataframe datsubAAR (with dplyr). 

#EXERCISE - reshape dataframe newdflong from long to wide as new dataframe newdfwide (with tidyr).


<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Graphics </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
For visualizations package <code>ggplot2</code> is preferred over base functions (<code>plot()</code>, <code>lines()</code>, <code>barplot()</code>).<br>
Mapping data, which can come from multiple sources (dataframes), to graphic attributes (<code>x</code>, <code>y</code>, <code>color</code>, <code>fill</code>, <code>group</code>) is done through aesthetics <code>(aes)</code>. <br>
Layers (geoms, stats, position, annotation), scales, facets, coordinate systems, themes can all then be added with <code>+</code>. <br><br>
Example geoms: <code>geom_point</code>, <code>geom_line</code>, <code>geom_bar</code>, <code>geom_boxplot</code>, <code>geom_polygon</code>, <code>geom_path</code>.
</font>

In [None]:
plt1 = ggplot( data=midwest, aes(x=area, y=poptotal, color=state) ) +
    #scatter geometric object (points = observation/county)
    geom_point() + ggtitle("plot1")
    
plt1

In [None]:
plt2 = ggplot( data=midwest, aes(x=poptotal, y=popasian) ) +
    #scatter geometric object (points = observation/county)
    geom_point( aes(color=county), shape=18, size=3.5, show.legend=FALSE ) + ggtitle("plot2") +
    #subplot dataset by column/variable (state)
        facet_wrap( ~state, scales="fixed", nrow=2 ) 
        #facet_wrap( ~state, scales="free", nrow=2 ) 
      
plt2

In [None]:
plt3 = plt2 + ggtitle("plot3") +
    facet_wrap( ~state, scales="free", nrow=2 ) +
    #statistical transformation (here groups = facets / local vs global)
        stat_smooth( method = "lm", formula=y~x, size=1, se=T, color="black" ) +
        stat_smooth( aes(group=state), method = "loess", formula=y~x, size=1, se=T, color="red" )

plt3

In [None]:
plt4 = ggplot( data=midwest, aes(x=1, y=area, fill=state) ) + ggtitle("plot4") +
    #bar geometric object (stat="identity" - raw data)
    geom_bar( stat="identity", position="stack" ) +
    labs( x="", y="Cumulative Area", fill="Midwest" ) +
    #scale discrete (not continuous) for fill (not color)
        scale_fill_brewer( palette="Set3" ) +
        #scale_fill_manual(values=c("#FFCCFF","#CC99FF","#9933FF", "#330099", "#000033")) +
    scale_x_discrete( breaks="", labels="" )

plt4

In [None]:
#print figures to file (pdf / png / svg / etc)
pdf("Figures.pdf", width=40, height=20)
print(plt1)
print(plt2)
print(plt3)
print(plt4)
dev.off()

#?ggsave();#built-in save function

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 7.</b>  packages often come with tutorials called vignettes.
</div>

In [None]:
#CHECK-IN POINT 
#EXERCISE - change plt1 so that size of points are now proportional to 'percbelowpoverty', y-axis log10 transformed, 
#axis title bold size 10, axis text italics size 8 and major/minor black grid lines.     

#EXERCISE - recreate plt4 so that bars are now plotted horizontally per state (unstacked/unfaceted).  


<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Saving / Writing </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Workspace objects can be saved as <em> .Rdata/.Rda </em> files for logged records or future use. <br>
Results (data, figures, apps) can be outputted as other files also.
</font>

In [None]:
ls() #print env objects

In [None]:
rm(plt1,plt2,plt3,plt4) #remove env objects
ls()

In [None]:
save.image("05_29_2019.RData")
#save(list=c("datOH","datminusOH"), file="05_29_2019.RData")

#load("05_29_2019.RData")

In [None]:
?write.table

In [None]:
write.table(midwest, file="ExampleGGData.csv", sep=",")
#write.table(midwest, file="ExampleGGData.txt", sep="\t")

list.files(path=".", pattern=".csv") 

#rm(list=ls());cat("\014")

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Loading / Reading </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Importing data into <em>R</em> from files (.csv/.txt) can range in complexity - multiple options/arguments. <br>
Examples: separator/delimiter (<em>",", "\t", " "</em>), header.
</font>

In [None]:
df = read.table(file="ExampleGGData.csv", sep=",")
#df = read.table(file="ExampleGGData.txt", sep="\t")
head(df)

#?read.csv()
#?read.delim()

In [None]:
#https://github.com/fivethirtyeight/data

dfOnline = read.table(file="https://raw.githubusercontent.com/fivethirtyeight/data/master/bachelorette/bachelorette.csv", sep=",", header=TRUE)
head(dfOnline)

<div class="alert alert-block alert-info" style="font-style:italic; font-size:13px">
<b>#Tip 8.</b>  try package 'openxlsx' for Excel.
</div>

In [None]:
#print info about R
sessionInfo()

In [None]:
#quit()

<font style="color:#68829E; font-family:calibri; font-size:29px">
<b> Command Line </b>
</font>
<br>
<font style="color:black; font-family:calibri; font-size:15px">
Within a terminal <em>R</em> can be interactive (open input) or not/batch (closed execution).
</font>

``` $ R ``` <br>
``` $ Rscript Rcode.R ``` <br>
``` $ R CMD BATCH Rcode.R ``` 

<font style="color:black; font-family:calibri; font-size:15px">
<b><em>References</em></b> <br>
</font>
<font style="color:black; font-family:calibri; font-size:15px">
Software Carpentry: Our Lessons (<a href="https://software-carpentry.org/lessons/">https://software-carpentry.org/lessons</a>) <br>
R Packages: Organize, test, document and share your code (<a href="https://r-pkgs.org/">https://r-pkgs.org</a>) <br>
Tidyverse: R packages for data science (<a href="https://www.tidyverse.org/">https://www.tidyverse.org</a>) <br>
RStudio: Cheat Sheets (<a href="https://www.rstudio.com/resources/cheatsheets/">https://www.rstudio.com/resources/cheatsheets</a>) <br>
</font>

<font size="2" color="black" face="calibri"> <b>
Introduction: R and Jupyter <br>
</b> </font>