<font size="6" color="68829E" face="calibri"> <b> Introduction: R and Jupyter </b> </font>

<font size="4" color="#000000" face="calibri">
Welcome to R, on HCC, through Jupyter. <br> <br>
<i>"R is a free software environment for statistical computing and graphics" </i><br>
<a href="https://www.r-project.org/"> www.r-project.org </a> <br> <br>
Here, we'll be touching upon some of the fundamentals of programming in the R language. Topics to be covered include: functions, data types, manipulation, packages, saving and loading. XXX  <br>
First though lets briefly talk Jupyter ... <br> <br>
<i>"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text" </i><br>
<a href="https://jupyter.org/"> www.jupyter.org </a>
</font>

<font size="6" color="68829E" face="calibri">
<b> Jupyter Notebook </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Segregated into a kernel (execute code) and dashboard & editor (interface), Jupyter permits mixing real-time coding with interactive rich-text ouputs in self-contained ```.ipynb(JSON)``` documents (local or remote). <br>
Kernels include Julia, Python, R, MATLAB/Octave, etc. <br>
Interact with input cells (code, markdown, etc) that exhibit an order and mode - insert, edit, run. <br>
</font>

``` $ jupyter notebook ```

<font size="6" color="68829E" face="calibri">
<b> R </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Starting R loads an interpreter and creates a workspace/environment to store values and perform commands. <br>
Interact with terminal directly or indirectly through scripts. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 1. a popular GUI alternative is R Studio.
</font>

In [None]:
#R - load & open (HCC terminal)
#module load R.3.4
#R

In [None]:
#print info about R
sessionInfo()

In [None]:
#get (/set) current/working directory
getwd()
#setwd('/home/XXX')

In [None]:
list.files()

<font size="6" color="68829E" face="calibri">
<b> Calculations </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Mathematical calculations - elements & arrays (/vectors). <br>
Example operators: ``` + , - , * , / , ^ ```.
</font>

In [None]:
(20+20)/5

In [None]:
c(1,2,3,4) * c(1,2,3,4)

In [None]:
sqrt(81); #built-in code (base functions)

<font size="6" color="68829E" face="calibri">
<b> Functions </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Perform procedures - Input to Output with options/arguments (<i>round brackets</i>). <br>
Combining elements ``` c() ``` is one of the simplest base functions. <br>
With no defined arguments, functions will resort to default values found in documentation (? / help).
</font>
<br> <br>
<font size="3" color="#000000" face="calibri">
#Tip 2. source("Rcode.R").
</font>

In [None]:
#nested example (runif = generate random number) 
sort( runif(10, min=0, max=1) , decreasing=FALSE, index.return=FALSE)

In [None]:
#check documentation (usage, arguments, details, examples, etc)
?sort

In [None]:
calculate_SE <- function(x){
    #define function to calculate standard error ({body - local env})
    sqrt(var(x) / length(x))
}

calculate_SE( c(10.4, 5.6, 3.1, 6.4, 21.7) ); #run function

<font size="6" color="68829E" face="calibri">
<b> Assignment </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Commands ``` <- ``` OR ``` = ``` assigns and stores outputs/results as well as functions to memory. <br> <br>
R operates on named data structures, that is, sets of ordered entities.<br>
Example vectors: numeric (double, integer, complex), logical, character.
</font>

In [None]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7); 
#x = c(10.4, 5.6, 3.1, 6.4, 21.7);

typeof(x)
x

In [None]:
y <- c("test", "control", "control", "control", "test");
#y = c("test", "control", "control", "control", "test");

typeof(y)
y

<font size="6" color="68829E" face="calibri">
<b> Objects </b>
</font>
<br>
<font size="4" color="black" face="calibri">
While vectors are the most important type of object/structure/variable in R, others exist each with intrinsic attributes (mode and length). Conversion between types is relatively intuitive with ``` as. ``` functions - as.character, as.numeric, etc. <br> <br>
Examples: matrices, lists, dataframes.
</font>

In [None]:
set.seed(1); #define starting point to get (reproducibility)

#visualize toy matrix
X = matrix( runif(20, min=0, max=10) , nrow=4, ncol=5, byrow=FALSE);

dim(X)
X

<font size="6" color="68829E" face="calibri">
<b> Indexing </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Access parts of a dataset - rows then columns (<i>square brackets</i>) - can be coded multiple ways. <br>
Shortcut ``` : ``` generates a regular sequence from start to end (by 1).
</font>

In [None]:
X[2, 4]

In [None]:
t(X)[2, 4]; #transpose (rows to columns, columns to rows)

In [None]:
X[c(2,4), 4] = c(0,10); #replace elements
X

In [None]:
X[1:3, 2:3]

In [None]:
cbind(X[1:3, 2], X[1:3, 3]); #concatenate columns

In [None]:
rbind(X[1, 2:3], X[2, 2:3], X[3, 2:3]); #concatenate rows

<font size="6" color="68829E" face="calibri">
<b> Conditions </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Checks that control the flow of actions/execution (if & else). <br>
Comparsion: lesser than ( ``` < ``` ), greater than ( ``` > ``` ), equal ( ``` == ``` ), different ( ``` != ``` ). <br>
Logical: AND ( ``` & ``` ) if both <i>true</i>, OR ( ``` | ``` ) if either <i>true</i>, NOT ( ``` ! ``` ) converts <i>true</i> to <i>false</i>.
</font>

In [None]:
X[X > 5]; #output - independent of dimensions (byrow=FALSE)
X

In [None]:
X[X >= 6 & X <= 7]; #across whole matrix
X[1, X[1, ] >= 6 & X[1, ] <= 7]; #only first row

In [None]:
if (dim(X)[1] != 4) {
    print("check - dimensions don't match"); #first check cond.1, TRUE do something
} else if (dim(X)[2] != 5) {
    print("check - dimensions don't match"); #FALSE then try cond.2, TRUE now do something different
} else {
    print("good - dimensions match"); #FALSE again, do something else
}

In [None]:
#CHECK-IN POINT
#EXERCISE - create numeric array from 10 to 100 in increments of 0.5 (?seq) 

#EXERCISE - XXX

#EXERCISE - XXX

<font size="6" color="68829E" face="calibri">
<b> Iterations </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Scalar loops, element by element, can be used to repeat procedures (for & while). <br>
Vectorization, all elements at once, however is faster and scalable - apply (array), lapply (list), sapply (vector). This is because XXX. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 3. check system.time().
</font>

In [None]:
#initiate variable of NA (missing values) to fill during loop (efficiency)
colMean = rep(NA, times=dim(X)[2]);

for (i in 1:dim(X)[2]){
    #variable i in sequence start:end
    print(i)
    colMean[i] = mean(X[,i])
    #break;#next
}

colMean

In [None]:
colMean = apply(X, 2, mean)
colMean

#colMeans(X)

In [None]:
rowSE = apply(X, 1, calculate_SE)
rowSE

<font size="6" color="68829E" face="calibri">
<b> Packages </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Codes/functions developed by the community are organised and shared online in resposities - these can be retrieved/downloaded (source (compile) vs bundled (compressed) vs binary (pre-built)), structured into directories (libraries) and loaded into memory to use. Packages will often require other packages - dependencies. <br> <br>
Repositories: CRAN (official network), Bioconductor (bioinformatic specific), GitHub. <br>
Packages: Tidyverse, Shiny, dada2, phyloseq, MetaboAnalystR, lme4. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 4. detail active libraries with .libPaths().
</font>

In [None]:
#lapply(.libPaths(), dir)
lapply(.libPaths(), list.files)

In [None]:
#install.packages('ggplot2')
library(ggplot2)

#source("https://bioconductor.org/biocLite.R");biocLite()
#install.packages('devtools');devtools::install_github()

In [None]:
#packages can contain data also 
data(package = "ggplot2")

In [None]:
#data("mpg", package = "ggplot2")
data("midwest", package = "ggplot2"); #call dataset (midwest demographics) into memory/env

class(midwest)

#detach("package:ggplot2")

<font size="6" color="68829E" face="calibri">
<b> Dataframes </b>
</font>
<br>
<font size="4" color="black" face="calibri">
A special list of vectors of equal length that suggests an implicit relationship between elements. <br>
Without such restrictions, variable type would default to a basic list. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 5. tibbles are modern dataframes with subtleties (printing & subsetting).
</font>

In [None]:
dim(midwest)
head(midwest)

<font size="6" color="68829E" face="calibri">
<b> Manipulation </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Packages ```dplyr``` and ```tidyr``` are popular methods to wield (select, exclude, add) dataframes - rows/columns. <br>
Base functions exist also and again can be coded multiple ways.
</font>

In [None]:
print(colnames(midwest))
#rownames(midwest)

In [None]:
#get unique 'states'
unique(midwest$state)
unique(midwest[,"state"])
unique(midwest[,3])

In [None]:
#base method (logical TRUE/FALSE)
datOH = midwest[midwest$state=="OH",];head(datOH)
#datOHc = midwest[midwest$state=="OH", 2];head(datOHc)

In [None]:
#base method (indexed INTEGER)
indexOH = which(midwest$state=="OH");indexOH[1:10]
#datOH = midwest[indexOH,];head(datOH)

In [None]:
datminusOH = midwest[-indexOH,]; #exclude (-)

In [None]:
#install.packages('dplyr')
library(dplyr)

In [None]:
#dplyr uses '%>%' as a pipe - pass left side as first argument of right side
unique(midwest$state)
distinct(midwest, state)
midwest %>% distinct(state)

In [None]:
#dplyr filters rows & selects columns (%in% = belong to/member of | INVERSE with !%in%)
datsubAAR = midwest %>%
    #filter(category=="AAR") %>%
    filter(category %in% "AAR" & state %in% "IL") %>% #multiple cond.
    #arrange(desc(popdensity)) %>%
    select(county, state, popdensity, popwhite, popblack, popamerindian, popasian, category)

datsubAAR

In [None]:
#add new columns (using current data or not)
datsubAAR = datsubAAR %>%
    mutate(poptotalrecal = popwhite + popblack + popamerindian + popasian,
           fakegroups = rep(c("North","South","Central"), times=17))
    #transmute(poptotalrecal = popwhite + popblack + popamerindian + popasian)

datsubAAR

In [None]:
#dplyr has helper functions for ease/speed
datsubAAR = datsubAAR %>%
    #select(-starts_with("poptotal"))
    #select(-ends_with("recal"))
    select(-contains("recal"))

datsubAAR

In [None]:
#summarise data (statistics - with or without groups)
datsubAAR %>%
    #group_by(fakegroups) %>%
    summarise(mean(popwhite), min(popdensity), n())

In [None]:
library(tidyr)

![](http://journals.plos.org/plosone/article/figure/image?size=medium&id=info:doi/10.1371/journal.pone.0061217.g003)

In [113]:
#switch format (reshape) - wide to long (redundancy - picture)
datsubAARlong <- gather(datsubAAR, populationGroup, populationNum, popwhite:popasian, factor_key=TRUE)
datsubAARlong

dim(datsubAAR)
dim(datsubAARlong)

#?spread(); #format - long to wide

county,state,popdensity,category,fakegroups,populationGroup,populationNum
<chr>,<chr>,<dbl>,<chr>,<chr>,<fct>,<int>
ADAMS,IL,1270.9615,AAR,North,popwhite,63917
BOND,IL,681.4091,AAR,South,popwhite,14477
BROWN,IL,324.2222,AAR,Central,popwhite,5264
BUREAU,IL,713.7600,AAR,North,popwhite,35157
CARROLL,IL,622.4074,AAR,South,popwhite,16519
CASS,IL,559.8750,AAR,Central,popwhite,13384
CHRISTIAN,IL,819.4762,AAR,North,popwhite,34176
CLARK,IL,530.7000,AAR,South,popwhite,15842
COLES,IL,1721.4667,AAR,Central,popwhite,50177
CRAWFORD,IL,748.6154,AAR,North,popwhite,19300


In [114]:
#CHECK-IN POINT 
#EXERCISE - XXX 

#EXERCISE - XXX

#EXERCISE - XXX

<font size="6" color="68829E" face="calibri">
<b> Plotting </b>
</font>
<br>
<font size="4" color="black" face="calibri">
For visualizations package ``` ggplot2 ``` is preferred over base functions (```plot()``` , ```lines()``` , ```barplot()```).<br>
Mapping data, which can come from multiple sources/datasets, to graphic attributes ``` (x, y, color, fill, group) ``` is done through aesthetics ``` (aes) ```. <br>
Layers (geoms, stats, position, annotation), scales, facets, coordinate systems, themes can all then be added ``` (+) ```.<br><br>
Example geom - point, line, bar, boxplot, density.
</font>

In [None]:
plt1 = ggplot( data=midwest, aes(x=area, y=poptotal, color=state) ) +
    #scatter geometric object (points = observation/county)
    geom_point() + ggtitle("plot1")
    
plt1

In [None]:
plt2 = ggplot( data=midwest, aes(x=poptotal, y=popasian) ) +
    #scatter geometric object (points = observation/county)
    geom_point( aes(color=county), shape=18, size=3.5, show.legend=FALSE ) + ggtitle("plot2") +
    #subplot dataset by column/variable (state)
        facet_wrap( ~state, scales="fixed", nrow=2 ) 
        #facet_wrap( ~state, scales="free", nrow=2 ) 
      
plt2

In [None]:
plt3 = plt2 + ggtitle("plot3") +
    facet_wrap( ~state, scales="free", nrow=2 ) +
    #statistical transformation (here groups = facets / local vs global)
    stat_smooth( method = "lm", formula=y~x, size=1, se=T, color="black" ) +
    stat_smooth( aes(group=state), method = "loess", formula=y~x, size=1, se=T, color="red" )

plt3

In [None]:
plt4 = ggplot( data=midwest, aes(x=1, y=area, fill=state) ) + ggtitle("plot4") +
    #bar geometric object (stat="identity" - raw data)
    geom_bar( stat="identity", position="stack" ) +
    labs( x="", y="Cumulative Area", fill="Midwest" ) +
    #scale discrete (not continuous) for fill (not color)
        scale_fill_brewer( palette="Set3" ) +
        #scale_fill_manual(values=c("#FFCCFF","#CC99FF","#9933FF", "#330099", "#000033")) +
    scale_x_discrete( breaks="", labels="" )

plt4

In [None]:
#print figures to file (pdf / png / svg / etc)
pdf("Figures.pdf", width=40, height=20)
print(plt1)
print(plt2)
print(plt3)
print(plt4)
dev.off()

#?ggsave();#built-in save function

In [None]:
#CHECK-IN POINT 
#EXERCISE - change themes of plt1 so that

#EXERCISE - XXX

#EXERCISE - XXX

<font size="6" color="68829E" face="calibri">
<b> Saving / Writing </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Workspace objects can be saved as ``` .Rdata/.Rda ``` files for logged records or future use. <br>
Results (data, figures, apps) can be outputted as other files also.
</font>

In [None]:
ls()

In [None]:
rm(plt1,plt2,plt3,plt4)
ls()

In [None]:
save.image("05_29_2019.RData")
#save(list=c("datOH","datminusOH"), file="05_29_2019.RData")

#load("05_29_2019.RData")

In [None]:
?write.table

In [None]:
write.table(midwest, file="ExampleGGData.csv", sep=",")
#write.table(midwest, file="ExampleGGData.txt", sep="\t")

list.files(path=".", pattern=".csv")

#rm(list=ls());cat("\014")

<font size="6" color="68829E" face="calibri">
<b> Loading / Reading </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Importing data into R from files (.csv/.txt) can range in complexity - multiple options/arugments. <br>
Examples: separator/delimiter (<i>",", "\t", " "</i>), header. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 6. try package 'openxlsx' for Excel.
</font>

In [None]:
df = read.table(file="ExampleGGData.csv", sep=",")
#df = read.table(file="ExampleGGData.txt", sep="\t")
head(df)

#?read.csv()
#?read.delim()

In [None]:
#https://github.com/fivethirtyeight/data

dfOnline = read.table(file="https://raw.githubusercontent.com/fivethirtyeight/data/master/bachelorette/bachelorette.csv", sep=",", header=TRUE)
head(dfOnline)

In [None]:
#quit()

<font size="6" color="68829E" face="calibri">
<b> Command Line </b>
</font>
<br>
<font size="4" color="black" face="calibri">
XXX. <br>
XXX.
</font>

``` $ Rscript Rcode.R ``` <br>
``` $ R CMD BATCH Rcode.R ``` 

<font size="4" color="black" face="calibri">
<b><i>References</i></b> <br>
</font>
<font size="4" color="black" face="calibri">
Software Carpentry: Our Lessons (<a href="https://software-carpentry.org/lessons/">https://software-carpentry.org/lessons</a>) <br>
R Packages: Organize, test, document and share your code (<a href="https://r-pkgs.org/">https://r-pkgs.org</a>) <br>
Tidyverse: R packages for data science (<a href="https://www.tidyverse.org/">https://www.tidyverse.org</a>) <br>
</font>

<font size="2" color="black" face="calibri"> <b>
Introduction: R and Jupyter <br>
</b> </font>