<font size="6" color="68829E" face="calibri"> <b> Introduction: R and Jupyter </b> </font>

<font size="4" color="#000000" face="calibri">
Welcome to R, on HCC, through Jupyter. <br> <br>
<i>"R is a free software environment for statistical computing and graphics" </i><br>
<a href="https://www.r-project.org/"> www.r-project.org </a> <br> <br>
Here, we'll be touching upon some of the fundamentals of the R language. <br>
First though lets briefly talk Jupyter ... <br> <br>
<i>"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text" </i><br>
<a href="https://jupyter.org/"> www.jupyter.org </a>
</font>

<font size="6" color="68829E" face="calibri">
<b> Jupyter Notebook </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Segregated into a kernel (execute code) and dashboard/editor (interface), Jupyter permits mixing real-time coding with interactive rich-text ouputs in self-contained ```.ipynb(JSON)``` documents (local or remote). <br>
Kernels include Julia, Python, R, MATLAB/Octave, etc. <br>
Interact with input cells (code, markdown, etc) that exhibit an order and mode (edit & command). <br>
</font>

``` $ conda install -c conda-forge jupyterlab ``` <br>
``` $ jupyter notebook ```

<font size="6" color="68829E" face="calibri">
<b> R </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Starting R loads an interpreter and creates a workspace/environment to store values and perform commands. <br>
Interact with terminal directly or indirectly through scripts. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 1. a popular GUI alternative is R Studio.
</font>

In [None]:
#module load R.3.4
#R

sessionInfo()

In [None]:
getwd()
#setwd('/home/')

In [None]:
list.files()

<font size="6" color="68829E" face="calibri">
<b> Calculations </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Mathematical calculations - elements & arrays (/vectors). <br>
Example operators: ``` + , - , * , / , ^ ```.
</font>

In [None]:
(20+20)/5

In [None]:
c(1,2,3,4) * c(1,2,3,4)

In [None]:
sqrt(81) 

<font size="6" color="68829E" face="calibri">
<b> Functions </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Perform procedures - Input to Output with options/arguments (<i>round brackets</i>). <br>
Combining elements ``` c() ``` is one of the simplest functions. <br>
With no defined arguments, functions will resort to default values found in documentation (? / help).
</font>
<br> <br>
<font size="3" color="#000000" face="calibri">
#Tip 2. source("XXX.R").
</font>

In [None]:
#nested example (runif = generate random number) 
sort(runif(10, min=0, max=1), decreasing=FALSE)

In [None]:
?sort

In [None]:
calculate_SE <- function(x){
    #function to calculate standard error ({local env})
    sqrt(var(x) / length(x))
}

calculate_SE(c(10.4, 5.6, 3.1, 6.4, 21.7))

<font size="6" color="68829E" face="calibri">
<b> Assignment </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Commands ``` <- OR = ``` assigns and stores outputs (results). <br> <br>
R operates on named data structures, that is, sets of ordered entities.<br>
Example vectors: numeric (double, integer, complex), logical, character.
</font>

In [None]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7); 
#x = c(10.4, 5.6, 3.1, 6.4, 21.7);
x

In [None]:
y <- c("test", "control", "control", "control", "test");
#y = c("test", "control", "control", "control", "test");
y

In [None]:
as.character(x)

<font size="6" color="68829E" face="calibri">
<b> Objects </b>
</font>
<br>
<font size="4" color="black" face="calibri">
While vectors are the most important type of object/structure/variable in R, others exist each with intrinsic attributes (mode and length). Conversion between types is relatively intuitive. <br>
Examples: matrices, lists, dataframes.
</font>

In [None]:
X = matrix(runif(20, min=0, max=10), nrow=4, ncol=5, byrow=FALSE);

dim(X)
X

<font size="6" color="68829E" face="calibri">
<b> Indexing </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Access parts of a dataset - rows then columns (<i>square brackets</i>).
</font>

In [None]:
X[2,4]

In [None]:
t(X)[2,4]

In [None]:
X[1:3,2:3]

In [None]:
cbind(X[1:3,2], X[1:3,3])

In [None]:
rbind(X[1,2:3], X[2,2:3], X[3,2:3])

<font size="6" color="68829E" face="calibri">
<b> Iterations </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Scalar loops, element by element, can be used to repeat procedures (for & while). <br>
Vectorization, all elements at once, however is faster and scalable - apply (array), lapply (list), sapply (vector). This is because XXX. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 3. check system.time().
</font>

In [None]:
#create variable of NA (missing values) to fill during loop (efficiency)
colMean = rep(NA, times=dim(X)[2]);
for (i in 1:dim(X)[2]){
    colMean[i] = mean(X[,i])
}

colMean

In [None]:
colMean = apply(X, 2, mean)
colMean

#colMeans(X)

In [None]:
rowSE = apply(X, 1, calculate_SE)
rowSE

<font size="6" color="68829E" face="calibri">
<b> Conditions </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Checks that control the flow of actions/execution. <br>
Comparsion: lesser than (``` < ```), greater than (``` > ```), equal (``` == ```), different (``` != ```). <br>
Logical: AND (``` & ```) if both <i>true</i>, OR (``` | ```) if either <i>true</i>, NOT (``` ! ```) converts <i>true</i> to <i>false</i>.
</font>

In [None]:
if (length(colMean) == 5) {
  print("good - dimensions match")
#} elseif {
} else {
  print("check - dimensions don't match")
}

<font size="6" color="68829E" face="calibri">
<b> Packages </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Codes/functions developed by the community are organised and shared online in resposities - these can be retrieved/downloaded (source (compile) vs bundled (compressed) vs binary (pre-built)), structured into directories (libraries) and loaded into memory to use. Packages will often require other packages - dependencies. <br> <br>
Repositories: CRAN (official network), Bioconductor (bioinformatic specific), GitHub. <br>
Packages: Tidyverse, Shiny, dada2, phyloseq, MetaboAnalystR, lme4. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 4. detail active libraries with .libPaths().
</font>

In [None]:
#lapply(.libPaths(), dir)
lapply(.libPaths(), list.files)

In [None]:
#install.packages('ggplot2')
library(ggplot2)

#source("https://bioconductor.org/biocLite.R");biocLite()
#install.packages('devtools');devtools::install_github()

In [None]:
data(package = "ggplot2")

In [None]:
#data("mpg", package = "ggplot2")
data("midwest", package = "ggplot2")

dat = midwest;
class(dat)

#detach("package:ggplot2")

<font size="6" color="68829E" face="calibri">
<b> Dataframes </b>
</font>
<br>
<font size="4" color="black" face="calibri">
A special list of vectors of equal length that suggests an implicit relationship between elements. <br>
Without such restrictions, variable type would default to a basic list. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 5. tibbles are modern dataframes with subtleties (printing & subsetting).
</font>

In [None]:
dim(dat)
head(dat)

<font size="6" color="68829E" face="calibri">
<b> Manipulation </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Packages ```dplyr``` and ```tidyr``` are popular methods to wield (select, exclude, add) dataframe rows/columns.
</font>

In [None]:
print(colnames(dat))
#rownames(dat)

In [None]:
unique(dat$state)
unique(dat[,"state"])
unique(dat[,3])

In [None]:
#base method (TRUE/FALSE)
datOH = dat[dat$state=="OH",];head(datOH)
#datOHc = dat[dat$state=="OH", 2];head(datOHc)

#base method (index)
indexOH = which(dat$state=="OH");indexOH[1:10]
#datOH = dat[indexOH,];head(datOH)

datminusOH = dat[-indexOH,];#exclude

In [None]:
#install.packages('dplyr')
library(dplyr)

#dplyr uses '%>%' as a pipe - pass left side as first argument of right side
unique(dat$category)
distinct(dat, category)
dat %>% distinct(category)

In [None]:
#dplyr filters rows & selects columns 
datsubAAR = dat %>% 
    filter(category=="AAR" & state=="IL") %>%
    #arrange(desc(popdensity)) %>%
    select(county, state, popdensity, popwhite, popblack, popamerindian, popasian, category)
datsubAAR

In [None]:
#add new columns (using current data or not)
datsubAAR = datsubAAR %>%
    mutate(poptotalrecal = popwhite + popblack + popamerindian + popasian,
           fakegroups = rep(c("North","South","Central"), times=17))
    #transmute(poptotalrecal = popwhite + popblack + popamerindian + popasian)
datsubAAR

In [None]:
#dplyr has helper functions for ease/speed
datsubAAR = datsubAAR %>%
    #select(-starts_with("poptotal"))
    #select(-ends_with("recal"))
    select(-contains("recal"))
datsubAAR

In [None]:
#summarise data (with or without groups)
datsubAAR %>%
    #group_by(fakegroups) %>%
    summarise(mean(popwhite), min(popdensity), n())

In [None]:
library(tidyr)

#switch format - wide to long 
datsubAARlong <- gather(datsubAAR, populationGroup, populationNum, popwhite:popasian, factor_key=TRUE)
datsubAARlong

#?spread();#format - long to wide

<font size="6" color="68829E" face="calibri">
<b> Plotting </b>
</font>
<br>
<font size="4" color="black" face="calibri">
For visualizations ``` ggplot2 ``` is preferred over the base plotting system.<br>
Mapping data, which can come from multiple sources, to graphic attributes ``` (x, y, color) ``` is done through aesthetics ``` (aes) ```. <br>
Layers (geoms, stats, position, annotation), scales, facets, coordinate systems, themes can all then be added ``` (+) ```.<br>
</font>

In [None]:
plt1 = ggplot(data=dat, aes(x=area, y=poptotal, color=state)) +
    geom_point() + ggtitle("plot1") +
    theme(
        axis.title = element_text(face="bold", size=9),
        axis.text = element_text(size=7),
        legend.title = element_text(face="bold", size=9),
        legend.text = element_text(size=7),
        legend.position = "bottom",
        plot.title = element_text(face="bold", size=11)
    )
    
plt1

In [None]:
plt2 = ggplot(data=dat, aes(x=poptotal, y=popasian)) +
    geom_point(aes(color=county)) + ggtitle("plot2") +
    facet_wrap(~state, scales="fixed", nrow=2) +
    #facet_wrap(~state, scales="free", nrow=2) +
    theme(
        axis.title = element_text(face="bold", size=9),
        axis.text = element_text(size=7),
        legend.title = element_text(face="bold", size=9),
        legend.text = element_text(size=7),
        legend.position = "none",
        strip.text=element_text(face="italic", size=8),
        plot.title = element_text(face="bold", size=11)
    )
    
plt2

In [None]:
plt3 = plt2 + ggtitle("plot3") +
    facet_wrap(~state, scales="free", nrow=2) +
    stat_smooth(method = "lm", formula=y~x, size=1, se=T, colour="black") +
    stat_smooth(aes(group=state), method = "loess", formula=y~x, size=1, se=T, colour="red")

plt3

In [None]:
#attributes(ggplot_build(plt3))

typeof(plt3)

head(ggplot_build(plt3)$data[[2]]);#LM fit
head(ggplot_build(plt3)$data[[3]]);#LOESS fit

In [None]:
plt4 = ggplot() + ggtitle("plot4") +
    geom_bar(data=dat, aes(x=1, y=area, fill=state), stat="identity", position="stack") +
    labs(x="", y="Cumulative Area", fill="Midwest") +
    scale_x_discrete(breaks="", labels="") +
    theme(
        axis.title = element_text(face="bold", size=9),
        axis.text = element_text(size=7),
        legend.title = element_text(face="bold", size=9),
        legend.text = element_text(size=7),
        plot.title = element_text(face="bold", size=11)
    )

plt4

In [None]:
pdf("Figures.pdf", width=40, height=20)
print(plt1)
print(plt2)
print(plt3)
print(plt4)
dev.off()

#ggsave()

<font size="6" color="68829E" face="calibri">
<b> Saving </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Workspace objects can be saved for a later date (.Rda/.Rdata). <br>
Likewise, results (data, figures, apps) can be outputted as specific files.
</font>

In [None]:
ls()

In [None]:
rm(plt1,plt2,plt3,plt4)
ls()

In [None]:
save.image("05_29_2019.RData")
#save(list=c("datOH","datminusOH"), file="05_29_2019.RData")

#load("05_29_2019.RData")

#library(session);
#save.session("RSession.Rda");
#restore.session("RSession.Rda");

In [None]:
write.table(dat, file="ExampleGGData.csv", sep=",")
list.files(path=".", pattern=".csv")

#rm(list=ls());cat("\014")

<font size="6" color="68829E" face="calibri">
<b> Loading </b>
</font>
<br>
<font size="4" color="black" face="calibri">
Reading data from files (.csv, .txt, .xlsx) can be tricky in R (multiple options/arugments). <br>
Examples: delimiter/separator (/tab, /common), header, XXX. <br> <br>
</font>
<font size="3" color="#000000" face="calibri">
#Tip 6. try package 'openxlsx' for Excel.
</font>

In [None]:
df = read.table(file="ExampleGGData.csv", sep=",")
head(df)

#read.csv()
#read.delim()

In [None]:
#https://github.com/fivethirtyeight/data

dfOnline = read.table(file="https://raw.githubusercontent.com/fivethirtyeight/data/master/bachelorette/bachelorette.csv", sep=",", header=TRUE)
head(dfOnline)

In [None]:
#quit()

<font size="6" color="68829E" face="calibri">
<b> Command Line </b>
</font>
<br>
<font size="4" color="black" face="calibri">
XXX. <br>
XXX.
</font>

``` $ R --slave --no-restore --file=print-args.R --args ```

<font size="4" color="black" face="calibri">
<b><i>References</i></b> <br>
</font>
<font size="4" color="black" face="calibri">
Software Carpentry: Our Lessons (<a href="https://software-carpentry.org/lessons/">https://software-carpentry.org/lessons</a>) <br>
R Packages: Organize, test, document and share your code (<a href="https://r-pkgs.org/">https://r-pkgs.org</a>) <br>
Tidyverse: R packages for data science (<a href="https://www.tidyverse.org/">https://www.tidyverse.org</a>) <br>
</font>

<font size="2" color="black" face="calibri"> <b>
Introduction: R and Jupyter <br>
</b> </font>