-
Notifications
You must be signed in to change notification settings - Fork 0
/
R-bioinfo-intro.Rnw-former
5074 lines (3696 loc) · 145 KB
/
R-bioinfo-intro.Rnw-former
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% ;;; -*- mode: Rnw; -*-
\synctex=1
\documentclass[a4paper,11pt]{article}
\usepackage{graphics}
%\usepackage[dvips]{graphicx}
\usepackage{amssymb,amsfonts,amsmath,amsbsy}
\usepackage{geometry}
\geometry{verbose,a4paper,tmargin=28mm,bmargin=28mm,lmargin=30mm,rmargin=30mm}
\usepackage{setspace}
\singlespacing
\usepackage{url}
\usepackage{nameref}
\usepackage[english]{babel}
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage[small]{caption}
\usepackage{hyperref}
\usepackage{color}
\newcommand{\cyan}[1]{{\textcolor {cyan} {#1}}}
\newcommand{\blu}[1]{{\textcolor {blue} {#1}}}
\newcommand{\Burl}[1]{\blu{\url{#1}}}
\newcommand{\red}[1]{{\textcolor {red} {#1}}}
\newcommand{\green}[1]{{\textcolor {green} {#1}}}
\newcommand{\mg}[1]{{\textcolor {magenta} {#1}}}
\newcommand{\og}[1]{{\textcolor {PineGreen} {#1}}}
\newcommand{\code}[1]{\texttt{#1}} %From B. Bolker
\newcommand{\myverb}[1]{{\footnotesize\texttt {\textbf{#1}}}}
\newcommand{\Rnl}{\ +\qquad\ }
\newcommand{\Emph}[1]{\emph{\mg{#1}}}
% \newcommand{\R}{{something or other R}} but this gets to be a mess. And
% ugly.
\newcommand{\R}{R}
\newcommand{\flspecific}[1]{{\textit{#1}}}
\newcounter{exercise}
\numberwithin{exercise}{section}
\newcommand{\exnumber}{\addtocounter{exercise}{1} \theexercise \thinspace}
\usepackage[copyright]{ccicons}
\usepackage[authoryear, round, sort]{natbib}
\bibliographystyle{chicago}
%% decreasing margins after knitr output
%% \setlength{\topsep}{0pt}
%% \setlength{\parskip}{0pt}
%% \setlength{\partopsep}{1pt}
\usepackage{gitinfo2}
%% For using listings, so as to later produce HTML
%% uncommented by the make-knitr-hmtl.sh script
%%listings-knitr-html%%\usepackage{listings}
%%listings-knitr-html%%\lstset{language=R}
<<setup,include=FALSE,cache=FALSE>>=
require(knitr)
opts_knit$set(concordance = TRUE)
opts_knit$set(stop_on_error = 2L)
## next are for listings, to produce HTML
##listings-knitr-html%%options(formatR.arrow = TRUE)
##listings-knitr-html%%render_listings()
@
\begin{document}
%% Only takes effect after begin document?
%%listings-knitr-html%%<<listingfigdir,error=FALSE, include=FALSE, cache=FALSE>>=
%%listings-knitr-html%%opts_chunk$set(fig.path = 'figures_html/listings-')
%%listings-knitr-html%%@
\title{A quick and crash introduction to R with a bioinformatics bent}
\date{\gitAuthorDate\ {\footnotesize (Release\gitRels: Rev: \gitAbbrevHash)}}
\author{Ramon Diaz-Uriarte\thanks{Dept.\ of Biochemistry, Universidad
Aut\'onoma de Madrid, Spain, \Burl{http://ligarto.org/rdiaz},
\texttt{r.diaz@uam.es}}}
%% <<>>=
%% opts_chunk$set(size= "small", error=FALSE)
%% @
\maketitle
\tableofcontents
%% To make sure things within page margins
<<include=FALSE>>=
rm(list = ls())
options(width = 60)
@
\section{License and copyright}\label{license}
This work is Copyright, \copyright, 2014, 2015, 2016, 2017, Ramon Diaz-Uriarte,
and is licensed under a \textbf{Creative Commons } Attribution-ShareAlike
4.0 International License:
\Burl{http://creativecommons.org/licenses/by-sa/4.0/}.
\centerline \ccbysa
\section{Scenarios}\label{scenarios}
\begin{itemize}
\item You are designing an experiment: 20 plates are to be assigned
(randomly) to 4 conditions. You are too young (or too old) to cut paper
into pieces, place it in a urn, etc. You want a better, faster
way. Specially because your next experiment will involve 300 units, not
20.
\item The authors of a paper claim there is a weak relationship between
levels of protein A and growth. However, you know that some of the
samples are from males and some are from females, and you suspect the
correlation is present only in males. The authors provide the complete
data and you want to check for differences in correlation pattern
between males and females.
\item You've been working on a microarray study. For 100 subjects (50 of
them with leukemia, 50 of them healthy) you have the $Cy3/Cy5$ intensity
ratios for 300,000 spots. You just got the email with the compressed
data file. You are leaving for home. In less than five minutes you'd
like to get a quick idea of what the data look like: maximum and minimum
values for all spots, average for 5 specific control spots
(corresponding to probes 10, 23, 56, 10,004, 20,000), and a
quick-and-dirty statistical test of differences for two specific probes,
probe 7000 and 99,000, that correspond to two well know genes.
\item Tomorrow you'll look at the data in more detail. For a set of 20
selected probes you will want to: a) take a look at the mean of the
intensity, variance of intensity, and the mean of the intensity in each
of the two groups; b) plot the intensity vs.\ the age of the subject; c)
plot the log of the intensity vs.\ the age of the subject.
%% \item A paper describes a specific growth curve model (some non-linear
%% function). You would like to see what the actual curve looks like, and
%% how much variation you get if you modify the parameters slightly.
\end{itemize}
For each of those problems, would you \ldots
\begin{itemize}
\item Know how to do it?
\item Do it quickly?
\item Save all the steps of what you did so that 6 months from today you
know \textbf{exactly} what you did, can repeat it, and apply it to new data?
\end{itemize}
This course is a quick introduction to an ``environment for statistical
computing and graphics'' that will allow you to carry out each of the above.
\section{This document and how to use it}
This document is to be used a crash and relatively quick introduction to
R, with a clear Bioinformatics bias\footnote{OK, there is an example with
birds, reptiles, metabolic rates, body size, etc, that is not really
bioinformatics but \ldots it is a neat data set and matches with some of
my early scientific loves.}.
The structure and logic is as follows:
\begin{itemize}
\item First, with the scenarios above, we try to motivate you.
\item I then (section \ref{mistery}) show a six-line real example of a common problem in
Bioinformatics (multiple testing). You might not understand much of what
is done, but I will explain it in class (this is not a textbook, but a
document for a class).
\item We next go over a few practical things we need to get out of the way
(sections \ref{basics} and \ref{console}).
\item We then (sections \ref{readingr} to \ref{plotsplots}) cover in some
detail what are the main objects of R (vectors, data frames, matrices),
how to manipulate them and some plotting. This can be boring, and this
used to be after section \ref{crashex} but some students argued strongly
that they'd rather see this first, so this comes now first.
\item After that, we jump into \R\ with three longer examples (section
\ref{crashex}). Again, on first reading you might not understand
all of what is done, but we will go over it in class.
\item Then we cover tables and a little bit of programming (sections
\ref{tables} and \ref{rprog}).
\item We then revisit some examples.
% \item Then, up to section \ref{back-scenarios}, we go over many details in
% a more systematic, but more boring, way.
\item If you understand all that is done up to that point, you should have a
decent working understanding of how to use R, and can move on your own.
\item But you do not want to skip section \ref{debug}: this section covers
debugging. Being able to debug quickly and painlessly is essential
for using R effectively (and enjoying it even while debugging).
\item Finally, in section \ref{more-ex} I include several longer commented
examples. These should bring together many of the features we have
mentioned, but also introduce new functions, and will give you practice
with programming and debugging.
\end{itemize}
The material has been ordered that way on purpose. Yes, expect some
frustration when working through sections \ref{mistery} and \ref{crashex},
but definitely look at them before class and try to understand what is
going on. After section \ref{crashex} things should be smoother (but more
boring, until we get to section \ref{more-ex}). And here, definitely, you
\textbf{must} type things on your own and understand the output. I have
tried to use a kind of spiraling lay out, working over several things
repeatedly, and repeating and going deeper on some key ideas. Hopefully,
this will allow you to understand the material better, connect it with
other pieces, and retain it for longer.
\subsection{The PDF and the code}
The primary output of this document is a PDF.%% I also provide (and will
%% use) an HTML file; this is kind of experimental (a few things might not
%% be typeset correctly, or some links not work fully, etc). The HTML offers
%% the advantage that we can accommodate, on the same screen, a running
%% session with \R\, and have the web browser size adjusted to our liking (so
%% less fiddling around than using a PDF). In the HTML, the code is in red
%% and the output in blue.
However, all the original files for the document are available (again, under a
Creative Commons license ---see section \ref{license}) from
\Burl{https://github.com/rdiaz02/R-bioinfo-intro}. (Note that in the
github repo you will not see the PDF %% HTML,
or R-bioinfo-intro.R files,
since those are derived from the Rnw file).
For many commands I do not show the output (e.g., because it
would just provide boring and space-filling output). However, make sure
you type and understand it. You can copy and paste, of course, but I
strongly suggest you type the code and change it, modify it, etc.
\subsection{Other files you need in addition to this one}
You should have (or should get) the following files:
\begin{itemize}
\item \code{hit-table-500-text.txt}
\item \code{AnotherDataSet.txt}
\item \code{anage.RData}
\item \code{lastExample.R}
\item \code{Condition\_A.txt}, \code{Condition\_B.txt}, \code{Condition\_C.txt}
%% \item \code{permafrost-zip}, a compressed directory with 5000 files.
%% \item \code{R-bioinfo-intro-html-dir.zip}: a zip file that, when
%% uncompressed, while give you a directory that contains the HTML version
%% of this PDF and a directory for the figures (used in the HTML). You can
%% open the HTML with any browser.
%% \item \code{script1.R}
\item \code{R-bioinfo-intro.R}
\end{itemize}
All of the files above (except the last) are mentioned or used in this
document. What about \code{R-bioinfo-intro.R}? That is all the \R\ code
used in this document.
\subsection{R and Bioinformatics}
If you are reading this document, it is probably because you already have
some idea of what \R\ is. So no long details here. A summary is ``R is a
free software environment for statistical computing and graphics.''
(\Burl{http://www.r-project.org/}) and ``R is 'GNU S', a freely available
language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques: linear
and nonlinear modelling, statistical tests, time series analysis,
classification, clustering, etc. '' (\Burl{http://cran.r-project.org/}).
Virtually all of the statistical analysis done in Bioinformatics can be
conducted with R. Moreover, ``data mining'' (which is, according to some
authors, simply ``statistics + marketing'') is well covered in \R:
clustering (often called ``unsupervised analysis'') in many of its
variants (hierarchical, k-means and family, mixture models, fuzzy, etc),
bi-clustering, classification and discrimination (from discriminant
analysis to classification trees, bagging, support vector machines, etc),
all have many packages in \R. Thus, tasks such as finding homogeneous
subgroups in sets of genes/subjects, identifying genes that show
differential expression (with adjustment for multiple testing), building
class-prediction algorithms to separate good from bad prognosis patients
as a function of genetic profile, or identifying regions of the genome
with losses/gains of DNA (copy number alterations) can all be carried out
in \R\ out-of-the-box (see BioConductor and CRAN).
[A proselitizing note] \R\ is free software, meaning ``free'' as in free
speech (not free as in free beer; in Spanish, free as in "libre", not free
as in "gratis"). The definition of free software is explained, for
instance, in \Burl{http://www.gnu.org/philosophy/free-sw.html}. Why does
it matter that \R\ is free software? For one thing, it makes your access
to it simple and easy. As well, you can play with the system and look at
the inside (you can look at the original code) and do with that code a
variety of things, including modifying it, learning from it, etc. In
addition, that \R\ is free software is, arguably, one of the reasons of
its incredible success (and, for instance, one explanation for why there
are over 6000 contributed, and free software, packages). Moreover,
Bioinformatics, as we know it, would not exist without free
software. Newton, and others before him, used the expression ``standing on
the shoulders of giants'' when explaining how the development of science
and other intellectual pursuits builds upon past accomplishments; in
Bioinformatics (and many other fields), we are also standing on the
shoulders of millions of lines of free software.
\subsection{Some references}
When you download \R\ you also download ``An introduction to R'', which is
an excellent intro. There are many freely available documents (of variable
quality, of course) here:
\Burl{http://cran.r-project.org/other-docs.html}. Many books are listed
(and some briefly commented) here: \Burl{http://www.r-project.org/doc/bib/R-books.html}.
This is a partial list a books I like and use when preparing classes:
\begin{description}
\item[Programming] As it says, just focus on programming R:
\begin{itemize}
\item \textit{R Programming for Data Science}. Peng. (This is an
ebook and PDF, and you can pay whatever you want for it.). If you want
to start somewhere and use only a single reference, \textbf{I'd start
with this book}.
\item \textit{Advanced R}. Wickham. (If you go to the web page for the
book, in github, you can download the complete sources and build your
own pdf).
\item \textit{The art of R programming}. Matloff.
\end{itemize}
\item[Stats and some programming] Introductory statistics (or introductory
data science) with some
programming interleaved.
\begin{itemize}
\item \textit{Introductory statistics with R, 2nd ed}. Dalgaard.
\item \textit{R in Action}. Kabacoff. (A second ed.\ available since
June 2015).
\end{itemize}
\item[Linear models et al.] Linear models are fundamental in
statistics. And fascinating.
\begin{itemize}
\item \textit{An R companion to applied regression}. Cox and
Weisberg. John Fox is also the author of an excellent textbook (now in
its second edition) about linear models. This companion is absolutely
fantastic (and can be used even if you don't have the other
textbook). You probably want this book.
\item \textit{Regression modeling strategies, 2nd ed}. Harell. Among its
many virtues, this book contains excellent discussions of the
problems of variable selection.
\item Faraway has two books on linear models with \R, both published by
CRC. Wood is the author of a great book on Generalized Additive Models
(also CRC). Etc, etc.
\end{itemize}
\item[Machine learning] Machine learning, classification, etc. And many of
the examples are bioinformatics-inspired.
\begin{itemize}
\item \textit{Applied predictive modeling}. Kuhn and Johnson.
\item \textit{An introduction to statistical learning}. James, Witten,
Hastie, Tibshirani. The PDF of the book is available for download from
their web page.
\end{itemize}
\end{description}
There are of course many others (including classics such as the two by
Venables and Ripley, or Chamber's \textit{Software for data analysis},
many specific to some fields, several devoted to graphics, etc, etc).
There is also a (short) list of books I think are not worth it; ask me
about them in class.
\clearpage
\section{This will not be mysterious at the end of the course}\label{mistery}
(This is an example we go over in section \ref{example-multtest}, p.\
\pageref{example-multtest}, with a different number of genes).
We might have heard about the multiple testing problem with microarrays:
if we look at the p-values from a large number of tests, we can be mislead
into thinking there is something happening (i.e., there are differentially
expressed genes) when, in fact, there is absolutely no signal in the
data. Now, you are convinced by this. But you have a stubborn colleague
who isn't. You have decided to use a simple numerical example to show her
the problem.
This is the fictitious scenario: 50 subjects, and of those 30 have cancer
and 20 don't. You measure 1000 genes, but none of the genes have any real
difference between the two groups; for simplicity, all genes have the same
distribution. You will do a t-test per gene, show a histogram of the
p-values, and report the number of ``significant'' genes (genes with p <
0.05).
This is the R code:
<<eval=TRUE,tidy=FALSE, fig.height=5>>=
randomdata <- matrix(rnorm(50 * 1000), ncol = 50)
class <- factor(c(rep("NC", 20), rep("cancer", 30)))
pvalues <- apply(randomdata, 1,
function(x) t.test(x ~ class)$p.value)
hist(pvalues)
sum(pvalues < 0.05)
@
The example could be made faster, you could write a function, prepare
nicer plots, etc, but the key is that in six lines of code you have
settled the discussion.
Let's try to understand what we did. But first, we need to install \R, and
maybe some additional packages.
\section{Very basics of using \R}\label{basics}
\subsection{Installing \R}
Go to CRAN, \Burl{http://cran.r-project.org/}. Now, if you know what
source code is, and you want to compile R, go to Sources
(\Burl{http://cran.r-project.org/sources.html}). Otherwise, just download
a binary for your operating system (\Burl{http://cran.r-project.org/bin/}).
\begin{itemize}
\item For Linux, most distros have pre-built binaries, so with Debian
use apt-get install r-base r-base-dev, with Fedora and RH yum install
whatever, etc. There are instructions in the CRAN page if you need
them, though, for many distros.
However, if you use Ubuntu, please read the instructions in
\Burl{http://cran.r-project.org/bin/linux/ubuntu/README.html}, since the
default Ubuntu packages can be outdated.
\item If you use Windows, you want to install "base". It says so
clearly: "Binaries for base distribution (managed by Duncan
Murdoch). This is what you want to install R for the first time."
\item If you use Mac, if you play with installation options, note that
you need to install the tcl/Tk X11 libraries. If you run into
trouble, make sure to read the FAQ
(\Burl{http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html}).
\item However you do it, please make sure you have a recent version of R.
\end{itemize}
% \item You can change the language if you want. For Spanish, use ``es'' (or
% edit directly the {\tt Rconsole} file).
% \end{itemize}
\subsection{Installing RStudio}\label{rstudio}
There are a variety of ways of interacting and using R. For ease, and
because it is a really nice piece of software, we will use RStudio. We
want to use the "Dektop", that you can download from here:
\Burl{http://www.rstudio.com/products/rstudio/download/}.
\subsection{Editors and ``GUIs'' for \R, et al.}\label{guis}
Ah, this is a nice topic for a long, passionate, conversation. In this
course, we will, by default, be using RStudio. I will, however, often use
Emacs + ESS (\Burl{http://ess.r-project.org/}). For those used to Eclipse, there
is a plug-in designed to work with R: StatET
(\Burl{http://www.walware.de/goto/statet}). Another popular interface is
JGR (\Burl{http://www.rforge.net/JGR/}). RKward
(\Burl{http://rkward.sourceforge.net/wiki/Main_Page}) is also popular in
some places (this was originally Linux-only, but not anymore). Some Mac
users are very happy just the default, plain, interface provided by R
under Mac OS X. And some Windows users like Tinn-R
(\Burl{http://nbcgib.uesc.br/lec/software/editores/tinn-r/en}); I used to
use Tinn-R in R courses I taught 8 to 10 years ago, but I think it has
lost ground to RStudio, and it is only Windows. If you love vim, there is
Vim-R-plugin (\Burl{http://www.vim.org/scripts/script.php?script_id=2628};
\Burl{http://manuals.bioinformatics.ucr.edu/home/programming-in-r/vim-r}). And
then, there are many other options; %% (an outdated list is available from
%% \Burl{http://www.sciviews.org/\_rgui/}, and
some other entries around in
internet land are
\Burl{http://www.theusrus.de/blog/r-guis-which-one-fits-you/} and
\Burl{http://stats.stackexchange.com/questions/5292/good-gui-for-r-suitable-for-a-beginner-wanting-to-learn-programming-in-r}).
If you plan to spend a fair amount of time doing Bioinformatics, then
you'll spend a fair amount of time programming, probably using a variety
of languages (R, Python, C, Perl, Java, PHP, etc). Becoming used to a
programmer-friendly editor that ``understands'' all of the languages you
use is thus worth it. Choosing an editor is a highly personal issue. Emacs
is an editor and then a lot of other things (that is what I use, for
programming, editing text, email, etc); if you use Emacs then Emacs + ESS
is the perfect combination for you. For vim users, there is the
vim-R-plugin. Those who come from the Java world might be familiar with
Eclipse (and, thus, you'll want to give StatET a try). Kate is another
great editor that understands many editors and it easy to submit code to
an R process running in the terminal, but it lacks some nice features that
RStudio and Emacs+ESS have (but RKward might then be a nice option). Some
people (myself) like to use a single editor for most/all editing
tasks. Some other people jump around (they use RStudio for R, Eclipse for
Java, and maybe Kate for Python). You get to choose.
Note, though, that one thing is syntax highlighting (and syntax
highlighting for R is available for many, many editors) and another is the
ability to interact with an R session, provide shortcuts for displaying
help, offering object browsers, etc. Of course, you are the one who must
weight the choices.
The summary (highly biased?): I definitely prefer Emacs (+ ESS), but in
this course I will not attempt to teach you Emacs + ESS. So if you do not
know Emacs, then try RStudio, which is what we will ``officially''
use. However, if you like Eclipse, then use Eclipse with StatEt. If you
like Kate, use Kate, etc, but I might not be able to help you.
Note that all of the above have a different purpose from R Commander
(\Burl{http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/}) which, as it says, is
a basic statistics GUI for R. In this course we will rarely (if at all)
use R Commander, since these notes are focused on programming and using R
from the command line. However, I do recommend that you play around with R
Commander. Another GUI for statistics with R (that I have not used but
know is liked by some people) is Deducer:
\Burl{http://www.deducer.org}.
\subsection{Installing R packages}\label{packages}
Most ``for real'' work with \R\ you do will require installation of
packages. Packages provide additional functionality. Packages are
available from many different sources, but possibly the major ones now are
CRAN and BioConductor.
If a package is available from CRAN you can do
<<eval=FALSE>>=
install.packages("car")
@
(for example --- this installs the \code{car} package and its
dependencies).
If you want to install more than one package you can do (don't execute the
code below as we will not use those packages)
<<eval=FALSE>>=
install.packages(c("RJaCGH", "varSelRF"))
@
In Bioinformatics, BioConductor (\Burl{http://www.bioconductor.org}) is a
well known source of many different packages. BioConductor packages can be
installed in several ways, and there is a semi-automated tool that allows
you to install suites of BioC packages (see
\Burl{http://www.bioconductor.org/install/}). For example, go to
\Burl{http://www.bioconductor.org/packages/release/bioc/html/limma.html}
and see how instructions are clearly given there.
Note: the new (as from about summar of 2018) way of installing BioC
packages is via
<<eval=FALSE>>=
BiocManager::install("package_name")
@
See
\Burl{https://cran.r-project.org/web/packages/BiocManager/index.html}. But
this is not updated even in the BioC page for most (all?) packages
yet. The previous
<<eval=FALSE>>=
source("https://bioconductor.org/biocLite.R")
biocLite("package_name")
@
still work, though it gives a warning. (So if today --- 2018-10-17--- you
visit the above page for limma, and you follow the recommendation, you
will get a warning).
As we said above, sometimes packages depend on other packages. If this is
the case, by default, the above mechanisms will also install dependencies.
With some GUIs (under some of the operating systems) you can also install
packages from a menu entry. For instance, under Windows, there is an entry
in the menu bar called \textbf{Packages}, which allows you to install from
the Internet, change the repositories, install from local zip files,
etc. Likewise, from RStudio there is an entry for installing packages
(under ``Tools'').
Packages are also available from other places (RForge, github,
etc); you will often find instructions there.
Now, make sure you install package ``car'', which we will use below:
<<eval=FALSE>>=
install.packages("car")
@
(or do it from the menu of RStudio).
How do you find a package? Looking at a list of 6000 things in CRAN and
another thousand in BioC is not a good idea. In addition to google et al.,
there are task views in CRAN: \Burl{http://cran.r-project.org/web/views/},
and there is a not too dissimilar thing in BioC. In addition,
\code{findFn}, from package ``sos'' can help (see section \ref{help}).
\subsection{Starting \R}
If you use RStudio, just start RStudio (icons should have been placed
wherever they are placed in your operating system, or start if from the
command line if you know how to/like to do that). From what I've been
told, RStudio should be available from the menus in your desktop, in
Windows, Linux, or Mac OS.
If you use other systems (Emacs + ESS, Eclipse, RKward, Kate, etc) just use
the appropriate procedure (I assume that if you are using any of these you
know what to do).
\subsection{Stopping \R}
You can always just kill RStudio; but that is not nice. In all systems
typing \code{q()} at the command prompt should stop R/RStudio. There will
also be menu entries (e.g., ``Quit RStudio'' under ``File'', etc).
<<eval=FALSE>>=
q()
@
Say no to the question about saving the workspace.
What if things hang? Try \code{Control-C} and/or
\code{Esc}.
\section{The R console for interactive calculations}\label{console}
In what follows, I will assume that you are either running \R\ from
RStudio, or that you know your way around and are using some other means
(e.g., directly from the \R\ icon in Windows, or from Emacs + ESS in any
operating system, or using Eclipse, etc).
Regardless of how you interact with \R, once you start an interactive \R\
session, there will always be a console, which is where you can enter
commands to have them executed by \R. In RStudio, for instance, the
console is usually located on the bottom left.
Now, move to the console and at the prompt (which will often start with
\code{>}) type ``1 + 2'' (without the quotes) and press \code{Enter}:
<<>>=
1 + 2
@
(All the code for this document is available, so you can copy and paste
from the original code directly. If you copy code from other documents,
say a PDF, that show the prompt, do not copy the prompt itself. That
should not be an issue in this document, though, as the code sections do
not show the prompt).
Look at the output. In this document, code chunks, if they show output,
will show the output preceded by \code{\#\#}. In R (as in Python), \code{\#}
is the comment character. In your console, you will NOT see the \code{\#\#}
preceding the output. This is just the way it is formatted in this
document.
Note also that you see a \code{[1]}, before the \code{3}. Why? Because the
output of that operation is, really, a vector of length 1, and \R\ is
showing its index. Here it does not help much, but it would if we were to
print 40 numbers:
<<>>=
1:40
@
Now, assign \code{1 + 2} that to a variable:
<<>>=
v1 <- 1 + 2
@
\noindent(you can also use \code{=} for assignment, but I prefer not to).
And now display its value
<<>>=
v1
@
If you want to be more verbose, do
<<>>=
print(v1)
@
Alternatively, you could surround the expression in parentheses:
<<>>=
(v1 <- 1 + 2)
@
and that makes the assignment AND shows you the value just assigned to
\code{v1}.
Finally, you could do
<<>>=
v1 <- 1 + 2; v1
@
thus separating the two commands with a \code{;}, though that is rarely
a good idea except for very special cases.
It is also possible to break commands, if it is clear to \R\ that the
expression is not yet finished:
\begin{verbatim}
v2 <- 4 - ( 3 * [Enter]
2)
\end{verbatim}
You will see a \code{+} that indicates the line is being continued: \R\ is
still expecting more input (in this case, you must close the parenthesis
and add something after the \code{*}). But sometimes things get
confusing. You can bail out by typing \code{Ctrl + c} (Unix) or
\code{Escape}, and abort the calculation.
Of course, use parenthesis as you think appropriate to make the meaning of
an expression clear. \R\ uses, for the usual functions, the usual
precedence rules. If in doubt, use parentheses.
<<>>=
v11 <- 3 * ( 5 + sqrt(13) - 3^(1/(4 + 1)))
@
By the way, if you want to modify partially what you typed, you can repeat
the previous commands with the up-arrow ($\uparrow$)
in RStudio (or Alt-p in ESS); and then move around also using $\uparrow$ ,
etc. You also have tab completion: if you get at the prompt, type \code{v}
and press tab you should be given a bunch of options (that include v1 and
v2, plus several functions that start with ``v'').
\subsection{Naming variables}
We created \code{v1} and \code{v2} above. Names of variables in \R\ must
begin with a letter (also a period, though this will make them
hidden). Then you can mix letters, numbers, \code{.} and
\code{\_}. Variable names are case sensitive, so \code{v1} and \code{V1}
are different things.
Once you have something in a variable, you can just use it instead of that
something:
<<>>=
v3 <- 5
(v4 <- v1 + v3)
(v5 <- v1 * v3)
(v6 <- v1 / v3)
@
Newer assignments silently \textbf{overwrite} previous assignments:
<<>>=
(z2 <- 33)
z2 <- 999
z2
z2 <- "Now z2 is a sentence"
z2
@
You can delete a variable
<<>>=
rm(z2)
@
\subsection{Getting help}\label{help}
Look at one help page:
<<eval=FALSE>>=
help(mean)
@
Now, shorter:
<<eval=FALSE>>=
?mean
@
Now let's use the help to l ,earn about the help system (and yes, read or
take a quick look at it):
<<eval=FALSE>>=
?help
?apropos
@
Now, try
<<eval=FALSE>>=
?normal
?rnorm
apropos("normal")
apropos("norm")
help.search("normal")
@
Many help files include executable code (examples)
<<eval=FALSE, results='hide'>>=
example(rnorm)
example(graphics) ## will give an error
example(lm)
@
\noindent and note how you get to see the code that produced the figures.
example.
Some help files include demos
<<eval=FALSE>>=
demo(graphics)
@
\noindent again, note how you get to see the code that produced the
figures. example.
And some include both
<<eval=FALSE>>=
demo(persp)
example(persp)
@
But there are many other ways of searching for help about how to do
something with \R. Of course, you can google around, use stackoverflow,
etc. There are mailing lists for \R\, and for specific interest groups in \R.
There is a package, ``sos''
(\Burl{http://cran.r-project.org/web/packages/sos/index.html}), that can
help you search functions, etc, in packages that you do not have
installed, ranks search results, etc. It is well documented (see
\Burl{http://cran.r-project.org/web/packages/sos/vignettes/sos.pdf}). The
only problem I see is that only some of the BioConductor packages are
among those searched (and you need an internet connection).
Patrick Burns has a interesting blog entry about R navigation tools:
\Burl{http://www.burns-stat.com/r-navigation-tools/}.
Oh, by the way, RStudio includes an integrated help browser. Use it if you
use RStudio.
\subsection{Error messages}
The best way to learn to use \R\ is to use it. As explained before,
mistakes are harmless, so you should play and experiment. However, there
are two key attitudes that will make your learning a lot faster: first,
using the help system, and second \textbf{paying attention to the error
messages}. Yes, the error messages are written in English, not some
weird, unintelligible language. Sometimes they are a little bit cryptic,
but more often than not, if you read them carefully, you will see how to
approach to problem to fix the mistake, or will realize that what you
typed makes no sense.
Lets look at a few. These are not representative or common or anything
like it. But you should read them, understand them, and think about how to
take corrective action (or realize that I was trying to do something
silly).
<<eval=FALSE>>=
apply(something, 1, mean)
apply(v3, 1, mean)
apply(F, 1, mean) ## this is an interesting one
log("23")
rnorm("a")
lug(23)
rnorm(23, 1, 1, 1, 34)
x <- 1:10
y <- 11:21
plot(x, y)
lm(y ~ x)
z <- 1:10
t.test(x ~ z)
@
\subsection{Coding style}\label{style}
You write R code for the computer to do something, but that code should be
readable by humans (including not only other people besides yourself, but
yourself in the future). Please, make sure your code is tidy and respects
some minimal rules of civility. In particular:
\begin{itemize}
\item Do not extend beyond column 80.
\item Use spaces appropriately; for example, write
\verb@ x <- rnorm(3, mean = 2) @ and NOT
\verb@ x<-rnorm(3,mean=2) @. Thesecondformisclearlyveryhardtoread.
\end{itemize}
There are many other possible coding style guides, but the above two for
me are basic (if I grade code written by you, I will take into account
respect of the above rules). This is not my particular silly snobbery:
look at the code in the base R distribution, or look at the code in
classics such as ``Modern applied statistics with S'' or ``S
programming''(Venables and Ripley), or ``Software for data analysis''
(Chambers), or \ldots. Programming environments (e.g., Emacs + ESS) will
offer ways of tidying your code, and there is even a package that can help
you do it
(\Burl{http://cran.r-project.org/web/packages/formatR/index.html},
\Burl{http://yihui.name/formatR}).
\clearpage
\section{Entering data into \R\ and saving data from \R}\label{readingr}
There are many ways to load data into \R\ (for example, see the book by
P.\ Spector, or the ``R Data Import/Export'' manual
\Burl{http://cran.r-project.org/doc/manuals/R-data.html}). Here we will
only use \code{read.table}.
%% Let's repeat some of what we did with the BLAST example (section \ref{blast}).
<<eval=FALSE>>=
X <- read.table("hit-table-500-text.txt")
head(X)
## We could save what we care about in variables
## with better names
align.length <- X[, 5]
score <- X[, 13]
@
To see a slightly different example, open \code{AnotherDataSet.txt}. Now do:
<<>>=
another.data.set <- read.table("AnotherDataSet.txt",
header = TRUE)
summary(another.data.set)
@
Notice that we used the variable names (and took those names from the
header), and the object is not a matrix, but a data frame (we will see
this later).
\subsection{But where are those files?}\label{wherefiles}
Of course, for \R\ to read those files, you need to tell \R\
\textbf{exactly} where those files are located. This is always the source
of a lot of grief, but is really simple. These are some cases and ways
of dealing with them:
\begin{enumerate}
\item The file you are trying to read lives exactly in the same working
directory where \R\ is running. OK, easy: just read as in the examples above.
\begin{itemize}
\item How do you know what is the working directory where \R\ is
running? Type \code{getwd()}.
\item How do you know where the file you want to read is? Eh, this is up
to you! You should know that (or ask your operating system or search
facilities for it).
\end{itemize}
\item The file you are trying to read \textbf{DOES NOT} live exactly in the same working
directory where \R\ is running. You can either:
\begin{enumerate}
\item Tell \R\ where the file is: specify the full path. Suppose your
file, ``f1.txt'', is in ``C:/tmp''. Then, say \code{X <-
read.table(``C:/tmp/f1.txt'')}.
\item Move \R's working directory to the place where your files
live. Two ways:
\begin{enumerate}
\item Use \code{setwd(``someplace'')}, where ``someplace'' is the