-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path_gapminder-data-frame.Rmd
1101 lines (735 loc) · 45.3 KB
/
_gapminder-data-frame.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Untitled"
author: "Stephen Turner"
date: "December 18, 2015"
output: html_document
---
Before coming, you should have downloaded the gapminder data. If you [downloaded this entire lesson repository](https://github.com/bioconnector/workshops/archive/master.zip), once you extract it you'll find it in `workshops/lessons/r/data/gapminder.csv`. Alternatively you can download it directly from the [data page](data.html). This dataset is an excerpt from the [Gapminder](http://www.gapminder.org/) data, that's [already been cleaned up to a degree](https://github.com/jennybc/gapminder).
Open up the data in Excel if you have it installed, and take a look. This particular dataset has 1704 observations on six variables:
* `country` a categorical variable (aka "factor") 142 levels
* `continent`, a categorical variable with 5 levels
* `year`: going from 1952 to 2007 in increments of 5 years
* `pop`: population
* `gdpPercap`: GDP per capita
* `lifeExp`: life expectancy
Let's load the data first. There are two ways to do this. You can use RStudio's menus to select a file from your computer (tools, import dataset, from text file). But that's not reproducible. The best way to do this is to save the data and the script you're using to the same place, and read in the data in using `read.csv`. It's important to tell R that the file has a header, which tells R the names of the columns. We tell this to R with the `header=TRUE` argument.
Once we've loaded it we can type the name of the object itself (`gm`) to view the entire data frame. *Note: doing this with large data frames can cause you trouble, but I'll show you soon how to get around that.*
```{r readGapminder}
# Read in the data from a file
gm <- read.csv("data/gapminder.csv", header=TRUE)
# See what kind of data it is, and print the object to the screen
class(gm)
gm
```
### The dplyr tbl_df
You probably saw that printing the entire dataset wasn't very useful. With very large datasets this could potentially slow down your computer if it tries to display too much. We're going to use a function from the **dplyr** package that we'll cover in more detail later on that will convert the regular `data.frame` into special kind of data frame that also has some additional functionality that we'll see later. First, load the dplyr package, then use the `tbl_df()` function on the data frame, reassigning it back to the object itself.
```{r convertToTbl}
library(dplyr)
gm <- tbl_df(gm)
gm
class(gm) # still a data.frame
```
This functionality comes out of the dplyr package. The first really nice thing about the dplyr package is the `tbl_df` class. A `tbl_df` is basically an improved version of the `data.frame` object. The main advantage to using a `tbl_df` over a regular data frame is the printing: `tbl` objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text. We can turn our gm data frame into a `tbl_df` using the `tbl_df()` function. Let's do that, and reassign the result back to gm. Now, if we take a look at gm's class, we'll see that it's still a data frame, but it's also now a tbl_df. If we now type the name of the object, it will by default only print out a few lines. If this was a "wide" dataset with many columns, it would also not try to show us everything.
### Inspecting data.frame objects
There are several built-in functions that are useful for working with data frames.
* Content:
* `head()`: shows the first 6 rows
* `tail()`: shows the last 6 rows
* Size:
* `dim()`: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
* `nrow()`: returns the number of rows
* `ncol()`: returns the number of columns
* Summary:
* `colnames()` (or just `names()`): returns the column names
* `str()`: structure of the object and information about the class, length and content of each column
* `summary()`: works differently depending on what kind of object you pass to it. Passing a data frame to the `summary()` function prints out some summary statistics about each column (min, max, median, mean, etc.)
```{r tibble_functions}
head(gm)
tail(gm)
dim(gm)
names(gm)
str(gm)
summary(gm)
```
Finally, we can click on the object in the Environment pane, or we can use the `View()` function to bring it up in a graphical table viewer.
```{r View, eval=FALSE}
View(gm)
```
### Accessing variables & subsetting data frames
We can access individual variables within a data frame using the `$` operator, e.g., `mydataframe$specificVariable`. Let's print out the population sizes for every country. Then let's calculate the average life expectancy for every country for every year (using the built-in `mean()` function).
```{r}
# display all populations
gm$pop
#mean life expectancy (notice capital E)
mean(gm$lifeExp)
```
Now that's not too interesting. This is the average life expectancy across all countries across all years. We might be interested in something like the life expectancy for a particular country, and how that changes over time. Or maybe the average life expectancy across all countries, separately for each year. Or maybe the average life expectancy for different continents, or both -- the life expectancy for each country for each year. This is the kind of thing we're going to do in the next section.
----
**EXERCISE `r .ex``r .ex=.ex+1`**
1. What's the standard deviation of the life expectancy (hint: get help on the `sd` function with `?sd`).
1. What's the mean population size in millions? (hint: divide by 1000000, or alternatively, `1e6`).
1. What's the range of years represented in the data? (hint: `range()`).
1. What's the median per capita GDP?
----
----
**BONUS: Preview to advanced data manipulation section**
What if we wanted to compute the mean population size and median GDP for each country for each year? We have 12 different years, times 5 continents is 60, times 2 calculations (mean population size and median GDP), gives us 120 operations total.
```{r advanced, echo=TRUE, eval=FALSE}
gm %>%
group_by(continent,year) %>%
summarize(mean(pop), median(gdpPercap)) %>%
View
```
----
[_(Jump back to top...)_](#top)
<a name="dplyr"></a>
# Data manipulation with **dplyr**
[_(Jump back to top...)_](#top)
Data analysis involves a large amount of [janitor work](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) -- munging and cleaning data to facilitate downstream data analysis. This section assumes a basic familiarity with R and demonstrates techniques for advanced data manipulation and analysis with the split-apply-combine strategy. We will use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a "big" dataset containing many observations.
## The dplyr package
The [dplyr package](https://github.com/hadley/dplyr) is a relatively new R package that makes data manipulation fast and easy. It imports functionality from another package called magrittr that allows you to chain commands together into a pipeline that will completely change the way you write R code such that you're writing code the way you're thinking about the problem.
You already saw one function, the `tbl_df()` function, that makes a data frame nicer to work with interactively. You don't have to turn all your data.frame objects into tbl_df objects, but it does make working with large datasets a bit easier.
## dplyr verbs
The dplyr package gives you a handful of useful **verbs** for managing data. On their own they don't do anything that base R can't do.
0. filter
0. select
0. mutate
0. arrange
0. summarize
0. group_by
They all take a `data.frame` or `tbl_df` as their input for the first argument, and they all return a `data.frame` or `tbl_df`.
### filter()
If you want to filter **rows** of the data where some condition is true, use the `filter()` function.
1. The first argument is the data frame you want to filter, e.g. `filter(gm, ...`.
2. The second argument is a condition you must satisfy, e.g. `filter(gm, year==1982)`. If you want to satisfy *all* of multiple conditions, you can use the "and" operator, `&`. The "or" operator `|` (the pipe character, usually shift-backslash) will return a subset that meet *any* of the conditions.
- `==`: Equal to
- `!=`: Not equal to
- `>`, `>=`: Greater than, greater than or equal to
- `<`, `<=`: Less than, less than or equal to
Let's try it out. For this to work you have to have already loaded the dplyr package.
```{r}
# Load the dplyr package
library(dplyr)
# Show only stats for the year 1982
filter(gm, year==1982)
# Show only stats for the US
filter(gm, country=="United States")
# Show only stats for American contries in 1997
filter(gm, continent=="Americas" & year==1997)
# Show only stats for countries with per-capita GDP of less than 300 OR a life
# expectancy of less than 30. What happened those years?
filter(gm, gdpPercap<300 | lifeExp<30)
```
Finally, take a look at the class of what's returned by a `filter()` function. The `filter()` function takes a data.frame and returns a data.frame. You can operate on this new data.frame just as you would any other data.frame using the `$` operator. E.g., print out the GDP for the two oceanic countries in 2002, and take the mean of that.
```{r}
filter(gm, continent=="Oceania" & year==2002)
filter(gm, continent=="Oceania" & year==2002)$gdpPercap
mean(filter(gm, continent=="Oceania" & year==2002)$gdpPercap)
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
1. What country and what years had a low GDP (<500) but high life expectancy (>50)?
2. What's the average GDP for Asian countries in 2002?
```{r, echo=FALSE}
filter(gm, gdpPercap<500 & lifeExp>50)
mean(filter(gm, year==2002 & continent=="Asia")$gdpPercap)
mean(filter(gm, year==2002 & continent=="Europe")$gdpPercap)
mean(filter(gm, year==2002 & continent=="Americas")$gdpPercap)
```
----
### select()
The `filter()` function allows you to return only certain rows matching a condition. The `select()` function lets you subset the data and restrict to a number of columns. The first argument is the data, and subsequent arguments are the columns you want. Let's just get the year and the population variables.
```{r select}
select(gm, year, pop)
```
### mutate()
The `mutate()` function adds new columns to the data. Remember, the variable in our dataset is GDP per capita, which is the total GDP divided by the population size for that country, for that year. Let's mutate this dataset and add a column called gdp:
```{r mutate}
mutate(gm, gdp=pop*gdpPercap)
```
Mutate has a nice little feature too in that it's "lazy." You can mutate and add one variable, then continue mutating to add more variables based on that variable. Let's make another column that's GDP in billions.
```{r mutatelazy}
mutate(gm, gdp=pop*gdpPercap, gdpBil=gdp/1e9)
```
### arrange()
The `arrange()` function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the `desc()` function to arrange by descending.
```{r arrange}
arrange(gm, lifeExp)
arrange(gm, year, desc(lifeExp))
```
### summarize()
The `summarize()` function summarizes multiple values to a single value. On its own the `summarize()` function doesn't seem to be all that useful. The dplyr package provides a few convenience functions called `n()` and `n_distinct()` that tell you the number of observations or the number of distinct values of a particular variable.
```{r summarize}
summarize(gm, mean(pop))
summarize(gm, meanpop=mean(pop))
summarize(gm, n())
summarize(gm, n_distinct(country))
```
### group_by()
We saw that `summarize()` isn't that useful on its own. Neither is `group_by()` All this does is takes an existing tbl and coverts it into a grouped tbl where operations are performed by group.
```{r groupby}
gm
group_by(gm, continent)
```
The real power comes in where `group_by()` and `summarize()` are used together. Let's take the same grouped tbl from last time, and pass all that as an input to summarize, where we get the mean population size. We can also group by more than one variable.
```{r gby_nopipe}
summarize(group_by(gm, continent), mean(pop))
group_by(gm, continent, year)
summarize(group_by(gm, continent, year), mean(pop))
```
## The almighty pipe: **%>%**
This is where things get awesome. The dplyr package imports functionality from the [magrittr](https://github.com/smbache/magrittr) package that lets you _pipe_ the output of one function to the input of another, so you can avoid nesting functions. It looks like this: `%>%`. You don't have to load the magrittr package to use it since dplyr imports its functionality when you load the dplyr package.
Here's the simplest way to use it. Remember of the `tail()` function. It expects a data frame as input, and the next argument is the number of lines to print. These two commands are identical:
```{r, results='markup'}
tail(gm, 5)
gm %>% tail(5)
```
So what?
Now, think about this for a minute. What if we wanted to get the life expectancy and GDP averaged across all Asian countries for each year? (See top of image). Mentally we would do something like this:
0. Take the `gm` dataset
0. `mutate()` it to add raw GDP
0. `filter()` it to restrict to Asian countries only
0. `group_by()` year
0. and `summarize()` it to get the mean life expectancy and mean GDP.
But in code, it gets ugly. First, `mutate` the data to add GDP.
```{r}
mutate(gm, gdp=gdpPercap*pop)
```
Wrap that whole command with `filter()`.
```{r}
filter(mutate(gm, gdp=gdpPercap*pop), continent=="Asia")
```
Wrap that again with `group_by()`:
```{r}
group_by(filter(mutate(gm, gdp=gdpPercap*pop), continent=="Asia"), year)
```
Finally, wrap everything with `summarize()`:
```{r nopipemess}
summarize(
group_by(
filter(
mutate(gm, gdp=gdpPercap*pop),
continent=="Asia"),
year),
mean(lifeExp), mean(gdp))
```
Now compare that with the mental process of what you're actually trying to accomplish. The way you would do this without pipes is completely inside-out and backwards from the way you express in words and in thought what you want to do. The pipe operator `%>%` allows you to pass data frame or tbl objects from one function to another, so long as those functions expect data frames or tables as input.
<img src="{{site.baseurl}}/assets/gmdplyr.png", width=750>
This is how we would do that in code. It's as simple as replacing the word "then" in words to the symbol `%>%` in code.
```{r pipe}
gm %>%
mutate(gdp=gdpPercap*pop) %>%
filter(continent=="Asia") %>%
group_by(year) %>%
summarize(mean(lifeExp), mean(gdp))
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
Here's a warm-up round. Try the following.
What was the population of Peru in 1992? Show only the population variable. Answer should be 22430449. _Hint:_ 2 pipes; use `filter()` and `select()`.
```{r, echo=FALSE, results='markup'}
gm %>% filter(country=="Peru" & year==1992) %>% select(pop)
```
Which countries and which years had the worst five GDP per capita measurements? _Hint:_ 2 pipes; use `arrange()` and `head()`.
```{r, echo=FALSE, results='markup'}
gm %>%
arrange(gdpPercap) %>%
head(5)
```
What was the average life expectancy across all contries for each year in the dataset? _Hint:_ 2 pipes; `group_by()` and `summarize()`.
```{r, echo=FALSE, results='markup'}
gm %>%
group_by(year) %>%
summarize(mean(lifeExp))
```
----
----
**EXERCISE `r .ex``r .ex=.ex+1`**
That was easy, right? How about some tougher ones.
Which five Asian countries had the highest life expectancy in 2007? _Hint:_ 3 pipes; `filter`, `arrange`, and `head`.
```{r, echo=FALSE, results='markup'}
gm %>%
filter(continent=="Asia", year==2007) %>%
arrange(desc(lifeExp)) %>%
head(5)
```
How many countries are on each continent? _Hint:_ 2 pipes; `group_by`, `summarize(n_distinct(...))`
```{r, echo=FALSE, results='markup'}
gm %>%
group_by(continent) %>%
summarize(n_distinct(country))
```
Separately for each year, compute the correlation coefficients (e.g., `cor(x,y)`) for life expectancy (y) against both log<sub>10</sub> of the population size and log<sub>10</sub> of the per capita GDP. What do these trends mean? _Hint:_ 2 pipes; `group_by` and `summarize`.
```{r, echo=FALSE, results='markup'}
gm %>%
group_by(year) %>%
summarize(cor(log10(pop), lifeExp), cor(log10(gdpPercap), lifeExp))
```
_Really tough one_: Compute the average GDP (not per-capita) in billions averaged across all contries separately for each continent separately for each year. What continents/years had the top 5 overall GDP? _Hint: 6 pipes. If you want to arrange a dataset by a value computed on grouped data, you first have to pass that resulting dataset to a funcion called `ungroup()` before continuing to operate._
```{r, echo=FALSE, results='markup'}
gm %>%
mutate(gdp=pop*gdpPercap/1e9) %>%
group_by(continent, year) %>%
summarize(meangdp=mean(gdp)) %>%
ungroup() %>%
arrange(desc(meangdp)) %>%
head(5)
```
----
## Writing out data
Remember, running functions on data frames doesn't actually change the data frame. We have to reassign it back to an object first. First, lets create a small dataset that only has the data for 1997 in it.
```{r}
filter(gm, year==1997)
gm
gm97 <- filter(gm, year==1997)
```
Next, check what your working directory is with `getwd()` with no arguments, and look up some help for `write.table()` and `write.csv()`.
```{r, eval=FALSE}
getwd()
help(write.table)
help(write.csv)
```
Now you can save the new reduced data frame to a comma-separated file called `gm97.csv` using the `write.csv()` function.
```{r, eval=FALSE}
write.csv(gm97, file="gm97.csv")
```
Later on you can load this particular dataset again either using the menus (Tools -- Import Dataset) or using `read.csv()`.
[_(Jump back to top...)_](#top)
```{r, echo=FALSE}
# Set to FALSE to no longer run the code chunks below.
# This helps when building ggplot2 lesson, but turn off for final to make sure everything works as written and to increment exercise counter.
opts_chunk$set(eval=TRUE)
```
<a name="ggplot2"></a>
# Data visualization with **ggplot2**
[_(Jump back to top...)_](#top)
This section will cover fundamental concepts for creating effective data visualization and will introduce tools and techniques for visualizing large, high-dimensional data using R and the ggplot2 package. We will cover the grammar of graphics (geoms, aesthetics, stats, and faceting), and using ggplot2 to create plots layer-by-layer.
## About ggplot2
**ggplot2** is a widely used R package that extends R's visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We'll learn about what all these things mean shortly.
_Where does the "gg" in ggplot2 come from?_ The **ggplot2** package provides an R implementation of Leland Wilkinson's *Grammar of Graphics* (1999). The *Grammar of Graphics* allows you to think beyond the garden variety plot types (e.g. scatterplot, barplot) and the consider the components that make up a plot or graphic, such as how data are represented on the plot (as lines, points, etc.), how variables are mapped to coordinates or plotting shape or color, what transformation or statistical summary is required, and so on.
Specifically, **ggplot2** allows you to build a plot layer-by-layer by specifying:
- a **geom**, which specifies how the data are represented on the plot (points, lines, bars, etc.),
- **aesthetics** that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
- a **stat**, a statistical transformation or summary of the data applied prior to plotting,
- **facets**, which we've already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk.
_First, a note about `qplot()`._ The `qplot()` function is a quick and dirty way of making ggplot2 plots. You might see it if you look for help with ggplot2, and it's even covered extensively in the ggplot2 book. And if you're used to making plots with built-in base graphics, the `qplot()` function will probably feel more familiar. But the sooner you abandon the `qplot()` syntax the sooner you'll start to really understand ggplot2's approach to building up plots layer by layer. So we're not going to use it at all in this class.
## Plotting bivariate data: continuous Y by continuous X
The `ggplot` function has two required arguments: the *data* used for creating the plot, and an *aesthetic* mapping to describe how variables in said data are mapped to things we can see on the plot.
First let's load the package:
```{r loadggplot2, eval=TRUE}
library(ggplot2)
```
Now, let's lay out the plot. If we want to plot a continuous Y variable by a continuous X variable we're probably most interested in a scatter plot. Here, we're telling ggplot that we want to use the `gm` dataset, and the aesthetic mapping will map `gdpPercap` onto the x-axis and `lifeExp` onto the y-axis.
```{r noLayers, eval=FALSE}
ggplot(gm, aes(x = gdpPercap, y = lifeExp))
```
Look at that, we get an error, and it's pretty clear from the message what the problem is. We've laid out a two-dimensional plot specifying what goes on the x and y axes, but we haven't told it what kind of geometric object to plot. The obvious choice here is a point. Check out [docs.ggplot2.org](http://docs.ggplot2.org/) to see what kind of geoms are available.
```{r}
ggplot(gm, aes(x = gdpPercap, y = lifeExp)) + geom_point()
```
Here, we've built our plot in layers. First, we create a canvas for plotting layers to come using the `ggplot` function, specifying which **data** to use (here, the **gm** data frame), and an **aesthetic mapping** of `gdpPercap` to the x-axis and `lifeExp` to the y-axis. We next add a layer to the plot, specifying a **geom**, or a way of visually representing the aesthetic mapping.
Now, the typical workflow for building up a ggplot2 plot is to first construct the figure and save that to a variable (for example, `p`), and as you're experimenting, you can continue to re-define the `p` object as you develop "keeper commands".
First, let's construct the graphic. Notice that we don't have to specify `x=` and `y=` if we specify the arguments in the correct order (x is first, y is second).
```{r, eval=TRUE}
p <- ggplot(gm, aes(gdpPercap, lifeExp))
```
Now, if we tried to display p here alone we'd get another error because we don't have any layers in the plot. Let's experiment with adding points and a different scale to the x-axis.
```{r}
# Experiment with adding poings
p + geom_point()
# Experiment with a different scale
p + geom_point() + scale_x_log10()
```
I like the look of using a log scale for the x-axis. Let's make that stick.
```{r, eval=TRUE}
p <- p + scale_x_log10()
```
Then re-plot again with a layer of points:
```{r}
p + geom_point()
```
Now notice what I've saved to `p` at this point: only the basic plot layout and the log10 mapping on the x-axis. I didn't save any layers yet because I want to fiddle around with the points for a bit first.
Above we implied the aesthetic mappings for the x- and y- axis should be `gdpPercap` and `lifeExp`, but we can also add aesthetic mappings to the geoms themselves. For instance, what if we wanted to color the points by the value of another variable in the dataset, say, continent?
```{r}
p + geom_point(aes(color=continent))
```
Notice the difference here. If I wanted the colors to be some static value, I wouldn't wrap that in a call to `aes()`. I would just specify it outright. Same thing with other features of the points. For example, lets make all the points huge (`size=8`) blue (`color="blue"`) semitransparent (`alpha=(1/4)`) triangles (`pch=17`):
```{r}
p + geom_point(color="blue", pch=17, size=8, alpha=1/4)
```
Now, this time, let's map the aesthetics of the point character to certain features of the data. For instance, let's give the points different colors and character shapes according to the continent, and map the size of the point onto the life Expectancy:
```{r}
p + geom_point(aes(col=continent, pch=continent, size=lifeExp))
```
Now, this isn't a great plot because there are several aesthetic mappings that are redundant. Life expectancy is mapped to both the y-axis and the size of the points -- the size mapping is superfluous. Similarly, continent is mapped to both the color and the point character (the point character is superfluous). Let's get rid of that, but let's make the points a little bigger outsize of an aesthetic mapping.
```{r}
p + geom_point(aes(col=continent), size=4)
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
Re-create this same plot from scratch without saving anything to a variable. That is, start from the `ggplot` call.
* Start with the `ggplot()` function.
* Use the gm data.
* Map `gdpPercap` to the x-axis and `lifeExp` to the y-axis.
* Add points to the plot
* Make the points size 4
* Map continent onto the aesthetics of the point
* Use a log<sub>10</sub> scale for the x-axis.
```{r, echo=FALSE}
ggplot(gm, aes(gdpPercap, lifeExp)) +
geom_point(aes(col=continent), size=4) +
scale_x_log10()
```
----
### Adding layers to the plot
Let's add a fitted curve to the points. Recreate the plot in the `p` object if you need to.
```{r}
p <- ggplot(gm, aes(gdpPercap, lifeExp)) + scale_x_log10()
p + geom_point() + geom_smooth()
```
By default `geom_smooth()` will try to lowess for data with n<1000 or generalized additive models for data with n>1000. We can change that behavior by tweaking the parameters to use a thick red line, use a linear model instead of a GAM, and to turn off the standard error stripes.
```{r}
p + geom_point() + geom_smooth(lwd=2, se=FALSE, method="lm", col="red")
```
But let's add back in our aesthetic mapping to the continents. Notice what happens here. We're mapping continent as an aesthetic mapping _to the color of the points only_ -- so `geom_smooth()` still works only on the entire data.
```{r}
p + geom_point(aes(color = continent)) + geom_smooth()
```
But notice what happens here: we make the call to `aes()` outside of the `geom_point()` call, and the continent variable gets mapped as an aesthetic to any further geoms. So here, we get separate smoothing lines for each continent. Let's do it again but remove the standard error stripes and make the lines a bit thicker.
```{r}
p + aes(color = continent) + geom_point() + geom_smooth()
p + aes(color = continent) + geom_point() + geom_smooth(se=F, lwd=2)
```
### Faceting
Facets display subsets of the data in different panels. There are a couple ways to do this, but `facet_wrap()` tries to sensibly wrap a series of facets into a 2-dimensional grid of small multiples. Just give it a formula specifying which variables to facet by. We can continue adding more layers, such as smoothing. If you have a look at the help for `?facet_wrap()` you'll see that we can control how the wrapping is laid out.
```{r}
p + geom_point() + facet_wrap(~continent)
p + geom_point() + geom_smooth() + facet_wrap(~continent)
p + geom_point() + geom_smooth() + facet_wrap(~continent, ncol=1)
```
### Saving plots
There are a few ways to save ggplots. The quickest way, that works in an interactive session, is to use the `ggsave()` function. You give it a file name and by default it saves the last plot that was printed to the screen.
```{r, eval=FALSE}
p + geom_point()
ggsave(file="myplot.png")
```
But if you're running this through a script, the best way to do it is to pass `ggsave()` the object containing the plot that is meant to be saved. We can also adjust things like the width, height, and resolution. `ggsave()` also recognizes the name of the file extension and saves the appropriate kind of file. Let's save a PDF.
```{r, eval=FALSE}
pfinal <- p + geom_point() + geom_smooth() + facet_wrap(~continent, ncol=1)
ggsave(pfinal, file="myplot.pdf", width=5, height=15)
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
0. Make a scatter plot of `lifeExp` on the y-axis against `year` on the x.
0. Make a series of small multiples faceting on continent.
0. Add a fitted curve, smooth or lm, with and without facets.
0. **Bonus**: using `geom_line()` and and aesthetic mapping `country` to `group=`, make a "spaghetti plot", showing _semitransparent_ lines connected for each country, faceted by continent. Add a smoothed loess curve with a thick (`lwd=3`) line with no standard error stripe. Reduce the opacity (`alpha=`) of the individual black lines.
```{r, echo=FALSE}
p <- ggplot(gm, aes(year, lifeExp))
p + geom_point()
p + geom_point() + geom_smooth()
p + geom_point() + geom_smooth() + facet_wrap(~continent)
p + facet_wrap(~continent) + geom_line()
p + facet_wrap(~continent) + geom_line(aes(group=country))
p + facet_wrap(~continent) + geom_line(aes(group=country), alpha=.5) + geom_smooth(lwd=3, se=FALSE)
```
----
## Plotting bivariate data: continuous Y by categorical X
With the last example we examined the relationship between a continuous Y variable against a continuous X variable. A scatter plot was the obvious kind of data visualization. But what if we wanted to visualize a continuous Y variable against a categorical X variable? We sort of saw what that looked like in the last exercise. `year` is a continuous variable, but in this dataset, it's broken up into 5-year segments, so you could almost think of each year as a categorical variable. But a better example would be life expectancy against continent or country.
First, let's set up the basic plot:
```{r, eval=TRUE}
p <- ggplot(gm, aes(continent, lifeExp))
```
Then add points:
```{r}
p + geom_point()
```
That's not terribly useful. There's a big overplotting problem. We can try to solve with transparency:
```{r}
p + geom_point(alpha=1/4)
```
But that really only gets us so far. What if we spread things out by adding a little bit of horizontal noise (aka "jitter") to the data.
```{r}
p + geom_jitter()
```
Note that the little bit of horizontal noise that's added to the jitter is random. If you run that command over and over again, each time it will look slightly different. The idea is to visualize the density at each vertical position, and spreading out the points horizontally allows you to do that. If there were still lots of over-plotting you might think about adding some transparency by setting the `alpha=` value for the jitter.
```{r}
p + geom_jitter(alpha=1/2)
```
Probably a more common visualization is to show a box plot:
```{r}
p + geom_boxplot()
```
But why not show the summary and the raw data?
```{r}
p + geom_jitter() + geom_boxplot()
```
Notice how in that example we first added the jitter layer then added the boxplot layer. But the boxplot is now superimposed over the jitter layer. Let's make the jitter layer go on top. Also, go back to just the boxplots. Notice that the outliers are represented as points. But there's no distinction between the outlier point from the boxplot geom and all the other points from the jitter geom. Let's change that. Notice the British spelling.
```{r}
p + geom_boxplot(outlier.colour = "red") + geom_jitter(alpha=1/2)
```
There's another geom that's useful here, called a voilin plot.
```{r}
p + geom_violin()
p + geom_violin() + geom_jitter(alpha=1/2)
```
Let's go back to our boxplot for a moment.
```{r}
p + geom_boxplot()
```
This plot would be a lot more effective if the continents were shown in some sort of order other than alphabetical. To do that, we'll have to go back to our basic build of the plot again and use the `reorder` function in our original aesthetic mapping. Here, reorder is taking the first variable, which is some categorical variable, and ordering it by the level of the mean of the second variable, which is a continuous variable. It looks like this
```{r, eval=TRUE}
p <- ggplot(gm, aes(x=reorder(continent, lifeExp), y=lifeExp))
```
```{r}
p + geom_boxplot()
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
0. Make a jittered strip plot of GDP per capita against continent.
0. Make a box plot of GDP per capita against continent.
0. Using a log<sub>10</sub> y-axis scale, overlay semitransparent jittered points on top of box plots, where outlying points are colored.
0. **BONUS**: Try to reorder the continents on the x-axis by GDP per capita. Why isn't this working as expected? See `?reorder` for clues.
```{r, echo=FALSE}
p <- ggplot(gm, aes(continent, gdpPercap))
p + geom_jitter()
p + geom_boxplot()
p <- ggplot(gm, aes(reorder(continent, gdpPercap), gdpPercap))
p <- p + scale_y_log10()
p + geom_boxplot(outlier.colour="red") + geom_jitter(alpha=1/2)
library(dplyr)
gm %>% group_by(continent) %>% summarize(mean(gdpPercap))
gm %>% group_by(continent) %>% summarize(mean(log10(gdpPercap)))
p <- ggplot(gm, aes(reorder(continent, gdpPercap, FUN=function(x) mean(log10(x))), gdpPercap))
p <- p + scale_y_log10()
p + geom_boxplot(outlier.colour="red") + geom_jitter(alpha=1/2)
```
----
## Plotting univariate continuous data
What if we just wanted to visualize distribution of a single continuous variable? A histogram is the usual go-to visualization. Here we only have one aesthetic mapping instead of two.
```{r histogram}
p <- ggplot(gm, aes(lifeExp))
p + geom_histogram()
```
When we do this ggplot lets us know that we're automatically selecting the width of the bins, and we might want to think about this a little further.
```{r}
p + geom_histogram(binwidth=5)
p + geom_histogram(binwidth=1)
p + geom_histogram(binwidth=.25)
```
Alternative we could plot a smoothed density curve instead of a histogram:
```{r}
p + geom_density()
```
Back to histograms. What if we wanted to color this by continent?
```{r}
p + geom_histogram(aes(color=continent))
```
That's not what we had in mind. That's just the outline of the bars. We want to change the fill color of the bars.
```{r}
p + geom_histogram(aes(fill=continent))
```
Well, that's not exactly what we want either. If you look at the help for `?geom_histogram` you'll see that by default it stacks overlapping points. This isn't really an effective visualization. Let's change the position argument.
```{r}
p + geom_histogram(aes(fill=continent), position="identity")
```
But the problem there is that the histograms are blocking each other. What if we tried transparency?
```{r}
p + geom_histogram(aes(fill=continent), position="identity", alpha=1/3)
```
That's somewhat helpful, and might work for two distributions, but it gets cumbersome with 5. Let's go back and try this with density plots, first changing the color of the line:
```{r}
p + geom_density(aes(color=continent))
```
Then by changing the color of the fill and setting the transparency to 25%:
```{r}
p + geom_density(aes(fill=continent), alpha=1/4)
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
0. Plot a histogram of GDP Per Capita.
0. Do the same but use a log<sub>10</sub> x-axis.
0. Still on the log<sub>10</sub> x-axis scale, try a density plot mapping continent to the fill of each density distribution, and reduce the opacity.
0. Still on the log<sub>10</sub> x-axis scale, make a histogram faceted by continent _and_ filled by continent. Facet with a single column (see `?facet_wrap` for help). Save this to a 6x10 PDF file.
```{r, echo=FALSE, eval=FALSE}
p <- ggplot(gm, aes(gdpPercap))
p + geom_histogram()
p <- p + scale_x_log10()
p + geom_histogram()
p + geom_density(aes(fill=continent), alpha=1/4)
p + geom_histogram(aes(fill=continent)) + facet_wrap(~continent, ncol=1)
ggsave("myplot.pdf", width=6, height=10)
```
----
## Themes
Let's make a plot we made earlier (life expectancy versus the log of GDP per capita with points colored by continent with lowess smooth curves overlaid without the standard error ribbon):
```{r}
p <- ggplot(gm, aes(gdpPercap, lifeExp))
p <- p + scale_x_log10()
p <- p + aes(col=continent) + geom_point() + geom_smooth(lwd=2, se=FALSE)
p
```
Give the plot a title and axis labels:
```{r}
p <- p + ggtitle("Life expectancy vs GDP by Continent")
p <- p + xlab("GDP Per Capita (USD)") + ylab("Life Expectancy (years)")
p
```
They "gray" theme is the usual background.
```{r}
p + theme_gray()
```
We could also get a black and white background:
```{r}
p + theme_bw()
```
Or go a step further and remove the gridlines:
```{r}
p + theme_classic()
```
Finally, there's another package that gives us lots of different themes. This package isn't on CRAN, so you'll have to use devtools to install it directly from the source code on github.
```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("jrnold/ggthemes")
```
```{r themes, eval=FALSE}
library(ggthemes)
p <- ggplot(gm, aes(gdpPercap, lifeExp))
p <- p + scale_x_log10()
p <- p + aes(col=continent) + geom_point() + geom_smooth(lwd=2, se=FALSE)
p + theme_excel()
p + theme_excel() + scale_colour_excel()
p + theme_gdocs() + scale_colour_gdocs()
p + theme_stata() + scale_colour_stata()
p + theme_wsj() + scale_colour_wsj()
p + theme_economist()
p + theme_fivethirtyeight()
p + theme_tufte()
```
[_(Jump back to top...)_](#top)
<a name="resources"></a>
# Further resources
[_(Jump back to top...)_](#top)
### R resources
- <http://tryr.codeschool.com/>: TryR - an interactive, browser-based R tutor.
- <http://swirlstats.com/>: An R package that teaches you R (_and statistics!_) from within R.
- <https://stat545-ubc.github.io/>: Jenny Bryan's Stat 545 "Data wrangling, exploration, and analysis with R" course material -- excellent resource for learning R, dplyr, and ggplot2.
- <https://www.datacamp.com/courses/free-introduction-to-r>: DataCamp's free introduction to R.
- <https://cran.r-project.org/doc/contrib/Short-refcard.pdf>: Printable R command reference card.
- <https://www.rstudio.com/resources/cheatsheets/>: Printable cheat sheets for many different tasks within R.
- <http://rseek.org/>: Rseek - a custom Google search for R-related sites.
### dplyr resources
- <https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html>: The dplyr vignette - offers a great introduction.
- <http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/>: A longer dplyr tutorial with video and code.
- <http://genomicsclass.github.io/book/pages/dplyr_tutorial.html>: The dplyr tutorial from the HarvardX Biomedical Data Science MOOC.
- <https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>: A dplyr cheat sheet from RStudio.
### ggplot2 resources
- <http://docs.ggplot2.org/>: The official ggplot2 documentation.
- <http://amzn.to/1akjqsR>: Edition 1 of the ggplot2 book, by the developer, Hadley Wickham.
- <https://github.com/hadley/ggplot2-book>: New version of the ggplot2 book, freely available on GitHub.
- <https://groups.google.com/d/forum/ggplot2>: The ggplot2 Google Group (mailing list, discussion forum).
- <http://learnr.wordpress.com/>: A blog with a good number of posts describing how to reproduce various kind of plots using ggplot2.
- <http://stackoverflow.com/questions/tagged/ggplot2>: Thousands of questions and answers tagged with "ggplot2" on Stack Overflow, a programming Q&A site.
- <http://stat545-ubc.github.io/>: Jenny Bryan's stat 545 class at UBC -- a great resource for learning about data manipulation and visualization in R. Much of this course material was modified from the stat 545 lesson material.
- <http://shinyapps.stat.ubc.ca/r-graph-catalog/>: A catalog of graphs made with ggplot2, complete with accompanying R code.
- <https://www.rstudio.com/wp-content/uploads/2015/05/ggplot2-cheatsheet.pdf>: RStudio's ggplot2 cheat sheet.
[_(Jump back to top...)_](#top)
<!-- This begins a comment
----
**EXERCISE `r .ex``r .ex=.ex+1`**
```{r, echo=FALSE}
```
----
comment ends here -->
<a name="intro"></a>
# Quick introduction to R and our dataset
[_(Jump back to top...)_](#top)
This section assumes little to no experience with statistical computing with R. We will introduce the R statistical computing environment, RStudio, and the dataset that we will work with for the remainder of the lesson. We will cover very basic functionality in R, including variables, functions, and importing/inspecting data frames.
## RStudio
Let's start by learning about RStudio. **R** is the underlying statistical computing environment, but using R alone is no fun. **RStudio** is a graphical integrated development environment that makes using R much easier.
- Panes in RStudio. There are four panes, and their orientation is configurable under "Tools -- Global Options." I set up my window to have the editor in the top left, console top right, environment/history on the bottom left, and plots/help on the bottom right.
- Projects: first, start a new project in a new folder somewhere easy to remember. When we start reading in data it'll be important that the _code and the data are in the same place._ Creating a project creates an Rproj file that opens R running _in that folder_. This way, when you want to read in dataset _whatever.txt_, you just tell it the filename rather than a full path. This is critical for reproducibility.
- Code that you type into the console is code that R executes. From here forward we will use the editor window to write a script that we can save to a file and run it again whenever we want to. We usually give it a `.R` extension, but it's just a plain text file. If you want to send commands from your editor to the console, use `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
- Anything after a `#` sign is a comment. Use them liberally to *comment your code*.
## Basic operations
R can be used as a glorified calculator. Try typing this in directly into the console. Make sure you're typing into into the editor, not the console, and save your script. Use the run button, or press `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
```{r calculator}
2+2
5*4
2^3
```
R Knows order of operations and scientific notation.
```{r calculator2}
2+3*4/(5+3)*15/2^2+3*4^2
5e4
```
However, to do useful and interesting things, we need to assign *values* to *objects*. To create objects, we need to give it a name followed by the assignment operator `<-` and the value we want to give it: