diff --git a/01-rstudio-intro.Rmd b/01-rstudio-intro.Rmd index 50a8c1152..810c99ef9 100644 --- a/01-rstudio-intro.Rmd +++ b/01-rstudio-intro.Rmd @@ -9,7 +9,7 @@ minutes: 45 source("tools/chunk-options.R") ``` -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To gain familiarity with the various panes in the RStudio IDE > * To gain familiarity with the buttons, short cuts and options in the Rstudio IDE @@ -383,12 +383,12 @@ rm(list <- ls()) > Draw diagrams showing what variables refer to what values after each > statement in the following program: > -> ~~~ {.r} +> ```{r, eval=FALSE} > mass <- 47.5 > age <- 122 > mass <- mass * 2.3 > age <- age - 20 -> ~~~ +> ``` > > #### Challenge 2 {.challenge} diff --git a/01-rstudio-intro.md b/01-rstudio-intro.md index 058b8749a..68fe25c08 100644 --- a/01-rstudio-intro.md +++ b/01-rstudio-intro.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Introduction to R and RStudio -minutes: 15 +minutes: 45 --- -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To gain familiarity with the various panes in the RStudio IDE > * To gain familiarity with the buttons, short cuts and options in the Rstudio IDE @@ -19,7 +19,7 @@ minutes: 15 ### Introduction to RStudio -Welcome to the R portion of the Software Carpentry workshop. +Welcome to the R portion of the Software Carpentry workshop. Throughout this lesson, we're going to teach you some of the fundamentals of the R language as well as some best practices for organising code for @@ -50,26 +50,26 @@ There are two main ways one can work within Rstudio. 1. This is great way to start and work as all workings are saved for latter reference and can be read latter. > #### Tip: Pushing to the interactive R console {.callout} -> To run the current line click on the `Run` button just above the file pane. Or use the short cut which can be see +> To run the current line click on the `Run` button just above the file pane. Or use the short cut which can be see > by hovering the mouse over the button. > > To run a block of code, select it and then `Run`. If you have modified a line -> of code within a block of code you have just run. There is no need to reselct the section and `Run`, -> you can use the next button along, `Re-run the previous region`. This will run the previous code block inculding +> of code within a block of code you have just run. There is no need to reselct the section and `Run`, +> you can use the next button along, `Re-run the previous region`. This will run the previous code block inculding > the modifications you have made. > ### Introduction to R A lot of your time in R will be spent in the R interactive console. This is where you -will run all of your code, and can be a useful environment to try out ideas before +will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you just typed in `R` in your commandline environment. -The first thing you will see in the R interactive session is a bunch of information, +The first thing you will see in the R interactive session is a bunch of information, followed by a ">" and a blinking cursor. In many ways this is similar to the shell environment you learnt about during the shell lessons: it operates on the same idea -of a "Read, evaluate, print loop": you type in commands, R tries to execute them, and +of a "Read, evaluate, print loop": you type in commands, R tries to execute them, and then returns a result. #### Using R as a calculator @@ -88,7 +88,7 @@ The simplest thing you could do with R is do arithmetic: ~~~ -And R will print out the answer, with a preceding "[1]". Don't worry about this +And R will print out the answer, with a preceding "[1]". Don't worry about this for now, we'll explain that later. For now think of it as indicating ouput. Just like bash, if you type in an incomplete command, R will wait for you to @@ -102,7 +102,7 @@ complete it: + ~~~ -Any time you hit return and the R session shows a "+" instead of a ">", it +Any time you hit return and the R session shows a "+" instead of a ">", it means it's waiting for you to complete the command. If you want to cancel a command you can simply hit "Esc" and RStudio will give you back the ">" prompt. @@ -115,7 +115,7 @@ prompt. > > Cancelling a command isn't just useful for killing incomplete commands: > you can also use it to tell R to stop running code (for example if its -> taking much longer than you expect), or to get rid of the code you're +> taking much longer than you expect), or to get rid of the code you're > currently writing. > @@ -169,7 +169,7 @@ But this can get unwieldy when not needed: The text I've typed after each line of code is called a comment. Anything that -follows on from the octothorpe (or hash) symbol `#` is ignored by R when it +follows on from the octothorpe (or hash) symbol `#` is ignored by R when it executes code. Really small or large numbers get a scientific notation: @@ -218,7 +218,7 @@ sin(1) # trigonometry functions ~~~{.output} -[1] 0.8415 +[1] 0.841471 ~~~ @@ -254,7 +254,7 @@ exp(0.5) # e^(1/2) ~~~{.output} -[1] 1.649 +[1] 1.648721 ~~~ @@ -263,13 +263,13 @@ can simply look them up on google, or if you can remember the start of the function's name, use the tab completion in RStudio. This is one advantage that RStudio has over R on its own, it -has autocompletion abilities that allow you to more easily +has autocompletion abilities that allow you to more easily look up functions, their arguments, and the values that they take. Typing a `?` before the name of a command will open the help page for that command. As well as providing a detailed description of -the command and how it works, scrolling ot the bottom of the +the command and how it works, scrolling ot the bottom of the help page will usually show a collection of code examples which illustrate command usage. We'll go through an example later. @@ -350,19 +350,19 @@ We can also do comparison in R: ~~~ > #### Tip: Comparing Numbers {.callout} -> -> A word of warning about comparing numbers: you should +> +> A word of warning about comparing numbers: you should > never use `==` to compare two numbers unless they are > integers (a data type which can specifically represent -> only whole numbers). +> only whole numbers). > -> Computers may only represent decimal numbers with a +> Computers may only represent decimal numbers with a > certain degree of precision, so two numbers which look > the same when printed out by R, may actually have -> different underlying representations and therefore be -> different by a small margin of error (called Machine -> numeric tolerance). -> +> different underlying representations and therefore be +> different by a small margin of error (called Machine +> numeric tolerance). +> > Instead you should use the `all.equal` function. > > Further reading: [http://floating-point-gui.de/](http://floating-point-gui.de/) @@ -406,7 +406,7 @@ log(x) ~~~{.output} -[1] -3.689 +[1] -3.688879 ~~~ @@ -437,7 +437,7 @@ different conventions for long variable names, these include * underscores\_between_words * camelCaseToSeparateWords -What you use is up to you, but **be consistent**. +What you use is up to you, but **be consistent**. It is also possible to use the `=` operator for assignment: @@ -471,9 +471,9 @@ ls() ~~~ > #### Tip: hidden objects {.callout} -> +> > Just like in the shell, `ls` will hide any variables or functions starting -> with a "." by default. To list all objects, type `ls(all.names=TRUE)` +> with a "." by default. To list all objects, type `ls(all.names=TRUE)` > instead > @@ -522,7 +522,7 @@ function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, } else all.names } - + ~~~ @@ -543,13 +543,13 @@ rm(list = ls()) ~~~ In this case we've combined the two. Just like the order of operations, anything -inside the innermost brackets is evaluated first, and so on. +inside the innermost brackets is evaluated first, and so on. -In this case we've specified that the results of `ls` should be used for the -`list` argument in `rm`. When assigning values to arguments by name, you *must* -use the `=` operator!! +In this case we've specified that the results of `ls` should be used for the +`list` argument in `rm`. When assigning values to arguments by name, you *must* +use the `=` operator!! -If instead we use `<-`, there will be unintended side effects, or you may just +If instead we use `<-`, there will be unintended side effects, or you may just get an error message: @@ -560,11 +560,11 @@ rm(list <- ls()) ~~~{.output} -Error: ... must contain names or character strings +Error in rm(list <- ls()): ... must contain names or character strings ~~~ -> #### Tip: Warnings vs. Errors {.callout} +> #### Tip: Warnings vs. Errors {.callout} > > Pay attention when R does something unexpected! Errors, like above, > are thrown when R cannot proceed with a calculation. Warnings on the @@ -578,10 +578,11 @@ Error: ... must contain names or character strings > #### Challenge 1 {.challenge} > -> Draw diagrams showing what variables refer to what values after each +> Draw diagrams showing what variables refer to what values after each > statement in the following program: > -> ~~~ {.r} +> +> ~~~{.r} > mass <- 47.5 > age <- 122 > mass <- mass * 2.3 @@ -600,4 +601,3 @@ Error: ... must contain names or character strings > Clean up your working environment by deleting the mass and age > variables. > - diff --git a/02-project-intro.Rmd b/02-project-intro.Rmd index a4ce6177f..4d9f7de9c 100644 --- a/02-project-intro.Rmd +++ b/02-project-intro.Rmd @@ -9,7 +9,7 @@ minutes: 30 source("tools/chunk-options.R") ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to create self-contained projects in RStudio > * To be able to use git from within RStudio @@ -114,11 +114,11 @@ get shared between projects. > 2. Load the library > 3. Initialise the project: > -> ~~~ {.r} +> ```{r, eval=FALSE} > install.packages("ProjectTemplate") > library(ProjectTemplate) > create.project("../my_project", merge.strategy = "allow.non.conflict") -> ~~~ +> ``` > > For more information on ProjectTemplate and its functionality visit the > home page [ProjectTemplate](http://projecttemplate.net/index.html) diff --git a/02-project-intro.md b/02-project-intro.md index df5e71cd3..6afb33207 100644 --- a/02-project-intro.md +++ b/02-project-intro.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Project management with RStudio -minutes: 15 +minutes: 30 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to create self-contained projects in RStudio > * To be able to use git from within RStudio @@ -49,7 +49,7 @@ A good project layout will ultimately make your life easier: Fortunately, there are tools and packages which can help you manage your work effectively. One of the most powerful and useful aspects of RStudio is its project management -functionality. We'll be using this today to create a self-contianed, reproducible +functionality. We'll be using this today to create a self-contained, reproducible project. @@ -100,19 +100,20 @@ analysis. This makes it easier later, as many of my analyses are exploratory and don't end up being used in the final project, and some of the analyses get shared between projects. -> #### Tip: ProjectTempate - a possible solution {.callout} +> #### Tip: ProjectTemplate - a possible solution {.callout} > > One way to automate the management of projects is to install the third-party package, `ProjectTemplate`. > This package will set up an ideal directory structure for project management. > This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. -> Together with the default Rstudio project functionality and Git you will be able to keep track of your +> Together with the default RStudio project functionality and Git you will be able to keep track of your > work as well as be able to share your work with collaborators. > > 1. Install `ProjectTemplate`. > 2. Load the library > 3. Initialise the project: > -> ~~~ {.r} +> +> ~~~{.r} > install.packages("ProjectTemplate") > library(ProjectTemplate) > create.project("../my_project", merge.strategy = "allow.non.conflict") @@ -143,7 +144,7 @@ one to store the analysis scripts. > make updates to code in multiple places. > > In this case I find it useful to make "symbolic links", which are essentially -> shortcuts to files somewhere else on a filesystem. On linux and OSX you can +> shortcuts to files somewhere else on a filesystem. On Linux and OS X you can > use the `ln -s` command, and on windows you can either create a shortcut or > use the `mklink` command from the windows terminal. > @@ -153,7 +154,7 @@ one to store the analysis scripts. Now we have a good directory structure we will now place/save the data file in the `data/` directory. > #### Challenge 1 {.challenge} -> Download the gapminer data from [here](https://github.com/resbaz/r-novice-gapminder-files). +> Download the gapminder data from [here](https://github.com/resbaz/r-novice-gapminder-files). > > 1. Use the `Download ZIP` located on the right hand side menu, last option. To download the `.zip` file to > your downloads folder. diff --git a/03-seeking-help.Rmd b/03-seeking-help.Rmd index 82002ec19..4783a71ad 100644 --- a/03-seeking-help.Rmd +++ b/03-seeking-help.Rmd @@ -10,7 +10,7 @@ source("tools/chunk-options.R") ``` -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able read R help files for functions and special operators. > * To be able to use CRAN task views to identify packages to solve a problem. @@ -114,19 +114,20 @@ your issue. > What is the difference between the `sep` and `collapse` arguments? > -> #### Solution to Challenge 1 {.challenge} +### Other ports of call + +* [Quick R](http://www.statmethods.net/) +* [RStudio cheat sheets](http://www.rstudio.com/resources/cheatsheets/) +* [Cookbook for R](http://www.cookbook-r.com/) + +## Challenge solutions + +> #### Solution to challenge 1 {.challenge} > > Look at the help for the `paste` function. You'll need to use this later. > -> ~~~ {.r} +> ```{r, eval=FALSE} > help("paste") > ?paste -> ~~~ +> ``` > - - -### Other ports of call - -* [Quick R](http://www.statmethods.net/) -* [RStudio cheat sheets](http://www.rstudio.com/resources/cheatsheets/) -* [Cookbook for R](http://www.cookbook-r.com/) diff --git a/03-seeking-help.md b/03-seeking-help.md index 1504c2fd0..f67319499 100644 --- a/03-seeking-help.md +++ b/03-seeking-help.md @@ -8,10 +8,10 @@ minutes: 15 -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > -> * To be able read R helpfiles for functions and special operators. -> * To be able to use CRAN taskviews to identify packages to solve a problem. +> * To be able read R help files for functions and special operators. +> * To be able to use CRAN task views to identify packages to solve a problem. > * To be able to seek help from your peers > @@ -31,7 +31,7 @@ This will load up a help page in RStudio (or as plain text in R by itself). Each help page is broken down into sections: - - Description: An extended description of what the function does. + - Description: An extended description of what the function does. - Usage: The arguments of the function and their default values. - Arguments: An explanation of the data each argument is expecting. - Details: Any important details to be aware of. @@ -43,9 +43,9 @@ Different functions might have different sections, but these are the main ones y > #### Tip: Reading help files {.callout} > -> One of the most daunting aspects of R is the large number of functions +> One of the most daunting aspects of R is the large number of functions > available. It would be prohibitive, if not impossible to remember the -> correct usage for every function you use. Luckily, the help files +> correct usage for every function you use. Luckily, the help files > mean you don't have to! > @@ -65,7 +65,7 @@ Without any arguments, `vignette()` will list all vignettes for all installed pa `vignette(package="package-name")` will list all available vignettes for `package-name`, and `vignette("vignette-name")` will open the specified vignette. -If a package doesn't have any vignettes, you can usually find help by typing +If a package doesn't have any vignettes, you can usually find help by typing `help("package-name")`. ### When you kind of remember the function @@ -79,15 +79,15 @@ If you're not sure what package a function is in, or how it's specifically spell ### When you have no idea where to begin -If you don't know what function or package you need to use -[CRAN Task Views](http://cran.at.r-project.org/web/views) +If you don't know what function or package you need to use +[CRAN Task Views](http://cran.at.r-project.org/web/views) is a specially maintained list of packages grouped into fields. This can be a good starting point. ### When your code doesn't work: seeking help from your peers -If you're having trouble using a function, 9 times out of 10, -the answers you are seeking have already been answered on +If you're having trouble using a function, 9 times out of 10, +the answers you are seeking have already been answered on [Stack Overflow](http://stackoverflow.com/). You can search using the `[r]` tag. @@ -120,10 +120,10 @@ attached base packages: [1] stats graphics grDevices utils datasets base other attached packages: -[1] knitr_1.6 +[1] knitr_1.10.12 loaded via a namespace (and not attached): -[1] evaluate_0.5.5 formatR_1.0 stringr_0.6.2 tools_3.1.0 +[1] evaluate_0.7 formatR_1.0 stringr_0.6.2 tools_3.1.0 ~~~ @@ -131,8 +131,27 @@ Will print out your current version of R, as well as any packages you have loaded. This can be useful for others to help reproduce and debug your issue. +> #### Challenge 1 {.challenge} +> +> Look at the help for the `paste` function. You'll need to use this later. +> What is the difference between the `sep` and `collapse` arguments? +> + ### Other ports of call * [Quick R](http://www.statmethods.net/) -* [Rstudio cheat sheets](http://www.rstudio.com/resources/cheatsheets/) +* [RStudio cheat sheets](http://www.rstudio.com/resources/cheatsheets/) * [Cookbook for R](http://www.cookbook-r.com/) + +## Challenge solutions + +> #### Solution to challenge 1 {.challenge} +> +> Look at the help for the `paste` function. You'll need to use this later. +> +> +> ~~~{.r} +> help("paste") +> ?paste +> ~~~ +> diff --git a/04-data-structures-part1.Rmd b/04-data-structures-part1.Rmd index 20bf90684..27e610330 100644 --- a/04-data-structures-part1.Rmd +++ b/04-data-structures-part1.Rmd @@ -10,7 +10,7 @@ source("tools/chunk-options.R") ``` -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > - To be aware of the different types of data > - To be aware of the different basic data structures commonly encountered in R @@ -163,11 +163,11 @@ x > > **Guess what the following do without running them first:** > -> ~~~ {.r} +> ```{r, eval=FALSE} > xx <- c(1.7, "a") > xx <- c(TRUE, 2) > xx <- c("a", TRUE) -> ~~~ +> ``` > This is called implicit coercion. @@ -281,7 +281,7 @@ will tell you the number of rows and columns (this also applies to data frames!) while `length` will tell you the number of elements. > -> ### Challenge 3 {.challenge} +> #### Challenge 3 {.challenge} > > What do you think will be the result of > `length(x)`? @@ -290,7 +290,7 @@ while `length` will tell you the number of elements. > > -> ### Challenge 4 {.challenge} +> #### Challenge 4 {.challenge} > > Make another matrix, this time containing the numbers 1:50, > with 5 columns and 10 rows. @@ -372,7 +372,7 @@ Lists can also contain themselves: list(list(list(list()))) ``` -> ### Challenge 5 {.challenge} +> #### Challenge 5 {.challenge} > > Create a list of length two containing a character vector for each of the > sections in this part of the workshop: @@ -388,9 +388,9 @@ Lists are extremely useful inside functions. You can "staple" together lots of different kinds of results into a single object that a function can return. In fact many R functions which return complex output store their results in a list. +## Challenge solutions - -> #### Solution to Challenge 1: Data types {.challenge} +> #### Solution to challenge 1: Data types {.challenge} > > Use your knowledge of how to assign a value to > a variable, to create examples of data with the @@ -404,96 +404,64 @@ fact many R functions which return complex output store their results in a list. > has the data type you intended. Do you find > anything unexpected? > -> ~~~ {.r} +> ```{r} > answer <- TRUE > height <- 150 > dog_name <- "Snoopy" -> > is.logical(answer) -> ~~~ -> -> ~~~ {.output} -> [1] TRUE -> ~~~ +> ``` > -> ~~~ {.r} +> ```{r} > is.numeric(height) -> ~~~ -> -> ~~~ {.output} -> [1] TRUE -> ~~~ +> ``` > -> ~~~ {.r} +> ```{r} > is.character(dog_name) -> ~~~ -> -> ~~~ {.output} -> [1] TRUE -> ~~~ +> ``` > -> #### Solution to Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > Vectors can only contain one atomic type. If you try to combine different > types, R will create a vector that is the least common denominator: the > type that is easiest to coerce to. > -> ~~~ {.r} +> ```{r} > xx <- c(1.7, "a") > xx > typeof(xx) -> ~~~ +> ``` > -> ~~~ {.output} -> [1] "1.7" "a" -> [1] "character" -> ~~~ -> -> ~~~ {.r} +> ```{r} > xx <- c(TRUE, 2) > xx > typeof(xx) -> ~~~ -> -> ~~~ {.output} -> [1] 1 2 -> [1] "double" -> ~~~ +> ``` > -> ~~~ {.r} +> ```{r} > xx <- c("a", TRUE) > xx > typeof(xx) -> ~~~ -> -> ~~~ {.output} -> [1] "a" "TRUE" -> [1] "character" -> ~~~ +> ``` > > -> ### Solution to Challenge 3 {.challenge} +> #### Solution to challenge 3 {.challenge} > > What do you think will be the result of > `length(x)`? > -> ~~~ {.r} +> ```{r} > x <- matrix(rnorm(18), ncol=6, nrow=3) > length(x) -> ~~~ -> -> ~~~ {.output} -> [1] 18 -> ~~~ +> ``` > > Because a matrix is really just a vector with added dimension attributes, `length` > gives you the total number of elements in the matrix. > > -> ### Solution to Challenge 4 {.challenge} +> #### Solution to challenge 4 {.challenge} > > Make another matrix, this time containing the numbers 1:50, > with 5 columns and 10 rows. @@ -502,14 +470,14 @@ fact many R functions which return complex output store their results in a list. > See if you can figure out how to change this. > (hint: read the documentation for `matrix`!) > -> ~~~ {.r} +> ```{r, eval=FALSE} > x <- matrix(1:50, ncol=5, nrow=10) > x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row -> ~~~ +> ``` > -> ### Solution to Challenge 5 {.challenge} +> #### Solution to challenge 5 {.challenge} > > Create a list of length two containing a character vector for each of the > sections in this part of the workshop: @@ -520,8 +488,10 @@ fact many R functions which return complex output store their results in a list. > Populate each character vector with the names of the data types and data > structures we've seen so far. > -> ~~~ {.r} -> my_list <- list(data_types = c("logical", "integer", "double", "complex", "character"), - data_structures = c("vector", "matrix", "factor", "list")) -> ~~~ +> ```{r, eval=FALSE} +> my_list <- list( +> data_types = c("logical", "integer", "double", "complex", "character"), +> data_structures = c("vector", "matrix", "factor", "list") +> ) +> ``` > diff --git a/04-data-structures-part1.md b/04-data-structures-part1.md index 1af64f8b8..cb5841edb 100644 --- a/04-data-structures-part1.md +++ b/04-data-structures-part1.md @@ -2,13 +2,13 @@ layout: page title: R for reproducible scientific analysis subtitle: Data structures -minutes: lots +minutes: 45 --- -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > - To be aware of the different types of data > - To be aware of the different basic data structures commonly encountered in R @@ -27,7 +27,7 @@ R has 5 basic atomic types (meaning they can't be broken down into anything smal * logical (e.g., `TRUE`, `FALSE`) * numeric - * integer (e.g, `3`, `2L`, `as.integer(3)`) + * integer (e.g, `2L`, `as.integer(3)`) * double (i.e. decimal) (e.g, `-24.57`, `2.0`, `pi`) * complex (i.e. complex numbers) (e.g, `1 + 0i`, `1 + 4i`) * text (called "character" in R) (e.g, `"a"`, `"swc"`, `'This is a cat'`) @@ -39,6 +39,7 @@ There are a few functions we can use to interrogate data in R to determine its t typeof() # what is its atomic type? is.logical() # is it TRUE/FALSE data? is.numeric() # is it numeric? +is.integer() # is it an integer? is.complex() # is it complex number data? is.character() # is it character data? ~~~ @@ -60,10 +61,10 @@ is.character() # is it character data? ### Data Structures -There are five data structures you will commonly encounter in R. These include: +There are five data structures you will commonly encounter in R. These are: * vector -* factors +* factor * list * matrix * data.frame @@ -231,7 +232,8 @@ x > > **Guess what the following do without running them first:** > -> ~~~ {.r} +> +> ~~~{.r} > xx <- c(1.7, "a") > xx <- c(TRUE, 2) > xx <- c("a", TRUE) @@ -240,7 +242,7 @@ x This is called implicit coercion. -The coersion rule goes `logical` -> `integer` -> `numeric` -> `complex` -> +The coercion rule goes `logical` -> `integer` -> `numeric` -> `complex` -> `character`. You can also coerce vectors explicitly using the `as.`. Example @@ -412,7 +414,7 @@ str(x) ~~~ -Like data.frames, vectors can also be named: +Vectors can be named: ~~~{.r} @@ -440,7 +442,7 @@ a b c d #### Matrices -Another data structure you'll likely encounter are Matrices. Underneath the +Another data structure you'll likely encounter are matrices. Underneath the hood, they are really just atomic vectors, with added dimension attributes. We can create one with the `matrix` function. Let's generate some random data: @@ -455,10 +457,10 @@ x ~~~{.output} - [,1] [,2] [,3] [,4] [,5] [,6] -[1,] -0.6265 1.5953 0.4874 -0.3054 -0.6212 -0.04493 -[2,] 0.1836 0.3295 0.7383 1.5118 -2.2147 -0.01619 -[3,] -0.8356 -0.8205 0.5758 0.3898 1.1249 0.94384 + [,1] [,2] [,3] [,4] [,5] [,6] +[1,] -0.6264538 1.5952808 0.4874291 -0.3053884 -0.6212406 -0.04493361 +[2,] 0.1836433 0.3295078 0.7383247 1.5117812 -2.2146999 -0.01619026 +[3,] -0.8356286 -0.8204684 0.5757814 0.3898432 1.1249309 0.94383621 ~~~ @@ -480,7 +482,7 @@ will tell you the number of rows and columns (this also applies to data frames!) while `length` will tell you the number of elements. > -> ### Challenge 3 {.challenge} +> #### Challenge 3 {.challenge} > > What do you think will be the result of > `length(x)`? @@ -489,7 +491,7 @@ while `length` will tell you the number of elements. > > -> ### Challenge 4 {.challenge} +> #### Challenge 4 {.challenge} > > Make another matrix, this time containing the numbers 1:50, > with 5 columns and 10 rows. @@ -502,10 +504,10 @@ while `length` will tell you the number of elements. #### Factors Factors are special vectors that represent categorical data. Factors can be -ordered or unordered and are important when for modelling functions such as +ordered or unordered and are important when for modeling functions such as `aov()`, `lm()` and `glm()` and also in plot methods. -Factors can only contain pre-defined values, and we can create one with the +Factors can only contain predefined values, and we can create one with the `factor` function: @@ -541,7 +543,7 @@ This reveals something important: while factors look (and often behave) like character vectors, they are actually integers under the hood, and here, we can see that "no" is represented by a 1, and "yes" a 2. -In modelling functions, important to know what baseline levels is. This is the +In modeling functions, important to know what baseline levels is. This is the first factor but by default the ordering is determined by alphabetical order of words entered. You can change this by specifying the levels: @@ -624,7 +626,7 @@ $data ~~~ -In this case our list contains a character vector of lenght one, +In this case our list contains a character vector of length one, a numeric vector with 10 entries, and a small data frame from one of R's many preloaded datasets (see `?data`). We've also given each list element a name, which is why you see `$a` instead of `[[1]]`. @@ -646,10 +648,10 @@ list() ~~~ -> ### Challenge 5 {.challenge} +> #### Challenge 5 {.challenge} > -> Create a list containing two character vectors for each of the sections in this -> part of the workshop: +> Create a list of length two containing a character vector for each of the +> sections in this part of the workshop: > > * Data types > * Data structures @@ -661,3 +663,205 @@ list() Lists are extremely useful inside functions. You can "staple" together lots of different kinds of results into a single object that a function can return. In fact many R functions which return complex output store their results in a list. + +## Challenge solutions + +> #### Solution to challenge 1: Data types {.challenge} +> +> Use your knowledge of how to assign a value to +> a variable, to create examples of data with the +> following characteristics: +> +> 1) Variable name: 'answer', Type: logical +> 2) Variable name: 'height', Type: numeric +> 3) Variable name: 'dog_name', Type: character +> +> For each variable you've created, test that it +> has the data type you intended. Do you find +> anything unexpected? +> +> +> ~~~{.r} +> answer <- TRUE +> height <- 150 +> dog_name <- "Snoopy" +> is.logical(answer) +> ~~~ +> +> +> +> ~~~{.output} +> [1] TRUE +> +> ~~~ +> +> +> ~~~{.r} +> is.numeric(height) +> ~~~ +> +> +> +> ~~~{.output} +> [1] TRUE +> +> ~~~ +> +> +> ~~~{.r} +> is.character(dog_name) +> ~~~ +> +> +> +> ~~~{.output} +> [1] TRUE +> +> ~~~ +> + +> #### Solution to challenge 2 {.challenge} +> +> Vectors can only contain one atomic type. If you try to combine different +> types, R will create a vector that is the least common denominator: the +> type that is easiest to coerce to. +> +> +> ~~~{.r} +> xx <- c(1.7, "a") +> xx +> ~~~ +> +> +> +> ~~~{.output} +> [1] "1.7" "a" +> +> ~~~ +> +> +> +> ~~~{.r} +> typeof(xx) +> ~~~ +> +> +> +> ~~~{.output} +> [1] "character" +> +> ~~~ +> +> +> ~~~{.r} +> xx <- c(TRUE, 2) +> xx +> ~~~ +> +> +> +> ~~~{.output} +> [1] 1 2 +> +> ~~~ +> +> +> +> ~~~{.r} +> typeof(xx) +> ~~~ +> +> +> +> ~~~{.output} +> [1] "double" +> +> ~~~ +> +> +> ~~~{.r} +> xx <- c("a", TRUE) +> xx +> ~~~ +> +> +> +> ~~~{.output} +> [1] "a" "TRUE" +> +> ~~~ +> +> +> +> ~~~{.r} +> typeof(xx) +> ~~~ +> +> +> +> ~~~{.output} +> [1] "character" +> +> ~~~ +> + +> +> #### Solution to challenge 3 {.challenge} +> +> What do you think will be the result of +> `length(x)`? +> +> +> ~~~{.r} +> x <- matrix(rnorm(18), ncol=6, nrow=3) +> length(x) +> ~~~ +> +> +> +> ~~~{.output} +> [1] 18 +> +> ~~~ +> +> Because a matrix is really just a vector with added dimension attributes, `length` +> gives you the total number of elements in the matrix. +> + +> +> #### Solution to challenge 4 {.challenge} +> +> Make another matrix, this time containing the numbers 1:50, +> with 5 columns and 10 rows. +> Did the `matrix` function fill your matrix by column, or by +> row, as its default behaviour? +> See if you can figure out how to change this. +> (hint: read the documentation for `matrix`!) +> +> +> ~~~{.r} +> x <- matrix(1:50, ncol=5, nrow=10) +> x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row +> ~~~ +> + + +> #### Solution to challenge 5 {.challenge} +> +> Create a list of length two containing a character vector for each of the +> sections in this part of the workshop: +> +> * Data types +> * Data structures +> +> Populate each character vector with the names of the data types and data +> structures we've seen so far. +> +> +> ~~~{.r} +> my_list <- list( +> data_types = c("logical", "integer", "double", "complex", "character"), +> data_structures = c("vector", "matrix", "factor", "list") +> ) +> ~~~ +> diff --git a/05-data-structures-part2.Rmd b/05-data-structures-part2.Rmd index 1aa3e2fbf..31d1e18f0 100644 --- a/05-data-structures-part2.Rmd +++ b/05-data-structures-part2.Rmd @@ -11,7 +11,7 @@ source("tools/chunk-options.R") gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Become familiar with data frames > * To be able to read in regular data into R @@ -219,19 +219,9 @@ Let's look at some of the columns. > > Look at the first 6 rows of the gapminder data frame we loaded before: > -> ~~~ {.r} +> ```{r} > head(gapminder) -> ~~~ -> -> ~~~ {.output} -> ## country year pop continent lifeExp gdpPercap -> ## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 -> ## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 -> ## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 -> ## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 -> ## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 -> ## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 -> ~~~ +> ``` > > Write down what data type you think is in each column > @@ -375,8 +365,9 @@ summary(l1) As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive association! +## Challenge Solutions -> #### Solution to Challenge 1 {.challenge} +> #### Solution to challenge 1 {.challenge} > > Create a data frame that holds the following information for yourself: > @@ -389,10 +380,10 @@ time, so we see a significant positive association! > Now use cbind to add a column of logicals answering the question, > "Is there anything in this workshop you're finding confusing?" > -> ~~~ {.r} +> ```{r, eval=FALSE} > my_df <- data.frame(first_name = "Software", last_name = "Carpentry", age = 17) > my_df <- rbind(my_df, list("Jane", "Smith", 29)) > my_df <- rbind(my_df, list(c("Jo", "John"), c("White", "Lee"), c(23, 41))) > my_df <- cbind(my_df, confused = c(FALSE, FALSE, TRUE, FALSE)) -> ~~~ -> \ No newline at end of file +> ``` +> diff --git a/05-data-structures-part2.md b/05-data-structures-part2.md index 9edbd34c4..a84dae0de 100644 --- a/05-data-structures-part2.md +++ b/05-data-structures-part2.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Data frames and reading in data -minutes: 15 +minutes: 45 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Become familiar with data frames > * To be able to read in regular data into R @@ -42,10 +42,10 @@ df ~~~ -> #### Challenge: Dataframes {.challenge} +> #### Challenge: Data frames {.challenge} > > Try using the `length` function to query -> your dataframe `df`. Does it give the result +> your data frame `df`. Does it give the result > you expect? > @@ -193,9 +193,9 @@ df ~~~ -> #### Challenge {.challenge} +> #### Challenge 1 {.challenge} > -> Create a dataframe that holds the following information for yourself: +> Create a data frame that holds the following information for yourself: > > * First name > * Last name @@ -258,12 +258,12 @@ head(gapminder) ~~~{.output} country year pop continent lifeExp gdpPercap -1 Afghanistan 1952 8425333 Asia 28.80 779.4 -2 Afghanistan 1957 9240934 Asia 30.33 820.9 -3 Afghanistan 1962 10267083 Asia 32.00 853.1 -4 Afghanistan 1967 11537966 Asia 34.02 836.2 -5 Afghanistan 1972 13079460 Asia 36.09 740.0 -6 Afghanistan 1977 14880372 Asia 38.44 786.1 +1 Afghanistan 1952 8425333 Asia 28.801 779.4453 +2 Afghanistan 1957 9240934 Asia 30.332 820.8530 +3 Afghanistan 1962 10267083 Asia 31.997 853.1007 +4 Afghanistan 1967 11537966 Asia 34.020 836.1971 +5 Afghanistan 1972 13079460 Asia 36.088 739.9811 +6 Afghanistan 1977 14880372 Asia 38.438 786.1134 ~~~ @@ -288,12 +288,12 @@ head(gapminder) ~~~{.output} country year pop continent lifeExp gdpPercap -1 Afghanistan 1952 8425333 Asia 28.80 779.4 -2 Afghanistan 1957 9240934 Asia 30.33 820.9 -3 Afghanistan 1962 10267083 Asia 32.00 853.1 -4 Afghanistan 1967 11537966 Asia 34.02 836.2 -5 Afghanistan 1972 13079460 Asia 36.09 740.0 -6 Afghanistan 1977 14880372 Asia 38.44 786.1 +1 Afghanistan 1952 8425333 Asia 28.801 779.4453 +2 Afghanistan 1957 9240934 Asia 30.332 820.8530 +3 Afghanistan 1962 10267083 Asia 31.997 853.1007 +4 Afghanistan 1967 11537966 Asia 34.020 836.1971 +5 Afghanistan 1972 13079460 Asia 36.088 739.9811 +6 Afghanistan 1977 14880372 Asia 38.438 786.1134 ~~~ @@ -302,7 +302,7 @@ head(gapminder) > 1. Another type of file you might encounter are tab-separated > format. To specify a tab as a separator, use `"\t"`. > -> 2. You can also read in files from the internet by replacing +> 2. You can also read in files from the Internet by replacing > the file paths with a web address. > > 3. You can read directly from excel spreadsheets without @@ -312,7 +312,7 @@ head(gapminder) To make sure our analysis is reproducible, we should put the code into a script file so we can come back to it later. -> #### Challenge {.challenge} +> #### Challenge 2 {.challenge} > > Go to file -> new file -> R script, and write an R script > to load in the gapminder dataset. Put it in the `scripts/` @@ -322,7 +322,7 @@ into a script file so we can come back to it later. > as its argument (or by pressing the "source" button in RStudio). > -### Using dataframes: the `gapminder` dataset +### Using data frames: the `gapminder` dataset To recap what we've just learnt, let's have a look at our example @@ -373,22 +373,26 @@ in data, and (as we've heard) is useful for storing data with mixed types of col Let's look at some of the columns. -> #### Challenge: Data types in a real dataset {.challenge} +> #### Challenge 3: Data types in a real dataset {.challenge} > -> Look at the first 6 rows of the gapminder dataframe we loaded before: +> Look at the first 6 rows of the gapminder data frame we loaded before: > -> ~~~ {.r} +> +> ~~~{.r} > head(gapminder) > ~~~ -> -> ~~~ {.output} -> ## country year pop continent lifeExp gdpPercap -> ## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 -> ## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 -> ## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 -> ## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 -> ## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 -> ## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 +> +> +> +> ~~~{.output} +> country year pop continent lifeExp gdpPercap +> 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 +> 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 +> 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 +> 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 +> 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 +> 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 +> > ~~~ > > Write down what data type you think is in each column @@ -452,7 +456,7 @@ class(gapminder$continent) One of the default behaviours of R is to treat any text columns as "factors" when reading in data. The reason for this is that text columns often represent categorical data, which need to be factors to be handled appropriately by -the statistical modelling functions in R. +the statistical modeling functions in R. However it's not obvious behaviour, and something that trips many people up. We can disable this behaviour and read in the data again. @@ -740,17 +744,17 @@ head(copy) ~~~{.output} - a b c d e f -1 Afghanistan 1952 8425333 Asia 28.80 779.4 -2 Afghanistan 1957 9240934 Asia 30.33 820.9 -3 Afghanistan 1962 10267083 Asia 32.00 853.1 -4 Afghanistan 1967 11537966 Asia 34.02 836.2 -5 Afghanistan 1972 13079460 Asia 36.09 740.0 -6 Afghanistan 1977 14880372 Asia 38.44 786.1 + a b c d e f +1 Afghanistan 1952 8425333 Asia 28.801 779.4453 +2 Afghanistan 1957 9240934 Asia 30.332 820.8530 +3 Afghanistan 1962 10267083 Asia 31.997 853.1007 +4 Afghanistan 1967 11537966 Asia 34.020 836.1971 +5 Afghanistan 1972 13079460 Asia 36.088 739.9811 +6 Afghanistan 1977 14880372 Asia 38.438 786.1134 ~~~ -There are a few related ways of retreiving and modifying this information. +There are a few related ways of retrieving and modifying this information. `attributes` will give you both the row and column names, along with the class information, while `dimnames` will give you just the rownames and column names. @@ -759,15 +763,15 @@ In both cases, the output object is stored in a `list`: ~~~{.r} -str(dimnames(df)) +str(dimnames(gapminder)) ~~~ ~~~{.output} List of 2 - $ : chr [1:10] "1" "2" "3" "4" ... - $ : chr [1:4] "id" "x" "y" "10:1" + $ : chr [1:1704] "1" "2" "3" "4" ... + $ : chr [1:6] "country" "year" "pop" "continent" ... ~~~ @@ -805,7 +809,7 @@ lm(formula = lifeExp ~ year, data = gapminder) Coefficients: (Intercept) year - -585.652 0.326 + -585.6522 0.3259 ~~~ @@ -898,21 +902,45 @@ Call: lm(formula = lifeExp ~ year, data = gapminder) Residuals: - Min 1Q Median 3Q Max --39.95 -9.65 1.70 10.33 22.16 + Min 1Q Median 3Q Max +-39.949 -9.651 1.697 10.335 22.158 Coefficients: - Estimate Std. Error t value Pr(>|t|) -(Intercept) -585.6522 32.3140 -18.1 <2e-16 *** -year 0.3259 0.0163 20.0 <2e-16 *** + Estimate Std. Error t value Pr(>|t|) +(Intercept) -585.65219 32.31396 -18.12 <2e-16 *** +year 0.32590 0.01632 19.96 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -Residual standard error: 11.6 on 1702 degrees of freedom -Multiple R-squared: 0.19, Adjusted R-squared: 0.189 -F-statistic: 399 on 1 and 1702 DF, p-value: <2e-16 +Residual standard error: 11.63 on 1702 degrees of freedom +Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893 +F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16 ~~~ As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive association! + +## Challenge Solutions + +> #### Solution to challenge 1 {.challenge} +> +> Create a data frame that holds the following information for yourself: +> +> * First name +> * Last name +> * Age +> +> Then use rbind to add the same information for the people sitting near you. +> +> Now use cbind to add a column of logicals answering the question, +> "Is there anything in this workshop you're finding confusing?" +> +> +> ~~~{.r} +> my_df <- data.frame(first_name = "Software", last_name = "Carpentry", age = 17) +> my_df <- rbind(my_df, list("Jane", "Smith", 29)) +> my_df <- rbind(my_df, list(c("Jo", "John"), c("White", "Lee"), c(23, 41))) +> my_df <- cbind(my_df, confused = c(FALSE, FALSE, TRUE, FALSE)) +> ~~~ +> diff --git a/06-data-subsetting.Rmd b/06-data-subsetting.Rmd index b3173f518..59cfbd9e9 100644 --- a/06-data-subsetting.Rmd +++ b/06-data-subsetting.Rmd @@ -11,7 +11,7 @@ source("tools/chunk-options.R") gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) ``` -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to subset vectors, factors, matrices, lists, and data frames > * To be able to extract individual and multiple elements: @@ -113,13 +113,9 @@ x[c(-1, -5)] # or x[-c(1,5)] > slices of a vector. Most people first try to negate a > sequence like so: > -> ~~~ {.r} +> ```{r, error=TRUE} > x[-1:3] -> ~~~ -> -> ~~~ {.output} -> ## Error in x[-1:3] : only 0's may be mixed with negative subscripts -> ~~~ +> ``` > > This gives a somewhat cryptic error: > @@ -130,14 +126,9 @@ x[c(-1, -5)] # or x[-c(1,5)] > The correct solution is to wrap that function call in brackets, so > that the `-` operator applies to the results: > -> ~~~ {.r} +> ```{r} > x[-(1:3)] -> ~~~ -> -> ~~~ {.output} -> ## d e -> ## 4.8 7.5 -> ~~~ +> ``` > To remove elements from a vector, we need to assign the results back @@ -152,22 +143,17 @@ x > > Given the following code: > -> ~~~ {.r} +> ```{.r} > x <- c(5.4, 6.2, 7.1, 4.8, 7.5) > names(x) <- c('a', 'b', 'c', 'd', 'e') > print(x) -> ~~~ -> -> ~~~ {.output} -> ## a b c d e -> ## 5.4 6.2 7.1 4.8 7.5 -> ~~~ +> ``` > > 1. Come up with at least 3 different commands that will produce the following output: > -> ~~~ {.r} +> ```{.r, echo=FALSE} > x[2:4] -> ~~~ +> ``` > > 2. Compare notes with your neighbour. Did you have different strategies? > @@ -307,16 +293,11 @@ x[x > 7] > > Given the following code: > -> ~~~ {.r} +> ```{r} > x <- c(5.4, 6.2, 7.1, 4.8, 7.5) > names(x) <- c('a', 'b', 'c', 'd', 'e') > print(x) -> ~~~ -> -> ~~~ {.output} -> ## a b c d e -> ## 5.4 6.2 7.1 4.8 7.5 -> ~~~ +> ``` > > 1. Write a subsetting command to return the values in x that are greater than 4 and less than 7. > @@ -426,17 +407,10 @@ instead of their row and column indices. > > Given the following code: > -> ~~~ {.r} +> ```{r} > m <- matrix(1:18, nrow=3, ncol=6) > print(m) -> ~~~ -> -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] [,5] [,6] -> ## [1,] 1 4 7 10 13 16 -> ## [2,] 2 5 8 11 14 17 -> ## [3,] 3 6 9 12 15 18 -> ~~~ +> ``` > > 1. Which of the following commands will extract the values 11 and 14? > @@ -512,18 +486,19 @@ xlist$data > #### Challenge 3 {.challenge} > Given the following list: > -> ~~~ {.r} +> ```{r, eval=FALSE} > xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) -> ~~~ +> ``` > -> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the "b" item in the list. +> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +> Hint: the number 2 is contained within the "b" item in the list. > #### Challenge 4 {.challenge} > Given a linear model: > -> ~~~ {.r} +> ```{r, eval=FALSE} > mod <- aov(pop ~ lifeExp, data=gapminder) -> ~~~ +> ``` > > Extract the residual degrees of freedom (hint: `attributes()` will help you) > @@ -574,35 +549,35 @@ be changed with the third argument, `drop = FALSE`). > > 1. Extract observations collected for the year 1957 > -> ~~~ {.r} +> ```{r, eval=FALSE} > gapminder[gapminder$year = 1957,] -> ~~~ +> ``` > > 2. Extract all columns except 1 through to 4 > -> ~~~ {.r} +> ```{r, eval=FALSE} > gapminder[,-1:4] -> ~~~ +> ``` > > 3. Extract the rows where the life expectancy is longer the 80 years > -> ~~~ {.r} +> ```{r, eval=FALSE} > gapminder[gapminder$lifeExp > 80] -> ~~~ +> ``` > > 4. Extract the first row, and the fourth and fifth columns > (`lifeExp` and `gdpPercap`). > -> ~~~ {.r} +> ```{r, eval=FALSE} > gapminder[1, 4, 5] -> ~~~ +> ``` > > 5. Advanced: extract rows that contain information for the years 2002 > and 2007 > -> ~~~ {.r} +> ```{r, eval=FALSE} > gapminder[gapminder$year == 2002 | 2007,] -> ~~~ +> ``` > > #### Challenge 6 {.challenge} @@ -614,49 +589,41 @@ be changed with the third argument, `drop = FALSE`). > and 19 through 23. You can do this in one or two steps. > +## Challenge solutions -> #### Solution to Challenge 1 {.challenge} +> #### Solution to challenge 1 {.challenge} > > Given the following code: > -> ~~~ {.r} +> ```{r} > x <- c(5.4, 6.2, 7.1, 4.8, 7.5) > names(x) <- c('a', 'b', 'c', 'd', 'e') > print(x) -> ~~~ -> -> ~~~ {.output} -> ## a b c d e -> ## 5.4 6.2 7.1 4.8 7.5 -> ~~~ +> ``` > > 1. Come up with at least 3 different commands that will produce the following output: > -> ~~~ {.r} +> ```{r, echo=FALSE} > x[2:4] +> ``` > +> ```{r, eval=FALSE} +> x[2:4] > x[-c(1,5)] > x[c("b", "c", "d")] > x[c(2,3,4)] -> ~~~ +> ``` > > -> #### Solution to Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > Given the following code: > -> ~~~ {.r} +> ```{r} > m <- matrix(1:18, nrow=3, ncol=6) > print(m) -> ~~~ -> -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] [,5] [,6] -> ## [1,] 1 4 7 10 13 16 -> ## [2,] 2 5 8 11 14 17 -> ## [3,] 3 6 9 12 15 18 -> ~~~ +> ``` > > 1. Which of the following commands will extract the values 11 and 14? > @@ -670,83 +637,82 @@ be changed with the third argument, `drop = FALSE`). > > Answer: D - -> #### Solution to Challenge 3 {.challenge} +> #### Solution to challenge 3 {.challenge} > Given the following list: > -> ~~~ {.r} +> ```{r} > xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) -> ~~~ +> ``` > -> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the "b" item in the list. +> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +> Hint: the number 2 is contained within the "b" item in the list. > -> ~~~{.r} -> ## Any of the below should work: +> ```{r, eval=FALSE} > xlist$b[2] > xlist[[2]][2] > xlist[["b"]][2] -> ~~~ +> ``` -> #### Solution to Challenge 4 {.challenge} +> #### Solution to challenge 4 {.challenge} > Given a linear model: > -> ~~~ {.r} +> ```{r} > mod <- aov(pop ~ lifeExp, data=gapminder) -> ~~~ +> ``` > > Extract the residual degrees of freedom (hint: `attributes()` will help you) > -> ~~~ {.r} +> ```{r, eval=FALSE} > attributes(mod) ## `df.residual` is one of the names of `mod` > mod$df.residual -> ~~~ +> ``` -> #### Solution to Challenge 5 {.challenge} +> #### Solution to challenge 5 {.challenge} > > Fix each of the following common data frame subsetting errors: > > 1. Extract observations collected for the year 1957 > -> ~~~ {.r} -> gapminder[gapminder$year = 1957,] +> ```{r, eval=FALSE} +> # gapminder[gapminder$year = 1957,] > gapminder[gapminder$year == 1957,] -> ~~~ +> ``` > > 2. Extract all columns except 1 through to 4 > -> ~~~ {.r} -> gapminder[,-1:4] +> ```{r, eval=FALSE} +> # gapminder[,-1:4] > gapminder[,-c(1:4)] -> ~~~ +> ``` > > 3. Extract the rows where the life expectancy is longer the 80 years > -> ~~~ {.r} -> gapminder[gapminder$lifeExp > 80] +> ```{r, eval=FALSE} +> # gapminder[gapminder$lifeExp > 80] > gapminder[gapminder$lifeExp > 80,] -> ~~~ +> ``` > > 4. Extract the first row, and the fourth and fifth columns > (`lifeExp` and `gdpPercap`). > -> ~~~ {.r} -> gapminder[1, 4, 5] +> ```{r, eval=FALSE} +> # gapminder[1, 4, 5] > gapminder[1, c(4, 5)] -> ~~~ +> ``` > > 5. Advanced: extract rows that contain information for the years 2002 > and 2007 > -> ~~~ {.r} -> gapminder[gapminder$year == 2002 | 2007,] +> ```{r, eval=FALSE} +> # gapminder[gapminder$year == 2002 | 2007,] > gapminder[gapminder$year == 2002 | gapminder$year == 2007,] > gapminder[gapminder$year %in% c(2002, 2007),] -> ~~~ +> ``` > -> #### Solution to Challenge 6 {.challenge} +> #### Solution to challenge 6 {.challenge} > > 1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? > @@ -755,7 +721,7 @@ be changed with the third argument, `drop = FALSE`). > 2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 > and 19 through 23. You can do this in one or two steps. > -> ~~~ {.r} +> ```{r} > gapminder_small <- gapminder[c(1:9, 19:23),] -> ~~~ +> ``` > diff --git a/06-data-subsetting.md b/06-data-subsetting.md index 417bde07e..892fc6fb6 100644 --- a/06-data-subsetting.md +++ b/06-data-subsetting.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Subsetting data -minutes: 15 +minutes: 45 --- -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to subset vectors, factors, matrices, lists, and data frames > * To be able to extract individual and multiple elements: @@ -20,14 +20,14 @@ minutes: 15 R has many powerful subset operators and mastering them will allow you to easily perform complex operations on any kind of dataset. -There are six different ways we can subset any kind of object, and three +There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures. Let's start with the workhorse of R: atomic vectors. ~~~{.r} -x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) names(x) <- c('a', 'b', 'c', 'd', 'e') x ~~~ @@ -45,7 +45,7 @@ contents? ### Accessing elements using their indices -To extract elements of a vector we can give their corresponding index, starting +To extract elements of a vector we can give their corresponding index, starting from one: @@ -74,7 +74,7 @@ x[4] ~~~ -The square brackets operator is just like any other function. For atomic vectors +The square brackets operator is just like any other function. For atomic vectors (and matrices), it means "get me the nth element". We can ask for multiple elements at once: @@ -196,32 +196,40 @@ x[c(-1, -5)] # or x[-c(1,5)] > > A common trip up for novices occurs when trying to skip > slices of a vector. Most people first try to negate a -> sequence like so: +> sequence like so: > -> ~~~ {.r} +> +> ~~~{.r} > x[-1:3] > ~~~ > -> ~~~ {.output} -> ## Error in x[-1:3] : only 0's may be mixed with negative subscripts -> ~~~ +> +> +> ~~~{.output} +> Error in x[-1:3]: only 0's may be mixed with negative subscripts +> +> ~~~ > > This gives a somewhat cryptic error: > > But remember the order of operations. `:` is really a function, so > what happens is it takes its first argument as -1, and second as 3, -> so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`. +> so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`. > > The correct solution is to wrap that function call in brackets, so > that the `-` operator applies to the results: > -> ~~~ {.r} +> +> ~~~{.r} > x[-(1:3)] > ~~~ -> -> ~~~ {.output} -> ## d e -> ## 4.8 7.5 +> +> +> +> ~~~{.output} +> d e +> 4.8 7.5 +> > ~~~ > @@ -242,25 +250,32 @@ x ~~~ -> #### Challenge {.challenge} +> #### Challenge 1 {.challenge} > > Given the following code: > -> ~~~ {.r} +> +> ~~~{.r} > x <- c(5.4, 6.2, 7.1, 4.8, 7.5) > names(x) <- c('a', 'b', 'c', 'd', 'e') > print(x) > ~~~ -> -> ~~~ {.output} -> ## a b c d e -> ## 5.4 6.2 7.1 4.8 7.5 +> +> +> +> ~~~{.output} +> a b c d e +> 5.4 6.2 7.1 4.8 7.5 +> > ~~~ > > 1. Come up with at least 3 different commands that will produce the following output: > -> ~~~ {.r} -> x[2:4] +> +> ~~~{.output} +> b c d +> 6.2 7.1 4.8 +> > ~~~ > > 2. Compare notes with your neighbour. Did you have different strategies? @@ -283,8 +298,8 @@ x[c("a", "c")] ~~~ -This is usually a much more reliable way to subset objects: the -position of various elements can often change when chaining together +This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together subsetting operations, but the names will always remain the same! Unfortunately we can't skip or remove elements so easily. @@ -299,14 +314,14 @@ x[-which(names(x) == "a")] ~~~{.output} - b c e -6.2 7.1 7.5 + b c d e +6.2 7.1 4.8 7.5 ~~~ The `which` function returns the indices of all `TRUE` elements of its argument. Remember that expressions evaluate before being passed to functions. Let's break -this down so that its clearer whats happening. +this down so that its clearer what's happening. First this happens: @@ -318,7 +333,7 @@ names(x) == "a" ~~~{.output} -[1] TRUE FALSE FALSE FALSE +[1] TRUE FALSE FALSE FALSE FALSE ~~~ @@ -353,21 +368,21 @@ x[-which(names(x) %in% c("a", "c"))] ~~~{.output} - b e -6.2 7.5 + b d e +6.2 4.8 7.5 ~~~ -The `%in%` goes through each element of its left argument, in this case the +The `%in%` goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". > #### Tip: Getting help for operators {.callout} -> +> > Remember you can search for help on operators by wrapping them in quotes: > `help("%in%")` or `?"%in%"`. > -So why can't we use `==` like before? That's an excellent question. +So why can't we use `==` like before? That's an excellent question. Let's take a look at just the comparison component: @@ -379,11 +394,19 @@ names(x) == c('a', 'c') ~~~{.output} -[1] TRUE FALSE FALSE FALSE +Warning in names(x) == c("a", "c"): longer object length is not a multiple +of shorter object length + +~~~ + + + +~~~{.output} +[1] TRUE FALSE FALSE FALSE FALSE ~~~ -Obviously "c" is in the names of `x`, so why didn't this work? `==` works +Obviously "c" is in the names of `x`, so why didn't this work? `==` works slightly differently to `%in%`. It will compare each element of its left argument to the corresponding element of its right argument. @@ -417,14 +440,15 @@ names(x) == c('a', 'c', 'e') ~~~{.output} -Warning: longer object length is not a multiple of shorter object length +Warning in names(x) == c("a", "c", "e"): longer object length is not a +multiple of shorter object length ~~~ ~~~{.output} -[1] TRUE FALSE FALSE FALSE +[1] TRUE FALSE FALSE FALSE FALSE ~~~ @@ -443,8 +467,8 @@ x[c(TRUE, TRUE, FALSE, FALSE)] ~~~{.output} - a b -5.4 6.2 + a b e +5.4 6.2 7.5 ~~~ @@ -459,8 +483,8 @@ x[c(TRUE, FALSE)] ~~~{.output} - a c -5.4 7.1 + a c e +5.4 7.1 7.5 ~~~ @@ -484,7 +508,7 @@ x[x > 7] > > There are many situations in which you will wish to combine multiple conditions. > To do so several logical operations exist in R: -> +> > * `|` logical OR: returns `TRUE`, if either the left or right are `TRUE`. > * `&` logical AND: returns `TRUE` if both the left and right are `TRUE` > * `!` logical NOT: converts `TRUE` to `FALSE` and `FALSE` to `TRUE` @@ -496,19 +520,23 @@ x[x > 7] > > Given the following code: > -> ~~~ {.r} +> +> ~~~{.r} > x <- c(5.4, 6.2, 7.1, 4.8, 7.5) > names(x) <- c('a', 'b', 'c', 'd', 'e') > print(x) > ~~~ -> -> ~~~ {.output} -> ## a b c d e -> ## 5.4 6.2 7.1 4.8 7.5 +> +> +> +> ~~~{.output} +> a b c d e +> 5.4 6.2 7.1 4.8 7.5 +> > ~~~ > > 1. Write a subsetting command to return the values in x that are greater than 4 and less than 7. -> +> #### Handling special values @@ -529,7 +557,7 @@ There are a number of special functions you can use to filter out this data: Now that we've explored the different ways to subset vectors, how do we subset the other data structures? -Factor subsetting works the same way as vector subsetting. +Factor subsetting works the same way as vector subsetting. ~~~{.r} @@ -605,9 +633,9 @@ m[3:4, c(3,1)] ~~~{.output} - [,1] [,2] -[1,] 1.12493 -0.8356 -[2,] -0.04493 1.5953 + [,1] [,2] +[1,] 1.12493092 -0.8356286 +[2,] -0.04493361 1.5952808 ~~~ @@ -622,13 +650,13 @@ m[, c(3,4)] ~~~{.output} - [,1] [,2] -[1,] -0.62124 0.82122 -[2,] -2.21470 0.59390 -[3,] 1.12493 0.91898 -[4,] -0.04493 0.78214 -[5,] -0.01619 0.07456 -[6,] 0.94384 -1.98935 + [,1] [,2] +[1,] -0.62124058 0.82122120 +[2,] -2.21469989 0.59390132 +[3,] 1.12493092 0.91897737 +[4,] -0.04493361 0.78213630 +[5,] -0.01619026 0.07456498 +[6,] 0.94383621 -1.98935170 ~~~ @@ -643,7 +671,7 @@ m[3,] ~~~{.output} -[1] -0.8356 0.5758 1.1249 0.9190 +[1] -0.8356286 0.5757814 1.1249309 0.9189774 ~~~ @@ -658,8 +686,8 @@ m[3, , drop=FALSE] ~~~{.output} - [,1] [,2] [,3] [,4] -[1,] -0.8356 0.5758 1.125 0.919 + [,1] [,2] [,3] [,4] +[1,] -0.8356286 0.5757814 1.124931 0.9189774 ~~~ @@ -674,13 +702,13 @@ m[, c(3,6)] ~~~{.output} -Error: subscript out of bounds +Error in m[, c(3, 6)]: subscript out of bounds ~~~ > #### Tip: Higher dimensional arrays {.callout} > -> when dealing with multi-dimensional arrays, each argument to `[` +> when dealing with multi-dimensional arrays, each argument to `[` > corresponds to a dimension. For example, a 3D array, the first three > arguments correspond to the rows, columns, and depth dimension. > @@ -696,13 +724,13 @@ m[5] ~~~{.output} -[1] 0.3295 +[1] 0.3295078 ~~~ This usually isn't useful. However it is useful to note that matrices -are laid out in *column-major format* by default. That is the elements of the +are laid out in *column-major format* by default. That is the elements of the vector are arranged column-wise: @@ -722,37 +750,41 @@ matrix(1:6, nrow=2, ncol=3) Matrices can also be subsetted using their rownames and column names instead of their row and column indices. -> #### Challenge {.challenge} +> #### Challenge 2 {.challenge} > > Given the following code: +> > -> ~~~ {.r} +> ~~~{.r} > m <- matrix(1:18, nrow=3, ncol=6) > print(m) > ~~~ > -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] [,5] [,6] -> ## [1,] 1 4 7 10 13 16 -> ## [2,] 2 5 8 11 14 17 -> ## [3,] 3 6 9 12 15 18 +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] [,5] [,6] +> [1,] 1 4 7 10 13 16 +> [2,] 2 5 8 11 14 17 +> [3,] 3 6 9 12 15 18 +> > ~~~ > > 1. Which of the following commands will extract the values 11 and 14? -> +> > A. `m[2,4,2,5]` -> +> > B. `m[2:5]` -> +> > C. `m[4:5,2]` -> +> > D. `m[2,c(4,5)]` -> +> ### List subsetting -Now we'll introduce some new subsetting operators. There are three functions +Now we'll introduce some new subsetting operators. There are three functions used to subset lists. `[`, as we've seen for atomic vectors and matrices, as well as `[[` and `$`. @@ -761,8 +793,8 @@ Using `[` will always return a list. If you want to *subset* a list, but not ~~~{.r} -xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) -xlist[1] +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) +xlist[1] ~~~ @@ -776,7 +808,7 @@ $a This returns a *list with one element*. We can subset elements of a list exactly the same was as atomic -vectors using `[`. Comparison operations however won't work as +vectors using `[`. Comparison operations however won't work as they're not recursive, they will try to condition on the data structures in each element of the list, not the individual elements within those data structures. @@ -824,7 +856,7 @@ xlist[[1:2]] ~~~{.output} -Error: subscript out of bounds +Error in xlist[[1:2]]: subscript out of bounds ~~~ @@ -838,7 +870,7 @@ xlist[[-1]] ~~~{.output} -Error: attempt to select more than one element +Error in xlist[[-1]]: attempt to select more than one element ~~~ @@ -876,18 +908,22 @@ xlist$data ~~~ -> #### Challenge {.challenge} -> 1. Given the following list: +> #### Challenge 3 {.challenge} +> Given the following list: > -> ~~~ {.r} -> xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) -> ~~~ -> -> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the "b" item in the list. > -> 2. Given a linear model: +> ~~~{.r} +> xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) +> ~~~ +> +> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +> Hint: the number 2 is contained within the "b" item in the list. + +> #### Challenge 4 {.challenge} +> Given a linear model: +> > -> ~~~ {.r} +> ~~~{.r} > mod <- aov(pop ~ lifeExp, data=gapminder) > ~~~ > @@ -930,11 +966,11 @@ head(gapminder[["lifeExp"]]) ~~~{.output} -[1] 28.80 30.33 32.00 34.02 36.09 38.44 +[1] 28.801 30.332 31.997 34.020 36.088 38.438 ~~~ -And `$` provides a convenient shorthand to extact columns by name: +And `$` provides a convenient shorthand to extract columns by name: ~~~{.r} @@ -959,9 +995,9 @@ gapminder[1:3,] ~~~{.output} country year pop continent lifeExp gdpPercap -1 Afghanistan 1952 8425333 Asia 28.80 779.4 -2 Afghanistan 1957 9240934 Asia 30.33 820.9 -3 Afghanistan 1962 10267083 Asia 32.00 853.1 +1 Afghanistan 1952 8425333 Asia 28.801 779.4453 +2 Afghanistan 1957 9240934 Asia 30.332 820.8530 +3 Afghanistan 1962 10267083 Asia 31.997 853.1007 ~~~ @@ -977,57 +1013,231 @@ gapminder[3,] ~~~{.output} country year pop continent lifeExp gdpPercap -3 Afghanistan 1962 10267083 Asia 32 853.1 +3 Afghanistan 1962 10267083 Asia 31.997 853.1007 ~~~ But for a single column the result will be a vector (this can be changed with the third argument, `drop = FALSE`). -> #### Challenge {.challenge} -> +> #### Challenge 5 {.challenge} +> > Fix each of the following common data frame subsetting errors: -> +> > 1. Extract observations collected for the year 1957 > -> ~~~ {.r} +> +> ~~~{.r} > gapminder[gapminder$year = 1957,] > ~~~ > > 2. Extract all columns except 1 through to 4 > -> ~~~ {.r} +> +> ~~~{.r} > gapminder[,-1:4] > ~~~ > > 3. Extract the rows where the life expectancy is longer the 80 years > -> ~~~ {.r} +> +> ~~~{.r} > gapminder[gapminder$lifeExp > 80] > ~~~ > -> 4. Extract the first row, and the fourth and fifth columns +> 4. Extract the first row, and the fourth and fifth columns > (`lifeExp` and `gdpPercap`). > -> ~~~ {.r} +> +> ~~~{.r} > gapminder[1, 4, 5] > ~~~ > > 5. Advanced: extract rows that contain information for the years 2002 > and 2007 > -> ~~~ {.r} +> +> ~~~{.r} > gapminder[gapminder$year == 2002 | 2007,] > ~~~ > -> #### Challenge {.challenge} +> #### Challenge 6 {.challenge} > > 1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? > -> -> 2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 +> +> 2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 > and 19 through 23. You can do this in one or two steps. > +## Challenge solutions +> #### Solution to challenge 1 {.challenge} +> +> Given the following code: +> +> +> ~~~{.r} +> x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +> names(x) <- c('a', 'b', 'c', 'd', 'e') +> print(x) +> ~~~ +> +> +> +> ~~~{.output} +> a b c d e +> 5.4 6.2 7.1 4.8 7.5 +> +> ~~~ +> +> 1. Come up with at least 3 different commands that will produce the following output: +> +> +> ~~~{.output} +> b c d +> 6.2 7.1 4.8 +> +> ~~~ +> +> +> ~~~{.r} +> x[2:4] +> x[-c(1,5)] +> x[c("b", "c", "d")] +> x[c(2,3,4)] +> ~~~ +> +> + +> #### Solution to challenge 2 {.challenge} +> +> Given the following code: +> +> +> ~~~{.r} +> m <- matrix(1:18, nrow=3, ncol=6) +> print(m) +> ~~~ +> +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] [,5] [,6] +> [1,] 1 4 7 10 13 16 +> [2,] 2 5 8 11 14 17 +> [3,] 3 6 9 12 15 18 +> +> ~~~ +> +> 1. Which of the following commands will extract the values 11 and 14? +> +> A. `m[2,4,2,5]` +> +> B. `m[2:5]` +> +> C. `m[4:5,2]` +> +> D. `m[2,c(4,5)]` +> +> Answer: D + +> #### Solution to challenge 3 {.challenge} +> Given the following list: +> +> +> ~~~{.r} +> xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris)) +> ~~~ +> +> Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +> Hint: the number 2 is contained within the "b" item in the list. +> +> +> ~~~{.r} +> xlist$b[2] +> xlist[[2]][2] +> xlist[["b"]][2] +> ~~~ + + +> #### Solution to challenge 4 {.challenge} +> Given a linear model: +> +> +> ~~~{.r} +> mod <- aov(pop ~ lifeExp, data=gapminder) +> ~~~ +> +> Extract the residual degrees of freedom (hint: `attributes()` will help you) +> +> +> ~~~{.r} +> attributes(mod) ## `df.residual` is one of the names of `mod` +> mod$df.residual +> ~~~ + + +> #### Solution to challenge 5 {.challenge} +> +> Fix each of the following common data frame subsetting errors: +> +> 1. Extract observations collected for the year 1957 +> +> +> ~~~{.r} +> # gapminder[gapminder$year = 1957,] +> gapminder[gapminder$year == 1957,] +> ~~~ +> +> 2. Extract all columns except 1 through to 4 +> +> +> ~~~{.r} +> # gapminder[,-1:4] +> gapminder[,-c(1:4)] +> ~~~ +> +> 3. Extract the rows where the life expectancy is longer the 80 years +> +> +> ~~~{.r} +> # gapminder[gapminder$lifeExp > 80] +> gapminder[gapminder$lifeExp > 80,] +> ~~~ +> +> 4. Extract the first row, and the fourth and fifth columns +> (`lifeExp` and `gdpPercap`). +> +> +> ~~~{.r} +> # gapminder[1, 4, 5] +> gapminder[1, c(4, 5)] +> ~~~ +> +> 5. Advanced: extract rows that contain information for the years 2002 +> and 2007 +> +> +> ~~~{.r} +> # gapminder[gapminder$year == 2002 | 2007,] +> gapminder[gapminder$year == 2002 | gapminder$year == 2007,] +> gapminder[gapminder$year %in% c(2002, 2007),] +> ~~~ +> + +> #### Solution to challenge 6 {.challenge} +> +> 1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? +> +> Answer: `gapminder` is a data.frame so needs to be subsetted on two dimensions. `gapminder[1:20, ]` subsets the data to give the first 20 rows and all columns. +> +> 2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 +> and 19 through 23. You can do this in one or two steps. +> +> +> ~~~{.r} +> gapminder_small <- gapminder[c(1:9, 19:23),] +> ~~~ +> diff --git a/07-functions.Rmd b/07-functions.Rmd index 60a9dbdda..0b6d9350b 100644 --- a/07-functions.Rmd +++ b/07-functions.Rmd @@ -11,7 +11,7 @@ source("tools/chunk-options.R") gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) ``` -> ## Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Define a function that takes arguments. > * Return a value from a function. @@ -276,25 +276,17 @@ which is much better than in our first attempt where we just got a vector of num > > The `paste` function can be used to combine text together, e.g: > -> ~~~ {.r} +> ```{r} > best_practice <- c("Write", "programs", "for", "people", "not", "computers") > paste(best_practice, collapse=" ") -> ~~~ -> -> ~~~ {.output} -> ## [1] "Write programs for people not computers" -> ~~~ +> ``` > > Write a function called `fence` that takes two vectors as arguments, called > `text` and `wrapper`, and prints out the text wrapped with the `wrapper`: > -> ~~~ {.r} +> ```{r, eval=FALSE} > fence(text=best_practice, wrapper="***") -> ~~~ -> -> ~~~ {.output} -> ## [1] "*** Write programs for people not computers ***" -> ~~~ +> ``` > > *Note:* the `paste` function has an argument called `sep`, which specifies the > separator between text. The default is a space: " ". The default for `paste0` @@ -347,50 +339,49 @@ which is much better than in our first attempt where we just got a vector of num [roxygen2]: http://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html [testthat]: http://r-pkgs.had.co.nz/tests.html +## Challenge solutions -> #### Solution for Challenge 1 {.challenge} +> #### Solution to challenge 1 {.challenge} > > Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin > and returns that temperature in Celsius > -> ~~~ {.r} +> ```{r} > kelvin_to_celsius <- function(temp) { > celsius <- temp - 273.15 > return(celsius) > } -> ~~~ +> ``` -> #### Solution for Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > Define the function to convert directly from Fahrenheit to Celsius, > by reusing these two functions above > > -> ~~~ {.r} +> ```{r} > fahr_to_celsius <- function(temp) { > temp_k <- fahr_to_kelvin(temp) > result <- kelvin_to_celsius(temp_k) > return(result) > } -> ~~~ +> ``` > -> #### Solution for Challenge 3 {.challenge} +> #### Solution to challenge 3 {.challenge} > > > Write a function called `fence` that takes two vectors as arguments, called > `text` and `wrapper`, and prints out the text wrapped with the `wrapper`: > -> ~~~ {.r} +> ```{r} > fence <- function(text, wrapper){ > text <- c(wrapper, text, wrapper) > result <- paste(text, collapse = " ") +> return(result) > } +> best_practice <- c("Write", "programs", "for", "people", "not", "computers") > fence(text=best_practice, wrapper="***") -> ~~~ -> -> ~~~ {.output} -> ## [1] "*** Write programs for people not computers ***" -> ~~~ +> ``` > diff --git a/07-functions.md b/07-functions.md index 05a2834e2..ea1ac5657 100644 --- a/07-functions.md +++ b/07-functions.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Creating functions -minutes: 15 +minutes: 45 --- -> ## Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Define a function that takes arguments. > * Return a value from a function. @@ -17,8 +17,8 @@ minutes: 15 > If we only had one data set to analyze, it would probably be faster to load the -file into a spreadsheet and use that to plot simple statistics. However, the -gapminder data is updated periodically, and we may want to pull in that new +file into a spreadsheet and use that to plot simple statistics. However, the +gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again. We may also obtain similar data from a different source in the future. @@ -49,7 +49,7 @@ fahr_to_kelvin <- function(temp) { ~~~ We define `fahr_to_kelvin` by assigning it to the output of `function`. -The list of argument names are containted within parentheses. +The list of argument names are contained within parentheses. Next, the [body](reference.html#function-body) of the function--the statements that are executed when it runs--is contained within curly braces (`{}`). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates. @@ -76,7 +76,7 @@ fahr_to_kelvin(32) ~~~{.output} -[1] 273.1 +[1] 273.15 ~~~ @@ -89,16 +89,16 @@ fahr_to_kelvin(212) ~~~{.output} -[1] 373.1 +[1] 373.15 ~~~ > #### Challenge 1 {.challenge} -> -> Write a function called `kelvin_to_celsius` that takes a temperature in kelvin -> and returns that tempterature in celcius -> -> Hint: To convert from kelvin to celcius you minus 273.15 +> +> Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +> and returns that temperature in Celsius +> +> Hint: To convert from Kelvin to Celsius you minus 273.15 > #### Combining functions @@ -106,7 +106,7 @@ fahr_to_kelvin(212) The real power of functions comes from mixing, matching and combining them into ever large chunks to get the effect we want. -Let's define two functions that will convert temparature from Fahrenheit to +Let's define two functions that will convert temperature from Fahrenheit to Kelvin, and Kelvin to Celsius: @@ -144,19 +144,19 @@ calcGDP <- function(dat) { ~~~ We define `calcGDP` by assigning it to the output of `function`. -The list of argument names are containted within parentheses. +The list of argument names are contained within parentheses. Next, the body of the function -- the statements executed when you -call the function -- is contained within curly braces (`{}`). +call the function -- is contained within curly braces (`{}`). We've indented the statements in the body by two spaces. This makes the code easier to read but does not affect how it operates. -When we call the function, the values we pass to it are assigned -to the arguments, which become variables inside the body of the +When we call the function, the values we pass to it are assigned +to the arguments, which become variables inside the body of the function. -Inside the function, we use the `return` function to send back the -result. This return function is optional: R will automatically +Inside the function, we use the `return` function to send back the +result. This return function is optional: R will automatically return the results of whatever command is executed on the last line of the function. @@ -169,12 +169,12 @@ calcGDP(head(gapminder)) ~~~{.output} -[1] 6.567e+09 7.585e+09 8.759e+09 9.648e+09 9.679e+09 1.170e+10 +[1] 6567086330 7585448670 8758855797 9648014150 9678553274 11697659231 ~~~ -That's not very informative. Let's add some more arguments so we can extract -that per year and country. +That's not very informative. Let's add some more arguments so we can extract +that per year and country. ~~~{.r} @@ -182,19 +182,19 @@ that per year and country. # with the GDP per capita column. calcGDP <- function(dat, year=NULL, country=NULL) { if(!is.null(year)) { - dat <- dat[dat$year %in% year, ] + dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,] } gdp <- dat$pop * dat$gdpPercap - new <- cbind(dat, gdp=gdp) + new <- cbind(dat, gdp=gdp) return(new) } ~~~ -If you've been writing these functions down into a separate R script +If you've been writing these functions down into a separate R script (a good idea!), you can load in the functions into our R session by using the `source` function: @@ -203,8 +203,8 @@ If you've been writing these functions down into a separate R script source("functions/functions-lesson.R") ~~~ -Ok, so there's a lot going on in this function now. In plain english, -the function now subsets the provided data by year if the year argument isn't +Ok, so there's a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument isn't empty, then subsets the result by country if the country argument isn't empty. Then it calculates the GDP for whatever subset emerges from the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns @@ -221,13 +221,13 @@ head(calcGDP(gapminder, year=2007)) ~~~{.output} - country year pop continent lifeExp gdpPercap gdp -12 Afghanistan 2007 31889923 Asia 43.83 974.6 3.108e+10 -24 Albania 2007 3600523 Europe 76.42 5937.0 2.138e+10 -36 Algeria 2007 33333216 Africa 72.30 6223.4 2.074e+11 -48 Angola 2007 12420476 Africa 42.73 4797.2 5.958e+10 -60 Argentina 2007 40301927 Americas 75.32 12779.4 5.150e+11 -72 Australia 2007 20434176 Oceania 81.23 34435.4 7.037e+11 + country year pop continent lifeExp gdpPercap gdp +12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949 +24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360 +36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958 +48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818 +60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357 +72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894 ~~~ @@ -241,19 +241,19 @@ calcGDP(gapminder, country="Australia") ~~~{.output} - country year pop continent lifeExp gdpPercap gdp -61 Australia 1952 8691212 Oceania 69.12 10040 8.726e+10 -62 Australia 1957 9712569 Oceania 70.33 10950 1.063e+11 -63 Australia 1962 10794968 Oceania 70.93 12217 1.319e+11 -64 Australia 1967 11872264 Oceania 71.10 14526 1.725e+11 -65 Australia 1972 13177000 Oceania 71.93 16789 2.212e+11 -66 Australia 1977 14074100 Oceania 73.49 18334 2.580e+11 -67 Australia 1982 15184200 Oceania 74.74 19477 2.957e+11 -68 Australia 1987 16257249 Oceania 76.32 21889 3.559e+11 -69 Australia 1992 17481977 Oceania 77.56 23425 4.095e+11 -70 Australia 1997 18565243 Oceania 78.83 26998 5.012e+11 -71 Australia 2002 19546792 Oceania 80.37 30688 5.998e+11 -72 Australia 2007 20434176 Oceania 81.23 34435 7.037e+11 + country year pop continent lifeExp gdpPercap gdp +61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102 +62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169 +63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002 +64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742 +65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658 +66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175 +67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309 +68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294 +69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952 +70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921 +71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654 +72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894 ~~~ @@ -267,8 +267,8 @@ calcGDP(gapminder, year=2007, country="Australia") ~~~{.output} - country year pop continent lifeExp gdpPercap gdp -72 Australia 2007 20434176 Oceania 81.23 34435 7.037e+11 + country year pop continent lifeExp gdpPercap gdp +72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894 ~~~ @@ -279,7 +279,7 @@ Let's walk through the body of the function: calcGDP <- function(dat, year=NULL, country=NULL) { ~~~ -Here we've added two argumets, `year`, and `country`. We've set +Here we've added two arguments, `year`, and `country`. We've set *default arguments* for both as `NULL` using the `=` operator in the function definition. This means that those arguments will take on those values unless the user specifies otherwise. @@ -287,7 +287,7 @@ take on those values unless the user specifies otherwise. ~~~{.r} if(!is.null(year)) { - dat <- dat[dat$year %in% year, ] + dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,] @@ -295,10 +295,10 @@ take on those values unless the user specifies otherwise. ~~~ Here, we check whether each additional argument is set to `null`, -and whenever they're not `null` overwrite the dataset stored in `dat` with +and whenever they're not `null` overwrite the dataset stored in `dat` with a subset given by the non-`null` argument. -I did this so that our function is more flexible for later. We +I did this so that our function is more flexible for later. We can ask it to calculate the GDP for: * The whole dataset; @@ -314,11 +314,11 @@ to those arguments. > Functions in R almost always make copies of the data to operate on > inside of a function body. When we modify `dat` inside the function > we are modifying the copy of the gapminder dataset stored in `dat`, -> not the original variable we gave as the first argument. +> not the original variable we gave as the first argument. > > This is called "pass-by-value" and it makes writing code much safer: -> you can always be sure that whatever changes you make within the -> body of the function, stay inside the body of the function. +> you can always be sure that whatever changes you make within the +> body of the function, stay inside the body of the function. > > #### Tip: Function scope {.callout} @@ -326,20 +326,20 @@ to those arguments. > Another important concept is scoping: any variables (or functions!) you > create or modify inside the body of a function only exist for the lifetime > of the function's execution. When we call `calcGDP`, the variables `dat`, -> `gdp` and `new` only exist inside the body of the function. Even if we -> have variables of the same name in our interactive R session, they are +> `gdp` and `new` only exist inside the body of the function. Even if we +> have variables of the same name in our interactive R session, they are > not modified in any way when executing a function. > ~~~{.r} gdp <- dat$pop * dat$gdpPercap - new <- cbind(dat, gdp=gdp) + new <- cbind(dat, gdp=gdp) return(new) } ~~~ -Finally, we calculated the GDP on our new subset, and created a new +Finally, we calculated the GDP on our new subset, and created a new data frame with that column added. This means when we call the function later we can see the context for the returned GDP values, which is much better than in our first attempt where we just got a vector of numbers. @@ -347,40 +347,41 @@ which is much better than in our first attempt where we just got a vector of num > #### Challenge 3 {.challenge} > > The `paste` function can be used to combine text together, e.g: +> > -> ~~~ {.r} +> ~~~{.r} > best_practice <- c("Write", "programs", "for", "people", "not", "computers") > paste(best_practice, collapse=" ") > ~~~ -> -> ~~~ {.output} -> ## [1] "Write programs for people not computers" +> +> +> +> ~~~{.output} +> [1] "Write programs for people not computers" +> > ~~~ > > Write a function called `fence` that takes two vectors as arguments, called > `text` and `wrapper`, and prints out the text wrapped with the `wrapper`: > -> ~~~ {.r} -> fence(text=best_practice, wrapper="***") -> ~~~ > -> ~~~ {.output} -> ## [1] "*** Write programs for people not computers ***" +> ~~~{.r} +> fence(text=best_practice, wrapper="***") > ~~~ -> +> > *Note:* the `paste` function has an argument called `sep`, which specifies the > separator between text. The default is a space: " ". The default for `paste0` > is no space "". > -> ## Tip {.callout} -> -> R has some unique aspects that can be exploited when performing -> more complicated operations. We will not be writing anything that requires -> knowledge of these more advanced concepts. In the future when you are -> comfortable writing functions in R, you can learn more by reading the -> [R Language Manual][man] or this [chapter][] from -> [Advanced R Programming][adv-r] by Hadley Wickham. For context, R uses the +> ## Tip {.callout} +> +> R has some unique aspects that can be exploited when performing +> more complicated operations. We will not be writing anything that requires +> knowledge of these more advanced concepts. In the future when you are +> comfortable writing functions in R, you can learn more by reading the +> [R Language Manual][man] or this [chapter][] from +> [Advanced R Programming][adv-r] by Hadley Wickham. For context, R uses the > terminology "environments" instead of frames. [man]: http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects @@ -391,14 +392,14 @@ which is much better than in our first attempt where we just got a vector of num > #### Tip: Testing and documenting {.callout} > > It's important to both test functions and document them: -> Documentation helps you, and others, understand what the +> Documentation helps you, and others, understand what the > purpose of your function is, and how to use it, and its > important to make sure that your function actually does > what you think. > > When you first start out, your workflow will probably look a lot > like this: -> +> > 1. Write a function > 2. Comment parts of the function to document its behaviour > 3. Load in the source file @@ -406,12 +407,12 @@ which is much better than in our first attempt where we just got a vector of num > as you expect > 5. Make any necessary bug fixes > 6. Rinse and repeat. -> +> > Formal documentation for functions, written in separate `.Rd` > files, gets turned into the documentation you see in help -> files. The [roxygen2][] package allows R coders to write documentation alongside +> files. The [roxygen2][] package allows R coders to write documentation alongside > the function code and then process it into the appropriate `.Rd` files. -> You will want to switch to this more formal method of writing documentation +> You will want to switch to this more formal method of writing documentation > when you start writing more complicated R projects. > > Formal automated tests can be written using the [testthat][] package. @@ -419,14 +420,30 @@ which is much better than in our first attempt where we just got a vector of num [roxygen2]: http://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html [testthat]: http://r-pkgs.had.co.nz/tests.html +## Challenge solutions + +> #### Solution to challenge 1 {.challenge} +> +> Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +> and returns that temperature in Celsius +> +> +> ~~~{.r} +> kelvin_to_celsius <- function(temp) { +> celsius <- temp - 273.15 +> return(celsius) +> } +> ~~~ + -> #### Solution for Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > Define the function to convert directly from Fahrenheit to Celsius, > by reusing these two functions above > +> > -> ~~~ {.r} +> ~~~{.r} > fahr_to_celsius <- function(temp) { > temp_k <- fahr_to_kelvin(temp) > result <- kelvin_to_celsius(temp_k) @@ -435,3 +452,27 @@ which is much better than in our first attempt where we just got a vector of num > ~~~ > +> #### Solution to challenge 3 {.challenge} +> +> +> Write a function called `fence` that takes two vectors as arguments, called +> `text` and `wrapper`, and prints out the text wrapped with the `wrapper`: +> +> +> ~~~{.r} +> fence <- function(text, wrapper){ +> text <- c(wrapper, text, wrapper) +> result <- paste(text, collapse = " ") +> return(result) +> } +> best_practice <- c("Write", "programs", "for", "people", "not", "computers") +> fence(text=best_practice, wrapper="***") +> ~~~ +> +> +> +> ~~~{.output} +> [1] "*** Write programs for people not computers ***" +> +> ~~~ +> diff --git a/08-plot-ggplot2.Rmd b/08-plot-ggplot2.Rmd index 1b273657e..057de65dc 100644 --- a/08-plot-ggplot2.Rmd +++ b/08-plot-ggplot2.Rmd @@ -12,7 +12,7 @@ opts_chunk$set(fig.path = "fig/08-plot-ggplot2-") gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) ``` -> # Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to use ggplot2 to generate publication quality graphics > * To understand the basics of the grammar of graphics: @@ -92,9 +92,9 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + > Modify the example so that the figure visualise how life expectancy has > changed over time: > -> ~~~ {.r} +> ```{r, eval=FALSE} > ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() -> ~~~ +> ``` > > Hint: the gapminder dataset has a column called "year", which should appear > on the x-axis. @@ -262,19 +262,19 @@ code to modify! > - Add a facet layer to panel the density plots by year. > +## Challenge solutions -> #### Solution to Challenge 1 {.challenge} +> #### Solution to challenge 1 {.challenge} > > Modify the example so that the figure visualise how life expectancy has > changed over time: > -> ~~~ {.r} +> ```{r ch1-sol} > ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point() -> ~~~ +> ``` > - -> #### Solution to Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > In the previous examples and challenge we've used the `aes` function to tell > the scatterplot **geom** about the **x** and **y** locations of each point. @@ -282,41 +282,41 @@ code to modify! > code from the previous challenge to **color** the points by the "continent" > column. What trends do you see in the data? Are they what you expected? > -> ~~~ {.r} +> ```{r ch2-sol} > ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) + > geom_point() -> ~~~ +> ``` > -> #### Solution to Challenge 3 {.challenge} +> #### Solution to challenge 3 {.challenge} > > Switch the order of the point and line layers from the previous example. What > happened? > -> ~~~ {.r} +> ```{r ch3-sol} > ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) + > geom_point() + geom_line(aes(color=continent)) -> ~~~ +> ``` > > The lines now get drawn over the points! > -> #### Solution to Challenge 4 {.challenge} +> #### Solution to challenge 4 {.challenge} > > Modify the color and size of the points on the point layer in the previous > example. > > Hint: do not use the `aes` function. > -> ~~~ {.r} +> ```{r ch4-sol} > ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + > geom_point(size=3, color="orange") + scale_y_log10() + > geom_smooth(method="lm", size=1.5) -> ~~~ +> ``` > -> #### Solution toChallenge 5 {.challenge} +> #### Solution to challenge 5 {.challenge} > > Create a density plot of GDP per capita, filled by continent. > @@ -324,8 +324,8 @@ code to modify! > - Transform the x axis to better visualise the data spread. > - Add a facet layer to panel the density plots by year. > -> ~~~ {.r} +> ```{r ch5-sol} > ggplot(data = gapminder, aes(x = gdpPercap, fill=continent)) + > geom_density(alpha=0.6) + facet_wrap( ~ year) + scale_x_log10() -> ~~~ +> ``` > diff --git a/08-plot-ggplot2.md b/08-plot-ggplot2.md index 5d41be497..c81b1d69b 100644 --- a/08-plot-ggplot2.md +++ b/08-plot-ggplot2.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Creating publication quality graphics -minutes: 60 +minutes: 70 --- -> # Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to use ggplot2 to generate publication quality graphics > * To understand the basics of the grammar of graphics: @@ -15,7 +15,7 @@ minutes: 60 > - The geometry layer > - Adding statistics > - Transforming scales -> - Coloring or panelling by groups. +> - Coloring or paneling by groups. > Plotting our data is one of the best ways to @@ -34,14 +34,14 @@ Today we'll be learning about the ggplot2 package, because it is the most effective for creating publication quality graphics. -ggplot2 is built on the grammar of graphics, the idea that any plot can be -expressed from the same set of components: a **data** set, a +ggplot2 is built on the grammar of graphics, the idea that any plot can be +expressed from the same set of components: a **data** set, a **coordinate system**, and a set of **geoms**--the visual representation of data points. The key to understanding ggplot2 is thinking about a figure in layers: just like -you might do in an image editing program like photoshop, illustrator, or -inkscape. +you might do in an image editing program like Photoshop, Illustrator, or +Inkscape. Let's start off with an example: @@ -52,21 +52,21 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() ~~~ -plot of chunk lifeExp-vs-gdpPercap-scatter +plot of chunk lifeExp-vs-gdpPercap-scatter -So the first thing we do is call the `ggplot` function. This function lets R -know that we're creating a new plot, and any of the arguments we give the -`ggplot` function are the *global* options for the plot: they apply to all +So the first thing we do is call the `ggplot` function. This function lets R +know that we're creating a new plot, and any of the arguments we give the +`ggplot` function are the *global* options for the plot: they apply to all layers on the plot. We've passed in two arguments to `ggplot`. First, we tell `ggplot` what data we -want to show on our figure, in this example the gapminder data we read in -earlier. For the second argument we passed in the `aes` function, which +want to show on our figure, in this example the gapminder data we read in +earlier. For the second argument we passed in the `aes` function, which tells `ggplot` how variables in the **data** map to *aesthetic* properties of the figure, in this case the **x** and **y** locations. Here we told `ggplot` we want to plot the "lifeExp" column of the gapminder data frame on the x-axis, and -the "gdpPercap" column on the y-axis. Notice that we didn't need to explicity -pass `aes` these columns (e.g. `x = gapminder[, "lifeExp"]`), this is because +the "gdpPercap" column on the y-axis. Notice that we didn't need to explicitly +pass `aes` these columns (e.g. `x = gapminder[, "lifeExp"]`), this is because `ggplot` is smart enough to know to look in the **data** for that column! By itself, the call to `ggplot` isn't enough to draw a figure: @@ -83,9 +83,9 @@ Error: No layers in plot ~~~ -We need to tell `ggplot` how we want to visually represent the data, which we +We need to tell `ggplot` how we want to visually represent the data, which we do by adding a new **geom** layer. In our example, we used `geom_point`, which -tells `ggplot` we want to visually represent the relationship between **x** and +tells `ggplot` we want to visually represent the relationship between **x** and **y** as a scatterplot of points: @@ -94,49 +94,31 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() ~~~ -plot of chunk lifeExp-vs-gdpPercap-scatter2 +plot of chunk lifeExp-vs-gdpPercap-scatter2 > #### Challenge 1 {.challenge} > -> Modify the example so that the figure visualise how life expectancy has +> Modify the example so that the figure visualise how life expectancy has > changed over time: > -> ~~~ {.r} +> +> ~~~{.r} > ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() > ~~~ > -> Hint: the gapminder dataset has a column called "year", which should appear +> Hint: the gapminder dataset has a column called "year", which should appear > on the x-axis. -> - -Solution: - - -~~~{.r} -ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point() -~~~ - -plot of chunk challenge-1-solution +> > #### Challenge 2 {.challenge} > > In the previous examples and challenge we've used the `aes` function to tell -> the scatterplot **geom** about the **x** and **y** locations of each point. +> the scatterplot **geom** about the **x** and **y** locations of each point. > Another *aesthetic* property we can modify is the point *color*. Modify the -> code from the previous challenge to **color** the points by the "continent" +> code from the previous challenge to **color** the points by the "continent" > column. What trends do you see in the data? Are they what you expected? > -Solution: - - -~~~{.r} -ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) + - geom_point() -~~~ - -plot of chunk challenge-2-solution - ### Layers Using a scatterplot probably isn't the best for visualising change over time. @@ -148,13 +130,13 @@ ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) + geom_line() ~~~ -plot of chunk lifeExp-line +plot of chunk lifeExp-line -Instead of adding a `geom_point` layer, we've added a `geom_line` layer. We've +Instead of adding a `geom_point` layer, we've added a `geom_line` layer. We've added the **by** *aesthetic*, which tells `ggplot` to draw a line for each country. -But what if we want to visualise both lines and points on the plot? We can +But what if we want to visualise both lines and points on the plot? We can simply add another layer to the plot: @@ -163,10 +145,10 @@ ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) + geom_line() + geom_point() ~~~ -plot of chunk lifeExp-line-point +plot of chunk lifeExp-line-point It's important to note that each layer is drawn on top of the previous layer. In -this example, the points have been drawn *on top of* the lines. Here's a +this example, the points have been drawn *on top of* the lines. Here's a demonstration: @@ -175,11 +157,11 @@ ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) + geom_line(aes(color=continent)) + geom_point() ~~~ -plot of chunk lifeExp-layer-example-1 +plot of chunk lifeExp-layer-example-1 -In this example, the *aesthetic* mapping of **color** has been moved from the +In this example, the *aesthetic* mapping of **color** has been moved from the global plot options in `ggplot` to the `geom_line` layer so it no longer applies -to the points. Now we can clearly see that the points are drawn on top of the +to the points. Now we can clearly see that the points are drawn on top of the lines. > #### Challenge 3 {.challenge} @@ -188,21 +170,9 @@ lines. > happened? > -Solution: - - -~~~{.r} -ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) + - geom_point() + geom_line(aes(color=continent)) -~~~ - -plot of chunk lifeExp-layer-example-2 - -The lines now get drawn over the points! - ### Transformations and statistics -Ggplot also makes it easy to overlay statistical models over the data. To +Ggplot also makes it easy to overlay statistical models over the data. To demonstrate we'll go back to our first example: @@ -211,11 +181,11 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) + geom_point() ~~~ -plot of chunk lifeExp-vs-gdpPercap-scatter3 +plot of chunk lifeExp-vs-gdpPercap-scatter3 Currently it's hard to see the relationship between the points due to some strong -outliers in GDP per capita. We can change the scale of units on the y axis using -the *scale* functions. These control the mapping between the data values and +outliers in GDP per capita. We can change the scale of units on the y axis using +the *scale* functions. These control the mapping between the data values and visual values of an aesthetic. @@ -224,16 +194,16 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() + scale_y_log10() ~~~ -plot of chunk axis-scale +plot of chunk axis-scale The `log10` function applied a transformation to the values of the gdpPercap column before rendering them on the plot, so that each multiple of 10 now only corresponds to an increase in 1 on the transformed scale, e.g. a GDP per capita -of 1,000 is now 3 on the y axis, a value of 10,000 corresponds to 4 on the y -axis and so on. This makes it easier to visualise the spread of data on the +of 1,000 is now 3 on the y axis, a value of 10,000 corresponds to 4 on the y +axis and so on. This makes it easier to visualise the spread of data on the y-axis. -We can fit a simple relationship to the data by adding another layer, +We can fit a simple relationship to the data by adding another layer, `geom_smooth`: @@ -242,9 +212,9 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() + scale_y_log10() + geom_smooth(method="lm") ~~~ -plot of chunk lm-fit +plot of chunk lm-fit -We can make the line thicker by *setting* the **size** aesthetic in the +We can make the line thicker by *setting* the **size** aesthetic in the `geom_smooth` layer: @@ -253,12 +223,12 @@ ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5) ~~~ -plot of chunk lm-fit2 +plot of chunk lm-fit2 -There are two ways an *aesthetic* can be specified. Here we *set* the **size** -aesthetic by passing it as an argument to `geom_smooth`. Previously in the -lesson we've used the `aes` function to define a *mapping* between data -variables and their visual representation. +There are two ways an *aesthetic* can be specified. Here we *set* the **size** +aesthetic by passing it as an argument to `geom_smooth`. Previously in the +lesson we've used the `aes` function to define a *mapping* between data +variables and their visual representation. > #### Challenge 4 {.challenge} > @@ -268,20 +238,11 @@ variables and their visual representation. > Hint: do not use the `aes` function. > - -~~~{.r} -ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + - geom_point(size=3, color="orange") + scale_y_log10() + - geom_smooth(method="lm", size=1.5) -~~~ - -plot of chunk setting - ### Multi-panel figures -Earlier we visualised the change in life expectancy over time across all +Earlier we visualised the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels -by adding a layer of **facet** panels: +by adding a layer of **facet** panels: ~~~{.r} @@ -289,40 +250,40 @@ ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) + geom_line() + facet_wrap( ~ country) ~~~ -plot of chunk facet +plot of chunk facet -The `facet_wrap` layer took a "formula" as its argument, denoted by the tilda +The `facet_wrap` layer took a "formula" as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the country column of the gapminder dataset. ### Modifying text -To clean this figure up for a publication we need to change some of the text -elements. The x-axis is way too cluttered, and the y axis should read +To clean this figure up for a publication we need to change some of the text +elements. The x-axis is way too cluttered, and the y axis should read "Life expectancy", rather than the column name in the data frame. -We can do this by adding a couple of different layers. The **theme** layer -controls the axis text, and overall text size, and there are special layers +We can do this by adding a couple of different layers. The **theme** layer +controls the axis text, and overall text size, and there are special layers for changing the axis labels. To change the legend title, we need to use the **scales** layer. ~~~{.r} ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) + - geom_line() + facet_wrap( ~ country) + + geom_line() + facet_wrap( ~ country) + xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") + - scale_fill_discrete(name="Continent") + + scale_fill_discrete(name="Continent") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) ~~~ -plot of chunk theme +plot of chunk theme -This is just a taste of what you can do with `ggplot2`. RStudio provides a +This is just a taste of what you can do with `ggplot2`. RStudio provides a really useful [cheat sheet][cheat] of the different layers available, and more extensive documentation is available on the [ggplot2 website][ggplot-doc]. Finally, if you have no idea how to change something, a quick google search will -usually send you to a relevant question and answer on stackoverflow with reusable +usually send you to a relevant question and answer on Stack Overflow with reusable code to modify! [cheat]: http://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf @@ -332,17 +293,91 @@ code to modify! > #### Challenge 5 {.challenge} > > Create a density plot of GDP per capita, filled by continent. -> +> > Advanced: -> - Transform the x axis to better visualise the data spread. -> - Add a facet layer to panel the density plots by year. +> - Transform the x axis to better visualise the data spread. +> - Add a facet layer to panel the density plots by year. > +## Challenge solutions -~~~{.r} -ggplot(data = gapminder, aes(x = gdpPercap, fill=continent)) + - geom_density(alpha=0.6) + facet_wrap( ~ year) + scale_x_log10() -~~~ +> #### Solution to challenge 1 {.challenge} +> +> Modify the example so that the figure visualise how life expectancy has +> changed over time: +> +> +> ~~~{.r} +> ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point() +> ~~~ +> +> plot of chunk ch1-sol +> + +> #### Solution to challenge 2 {.challenge} +> +> In the previous examples and challenge we've used the `aes` function to tell +> the scatterplot **geom** about the **x** and **y** locations of each point. +> Another *aesthetic* property we can modify is the point *color*. Modify the +> code from the previous challenge to **color** the points by the "continent" +> column. What trends do you see in the data? Are they what you expected? +> +> +> ~~~{.r} +> ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) + +> geom_point() +> ~~~ +> +> plot of chunk ch2-sol +> + +> #### Solution to challenge 3 {.challenge} +> +> Switch the order of the point and line layers from the previous example. What +> happened? +> +> +> ~~~{.r} +> ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) + +> geom_point() + geom_line(aes(color=continent)) +> ~~~ +> +> plot of chunk ch3-sol +> +> The lines now get drawn over the points! +> -plot of chunk density +> #### Solution to challenge 4 {.challenge} +> +> Modify the color and size of the points on the point layer in the previous +> example. +> +> Hint: do not use the `aes` function. +> +> +> ~~~{.r} +> ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + +> geom_point(size=3, color="orange") + scale_y_log10() + +> geom_smooth(method="lm", size=1.5) +> ~~~ +> +> plot of chunk ch4-sol +> + +> #### Solution to challenge 5 {.challenge} +> +> Create a density plot of GDP per capita, filled by continent. +> +> Advanced: +> - Transform the x axis to better visualise the data spread. +> - Add a facet layer to panel the density plots by year. +> +> +> ~~~{.r} +> ggplot(data = gapminder, aes(x = gdpPercap, fill=continent)) + +> geom_density(alpha=0.6) + facet_wrap( ~ year) + scale_x_log10() +> ~~~ +> +> plot of chunk ch5-sol +> diff --git a/09-vectorisation.Rmd b/09-vectorisation.Rmd index df663ec68..36394db2b 100644 --- a/09-vectorisation.Rmd +++ b/09-vectorisation.Rmd @@ -7,11 +7,13 @@ minutes: 30 ```{r, include=FALSE} source("tools/chunk-options.R") +opts_chunk$set(fig.path = "fig/09-vectorisation-") # Silently load in the data so the rest of the lesson works gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) +library(ggplot2) ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To understand vectorised operations in R. > @@ -108,17 +110,10 @@ m * -1 > > Given the following matrix: > -> ~~~ {.r} +> ```{r} > m <- matrix(1:12, nrow=3, ncol=4) > m -> ~~~ -> -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] 1 4 7 10 -> ## [2,] 2 5 8 11 -> ## [3,] 3 6 9 12 -> ~~~ +> ``` > > Write down what you think will happen when you run: > @@ -134,9 +129,9 @@ m * -1 > We're interested in looking at the sum of the > following sequence of fractions: > -> ~~~ {.output} +> ```{r, eval=FALSE} > x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) -> ~~~ +> ``` > > This would be tedious to type out, and impossible for > high values of n. @@ -145,7 +140,9 @@ m * -1 > -> #### Solution to Challenge 1 {.challenge} +## Challenge solutions + +> #### Solution to challenge 1 {.challenge} > > Let's try this on the `pop` column of the `gapminder` dataset. > @@ -154,87 +151,74 @@ m * -1 > Check the head or tail of the data frame to make sure > it worked. > -> ~~~ {.r} +> ```{r} > gapminder$pop_millions <- gapminder$pop / 1e6 > head(gapminder) -> ~~~ +> ``` > -> #### Solution to Challenge 2 {.challenge} +> #### Solution to challenge 2 {.challenge} > > Refresh your ggplot skills by plotting population in millions against year. > -> ~~~ {.r} +> ```{r ch2-sol} > ggplot(gapminder, aes(x = year, y = pop_millions)) + geom_point() -> ~~~ +> ``` > -> #### Solution to Challenge 3 {.challenge} +> #### Solution to challenge 3 {.challenge} > > Given the following matrix: > -> ~~~ {.r} +> ```{r} > m <- matrix(1:12, nrow=3, ncol=4) > m -> ~~~ +> ``` > -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] 1 4 7 10 -> ## [2,] 2 5 8 11 -> ## [3,] 3 6 9 12 -> ~~~ > > Write down what you think will happen when you run: > > 1. `m ^ -1` > -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] 1.0000000 0.2500000 0.1428571 0.10000000 -> ## [2,] 0.5000000 0.2000000 0.1250000 0.09090909 -> ## [3,] 0.3333333 0.1666667 0.1111111 0.08333333 -> ~~~ +> ```{r, echo=FALSE} +> m ^ -1 +> ``` > > 2. `m * c(1, 0, -1)` > -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] 1 4 7 10 -> ## [2,] 0 0 0 0 -> ## [3,] -3 -6 -9 -12 -> ~~~ +> ```{r, echo=FALSE} +> m * c(1, 0, -1) +> ``` > > 3. `m > c(0, 20)` > -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] TRUE FALSE TRUE FALSE -> ## [2,] FALSE TRUE FALSE TRUE -> ## [3,] TRUE FALSE TRUE FALSE -> ~~~ +> ```{r, echo=FALSE} +> m > c(0, 20) +> ``` > - > #### Bonus Challenge {.challenge} > > We're interested in looking at the sum of the > following sequence of fractions: > -> ~~~ {.output} +> ```{r, eval=FALSE} > x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) -> ~~~ +> ``` > > This would be tedious to type out, and impossible for > high values of n. > Can you use vectorisation to solve for x, when n=100? > How about when n=10,000? > -> ~~~ {.r} -> n <- 1:100 -> y <- 1/(n^2) -> x <- sum(y) -> -> n <- 1:10000 -> ~~~ +> ```{r} +> inverse_sum_of_squares <- function(n) { +> sequence <- 1:n +> y <- 1/(sequence^2) +> result <- sum(y) +> return(result) +> } +> inverse_sum_of_squares(100) +> inverse_sum_of_squares(10000) +> ``` > diff --git a/09-vectorisation.md b/09-vectorisation.md index c08a30ca2..b25d19a3b 100644 --- a/09-vectorisation.md +++ b/09-vectorisation.md @@ -2,18 +2,18 @@ layout: page title: R for reproducible scientific analysis subtitle: Vectorisation -minutes: 15 +minutes: 30 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To understand vectorised operations in R. > -One of the nice features of R is that most of its functions are vectorized, -that is the function will operate on all elements of a vector without +One of the nice features of R is that most of its functions are vectorised, +that is the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone. @@ -62,17 +62,17 @@ y: 6 7 8 9 > #### Challenge 1 {.challenge} > -> Let's try this on the `pop` column of the `gapminder` dataset. +> Let's try this on the `pop` column of the `gapminder` dataset. > -> Make a new column in the `gapminder` dataframe that +> Make a new column in the `gapminder` data frame that > contains population in units of millions of people. -> Check the head or tail of the dataframe to make sure +> Check the head or tail of the data frame to make sure > it worked. > > #### Challenge 2 {.challenge} -> -> Refresh your ggplot skils by plotting population in millions against year. +> +> Refresh your ggplot skills by plotting population in millions against year. > Comparison operators also apply element-wise, as we saw in the @@ -122,7 +122,7 @@ log(x) ~~~{.output} -[1] 0.0000 0.6931 1.0986 1.3863 +[1] 0.0000000 0.6931472 1.0986123 1.3862944 ~~~ @@ -153,19 +153,23 @@ m * -1 > guide](http://www.statmethods.net/advstats/matrix.html) > #### Challenge 3 {.challenge} -> +> > Given the following matrix: +> > -> ~~~ {.r} +> ~~~{.r} > m <- matrix(1:12, nrow=3, ncol=4) > m > ~~~ -> -> ~~~ {.output} -> ## [,1] [,2] [,3] [,4] -> ## [1,] 1 4 7 10 -> ## [2,] 2 5 8 11 -> ## [3,] 3 6 9 12 +> +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] +> [1,] 1 4 7 10 +> [2,] 2 5 8 11 +> [3,] 3 6 9 12 +> > ~~~ > > Write down what you think will happen when you run: @@ -178,11 +182,12 @@ m * -1 > > #### Bonus Challenge {.challenge} -> +> > We're interested in looking at the sum of the > following sequence of fractions: > -> ~~~ {.output} +> +> ~~~{.r} > x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) > ~~~ > @@ -192,3 +197,150 @@ m * -1 > How about when n=10,000? > + +## Challenge solutions + +> #### Solution to challenge 1 {.challenge} +> +> Let's try this on the `pop` column of the `gapminder` dataset. +> +> Make a new column in the `gapminder` data frame that +> contains population in units of millions of people. +> Check the head or tail of the data frame to make sure +> it worked. +> +> +> ~~~{.r} +> gapminder$pop_millions <- gapminder$pop / 1e6 +> head(gapminder) +> ~~~ +> +> +> +> ~~~{.output} +> country year pop continent lifeExp gdpPercap pop_millions +> 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 8.425333 +> 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 9.240934 +> 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 10.267083 +> 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 11.537966 +> 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 13.079460 +> 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 14.880372 +> +> ~~~ +> + +> #### Solution to challenge 2 {.challenge} +> +> Refresh your ggplot skills by plotting population in millions against year. +> +> +> ~~~{.r} +> ggplot(gapminder, aes(x = year, y = pop_millions)) + geom_point() +> ~~~ +> +> plot of chunk ch2-sol +> + +> #### Solution to challenge 3 {.challenge} +> +> Given the following matrix: +> +> +> ~~~{.r} +> m <- matrix(1:12, nrow=3, ncol=4) +> m +> ~~~ +> +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] +> [1,] 1 4 7 10 +> [2,] 2 5 8 11 +> [3,] 3 6 9 12 +> +> ~~~ +> +> +> Write down what you think will happen when you run: +> +> 1. `m ^ -1` +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] +> [1,] 1.0000000 0.2500000 0.1428571 0.10000000 +> [2,] 0.5000000 0.2000000 0.1250000 0.09090909 +> [3,] 0.3333333 0.1666667 0.1111111 0.08333333 +> +> ~~~ +> +> 2. `m * c(1, 0, -1)` +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] +> [1,] 1 4 7 10 +> [2,] 0 0 0 0 +> [3,] -3 -6 -9 -12 +> +> ~~~ +> +> 3. `m > c(0, 20)` +> +> +> ~~~{.output} +> [,1] [,2] [,3] [,4] +> [1,] TRUE FALSE TRUE FALSE +> [2,] FALSE TRUE FALSE TRUE +> [3,] TRUE FALSE TRUE FALSE +> +> ~~~ +> + +> #### Bonus Challenge {.challenge} +> +> We're interested in looking at the sum of the +> following sequence of fractions: +> +> +> ~~~{.r} +> x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +> ~~~ +> +> This would be tedious to type out, and impossible for +> high values of n. +> Can you use vectorisation to solve for x, when n=100? +> How about when n=10,000? +> +> +> ~~~{.r} +> inverse_sum_of_squares <- function(n) { +> sequence <- 1:n +> y <- 1/(sequence^2) +> result <- sum(y) +> return(result) +> } +> inverse_sum_of_squares(100) +> ~~~ +> +> +> +> ~~~{.output} +> [1] 1.634984 +> +> ~~~ +> +> +> +> ~~~{.r} +> inverse_sum_of_squares(10000) +> ~~~ +> +> +> +> ~~~{.output} +> [1] 1.644834 +> +> ~~~ +> diff --git a/10-control-flow.Rmd b/10-control-flow.Rmd index b0f4e868e..e1f37a886 100644 --- a/10-control-flow.Rmd +++ b/10-control-flow.Rmd @@ -13,7 +13,7 @@ gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) set.seed(10) ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Write conditional statements with `if` and `else`. > * Write and understand `while` and `for` loops. diff --git a/10-control-flow.md b/10-control-flow.md index e1d432a0d..761ed9425 100644 --- a/10-control-flow.md +++ b/10-control-flow.md @@ -2,23 +2,23 @@ layout: page title: R for reproducible scientific analysis subtitle: Control flow -minutes: 30 +minutes: 45 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * Write conditional statements with `if` and `else`. -> * Write and understand `while` and `for` loops. +> * Write and understand `while` and `for` loops. > -Often when we're coding we want to control the flow of our actions. This can be done -by setting actions to occur only if a condition or a set of conditions are met. +Often when we're coding we want to control the flow of our actions. This can be done +by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times. -There are several ways you can control flow in R. -For conditional statements, the most commonly used approaches are the constructs: +There are several ways you can control flow in R. +For conditional statements, the most commonly used approaches are the constructs: ~~~{.r} @@ -47,7 +47,7 @@ x <- rpois(1, lambda=8) if (x >= 10) { print("x is greater than or equal to 10") } - + x ~~~ @@ -59,7 +59,7 @@ x ~~~ Note you may not get the same output as your neighbour because -you may be sampling different random numbers from the same distribution. +you may be sampling different random numbers from the same distribution. Let's set a seed so that we all generate the same 'pseudo-random' number, and then print more information: @@ -88,9 +88,9 @@ if (x >= 10) { > > In the above case, the function `rpois` generates a random number following a > Poisson distribution with a mean (i.e. lambda) of 8. The function `set.seed` -> guarantees that all machines will generate the exact same 'pseudo-random' +> guarantees that all machines will generate the exact same 'pseudo-random' > number ([more about pseudo-random numbers](http://en.wikibooks.org/wiki/R_Programming/Random_Number_Generation)). -> So if we `set.seed(10)`, we see that `x` takes the value 8. You should get the +> So if we `set.seed(10)`, we see that `x` takes the value 8. You should get the > exact same number. > @@ -106,7 +106,7 @@ if (x) { } ~~~ -As we can see, the message was not printed because the vector x is `FALSE` +As we can see, the message was not printed because the vector x is `FALSE` ~~~{.r} @@ -122,9 +122,9 @@ x ~~~ > #### Challenge 1 {.challenge} -> +> > Use an `if` statement to print a suitable message -> reporting whether there are any records from 2002 in +> reporting whether there are any records from 2002 in > the `gapminder` dataset. > Now do the same for 2012. > @@ -133,8 +133,8 @@ Did anyone get a warning message like this? ~~~{.output} -Warning: the condition has length > 1 and only the first element will be -used +Warning in if (gapminder$year == 2012) {: the condition has length > 1 and +only the first element will be used ~~~ @@ -144,28 +144,28 @@ element. Here you need to make sure your condition is of length 1. > #### Tip: `any` and `all` {.callout} > The `any` function will return TRUE if at least one -> TRUE value is found within a vector, otherwise it will return `FALSE`. +> TRUE value is found within a vector, otherwise it will return `FALSE`. > This can be used in a similar way to the `%in%` operator. > The function `all`, as the name suggests, will only return `TRUE` if all values in -> the vector are `TRUE`. +> the vector are `TRUE`. > ### Repeating operations Sometimes you will find yourself needing to repeat an operation until a certain -condition is met. You can do this with a `while` loop. +condition is met. You can do this with a `while` loop. ~~~{.r} while(this condition is true){ do a thing -} +} ~~~ Let's try an example, shall we? We'll try to come up with some simple code that generates random numbers from a uniform distribution (the `runif` function) -between 0 and 1 until it gets one that's less than 0.1. +between 0 and 1 until it gets one that's less than 0.1. ~~~{.r} @@ -201,14 +201,14 @@ while(z > 0.1){ ~~~{.output} -[1] 0.4269 -[1] 0.6931 -[1] 0.08514 +[1] 0.4269077 +[1] 0.6931021 +[1] 0.08513597 ~~~ > #### Challenge 2 {.challenge} -> +> > Use a `while` loop to construct a vector called 'pet_list' > with the value: > 'cat', 'dog', 'dog', 'dog', 'dog' @@ -217,11 +217,11 @@ while(z > 0.1){ > `while` loops will not always be appropriate. If you want to iterate over -a set of values, when the order of iteration is important, and perform the -same operation on each, a `for` loop will do the job. -We saw `for` loops in the shell lessons earlier. This is the most -flexible of looping operations, but therefore also the hardest to use -correctly. Avoid using `for` loops unless the order of iteration is important: +a set of values, when the order of iteration is important, and perform the +same operation on each, a `for` loop will do the job. +We saw `for` loops in the shell lessons earlier. This is the most +flexible of looping operations, but therefore also the hardest to use +correctly. Avoid using `for` loops unless the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations. The basic structure of a `for` loop is: @@ -308,7 +308,7 @@ Rather than printing the results, we could write the loop output to a new object ~~~{.r} -output_vector <- c() +output_vector <- c() for (i in 1:5){ for(j in c('a', 'b', 'c', 'd', 'e')){ temp_output <- paste(i, j) @@ -373,7 +373,7 @@ output_vector2 > #### Challenge 3 {.challenge} > -> Compare the objects output_vector and +> Compare the objects output_vector and > output_vector2. Are they the same? If not, why not? > How would you change the last block of code to make output_vector2 > the same as output_vector? @@ -388,7 +388,7 @@ output_vector2 > #### Challenge 5 {.challenge} > -> Modify the script from Challenge 4 to also loop over each +> Modify the script from Challenge 4 to also loop over each > country. This time print out whether the life expectancy is > smaller than 50, between 50 and 70, or greater than 70. > @@ -396,7 +396,6 @@ output_vector2 > #### Challenge 6 - Advanced {.challenge} > > Write a script that loops over each country in the `gapminder` dataset, -> tests whether the country starts with a 'B', and graphs life expectancy -> against time as a line graph if the mean life expectancy is under 50 years. +> tests whether the country starts with a 'B', and graphs life expectancy +> against time as a line graph if the mean life expectancy is under 50 years. > - diff --git a/11-writing-data.Rmd b/11-writing-data.Rmd index e1f1c27f6..c324a7c33 100644 --- a/11-writing-data.Rmd +++ b/11-writing-data.Rmd @@ -16,7 +16,7 @@ gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) dir.create("cleaned-data") ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to write out plots and data from R > @@ -26,7 +26,7 @@ dir.create("cleaned-data") You have already seen how to save the most recent plot you create in `ggplot2`, using the command `ggsave`. As a refresher: -```{rm, eval=FALSE} +```{r, eval=FALSE} ggsave("My_most_recent_plot.pdf") ``` diff --git a/11-writing-data.md b/11-writing-data.md index 6e6447308..1ed5632c6 100644 --- a/11-writing-data.md +++ b/11-writing-data.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Writing data -minutes: 15 +minutes: 20 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to write out plots and data from R > @@ -22,16 +22,16 @@ using the command `ggsave`. As a refresher: ggsave("My_most_recent_plot.pdf") ~~~ -You can save a plot from within RStudio using the 'Export' button -in the 'Plot' window. This will give you the option of saving as a -.pdf or as .png, .jpg or other image formats. +You can save a plot from within RStudio using the 'Export' button +in the 'Plot' window. This will give you the option of saving as a +.pdf or as .png, .jpg or other image formats. Sometimes you will want to save plots without creating them in the 'Plot' window first. Perhaps you want to make a pdf document with -multiple pages: each one a different plot, for example. Or perhaps -you're looping through multiple subsets of a file, plotting data from -each subset, and you want to save each plot, but obviously can't stop -the loop to click 'Export' for each one. +multiple pages: each one a different plot, for example. Or perhaps +you're looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can't stop +the loop to click 'Export' for each one. In this case you can use a more flexible approach. The function `pdf` creates a new pdf device. You can control the size and resolution @@ -48,7 +48,7 @@ ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) + dev.off() ~~~ -Open up this document and have a look. +Open up this document and have a look. > #### Challenge 1 {.challenge} > @@ -59,7 +59,7 @@ Open up this document and have a look. The commands `jpeg`, `png` etc. are used similarly to produce -documents in different formats. +documents in different formats. ### Writing data @@ -82,7 +82,7 @@ write.table(aust_subset, ~~~ Let's switch back to the shell to take a look at the data to make sure it looks -ok: +OK: ~~~{.r} @@ -107,7 +107,7 @@ head cleaned-data/gapminder-aus.csv ~~~ Hmm, that's not quite what we wanted. Where did all these -quotation marks come from? Also the row numbers are +quotation marks come from? Also the row numbers are meaningless. Let's look at the help file to work out how to change this @@ -163,9 +163,9 @@ That looks better! > > Write a data-cleaning script file that subsets the gapminder > data to include only data points collected since 1990. -> +> > Use this script to write out the new subset to a file -> in the `cleaned-data/` directory. -> +> in the `cleaned-data/` directory. +> diff --git a/12-plyr.Rmd b/12-plyr.Rmd index ec55469c4..9dc4136f7 100644 --- a/12-plyr.Rmd +++ b/12-plyr.Rmd @@ -12,7 +12,7 @@ opts_chunk$set(fig.path = "fig/12-plyr-") gapminder <- read.csv("data/gapminder-FiveYearData.csv", header=TRUE) ``` -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to use the split-apply-combine strategy for data analysis > @@ -227,7 +227,7 @@ d_ply( > life expectancy per continent: > > 1. -> ~~~ {.r} +> ```{r, eval=FALSE} > ddply( > .data = gapminder, > .variables = gapminder$continent, @@ -235,19 +235,19 @@ d_ply( > mean(dataGroup$lifeExp) > } > ) -> ~~~ +> ``` > > 2. -> ~~~ {.r} +> ```{r, eval=FALSE} > ddply( > .data = gapminder, > .variables = "continent", > .fun = mean(dataGroup$lifeExp) > ) -> ~~~ +> ``` > > 3. -> ~~~ {.r} +> ```{r, eval=FALSE} > ddply( > .data = gapminder, > .variables = "continent", @@ -255,10 +255,10 @@ d_ply( > mean(dataGroup$lifeExp) > } > ) -> ~~~ +> ``` > > 4. -> ~~~ {.r} +> ```{r, eval=FALSE} > adply( > .data = gapminder, > .variables = "continent", @@ -266,5 +266,5 @@ d_ply( > mean(dataGroup$lifeExp) > } > ) -> ~~~ +> ``` > diff --git a/12-plyr.md b/12-plyr.md index 2fc1edee7..a377c2906 100644 --- a/12-plyr.md +++ b/12-plyr.md @@ -2,12 +2,12 @@ layout: page title: R for reproducible scientific analysis subtitle: Split-apply-combine -minutes: 15 +minutes: 45 --- -> ## Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To be able to use the split-apply-combine strategy for data analysis > @@ -23,14 +23,14 @@ additional arguments so we could filter by `year` and `country`: # with the GDP per capita column. calcGDP <- function(dat, year=NULL, country=NULL) { if(!is.null(year)) { - dat <- dat[dat$year %in% year, ] + dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,] } gdp <- dat$pop * dat$gdpPercap - new <- cbind(dat, gdp=gdp) + new <- cbind(dat, gdp=gdp) return(new) } ~~~ @@ -45,46 +45,46 @@ We could run `calcGPD` and then take the mean of each continent: ~~~{.r} withGDP <- calcGDP(gapminder) -mean(withGDP[withGDP$continent == "Africa", "gdp"]) +mean(withGDP[withGDP$continent == "Africa", "gdp"]) ~~~ ~~~{.output} -[1] 2.09e+10 +[1] 20904782844 ~~~ ~~~{.r} -mean(withGDP[withGDP$continent == "Americas", "gdp"]) +mean(withGDP[withGDP$continent == "Americas", "gdp"]) ~~~ ~~~{.output} -[1] 3.793e+11 +[1] 379262350210 ~~~ ~~~{.r} -mean(withGDP[withGDP$continent == "Asia", "gdp"]) +mean(withGDP[withGDP$continent == "Asia", "gdp"]) ~~~ ~~~{.output} -[1] 2.272e+11 +[1] 227233738153 ~~~ But this isn't very *nice*. Yes, by using a function, you have reduced a substantial amount of repetition. That **is** nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and -potentially introduce some nasty bugs. +potentially introduce some nasty bugs. We could write a new function that is flexible like `calcGDP`, but this also takes a substantial amount of effort and testing to get right. @@ -94,15 +94,15 @@ The abstract problem we're encountering here is know as "split-apply-combine": ![Split apply combine](fig/splitapply.png) We want to *split* our data into groups, in this case continents, *apply* -some calculations on that group, then optionally *combine* the results +some calculations on that group, then optionally *combine* the results together afterwards. #### The `plyr` package -For those of you who have used R before, you might be familiar with the +For those of you who have used R before, you might be familiar with the `apply` family of functions. While R's built in functions do work, we're going to introduce you to another method for solving the "split-apply-combine" -probelm. The [plyr](http://had.co.nz/plyr/) package provides a set of +problem. The [plyr](http://had.co.nz/plyr/) package provides a set of functions that we find more user friendly for solving this problem. We installed this package in an earlier challenge. Let's load it now: @@ -163,12 +163,12 @@ ddply( ~~~{.output} - continent V1 -1 Africa 2.090e+10 -2 Americas 3.793e+11 -3 Asia 2.272e+11 -4 Europe 2.694e+11 -5 Oceania 1.882e+11 + continent V1 +1 Africa 20904782844 +2 Americas 379262350210 +3 Asia 227233738153 +4 Europe 269442085301 +5 Oceania 188187105354 ~~~ @@ -183,11 +183,11 @@ returns another `data.frame` (2nd letter is a **d**) i column. Note that we just gave the name of the column, not the actual column itself like we've done previously with subsetting. Plyr takes care of these implementation details for you. -- The third argument is the function we want to apply to each grouping of the +- The third argument is the function we want to apply to each grouping of the data. We had to define our own short function here: each subset of the data gets stored in `x`, the first argument of our function. This is an anonymous function: we haven't defined it elsewhere, and it has no name. It only exists - in the scope of our call to `ddply`. + in the scope of our call to `ddply`. What if we want a different type of output data structure?: @@ -204,19 +204,19 @@ dlply( ~~~{.output} $Africa -[1] 2.09e+10 +[1] 20904782844 $Americas -[1] 3.793e+11 +[1] 379262350210 $Asia -[1] 2.272e+11 +[1] 227233738153 $Europe -[1] 2.694e+11 +[1] 269442085301 $Oceania -[1] 1.882e+11 +[1] 188187105354 attr(,"split_type") [1] "data.frame" @@ -247,67 +247,67 @@ ddply( ~~~{.output} - continent year V1 -1 Africa 1952 5.992e+09 -2 Africa 1957 7.359e+09 -3 Africa 1962 8.785e+09 -4 Africa 1967 1.144e+10 -5 Africa 1972 1.507e+10 -6 Africa 1977 1.869e+10 -7 Africa 1982 2.204e+10 -8 Africa 1987 2.411e+10 -9 Africa 1992 2.626e+10 -10 Africa 1997 3.002e+10 -11 Africa 2002 3.530e+10 -12 Africa 2007 4.578e+10 -13 Americas 1952 1.177e+11 -14 Americas 1957 1.408e+11 -15 Americas 1962 1.692e+11 -16 Americas 1967 2.179e+11 -17 Americas 1972 2.682e+11 -18 Americas 1977 3.241e+11 -19 Americas 1982 3.633e+11 -20 Americas 1987 4.394e+11 -21 Americas 1992 4.899e+11 -22 Americas 1997 5.827e+11 -23 Americas 2002 6.612e+11 -24 Americas 2007 7.767e+11 -25 Asia 1952 3.410e+10 -26 Asia 1957 4.727e+10 -27 Asia 1962 6.014e+10 -28 Asia 1967 8.465e+10 -29 Asia 1972 1.244e+11 -30 Asia 1977 1.598e+11 -31 Asia 1982 1.944e+11 -32 Asia 1987 2.418e+11 -33 Asia 1992 3.071e+11 -34 Asia 1997 3.876e+11 -35 Asia 2002 4.580e+11 -36 Asia 2007 6.275e+11 -37 Europe 1952 8.497e+10 -38 Europe 1957 1.100e+11 -39 Europe 1962 1.390e+11 -40 Europe 1967 1.734e+11 -41 Europe 1972 2.187e+11 -42 Europe 1977 2.554e+11 -43 Europe 1982 2.795e+11 -44 Europe 1987 3.165e+11 -45 Europe 1992 3.427e+11 -46 Europe 1997 3.836e+11 -47 Europe 2002 4.364e+11 -48 Europe 2007 4.932e+11 -49 Oceania 1952 5.416e+10 -50 Oceania 1957 6.683e+10 -51 Oceania 1962 8.234e+10 -52 Oceania 1967 1.060e+11 -53 Oceania 1972 1.341e+11 -54 Oceania 1977 1.547e+11 -55 Oceania 1982 1.762e+11 -56 Oceania 1987 2.095e+11 -57 Oceania 1992 2.363e+11 -58 Oceania 1997 2.893e+11 -59 Oceania 2002 3.452e+11 -60 Oceania 2007 4.037e+11 + continent year V1 +1 Africa 1952 5992294608 +2 Africa 1957 7359188796 +3 Africa 1962 8784876958 +4 Africa 1967 11443994101 +5 Africa 1972 15072241974 +6 Africa 1977 18694898732 +7 Africa 1982 22040401045 +8 Africa 1987 24107264108 +9 Africa 1992 26256977719 +10 Africa 1997 30023173824 +11 Africa 2002 35303511424 +12 Africa 2007 45778570846 +13 Americas 1952 117738997171 +14 Americas 1957 140817061264 +15 Americas 1962 169153069442 +16 Americas 1967 217867530844 +17 Americas 1972 268159178814 +18 Americas 1977 324085389022 +19 Americas 1982 363314008350 +20 Americas 1987 439447790357 +21 Americas 1992 489899820623 +22 Americas 1997 582693307146 +23 Americas 2002 661248623419 +24 Americas 2007 776723426068 +25 Asia 1952 34095762661 +26 Asia 1957 47267432088 +27 Asia 1962 60136869012 +28 Asia 1967 84648519224 +29 Asia 1972 124385747313 +30 Asia 1977 159802590186 +31 Asia 1982 194429049919 +32 Asia 1987 241784763369 +33 Asia 1992 307100497486 +34 Asia 1997 387597655323 +35 Asia 2002 458042336179 +36 Asia 2007 627513635079 +37 Europe 1952 84971341466 +38 Europe 1957 109989505140 +39 Europe 1962 138984693095 +40 Europe 1967 173366641137 +41 Europe 1972 218691462733 +42 Europe 1977 255367522034 +43 Europe 1982 279484077072 +44 Europe 1987 316507473546 +45 Europe 1992 342703247405 +46 Europe 1997 383606933833 +47 Europe 2002 436448815097 +48 Europe 2007 493183311052 +49 Oceania 1952 54157223944 +50 Oceania 1957 66826828013 +51 Oceania 1962 82336453245 +52 Oceania 1967 105958863585 +53 Oceania 1972 134112109227 +54 Oceania 1977 154707711162 +55 Oceania 1982 176177151380 +56 Oceania 1987 209451563998 +57 Oceania 1992 236319179826 +58 Oceania 1997 289304255183 +59 Oceania 2002 345236880176 +60 Oceania 2007 403657044512 ~~~ @@ -324,19 +324,26 @@ daply( ~~~{.output} year -continent 1952 1957 1962 1967 1972 1977 - Africa 5.992e+09 7.359e+09 8.785e+09 1.144e+10 1.507e+10 1.869e+10 - Americas 1.177e+11 1.408e+11 1.692e+11 2.179e+11 2.682e+11 3.241e+11 - Asia 3.410e+10 4.727e+10 6.014e+10 8.465e+10 1.244e+11 1.598e+11 - Europe 8.497e+10 1.100e+11 1.390e+11 1.734e+11 2.187e+11 2.554e+11 - Oceania 5.416e+10 6.683e+10 8.234e+10 1.060e+11 1.341e+11 1.547e+11 +continent 1952 1957 1962 1967 + Africa 5992294608 7359188796 8784876958 11443994101 + Americas 117738997171 140817061264 169153069442 217867530844 + Asia 34095762661 47267432088 60136869012 84648519224 + Europe 84971341466 109989505140 138984693095 173366641137 + Oceania 54157223944 66826828013 82336453245 105958863585 year -continent 1982 1987 1992 1997 2002 2007 - Africa 2.204e+10 2.411e+10 2.626e+10 3.002e+10 3.530e+10 4.578e+10 - Americas 3.633e+11 4.394e+11 4.899e+11 5.827e+11 6.612e+11 7.767e+11 - Asia 1.944e+11 2.418e+11 3.071e+11 3.876e+11 4.580e+11 6.275e+11 - Europe 2.795e+11 3.165e+11 3.427e+11 3.836e+11 4.364e+11 4.932e+11 - Oceania 1.762e+11 2.095e+11 2.363e+11 2.893e+11 3.452e+11 4.037e+11 +continent 1972 1977 1982 1987 + Africa 15072241974 18694898732 22040401045 24107264108 + Americas 268159178814 324085389022 363314008350 439447790357 + Asia 124385747313 159802590186 194429049919 241784763369 + Europe 218691462733 255367522034 279484077072 316507473546 + Oceania 134112109227 154707711162 176177151380 209451563998 + year +continent 1992 1997 2002 2007 + Africa 26256977719 30023173824 35303511424 45778570846 + Americas 489899820623 582693307146 661248623419 776723426068 + Asia 307100497486 387597655323 458042336179 627513635079 + Europe 342703247405 383606933833 436448815097 493183311052 + Oceania 236319179826 289304255183 345236880176 403657044512 ~~~ @@ -361,11 +368,11 @@ d_ply( ~~~{.output} -[1] "The mean GDP per capita for Africa is 2,194" -[1] "The mean GDP per capita for Americas is 7,136" -[1] "The mean GDP per capita for Asia is 7,902" -[1] "The mean GDP per capita for Europe is 14,469" -[1] "The mean GDP per capita for Oceania is 18,622" +[1] "The mean GDP per capita for Africa is 2,193.755" +[1] "The mean GDP per capita for Americas is 7,136.11" +[1] "The mean GDP per capita for Asia is 7,902.15" +[1] "The mean GDP per capita for Europe is 14,469.48" +[1] "The mean GDP per capita for Oceania is 18,621.61" ~~~ @@ -383,7 +390,7 @@ d_ply( > > #### Challenge 2 {.challenge} -> +> > Calculate the average life expectancy per continent and year. Which had the > longest and shortest in 2007? Which had the greatest change in between 1952 > and 2007? @@ -402,7 +409,8 @@ d_ply( > life expectancy per continent: > > 1. -> ~~~ {.r} +> +> ~~~{.r} > ddply( > .data = gapminder, > .variables = gapminder$continent, @@ -413,7 +421,8 @@ d_ply( > ~~~ > > 2. -> ~~~ {.r} +> +> ~~~{.r} > ddply( > .data = gapminder, > .variables = "continent", @@ -422,7 +431,8 @@ d_ply( > ~~~ > > 3. -> ~~~ {.r} +> +> ~~~{.r} > ddply( > .data = gapminder, > .variables = "continent", @@ -433,7 +443,8 @@ d_ply( > ~~~ > > 4. -> ~~~ {.r} +> +> ~~~{.r} > adply( > .data = gapminder, > .variables = "continent", diff --git a/13-wrap-up.Rmd b/13-wrap-up.Rmd index e40bb606a..25316c2fb 100644 --- a/13-wrap-up.Rmd +++ b/13-wrap-up.Rmd @@ -5,7 +5,7 @@ subtitle: Wrapping up minutes: 15 --- -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > > * To review the best practices for using R for > scientific analysis. diff --git a/13-wrap-up.md b/13-wrap-up.md index 20ef27105..25316c2fb 100644 --- a/13-wrap-up.md +++ b/13-wrap-up.md @@ -5,27 +5,27 @@ subtitle: Wrapping up minutes: 15 --- -> ### Learning Objectives {.objectives} +> ## Learning objectives {.objectives} > -> * To review the best practices for using R for +> * To review the best practices for using R for > scientific analysis. > ### Best practices for writing nice code -#### Make code readable +#### Make code readable The most important part of writing code is making it readable and understandable. -You want someone else to be able to pick up your code and be able to understand +You want someone else to be able to pick up your code and be able to understand what it does: more often than not this someone will be you 6 months down the line, who will otherwise be cursing past-self. #### Documentation: tell us what and why, not how When you first start out, your comments will often describe what a command does, -since you're still learning yourself and it can help to clarify concepts and +since you're still learning yourself and it can help to clarify concepts and remind you later. However, these comments aren't particularly useful later on -when you don't remember what problem your code is trying to solve. Try to also +when you don't remember what problem your code is trying to solve. Try to also include comments that tell you *why* you're solving a problem, and *what* problem that is. The *how* can come after that: it's an implementation detail you ideally shouldn't have to worry about. @@ -34,17 +34,17 @@ shouldn't have to worry about. Our recommendation is that you should separate your functions from your analysis scripts, and store them in a separate file that you `source` when you open the R -session in your project. This approach is nice because it leaves you with an -uncluttered analysis script, and a repository of useful functions that can be -loaded into any analysis script in your project. It also lets you group related +session in your project. This approach is nice because it leaves you with an +uncluttered analysis script, and a repository of useful functions that can be +loaded into any analysis script in your project. It also lets you group related functions together easily. #### Break down problem into bite size pieces When you first start out, problem solving and function writing can be daunting tasks, and hard to separate from code inexperience. Try to break down your -problem into digestable chunks and worry about the implementation details later: -keep breaking down the problem into smaller and smaller functions until you +problem into digestible chunks and worry about the implementation details later: +keep breaking down the problem into smaller and smaller functions until you reach a point where you can code a solution, and build back up from there. #### Know that your code is doing the right thing @@ -64,5 +64,3 @@ for which a particular input always gives a particular output. #### Remember to be stylish Apply consistent style to your code. - - diff --git a/fig/08-plot-ggplot2-axis-scale.png b/fig/08-plot-ggplot2-axis-scale.png deleted file mode 100644 index 48fd3a83b..000000000 Binary files a/fig/08-plot-ggplot2-axis-scale.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-challenge-1-solution.png b/fig/08-plot-ggplot2-challenge-1-solution.png deleted file mode 100644 index 6db1fc6e5..000000000 Binary files a/fig/08-plot-ggplot2-challenge-1-solution.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-challenge-2-solution.png b/fig/08-plot-ggplot2-challenge-2-solution.png deleted file mode 100644 index 5eb3d37ee..000000000 Binary files a/fig/08-plot-ggplot2-challenge-2-solution.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-density.png b/fig/08-plot-ggplot2-density.png deleted file mode 100644 index 4dd613fd7..000000000 Binary files a/fig/08-plot-ggplot2-density.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-facet.png b/fig/08-plot-ggplot2-facet.png deleted file mode 100644 index 4114ae92c..000000000 Binary files a/fig/08-plot-ggplot2-facet.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lifeExp-line-point.png b/fig/08-plot-ggplot2-lifeExp-line-point.png deleted file mode 100644 index 14609dad2..000000000 Binary files a/fig/08-plot-ggplot2-lifeExp-line-point.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lifeExp-line.png b/fig/08-plot-ggplot2-lifeExp-line.png deleted file mode 100644 index 50ad3f881..000000000 Binary files a/fig/08-plot-ggplot2-lifeExp-line.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter.png b/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter.png deleted file mode 100644 index d4778a50f..000000000 Binary files a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter2.png b/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter2.png deleted file mode 100644 index d4778a50f..000000000 Binary files a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter2.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter3.png b/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter3.png deleted file mode 100644 index 2017d98c1..000000000 Binary files a/fig/08-plot-ggplot2-lifeExp-vs-gdpPercap-scatter3.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lm-fit.png b/fig/08-plot-ggplot2-lm-fit.png deleted file mode 100644 index cd9b278fa..000000000 Binary files a/fig/08-plot-ggplot2-lm-fit.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-lm-fit2.png b/fig/08-plot-ggplot2-lm-fit2.png deleted file mode 100644 index 9a17a59aa..000000000 Binary files a/fig/08-plot-ggplot2-lm-fit2.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-setting.png b/fig/08-plot-ggplot2-setting.png deleted file mode 100644 index f9bcbfedb..000000000 Binary files a/fig/08-plot-ggplot2-setting.png and /dev/null differ diff --git a/fig/08-plot-ggplot2-theme.png b/fig/08-plot-ggplot2-theme.png deleted file mode 100644 index d942cbbcb..000000000 Binary files a/fig/08-plot-ggplot2-theme.png and /dev/null differ