Skip to content
Permalink
Fetching contributors…
Cannot retrieve contributors at this time
353 lines (222 sloc) 15.5 KB

Data visualisation with R and ggplot2

Overview

This course provides a short beginners introduction to data visualisation using the R programming language and software environment for statistical computing and graphics. Sergio will demonstrate basic examples on how to import data, perform different types of plots and export graphics using R standard functions and the library ggplot2. Everybody is welcome; if you would like to follow along with your laptop, please bring R and RStudio downloaded and installed before the session.

Outline

  • Motivation
  • Getting started
  • Import data into R
  • Basic plotting
  • Exercise 1
  • Advanced plotting using the ggplot2 library
  • Export graphics

For a basic introduction to R functionality, check out our basic R course.

These short courses are inspired by the R crash course developed by Mark Dunning, Laurent Gatto and others.

Motivation

  • R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
  • It is open-source and available in all platforms (Mac, Linux and Windows)
  • Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
  • It facilitates reproducibility in research and integration of all your analyses in individual scripts
  • Easy to write documentation and code together using a free environment like RStudio

Getting started

  • Open RStudio, e.g. go to Finder -> Applications and click on Rstudio

To download today's workshop:

  • Go to your web browser e.g. Firefox and type: https://tinyurl.com/2018-DataVisR-Wolfson
  • Click on DataVisR.zip, then press Download and save the file in your preferred folder, e.g. your Desktop
  • Go to the folder where you saved DataVisR.zip and uncompress it, e.g. in Mac just double-click on DataVisR.zip. Only then, the folder DataVisR will appear.
  • The folder DataVisR contains two files:
    • DataVisR.Rmd - the code for today's session
    • patient-data-cleaned.csv - the dataset that we will be exploring

Now, go back to RStudio:

  • Click on File -> Open File and select DataVisR.Rmd
  • You are all set to go now :)

Also, in case you are not familiar with RStudio, a quick recap:

  • RStudio interface is composed of four panels, in anti-clockwise sense:

    • Top-left: scripts panel
    • Bottom-left: R console
    • Bottom-right: plots, packages and help
    • Top-right: log panel
  • You are now looking at the scripts panel. We will be using the R console below to interact with R during the workshop

  • Blocks of code in RStudio are often written using the format R markdown, which allows mixed plain text and R code together within the same document

  • Each line of R code inside a block can be executed by clicking on the line and pressing CMD + ENTER (Mac) or CTRL + ENTER (Windows and Linux), e.g.:

print("R is fun!")

Alternatively, to execute the entire block, click on the green arrow tip on the right-hand side of the block.

3 + 1
  • You can add a new block of code by selecting R in the Insert menu or by typing the following syntax directly:
# R code goes in here

Import data into R

We will use a small made-up dataset which is often used for training purposes. It contains information about 100 lung cancer patients aged 42-44 from different states in the US. We have saved these data as a comma-separated values (CSV) file patient-data-cleaned.csv, which can easily be opened using software like Excel. In R, use the read.csv() function to import the data:

patient_data <- read.csv("/Users/martin03/Desktop/DataVisR/patient-data-cleaned.csv") # copy here the path to the file

If you have trouble finding the exact path to patient-data-cleaned.csv, use the function file.choose() to open a dialogue box and browse through the directories to reach the file:

file.choose()

The path will then be displayed in R and you can copy it into the read.csv() command above.

The object patient_data is known as a data frame in R. To explore its contents:

# Dimensions (rows and columns)
dim(patient_data)

# Viewing contents
View(patient_data)

# Structure of the data frame
str(patient_data)

# Summary of all data frame contents
summary(patient_data)

Basic plotting

Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.

Histograms are often used to have an overview of the distribution of continuous data:

hist(patient_data$BMI)
hist(patient_data$Weight)

Barplots are useful when you have counts of categorical data:

barplot(table(patient_data$Race))
barplot(table(patient_data$Sex))
barplot(table(patient_data$Smokes))
barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size
barplot(table(patient_data$Grade))
barplot(table(patient_data$Overweight))

Boxplots are good when comparing distributions Here the ~ symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis -> boxplot(y ~ x)

boxplot(patient_data$BMI ~ patient_data$Grade)
boxplot(patient_data$BMI ~ patient_data$Overweight)

boxplot(patient_data$Weight ~ patient_data$Overweight)

Scatter plots are useful when representing two continuous variables. Here -> plot(x, y):

plot(patient_data$Weight, patient_data$BMI)

To enhance the appearance of your plots, many different ways of customisation are possible:

  • Colours: col argument. To get a full list of possible colours type colours(), or check this online reference.
  • Point type: pch
  • Axis labels: xlab and ylab
  • Plot title: main
  • ... and many others: see ?plot and ?par for more options
# linear regression
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")

# polynomial regression
quadratic.model <-lm(patient_data$BMI ~ patient_data$Weight + I(patient_data$Weight^2))
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
lines(sort(patient_data$Weight), fitted(quadratic.model)[order(patient_data$Weight)], col = "darkgreen")

The arguments can also be used for other plotting functions!

boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")

To explore other types of plots using R standard functions, have a look here. There are dedicated R libraries e.g. ggplot2 to do more sophisticated plotting.

Exercise 1

  1. Any differences of BMI between Smokers and Non-Smokers? (hint: try boxplot)
  2. Visualise the relationship between the Height and Weight of the patients
  3. A small trick: if you attach the data.frame patient_data as follows, then you will only need the column name without the '$' notation:
attach(patient_data)
plot(Weight, BMI)

Advanced plotting using the ggplot2 library

The ggplot2 library offers a powerful graphics language for creating elegant and complex plots. It is particularly useful when creating publication-quality graphics.

The key to understanding ggplot2 is thinking about a figure in layers (e.g. data points, axes and labels, legend). This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator or Inkscape, where you can ungroup the figure into its different components.

Load ggplot2

There are two ways to do this:

  • Click on the Packages tab in the bottom-right RStudio panel and search for ggplot2, then tick its box. If you can't find it, then click on Install and type ggplot2 inside the Packages box. Leaving the rest on default, click on Install. Once installed, then tick the box.

  • Run library(ggplot2) in the console. If you get a message like Error in library("ggplot2") : there is no package called ‘ggplot2’ then run install.packages("ggplot2") in the console. Once the installation is finished, run library(ggplot2) again.

Example

Let's begin with the scatterplot of Weight and Height.

First, loading ggplot2 library:

library("ggplot2")

The first "global" layer requires the definition of the dataset, and the x and y axes:

ggplot(data = patient_data, aes(x = Weight, y = Height))

In the second layer, we need to tell ggplot how we want to visually represent the data (scatterplot, boxplot, barplot ...). For a scatterplot, we need geom_point():

ggplot(data = patient_data, aes(x = Weight, y = Height)) +
geom_point()

Another aes (aesthetic) property we can modify is the point color, e.g. to change the color depending on the grade of the disease:

ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()

Export graphics

When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on Export -> Save as PDF ....

When plotting using R standard graphics, you can also save plots to a file calling the pdf() or png() functions before executing the code to create the plot:

pdf("/Users/martin03/Desktop/DataVisR/BMIvsWeight.pdf")
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
dev.off()

The dev.off() line is important; without it you will not be able to view the plot you have created.

If you use ggplot2, the syntax is more concise:

gg<- ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()
ggsave("/Users/martin03/Desktop/DataVisR/HeightvsWeight.pdf")

That's it! Enjoy R!

Questions?

Feedback / questions about the course, please email Sergio (sermarcue@gmail.com).

References and resources

Blogs:

Books:

Courses:

Perspectives:

Tutorials:

Acknowledgements

Sergio is a University of Cambridge Data Champion funded by a Jisc research data fellowship to develop research data training activities for researchers. He does research in bioinformatics and computational biology within the Balasubramanian laboratories funded by the Wellcome Trust at the University of Cambridge.

License

This work is distributed under a Creative Commons CC0 license. No rights reserved.

Our sponsors:

You can’t perform that action at this time.