Data visualisation with R and ggplot2
- Date: 31st May 2018, 6 - 7pm
- Series: Wolfson College Skills for Academic Success
- Location: Gatsby room, Wolfson college, University of Cambridge, UK
- Trainer: Sergio Martínez Cuesta
- Register here
This course provides a short beginners introduction to data visualisation using the R programming language and software environment for statistical computing and graphics. Sergio will demonstrate basic examples on how to import data, perform different types of plots and export graphics using R standard functions and the library ggplot2. Everybody is welcome; if you would like to follow along with your laptop, please bring R and RStudio downloaded and installed before the session.
- Getting started
- Import data into R
- Basic plotting
- Exercise 1
- Advanced plotting using the ggplot2 library
- Export graphics
For a basic introduction to R functionality, check out our basic R course.
- R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
- It is open-source and available in all platforms (Mac, Linux and Windows)
- Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
- It facilitates reproducibility in research and integration of all your analyses in individual scripts
- Easy to write documentation and code together using a free environment like RStudio
- Open RStudio, e.g. go to
Applicationsand click on
To download today's workshop:
- Go to your web browser e.g. Firefox and type: https://tinyurl.com/2018-DataVisR-Wolfson
- Click on
DataVisR.zip, then press
Downloadand save the file in your preferred folder, e.g. your Desktop
- Go to the folder where you saved
DataVisR.zipand uncompress it, e.g. in Mac just double-click on
DataVisR.zip. Only then, the folder
- The folder
DataVisRcontains two files:
DataVisR.Rmd- the code for today's session
patient-data-cleaned.csv- the dataset that we will be exploring
Now, go back to RStudio:
- Click on
Open Fileand select
- You are all set to go now :)
Also, in case you are not familiar with RStudio, a quick recap:
RStudio interface is composed of four panels, in anti-clockwise sense:
- Top-left: scripts panel
- Bottom-left: R console
- Bottom-right: plots, packages and help
- Top-right: log panel
You are now looking at the scripts panel. We will be using the R console below to interact with R during the workshop
Blocks of code in RStudio are often written using the format R markdown, which allows mixed plain text and R code together within the same document
Each line of R code inside a block can be executed by clicking on the line and pressing CMD + ENTER (Mac) or CTRL + ENTER (Windows and Linux), e.g.:
print("R is fun!")
Alternatively, to execute the entire block, click on the green arrow tip on the right-hand side of the block.
3 + 1
- You can add a new block of code by selecting
Insertmenu or by typing the following syntax directly:
# R code goes in here
Import data into R
We will use a small made-up dataset which is often used for training purposes. It contains information about 100 lung cancer patients aged 42-44 from different states in the US. We have saved these data as a comma-separated values (CSV) file
patient-data-cleaned.csv, which can easily be opened using software like Excel. In R, use the
read.csv() function to import the data:
patient_data <- read.csv("/Users/martin03/Desktop/DataVisR/patient-data-cleaned.csv") # copy here the path to the file
If you have trouble finding the exact path to
patient-data-cleaned.csv, use the function
file.choose() to open a dialogue box and browse through the directories to reach the file:
The path will then be displayed in R and you can copy it into the
read.csv() command above.
patient_data is known as a data frame in R. To explore its contents:
# Dimensions (rows and columns) dim(patient_data) # Viewing contents View(patient_data) # Structure of the data frame str(patient_data) # Summary of all data frame contents summary(patient_data)
Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.
Histograms are often used to have an overview of the distribution of continuous data:
Barplots are useful when you have counts of categorical data:
barplot(table(patient_data$Race)) barplot(table(patient_data$Sex)) barplot(table(patient_data$Smokes)) barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size barplot(table(patient_data$Grade)) barplot(table(patient_data$Overweight))
Boxplots are good when comparing distributions Here the
~ symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis ->
boxplot(y ~ x)
boxplot(patient_data$BMI ~ patient_data$Grade) boxplot(patient_data$BMI ~ patient_data$Overweight) boxplot(patient_data$Weight ~ patient_data$Overweight)
Scatter plots are useful when representing two continuous variables. Here ->
To enhance the appearance of your plots, many different ways of customisation are possible:
colargument. To get a full list of possible colours type
colours(), or check this online reference.
- Point type:
- Axis labels:
- Plot title:
- ... and many others: see
?parfor more options
# linear regression plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data") abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue") # polynomial regression quadratic.model <-lm(patient_data$BMI ~ patient_data$Weight + I(patient_data$Weight^2)) plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data") lines(sort(patient_data$Weight), fitted(quadratic.model)[order(patient_data$Weight)], col = "darkgreen")
The arguments can also be used for other plotting functions!
boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")
- Any differences of BMI between Smokers and Non-Smokers? (hint: try
- Visualise the relationship between the Height and Weight of the patients
- A small trick: if you attach the data.frame
patient_dataas follows, then you will only need the column name without the '$' notation:
attach(patient_data) plot(Weight, BMI)
Advanced plotting using the ggplot2 library
The ggplot2 library offers a powerful graphics language for creating elegant and complex plots. It is particularly useful when creating publication-quality graphics.
The key to understanding ggplot2 is thinking about a figure in layers (e.g. data points, axes and labels, legend). This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator or Inkscape, where you can ungroup the figure into its different components.
There are two ways to do this:
Click on the
Packagestab in the bottom-right RStudio panel and search for
ggplot2, then tick its box. If you can't find it, then click on
ggplot2inside the Packages box. Leaving the rest on default, click on
Install. Once installed, then tick the box.
library(ggplot2)in the console. If you get a message like
Error in library("ggplot2") : there is no package called ‘ggplot2’then run
install.packages("ggplot2")in the console. Once the installation is finished, run
Let's begin with the scatterplot of Weight and Height.
First, loading ggplot2 library:
The first "global" layer requires the definition of the dataset, and the x and y axes:
ggplot(data = patient_data, aes(x = Weight, y = Height))
In the second layer, we need to tell ggplot how we want to visually represent the data (scatterplot, boxplot, barplot ...). For a scatterplot, we need geom_point():
ggplot(data = patient_data, aes(x = Weight, y = Height)) + geom_point()
Another aes (aesthetic) property we can modify is the point color, e.g. to change the color depending on the grade of the disease:
ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) + geom_point()
When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on
Save as PDF ....
When plotting using R standard graphics, you can also save plots to a file calling the
png() functions before executing the code to create the plot:
pdf("/Users/martin03/Desktop/DataVisR/BMIvsWeight.pdf") plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data") abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue") dev.off()
dev.off() line is important; without it you will not be able to view the plot you have created.
If you use ggplot2, the syntax is more concise:
gg<- ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) + geom_point() ggsave("/Users/martin03/Desktop/DataVisR/HeightvsWeight.pdf")
That's it! Enjoy R!
Feedback / questions about the course, please email Sergio (email@example.com).
References and resources
- Getting started with data visualization in R using ggplot2
- End-to-end visualization using ggplot2
- ggplot2 - Easy way to mix multiple graphs on the same page
- Rookie mistakes and how to fix them when making plots of data
- BBC Visual and Data Journalism cookbook for R graphics
- Cookbook for R
- R for Data Science
- Data Visualization for Social Science. A practical introduction with R and ggplot2
- ggplot2: Elegant Graphics for Data Analysis (Use R!)
- R packages
- plotly for R
- bookdown: Authoring Books and Technical Documents with R Markdown
- Bernd Klaus teaching materials
- Modern Statistics for Modern Biology
- Statistical Inference via Data Science A moderndive into R and the tidyverse
- CRUK-CI R crash course
- R for Reproducible Scientific Analysis
- Karl Broman's mini tutorials
- Basic statistics and data handling with R
- Scripting for data analysis (with R)
- An Introduction to Solving Biological Problems with R
- Data Analysis and Visualisation using R: including dplyr and ggplot2
- Babraham institute basic/advanced R and ggplot2 courses
- R object-oriented programming and package development, link1 and link2
- R course content for the CODATA-RDA Research Data Science Summer School
- Data carpentry course for biologists by Ethan White
- Cambridge's Data carpentry using R
Sergio is a University of Cambridge Data Champion funded by a Jisc research data fellowship to develop research data training activities for researchers. He does research in bioinformatics and computational biology within the Balasubramanian laboratories funded by the Wellcome Trust at the University of Cambridge.
This work is distributed under a Creative Commons CC0 license. No rights reserved.