forked from gezel/data-carpentry-embl-2018
-
Notifications
You must be signed in to change notification settings - Fork 4
/
00_exercises.Rmd
277 lines (215 loc) · 8.98 KB
/
00_exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "Exercises for Exploratory analysis of RNAseq data"
output:
html_document:
toc: yes
toc_float: yes
toc_depth: 3
highlight: pygments
df_print: kable
code_folding: hide
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
[back to lesson's homepage](https://tavareshugo.github.io/data-carpentry-rnaseq/)
```{r, echo = FALSE, message = FALSE}
# Load the tidyverse package
library(tidyverse)
```
# Intro
## Import data
1. Import the four CSV files into R and store in the following objects:
`raw_cts`, `trans_cts`, `sample_info` and `test_result`.
2. How many samples are there?
Was the design balanced (i.e. do all samples have the same number of replicates)?
3. How many genes do you have gene expression levels for?
```{r, results=FALSE, message=FALSE}
# 1. Import data
raw_cts <- read_csv("./data/counts_raw.csv")
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")
# 2. number of samples is in the "sample_info" table
nrow(sample_info)
# 4. this can be taken from the table of counts
nrow(trans_cts)
```
# Exploratory analysis of count data
## Reshape table
Convert the `raw_cts` table to a "long" format using the `pivot_longer()` function.
Save it into an object called `raw_cts_long`.
```{r}
raw_cts_long <- raw_cts %>%
pivot_longer(wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts")
```
## Join tables
* Take the `raw_cts_long` object created in the previous exercise and _join_ it with the `sample_info` table.
* Produce the plot below for the raw count data. (hint: you might want to try
log-transforming the x-axis data using the `log10()` function).
```{r, results=FALSE}
# Join with sample information table
raw_cts_long <- full_join(raw_cts_long, sample_info, by = ("sample"))
# Make the plot
raw_cts_long %>%
# add pseudo-count of 1 because log(0) = -Inf
ggplot(aes(log10(cts + 1), colour = replicate)) +
geom_freqpoly(binwidth = 1) +
facet_grid(rows = vars(strain), cols = vars(minute))
```
* Try out other ways to visualise these data, for example as a boxplot.
```{r, results=FALSE}
# Make a boxplot
raw_cts_long %>%
# make sure minute is specified as a factor
ggplot(aes(factor(minute), log10(cts + 1), fill = strain)) +
geom_boxplot() +
facet_grid(cols = vars(replicate))
```
## Scatterplot
Compare the expression between a WT cell at T0 and T30. What can you conclude from this?
```{r}
# Scatterplot between T0 and T30
# the correlation is lower than between replicates at T0, for example
trans_cts %>%
ggplot(aes(wt_0_r1, wt_30_r1)) + geom_point() +
geom_abline(colour = "brown")
```
# PCA
## Examine `prcomp()` output
```{r, echo=FALSE}
sample_pca <- prcomp(t(trans_cts[, -1]))
pc_scores <- sample_pca$x
```
After running the PCA investigate:
1. What type of object is it? (hint: `class()`)
2. The object returned by this function is a list-type object. We haven't encountered these before. Use the `str()` function to examine what is inside the object (i.e. its _structure_).
3. Look at the `prcomp()` help page to identify which parts of the object contain the _eigenvalues_, the _variable loadings_ and the _PC scores_
(hint: look under the section "Value" of the help page). As a reminder:
* _PC scores_: new coordinates of the data on the PC axis
* _eigenvalues_: variance explained by each PC
* _variable loadings_: "weight" that each original gene has on each PC axis
4. To access elements of this list-type object we can use the `$`. For example, the _PC scores_ can be obtained with: `pc_scores <- sample_pca$x`.
Similarly to this, extract the _eigenvalues_ and _variable loadings_ to objects called `pc_eigenvalues` and `pc_loadings`.
* what class is each of these elements?
5. How many principal components do we have?
```{r, results=FALSE}
# 1. class of the object
class(sample_pca)
# 2. structure of the object
str(sample_pca)
# 3. checking the help ?prcomp, under the section "Value" is says:
# "sdev" contains the standard deviation explained by each PC, so if we square it we get the eigenvalues (or explained variance)
# "rotation" contains the variable loadings for each PC, which define the eigenvectors
# "x" contains the PC scores, i.e. the data projected on the new PC axis
# "center" in this case contains the mean of each gene, which was subtracted from each value
# "scale" contains the value FALSE because we did not scale the data by the standard deviation
# 4. we can use the 'dollar sign' to access these elements
pc_scores <- sample_pca$x # PC scores (a matrix)
pc_eigenvalues <- sample_pca$sdev^2 # eigenvalues (a vector) - notice we square the values
pc_loadings <- sample_pca$rotation # variable loadings (a matrix)
# 5. here's three ways to check this
ncol(pc_scores)
length(pc_eigenvalues)
ncol(pc_loadings)
```
## Annotating PC plot
Fix the following code (where the word <span style="color: tomato;">FIXME</span>
appears), to recreate the plot below.
Once the code is fixed, assign (`<-`) the result to an object called `pca_plot`.
What can you conclude from this result?
<pre><code># get the PC scores from prcomp object
sample_pca$x %>%
# convert it to a tibble
as_tibble(rownames = "sample") %>%
# join with "sample_info" table
<span style="color: tomato;">FIXME</span>(sample_info, by = "<span style="color: tomato;">FIXME</span>") %>%
# make the plot
ggplot(aes(x = PC1, y = PC2,
<span style="color: tomato;">FIXME</span> = factor(minute), shape = <span style="color: tomato;">FIXME</span>)) +
geom_point()
</code></pre>
```{r}
pca_plot <- sample_pca$x %>% # extract the loadings from prcomp
# convert to a tibble retaining the sample names as a new column
as_tibble(rownames = "sample") %>%
# join with "sample_info" table
full_join(sample_info, by = "sample") %>%
# create the plot
ggplot(aes(x = PC1, y = PC2, colour = factor(minute), shape = strain)) +
geom_point()
# print the result (in this case a ggplot)
pca_plot
```
## Visualise variable loadings
```{r, echo=FALSE}
top_genes <- sample_pca$rotation %>%
as_tibble(rownames = "gene") %>%
select(gene, PC1, PC2) %>%
pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>%
group_by(PC) %>%
arrange(desc(abs(loading))) %>%
slice(1:10) %>%
pull(gene) %>% unique()
top_loadings <- sample_pca$rotation %>%
as_tibble(rownames = "gene") %>%
filter(gene %in% top_genes)
```
Fix the following code (where the word <span style="color: tomato;">FIXME</span>
appears), to recreate the plot below.
Once the code is fixed, assign (`<-`) the result to an object called `loadings_plot`.
<pre><code>ggplot(data = top_loadings) +
geom_segment(aes(x = 0, y = 0, xend = PC1, yend = <span style="color: tomato;">FIXME</span>),
arrow = arrow(length = unit(0.1, "in")),
<span style="color: tomato;">FIXME</span> = "brown") +
geom_text(aes(x = <span style="color: tomato;">FIXME</span>, y = PC2, label = gene),
nudge_y = 0.005, size = 3) +
scale_x_continuous(expand = c(0.02, 0.02))
</code></pre>
```{r}
loadings_plot <- ggplot(data = top_loadings) +
geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.1, "in")),
colour = "brown") +
geom_text(aes(x = PC1, y = PC2, label = gene),
nudge_y = 0.005, size = 3) +
scale_x_continuous(expand = c(0.02, 0.02))
loadings_plot
```
# Exploring test results
## MA plot
1. Recreate the plot below from the `test_result` table. (hint: notice the x-axis is log-transformed)
2. Why is the fold-change log transformed?
```{r}
# 1. making the MA plot
test_result %>%
ggplot(aes(log10(baseMean), log2FoldChange)) +
geom_point(alpha = 0.1) +
facet_wrap(vars(comparison))
# 2. The reason fold change is log-transformed is:
# Because a fold-change (FC) is a ratio between two things FC = a/b
# if a > b, then the FC can vary from 1 to infinity!
# but if a < b, then it can only go from 0 to 1
# therefore, ratios are not symmetric around equality (a = b)
# taking the log of a ratio solves this problem!
# For example:
# 4/1 = 4 and log2(4/1) = 2
# 1/4 = 0.25 and log2(1/4) = -2
# Note that another common example where log-transformation should always be used is RT-qPCR data!
```
**Bonus:** try and re-create the plot below where the x-axis is on a log-scale but
showing the original units and genes with an adjusted p-value below 0.01 are highlighted
in red. (hint: the function `ifelse()` is useful here. This may be hard if you're new
with R: but look at the solution and see if you can understand the trick we're using.)
```{r, warning=FALSE}
test_result %>%
# add column which contains value only if padj < 0.01
mutate(sig = ifelse(padj < 0.01, log2FoldChange, NA)) %>%
# make the plot
ggplot(aes(baseMean, log2FoldChange)) +
geom_point(alpha = 0.1) +
geom_point(aes(y = sig), colour = "brown", size = 1) +
scale_x_continuous(trans = "log10") +
facet_wrap(vars(comparison))
```