00_exercises.Rmd

---
title: "Exercises for Exploratory analysis of RNAseq data"
output: 
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 3
    highlight: pygments
    df_print: kable
    code_folding: hide
    number_sections: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

[back to lesson's homepage](https://tavareshugo.github.io/data-carpentry-rnaseq/)

```{r, echo = FALSE, message = FALSE}
# Load the tidyverse package
library(tidyverse)
```

# Intro

## Import data

1. Import the four CSV files into R and store in the following objects: 
`raw_cts`, `trans_cts`, `sample_info` and `test_result`.
2. How many samples are there? 
Was the design balanced (i.e. do all samples have the same number of replicates)?
3. How many genes do you have gene expression levels for?

```{r, results=FALSE, message=FALSE}
# 1. Import data
raw_cts <- read_csv("./data/counts_raw.csv")
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")

# 2. number of samples is in the "sample_info" table
nrow(sample_info)

# 4. this can be taken from the table of counts
nrow(trans_cts)
```


# Exploratory analysis of count data

## Reshape table

Convert the `raw_cts` table to a "long" format using the `pivot_longer()` function.
Save it into an object called `raw_cts_long`.

```{r}
raw_cts_long <- raw_cts %>% 
  pivot_longer(wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts")
```

## Join tables

* Take the `raw_cts_long` object created in the previous exercise and _join_ it with the `sample_info` table. 
* Produce the plot below for the raw count data. (hint: you might want to try 
log-transforming the x-axis data using the `log10()` function).

```{r, results=FALSE}
# Join with sample information table
raw_cts_long <- full_join(raw_cts_long, sample_info, by = ("sample"))

# Make the plot
raw_cts_long %>%
  # add pseudo-count of 1 because log(0) = -Inf
  ggplot(aes(log10(cts + 1), colour = replicate)) + 
  geom_freqpoly(binwidth = 1) +
  facet_grid(rows = vars(strain), cols = vars(minute))
```


* Try out other ways to visualise these data, for example as a boxplot.

```{r, results=FALSE}
# Make a boxplot
raw_cts_long %>%
  # make sure minute is specified as a factor
  ggplot(aes(factor(minute), log10(cts + 1), fill = strain)) + 
  geom_boxplot() + 
  facet_grid(cols = vars(replicate))
```


## Scatterplot

Compare the expression between a WT cell at T0 and T30. What can you conclude from this?

```{r}
# Scatterplot between T0 and T30
# the correlation is lower than between replicates at T0, for example
trans_cts %>% 
  ggplot(aes(wt_0_r1, wt_30_r1)) + geom_point() +
  geom_abline(colour = "brown")
```


# PCA

## Examine `prcomp()` output

```{r, echo=FALSE}
sample_pca <- prcomp(t(trans_cts[, -1]))
pc_scores <- sample_pca$x
```


After running the PCA investigate:

1. What type of object is it? (hint: `class()`)
2. The object returned by this function is a list-type object. We haven't encountered these before. Use the `str()` function to examine what is inside the object (i.e. its _structure_).
3. Look at the `prcomp()` help page to identify which parts of the object contain the _eigenvalues_, the _variable loadings_ and the _PC scores_
(hint: look under the section "Value" of the help page). As a reminder:
    * _PC scores_: new coordinates of the data on the PC axis
    * _eigenvalues_: variance explained by each PC
    * _variable loadings_: "weight" that each original gene has on each PC axis 
4. To access elements of this list-type object we can use the `$`. For example, the _PC scores_ can be obtained with: `pc_scores <- sample_pca$x`.
Similarly to this, extract the _eigenvalues_ and _variable loadings_ to objects called  `pc_eigenvalues` and `pc_loadings`.
    * what class is each of these elements?
5. How many principal components do we have?

```{r, results=FALSE}
# 1. class of the object
class(sample_pca)

# 2. structure of the object
str(sample_pca)

# 3. checking the help ?prcomp, under the section "Value" is says:
# "sdev" contains the standard deviation explained by each PC, so if we square it we get the eigenvalues (or explained variance)
# "rotation" contains the variable loadings for each PC, which define the eigenvectors
# "x" contains the PC scores, i.e. the data projected on the new PC axis
# "center" in this case contains the mean of each gene, which was subtracted from each value
# "scale" contains the value FALSE because we did not scale the data by the standard deviation

# 4. we can use the 'dollar sign' to access these elements
pc_scores <- sample_pca$x              # PC scores (a matrix)
pc_eigenvalues <- sample_pca$sdev^2    # eigenvalues (a vector) - notice we square the values
pc_loadings <- sample_pca$rotation     # variable loadings (a matrix)

# 5. here's three ways to check this
ncol(pc_scores)
length(pc_eigenvalues)
ncol(pc_loadings)
```

## Annotating PC plot

Fix the following code (where the word <span style="color: tomato;">FIXME</span> 
appears), to recreate the plot below.
Once the code is fixed, assign (`<-`) the result to an object called `pca_plot`.

What can you conclude from this result?

<pre><code># get the PC scores from prcomp object
sample_pca$x %>% 
  # convert it to a tibble
  as_tibble(rownames = "sample") %>% 
  # join with "sample_info" table
  <span style="color: tomato;">FIXME</span>(sample_info, by = "<span style="color: tomato;">FIXME</span>") %>% 
  # make the plot
  ggplot(aes(x = PC1, y = PC2, 
             <span style="color: tomato;">FIXME</span> = factor(minute), shape = <span style="color: tomato;">FIXME</span>)) +
  geom_point()
</code></pre>

```{r}
pca_plot <- sample_pca$x %>% # extract the loadings from prcomp
  # convert to a tibble retaining the sample names as a new column
  as_tibble(rownames = "sample") %>% 
  # join with "sample_info" table 
  full_join(sample_info, by = "sample") %>% 
  # create the plot
  ggplot(aes(x = PC1, y = PC2, colour = factor(minute), shape = strain)) +
  geom_point()

# print the result (in this case a ggplot)
pca_plot
```


## Visualise variable loadings

```{r, echo=FALSE}
top_genes <- sample_pca$rotation %>% 
  as_tibble(rownames = "gene") %>% 
  select(gene, PC1, PC2) %>%
  pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>% 
  group_by(PC) %>% 
  arrange(desc(abs(loading))) %>% 
  slice(1:10) %>% 
  pull(gene) %>% unique()
top_loadings <- sample_pca$rotation %>% 
  as_tibble(rownames = "gene") %>% 
  filter(gene %in% top_genes)
```


Fix the following code (where the word <span style="color: tomato;">FIXME</span> 
appears), to recreate the plot below.
Once the code is fixed, assign (`<-`) the result to an object called `loadings_plot`.

<pre><code>ggplot(data = top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = <span style="color: tomato;">FIXME</span>), 
               arrow = arrow(length = unit(0.1, "in")),
               <span style="color: tomato;">FIXME</span> = "brown") +
  geom_text(aes(x = <span style="color: tomato;">FIXME</span>, y = PC2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02))
</code></pre>


```{r}
loadings_plot <- ggplot(data = top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2), 
               arrow = arrow(length = unit(0.1, "in")),
               colour = "brown") +
  geom_text(aes(x = PC1, y = PC2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02))
loadings_plot
```

# Exploring test results

## MA plot

1. Recreate the plot below from the `test_result` table. (hint: notice the x-axis is log-transformed)
2. Why is the fold-change log transformed? 


```{r}
# 1. making the MA plot
test_result %>% 
  ggplot(aes(log10(baseMean), log2FoldChange)) +
  geom_point(alpha = 0.1) +
  facet_wrap(vars(comparison))

# 2. The reason fold change is log-transformed is:
# Because a fold-change (FC) is a ratio between two things FC = a/b
# if a > b, then the FC can vary from 1 to infinity!
# but if a < b, then it can only go from 0 to 1
# therefore, ratios are not symmetric around equality (a = b)
# taking the log of a ratio solves this problem!
# For example:
# 4/1 = 4     and  log2(4/1) =  2
# 1/4 = 0.25  and  log2(1/4) = -2
# Note that another common example where log-transformation should always be used is RT-qPCR data!
```

**Bonus:** try and re-create the plot below where the x-axis is on a log-scale but 
showing the original units and genes with an adjusted p-value below 0.01 are highlighted 
in red. (hint: the function `ifelse()` is useful here. This may be hard if you're new 
with R: but look at the solution and see if you can understand the trick we're using.)

```{r, warning=FALSE}
test_result %>% 
  # add column which contains value only if padj < 0.01
  mutate(sig = ifelse(padj < 0.01, log2FoldChange, NA)) %>% 
  # make the plot
  ggplot(aes(baseMean, log2FoldChange)) +
  geom_point(alpha = 0.1) +
  geom_point(aes(y = sig), colour = "brown", size = 1) +
  scale_x_continuous(trans = "log10") +
  facet_wrap(vars(comparison))
```