analysis/mouse-chronic-ccl4.Rmd

---
title: "Chronic CCl4 mouse model"
author: "Christian H. Holland"
date: "2020-12-18"
output: 
  workflowr::wflow_html:
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  autodep = TRUE,
  cache = TRUE
)
```

## Introduction
Here we analysis a mouse model of CCl4 induced chronic liver disease. The transcriptomic profiles were measured at time point 0, 2, 6, and 12 month. For time point 2 and 12 month matched oil controls are available. 

## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(tidyverse)
library(tidylog)
library(here)

library(edgeR)
library(biobroom)

library(janitor)

library(AachenColorPalette)
library(cowplot)
library(lemon)
library(patchwork)

options("tidylog.display" = list(print))
source(here("code/utils-rnaseq.R"))
source(here("code/utils-wrapper.R"))
source(here("code/utils-plots.R"))
```

## Global variables for this analysis
Definition of global variables that are used throught the entire analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/mouse-chronic-ccl4"
output_path <- "output/mouse-chronic-ccl4"
figure_path <- "output/mouse-chronic-ccl4/figures"

# graphical parameters
# fontsize
fz <- 9
```

## Preliminary exploratory analysis
### Library size
Barplot of the library size (total counts) for each of the samples.
```{r lib-size}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))

plot_libsize(count_matrix) +
  my_theme(fsize = fz)
```

### Count distribution
Violin plots of the raw read counts for each of the samples.
```{r "count-distribution"}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

count_matrix %>%
  tdy("gene", "sample", "count", meta) %>%
  arrange(treatment) %>%
  ggplot(aes(x = fct_reorder(sample, as.numeric(treatment)), y=log10(count+1), 
             group = sample, fill = treatment)) +
  geom_violin() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        legend.position = "top") +
  labs(x=NULL) +
  my_theme(grid = "no", fsize = fz)
```

### PCA of raw data
PCA plot of raw read counts contextualized based on the time point and treatment. Before gene with a constant expression across all samples are removed and count values are transformed to log2 scale. Only the top 1000 most variable genes are used as features.
```{r pca-raw-data}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(colnames(count_matrix) == meta$sample)

# remove constant expressed genes and transform to log2 scale
preprocessed_count_matrix <- preprocess_count_matrix(count_matrix)


pca_result <- do_pca(preprocessed_count_matrix, meta, top_n_var_genes = 1000)

plot_pca(pca_result, feature = "time") + 
  plot_pca(pca_result, feature = "treatment") &
  my_theme(fsize = fz)
```

## Data processing
### Normalization
Raw read counts are normalized by first filtering out lowly expressed genes, TMM normalization and finally logCPM transformation.
```{r normalization}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(meta$sample == colnames(count_matrix))

dge_obj <- DGEList(count_matrix, group = meta$group)

# filter low read counts, TMM normalization and logCPM transformation
norm <- voom_normalization(dge_obj)

saveRDS(norm, here(output_path, "normalized_expression.rds"))
```

### PCA of normalized data
PCA plot of normalized expression data contextualized based on the time point and treatment. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)

saveRDS(pca_result, here(output_path, "pca_result.rds"))

plot_pca(pca_result, feature = "time") + 
  plot_pca(pca_result, feature = "treatment") &
  my_theme(fsize = fz)
```

## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the effect of CCl4 intoxication while regression out the effect of the oil.
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(colnames(expr) == meta$sample)

# build design matrix
design <- model.matrix(~ 0 + group, data = meta)
rownames(design) <- meta$sample
colnames(design) <- levels(meta$group)

# define contrasts
contrasts <- makeContrasts(
  # effect of olive oil
  oil_2m_vs_0m = oil.2 - wt,
  oil_12m_vs_0m = oil.12 - wt,
  oil_12m_vs_2m = oil.12 - oil.2,

  # treatment vs control ignoring the effect of oil
  ccl_2m_vs_0m = ccl4.2 - wt,
  ccl_6m_vs_0m = ccl4.6 - wt,
  ccl_12m_vs_0m = ccl4.12 - wt,

  # treatment vs control regressing out the effect of oil
  pure_ccl_2m_vs_0m = (ccl4.2 - wt) - (oil.2 - wt),
  pure_ccl_6m_vs_0m = (ccl4.6 - wt) - ((oil.2 + oil.12) / 2 - wt),
  pure_ccl_12m_vs_0m = (ccl4.12 - wt) - (oil.12 - wt),

  # consecutive time point comparison
  consec_12m_vs_6m = ccl4.12 - ccl4.6,
  consec_12m_vs_2m = ccl4.12 - ccl4.2,
  # consec_48w_vs_8w_2 = (ccl4.48 - oil.48) - (ccl4.8 - oil.8),
  consec_6m_vs_2m = ccl4.6 - ccl4.2,
  levels = design
)

limma_result <- run_limma(expr, design, contrasts) %>%
  assign_deg()

deg_df <- limma_result %>%
  mutate(contrast = factor(contrast, levels = c(
    "ccl_2m_vs_0m", "ccl_6m_vs_0m",
    "ccl_12m_vs_0m",
    "pure_ccl_2m_vs_0m",
    "pure_ccl_6m_vs_0m",
    "pure_ccl_12m_vs_0m",
    "consec_6m_vs_2m",
    "consec_12m_vs_2m",
    "consec_12m_vs_6m",
    "oil_2m_vs_0m", "oil_12m_vs_0m",
    "oil_12m_vs_2m"
  ))) %>%
  mutate(contrast_reference = case_when(
    str_detect(contrast, "oil") ~ "oil",
    str_detect(contrast, "^pure_ccl") ~ "pure_ccl4",
    str_detect(contrast, "^ccl") ~ "ccl4",
    str_detect(contrast, "consec") ~ "consec"
  ))

saveRDS(deg_df, here(output_path, "limma_result.rds"))
```

### Volcano plots
Volcano plots visualizing the effect of CCl4 on gene expression.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))

df %>%
  filter(contrast_reference == "pure_ccl4") %>%
  plot_volcano() +
  my_theme(grid = "y", fsize = fz)
```

## Time series clustering
Here we cluster the gene expression trajectories using the [STEM](http://www.cs.cmu.edu/~jernst/stem/) software. The cluster algorithm is descibed [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-191).

### Prepare input
```{r prepare-stem-input}
# prepare input for stem analysis
df = readRDS(here(output_path,"limma_result.rds"))

stem_inputs = df %>%
  filter(contrast_reference %in% c("pure_ccl4")) %>%
  mutate(class = str_c("Month ", parse_number(as.character(contrast)))) %>%
  mutate(class = factor(class, levels = c("Month 2", "Month 6", "Month 12"))) %>%
  select(gene, class, logFC, contrast_reference)
  
stem_inputs %>%
  select(-contrast_reference) %>%
  pivot_wider(names_from = class, values_from = logFC) %>% 
  write_delim(here(output_path, "stem/input/pure_ccl4.txt"), delim = "\t")
```

### Run STEM
STEM is implemented in Java. The .jar file is called from R. Only significant time series clusters are displayed.
```{r run-stem}
# execute stem
stem_res = run_stem(here(output_path, "stem"), clear_output = T)

saveRDS(stem_res, here(output_path, "stem_result.rds"))

stem_res %>%
  filter(p <= 0.05) %>%
  filter(key == "pure_ccl4") %>%
  distinct() %>%
  plot_stem_profiles(model_profile = F) +
  labs(x = "Time in Month", y="logFC") +
  my_theme(grid = "y", fsize = fz)
```

## Translation to HGNC symbols
For later comparisons to human data the mouse gene symbols are mapped to their human orthologs.
```{r translate-to-hgnc-symbols}
df <- readRDS(here(output_path, "limma_result.rds"))

mapped_df <- df %>%
  translate_gene_ids(from = "symbol_mgi", to = "symbol_hgnc") %>%
  drop_na() %>%
  # for duplicated genes, keep the one with the highest absolute logFC
  group_by(contrast_reference, contrast, gene) %>%
  slice_max(order_by = abs(logFC), n = 1, with_ties = F) %>%
  ungroup()

saveRDS(mapped_df, here(output_path, "limma_result_hs.rds"))
```