analysis/mouse-acute-bdl.Rmd

---
title: "Acute BDL mouse model"
output: 
  workflowr::wflow_html:
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  autodep = TRUE,
  cache = TRUE,
  message = FALSE,
  warning = FALSE
)
```

```{r wall-time-start, cache=FALSE, include=FALSE}
# Track time spent on performing this analysis
start_time <- Sys.time()
```

## Introduction
Here we analysis a mouse model of BDL (Bile Duct Ligation) induced acute liver damage. The transcriptomic profiles were measured at 4 different time points ranging from 1 day to 21 days. For the time points 1, 3, and 7 days time-matched controls are available.

## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(mogene20sttranscriptcluster.db)

library(tidyverse)
library(tidylog)
library(here)

library(oligo)
library(annotate)
library(limma)
library(biobroom)
library(progeny)
library(dorothea)

library(janitor)
library(msigdf) # remotes::install_github("ToledoEM/msigdf@v7.1")

library(AachenColorPalette)
library(cowplot)
library(lemon)
library(patchwork)

options("tidylog.display" = list(print))
source(here("code/utils-microarray.R"))
source(here("code/utils-utils.R"))
source(here("code/utils-plots.R"))
```

Definition of global variables that are used throughout this analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/mouse-acute-bdl"
output_path <- "output/mouse-acute-bdl"

# graphical parameters
# fontsize
fz <- 9
```

## Data processing
### Load .CEL files and quality control
The array quality is controlled based on the relative log expression values (RLE) and the normalized unscaled standard errors (NUSE).
```{r load-cel-files}
# load cel files and check quality
platforms <- readRDS(here("data/annotation/platforms.rds"))
raw_eset <- list.celfiles(here(data_path), listGzipped = T, full.names = T) %>%
  read.celfiles() %>%
  ma_qc() # Discarding in total 1 arrays: 489-944wt 1d Sham_(MoGene-2_0-st).CEL
```

### Normalization and probe annotation
Probe intensities are normalized with the `rma()` function. Probes are annotated with MGI symbols.
```{r normalization-and-annotation}
eset <- rma(raw_eset)

# annotate microarray probes with mgi symbols
expr <- ma_annotate(eset, platforms)

# save normalized expression
saveRDS(expr, here(output_path, "normalized_expression.rds"))
```

### Build meta data
Meta information are parsed from the sample names.
```{r build-meta-data}
# build meta data
meta <- colnames(expr) %>%
  enframe(name = NULL, value = "sample") %>%
  separate(sample, into = c(
    "tmp1", "mouse", "time", "treatment",
    "tmp"
  ), remove = F, extra = "merge") %>%
  dplyr::select(-starts_with("tmp")) %>%
  mutate(
    time = ordered(parse_number(time)),
    mouse = str_remove(mouse, "wt"),
    treatment = factor(str_to_lower(treatment), levels = c("sham", "bdl")),
    group = str_c(treatment, str_c(time, "d"), sep = "_")
  ) %>%
  mutate(group = factor(group, levels = c(
    "sham_1d", "bdl_1d", "sham_3d",
    "bdl_3d", "sham_7d", "bdl_7d",
    "bdl_21d"
  )))

# save meta data
saveRDS(meta, here(output_path, "meta_data.rds"))
```

## Exploratory analysis
### PCA of normalized data
PCA plot of normalized expression data contextualized based on the time point and treatment. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)

saveRDS(pca_result, here(output_path, "pca_result.rds"))

plot_pca(pca_result, feature = "time") +
  plot_pca(pca_result, feature = "treatment") &
  my_theme()
```

## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the effect of BDL for the different time points.
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

stopifnot(colnames(expr) == meta$sample)

# build design matrix
design <- model.matrix(~ 0 + group, data = meta)
rownames(design) <- meta$sample
colnames(design) <- levels(meta$group)

# define contrasts
contrasts <- makeContrasts(
  # time matched effect of bdl
  bdl_vs_sham_1d = bdl_1d - sham_1d,
  bdl_vs_sham_3d = bdl_3d - sham_3d,
  bdl_vs_sham_7d = bdl_7d - sham_7d,
  bdl_vs_sham_21d = bdl_21d - (sham_1d + sham_3d + sham_7d) / 3,

  # pair-wise bdl comparison
  bdl_3d_vs_1d = bdl_3d - bdl_1d,
  bdl_7d_vs_1d = bdl_7d - bdl_1d,
  bdl_21d_vs_1d = bdl_21d - bdl_1d,
  bdl_7d_vs_3d = bdl_7d - bdl_3d,
  bdl_21d_vs_3d = bdl_21d - bdl_3d,
  bdl_21d_vs_7d = bdl_21d - bdl_7d,

  # pair-wise sham comparison
  sham_3d_vs_1d = sham_3d - sham_1d,
  sham_7d_vs_1d = sham_7d - sham_1d,
  sham_7d_vs_3d = sham_7d - sham_3d,

  levels = design
)

limma_result <- run_limma(expr, design, contrasts) %>%
  assign_deg()

deg_df <- limma_result %>%
  mutate(contrast = factor(contrast, levels = c(
    "bdl_vs_sham_1d", "bdl_vs_sham_3d", "bdl_vs_sham_7d", "bdl_vs_sham_21d",
    "bdl_3d_vs_1d",
    "bdl_7d_vs_1d", "bdl_21d_vs_1d", "bdl_7d_vs_3d", "bdl_21d_vs_3d",
    "bdl_21d_vs_7d", "sham_3d_vs_1d", "sham_7d_vs_1d", "sham_7d_vs_3d"
  ))) %>%
  mutate(contrast_reference = case_when(
    str_detect(contrast, "vs_sham") ~ "bdl",
    str_detect(contrast, "bdl_\\d*") ~ "pairwise_bdl",
    str_detect(contrast, "sham_\\d*") ~ "pairwise_sham"
  ))

saveRDS(deg_df, here(output_path, "limma_result.rds"))
```

### Volcano plots
Volcano plots visualizing the effect of BDL on gene expression.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))

df %>%
  filter(contrast_reference == "bdl") %>%
  plot_volcano() +
  my_theme(grid = "y", fsize = fz)
```

## Time series clustering
Gene expression trajectories are clustered using the [STEM](http://www.cs.cmu.edu/~jernst/stem/) software. The cluster algorithm is described [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-191).

### Prepare input
```{r prepare-stem-input}
# prepare input for stem analysis
df <- readRDS(here(output_path, "limma_result.rds"))

stem_inputs <- df %>%
  mutate(class = str_c("Day ", parse_number(as.character(contrast)))) %>%
  mutate(class = factor(class, levels = unique(.$class))) %>%
  select(gene, class, logFC, contrast_reference)

stem_inputs %>%
  filter(contrast_reference == "bdl") %>%
  select(-contrast_reference) %>%
  pivot_wider(names_from = class, values_from = logFC) %>%
  write_delim(here(output_path, "stem/input/bdl.txt"), delim = "\t")
```

### Run STEM
STEM is implemented in Java. The .jar file is called from R. Only significant time series clusters are visualized.
```{r run-stem}
# execute stem
stem_res <- run_stem(here(output_path, "stem"), clear_output = T)

saveRDS(stem_res, here(output_path, "stem_result.rds"))

stem_res %>%
  filter(p <= 0.05) %>%
  filter(key == "bdl") %>%
  distinct() %>%
  plot_stem_profiles(model_profile = F, ncol = 2) +
  labs(x = "Time in Days", y = "logFC") +
  my_theme(grid = "y", fsize = fz)
```

### Cluster characterization
STEM clusters are characterized by GO terms, [PROGENy's](http://saezlab.github.io/progeny/) pathways and [DoRothEA's](http://saezlab.github.io/dorothea/) TFs. As statistic over-representation analysis is used.
```{r cluster-characterization}
stem_res = readRDS(here(output_path, "stem_result.rds"))

signatures = stem_res %>%
  filter(p <= 0.05) %>%
  distinct(profile, gene, p_profile = p)

genesets = load_genesets() %>%
  filter(confidence %in% c(NA,"A", "B", "C")) 

ora_res = signatures %>%
  nest(sig = c(-profile)) %>%
  dplyr::mutate(ora = sig %>% map(run_ora,  sets = genesets, min_size = 10,
                           options = list(alternative = "greater"), 
                           background_n = 20000)) %>%
  select(-sig) %>%
  unnest(ora)

saveRDS(ora_res, here(output_path, "stem_characterization.rds"))
```

```{r wall-time-end, cache=FALSE, include=FALSE}
duration <- abs(as.numeric(difftime(Sys.time(), start_time, units = "secs")))
t = print(sprintf("%02d:%02d", duration %% 3600 %/% 60,  duration %% 60 %/% 1))
```
Time spend to execute this analysis: `r t` minutes.