analysis/mouse-acute-ph.Rmd

---
title: "Acute PH mouse model"
output: 
  workflowr::wflow_html:
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  autodep = TRUE,
  cache = TRUE,
  message = FALSE,
  warning = FALSE
)
```

```{r wall-time-start, cache=FALSE, include=FALSE}
# Track time spent on performing this analysis
start_time <- Sys.time()
```

## Introduction
Here we analysis a mouse model of PH (partial hepatectomy) induced acute liver damage. The transcriptomic profiles were measured at 12 different time points ranging from 1 hour to 3 months.

## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(mouse4302.db)

library(tidyverse)
library(tidylog)
library(here)

library(oligo)
library(annotate)
library(limma)
library(biobroom)
library(progeny)
library(dorothea)

library(janitor)
library(msigdf) # remotes::install_github("ToledoEM/msigdf@v7.1")

library(AachenColorPalette)
library(cowplot)
library(lemon)
library(patchwork)

options("tidylog.display" = list(print))
source(here("code/utils-microarray.R"))
source(here("code/utils-utils.R"))
source(here("code/utils-plots.R"))
```

Definition of global variables that are used throughout this analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/mouse-acute-ph"
output_path <- "output/mouse-acute-ph"

# graphical parameters
# fontsize
fz <- 9
```

## Data processing
### Load .CEL files and quality control
The array quality is controlled based on the relative log expression values (RLE) and the normalized unscaled standard errors (NUSE).
```{r load-cel-files}
# load cel files and check quality
platforms <- readRDS(here("data/annotation/platforms.rds"))
raw_eset <- list.celfiles(here(data_path), listGzipped = T, full.names = T) %>%
  read.celfiles() %>%
  ma_qc() # discard Jena_DPH_69_(Mouse430_2).CEL
```

### Normalization and probe annotation
Probe intensities are normalized with the `rma()` function. Probes are annotated with MGI symbols.
```{r normalization-and-annotation}
eset <- rma(raw_eset)

# annotate microarray probes with mgi symbols
expr <- ma_annotate(eset, platforms)

# save normalized expression
saveRDS(expr, here(output_path, "normalized_expression.rds"))
```

### Build meta data
Meta information are loaded from a manual constructed .csv file
```{r build-meta-data}
meta_raw = read_csv(here(data_path, "DPH_WMH_Samples_clean.csv"))

meta = meta_raw %>% 
  filter(sample %in% colnames(expr)) %>%
  mutate(time = ordered(time),
         gender = as_factor(gender),
         treatment = as_factor(treatment),
         year = ordered(year),
         mouse = fct_inorder(mouse),
         surgeon = factor(surgeon),
         group = as_factor(group)) %>%
  select(-gender)

# save meta data
saveRDS(meta, here(output_path, "meta_data.rds"))
```

## Exploratory analysis
### PCA of normalized data
PCA plot of normalized expression data contextualized based on the time point. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)

saveRDS(pca_result, here(output_path, "pca_result.rds"))

plot_pca(pca_result, feature = "time") +
  my_theme()
```

## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the effect of PH for the different time points.
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

stopifnot(colnames(expr) == meta$sample)

# build design matrix
design = model.matrix(~0+group, data=meta)
rownames(design) = meta$sample


# define contrasts
contrasts = makeContrasts(
  # comparison vs time point 0
  ph_0.043d = group1h - group0h,
  ph_0.25d = group6h - group0h,
  ph_0.5d = group12h - group0h,
  ph_1d = group24h - group0h,
  ph_2d = group48h - group0h,
  ph_3d = group3d - group0h,
  ph_4d = group4d - group0h,
  ph_7d = group1w - group0h,
  ph_14d = group2w - group0h,
  ph_28d = group4w - group0h,
  ph_84d = group3m - group0h,
  
  # consecutive timepoint comparison
  consec_1h_vs_0h = group1h - group0h,
  consec_6h_vs_1h = group6h - group1h,
  consec_12h_vs_6h = group12h - group6h,
  consec_1d_vs_12h = group24h - group12h,
  consec_2d_vs_1d = group48h - group24h,
  consec_3d_vs_2d = group3d - group48h,
  consec_4d_vs_3d = group4d - group3d,
  consec_1w_vs_4d = group1w - group4d,
  consec_2w_vs_1w = group2w - group1w,
  consec_1m_vs_2w = group4w - group2w,
  consec_3m_vs_1m = group3m - group4w,
  levels = design
)

limma_result = run_limma(expr, design, contrasts) %>%
  assign_deg()

deg_df = limma_result %>%
  mutate(contrast = fct_inorder(contrast)) %>%
  mutate(contrast_reference = case_when(
    str_detect(contrast, "ph") ~ "hepatec",
    str_detect(contrast, "consec") ~ "consec"))

saveRDS(deg_df, here(output_path, "limma_result.rds"))
```

### Volcano plots
Volcano plots visualizing the effect of APAP on gene expression.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))

df %>%
  filter(contrast_reference == "hepatec") %>%
  plot_volcano() +
  my_theme(grid = "y", fsize = fz)
```

## Time series clustering
Gene expression trajectories are clustered using the [STEM](http://www.cs.cmu.edu/~jernst/stem/) software. The cluster algorithm is described [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-191).

### Prepare input
```{r prepare-stem-input}
# prepare input for stem analysis
df = readRDS(here(output_path, "limma_result.rds"))

stem_inputs = df %>%
  filter(contrast_reference == "hepatec") %>%
  mutate(class = str_c("Hour ", 
                       as.integer(parse_number(as.character(contrast))*24))) %>%
  arrange(contrast) %>%
  mutate(class = fct_inorder(class)) %>%
  select(gene, class, logFC, contrast_reference)
  
stem_inputs %>%
  select(-contrast_reference) %>%
  pivot_wider(names_from = class, values_from = logFC) %>% 
  write_delim(here(output_path, "stem/input/hepatec.txt"), delim = "\t")
```

### Run STEM
STEM is implemented in Java. The .jar file is called from R. Only significant time series clusters are visualized.
```{r run-stem}
# execute stem
stem_res <- run_stem(here(output_path, "stem"), clear_output = T)

saveRDS(stem_res, here(output_path, "stem_result.rds"))

stem_res %>%
  filter(p <= 0.05) %>%
  filter(key == "hepatec") %>%
  distinct() %>%
  plot_stem_profiles(model_profile = F, ncol = 2) +
  labs(x = "Time in Days", y = "logFC") +
  my_theme(grid = "y", fsize = fz)
```

### Cluster characterization
STEM clusters are characterized by GO terms, [PROGENy's](http://saezlab.github.io/progeny/) pathways and [DoRothEA's](http://saezlab.github.io/dorothea/) TFs. As statistic over-representation analysis is used.
```{r cluster-characterization}
stem_res = readRDS(here(output_path, "stem_result.rds"))

signatures = stem_res %>%
  filter(p <= 0.05) %>%
  distinct(profile, gene, p_profile = p)

genesets = load_genesets() %>%
  filter(confidence %in% c(NA,"A", "B", "C")) 

ora_res = signatures %>%
  nest(sig = c(-profile)) %>%
  dplyr::mutate(ora = sig %>% map(run_ora,  sets = genesets, min_size = 10,
                           options = list(alternative = "greater"), 
                           background_n = 20000)) %>%
  select(-sig) %>%
  unnest(ora)

saveRDS(ora_res, here(output_path, "stem_characterization.rds"))
```

```{r wall-time-end, cache=FALSE, include=FALSE}
duration <- abs(as.numeric(difftime(Sys.time(), start_time, units = "secs")))
t = print(sprintf("%02d:%02d", duration %% 3600 %/% 60,  duration %% 60 %/% 1))
```
Time spend to execute this analysis: `r t` minutes.