analysis/human-ramnath-fibrosis.Rmd

---
title: "Fibrosis patient cohort by Ramnath et al."
output: 
  workflowr::wflow_html:
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  autodep = TRUE,
  cache = TRUE
)
```

```{r wall-time-start, cache=FALSE, include=FALSE}
# Track time spent on performing this analysis
start_time <- Sys.time()
```

## Introduction
Here we analysis a patient cohort covering patients suffering from HCV and NAFLD generated by [Ramnath et al.](https://doi.org/10.1172/jci.insight.120274).

## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(tidyverse)
library(tidylog)
library(here)

library(edgeR)
library(biobroom)

library(AachenColorPalette)
library(cowplot)
library(lemon)
library(patchwork)

options("tidylog.display" = list(print))
source(here("code/utils-rnaseq.R"))
source(here("code/utils-utils.R"))
source(here("code/utils-plots.R"))
```

Definition of global variables that are used throughout this analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/human-ramnath-fibrosis"
output_path <- "output/human-ramnath-fibrosis"

# graphical parameters
# fontsize
fz <- 9
```

## Preliminary exploratory analysis
### Library size
Barplot of the library size (total counts) for each of the samples.
```{r lib-size}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))

plot_libsize(count_matrix) +
  my_theme(fsize = fz)
```

### Count distribution
Violin plots of the raw read counts for each of the samples.
```{r "count-distribution"}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

count_matrix %>%
  tdy("gene", "sample", "count", meta) %>%
  arrange(disease) %>%
  ggplot(aes(
    x = fct_reorder(sample, as.numeric(disease)), y = log10(count + 1),
    group = sample, fill = disease
  )) +
  geom_violin() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    legend.position = "top"
  ) +
  labs(x = NULL) +
  my_theme(grid = "no", fsize = fz)
```

### PCA of raw data
PCA plot of raw read counts contextualized based on etiology stages. Before gene with a constant expression across all samples are removed and count values are transformed to log2 scale. Only the top 1000 most variable genes are used as features.
```{r pca-raw-data}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(colnames(count_matrix) == meta$sample)

# remove constant expressed genes and transform to log2 scale
preprocessed_count_matrix <- preprocess_count_matrix(count_matrix)


pca_result <- do_pca(preprocessed_count_matrix, meta, top_n_var_genes = 1000)

plot_pca(pca_result, feature = "disease") +
  plot_pca(pca_result, feature = "stage") &
  my_theme(fsize = fz)
```

## Data processing
### Normalization
Raw read counts are normalized by first filtering out lowly expressed genes, TMM normalization and finally logCPM transformation.
```{r normalization}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(meta$sample == colnames(count_matrix))

dge_obj <- DGEList(count_matrix, group = meta$group)

# filter low read counts, TMM normalization and logCPM transformation
norm <- voom_normalization(dge_obj)

saveRDS(norm, here(output_path, "normalized_expression.rds"))
```

### PCA of normalized data
PCA plot of normalized expression data contextualized based on etiology stages. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)

saveRDS(pca_result, here(output_path, "pca_result.rds"))

plot_pca(pca_result, feature = "disease") +
  plot_pca(pca_result, feature = "stage") &
  my_theme(fsize = fz)
```

## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the transcriptomic signatures of HCV and NAFLD.
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))

stopifnot(colnames(expr) == meta$sample)

# build design matrix
design <- model.matrix(~ 0 + group, data = meta)
rownames(design) <- meta$sample
colnames(design) <- levels(meta$group)


# define contrasts
contrasts <- makeContrasts(
  hcv_adv_vs_early = hcv_advanced - hcv_early,
  nafld_adv_vs_early = nafld_advanced - nafld_early,
  levels = design
)

limma_result <- run_limma(expr, design, contrasts) %>%
  assign_deg()

deg_df <- limma_result %>%
  mutate(contrast = factor(contrast)) %>%
  mutate(contrast_reference = contrast)

saveRDS(deg_df, here(output_path, "limma_result.rds"))
```

### Volcano plots
Volcano plots visualizing the transcriptomic signatures of HCV and NAFLD.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))

df %>%
  plot_volcano() +
  my_theme(grid = "y", fsize = fz)
```

```{r wall-time-end, cache=FALSE, include=FALSE}
duration <- abs(as.numeric(difftime(Sys.time(), start_time, units = "secs")))
t = print(sprintf("%02d:%02d", duration %% 3600 %/% 60,  duration %% 60 %/% 1))
```
Time spend to execute this analysis: `r t` minutes.